This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: Enhancements to block IO & IO scheduler tapset
- From: Prerna Saxena <prerna at linux dot vnet dot ibm dot com>
- To: "Frank Ch. Eigler" <fche at redhat dot com>
- Cc: systemtap at sourceware dot org
- Date: Wed, 18 Nov 2009 17:11:59 +0530
- Subject: Re: Enhancements to block IO & IO scheduler tapset
- References: <4AD2E8DE.90506@linux.vnet.ibm.com> <y0m63afs6lb.fsf@fche.csb>
Hi,
Reworked the patch per suggestions..
On 10/16/2009 08:56 PM, Frank Ch. Eigler wrote:
Prerna Saxena<prerna@linux.vnet.ibm.com> writes:
Here is a patch to add tracepoint-based probes to block IO and IO
scheduler tapsets.
Thank you.
I had to add new probe aliases instead of adding fallbacks to
existing probes because the tracepoint that flags an event, say ,
elv_add_request, is not defined at function entry but somewhere in
the interior. [...]
We don't have to have an exact match. Likely a function-entry or
function-return probe would do, filtered with predicates that skip
inapplicable calls.
For example, your elv_add_request = kernel.trace("block_rq_insert")
seems to match recent kernel.function("elv_insert").
I reworked this probe to have kprobe based fallback. Unfortunately, this
was the only place where it was feasible.
trace("block_rq_abort") ~= kernel.function("blk_start_request"),
perhaps with some filtering (cmd_flags | REQ_QUIET?).
In this case, a kprobe based fallback is not feasible because the same
function is called from multiple places, and a tracepoint is placed just
before a singular instance of this call. So a kprobe based probe in this
place will generate far more number of calls than the tracepoint hit.
Its a similar situation for most other tracepoint based probes -- it is
unfortunate that most probes cannot have kprobe fallbacks.
Also, the handlers for a lot of probe aliases are duplicated -- they
essentially expose the same set of local variables available in the
probe. Is there some way I could reduce duplication here ? [...]
probe ioscheduler_trace.plug = kernel.trace("block_plug"),
probe ioscheduler_trace.unplug_io = kernel.trace("block_unplug_io")
{
..do something..
}
We don't have a general preprocessor for such stuff. Among the
possibilities:
- extending our own little preprocessor
- so something like %( foo = token token token %) to define,
%( foo %) to use
- running cpp
- but that's bad because it'd mess up embedded-C code like #includes/#ifdefs
- running m4
- possible; could run m4 on tapset/**/*.stp.m4 at systemtap build time
or at run time
- could make contextual script errors tricky
- would not mess with embedded-C
- m4_define(`foo',`token token token') to define,
foo to use
- could export CONFIG_*, kernel_v, etc. variables for m4 conditionals
- doing nothing
- so cut& paste as today
I've just replicated handlers for now !
- FChE
I've added testcases and man page updates. Looking fwd to feedback..
Regards,
--
Prerna Saxena
Linux Technology Centre,
IBM Systems and Technology Lab,
Bangalore, India
Index: stap-git-Oct-01/tapset/ioblock.stp
===================================================================
--- stap-git-Oct-01.orig/tapset/ioblock.stp
+++ stap-git-Oct-01/tapset/ioblock.stp
@@ -145,7 +145,7 @@ probe ioblock.request = kernel.function
* Variables:
* devname - block device name
* ino - i-node number of the mapped file
- * byte_done - number of bytes transferred
+ * bytes_done - number of bytes transferred
* sector - beginning sector for the entire bio
* flags - see below
* BIO_UPTODATE 0 ok after I/O completion
@@ -186,3 +186,161 @@ probe ioblock.end = kernel.function("bio
%)
size = $bio->bi_size
}
+
+/**
+ * probe ioblock_trace.bounce
+ *
+ * tracepoint based probe.
+ * Fires whenever a buffer bounce is needed for at least one page of
+ * a block IO request.
+ *
+ * Context :
+ * The process creating a block IO request.
+ *
+ * Variables :
+ * $bio struct bio *
+ * $q struct request_queue*
+ * devname device for which a buffer bounce was needed.
+ * ino - i-node number of the mapped file
+ * bytes_done - number of bytes transferred
+ * sector - beginning sector for the entire bio
+ * flags - see below
+ * BIO_UPTODATE 0 ok after I/O completion
+ * BIO_RW_BLOCK 1 RW_AHEAD set, and read/write would block
+ * BIO_EOF 2 out-out-bounds error
+ * BIO_SEG_VALID 3 nr_hw_seg valid
+ * BIO_CLONED 4 doesn't own data
+ * BIO_BOUNCED 5 bio is a bounce bio
+ * BIO_USER_MAPPED 6 contains user pages
+ * BIO_EOPNOTSUPP 7 not supported
+ * error - 0 on success
+ * rw - binary trace for read/write request
+ * vcnt - bio vector count which represents number of array element (page,
+ * offset, length) which makes up this I/O request
+ * idx - offset into the bio vector array
+ * phys_segments - number of segments in this bio after physical address
+ * coalescing is performed.
+ * size - total size in bytes
+ */
+probe ioblock_trace.bounce = kernel.trace("block_bio_bounce")
+{
+ devname = __bio_devname($bio)
+ ino = __bio_ino($bio)
+
+ bytes_done = $bio->bi_size
+ sector = $bio->bi_sector
+ flags = $bio->bi_flags
+ rw = $bio->bi_rw
+ vcnt = $bio->bi_vcnt
+ idx = $bio->bi_idx
+ phys_segments = $bio->bi_phys_segments
+ size = $bio->bi_size
+}
+
+/* probe ioblock_trace.request
+ *
+ * tracepoint based probe.
+ * Fires just as a generic block I/O request is created for a bio.
+ *
+ * Context:
+ * The process makes block I/O request
+ *
+ * Variables:
+ * $bio struct bio* for which IO request is to be submitted
+ * $q struct request_queue* to which the request is to be added
+ * devname - block device name
+ * ino - i-node number of the mapped file
+ * sector - beginning sector for the entire bio
+ * flags - see below
+ * BIO_UPTODATE 0 ok after I/O completion
+ * BIO_RW_BLOCK 1 RW_AHEAD set, and read/write would block
+ * BIO_EOF 2 out-out-bounds error
+ * BIO_SEG_VALID 3 nr_hw_seg valid
+ * BIO_CLONED 4 doesn't own data
+ * BIO_BOUNCED 5 bio is a bounce bio
+ * BIO_USER_MAPPED 6 contains user pages
+ * BIO_EOPNOTSUPP 7 not supported
+ *
+ * rw - binary trace for read/write request
+ * vcnt - bio vector count which represents number of array element (page,
+ * offset, length) which make up this I/O request
+ * idx - offset into the bio vector array
+ * phys_segments - number of segments in this bio after physical address
+ * coalescing is performed.
+ * size - total size in bytes
+ * bdev - target block device
+ * bdev_contains - points to the device object which contains the
+ * partition (when bio structure represents a partition)
+ * p_start_sect - points to the start sector of the partition
+ * structure of the device
+ */
+
+probe ioblock_trace.request = kernel.trace("block_bio_queue")
+{
+ devname = __bio_devname($bio)
+ ino = __bio_ino($bio)
+
+ bytes_done = $bio->bi_size
+ error = $error
+ sector = $bio->bi_sector
+ flags = $bio->bi_flags
+ rw = $bio->bi_rw
+ vcnt = $bio->bi_vcnt
+ idx = $bio->bi_idx
+ phys_segments = $bio->bi_phys_segments
+ size = $bio->bi_size
+ bdev_contains = $bio->bi_bdev->bd_contains
+ bdev = $bio->bi_bdev
+ p_start_sect = __bio_start_sect($bio)
+}
+
+/* probe ioblock_trace.end
+ *
+ * Tracepoint based probe.
+ * Fires whenever a block I/O transfer is complete.
+ *
+ * Context:
+ * The process signals the transfer is done.
+ *
+ * Variables:
+ * q - request queue on which this bio was queued.
+ * devname - block device name
+ * ino - i-node number of the mapped file
+ * bytes_done - number of bytes transferred
+ * sector - beginning sector for the entire bio
+ * flags - see below
+ * BIO_UPTODATE 0 ok after I/O completion
+ * BIO_RW_BLOCK 1 RW_AHEAD set, and read/write would block
+ * BIO_EOF 2 out-out-bounds error
+ * BIO_SEG_VALID 3 nr_hw_seg valid
+ * BIO_CLONED 4 doesn't own data
+ * BIO_BOUNCED 5 bio is a bounce bio
+ * BIO_USER_MAPPED 6 contains user pages
+ * BIO_EOPNOTSUPP 7 not supported
+
+ * error - 0 on success
+ * rw - binary trace for read/write request
+ * vcnt - bio vector count which represents number of array element (page,
+ * offset, length) which makes up this I/O request
+ * idx - offset into the bio vector array
+ * phys_segments - number of segments in this bio after physical address
+ * coalescing is performed.
+ * size - total size in bytes
+ */
+probe ioblock_trace.end = kernel.trace("block_bio_complete")
+{
+ q = $q
+ devname = __bio_devname($bio)
+ ino = __bio_ino($bio)
+
+ bytes_done = $bio->bi_size
+ error = $error
+
+ sector = $bio->bi_sector
+ flags = $bio->bi_flags
+ rw = $bio->bi_rw
+ vcnt = $bio->bi_vcnt
+ idx = $bio->bi_idx
+ phys_segments = $bio->bi_phys_segments
+ size = $bio->bi_size
+}
Index: stap-git-Oct-01/tapset/ioscheduler.stp
===================================================================
--- stap-git-Oct-01.orig/tapset/ioscheduler.stp
+++ stap-git-Oct-01/tapset/ioscheduler.stp
@@ -68,22 +68,24 @@ probe ioscheduler.elv_next_request.retur
}
/**
- * probe ioscheduler.elv_add_request - A request was added to the request queue
+ * probe ioscheduler.elv_add_request.kp - kprobe based probe to indicate that a request was added to the request queue
* @elevator_name: The type of I/O elevator currently enabled
+ * @q: pointer to request queue
* @req: Address of the request
* @req_flags: Request flags
* @disk_major: Disk major number of the request
* @disk_minor: Disk minor number of the request
*/
// when a request is added to the request queue
-probe ioscheduler.elv_add_request
- = kernel.function("__elv_add_request")
+probe ioscheduler.elv_add_request.kp
+ = kernel.function("elv_insert")
{
%( kernel_v >= "2.6.10" %?
elevator_name = kernel_string($q->elevator->elevator_type->elevator_name)
%:
elevator_name = kernel_string($q->elevator->elevator_name)
%)
+ q = $q
if($rq == 0) {
disk_major = -1
disk_minor = -1
@@ -142,6 +144,189 @@ probe ioscheduler.elv_completed_request
%)
}
+/**
+ * probe ioscheduler.elv_add_request.tp : tracepoint based probe to indicate a request is added to the request queue.
+ * @elevator_name : The type of I/O elevator currently enabled.
+ * @q : Pointer to request queue.
+ * @rq : Address of request.
+ * @rq_flags : Request flags.
+ * @disk_major : Disk major no of request.
+ * @disk_minor : Disk minor number of request.
+ *
+ */
+probe ioscheduler.elv_add_request.tp
+ = kernel.trace("block_rq_insert")
+{
+q = $q
+elevator_name = kernel_string($q->elevator->elevator_type->elevator_name)
+rq = $rq
+
+if ($rq == 0 || $rq->rq_disk ==0) {
+ disk_major = -1
+ disk_minor = -1
+} else {
+ disk_major = $rq->rq_disk->major
+ disk_minor = $rq->rq_disk->first_minor
+}
+
+rq_flags = $rq==0? 0:$rq->cmd_flags
+}
+
+/**
+ * probe ioscheduler.elv_add_request : probe to indicate request is added to the request queue.
+ * @elevator_name : The type of I/O elevator currently enabled.
+ * @q : Pointer to request queue.
+ * @rq : Address of request.
+ * @rq_flags : Request flags.
+ * @disk_major : Disk major no of request.
+ * @disk_minor : Disk minor number of request.
+ *
+ */
+probe ioscheduler.elv_add_request =
+ ioscheduler.elv_add_request.tp !, ioscheduler.elv_add_request.kp
+{}
+
+/**
+ * probe ioscheduler_trace.elv_completed_request : Fires when a request is
+ * completed.
+ * @elevator_name : The type of I/O elevator currently enabled.
+ * @rq : Address of request.
+ * @rq_flags : Request flags.
+ * @disk_major : Disk major no of request.
+ * @disk_minor : Disk minor number of request.
+ *
+ */
+probe ioscheduler_trace.elv_completed_request
+ = kernel.trace("block_rq_complete")
+{
+elevator_name = kernel_string($q->elevator->elevator_type->elevator_name)
+rq = $rq
+
+if ($rq == 0 || $rq->rq_disk ==0) {
+ disk_major = -1
+ disk_minor = -1
+} else {
+ disk_major = $rq->rq_disk->major
+ disk_minor = $rq->rq_disk->first_minor
+}
+
+rq_flags = $rq==0? 0:$rq->cmd_flags
+}
+
+/**
+ * probe ioscheduler_trace.elv_issue_request : Fires when a request is
+ * scheduled.
+ * @elevator_name : The type of I/O elevator currently enabled.
+ * @rq : Address of request.
+ * @rq_flags : Request flags.
+ * @disk_major : Disk major no of request.
+ * @disk_minor : Disk minor number of request.
+ *
+ */
+probe ioscheduler_trace.elv_issue_request
+ = kernel.trace("block_rq_issue")
+{
+elevator_name = kernel_string($q->elevator->elevator_type->elevator_name)
+rq = $rq
+
+if ($rq == 0 || $rq->rq_disk ==0) {
+ disk_major = -1
+ disk_minor = -1
+} else {
+ disk_major = $rq->rq_disk->major
+ disk_minor = $rq->rq_disk->first_minor
+}
+
+rq_flags = $rq==0? 0:$rq->cmd_flags
+}
+
+/**
+ * probe ioscheduler_trace.elv_requeue_request : Fires when a request is
+ * put back on the queue, when the hadware cannot accept more requests.
+ * @elevator_name : The type of I/O elevator currently enabled.
+ * @rq : Address of request.
+ * @rq_flags : Request flags.
+ * @disk_major : Disk major no of request.
+ * @disk_minor : Disk minor number of request.
+ *
+ */
+probe ioscheduler_trace.elv_requeue_request
+ = kernel.trace("block_rq_requeue")
+{
+elevator_name = kernel_string($q->elevator->elevator_type->elevator_name)
+rq = $rq
+
+if ($rq == 0 || $rq->rq_disk ==0) {
+ disk_major = -1
+ disk_minor = -1
+} else {
+ disk_major = $rq->rq_disk->major
+ disk_minor = $rq->rq_disk->first_minor
+}
+
+rq_flags = $rq==0? 0:$rq->cmd_flags
+}
+
+/**
+ * probe ioscheduler_trace.elv_abort_request : Fires when a request is aborted.
+ * @elevator_name : The type of I/O elevator currently enabled.
+ * @rq : Address of request.
+ * @rq_flags : Request flags.
+ * @disk_major : Disk major no of request.
+ * @disk_minor : Disk minor number of request.
+ *
+ */
+probe ioscheduler_trace.elv_abort_request
+ = kernel.trace("block_rq_abort")
+{
+elevator_name = kernel_string($q->elevator->elevator_type->elevator_name)
+rq = $rq
+
+if ($rq == 0 || $rq->rq_disk ==0) {
+ disk_major = -1
+ disk_minor = -1
+} else {
+ disk_major = $rq->rq_disk->major
+ disk_minor = $rq->rq_disk->first_minor
+}
+
+rq_flags = $rq==0? 0:$rq->cmd_flags
+}
+
+/**
+ * probe ioscheduler_trace.plug - Fires when a request queue is plugged;
+ * ie, requests in the queue cannot be serviced by block driver.
+ * @rq_queue : request queue
+ *
+ */
+probe ioscheduler_trace.plug = kernel.trace("block_plug")
+{
+ rq_queue = $q
+}
+
+/**
+ * probe ioscheduler_trace.unplug_io - Fires when a request queue is unplugged;
+ * Either, when number of pending requests in the queue exceeds threshold
+ * or, upon expiration of timer that was activated when queue was plugged.
+ * @rq_queue : request queue
+ *
+ */
+probe ioscheduler_trace.unplug_io = kernel.trace("block_unplug_io")
+{
+ rq_queue = $q
+}
+
+/**
+ * probe ioscheduler_trace.unplug_timer - Fires when unplug timer associated
+ * with a request queue expires.
+ * @rq_queue : request queue
+ *
+ */
+probe ioscheduler_trace.unplug_timer = kernel.trace("block_unplug_timer")
+{
+ rq_queue = $q
+}
+
function disk_major_from_request:long(var_q:long)
%{ /* pure */
struct request_queue *q = (struct request_queue *)((long)THIS->var_q);
Index: stap-git-Oct-01/man/stapprobes.iosched.3stap.in
===================================================================
--- stap-git-Oct-01.orig/man/stapprobes.iosched.3stap.in
+++ stap-git-Oct-01/man/stapprobes.iosched.3stap.in
@@ -54,6 +54,7 @@ Fires when return from retrieving a requ
.TP
.B ioscheduler.elv_add_request
Fires when add a request into request queue
+Resolves to a kprobe based fallback if the tracepoint probe isnt available.
.B Arguments:
@@ -94,6 +95,124 @@ Fires when a request is completed
.I disk_minor
disk minor number of the request
+.P
+.TP
+.B ioscheduler_trace.elv_completed_request
+Tracepoint based probe that fires when a request is completed
+
+.B Arguments:
+
+.I elevator_name
+ The elevator name
+
+.I req
+ Address of the request
+
+.I req_flags
+ request flags
+
+.I disk_major
+ disk major number of the request
+
+.I disk_minor
+ disk minor number of the request
+
+.P
+.TP
+.B ioscheduler_trace.elv_issue_request
+Tracepoint based probe that fires when a request is scheduled.
+
+.B Arguments:
+
+.I elevator_name
+ The elevator name
+
+.I req
+ Address of the request
+
+.I req_flags
+ request flags
+
+.I disk_major
+ disk major number of the request
+
+.I disk_minor
+ disk minor number of the request
+
+.P
+.TP
+.B ioscheduler_trace.elv_requeue_request
+Tracepoint based probe that fires when a request is put back on the queue, cos the hardware cannot accept more requests.
+
+.B Arguments:
+
+.I elevator_name
+ The elevator name
+
+.I req
+ Address of the request
+
+.I req_flags
+ request flags
+
+.I disk_major
+ disk major number of the request
+
+.I disk_minor
+ disk minor number of the request
+
+.P
+.TP
+.B ioscheduler_trace.elv_abort_request
+Tracepoint based probe that fires when a request is aborted.
+
+.B Arguments:
+
+.I elevator_name
+ The elevator name
+
+.I req
+ Address of the request
+
+.I req_flags
+ request flags
+
+.I disk_major
+ disk major number of the request
+
+.I disk_minor
+ disk minor number of the request
+
+.P
+.TP
+.B ioscheduler_trace.plug
+Tracepoint based probe that fires when a request queue is plugged.
+
+.B Arguments:
+
+.I req_queue
+ Pointer to the request queue.
+
+.P
+.TP
+.B ioscheduler_trace.unplug_io
+Tracepoint based probe that fires when a request queue is unplugged.
+
+.B Arguments:
+
+.I req_queue
+ Pointer to the request queue.
+
+.P
+.TP
+.B ioscheduler_trace.unplug_timer
+Tracepoint based probe that fires when unplug timer associated with a request queue expires.
+
+.B Arguments:
+
+.I req_queue
+ Pointer to the request queue.
+
.SH SEE ALSO
.IR stap (1),
.IR stapprobes (3stap)
Index: stap-git-Oct-01/testsuite/buildok/ioscheduler.stp
===================================================================
--- stap-git-Oct-01.orig/testsuite/buildok/ioscheduler.stp
+++ stap-git-Oct-01/testsuite/buildok/ioscheduler.stp
@@ -7,3 +7,16 @@ probe ioscheduler.*
printf("ppname: %s, elv_name: %s, %d, %d", probefunc(),
elevator_name, disk_major, disk_minor)
}
+
+probe ioscheduler_trace.elv*
+{
+ printf("ppname: %s, request %p, elv_name: %s, %d, %d", probefunc(),
+ rq, elevator_name, disk_major, disk_minor)
+}
+
+probe ioscheduler_trace.plug, ioscheduler_trace.unplug_io, ioscheduler_trace.unplug_timer
+{
+ printf("ppname: %s, request %p, elv_name: %s, %d, %d", probefunc(),
+ rq_queue)
+}
+
Index: stap-git-Oct-01/testsuite/buildok/ioblock_test.stp
===================================================================
--- stap-git-Oct-01.orig/testsuite/buildok/ioblock_test.stp
+++ stap-git-Oct-01/testsuite/buildok/ioblock_test.stp
@@ -25,3 +25,10 @@ probe ioblock.end {
devname, sector, flags, rw, bio_rw_str(rw), vcnt, idx, phys_segments, size, bytes_done, error, ino)
%)
}
+
+probe ioblock_trace.*
+{
+ log(pp())
+ printf("%s\t%p\t%d\t%d\t%d\t%s\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n",
+ devname, q, sector, flags, rw, bio_rw_str(rw), vcnt, idx, phys_segments, size, bytes_done, error, ino)
+}