This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: Questions regarding systemtap syntax for profiling...

From: "Spirakis, Charles" <charles dot spirakis at intel dot com>
To: <fche at redhat dot com>
Cc: <systemtap at sources dot redhat dot com>
Date: Thu, 9 Jun 2005 01:17:21 -0700
Subject: RE: Questions regarding systemtap syntax for profiling...

> Not quite - in the "stopwatch" case, we 
> don't respond to changes in the time
> fluent, but rather to the changes in PC.

I'm not sure I understand. What do you mean by "respond to changes in
the time fluent"?

Regarding the various syntax examples, it is common to want ratios of
event counts which means getting multiple counts at the same time
(limited by the pmu hardware) For example, in stopwatch mode, one may
want instructions retired, clockticks and cache misses. Can you string
together the pmc_startwatch() syntax?

Probe
kernel.pmc_startwatch("instructions_retired").pmc_startwatch("clockticks
").pmc_startwatch("L2_cache_miss").function("foo")

In this case I'm guessing it would be $pmc_value[0], $pmc_value[1],
$pmc_value[2] to access the different pmc counters. If so, it seems like
this can become confusing (and prone to errors) if someone accidentally
did:

kernel.pmc_startwatch("L2_cache_miss").pmc_startwatch("L3_cache_miss").f
unction("foo")

kernel.pmc_startwatch("L3_cache_miss").pmc_startwatch("L2_cache_miss").f
unction("foo").return

In this case pmc_value[0] has two different meanings in the two
different probes and a casual inspection of the code may not spot this.
It seems like there could be a way to be more explicit and say
pmc_value[0] always maps to "instructions retired" throughout the script
- possibly via a global declaration:

Global pmc_value[0] = kernel.pmc_startwatch("event")

Another usage model that is common is that some pmu's allow you to get
extra information during profiling. For example, when you get an
L2_cache_miss, there are registers available (on the ia64 architecture)
which let you get both the exact data address as well as the exact
instruction address of the miss (no skid). Thus, it would also be useful
to define a syntax to simply read or write specific pmu registers for
users who know what they are doing. One possibility (following along the
lines of the global declaration above):

Global data_address = kernel.pmc_register("pmd[2]"), instr_address =
kernel.pmc_register("pmd[17]"),
data_access_latency=kernel.pmc_register("pmd[3]")

If we look at the profiling side, one may also want to do some
combination of profiling with event counting. For example, every 10ms,
get the IP and a count of the instructions retired and cpu cycles. This
type of information is commonly used to get an overview of the system
(the lower the cycles/instruction ratio, the better you are utilizing
the cpu hardware).

Probe
kernel.pmc_startwatch("instructions_retired").pmc_startwatch("clockticks
").time_ms(10)

Given the syntax difference between stopwatch and profile, it seems like
there can be some confusion regarding specifying an event (sometimes
it's "instructions retired" and sometimes it's
pmc_instructions_retired(val)). Would it be reasonable to put more of
the parsing inside the string parameter? Something like
pmc_profile("instructions_retired:1000000")? I'm assuming you are trying
to stay away from two parameters: pmc_profile("instructions_retired",
1000000)

Which leads to, do you have any thoughts regarding who's parsing the
quoted string? It seems like there are a couple of possibilities:

1) have the systemtap transilator do the translation directly
2) have the systemtap transilator call another library (pass in the
string, get back a list of register/value pairs that need to be
touched).

There is a lot of variety in the PMU space for different processors and
those events can have a lot of additional options. Since work has been
done by other people to create event libraries, is there interest in
trying to leverage their efforts?

As for virtualization. Are we willing to consider the use of the
perfmon2 infrastructure? 

Homepage http://www.hpl.hp.com/research/linux/perfmon
api spec http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html

My understanding is that Stephane is trying to get this in the upstream
kernel and has some initial code he's trying out for pentium M and
X86_64 (as well as ia64). In addition to virtualization, it supports
software extension of the counters to 64-bits and arbitration between
pmu requestors. His API set was designed more for user-mode access, but
we could probably find a way to access it from the kernel (he's using
syscall to do user->kernel parameter passing, so there might be a way to
EXPORT some of his functions to make them available to a kernel module).

-- charles

-----Original Message-----
From: fche@redhat.com [mailto:fche@redhat.com] 
Sent: Wednesday, June 08, 2005 12:49 PM
To: Spirakis, Charles
Cc: systemtap@sources.redhat.com
Subject: Re: Questions regarding systemtap syntax for profiling...

Hi -

"Spirakis, Charles" <charles.spirakis@intel.com> writes:

> [...]
> 1) Stopwatch (aka event counting): For example, see how many 
> instructions retired you have for a function
> 2) profiling (ala oprofile/papi): For example, see where your system 
> is retiring the most instructions.
> In both of these cases, time is just another event (wallclock and 
> process virtual time).

Not quite - in the "stopwatch" case, we don't respond to changes in the
time fluent, but rather to the changes in PC.

> So, a simple example to see the number of instructions retired for
> execve:
> 
> Probe begin
> 	// how do we say we want to start a pmu counter for instructions
>          retired?
> [...]
> 
> Probe kernel.function("sys_execve")
> [...]
> 	// need the current value for instructions retired. [...]
> 	thread->entry_time = kernel.pmc_instructions_retired
> 
> Probe kernel.function("sys_execve").return
> [...]
>     delta_instr = kernel.pmc_instructions_retired - 
> thread->entry_time;
> 
> Probe end
> 	// Need to indicate we are done using the pmu

I suspect a better model would be to associate PMU usage with an
individual probe handler.  How about something like this:

global value
probe kernel.pmc_startwatch("instructions_retired").function("foo") {
      value = $pmc_value;
}
probe
kernel.pmc_stopwatch("instructions_retired").function("foo").return {
      delta = $pmc_value - value
}

The translator (via an internal "tapset") would map these patterns to a
pair of underlying normal (kernel.function("foo") / .return) probes, and
also adds extra code to reserve / sample / release the named counter.

> For profiling, you may want to limit the event to a subset of cpu's, 
> but it is more common to be interested in the whole system vs.
per-process.
> Based on your email below, I'm assuming you mean:
> 
> Probe kernel.pmc_instructions_retired(1000000)
>   // capture information every 1M instructions. No cpu(0) means
>   // system wide...

Yes.

> How would you specify that you want the pmu to be virtualized 
> per-process?

systemtap is unlikely to hook into the kernel deeply enough to
*perform* such virtualization.  If such logic is already present in pmc
management APIs, then systemtap scripts could *activate* that logic
using any old invented syntax, even such as adding a
cpu("0-3,7,12") component to the probe point specification.  The parsing
logic for that string would again best reside within the translator.

- FChE

Follow-Ups:
- Re: Questions regarding systemtap syntax for profiling...
  - From: Frank Ch. Eigler

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]