This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Pass 4 (non-)optimization speedup


Hi Frank,

On Sat, 2009-07-11 at 21:51 -0400, Frank Ch. Eigler wrote:
> Mark Wielaard <mjw@redhat.com> writes:
> 
> > To make pass 4 a bit more flexible I added -O[0123s] as arguments to
> > stap (commit 5a5732). [...]
> > The default is -O0 which makes pass 4 a lot faster, so I think this is a
> > good default. [...]
> 
> No, it is unlikely to be a good default.

Maybe, like I said: "But maybe the default could be tuned...". It
definitely makes a difference when running stap interactively. When
trying out some probes trying to pinpoint some good stuff to measure, it
really matters if a quick try takes 6 or 3 seconds. It just feels that
much snappier. But for non-interactive work, it might be good if stap
defaulted to some higher optimization. Anyway, the user has a choice now
and we can always try to tweak the default depending on work load.

>   Frankly, I'm surprised it
> even compiles, since some kernel code is unbuildable without
> optimization.

And that surprises me. Obviously I tried it out on some different
kernels/architectures (i386/x86_64/fedora/2.6.29/rhel/2.6.18) before
implementing it. And afterwards of course I ran make installcheck to
make sure there were no regressions (there weren't, it just was minutes
faster!). If you have any examples of things that won't compile with the
new default please do let me know (and preferably add them to the
testsuite). 

>   Anyway, the code generated by systemtap is complex
> enough that with optimization disabled, it is bound to run measurably
> slower.  Please test some nontrivial probes with -t before & after.

Indeed, I should have measured that also. Here are some measurements for
the topsys.stp example, modified to stat only 12 times (add a global
count = 12; and an if (--count == 12) exit() to the timer probe).
gcc (GCC) 4.4.0 20090506 (Red Hat 4.4.0-4), 2.6.29.5-191.fc11.i586.

$ stap -O0 -v -t topsys.stp
[...]
Pass 4: compiled C into [...] in 6850usr/1160sys/8130real ms.
Pass 5: starting run.
[...]
probe syscall.* (topsys.stp:10:1), hits: 21048, cycles:
1045min/2736avg/10021max
probe timer.s(5) (topsys.stp:22:1), hits: 12, cycles:
147235min/153708avg/171006max
Pass 5: run completed in 40usr/80sys/60163real ms.

$ stap -O1 -v -t topsys.stp
[...]
Pass 4: compiled C into [...] in 8090usr/980sys/10815real ms.
Pass 5: starting run.
[...]
probe syscall.* (topsys.stp:10:1), hits: 252713, cycles:
726min/1561avg/537218max
probe timer.s(5) (topsys.stp:22:1), hits: 12, cycles:
68442min/96426avg/128172max
Pass 5: run completed in 40usr/80sys/60197real ms.

$ stap -O2 -v -t topsys.stp
[...]
Pass 4: compiled C into .[...] in 10120usr/1170sys/11433real ms.
Pass 5: starting run.
probe syscall.* (topsys.stp:10:1), hits: 79132, cycles:
671min/1669avg/5962max
probe timer.s(5) (topsys.stp:22:1), hits: 12, cycles:
82896min/90516avg/117381max
Pass 5: run completed in 30usr/100sys/60151real ms.

So with each optimization level we add about 2 seconds extra compile
time in pass 4, but the average number of cycles per probe is definitely
lower with each optimization level. And the win between -O1 and -O2
isn't as large as between -O0 and -O1.

What would be some good other scenarios to try out?

I would really like to have -O0 be the default for interactive (-e
<script>) usage, since it is really so much faster compiling in pass 4.
But you are right that it does impact the overhead per probe at runtime
in pass 5 significantly. Maybe we could make -O1 the default when stap
is given a script and -O2 for when the compile server is used? 

Cheers,

Mark


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]