This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

4/20/05 meeting - minutes from morning session


Morning notes and bin list
Systemtap F2F
20 April 2005
Santa Clara CA

Attendees:
Brad Chen (Intel)
Charles Spirakis (Intel)
Ulrich Drepper (RedHat)
John Healy (RedHat)
Frank Eigler (RedHat)
K. Sridharan (Intel)
Martin Hunt (RedHat)
Will Cohen (RedHat)
Vara Prasad (IBM)
Rohit Seth (Intel)
Douglas Armstrong (Intel)
Karen Bennett (RedHat)
Elena Zannoni (RedHat)
Andrew Kleen (RedHat)
Rusty Lynch (Intel)
Anil Keshavamurthy (Intel)
Jim Keniston (IBM)
Hiam Nguyen (IBM)
Josh Stone (Intel)
Sri Doddapaneni (Intel)

Morning Presentations
 9:30	Run time and kernel to user data transport (Martin)
10:00	Kprobes/Jprobes and their enhancements (Jim and Hien)
10:30	Build and test (Will)
11:00	Safety (Brad)
11:30	Scheduler and lock tapset (Doug)

======================================================================

Frank - Systemtap overview and history
- Solaris DTrace has made a lot of noise in the press
- Translator-based system, using simple scripting language like awk or
DTrace
- compile into kernel modules

Progress since initial concept
- a number of disposable prototypes
- RedHat work
  - Martin's runtime code
  - translator work
- IBM side
  - kprobes extensions
  - contributions from Jim and others to study open-holes in design

Expectation: something shareable/workable starts in a few weeks.

Q: When did development start in earnest (Rusty)
Frank: three or four months ago; mid-late January. Coding has mostly 
been happening in the background, partly because design is still 
developing. Some of code is throw away, hopefully not all of it.

Q: Is translator optional or required?
Frank: required.

Karen: If you see key opportunities or things to fix, please speak up.

===========================================================

Martin - Systemtap Runtime
[see also slides Martin posted to the mailing list]

The Runtime is:
- C code used by translator
- compiled into kernel module
- could be used to write providers (an option that might or might not 
be supported)

Goals
- safety. Don't crash or corrupt the system
- Low system impact. No kernel stack usage. (limited stack usage)
- High performance
- flexible
- Ease of use - least of goals; mostly used by translator.

Uli - May need to use signal stack.
Martin - exists for all architectures? 
Uli - Yes.
[Some detailed discussion with Uli and Rohit about use of signal 
stack and how to use it and avoid abusing it. Note binlist item
on figuring out what to do for a stack...]
Rohit: we might also want to talk about branches vs. breakpoints, 
at some point.

Possible kprobes work item: worry about use of the signal stack.
Will: haven't run into this problem in oprofile yet; routine is simple
and 
light; 2-3 call levels deep at most.
Charles: VTune driver is similar in not using too much stack.
Uli: Some device drivers are il-behaved. Safest bet is to use a separate
stack

TEST CASE: find the worst-case case signal-stack kernel code 
(probably a driver), and instrument it kprobes/systemtap to see what 
happens.

Uli: safest plan: make a new private stack of limited size, don't use 
the kernel or signal stack.
Vara: Will kernel folks be okay with a new stack?
Rohit: kprobes is already there; they shouldn't be overly concerned.
Jim voluteers to go the next step on this issue.

[back to Martin]
What's in the Runtime:
- associative arrays 
	- "maps"
	- keys can be 1 or 2 strings or longs
	- values: int64, string, or statistics
	- maps have a max size, preallocated
- K to U transport 
	- relayfs or netlink, both very fast
	- Q: Security issues for uplink? (Rusty) [BIN LIST]
- output formatting
	- relatively straightforward
- backtraces
	- backtrace kernel and user space works today
	- issue: don't want to do symbolic lookups in user space
		- might use dcookies; oprofile style
	- 
- copy from userspace funcs
- backtrace and register dump
- safe strings
	- strings should not be allocated dynamically
	- can't use the stack either
	- soln: runtime uses statically allocated string space

Arch diagram:
	- stpd daemon interacts with netlink or relayfs, generates files
	Frank: Why a daemon rather than a command-line tool?
	Martin: historical
	- one stpd per loadable systemtap module
	- multiple stpds can run simultaneously
	- stpd loads and unloads modules
	- data saved to per-cpu data files
	- note: two users can't load the same module at the same time
	  potential issue: two users who want to run the same module,
with
	  pre-compiled modules
	Q: How big would the maps for modules be? Space is limited (GB)
	Frank: not nearly that big
- Postprocessing:
	- done by stpd
	- deferred symbolic lookups
	- per-cpu data files need to be integrated and saved, etc.
- other things needed by translator

=================================================================

Jim - KProbes
- a facility for inserting a breakpoint in the kernel; specify a handler

function when the breakpoint is hit.
- currently suports x86, x86-64, ppc-64
- Rusty: IPF work underway; no important differences so far between 
AMD64 and EM64T
- Key new features for Systemtap: multi probes, return probes

Multi-probes: multiple handlers for the same address
- two competing implementations: Prasanna:
- Ananth has posted a patch which seems to be okay to most people
Will be submitting this patch to MM. Patch is out on lkml.

Return probes: fires when a function returns
- Basic implementation: probe point at entry replaces RA on stack
  with address of "trampoline." On return, transfer to trampoline
  where a probe is waiting. Handler is invoked etc. Trampoline
  code is a no-op; just a place to install a probe-point (kprobe)
- Q: Has this been tried on IA64? [Jim - No]. The IA64 might need
  some special attention...
- GDB might provide a relevant point of comparison for IA64 support.
- Vara - x86, x86-64, and PPC all supportable.
- Brad: Intel folks need to check that API is supportable on Itanium.
- Jim - refer to design document on systemtap mailing list "return
  probes" or "exit probes".
- Ananth on phone: maintains kprobes; in residence at RedHat (Westford)
  for six months.
- Performance improvements: smarter locking for SMP support.
  Ananth talks about specific work he's doing. Simpler locking
  mechanisms to address outstanding issue...

Last items on Jim's list. Ongoing discussion on mailing list for
these issues (x86-64 RIP-relative addressing, x86_64 wart removal.)

Rohit: holding a semaphore in the kernel? how would you instrument
the interrupt path?
When you register a probe, must acquire a spin-lock...
Only in the probes setup, not use.
Current solution: pre-allocate the memory, one "slot" or cache-line per
CPU.
- eliminates need for semaphore
- concern about cache ping-pong: don't want to have processors bouncing
  cache-lines back and forth as a part of kprobe setup, or (?)
  singlestepping of probed instructions.
- basic problem is self-modifying code. Want all self-modifying code
  to happen in a single unique address that is not shared by other CPUs.
- issue of allocating multiple locations instead of a single location:
  execute bit, and 2G relative addressing limitation for x86-64
- Rohit taking the lead on understanding this issue for Intel

=======================================================================

Will - Testing
Goals
- ensure systemtap doesn't crash machines
- show customers it's tested and safe
- find and quickly fix faults
- provide guidiance on instrumentation costs and limitations

Rohit - customer visits. Everybody mentions DTrace, everybody
expects Systemtap to be as robust as DTrace.
Will - 

What's checked in:
Components
- runtime, parser/translater, kprobes
system 
- integrated components

Testing Issues
- User space environment different than kernel
	- limited kernel stack space
	- copy from user
	- some operations not available in user space, such as accessing
	  PMU hardware
- need testing on various hardware and kernel configs, big systems and
apps
Frank: (to IBM) To what degree do you expect to be able to stress-test
system?
Frank: to what degree is IBM and Intel interested in this system working
on
"vanilla upstream" kernels?
Rohit: we are very interested
Vara: provides some detail on what kind of testing...
Rusty: the more public the test repository, the easier it is for Intel
to 
help with the testing.
- non-determinism - fact that kernel is non-deterministic.

Runtime test plan
- Runtime library user-space function testing
  - verify arg checking
  - verify in user space that all paths through code exercised 
    (use GCC profiling support)
  - check that memory leaks don't occur
- Runtime library kernel space testing
  - individual functions
  - data transport (only works in the kernel)
  - failure modes (out of memory, etc.)

Parser/Translator component testing
- identify invalid input
- make sure parser/translator generate valid constructs
  [tap validator might have a role on these]

KProbe and Kernel Instrumentation Component Testing
- Verify expected functionality of instrumentation
- torture tests
  - maximum rate calling instrumented function
  - repeated activation and deactivation
  - attempt to create race conditions
  - attempt to create recursive kprobes
- non-instrumentable locations
- get overhead estimates, max number of probes, etc.

RedHat Test Grid
- internal system to reserve and manage test resources
- collection of machines
- remote KVM - allows bios and boot-up config, power-cycle
- CVS repository for tests (gdb, oprofile, kernel torture (e.g.
cerberus)
- Most of the machines are in Westford.
- not accessible outside
- problem: testing on really large systems (64-way, terabyte memory...)
- Roland: might be good to figure out how to share kernel test suites
- Intel and IBM folks on-site at RedHat do have access to these test
suites

RPM and RedHat build system
...

========================================================================
Brad on Safety

Brad presented the Systemtap plans for safety. Much of
this has been worked through on the mailing list already.
Two undecided items:
- Tap Validator. Disassemble the tap and analyze for
  safety properties. Brad did a simple demo of how illegal
  instructions and confusing control flow could be exposed.
- Memory Portal, and, separating Policy from Mechanism.
  Brad presented a number of simple policies, including
  proposals for definitions of "safety" and "guru" modes.
  separating Policy and Mechanism could be done with or 
  without the memory portal; Brad will be posting to the
  mailing list to start a discussion on this topic.

Frank liked this table; suggested we put it into the arch paper.

				       Language Design
                                 |  Language Implementation
                                 |  |  insmod checks
                                 |  |  |  Runtime Checks
                                 |  |  |  |  Memory Portal
                                 |  |  |  |  |  Static Validator
                                 |  |  |  |  |  |
-------------------------------------------------------
infinite loops                      x     x     o
recursion                           x     x     o
division by zero                    x     x     o
-------------------------------------------------------
array bounds errors              x  x     x     o
invalid pointer errors           x              o
heap memory bugs                 x              o
-------------------------------------------------------
illegal instructions             x           o
privileged instructions          x           o
memory read/write restrictions   x        x  o  o
-------------------------------------------------------
memory execute restrictions      x        x  o  o
version alignment                      x
end-to-end safety                            x  x
separate policy from mechanism                  x
-------------------------------------------------------

There are lines to add to this table...

Another suggestion to draft some material for a security
section; that Systemtap doesn't change security of system,
possibility of someday supporting unpriviledged users with
appropriate memory protection etc.

========================================================================

Douglas Armstrong 
Synchronization and locking tapset

Intel Thread Profiler team
[other tool is Thread Checker]
Most requirements derived from Intel Thread Profiler; others
from Thread Profiler or other general parallel programming needs.

Tool Interface:
- kernel tools: safety, hard to maintain
- expensive ring transitions
- tap events buffered and dumped to file: space issues
- tap events buffered and sent to user level tools in batches
- double buffering/memory mapped....
Vara: relayfs is a memory-mapped kernel buffer

Q from Will: User threads or Kernel threads?
- We'd like to catch both...
- Even with kernel threads, there are parts of synchronization that
  are only implemented in user space.

Q: Frank - why do you want so much raw data, as opposed to aggregation?
- critical path analysis: requires script
- why not in kernel? memory usage, existing user-space implementation
- memory requirements: 500 GB of raw events

Q: Frank, Vara - it would be very helpful to understand the high-level
goals, instead of just talking about low-level details. Not
understanding
why traces are required.
Douglas: events are commonly calls into pthreads

Synchronization of Interest
- system synch objects
- user synch objects implemented by kernel
- user synch objects implemented in user space
- user synch objects; multi-level implementation

User synch objects
- mutex
- ...
- some objects have implementation split across user and kernel level
- 3rd party synchronization libraries

Blocking operations of interest
- File I/O
- Standard I/O
- ...

Synchronization taps:
- spin
- block
- ...
- notion of a tap that might correspond to a union of a set of
  kernel events

vs. DTrace: 
- Dtrace break down is based on object type. It would be nicer to 
have a generic set, with top of object and particular object 
available as data
- different flavors of acquire and release locks to expose contention.
- also a "cancel" event (timeout waiting on lock)

Sync Tap Data - What you would want (partial list)
- thread
- signaled thread
- call stack
- pid
- ...

Frequency of events?
- mutexes - 10s of thousands per second
  - mostly user-only, uncontended
  - would not commonly be seen in well-written applications

Q: How do you confirm that your measurement is not adversely impacting
  the application behavior?
  - have seen cases where 100s of microseconds causes big problems
  - in other cases, milliseconds of events are not a problem

Roland: Overall being able to have zero or tiny overhead for 
uncontended cases is important.

Frank: Relation of fast instrumentation of non-contested locks and
DTrace predicates? Are predicates fast enough? Worthy of immitating
in Systemtap? Possibility of using a predicate to identify
contended/uncontended, with predicate running at user level?

Process and thread create/destroy/manage events
- thread needs separate creation and start events
- internal scheduler events: when scheduler makes internal
  decisions that impact scheduler policy.

Frank: so far, all this looks easy to implement in script

Scheduler perspective on active/inactive/on/off/yield/switch events
- contrast to synch perspective
- aggregating to reduce instrumentation points is a useful optimization

Will: how about interrupts?
- Worried about accurate accounting for interrupt handling time?
- Probably should be included in the list

Brad: also would like to expose affinity requests

Random taps:
- I/O
- Sampling (with randomization)
- Module taps - loading/unloading of modules
- Signal taps - events for signals and handlers
- begin/end taps
- Spin - ability to have the user-level probe
- User function entry/exit
  - all, rather than selective
  - it appears that DTrace star notation lets you express this
- User-code traps...
  - all, rather than selective

Sri: recently, JVMs exporting hooks that support DTrace

======================================================================

Bin List
- Stack - "minimal stack usage" vs. alternate stack for kprobles [JK]
- container/keys larger than long for associative array indexing? [MH]
  Physical address extension - 36 bits. 
- Security model - user transport [MH]
  SE Linux, more complex than root vs. !root (one possible model)
  deferred; won't be complete for U2
- kernel models - implications of trying to load same module [FchE]
  (in some form) twice. Also, other limitations of module address space,
  vmalloc, static area, etc.
- IA64 stack & instruction architecture is "unique". Be aware [RS]
  that traditional IA32 style assumptions will lead to issues.




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]