This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Newbie] Help request: understanding slowdowns in a network file system


Hi, Daniel -


md1clv wrote:

> I'm trying to understand what is causing occasional slowdowns to disks
> I/O in a virtual environment I manage. [...]
> This means that any disk request from an application on a virtual
> server goes through something similar to the following layers:

> (1) Linux VFS on guest system
> (2) Hypervisor on host system
> (3) Linux VFS on host system
> (4) Gluster client FUSE module on host system
> (5) Network layer on host system
> (6) Physical network
> (7) Network layer on Gluster server system
> (8) Gluster server FUSE module on Gluster server system
> (9) Linux VFS on Gluster server system
> (10) Filesystem code on Gluster server system
> (11) Physical disk on Gluster server system
> [...]

Yup.

> The questions I have are:
> 1) Is Systemtap the right tool to help me get to the bottom of this
> problem?  If not, the rest of the questions don't matter...

It is a plausible tool to gather data for your analysis; other tools
can do at least some of the job too.


> 2) As an administrator rather than a developer I don't really know
> which system calls I need to be monitoring.  What is the best way to
> work this out?

Hey, your list of affected layers/systems didn't even include the
syscalls/userspace!  But basically read/write, if those are the
dominant operations, as opposed to memory-mapped I/O.  (You can
speculatively trace all syscalls for a process, and e.g. take official
notice of only those that take to complete.)


> 3) Is there a neat way to tie together requests going out of the
> client with requests coming into the server?

That's deeply protocol-specific, and is partly what makes such a
big-bang analysis job so difficult.  One needs to follow the data flow
throughout the layers, as it's encapsulated and transformed.  No
generic tool can do the job: one has to encode an understanding of all
these mappings at some point.  I'm aware of no tool that currently can
do all this, already hard-coded.  systemtap has the advantage of deep
programmability, so that you can experiment, and encode the knowledge
pretty directly (searching through data structures, following data as
it's being passed between code points, ...), without needing to
firehose-dump absolutely everything else that's going on on the
machine.


> 4) Are there any hints anyone can give on the best way to approach
> troubleshooting across several different processes, layers and
> services like this?

stap is probably suitable for gathering information on a per-host
basis, including tracking the data flow as it goes from userspace out
to the (virtual) network devices, and not too much extra data.

It sounds like you have multiple hosts that you'll need to combine the
(presumably timestamped) data and analyze further.  This would need
some tools like some sort of programmable viewer/aggregator, or simply
keen eyeballs.  stap per se (or most basic tracing tools) will be of
no direct use here.  We're working on arranging smoothish data flow
into PCP (performance co-pilot), so that its graphical event viewers /
clients could be used for this; this is work in progress.

- FChE


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]