This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: help needed with froked process becoming zombies

From: Christoph Anton Mitterer <calestyo at scientia dot net>
To: Carlos O'Donell <carlos at systemhalted dot org>
Cc: libc-help at sourceware dot org
Date: Mon, 27 Aug 2012 03:18:27 +0200
Subject: Re: help needed with froked process becoming zombies
References: <1345773623.3377.41.camel@fermat.scientia.net> <CADZpyiwadZWocSQQ21APTXmYthXbDh1yg0dc_2AZ-V-tSBiOAg@mail.gmail.com>

Hi Carlos.

On Sun, 2012-08-26 at 10:04 -0400, Carlos O'Donell wrote:
> (1) What problem are you trying to solve?
Well the idea is, that checks are remotely executed via ssh.
So Icinga/Nagios will do something like:
ssh <shomehost> <somecommand>

To speed this dramatically up, ssh's control channel multiplexing
feature is used; in auto mode.

That means the first time, I call
ssh <shomehost> ...
ssh forks a mux process that will live forever (one can configure it to
stop when it wasn't used for XXXX seconds).

The parent process is the "normal" ssh connection that executes
<somecommand>. The child is mentioned mux process.

Next time one calls ssh again, the mux process will be used for the
connection... everything's much faster.

When I do this from the shell, the first time, just what described above
happens.
The mux process gets forked, the parent process runs the remote command
and immediately returns with its stdout.
The mux process lives on as it should.

What happens on the first time when this is done from within
Nagios/Icinga:
The mux process gets forked,.. the parent process runs the remote
command... but then becomes a zombie.
The mux process lives on as it should.

On further connections everything works well, as the mux is already
there, thus no need anymore to fork.

The forker.c was just a simpler simulation for all that.
And I "proved" by it, that the parent actually runs through, i.e. I
worte to some file just before it's exit().

> It seems to me that the zombie processes are a result of the problem
> and not the problem itself. The problem itself is that you get
> timeouts.
Phew well,... I have really absolutely no idea.
But I'd guess not (but again,.. I'm really not too deep into
Nagios/Icinga code or Linux/POSIX process/signal handling)... because
the parent check process becomes a zombie, immediately.... and not after
the 60s timeout of Nagios/Icinga.

And when that runs through, the parent (which is by then a zombie) gets
away... while the child (the mux process in case of ssh) continues to
run (as it should).

I even have no idea how Icinga/Nagios can make the zombie go away,...
because obviously it cannot SIGKILL it, right?

> If that's the case then you need to provide a better test case.
Well using ssh is the real test case... I could send you the
configuration how to setup the control channel multiplexing... but that
of course would require you to have an Icinga installation.

> (2) Strace logs.
> 
> In order to determine what is going wrong you need to provide full
> strace logs of everything you are running. I suggest using the -ff and
> -ttt options and -o options to output one log file per PID. That way
> we can look at the behaviour and see what goes wrong.
Ok I'll try to do that if you still think it makes sense, after what I wrote in this email.
Could take me some time though, as I don't have real test systems for
this.
Just some production nodes.

> (3) forker.c code.
> 
> I'm not going to look at Nagios code, but I *am* going to look at your
> example forker.c code.
Hmm I really fear that the problem is rather in the Nagios/Icinga
code,...given that everything works just well when I call it from the
shell; both with the real ssh case, or with the forker.c dummy:
The parent exits, the child stays alive.

> You must call wait on the child or the kernel keeps around the child
> in order to deliver the return value.
Uhm... I can't do this, right?
If I wait/waitpid on the child, the parent won't exit until the child
does (or at least changes it's status)... but the child is intended to
live on "forever".

> The alternative is to ignore
> SIGCHLD in the parent and then the kernel knows it should not keep the
> child around and should reap the child on exit (not leaving a zombie).
I tried that,... added a:
signal(SIGCHILD, SIG_IGN);
to the default: case of the switch... no change.

> If the parent exits or dies before the child is waited upon then the
> child is reparented to init, and then init will reap the child.
I can confirm this... (I mean that init becomes the new parent, of the check process (not it's child, the mux process).
But then.... AFAIK init only reaps periodically, right? So likely it
reaps after Icinga's/Nagios' timeout of 60 seconds kick in.
But now,... the child you're speaking about is the mux process or the
child of forker.c, right? It's not the child that Icinga/Nagios created
(i.e. the "actual" ssh.... respectively the parent-forker.c)

So the mux process will become a child of init,... and that's just fine,
isn't it... it should live on there "forever".
It's the "actual" ssh (the process that executes the remote command),
respectively the parent-forker.c that becomes a zombie).

wait(2) says:
       A child that terminates, but has not been waited for becomes a "zombie".  The kernel maintains a minimal set of information about
       the  zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to
       obtain information about the child.  As long as a zombie is not removed from the system via a wait, it will consume a slot in the
       kernel  process  table, and if this table fills, it will not be possible to create further processes.  If a parent process termiâ
       nates, then its "zombie" children (if any) are adopted by init(8), which automatically performs a wait to remove the zombies.

Now IMHO the process look at first the following:
nagios/icinga
|
+--ssh(command) / parent-forker.c
   |
   +--ssh(mux) / child-forker.c

Then ssh(command)/parent-forker.c exits while ssh(mux)/child-forker.c
does not...
What happens (as my tests show):
nagios/icinga
|
+-(zombie)-ssh(command) / parent-forker.c

init
|
+--ssh(mux) / child-forker.c

Then I guess nagios/icinga thinks the process
(ssh(command)/parent-forker.c) still runs, and after 60s it tries to
"kill" it; and then it really goes away, but Nagios/icinga think... it
timed out.

Maybe Nagios/Icinga don't wait for ssh(command)/parent-forker.c ... and
that's why it becomes a zombie.

Thanks,
Chris.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Follow-Ups:
- Re: help needed with froked process becoming zombies
  - From: Siddhesh Poyarekar

References:
- help needed with froked process becoming zombies
  - From: Christoph Anton Mitterer
- Re: help needed with froked process becoming zombies
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]