This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Signal handling and SuspendThread


I've been trying to track down a lockup problem that I found while
running the mysql build under the Electric Cloud build accelerator
on a dual-processor hyperthreaded system running Windows Server
2003 SP1.  A bash process would sit there hogging a CPU and making
no progress on the build.  This reproduced once every three to
five complete builds.

I loaded up a debug bash and a debug build of cygwin1.dll (1.5.18)
and attached in gdb.  The main thread's call stack was

#0  0x7c82ed54 in ntdll!LdrAlternateResourcesEnabled ()
   from /cygdrive/c/WINDOWS/system32/ntdll.dll
#1  0x7c822194 in ntdll!ZwYieldExecution ()
   from /cygdrive/c/WINDOWS/system32/ntdll.dll
#2  0x77e4ad7b in SwitchToThread ()
   from /cygdrive/c/WINDOWS/system32/kernel32.dll
#3  0x61051285 in low_priority_sleep (secs=2)
    at ../../../../winsup/cygwin/miscfuncs.cc:339
#4  0x61086256 in _sigfe () at ../../../../winsup/cygwin/cygserver.h:82
#5  0x610abee0 in wait3 () at ../../../../winsup/cygwin/cygerrno.h:31
#6  0x0041d934 in sigchld_handler (sig=20) at jobs.c:2508
#7  0x610862cc in _sigbe () at ../../../../winsup/cygwin/cygserver.h:82

In thread 1:

(gdb) info w32 selector $fs
Selector "$fs"
0x03b: base=0x7ffde000 limit=0x00008000 32-bit Data (Read/Write, Exp-up)
Priviledge level = 3. Byte granular.
(gdb) x/3x 0x7ffde000
0x7ffde000:     0x0023ef88      0x00240000      0x0023a000
(gdb) print ((_cygtls *)0x240000)[-1].stacklock
$19 = 1
(gdb) print ((_cygtls *)0x240000)[-1].spinning
$20 = 1
(gdb) print ((_cygtls *)0x240000)[-1].incyg
$21 = 0

So someone else had the stack lock and _sigfe() was spinning on
it.  The signal thread was innocently blocked on ReadFile():

#0  0x6108c20b in wait_sig (self=0x610e9d18)
    at ../../../../winsup/cygwin/sigproc.cc:1029
#1  0x61003414 in cygthread::stub (arg=0x610e9d18)
    at ../../../../winsup/cygwin/cygthread.cc:73
#2  0x61004084 in _cygtls::call2 (func=0x610032f0 <cygthread::stub
(void*)>,
    arg=0x1f38, buf=0xe2f010) at ../../../../winsup/cygwin/cygtls.cc:93
#3  0x610040ca in _cygtls::call (func=0, arg=0x0)
    at ../../../../winsup/cygwin/cygtls.cc:82

I spent quite a bit of time puzzling over what could be going wrong;
eventually, I gave our chief architect the guided tour of the signal
handling code and he wondered if SuspendThread() might return before
the suspended thread has halted.  (The MSDN documentation doesn't
actually specify this one way or the other.)

So I crafted a small test program (attached) to verify this, and with
enough repetitions, it does indeed show the problem:  I've seen the
loop counter increment by as much as 32 after SuspendThread has
returned.

On a 3.06GHz machine, it'll reproduce inside of ten minutes.  On a
2.80GHz machine, it takes more like half an hour.  (It reproduces faster
if you run the test in multiple windows at the same time-- I usually
see it within two minutes if I have three windows running suspend.bat
on the 3.06GHz machine.)

In setup_handler(), the program does this:

      res = SuspendThread (hth);
      /* Just set pending if thread is already suspended */
      if (res)
	{
	  ResumeThread (hth);
	  break;
	}
      if (tls->incyg || tls->spinning || tls->locked ())
	sigproc_printf ("incyg %d, spinning %d, locked %d\n",
			tls->incyg, tls->spinning, tls->locked ());
      else
	{
	  cx.ContextFlags = CONTEXT_CONTROL | CONTEXT_INTEGER;
	  if (!GetThreadContext (hth, &cx))
	    system_printf ("couldn't get context of main thread, %E");
	  else if (interruptible (cx.Eip))
	    interrupted = tls->interrupt_now (&cx, sig, handler, siga);
	}

If the timing is *just* wrong, you can get this sequence of events:

main thread                     signal thread
-----------                     -------------
enter _sigfe()
                                SuspendThread()
                                read tls->incyg
                                read tls->spinning
                                read tls->stacklock
acquire stacklock
halt
                                GetThreadContext()
                                ...

I *think* the right thing to do is to move the call to tls->unlock()
to just before the calls to ResumeThread(), but IANACE
<http://cygwin.com/acronyms/#IANACE>.

I'll try to have a patch by end of day today; I'll be on vacation
until January 3, though, so it'll be a while before I can
respond to critique.


===========================================================================
WARNING: This e-mail has been altered by MIMEDefang.  Following this
paragraph are indications of the actual changes made.  For more
information about your site's MIMEDefang policy, contact
Electric Cloud E-mail Administrator <support@electric-cloud.com>.  For more information about MIMEDefang, see:

            http://www.roaringpenguin.com/mimedefang/enduser.php3

An attachment named suspend.bat was removed from this document as it
constituted a security hazard.  If you require this document, please contact
the sender and arrange an alternate means of receiving it.


// gcc -mno-cygwin -Wall suspend.cpp -o suspend_mingw.exe -lstdc++
// cl /EHsc /Zi /I C:/cygwin/usr/local/tools/i686_win32/vc7/Vc7/include /I C:/cygwin/usr/local/tools/i686_win32/vc7/Vc7/PlatformSDK/Include suspend.cpp /link /libpath:C:/cygwin/usr/local/tools/i686_win32/vc7/Vc7/lib /libpath:C:/cygwin/usr/local/tools/i686_win32/vc7/Vc7/PlatformSDK/Lib

#include <windows.h>
#include <iostream>

struct counter {
    int count;
    int done;
};

DWORD WINAPI Counter(LPVOID addr)
{
    counter *c = reinterpret_cast<counter *>(addr);

    while (!c->done) {
        ++c->count;
    }
    return c->count;
}

int main(int argc, char *argv[])
{
    counter c = { 0, 0 };
    DWORD id;

    HANDLE h = CreateThread(NULL, 0, Counter, &c, 0, &id);
    if (h == NULL) {
        std::cerr << "CreateThread():  " << GetLastError() << std::endl;
        exit(1);
    }

    Sleep(1000);

    if (SuspendThread(h) < 0) {
        std::cerr << "SuspendThread():  " << GetLastError() << std::endl;
        exit(1);
    }
    int early = c.count;
    Sleep(1000);
    int late = c.count;
    c.done = 1;
    if (ResumeThread(h) < 0) {
        std::cerr << "ResumeThread():  " << GetLastError() << std::endl;
        exit(1);
    }

    DWORD final;
    if (!GetExitCodeThread(h, &final)) {
        std::cerr << "GetExitCodeThread():  " << GetLastError() << std::endl;
        exit(1);
    }

    if (early != late) {
        std::cout << early << ", " << late << std::endl;
        exit(2);
    }

    // If I just put "return 0;" here, I get random crashes in the exit
    // code when I run three parallel "repeat 100 suspend.exe" jobs.
    exit(0);
}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]