This is the mail archive of the glibc-linux@ricardo.ecn.wfu.edu mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

glibc 2.1.3pre1 with Linuxthreads 2.1.3pre2 made my application work again


This mail gives some positive feedback.

During the last 2 years I've been working on a multithreaded server
application written in C++.

It's a kind of special purpose database, (random) i/o centric, using a
"one thread per client tcp/ip connection" approach.

It must run day and night without interruption, handles millions of jobs
each day, transferring about 20 GB of data per day (on a single i686).

I tried to use RedHat 6.0 as the operating system, but that didn't work.
Every few hours, after running under load with multiple clients, my
application aborted with unknown problems (I was unable to do post
mortem debugging).
When doing only the (internal) database update stuff, without multiple
clients fetching data, the application worked fine.

I suspected problems in my application, like overwritten memory, but
couldn't find anything.

However, as soon as I switched the production machine back to RedHat
5.2, everything worked great, without even a single problem, the
software ran for weeks! So I stayed with RedHat 5.2.

Last month I gave the new RedHat 6.1 a try, but still the same problem
occurred. Still, it never ran longer than a few hours.

But meanwhile I had multiple redundant production machines at hand, that
were able to substitute each other dynamically. Because of that I were
able to begin with a detailed error analysis.

I found out, using "error checking mutexes", that after a while a
condition occurred, that looked like an glibc internal inconsistency:

pthread_mutex_t Handle = {__m_reserved = 1074995304, __m_count = 0,
__m_owner = 0x0, __m_kind = 2, __m_lock = {__status = 1, __spinlock = 0}

Looking at the glibc-source, for my understanding, it's not correct to
have this kind of value combination in a pthread_mutex_t struct. While
__status == 1, __m_owner should contain a value != 0.

I recognized this problem because pthread_mutex_unlock returned "EPERM
(the calling thread does not own the mutex)".

I was able to reproduce this condition multiple times.

Reading the mailing list, I found your recent bugfix
       * spinlock.c: __pthread_lock queues back any received restarts
         that don't belong to it instead of assuming ownership of lock
         upon any restart; fastlock can no longer be acquired by two
threads
         simultaneously.
which made me believe it could have something to do with my problem.

So I gave the glibc 2.1.3pre1 with Linuxthreads 2.1.3pre2 a try, built
RPMs from the prerelease and replaced the system glibc.
(To be exact, I also applied all the RedHat patches that came with 6.1)

Since then, over two weeks ago, I haven't had any more problems.

The software has now been running for long times around the clock
without a single problem or crash.
I didn't change my own software or the used compiler, or anything else
on the system, so I know for sure, the newer glibc solved the problem.

Even if it's not the bug I mentioned above, I know for sure you fixed a
major bug between glibc 2.1.2 (as shipped with RedHat 6.1) and the
mentioned prerelease.

In my personal opinion, I think this is worth making a major
announcement.
At least to everyone using LinuxThreads.

I suppose there are be more people developing mulithreaded applications,
suffering from this problem, and I think everyone will be glad to hear
about the fix.

Thank you very much, to everyone involved.

My best regards
                 Kai

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]