This is the mail archive of the ecos-discuss@sourceware.org mailing list for the eCos project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: Possible sockets/fd race condition


Do you have ASSERTions turned on.  This could help identify a problem
earlier.

I have seen other similar kinds of problems (with assertions turned on), but
have not spent any time trying to resolve them yet.  Assertions such as:

eCos panic: cyg_fp_free() will close fd
eCos panic: sbdrop
eCos panic: soaccept: !NOFDREF
eCos panic: sofree: not queued
eCos panic: soclose: NOFDREF
eCos panic: Locking mutex I already own
eCos panic: sbflush: locked

I did a search of the eCos archives for these panics, and came accross this
(http://sourceware.org/ml/ecos-discuss/2000-04/msg00153.html)

I could find no evidence that this issue was ever addressed.

Jay

-----Original Message-----
From: Ross Wightman [mailto:rwightman@gmail.com]
Sent: Wednesday, November 16, 2005 12:45 AM
To: ecos-discuss@ecos.sourceware.org
Subject: [ECOS] Possible sockets/fd race condition


Myself and another engineer I am working with have recently
encountered what appears to be some sort of sockets (BSD Stack) or fd
related race condition.

We have two threads relevant to the problem. One of them is sending
out small (~64byte) UDP multicast packets at a configurable rate of
1-100 per second. The other thread is running a basic single threaded
http server that serves one connection at a time (but serves many
small requests quite rapidly).

Running by themselves either thread can send out hundreds of thousands
of UDP packets or serve millions of http requests with no problems.
However, as soon as the two threads are both running they last for
anywhere from a few hundred to a few thousand served requests before
the UDP thread dies a horrible death trying to close its socket.

The exact crash occurs when tcp_usr_detach in tcp_usrreq.c tries to
dereference a NULL pointer (which should be pointing to a tcp control
block). This is very bizarre because we are supposed to be closing a
UDP socket. If we look at the socket itself a few stack frames down it
is a non active TCP socket with the SS_NOFDREF and SS_INCOMP flags set
and no TCP control block. It is certainly not an active UDP socket as
it should be.

Every time the crash has occurred, the http server is waiting in
unlock_inner from an splx call via various tcp calls.

The Socket/File Descriptor # for the UDP socket and the TCP socket at
the time of the crash is 4 and 5 respectively.

We have yet to pinpoint when exactly things start to go wrong. We have
however looked for possible race conditions and we noticed that the
socreate function does not have any splnet protection like most of the
other functions at that level. So we added that and after half a day
running the problem has not occurred again. However it's not clear if
we fixed the problem in a correct manner or smashed it with a blunt
hammer and masked the real cause.

Any thoughts would be much appreciated. More details follow below.

Regards,

Ross



This is the pseudo code sequence of the UDP thread.

{
socket s;

loop
{
wait on flag()

s=socket(...);

setsocketopt(s, ...);

ioctl(s,.... ) //for address change check

sendto(s, ....)

close(s)

}
}

The http server thread is running a typical TCP serving loop and
shares nothing with the UDP thread. The only interaction between the
two threads is via any locking or data sharing that occurs in the
tcp/ip stack.

The relevant portion of the UDP thread stack is as follows:

#0  0x001f01b0 in tcp_disconnect (tp=0x0)
    at /cygdrive/d/ecos/ecos-cvs
/packages/net/bsd_tcpip/current/src/sys/netinet/tcp_usrreq.c:1070
 #1  0x001eedb4 in tcp_usr_detach (so=0x2cc010)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/netinet/tcp
_usrreq.c:181
#2  0x001c33ec in cyg_soclose (so=0x2cc010)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/uipc_s
ocket.c:321
#3  0x0016f244 in bsd_close (fp=0x2b99ec)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
.c:730
#4  0x0019c17c in fp_ucount_dec (fp=0x2b99ec)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:160
#5  0x0019c284 in fd_close (fd=5)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:189
#6  0x0019c55c in cyg_fd_free (fd=5)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:260
#7  0x0019f8ac in close (fd=5)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/io.cxx:216
#8  0x000524b0 in Packet::PacketSend() (this=0x32d5a0)
    at src/App/Packet.cpp:141

The print out of socket (so) is: This should be a UDP socket but
instead it's an inactive TCP socket.

(gdb) print/x *(struct socket*)(0x2cc010)
$2 = {so_zone = 0x2cc180, so_type = 0x1, so_options = 0x20, so_linger = 0x0,
  so_state = 0x801, so_pcb = 0x2cb478, so_proto = 0x2a51a0,
  so_head = 0x2cc0c0, so_incomp = {tqh_first = 0x0, tqh_last = 0x0},
  so_comp = {tqh_first = 0x0, tqh_last = 0x0}, so_list = {tqe_next = 0x0,
    tqe_prev = 0x2cc0d8}, so_qlen = 0x0, so_incqlen = 0x0, so_qlimit = 0x0,
  so_timeo = 0x0, so_error = 0x0, so_sigio = 0x0, so_oobmark = 0x0,
  so_aiojobq = {tqh_first = 0x0, tqh_last = 0x2cc054}, so_rcv = {sb_cc =
0x0,
    sb_hiwat = 0x4470, sb_mbcnt = 0x0, sb_mbmax = 0x22380, sb_lowat = 0x1,
    sb_mb = 0x0, sb_sel = {si_info = 0x0, si_waitFlag = 0x0}, sb_flags =
0x0,
    sb_timeo = 0x0}, so_snd = {sb_cc = 0x40, sb_hiwat = 0x4470,
    sb_mbcnt = 0x80, sb_mbmax = 0x22380, sb_lowat = 0x800, sb_mb = 0x2ccf80,
    sb_sel = {si_info = 0x0, si_waitFlag = 0x0}, sb_flags = 0x0,
    sb_timeo = 0x0}, so_upcall = 0x0, so_upcallarg = 0x0, so_cred = 0x0,
  cyg_so_gencnt = 0x8646, so_emuldata = 0x0, so_accf = 0x0}

The print out of inpcb (so_pcb) is:

(gdb) print/x *(struct inpcb*)(0x2cb478)
$3 = {inp_hash = {le_next = 0x0, le_prev = 0x0}, inp_fport = 0x0,
  inp_lport = 0x0, inp_list = {le_next = 0x0, le_prev = 0x2a6e18},
  inp_flow = 0x0, inp_dependfaddr = {inp46_foreign = {ia46_pad32 = {0x0,
0x0,
        0x0}, ia46_addr4 = {s_addr = 0x0}}, inp6_foreign = {__u6_addr = {
        __u6_addr8 = {0x0 <repeats 16 times>}, __u6_addr16 = {0x0, 0x0, 0x0,
          0x0, 0x0, 0x0, 0x0, 0x0}, __u6_addr32 = {0x0, 0x0, 0x0, 0x0}}}},
  inp_dependladdr = {inp46_local = {ia46_pad32 = {0x0, 0x0, 0x0},
      ia46_addr4 = {s_addr = 0x0}}, inp6_local = {__u6_addr = {__u6_addr8 =
{
          0x0 <repeats 16 times>}, __u6_addr16 = {0x0, 0x0, 0x0, 0x0, 0x0,
          0x0, 0x0, 0x0}, __u6_addr32 = {0x0, 0x0, 0x0, 0x0}}}},
  inp_ppcb = 0x0, inp_pcbinfo = 0x308310, inp_socket = 0x2cc010,
  inp_flags = 0x0, inp_dependroute = {inp4_route = {ro_rt = 0x0, ro_u = {
        rou_dst_ss = {ss_len = 0x0, ss_family = 0x0, __ss_pad1 = {0x0, 0x0,
            0x0, 0x0, 0x0, 0x0}, __ss_align = 0x0, __ss_pad2 = {
            0x0 <repeats 112 times>}}, ro_dst_sa = {sa_len = 0x0,
          sa_family = 0x0, sa_data = {0x0 <repeats 30 times>}}}},
     inp6_route = {ro_rt = 0x0, ro_u = {rou_dst_ss = {ss_len = 0x0,
          ss_family = 0x0, __ss_pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
          __ss_align = 0x0, __ss_pad2 = {0x0 <repeats 112 times>}},
         ro_dst_sa = {sa_len = 0x0, sa_family = 0x0, sa_data = {
            0x0 <repeats 30 times>}}}}}, inp_sp = 0x0, inp_vflag = 0x1,
  inp_ip_ttl = 0x40, inp_ip_p = 0x0, inp_depend4 = {inp4_ip_tos = 0x0,
     inp4_options = 0x0, inp4_moptions = 0x0}, inp_depend6 = {
    inp6_options = 0x0, inp6_inputopts = 0x0, inp6_outputopts = 0x0,
    inp6_moptions = 0x0, inp6_icmp6filt = 0x0, inp6_cksum = 0x0,
    inp6_ifindex = 0x0, inp6_hops = 0x0}, inp_portlist = {le_next = 0x0,
    le_prev = 0x0}, inp_phd = 0x0, inp_gencnt = 0x8d9d}


The relevant portion of the HTTP serving thread is:

#0  Cyg_Scheduler::unlock_inner(unsigned) (new_lock=0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/sched/sched.cxx:220
#1  0x00266858 in Cyg_Scheduler::unlock() ()
    at
/cygdrive/d/work/firmware/micro/ecos/conf/avppc1/ecos/install/include/cyg/ke
rnel/sched.inl:100
#2  0x001868cc in Cyg_Mutex::unlock() (this=0x2ff580)
    at
/cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/sync/mutex.cxx:445
#3  0x0017e470 in cyg_mutex_unlock (mutex=0x2ff580)
    at
/cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/common/kapi.cxx:1081
#4  0x001a6048 in cyg_splx (old_state=0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/ecos/synch.c:20
9
#5  0x001c43d8 in cyg_sosend (so=0x2cbeb0, addr=0x0, uio=0x2b8310,
    top=0x2cce80, control=0x0, flags=0, p=0x0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/uipc_s
ocket.c:622
 #6  0x0016f9dc in bsd_sendit (fp=0x2b99cc, mp=0x2b83b0, flags=0,
    retsize=0x2b83a0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
.c:1039
#7  0x0016eb64 in bsd_sendmsg (fp=0x2b99cc, m=0x2b83b0, flags=0,
    retsize=0x2b83a0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
.c:572
#8  0x001a2450 in sendto (s=4, buf=0x32d7fc, len=543, flags=0, to=0x0,
tolen=0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/socket.cxx:612
#9  0x001a251c in send (s=4, buf=0x32d7fc, len=543, flags=0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/socket.cxx:629
#10 0x00143694 in fsend (pData=0x32d5e8, len=543) at
src/Interface/serv.cpp:518

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]