This is the mail archive of the ecos-patches@sources.redhat.com mailing list for the eCos project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: timed condition variable patch


Nick Garnett <nickg@calivar.demon.co.uk> writes:

> Jonathan Larmour <jifl@eCosCentric.com> writes:
> 
> > Wade Jensen wrote:
> > > Hello,
> > > There is a very rare condition in Cyg_Condition_Variable::wait_inner(
> > > Cyg_Mutex *mx, cyg_tick_count timeout ) that causes the "Unnecessary call to
> > > unlock_inner()" assertion to fail.
> > 
> > This looks familiar.
> > 
> > Nick, I seem to recall I discussed this with you IRL in the Red Hat
> > office about this, or something similar - i.e. what happens about
> > timers that trigger "now" or in the past, and I think it was also in
> > the context of select(). Do you remember what the outcome was? I can't
> > find any kernel change resulting certainly, but then I can't remember
> > the problem :-).
> 
> Yes, I also recall such a conversation, and would have sworn I made
> changes to fix it. Maybe I only did this in a branch. 
> 

I've been thinking about this a bit more, and have recalled the
conversation we has IRL. The patch I supplied previously should not be
applied.

The problem is that since the introduction of unlock_reschedule() it
is now possible for a thread to run through unlock_inner() with all
the conditions tested for in that assertion false. This is rare and
benign but it is not possible to separate this case from places where
it should not happen.

Changing the behaviour of wait_inner() to avoid this is not a good
solution, it bloats the routine with an extra call to
Cyg_Scheduler::unlock() in all cases, not just the one rare case that matters.

The ultimate cause of the problem (at least when we discussed it
before) is that select() is being called with a ridiculously small
timeout. A timeout value of <= 1 tick can, if a timer interrupt occurs
during select() processing, result in the current time being passed to
wait_inner().

The solution we recommended before was to simply change the
application to supply a longer timeout. Any application that passes a
very short timeout to select() is effectively busy-waiting and is
broken.

I can see two solutions:

1. Remove the assertion from unlock_inner(). This was useful during
   development for catching programming bugs in the kernel. The kernel
   is now reasonably stable, and the added semantics of
   unlock_reschedule() now make it less useful -- and in this case a
   positive menace. We have has problems with this assertion before,
   so maybe it is time for it to go.

2. Fix the calculation of the end time in select() so that the current
   time is only added after the scheduler has been locked and it
   cannot advance any further.

Neither of these is ideal, but both should probably be applied.

I'll put together a patch to do this.

-- 
Nick Garnett - eCos Kernel Architect


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]