This is the mail archive of the ecos-discuss@sourceware.org mailing list for the eCos project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

do not use the ARM FIQ: there's a bug in the code


Hello,

I was using the FIQ pin of my AT91 ARM microcontroller (uC) as Ethernet interrupt. When connected to a network, our platforms (2 different ones) crash after a while. The debugger always gives "scheduler lock not zero" (first reported here:http://sourceware.org/ml/ecos-discuss/2006-08/msg00083.html; recently reported here: http://sourceware.org/ml/ecos-discuss/2007-07/msg00169.html; also some info here: http://sourceware.org/ml/ecos-discuss/2006-11/msg00094.html)
I use now another IRQ pin, and the problems are gone.


We (me and Wim Dumon working for Televic) were able to track down the bug, but not (yet) able to solve it.
Here a first report. Wim will mail a more detailed report when he has time.
The bug appears randomly, sometimes it takes an hour sometimes 10 seconds before it appears.
With UDP traffic as a test, the bug shows up as "scheduler lock not zero" in the idle thread. This is probably because UDP is rather simple, and the processor has much time free to spend in the idle thread (as no other application threads are running during the tests).
With TCP traffic as a test, the bug shows up as various other weird errors (I have a test report of it).


Because the bug always shows up with UDP in "scheduler lock not zero", we could track it down by setting a breakpoint there, and just before we toggle a pin of the uC. That pin was connected to a logic analyzer as trigger input. The 16 lowest address pins (going to the uC's SRAM where the code runs) were monitored by the logic analyzer, and everything before the trigger was stored in the analyzer's memory. This way Wim traced back the SW.
Wim found that at bug time, always the first 5 registers of the ARM were wrong, and always with the same values. Those values come from some stack - the bug makes that those 5 registers are not restored correctly at context switch. Register 2 (r2) contains the scheduler lock, and is indeed not zero at bug time (as reported by the assertion), but when reading the scheduler lock from its address in SRAM, it was correctly zero! r2 was always 0xFFDF_FFDF.
Wim is convinced that the /hal/arm/arch/.../src/vectors.S code contains the bug(s).
Our eCos tree is from 2006-02-15, so after the bugfix of 2006-02-06 from Sergei Organov. But we think a similar bug is still present.


Mark: the comment in /hal/arm/at91/var/.../cdl/hal_arm_at91.cdl about "CYGHWR_HAL_ARM_AT91_FIQ" is wrong. This is (more or less) the correct comment:
" Enable this option if you want to use the FIQ. Interrupts in eCos
may not be interrupted. Therefore, it is needed to handle FIQ
interrupts in the normal way, i.e. a FIQ interrupt must be treated
as a normal IRQ using the highest priority"


During debugging with the JTAG monitor BDI2000, we often saw spurious interrupts. But we checked the hardware with an oscilloscope, and the interrupts are clean. I see 3 possible reasons for the spurious interrupts:
- caused by the monitoring,
- or caused by that bug,
- or caused by using level sensitive interrupt.
When I have some time, I will check the last one out by using edge sensitive irq instead, and I will check out the second one now we are not using the FIQ anymore.


It is up to my boss to decide if he wants to spend any more money trying to solve this bug. It will also depend if I will be able to use the workaround for all version of our platforms...

I could also try to solve it for fun in my free time of course ;-),
kind regards,
Juergen


-- Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]