linuxgeek: [nslu2-linux] Re: Catching SIGSEGV

Monday, May 21, 2012

[nslu2-linux] Re: Catching SIGSEGV

--- In nslu2-linux@yahoogroups.com, David Given <dg@...> wrote:
>
> On 18/05/12 23:09, clerew5 wrote:
> [...]
> > Ah! I had thought that all signals would be passed to all threads (which is indeed the case for signals arising from outside).
>
> This is definitely getting out of my comfort zone (signals and threads
> mix like oil and cats), but I was under the impression that outside
> signals to sent to a single thread *at random* that had the signal
> unblocked? So you control which thread you want to receive the signals
> by blocking them from everywhere else.

Yes, that is what I thought, and the thread in question was the only one allowed to see the signals (the prime purpose off that thread is to catch people who suddenly press buttons). But it seems that SIGSEGV (unless generated externally) is only sent to the thread it happened in, or maybe its parents.

> > But the stack it prints bears no resemblance to what I get from 'bt' in gdb. It seems to start from errHandler, but after that it bears no resemblance to anything recognizable; and it is not just because it arises within a handler, because I have manually invoked it from elsewhere in the program, and it still does not work.
>
> I've tried the test program in the backtrace man page on my armhf box,
> and it doesn't work. I'm afraid that it's possible that backtrace simply
> doesn't work on ARM.

Actually, it is doing better than I thought. Here is some actual output:

./heat(pthread_create+0x710)[0x9da0]
/lib/libc.so.6(__default_rt_sa_restorer_v2+0x0)[0x4010bcf0]
/lib/libc.so.6(_IO_printf+0x34)[0x40120dfc]
./heat[0x14a18]
./heat[0x17274]
./heat(pthread_create+0x680)[0x9d10]

The addresses within [...] are indeed the addresses of code being obeyed all down the stack. But [0x9da0] is NOT within pthread_create (nm shows it is actually within my handler). Likewise [0x9d10] is within 'main', and [0x14a18] and [0x17274] are at identifiable places which gave me the clue as to where the fault was happening (though there appears to be one stack frame which does not appear at all).

BUT I then assumed that the claimed routines shown within /lib/libc.so.6 would be equally bogus, whereas research using /proc/<pid>/maps and nm showed that they were in fact correct, and the fault was indeed in printf (I had mistyped '%d' as '%n').

So there is definitely a bug in backtrace which is causing wrong identification within code compiled by myself, but which which operates fine within the system libraries.
>
> > My plan is to embed the whole program (which has to run 24/7) within a shell script which observes the failures, records what it can in a suitable file, and then restarts the program. But it is important that things should be cleaned up before the program is removed...
>
> You may be treading on thin ice here. Depending on what causes the seg
> fault, it's quite possible that the system will be in a bad state ---
> for example, if you call fwrite(), and the buffer is unreadable, then
> it's entirely likely that the signal will be thrown while it's in the
> middle of modifying the stdio state. Which means that trying to use
> stdio again will hang, crash, etc. The magic keyword to search for is
> 'async signal safe'.

Yes, you have to be aware of what code you are going to obey, and what exit() is likely to do, and there is a risk that these might trigger the same fault again. But in my case the system is controlling a heating system and turning real boilers on and off, and it is absolutely essential that, upon a crash, the boiilers are most definitely left OFF. Also, there are a few variables that ought to be preserved in permanent storage.

But so far it all seems to be working fine. The system crashed, restarted itself within one second, and crashed again shortly after - until eventually the temperatures had risen to a point where the offending piece of code did not need to be called anymore.