linuxgeek: [nslu2-linux] Re: Catching SIGSEGV

Friday, May 18, 2012

[nslu2-linux] Re: Catching SIGSEGV

--- In nslu2-linux@yahoogroups.com, David Given <dg@...> wrote:

> I'm not quite sure what you're trying to do here --- AIUI sigwait()
> blocks until a signal *for the blocked thread* is received, so no code
> that can generate the signal will be run... (unless you send it manually
> from another thread).

Ah! I had thought that all signals would be passed to all threads (which is indeed the case for signals arising from outside).
>
> If you're trying to catch a signal thrown by code in a different thread,
> then I don't think that will work.
>
> The simplest approach is to just register a SIGSEGV signal handler in
> the thread that's going to be doing the work. Then, when the signal is
> thrown, your handler will run. If you want to do the work in a different
> thread, then you'll need some sort of IPC between threads (but beware
> that you can only run a small subset of functions safely from inside a
> signal handler).

Yes, I have now set up a sigaction-type handler in main(), and it seems to be inherited by processes created subsequently to that. It sort of works, but I need to know where the SIGSEGV came from, in order to know where to look for the bug. Here is my current handler:

void errHandler(int signum, siginfo_t *info, void *ptr) {
int count = backtrace( tracePtrs, 100 );
backtrace_symbols_fd(tracePtrs, count, 2);
fatal("Heat failed with %s: si_code=%d",
strsignal(signum), info->si_code);
}

errAction.sa_sigaction = errHandler;
errAction.sa_flags = SA_SIGINFO;
sigaction(SIGSEGV, &errAction, NULL);
sigaction(SIGBUS, &errAction, NULL);
sigaction(SIGILL, &errAction, NULL);
sigaction(SIGFPE, &errAction, NULL);
sigaction(SIGSYS, &errAction, NULL);

The call to 'fatal' seems to work fine (given that it has to deal with all those signals, that is as much information as I can extract from info).

But, for tracing the bug, I need to know where the signal came from. Slugos 5.3 doesn't do core dumps (for good reason), so I have tried to generate a backtrace (#include <execinfo.h> - note that tracePtrs is declared globally). But the stack it prints bears no resemblance to what I get from 'bt' in gdb. It seems to start from errHandler, but after that it bears no resemblance to anything recognizable; and it is not just because it arises within a handler, because I have manually invoked it from elsewhere in the program, and it still does not work.

So does anybody have any experience of using 'backtrace' in Slugos?

My plan is to embed the whole program (which has to run 24/7) within a shell script which observes the failures, records what it can in a suitable file, and then restarts the program. But it is important that things should be cleaned up before the program is removed, which is why the call of 'fatal' is used - it then calls exit, which in turn calls all my 'atexit' functions, and calls the cleanup functions on those classes which have them.

And indeed it caught several SIGSEGVs this morning (but they appear to occur randomly when noone is looking, and obviously I need to find the cause).