Friday, May 18, 2012

Re: [nslu2-linux] Re: Catching SIGSEGV

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFPttlZf9E0noFvlzgRAt0gAJ0bki1S5J3rxqVtVHu2RhxEkVJYBACdGps3
q6YdoammkdX6UZLDZrQmhcU=
=wvpK
-----END PGP SIGNATURE-----
On 18/05/12 23:09, clerew5 wrote:
[...]
> Ah! I had thought that all signals would be passed to all threads (which is indeed the case for signals arising from outside).

This is definitely getting out of my comfort zone (signals and threads
mix like oil and cats), but I was under the impression that outside
signals to sent to a single thread *at random* that had the signal
unblocked? So you control which thread you want to receive the signals
by blocking them from everywhere else.

[...]
> void errHandler(int signum, siginfo_t *info, void *ptr) {
> int count = backtrace( tracePtrs, 100 );
> backtrace_symbols_fd(tracePtrs, count, 2);
[...]
> But the stack it prints bears no resemblance to what I get from 'bt' in gdb. It seems to start from errHandler, but after that it bears no resemblance to anything recognizable; and it is not just because it arises within a handler, because I have manually invoked it from elsewhere in the program, and it still does not work.

I've tried the test program in the backtrace man page on my armhf box,
and it doesn't work. I'm afraid that it's possible that backtrace simply
doesn't work on ARM.

> My plan is to embed the whole program (which has to run 24/7) within a shell script which observes the failures, records what it can in a suitable file, and then restarts the program. But it is important that things should be cleaned up before the program is removed...

You may be treading on thin ice here. Depending on what causes the seg
fault, it's quite possible that the system will be in a bad state ---
for example, if you call fwrite(), and the buffer is unreadable, then
it's entirely likely that the signal will be thrown while it's in the
middle of modifying the stdio state. Which means that trying to use
stdio again will hang, crash, etc. The magic keyword to search for is
'async signal safe'.

(Related: there is a very limited list of operations that you can safely
do inside a signal handler. Calling exit() is not one of them! See here:
https://www.securecoding.cert.org/confluence/display/seccode/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers)

This means that once you're program's crashed, you may not be able to
safely clean up afterwards.

What sort of cleanup do you need to do? Recording the program's state,
or freeing resources? If the latter, is there any way you can persuade a
different process to do the cleanup for you? That way, you can just let
your program crash without needing to do any actual work from your
signal handler.

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ "Never attribute to malice what can be adequately explained by
│ stupidity." --- Nick Diamos (Hanlon's Razor)

No comments:

Post a Comment