General system problem solving tools

Matthew Burgess matthew at linuxfromscratch.org
Tue Dec 21 12:53:53 PST 2010


Hi all,

I've been trying to track a problem at work for the last couple
of days but my efforts have been scuppered by a general lack of
knowledge of low-ish level debugging/tracing tools.

The problem we have is that a closed-source multi-threaded program
we are using hits issues when we get it to spawn over roughly 250
threads.  I'm deliberately preserving the program's anonymity so
as to protect the (presumed) innocent at this point.

When the program fails, we end up with messages related to
failures in the select() call with both 'bad file descriptor' and
'invalid argument' errors.

At the moment, we're not sure whether this is a problem with the
closed source program, or with some user-space or kernel
configuration or bug.  The server we're running this on is a 4-CPU
quad-core, 16GB RAM box, so has plenty of grunt.  The program also
runs on a Windows laptop without any issues with 500 threads!

Obvious things like 'top' and 'df' don't show any problems.  I've
tried running a couple of systemtap scripts (the nettop and
socket-trace examples) but they don't appear to show much of any
use.  When I attach 'strace' to the running process, it just seems
to be hanging on futex_wait so I can't see any select() calls or
their arguments.

So, does anyone have any secrets in their sysadmin toolbox that
may be of use here?  Any help at all would be greatly appreciated,
even if it's just pointers to more suitable mailing lists.

Thanks in advance,

Matt.




More information about the lfs-chat mailing list