| Title | Debugging with strace | |
| Date | Friday October 24, @03:56PM | |
| Author | Gerard Beekmans | |
| Topic | Debugging tools |
In an ideal world, whenever a program cannot perform a function it exits with a useful error message, giving you enough clues to fix the problem. Unfortunately, we do not live in an ideal world -- at least, not all the time. Sometimes when an application encounters an error, you are none the wiser.
This is where debugging tools come in. One tool I cannot live without is called strace which is a system call tracer. You can trace calls made by a program that is already running (bind strace to an existing PID) or you can tell strace to start a process to examine.
Let me demonstrate the use of strace with a few real world examples.
Problems with starting KDE
Some time ago I was debugging a problem starting KDE. The error message that was returned did not give me any clues:
_KDE_IceTransSocketCreateListener: failed to bind listener _KDE_IceTransSocketUNIXCreateListener: ...SocketCreateListener() failed _KDE_IceTransMakeAllCOTSServerListeners: failed to create listener for local Cannot establish any listening sockets DCOPServer self-test failed.
To me, this did not mean much. All I got out of this was that dcopserver, a crucial KDE program used for inter-process communication, could not be started. The other thing I got out of the error was its relation to the ICE (Inter Client Exchange) protocol that is part of XFree86. Other than that, it left me in the dark as to what was causing the problem.
I decided to deploy strace to look behind the scenes to see what went on when dcopserver tried to start by running dcopserver:
strace -f -F -o ~/dcop-strace.txt dcopserver
The -f -F options tell strace to also trace fork(2) and
vfork(2) processes respectively, in case there are any. The -o
option saves all strace output to the dcop-strace.txt file, and
dcopserver is the program strace will run and trace.
After I got the error again, I had a look at the dcop-strace.txt file. There is a lot of information in here including all the system calls that were made. The relevant parts are the system calls that were made just before the error occurred:
27207 mkdir("/tmp/.ICE-unix", 0777) = -1 EEXIST (File exists)
27207 lstat64("/tmp/.ICE-unix", {st_mode=S_IFDIR|S_ISVTX|0755, st_size=4096, ...}) = 0
27207 unlink("/tmp/.ICE-unix/dcop27207-1066844596") = -1 ENOENT (No such file or directory)
27207 bind(3, {sin_family=AF_UNIX, path="/tmp/.ICE-unix/dcop27207-1066844596"}, 38) = -1 EACCES (Permission denied)
27207 write(2, "_KDE_IceTrans", 13) = 13
27207 write(2, "SocketCreateListener: failed to "..., 46) = 46
27207 close(3) = 0 27207 write(2, "_KDE_IceTrans", 13) = 13
27207 write(2, "SocketUNIXCreateListener: ...Soc"..., 59) = 59
27207 umask(0) = 0 27207 write(2, "_KDE_IceTrans", 13) = 13
27207 write(2, "MakeAllCOTSServerListeners: fail"..., 64) = 64
27207 write(2, "Cannot establish any listening s"..., 39) = 39
The first line shows that something tried to create the /tmp/.ICE-unix directory with mode 0777. This failed because the directory already existed. The second call (lstat64) retrieves the status of the directory and it shows that it has mode 0755. This is the first clue something is wrong. The system call wanted to create the directory with mode 0777, but the existing directory has mode 0755. The third call (unlink) tried to remove a file, but it did not exist. This was of no consequence because it was just trying to remove a potential old file.
However, the fourth line confirmed my suspicion. It tried to bind to /tmp/.ICE-unix/dcop27207-1066844596 which failed with a permission denied error. The .ICE-unix directory is owned by root and group root with only the owner having write permission. A non-root user can never create a file in that directory. If its permissions were set to mode 0777 it would work and this is what the mkdir command tried to accomplish.
So I ran chmod 0777 /tmp/.ICE-unix and tried to start KDE
again. This time KDE started properly. The problem was solved; and it only
took a few minutes to run strace and interpret the output file.
Note: running chmod 0777 was a test. I do not like having
directories world-writable without having the sticky bit set on the
directory. Sticky bits on a directory prevent people from removing other
people's files regardless of a directory being world writable. Often you
will find your /tmp directory having the sticky bit set for this exact
reason. After KDE worked again, I set the sticky bit by running chmod
+t /tmp/.ICE-unix.
Resolving library-related problems
Another use of strace is to aid in resolving library related problems.
When you run ldd on a program, it tells you which libraries
the program depends on and where the libraries are found. However, if you
are on a system with an old Glibc version (old as in around version 2.2 or
older) you may have a buggy ldd program. The problem with this
version is that it may find a library in one directory, but upon actually
running the program the dynamic linker (/etc/ld-linux.so.2) may find it in
another place. This is usually caused by /etc/ld.so.conf and
/etc/ld.so.cache being out of sync, or the /etc/ld.so.cache file being
corrupted. Since I cannot seem to reproduce this on my Glibc-2.3.2 system
that bug might have been squashed already.
Additionally, ldd does not show all the libraries that a
program might use. The system call dlopen(3) can load a dynamic
library on demand, and these libraries do not show up in the
ldd output. The Name Server Switch (NSS) libraries, which are
part of Glibc, are prime examples of this. One of the things NSS does is
tell applications where the system's password database is located.
Applications do not link to the NSS libraries directly. These libraries are
loaded by Glibc itself using this dlopen system call. If they happen
to be missing, you will not be told about it, but applications will stop
being able to resolve things like userids into usernames. Let me
demonstrate this:
The whoami program tells you what your username is. This
can be useful when running from a script that needs to know the username of
the person running it. Sample output when running whoami:
# whoami
root
Say something went wrong during a Glibc upgrade and the NSS library that is required to do userid to username mapping ended up missing. We can simulate this by moving the /lib/libnss_files.so.2 symbolic link out of the way:
# mv /lib/libnss_files.so.2 /lib/libnss_files.so.2.backup whoami whoami: cannot find username for UID 0
As you can see, there is something wrong now. Userids are not resolved
into usernames anymore, but there is nothing from the whoami
output that tells you there is a problem with a library. The
ldd output also does not show anything useful:
# ldd /usr/bin/whoami libc.so.6 => /lib/libc.so.6 (0x4001f000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
All you see here is that libc.so.6 and ld-linux.so.2 are required for
whoami to run. It does not mention anything about other
libraries that are also required for proper operation.
Here is where strace comes in handy again. Run the following to strace
the whoami output:
strace -o whoami-strace.txt whoami
If you look at the whoami-strace.txt file you will see that there are calls to find the libnss_files.so.2 file in various directories, but none were found:
open("/lib/libnss_files.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib/i686/mmx/libnss_files.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/i686/mmx", 0xbffff190) = -1 ENOENT (No such file or directory)
open("/lib/i686/libnss_files.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/i686", 0xbffff190) = -1 ENOENT (No such file or directory)
open("/lib/mmx/libnss_files.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/mmx", 0xbffff190) = -1 ENOENT (No such file or directory)
open("/lib/libnss_files.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib", {st_mode=S_IFDIR|0755, st_size=2352, ...}) = 0
open("/usr/lib/i686/mmx/libnss_files.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/i686/mmx", 0xbffff190) = -1 ENOENT (No such file or directory)
open("/usr/lib/i686/libnss_files.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
Without a program like strace it was not obvious that a library was missing. With strace at least you have a place to start. Now it is a simple matter of figuring out where libnss_files.so.2 went and putting it back in place.
Limit strace to trace only specific system calls
If you already know what you are looking for, you can tell strace to show only a certain type of call. Say you wanted a list of all programs run by a configure script. The system call you will be looking for is execve. To tell strace to show only that call use:
strace -f -o configure-strace.txt -e execve ./configure
The output might look something along the lines of:
2720 execve("/usr/bin/expr", ["expr", "a", ":", "\\(a\\)"], [/* 31 vars */]) = 0
2725 execve("/bin/basename", ["basename", "./configure"], [/* 31 vars */]) = 0
2726 execve("/bin/chmod", ["chmod", "+x", "conftest.sh"], [/* 31 vars */]) = 0
2729 execve("/bin/rm", ["rm", "-f", "conftest.sh"], [/* 31 vars */]) = 0
2731 execve("/usr/bin/expr", ["expr", "99", "+", "1"], [/* 31 vars */]) = 0
2736 execve("/bin/ln", ["ln", "-s", "conf2693.file", "conf2693"], [/* 31 vars */]) = 0
As you can see, strace can be used by programmers and non-programmers alike. Granted, its output is not always easy to understand, but a lot of the output does not concern most people (i.e., memory allocation calls). You learn to sift through the output and choose what you might need: permission denied errors, no such file or directory errors, and so forth.
| Links |
printed from DevChannel, Debugging with strace on 2003-11-29 19:36:11