Resources for Enterprise Developers

Title    Debugging with strace
Date    Friday October 24, @03:56PM
Author    Gerard Beekmans
Topic    Debugging tools

In an ideal world, whenever a program cannot perform a function it exits with a useful error message, giving you enough clues to fix the problem. Unfortunately, we do not live in an ideal world -- at least, not all the time. Sometimes when an application encounters an error, you are none the wiser.

This is where debugging tools come in. One tool I cannot live without is called strace which is a system call tracer. You can trace calls made by a program that is already running (bind strace to an existing PID) or you can tell strace to start a process to examine.

Let me demonstrate the use of strace with a few real world examples.

Problems with starting KDE

Some time ago I was debugging a problem starting KDE. The error message that was returned did not give me any clues:

_KDE_IceTransSocketCreateListener: failed to bind listener
_KDE_IceTransSocketUNIXCreateListener: ...SocketCreateListener() failed
_KDE_IceTransMakeAllCOTSServerListeners: failed to create listener for local
Cannot establish any listening sockets DCOPServer self-test failed.

To me, this did not mean much. All I got out of this was that dcopserver, a crucial KDE program used for inter-process communication, could not be started. The other thing I got out of the error was its relation to the ICE (Inter Client Exchange) protocol that is part of XFree86. Other than that, it left me in the dark as to what was causing the problem.

I decided to deploy strace to look behind the scenes to see what went on when dcopserver tried to start by running dcopserver:

strace -f -F -o ~/dcop-strace.txt dcopserver

The -f -F options tell strace to also trace fork(2) and vfork(2) processes respectively, in case there are any. The -o option saves all strace output to the dcop-strace.txt file, and dcopserver is the program strace will run and trace.

After I got the error again, I had a look at the dcop-strace.txt file. There is a lot of information in here including all the system calls that were made. The relevant parts are the system calls that were made just before the error occurred:

27207 mkdir("/tmp/.ICE-unix", 0777)     = -1 EEXIST (File exists)
27207 lstat64("/tmp/.ICE-unix", {st_mode=S_IFDIR|S_ISVTX|0755, st_size=4096, ...}) = 0
27207 unlink("/tmp/.ICE-unix/dcop27207-1066844596") = -1 ENOENT (No such file or directory)
27207 bind(3, {sin_family=AF_UNIX, path="/tmp/.ICE-unix/dcop27207-1066844596"}, 38) = -1 EACCES (Permission denied) 
27207 write(2, "_KDE_IceTrans", 13)     = 13
27207 write(2, "SocketCreateListener: failed to "..., 46) = 46
27207 close(3) = 0 27207 write(2, "_KDE_IceTrans", 13)     = 13
27207 write(2, "SocketUNIXCreateListener: ...Soc"..., 59) = 59
27207 umask(0) = 0 27207 write(2, "_KDE_IceTrans", 13)     = 13
27207 write(2, "MakeAllCOTSServerListeners: fail"..., 64) = 64
27207 write(2, "Cannot establish any listening s"..., 39) = 39

The first line shows that something tried to create the /tmp/.ICE-unix directory with mode 0777. This failed because the directory already existed. The second call (lstat64) retrieves the status of the directory and it shows that it has mode 0755. This is the first clue something is wrong. The system call wanted to create the directory with mode 0777, but the existing directory has mode 0755. The third call (unlink) tried to remove a file, but it did not exist. This was of no consequence because it was just trying to remove a potential old file.

However, the fourth line confirmed my suspicion. It tried to bind to /tmp/.ICE-unix/dcop27207-1066844596 which failed with a permission denied error. The .ICE-unix directory is owned by root and group root with only the owner having write permission. A non-root user can never create a file in that directory. If its permissions were set to mode 0777 it would work and this is what the mkdir command tried to accomplish.

So I ran chmod 0777 /tmp/.ICE-unix and tried to start KDE again. This time KDE started properly. The problem was solved; and it only took a few minutes to run strace and interpret the output file.

Note: running chmod 0777 was a test. I do not like having directories world-writable without having the sticky bit set on the directory. Sticky bits on a directory prevent people from removing other people's files regardless of a directory being world writable. Often you will find your /tmp directory having the sticky bit set for this exact reason. After KDE worked again, I set the sticky bit by running chmod +t /tmp/.ICE-unix.

Resolving library-related problems

Another use of strace is to aid in resolving library related problems. When you run ldd on a program, it tells you which libraries the program depends on and where the libraries are found. However, if you are on a system with an old Glibc version (old as in around version 2.2 or older) you may have a buggy ldd program. The problem with this version is that it may find a library in one directory, but upon actually running the program the dynamic linker (/etc/ may find it in another place. This is usually caused by /etc/ and /etc/ being out of sync, or the /etc/ file being corrupted. Since I cannot seem to reproduce this on my Glibc-2.3.2 system that bug might have been squashed already.

Additionally, ldd does not show all the libraries that a program might use. The system call dlopen(3) can load a dynamic library on demand, and these libraries do not show up in the ldd output. The Name Server Switch (NSS) libraries, which are part of Glibc, are prime examples of this. One of the things NSS does is tell applications where the system's password database is located. Applications do not link to the NSS libraries directly. These libraries are loaded by Glibc itself using this dlopen system call. If they happen to be missing, you will not be told about it, but applications will stop being able to resolve things like userids into usernames. Let me demonstrate this:

The whoami program tells you what your username is. This can be useful when running from a script that needs to know the username of the person running it. Sample output when running whoami:

# whoami

Say something went wrong during a Glibc upgrade and the NSS library that is required to do userid to username mapping ended up missing. We can simulate this by moving the /lib/ symbolic link out of the way:

# mv /lib/ /lib/ whoami
whoami: cannot find username for UID 0 

As you can see, there is something wrong now. Userids are not resolved into usernames anymore, but there is nothing from the whoami output that tells you there is a problem with a library. The ldd output also does not show anything useful:

# ldd /usr/bin/whoami => /lib/ (0x4001f000)
    /lib/ => /lib/ (0x40000000)

All you see here is that and are required for whoami to run. It does not mention anything about other libraries that are also required for proper operation.

Here is where strace comes in handy again. Run the following to strace the whoami output:

strace -o whoami-strace.txt whoami

If you look at the whoami-strace.txt file you will see that there are calls to find the file in various directories, but none were found:

open("/lib/", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib/i686/mmx/", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/i686/mmx", 0xbffff190)     = -1 ENOENT (No such file or directory) 
open("/lib/i686/", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/i686", 0xbffff190)         = -1 ENOENT (No such file or directory)
open("/lib/mmx/", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/mmx", 0xbffff190)          = -1 ENOENT (No such file or directory) 
open("/lib/", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib", {st_mode=S_IFDIR|0755, st_size=2352, ...}) = 0
open("/usr/lib/i686/mmx/", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/i686/mmx", 0xbffff190) = -1 ENOENT (No such file or directory) 
open("/usr/lib/i686/", O_RDONLY) = -1 ENOENT (No such file or directory)

Without a program like strace it was not obvious that a library was missing. With strace at least you have a place to start. Now it is a simple matter of figuring out where went and putting it back in place.

Limit strace to trace only specific system calls

If you already know what you are looking for, you can tell strace to show only a certain type of call. Say you wanted a list of all programs run by a configure script. The system call you will be looking for is execve. To tell strace to show only that call use:

strace -f -o configure-strace.txt -e execve ./configure

The output might look something along the lines of:

2720  execve("/usr/bin/expr", ["expr", "a", ":", "\\(a\\)"], [/* 31 vars */]) = 0
2725  execve("/bin/basename", ["basename", "./configure"], [/* 31 vars */]) = 0
2726  execve("/bin/chmod", ["chmod", "+x", ""], [/* 31 vars */]) = 0
2729  execve("/bin/rm", ["rm", "-f", ""], [/* 31 vars */]) = 0
2731  execve("/usr/bin/expr", ["expr", "99", "+", "1"], [/* 31 vars */]) = 0
2736  execve("/bin/ln", ["ln", "-s", "conf2693.file", "conf2693"], [/* 31 vars */]) = 0

As you can see, strace can be used by programmers and non-programmers alike. Granted, its output is not always easy to understand, but a lot of the output does not concern most people (i.e., memory allocation calls). You learn to sift through the output and choose what you might need: permission denied errors, no such file or directory errors, and so forth.


  1. "strace" -
  2. "KDE" -
  3. "XFree86" -

Copyright 2003 - DevChannel, All Rights Reserved

printed from DevChannel, Debugging with strace on 2003-11-29 19:36:11