Zombie Process: Killing the Undead

Is your Ubuntu MOTD warning you of a zombie process?

Welcome to Ubuntu 11.10 (GNU/Linux 3.0.0-20-server x86_64)

* Documentation: https://help.ubuntu.com/11.10/serverguide/C

System information as of Thu Jun 28 18:36:57 EDT 2012

System load: 0.0 Processes: 94
Usage of /: 68.2% of 1.79TB Users logged in: 1
Memory usage: 29% IP address for eth0: 10.0.0.10
Swap usage: 8%

=> There is 1 zombie process.

What’s the scoop with that last line “There is 1 zombie process“, is my operating system getting caught up in this current climate of zombie infatuation? Well no, sadly it’s more boring than that. A zombie process occurs when a child process ends, but the parent doesn’t “reap” it. For a much better run down on what a zombie process is check out the Wikipedia article: Zombie process.

Here is a quick run down on some terminology. A process is just a fancy name for a running instance of a program. A child process (or just “child”) is a process started by another process. A process that starts another process is the “parent process” of the process it starts.

The ‘ps’ command shows processes we are running.

user@host:~$ ps
PID TTY TIME CMD
5828 pts/4 00:00:00 bash
6122 pts/4 00:00:00 ps

The ‘pstree’ command can show a family tree (of sorts) for processes, parents, their children, the children of their children, etc. Our shell is bash, and as we can see in the output from ‘ps’ above, the process ID number (pid) of our bash prompt is 5828.

user@host:~$ pstree -Gpl 5828
bash(5828)───pstree(6123)

Here we can see that bash is the parent process of the pstree command itself when we run it from the bash prompt. The pstree command exits shortly after it displays this information, and bash will go back to being childless. If we run another instance of bash from inside of the current bash prompt, the new bash instance will be a child of the first.

user@host:~$ bash
user@host:~$ pstree -Gpl 5828
bash(5828)───bash(6124)───pstree(6389)

So you can see our original bash process with the pid number 5828 has begotten our new child bash process of 6124. The new bash process is where we are running the ‘pstree’ command from, so pstree is a child of 6124.

For an interesting look at your systems family tree, try running ‘pstree -Gpl 1‘.

Hopefully you have a good handle on the whole parent/child thing. Now we’ll go zombie hunting. The system has told us that there is a zombie, but we know nothing about it. The ps command has options that will print the status of a process in a column of its output.

root@host:~# ps aux |grep Z
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      2925  0.0  0.0   9256   880 pts/2    S+   18:40   0:00 grep Z
root     28766  0.0  0.0      0     0     ?    Z    Jun06   0:00 [apt] <defunct

Here we’ve used the ‘grep’ command to search for a pattern “Z”. Because there is a “Z” in the “VSZ” column header we can also see the ‘ps’ header we were talking about. Over in the “STAT” column we can see that something called “[apt]” has the mark of the zombie (Z).

At this point you might be thinking about using the ‘kill’ command to kill this zombie dead. The problem with killing a zombie is that by definition they are already dead. Unlike motion pictures, the way to kill a Linux zombie isn’t by shooting it in its head, but by killing its parent (maybe we should call them vampires instead?).

Kill and kill -9 are futile at killing zombies.

root@host:~# kill 28766
root@host:~# ps aux |grep Z
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      2925  0.0  0.0   9256   880 pts/2    S+   18:40   0:00 grep Z
root     28766  0.0  0.0      0     0     ?    Z    Jun06   0:00 [apt] <defunct
root@host:~# kill -9 28766
root@host:~# ps aux |grep Z
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      2925  0.0  0.0   9256   880 pts/2    S+   18:40   0:00 grep Z
root     28766  0.0  0.0      0     0     ?    Z    Jun06   0:00 [apt] <defunct

So what gives? The ‘kill’ command is for killing processes, so what good is it if we can’t kill these processes? Zombie processes have ended, they are no more. Ghosts might be a more fitting term, traces of them exist in the system, but they are no longer functioning, they are waiting to gasp their last breath of exit code to their parent and have their memory wiped from the face of the system. The problem is the parent isn’t cooperating. It’s conceivable that this ignorance of its child’s death is intentional, but it is rare for a zombie condition to persists by design. If you see a zombie process and it doesn’t clear itself up in a moment, there is a good chance you’ll need to take matters into your own hands.. or you know, just ignore it.

Using pstree we can find child pids if we know the parent, but how can we find the parent pid number from a child pid?

Finding your own parent PID is easy in bash, it’s stored in the PPID variable.

user@host:~$ echo $PPID
5828

What about finding the parent of an arbitrary process ID? If you have the proc file system (you probably do), you can see lots of information about a given process including the parent pid by looking at the ‘stat’ file for that pid.

root@host:~# cat /proc/6124/stat
6124 (bash) S 5828 6124 5828 34820 6398 4202496 1166 3867 0 0 4 0 0 1 20 0 1 0 7883109 25640960 617 18446744073709551615 4194304 5111244 140736553088608 140736553087152 140358224629054 0 65536 3686404 1266761467 18446744071579277349 0 0 17 0 0 0 0 0 0

The 4th value in the ‘stat’ file is ppid, or the “parent pid” of the process.

The ‘stat’ file for any pid in a procfs enabled system can be found in /proc/[pid]/stat, where [pid] is replaced with the pid number you are interested in. For a description of the ‘stat’ file format search for ‘/proc/[pid]/stat’ at the URL below:
http://www.kernel.org/doc/man-pages/online/pages/man5/proc.5.html

To see just the pid number and ignore the other information we’re not currently interested in we can use the ‘awk’ command to select only the 4th field.

root@host:~# awk '{print $4}' /proc/6124/stat
5828

Armed with the information above, I’ve created a quick little zombie hunting script for use in the cron scheduler, or command line. The script first tries to alert the parent process to reap its child using the SIGCHLD signal. When SIGCHLD fails SIGKILL is used next.

Zombie Hunter

#!/bin/bash
zombies=(`ps ax |awk '{print $3" "$1}' |grep -e ^'Z ' |sed 's/Z //1'`)
for zombie in ${zombies[@]}
do
    echo "Found a zombie process "`awk '{print $2}' /proc/$zombie/stat`" [pid:$zombie]"
    parent="`awk '{print $4}' /proc/$zombie/stat`"
    echo "Asking parent process "`awk '{print $2}' /proc/$parent/stat`" [pid:$parent] to come quietly..."
    kill -SIGCHLD $parent
    sleep 10 # This seems awfully patient
    if [ -f /proc/$parent/stat ]; then
        echo "Asking not so nicely"
        kill -9 $parent
    fi
    sleep 1
    if ! [ -f /proc/$zombie/stat ]; then
        echo "Zombie vanquished"
    fi
done
root@host:~# ./zombie-hunter
Found a zombie process (apt) [pid:28766]
Asking parent process (run-parts) [pid:28763] to come quietly...
Asking not so nicely
Zombie vanquished
This entry was posted in Bash, System Administration. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *