Dealing With Zombie Processes On Linux And Unix

Understanding Zombie Processes

A zombie process, also called a defunct process, refers to a process that has completed execution but still has an entry in the process table. This occurs when a child process has terminated but the parent process has not cleaned up or awaited its termination via the wait() system call.

Some key attributes of zombie processes:

  • They consume a slot in the process table and a PID.
  • They do not consume any CPU cycles or memory.
  • They cannot be killed as they are already dead.
  • They indicate a bug in the parent process cleanup code.

Zombie processes persist until the parent process calls wait() or waitpid() to read their exit status and other information. This allows the parent process to obtain necessary information from the child before it vanishes. However, if the parent fails to call wait(), the zombie will remain in the process table.

What Causes Zombie Processes

The main causes of zombie processes are:

  1. Parent process not calling wait() after child terminates.
  2. Parent process terminating before the child, resulting in init inheriting zombies.
  3. Bugs, crashes, or other errors preventing parent wait() calls.

Problems with Zombie Processes

While zombies themselves are harmless, a large number of accumulated zombies indicate an issue and can cause problems like:

  • Exhaustion of available PID numbers, preventing new processes.
  • Unnecessary consumption of slots in the process table.
  • Confusing ps output displaying dead processes.

Identifying Zombie Processes

Zombie processes can be identified using commands like ps, top, and pstree. They will have a state value of Z (zombie).

Identifying Zombies with ps

Using ps axo to include state information:

$ ps axo pid,ppid,comm,state    
   PID    PPID COMM       STATE
    1       0 init        S
   42       1 sshd        S
  912      42 sshd        S
 4567     912 bash        S
 5012    4567 zombie      Z

Here process 5012 is a zombie owned by parent PID 4567.

Identifying Zombies with top

In top, zombie processes are indicated by a state value of Z:

top - 15:17:17 up 21:56,  1 user,  load average: 0.07, 0.02, 0.00
Tasks: 280 total,   1 running, 278 sleeping,   0 stopped,   1 zombie
%Cpu(s):  0.3 us,  0.1 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si
KiB Mem :  2519036 total,   349216 free,  1561408 used,   928412 buff/cache
KiB Swap:  2097148 total,  2095900 free,     1248 used.   197452 avail Mem

The “Tasks” line includes a count of current zombie processes on the system.

Identifying Zombies with pstree

pstree shows child/parent process relationships, making zombies easier to identify:

$ pstree
init─┬─crond
     ├─dockerd─┬─docker-containe─┬─sleep
     │         │                 ├─sleep
     │         │                 └─sleep
     │         ├─{dockerd}
     │         ├─{dockerd}
     │         └─{dockerd}
     ├─sshd─┬─sshd─┬─bash─┬─pstree
     │      │       └─sleep
     │      └─sshd───bash
     ├─systemd-journal
     └─systemd-udevd

Any processes above shown in {} braces indicate zombies – the original process has died but the parent has not cleaned it up.

Preventing Zombie Processes

There are a few techniques that can help application developers prevent zombie processes from accumulating over time:

  1. Always call wait() or waitpid() after forking child processes to properly cleanup terminated children.
  2. Handle SIGCHLD signals promptly in the parent process to be notified of terminated children.
  3. Set SIGCHLD to SIG_IGN in parent if it does not care about child state changes.
  4. Use non-blocking waitpid() calls so wait() does not block the main program.
  5. Fork a separate subprocess just to wait on children to avoid blocking parents.

Additionally, some coding best practices can prevent bugs that lead to zombies:

  • Check all return codes from fork(), wait(), and other calls for errors.
  • Include cleanup sections to call wait() if processes crash or exit prematurely.
  • Use try/catch blocks wrap sections with fork() calls.

Killing Existing Zombie Processes

Since zombies are already dead, they cannot be killed directly. The only way to eliminate zombie processes is by acting on the parent process. Common techniques include sending signals to the parent prompting it to wait(), or just restarting the parent entirely.

Using SIGCHLD Signal

SIGCHLD is the signal sent to parents when a child process exits. This can force the parent to clean up zombies.

An example SIGCHLD handler cleaning up zombies in C:

void sigchld_handler(int sig) {
  int saved_errno = errno; 
  while(waitpid(-1, NULL, WNOHANG) > 0);  
  errno = saved_errno;
}  

The NOHANG flag allows non-blocking waitpid() calls to clean any existing zombies without blocking.

To install the signal handler to start catching SIGCHLD signals:

 
struct sigaction sa;
sa.sa_handler = &sigchld_handler; 
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_RESTART;
if (sigaction(SIGCHLD, &sa, NULL) == -1) {
  perror("Error installing signal handler");
  exit(1);
}

Now any SIGCHLD signals will trigger the handler and reap existing zombie processes.

Restarting Parent Processes

If simple signal handlers do not resolve an ongoing zombie process issue, restarting the parent process can eliminate any existing zombies:

# Kill parent process ID 4567
kill -9 4567  

# Parent will restart automatically clearing zombies

The restart causes the parent to do a fresh waitpid() for any exited children, eliminating zombies.

Other Methods for Handling Zombies

In addition to signals and process restarts, there are a few other creative methods for handling zombie processes:

Init Re-Parenting

Init (PID 1) will inherit any zombie processes and automatically reap children.
Sending SIGSTOP and SIGCONT signals to init will force it to wait():

kill -SIGSTOP 1
kill -SIGCONT 1 

This will cause init to immediately collect child zombie processes on that system.

Running a Dedicated Zombie Reaper

A simple script or daemon can runs wait in a loop to clean zombies periodically:

while true; do
  wait -n
  sleep 60
done

This would be run as a separate background process or cron job dedicated just to zombie reaping.

System Call Interception

LD_PRELOAD can be used intercept sigwaitinfo system calls from buggy applications, force wait() calls after the app forks children itself.

#include 
#include 

int sigwaitinfo(sigset_t *set, struct siginfo *info) {

  int ret = old_sigwaitinfo(set, info);  

  while (waitpid(-1, NULL, WNOHANG) > 0);
 
  return ret; 

}

This helps work around waiting issues in closed-source or older applications.

When to Accept Zombies

While it’s generally best practice to reap child processes properly, for some transient or very short-lived parent processes, allowing temporary zombies is not harmful. Some examples:

  • Scripts or other single-use programs exiting immediately after children.
  • Tools used infrequently where zombies last milliseconds.
  • Batch processing programs not needing child exit codes.

As long as they do not accumulate faster than init or a reaper process can catch them, a few transient zombies are often acceptable.

Summary – Best Practices for Zombie Reaping

Preventing zombie accumulation boils down to a few key practices:

  • Calling wait()/waitpid() reliably after forking children.
  • Handling SIGCHLD promptly in parent processes.
  • Setting SIGCHLD to SIG_IGN if unneeded.
  • Checking all return codes and errors.
  • Having a reaping strategy in place for stray zombies.

Properly written and structured applications will avoid zombies. But unplanned terminations or bugs can still cause zombie processes to accumulate over time, so having reactive zombie handling and reaping strategies is crucial for system stability.

Leave a Reply

Your email address will not be published. Required fields are marked *