Investigating Unkillable Processes On Unix-Like Systems

Understanding Unkillable Processes

An unkillable process refers to any process on a Unix-like operating system that cannot be terminated with conventional kill signals like SIGTERM or SIGKILL. These defiant processes continue running despite attempts to shut them down, often requiring special intervention to eliminate.

Common causes leading to unkillable processes include:

Processes stuck in uninterruptible sleep (D state)
Processes waiting on unavailable resources like network, memory, or I/O
Processes with state corruption bugs freezing their operation
Processes with special privileges protecting them from standard kill signals

Unkillable processes can waste system resources like CPU cycles, memory, and disk I/O. They can also trigger cascading failures when dependent services get blocked waiting on the troubled processes. So detecting and properly eliminating unkillable processes is an important administrator skill.

Table of Contents

Identifying Unkillable Processes

The first step in managing unkillable processes is identifying any currently running. This involves inspecting the process list and understanding process states.

Checking Process Status with ps

The ps command lists detailed information about active processes, including their state codes. Important states related to unkillable processes include:

D state – uninterruptible sleep, process waiting on I/O
R state – running, processing instructions normally
S state – interruptible sleep, process waiting on event
T state – stopped by debugger or similar tool
Z state – zombie, terminated but not reaped

Run ps aux to see all processes with their states. Then inspect output for any concerning or abnormal states:

$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.2  19232  7988 ?        Ss   Jan05   0:07 /usr/lib/systemd/systemd --switched-root --system --deserialize 22

In particular, pay close attention to any processes in uninterruptible D state. These processes cannot be killed with standard SIGTERM and SIGKILL signals due to waiting for I/O.

Investigating Process Details

For any concerning processes found with ps, dig deeper into their status using utilities like strace, lsof, top, and cat proc files.

Strace will display the active system calls and signals to understand where a process is stuck:

# strace -p <PID>

Lsof will list open files handles that may block process termination:

# lsof -p <PID>

Top provides real-time monitoring of resources used per process like CPU, memory, and execution state.

Cat’ing proc files like /proc//status, /proc//stack, /proc//task can also provide insight.

Use these diagnostics tools to pinpoint why a specific process is in an unkillable state.

Example Code for Listing Processes

Here is some sample code for programmatically iterating processes and their states in Python. Adjust as needed for unkillable process hunting:

#!/usr/bin/env python3
import psutil

for process in psutil.process_iter(['pid', 'name', 'username', 'status']):
    if process.info['status'] == psutil.STATUS_ZOMBIE:
        print(process.info)

Terminating Unkillable Processes

Once unkillable processes are identified, administrators need mechanisms to eliminate them. This may involve specialized kill signals, restarting related services, or process state tools.

Using Kill Signals

When standard SIGTERM and SIGKILL signals fail to terminate a process, administrators can try alternate kill signals like:

SIGSTOP – Pauses process execution
SIGCONT – Resumes stopped process
SIGUSR1/SIGUSR2 – Application-defined signals

SIGSTOP can be effective for processes stuck in uninterruptible I/O wait suspend, allowing the I/O to complete so SIGKILL then works. SIGUSR1 and SIGUSR2 also have niche uses per application.

Test these alternative kill signals to see if any force termination:

# kill -SIGSTOP <PID>
# kill -SIGUSR1 <PID>

Leveraging SIGKILL

As a last resort when no other options works, SIGKILL (-9) can obliterate a process unconcerned with graceful shutdown. However SIGKILL risks data loss and filesystem corruption.

# kill -9 <PID>

So only use SIGKILL when facing an actively destructive unkillable process. Cleaner options are preferable.

Example Kill Command

Here is sample Python code to terminate processes by PID based on input arguments:

#!/usr/bin/env python
import os 
import sys

pid = int(sys.argv[1])
sig = int(sys.argv[2]) #e.g. 9 for SIGKILL 

try:
    os.kill(pid, sig)
except OSError:
    print(f"Failed to kill PID {pid} with signal {sig}")

This allows flexibility in targeting specific processes and kill signals from the command line:

$ python kill_script.py 32415 9

Restarting Hanging Services

Sometimes an unkillable process results from greater application or service faults. The standard process termination techniques fail to resolve the underlying issue.

In these scenarios, administrators need to restart affected applications and services to kill associated processes and refresh state.

Identifying Affected Services

To identify which services an unkillable process relates to, use utilities like:

ps – Lookup service by process name/details
lsof – Show files/sockets held by process
netstat – Displays active network connections

These will correlate the rogue process back to related services needing restart.

Restarting via Init Scripts

Most server applications include init scripts to facilitate restarting services. Common ones include:

/etc/init.d/
/usr/sbin/
/etc/rc.d/

Use the service or systemctl commands to invoke these scripts:

# service mysqld restart
# systemctl restart crond

This will kill and refresh affected processes and hopefully resolve the unkillable process.

Example Service Restart

Here is a simple Bash script to restart the httpd service by process name:

#!/bin/bash

process_name=$1

if [[ $(pgrep -c $process_name) -gt 0 ]]; then
  echo "Restarting $process_name service" 
  systemctl restart $process_name
else 
  echo "$process_name not running"
fi

Invoking this when httpd processes seem stuck:

$ bash restart_service.sh httpd 
Restarting httpd service

Often restarting associated services will eliminate troublesome processes.

Preventing Unkillable Processes

While difficult to avoid unkillable processes entirely, proactive configuration tweaks can reduce occurrences.

Setting Resource Limits

Resource limits restrict how much of a system resource a process can consume. This guards against runaway processes.

Common resource limits to enforce include:

RLIMIT_CPU – max CPU seconds per process
RLIMIT_NOFILE – max open files per process
RLIMIT_DATA – max data segment size per process

Use ulimit, setrlimit(), or prlimit in code to specify limits aligned to application needs. Too aggressive with limits can cause other issues however.

Enforcing Process Quotas

Setting quotas around total processes created by users can also limit process storms.

Edit /etc/security/limits.conf to set per-user/group process limits:

*                hard    nproc           512
@student        soft    nproc           32

Configuring the OOM Killer

The Linux out-of-memory (OOM) killer terminates processes when memory gets critically low.

Tune the OOM killer thresholds via /proc/sys/vm/ files like:

overcommit_ratio
panic_on_oom
oom_kill_allocating_task

Also adjust oom_score_adj values for preferred kill order per process.

Careful OOM killer settings helps constrain process memory usage and termination failures. But set incorrectly, it introduces instability risks from overly eager killing.