Improving Shell Script Reliability And Robustness

Handling Errors Gracefully

Robust shell scripts anticipate, detect, handle and recover from errors gracefully. Instead of failing catastrophically, they continue working or fail safely. Several techniques help make scripts more resilient.

Checking Return Codes to Detect Errors

Most commands return an exit code indicating success or failure. By checking these return codes after critical commands, scripts can detect errors and take appropriate actions:

backup_result=$(rsync -avz /source /destination)
if [ $? -ne 0 ]; then
  echo "Backup failed with exit code $?." >&2  
  cleanup
  exit 1
fi

Here the return code is checked after an rsync command to detect backup failures. The script then logs the failure, runs cleanup procedures, and exits with a non-zero failure code.

Trapping Errors with Bash’s trap

Bash’s built-in trap command lets you specify handlers for signals and errors in scripts. By trapping errors, robust scripts can avoid unexpected failures:

trap 'echo "Error on line $LINENO" >&2' ERR

rm important_file 
echo "Deleting complete" # Won't reach this on errors

Here trap is used to intercept ERR signals. If the rm command fails, the signal handler runs instead of terminating the script. This allows execution to continue and the script to attempt recovery.

Displaying Clear Error Messages

Informative error messages are critical when things go wrong. They let users and support staff quickly diagnose and fix problems. Best practices include:

Logging errors to STDERR with echo >&2
Including key variables like $LINENO and exit statuses $?
Adding context e.g. “Backup failed on serverXYZ”

tar cf archive.tar /var/logs &>>backup.log || (
  echo "ERROR: Backup failed on $HOSTNAME with code $?." >&2
  echo "TAR command was: tar cf /var/logs" >&2
)

This example shows a script backing up logs, with clear error handling that logs context and details to STDERR on failures.

Allowing the Script to Fail Safely

When failure is unavoidable, robust scripts should terminate gracefully to avoid data loss or corruption issues. Strategies like cleanup sections, final traps and locking prevent unsafe termination:

# Perform critical action first
transfer_data

# Entering unsafe zone  
lock /tmp/script.lock
delete_temp_files
remove_lock /tmp/script.lock
 
# If script exits here, lock left behind could cause trouble 
cleanup() {
  rm -f /tmp/script.lock
}  
trap cleanup EXIT

This script uses file locking and a final EXIT trap to guarantee removal of a lock file. This lets the overall workflow fail but avoid collateral damage from an unsafe exit.

Defensive Coding Practices

Carefully written and risk-averse code avoids assumptions and prevents errors. Defensive tactics like input validation, race condition checks and safe defaults help minimize robustness issues.

Quoting Variables to Prevent Globbing/Word Splitting

Unquoted variables undergo filename expansion and word splitting, exposing scripts to unintended behavior:

file_list=$(ls) 
wc -l $file_list # Oops, tries to count words in expanded list

By quoting variables, scripts avert ambiguity and remain in control:

  
file_list=$(ls)
wc -l "$file_list" # Passed as single argument

Quoting variables is especially important when passing user-supplied input to commands.

Avoiding Race Conditions with File Locking

Scripts editing shared files risk race conditions – output interleaving and loss of transactions. LockingCoordinate access prevents this:

lockfile script.lock

read data
write_data "$new_data" 
 
rm script.lock

This guarantees atomic access to shared data – no overlaps can corrupt data. Such race condition checks boost reliability.

Setting Failsafe Defaults and Timeouts

Defensive code sets safe defaults that prevent runaway jobs. For example:

  
tar -czvf backup.tar /home /etc -C /backups
# Oops tar by default continues indefinitely on errors
  
tar --timeout=60 -czvf backup.tar /home /etc -C /backups 
# Timeout will abort long-running failed job

Such timeout checks, along with sane default umasks, rlimits, ulimits and other options restrict adverse impacts from unanticipated problems.

Validating All Inputs

Careful checking and sanitizing of inputs prevents security issues and bad data. For example:

read -rp "Enter file path: " f

# Input validation 
if [[ $f != /* ]] || [[ ! -f $f ]]; then
   echo "Invalid file path $f" >&2
   exit 1
fi
 
cat "$f"

Here the script confirms the input file exists before passing it to cat. Such checks guarantee program assumptions hold before processing inputs.

Automated Testing

Thorough automated testing improves reliability by catching flaws and gaps before customers do. Unit tests validate modules while integration testing checks overall workflows.

Unit Testing Individual Functions

Scripts can unit test internal functions without user interaction e.g.:

#!/bin/bash
# Helper function
hostname_valid() {
  [[ $1 =~ [a-z][a-z0-9]{0,30} ]]
}

# Unit test    
if ! hostname_valid "validHost123"; then
  echo "Failed valid hostname check" >&2
  exit 1
fi

if hostname_valid "Invalid!"; then
  echo "Failed invalid hostname check" >&2
  exit 1
fi
echo "Passed checks"

Such tests run automatically on code changes to check for regressions near instantly. This catches issues early before they trigger failures.

Integration Testing Full Workflows

Higher level integration tests check script functionality and reliability by running full workflows on representative datasets. These detect gaps in operational assumptions:

# Test setup
mkdir test_backup 
cp /var/logs test_backup

# Run backup script      
./backup.sh test_backup

# Validation      
if [[ $(ls test_backup | wc -l) -lt 100 ]]; then
   echo "Insufficient files backed up" >&2
   exit 1
fi

# Test tear down
rm -rf test_backup

By automating overall workflow tests, scripts continue working reliably even amid upstream data changes.

Test Coverage for Critical Code Paths

Exercising high risk areas like error handling logic detects flaws before customers encounter them:

# Unit test error handling       
mkdir /tmp/missing
trap_err() { echo "Trapped error: $BASH_COMMAND"; }
trap trap_err ERR
  
rm -rf /tmp/missing 
if [[ $? -eq 0 ]]; then
  echo "Failed to trap error exit code" >&2
  exit 1
fi

echo "Caught expected error"

Such focused testing hardens the code and improves resilience even for risky paths.

Portability Best Practices

Shell scripts can trigger obscure environment-specific errors. Portability best practices minimize reliance on external factors so functionality persists across environments.

Supporting Multiple Shell Environments

Subtle shell differences can manifest as surprising ‘Syntax error’ crashes. One workaround is defensive checks:

# Check shell compatibility
if [[ -n $ZSH_VERSION ]]; then
   echo "Please run me in Bash" >&2
   exit 1
fi

# Bash-specific logic
[[ $1 == foo ]] && echo "Hello!"

This guarantees features like [[ ]] syntax errors won’t derail scripts. Explicit version targeting prevents nasty surprises.

Avoiding Dependence on External Commands

Calls to external utilities risk command missing failures. Embedding logic reduces external dependencies:

# Brittle dependence on external 'timeout' utility 
timeout 10s sleep 20

# Robust bash native approach  
secs=10; while ((secs > 0)); do
   echo -ne "Next check in $secs seconds...\r"
   sleep 1  
   : $((secs--)) 
done

Such shell native implementations enhance reliability across diverse systems with spotty tooling.

Conditional Logic for Different OS Variants

OS specific variations can impede script portability. Segregating logic prevents compatibility issues:

uname_str=$(uname)
# POSIX syntax works for GNU+BSD userspace tools
case "$uname_str" in
  Linux*)  ps_arg='-ef' ;;
  *)       ps_arg='aux' ;; 
esac

ps $ps_arg | grep sshd

Checking $OSTYPE/$uname allows adjusting for differences across OS flavors. This prevents false failures when assumptions become invalid.

Recovery and Cleanup

Graceful error handling is important but robust scripts also know how to pick up the pieces after failures. Techniques like cleanup handlers and logging aid recovery and debuggability.

Temporary Files and Cleanup Traps

Reliable scripts cleanly manage temporary files to avoid clutter buildup:

  
temp_dir=$(mktemp -d)
echo $$ > $temp_dir/pid

cleanup() {
   # Delete temp files   
   if [[ -d $temp_dir ]]; then
      rm -rf $temp_dir
   fi
}
trap cleanup EXIT

Here temporary files get isolated into disposable directories and removed via EXIT trap handlers. This prevents unexpected exits from littering the filesystem.

Logging for Auditing and Debuggability

Comprehensive logging improves reliability by tracing script execution and inputs to enable better debugging for postmortem analysis:

#!/bin/bash
# Setup logging 
exec &>>./$0.log
echo "Script execution started at $(date)"

# Log input parameters 
echo "Input parameters: $@"

countdown() {
   # Functionality here  
   echo "T-minus $1..."
   sleep 1
   countdown $(($1 - 1))
}

countdown 5
echo "Blastoff!"

Logging inputs alongside key events in workflow reconstruction simplifies troubleshooting unreliable execution.

Checkpointing State for Resume on Failure

Some workflows involve complex state requiring expensive rebuild after failures. Checkpointing saves state periodically to enable fast resume:

  
# Initial state
downloads_left=1236 

while ((downloads_left > 0)); do
   # Checkpoint every ~30 secs 
   ((seconds++ == 30)) && save_state
   download_file
   ((downloads_left--))
done

save_state() {
  echo "$downloads_left" > /tmp/remaining
}

This avoids wastefully restarting multi-hour processes from scratch. Checkpointing improves reliability with incrementality.