Improving Shell Script Reliability And Robustness
Handling Errors Gracefully
Robust shell scripts anticipate, detect, handle and recover from errors gracefully. Instead of failing catastrophically, they continue working or fail safely. Several techniques help make scripts more resilient.
Checking Return Codes to Detect Errors
Most commands return an exit code indicating success or failure. By checking these return codes after critical commands, scripts can detect errors and take appropriate actions:
backup_result=$(rsync -avz /source /destination) if [ $? -ne 0 ]; then echo "Backup failed with exit code $?." >&2 cleanup exit 1 fi
Here the return code is checked after an rsync command to detect backup failures. The script then logs the failure, runs cleanup procedures, and exits with a non-zero failure code.
Trapping Errors with Bash’s trap
Bash’s built-in trap command lets you specify handlers for signals and errors in scripts. By trapping errors, robust scripts can avoid unexpected failures:
trap 'echo "Error on line $LINENO" >&2' ERR rm important_file echo "Deleting complete" # Won't reach this on errors
Here trap is used to intercept ERR signals. If the rm command fails, the signal handler runs instead of terminating the script. This allows execution to continue and the script to attempt recovery.
Displaying Clear Error Messages
Informative error messages are critical when things go wrong. They let users and support staff quickly diagnose and fix problems. Best practices include:
- Logging errors to STDERR with echo >&2
- Including key variables like $LINENO and exit statuses $?
- Adding context e.g. “Backup failed on serverXYZ”
tar cf archive.tar /var/logs &>>backup.log || ( echo "ERROR: Backup failed on $HOSTNAME with code $?." >&2 echo "TAR command was: tar cf /var/logs" >&2 )
This example shows a script backing up logs, with clear error handling that logs context and details to STDERR on failures.
Allowing the Script to Fail Safely
When failure is unavoidable, robust scripts should terminate gracefully to avoid data loss or corruption issues. Strategies like cleanup sections, final traps and locking prevent unsafe termination:
# Perform critical action first transfer_data # Entering unsafe zone lock /tmp/script.lock delete_temp_files remove_lock /tmp/script.lock # If script exits here, lock left behind could cause trouble cleanup() { rm -f /tmp/script.lock } trap cleanup EXIT
This script uses file locking and a final EXIT trap to guarantee removal of a lock file. This lets the overall workflow fail but avoid collateral damage from an unsafe exit.
Defensive Coding Practices
Carefully written and risk-averse code avoids assumptions and prevents errors. Defensive tactics like input validation, race condition checks and safe defaults help minimize robustness issues.
Quoting Variables to Prevent Globbing/Word Splitting
Unquoted variables undergo filename expansion and word splitting, exposing scripts to unintended behavior:
file_list=$(ls) wc -l $file_list # Oops, tries to count words in expanded list
By quoting variables, scripts avert ambiguity and remain in control:
file_list=$(ls) wc -l "$file_list" # Passed as single argument
Quoting variables is especially important when passing user-supplied input to commands.
Avoiding Race Conditions with File Locking
Scripts editing shared files risk race conditions – output interleaving and loss of transactions. LockingCoordinate access prevents this:
lockfile script.lock read data write_data "$new_data" rm script.lock
This guarantees atomic access to shared data – no overlaps can corrupt data. Such race condition checks boost reliability.
Setting Failsafe Defaults and Timeouts
Defensive code sets safe defaults that prevent runaway jobs. For example:
tar -czvf backup.tar /home /etc -C /backups # Oops tar by default continues indefinitely on errors tar --timeout=60 -czvf backup.tar /home /etc -C /backups # Timeout will abort long-running failed job
Such timeout checks, along with sane default umasks, rlimits, ulimits and other options restrict adverse impacts from unanticipated problems.
Validating All Inputs
Careful checking and sanitizing of inputs prevents security issues and bad data. For example:
read -rp "Enter file path: " f # Input validation if [[ $f != /* ]] || [[ ! -f $f ]]; then echo "Invalid file path $f" >&2 exit 1 fi cat "$f"
Here the script confirms the input file exists before passing it to cat. Such checks guarantee program assumptions hold before processing inputs.
Automated Testing
Thorough automated testing improves reliability by catching flaws and gaps before customers do. Unit tests validate modules while integration testing checks overall workflows.
Unit Testing Individual Functions
Scripts can unit test internal functions without user interaction e.g.:
#!/bin/bash # Helper function hostname_valid() { [[ $1 =~ [a-z][a-z0-9]{0,30} ]] } # Unit test if ! hostname_valid "validHost123"; then echo "Failed valid hostname check" >&2 exit 1 fi if hostname_valid "Invalid!"; then echo "Failed invalid hostname check" >&2 exit 1 fi echo "Passed checks"
Such tests run automatically on code changes to check for regressions near instantly. This catches issues early before they trigger failures.
Integration Testing Full Workflows
Higher level integration tests check script functionality and reliability by running full workflows on representative datasets. These detect gaps in operational assumptions:
# Test setup mkdir test_backup cp /var/logs test_backup # Run backup script ./backup.sh test_backup # Validation if [[ $(ls test_backup | wc -l) -lt 100 ]]; then echo "Insufficient files backed up" >&2 exit 1 fi # Test tear down rm -rf test_backup
By automating overall workflow tests, scripts continue working reliably even amid upstream data changes.
Test Coverage for Critical Code Paths
Exercising high risk areas like error handling logic detects flaws before customers encounter them:
# Unit test error handling mkdir /tmp/missing trap_err() { echo "Trapped error: $BASH_COMMAND"; } trap trap_err ERR rm -rf /tmp/missing if [[ $? -eq 0 ]]; then echo "Failed to trap error exit code" >&2 exit 1 fi echo "Caught expected error"
Such focused testing hardens the code and improves resilience even for risky paths.
Portability Best Practices
Shell scripts can trigger obscure environment-specific errors. Portability best practices minimize reliance on external factors so functionality persists across environments.
Supporting Multiple Shell Environments
Subtle shell differences can manifest as surprising ‘Syntax error’ crashes. One workaround is defensive checks:
# Check shell compatibility if [[ -n $ZSH_VERSION ]]; then echo "Please run me in Bash" >&2 exit 1 fi # Bash-specific logic [[ $1 == foo ]] && echo "Hello!"
This guarantees features like [[ ]] syntax errors won’t derail scripts. Explicit version targeting prevents nasty surprises.
Avoiding Dependence on External Commands
Calls to external utilities risk command missing failures. Embedding logic reduces external dependencies:
# Brittle dependence on external 'timeout' utility timeout 10s sleep 20 # Robust bash native approach secs=10; while ((secs > 0)); do echo -ne "Next check in $secs seconds...\r" sleep 1 : $((secs--)) done
Such shell native implementations enhance reliability across diverse systems with spotty tooling.
Conditional Logic for Different OS Variants
OS specific variations can impede script portability. Segregating logic prevents compatibility issues:
uname_str=$(uname) # POSIX syntax works for GNU+BSD userspace tools case "$uname_str" in Linux*) ps_arg='-ef' ;; *) ps_arg='aux' ;; esac ps $ps_arg | grep sshd
Checking $OSTYPE/$uname allows adjusting for differences across OS flavors. This prevents false failures when assumptions become invalid.
Recovery and Cleanup
Graceful error handling is important but robust scripts also know how to pick up the pieces after failures. Techniques like cleanup handlers and logging aid recovery and debuggability.
Temporary Files and Cleanup Traps
Reliable scripts cleanly manage temporary files to avoid clutter buildup:
temp_dir=$(mktemp -d) echo $$ > $temp_dir/pid cleanup() { # Delete temp files if [[ -d $temp_dir ]]; then rm -rf $temp_dir fi } trap cleanup EXIT
Here temporary files get isolated into disposable directories and removed via EXIT trap handlers. This prevents unexpected exits from littering the filesystem.
Logging for Auditing and Debuggability
Comprehensive logging improves reliability by tracing script execution and inputs to enable better debugging for postmortem analysis:
#!/bin/bash # Setup logging exec &>>./$0.log echo "Script execution started at $(date)" # Log input parameters echo "Input parameters: $@" countdown() { # Functionality here echo "T-minus $1..." sleep 1 countdown $(($1 - 1)) } countdown 5 echo "Blastoff!"
Logging inputs alongside key events in workflow reconstruction simplifies troubleshooting unreliable execution.
Checkpointing State for Resume on Failure
Some workflows involve complex state requiring expensive rebuild after failures. Checkpointing saves state periodically to enable fast resume:
# Initial state downloads_left=1236 while ((downloads_left > 0)); do # Checkpoint every ~30 secs ((seconds++ == 30)) && save_state download_file ((downloads_left--)) done save_state() { echo "$downloads_left" > /tmp/remaining }
This avoids wastefully restarting multi-hour processes from scratch. Checkpointing improves reliability with incrementality.