Recursive Copying Of Pdf Files From Directory Trees With Rsync

Finding PDFs in Directory Trees

The first step in recursively copying PDF files from directories is locating all the .pdf files that need to be copied. The Linux find command can be used to search directory structures and match file names ending in .pdf.

For example, to find all PDFs under /home/user/documents and all subdirectories:

find /home/user/documents -type f -name *.pdf -print

This will match and print the paths of any discovered PDFs under /home/user/documents and any of its child folders to standard output.

Table of Contents

The relevant options used here are:

-type f – Only match regular files, excluding directories symlinks, etc.
-name *.pdf – Match file names ending with .pdf
-print – Print the full matched file path to standard output

We can also search directory structures recursively using -type d to match directories, together with -print to output paths:

  
find /home -type d -print

This will descend through all subdirectories under /home, printing the paths. This recursive directory searching allows pinpointing where PDFs may reside within deep nested folder hierarchies.

Constructing the rsync Command

With one or more source directories containing PDF files located, rsync can carry out the recursive copying process. The basic rsync command structure is:

rsync [options] <source> <destination>

For example, to recursively sync PDFs from /home/user/documents to /backups:

rsync -r -a /home/user/documents/ /backups

Key parameters and options include:

-r – Copy data recursively, descending into subdirectories
-a – Archive mode with permissions, times, ownership preserved
-v – Increase verbosity to get detailed output

The source path can be explicitly specified, or found dynamically using find to match PDFs. And the destination is the corresponding location mirroring the source directory hierarchy.

Excluding Unneeded Files and Folders

When doing bulk recursive directory copies, rsync will duplicate everything under the path by default. This often copies over cached files, temporary folders, logs, and other data not needed in the backup.

The –exclude option can specify file and folder patterns to prevent from being copied:

rsync -r -a --exclude 'temp*' /home/user/ /backups

Here temp* excludes files and folders starting with temp from the replication.

Likewise, cache folders, hidden configuration resources beginning in ., and log files such as *.log can be omitted via structured –exclude patterns.

Preserving PDF Metadata and Attributes

By mirroring directories from one file system location to another, rsync also copies over file permissions, ownership, timestamps, and other metadata.

The -a archive mode option already preserves these PDF attributes. But additional related options include:

-p – Preserve file permissions
-t – Preserve modification times
-o – Preserve original owner and group

This means copied PDF documents retain the original create/modify dates, owners, and permission access levels when syncing between source and destination. Useful when maintaining an up to date archive.

Monitoring Copy Progress

Since rsync recursively copies full directory trees, it could involve transferring gigabytes of data depending on volume. Verbose output and indicators of progress ensure proper admin monitoring.

Include the -v parameter to enable verbose output:

rsync -rav --exclude '*.log' /volumes/ /backups/

This prints each file and directory encountered, number updated, bytes transferred, full transfer speed, and estimated completion time.

Watch for errors related to permission issues accessing files or invalid paths. Also monitor overall throughput – if speed drops significantly or stalls, a network or storage issue may be occuring.

Verifying Results

Once an rsync directory mirroring operation concludes, the full set of copies should be verified for integrity and completeness.

Use directory listing tools like ls and tree to confirm matching file and folder counts between the source and destination. Spot check copy accuracy both superficially and by fully inspecting pertinent PDF contents.

ls -R /volumes/documents/ > live_docs.txt
ls -R /backups/documents/ > copied_docs.txt
diff live_docs.txt copied_docs.txt

Any mismatched entries indicate inconsistencies that should be investigated and rectified if problems or omissions occurred during the rsync. Don’t assume flawless execution – always verify.

Automating Recurrent PDF Backups

Manually issuing rsync commands whenever backups are needed leads to forgotten executions and gaps. Creating scripts that automate syncing based on scheduled times obviates this.

Shell scripts encapsulate the find commands to locate PDFs together with rsync directives to copy from sources to destinations. Make the scripts executable, place them under /usr/local/bin, and no coding is required to invoke them.

#!/bin/bash

find /volumes -type f -name *.pdf -print | xargs -I{} rsync -r {} /backups

The cron daemon scheduler, checking /etc/crontab, can trigger these backup scripts hourly, daily, weekly as appropriate. After initial seeding, incremental rsync runs will only copy new/changed PDFs for efficiency.

Integrating cron and scripts orchestrates recurring, automated offline archiving so users focus on content instead of backups.