Tag: linux

Backup Files to S3 using Bash
Description

A bash script will be used to copy a file from a Linux server to an S3 bucket. Next, it will run a checksum on the results to verify the upload. Finally, it will output the local file size, the local etag , the aws file size, and the aws etag value for easy comparison. This should give the end user enough confidence that the uploaded file has maintained it’s integrity.

The script assumes you have an account in AWS with a login credentials. You have the cli AWS tools and credentials downloaded to /home/user/.aws/config and /home/user/.aws/credentials. These two files are needed to successfully authenticate to the s3 bucket.

Amazon Web Service S3 Bucket

AWS is a flat file system. There are no folders or directories. The “full” name of a file includes all the subdirectories as well. i.e. “/file1/file2/file3.txt” is the file name and not “file3.txt”. AWS will show all subdirectories as folders in the console, for ease of human navigate.

Begin

Start the script by defining that it will run as bash and add any notes to the head.
```
#!/bin/bash
# 04/18/23, backup.sh
# created by mbb
```
Send any log output to a custom log file and code to exit the script if any commands in a pipeline fails.
```
exec 2>> /var/log/backups/aws.log
set -euo pipefail
```
Get the number of processing units available and add it to a variable.
```
NUM_PARALLEL=$(nproc)
```
Define the remaining local variables.
```
HOSTNAME=`hostname`
DATE=`date "+%F %T"`

DAY=$(date -d "1 day ago" +%d)
MONTH=$(date -d "1 day ago" +%b)
YEAR=$(date -d "1 day ago" +%Y)

LOG="/var/log/backups/aws.log"

src_dir="/mnt/logs/$YEAR/$MONTH"
src_file="log-$DAY.log.gz"
```
Define the AWS variables.
```
aws_dir="s3://bucket01/folder1/$YEAR/$MONTH" 

s3_bucket="bucket01"
s3_dir="folder1/$YEAR/$MONTH"
s3_key="$s3_dir/$src_file"     # name of the file in s3 
s3_body="$src_dir/$src_file"   # the location of the data to be uploaded
```
When a file is uploaded to AWS, it will calculate what is called an ETAG value. This is the checksum value of the upload file. To verify file integrity, we will compare the uploaded aws calculated ETAG against the local file’s calculated ETAG.

The ETAG will match a true md5 hash value if the file size is < 5 GB. If the file is > 5 GB, the aws ‘cp’ command will automatically break the file into 8 MB chunks and upload 4 threads of data simultaneously, until the upload is complete. Each uploaded thread will have an md5 calculated. The resulting ETAG will be a sum of all the uploaded data chunks, rather than a true md5 hash against the completed file.

In order to compare the ETAG’s and verify they match, we must calculate the local file’s ETAG value. Then compare that value to the value calculated by AWS. The script contains two methods to calculate the ETAG value, you will need to review and consider what is needed. In my case, I always know the value I will upload will be > 5 GB.

To calculate the local files ETAG value, for files < 5GB. use:
```
# calculate ETAG value of local file, if < 5GB.
md5_hash="$(md5sum "$src_dir/$src_file" | awk '{ print $1 }')"
```
For files > 5 GB, we can use the code from https://gist.github.com/rajivnarayan/1a8e5f2b6783701e0b3717dbcfd324ba.
```
# ---------- Begin Code ---------------
MULTIPART_MINSIZE=$((8*1024*1024))

file="$src_dir/$src_file"
partSizeInMb=8

if [[ ! -f "$file" ]]; then
   echo "$DATE Error: $file not found to compute hash." >> $LOG
   exit 1;
fi

# Calculate checksum for a specified file chunk
# inputs: file, partSizeInMb, chunk
# output: chunk md5sum
hash_chunk(){
   file="$1"
   partSizeInMb="$2"
   chunk="$3"
   skip=$((partSizeInMb * chunk))
   # output chunk + md5 (to allow sorting later)
   dd bs=1M count="$partSizeInMb" skip="$skip" if="$file" 2> /dev/null | echo -e "$chunk $(md5sum)"
}

# Integer quotient a/b after rounding up 
div_round_up(){
   echo $((($1 + $2 - 1)/$2))
}

partSizeInB=$((partSizeInMb * 1024 * 1024))
fileSizeInB=$(du -b "$file" | cut -f1 )
parts=$(div_round_up fileSizeInB partSizeInB)

if [[ $fileSizeInB -gt $MULTIPART_MINSIZE ]]; then
   export -f hash_chunk
   etag=$(seq 0 $((parts-1)) | \
       xargs -P ${NUM_PARALLEL} -I{} bash -c 'hash_chunk "$@"' -- "$file" "$partSizeInMb" {} | \
       sort -n -k1,1 |tr -s ' '|cut -f2,3 -d' '|xxd -r -p|md5sum|cut -f1 -d' ')"-$parts"
else
   etag=$(md5sum $file|cut -f1 -d' ')
fi
# ---------------- end of code ----------------------
# Calculate ETAG Value of local file, if > 5GB
md5_hash=$etag
```
Next, we will copy the files to the s3 bucket using the ‘cp’ command. We will be using the CLI copy command, rather than the s3api command, as the api can not handle file’s large then 5 GB. Copy the content to S3 and tell AWS that the data is just a plain text file.
```
aws s3 cp $src_dir/$src_file s3://$s3_bucket/$s3_dir/$src_file --content-type=text/plain
```
Get the ETAG value that AWS calculated during the upload.
```
s3_md5_hash=$(aws s3api head-object --bucket "$s3_bucket" --key "$s3_key" --query ETag --output text | sed 's/"//'g)
```
Next, we will get both the local file size and the uploaded file sizes.
```
aws_list=`aws s3 ls "$aws_dst_dir/$src_file" --human-readable --output text`
local_list=`ls -alhF "$src_dir/$src_file"`
```
Finally, display the file sizes and the ETAG values of both the uploaded file and the local file side by side for comparison.
```
echo "File size check:"
echo "$HOSTNAME: $local_list"
echo "s3.amazon: $aws_list"
echo ""
echo "ETAG check (md5 sum):"
echo "$HOSTNAME: $src_file  ETAG: $md5_hash"
echo "s3.amazon: $src_file  ETAG: $s3_md5_hash"
```
April 17, 2024

Validate the Integrity of a File Backup using Ansible

Introduction

Running nightly file backups is a common task for administrators. How do we know the file was copied successfully with no errors? In this post, we will set up an ansible script and it will run a file integrity check using MD5 on both the source and the destination files to verify it was not corrupted during the copy process. In this process the Ansible server is assumed to be a separate server from both the source server and the designation server.

Specifically, we will tell Ansible to execute a bash script on the source and destination servers, gather the results and store them in a temp text file, then it will output the text file to the body of an email and send it to interested parties for review.

Create the Ansible Script

Add comments to the head of the script. I like to include an example of the command, so that it can be easily copied to the command line.

# Ansible script, validate.yml
# cmd: ansible-playbook -i inventory.ini validate.yml

Add the variables to the script. All ansible scripts start with three dashes. Also note the Ansible is very sensitive to the placement of the columns. The names, hosts, and tasks columns must be lined up exact or the script will not execute.

---
- name: Verify integrity using md5 checksums 
  hosts: server1.company.com:server2.company.com
  gather_facts: False

Add the tasks that must be executed.

  tasks:
    - name: Check file integrity
      script: /home/user1/validate.sh   
      ignore_errors: False 
      register: results
    - name: Make a header for the results txt file.
      shell: echo '<-- results -->'
      register: title
      delegate_to: localhost
      run_once: true
    - name: Create new results txt file. 
      local_action: copy content={{ title.stdout }}
      dest=/home/user1/validate.txt
    - name: Append results to txt file. 
      lineinfile:
         dest: /home/user1/validate.txt
         line: "{{ results.stdout }}"
         insertafter: EOF
      delegate_to: localhost
    - name: get the date
      shell: "date +%Y-%m-%d"
      register: tstamp
      delegate_to: localhost

Finally we will send an email to interested parties.

- name: Sending email
      mail:
         host: exchange.company.com
         port: 25
         from: hostname@company.com
         to: 
         - user1@company.com
         - user2@company.com	
         subject: Copied to backup server.
         body: "Date: {{ tstamp.stdout }}\n\n
               NOTE: The below results are for yesterday's files.\n\n
               {{ lookup('file','/home/user1/validate.txt') }}"
      delegate_to: localhost
      run_once: True

Build the Bash Script

In Ansible, it will execute the code on all servers simultaneously. So, we don’t know what server’s results will be returned to Ansible first. That is why we need the server hostname.

Create the headers.

#!/bin/bash
# validate.sh

Create the variables.

date=`date -d yesterday +%Y-%m-%d`
host=`hostname | awk '{print tolower($0)}'`
day=`date -d yesterday +%d`
month=$(date -d yesterday +%b)
year=`date +%Y`

src_dir1="/var/ossec/logs/$year/$month"
dst_dir1="/mnt/storage/logs/$year/$month"
file="filename-$day.log"

Execute the comamnds, to gather the needed data.

# Get source file results
results1=$(ls -alhF "$src_dir1/$file")
md5sum1=$(md5sum "$src_dir1/$file" | awk '{ print $1 }' )
# Get destination file results
results2=$(ssh backupserver "ls -alhF $dst_dir1/$file")
md5sum2=$(ssh backupserver "md5sum $dst_dir1/$file" | awk '{ print $1 }')

Output the results. Remember these results will be returned to Ansible.

# output results, compare file file size and MD5 hash.
echo " $host File: $results1"
echo "Backup File: $results2"
echo ""
echo " $host MD5_Sum: $md5sum1"
echo "Backup MD5_Sum: $md5sum2"

This is my own method for verifying files were copied correctly. I hope you find it useful.

April 16, 2024

Tag: linux

Backup Files to S3 using Bash

Description

Amazon Web Service S3 Bucket

Begin

Validate the Integrity of a File Backup using Ansible

Introduction

Create the Ansible Script

Build the Bash Script