How to keep: daily backups for a week, weekly for a month, monthly for a year, and yearly after that

linux backup bash

730

Solution 1

You are seriously over-engineering this. Badly.

Here's some pseudocode:

Every day:
- make a backup, put into daily directory
- remove everything but the last 7 daily backups
Every week:
- make a backup, put into weekly directory
- remove everything but the last 5 weekly backups
Every month:
- make a backup, put into monthly directory
- remove everything but the last 12 monthly backups
Every year:
- make a backup, put into yearly directory

The amount of logic you have to implement is about the same, eh? KISS.

This looks easier:

s3cmd ls s3://backup-bucket/daily/ | \
    awk '$1 < "'$(date +%F -d '1 week ago')'" {print $4;}' | \
    xargs --no-run-if-empty s3cmd del

Or, by file count instead of age:

s3cmd ls s3://backup-bucket/daily/ | \
    awk '$1 != "DIR"' | \
    sort -r | \
    awk 'NR > 7 {print $4;}' | \
    xargs --no-run-if-empty s3cmd del

Solution 2

If you just want to keep, for example, 8 daily backups and 5 weekly (every sunday) backups, it works like this:

for i in {0..7}; do ((keep[$(date +%Y%m%d -d "-$i day")]++)); done
for i in {0..4}; do ((keep[$(date +%Y%m%d -d "sunday-$((i+1)) week")]++)); done
echo ${!keep[@]}

As of today (2014-11-10), this will output:

20141012 20141019 20141026 20141102 20141103 20141104
20141105 20141106 20141107 20141108 20141109 20141110

As an exercise left for you, you just have to delete all backup files whose names do not appear in the keep-array.

If you want to keep 13 monthly backups (first sunday of every month) and 6 yearly backups (first sunday of every year) as well, things get a little bit more complicated:

for i in {0..7}; do ((keep[$(date +%Y%m%d -d "-$i day")]++)); done
for i in {0..4}; do ((keep[$(date +%Y%m%d -d "sunday-$((i+1)) week")]++)); done
for i in {0..12}; do
        DW=$(($(date +%-W)-$(date -d $(date -d "$(date +%Y-%m-15) -$i month" +%Y-%m-01) +%-W)))
        for (( AY=$(date -d "$(date +%Y-%m-15) -$i month" +%Y); AY < $(date +%Y); AY++ )); do
                ((DW+=$(date -d $AY-12-31 +%W)))
        done
        ((keep[$(date +%Y%m%d -d "sunday-$DW weeks")]++))
done
for i in {0..5}; do
        DW=$(date +%-W)
        for (( AY=$(($(date +%Y)-i)); AY < $(date +%Y); AY++ )); do
                ((DW+=$(date -d $AY-12-31 +%W)))
        done
        ((keep[$(date +%Y%m%d -d "sunday-$DW weeks")]++))
done
echo ${!keep[@]}

As of today (2014-11-10), this will output:

20090104 20100103 20110102 20120101 20130106 20131103
20131201 20140105 20140202 20140302 20140406 20140504
20140601 20140706 20140803 20140907 20141005 20141012
20141019 20141026 20141102 20141103 20141104 20141105
20141106 20141107 20141108 20141109 20141110

Same as above, just delete all backup files not found in this array.

Solution 3

I recently had the same problem. IMHO, trying to write a shell script to do it is painful, and it is much easier to write some reusable logic using a higher-level language with builtins like sets, dictionaries, etc. The general idea is to take configuration indicating how many files of each period you want to keep, and then decide for each file if it should be kept.

There is a fairly popular python-based script that looks really nice and has some easy-to-understand source. Plus being python-based instead of shell-based gives it a cross-platform advantage: https://github.com/xolox/python-rotate-backups

730

robb17

Updated on September 18, 2022

Comments

robb17 almost 2 years

I'm setting up a reverse-proxy NGINX EC2 deployment of a flask app on AWS by following this guide. More specifically, I'm using a proxy pass to a gunicorn server (see config info below).

Things have been running smoothly, and the flask portion of the setup works great. The only thing is that, when attempting to access pages that rely on Flask-SocketIO, the client throws a 502 (Bad Gateway) and some 400 (Bad Request) errors. This happens after successfully talking a bit with the server, but then the next message(s) (e.g. https://example.com/socket.io/?EIO=3&transport=polling&t=1565901966787-3&sid=c4109ab0c4c74981b3fc0e3785fb6558) sit(s) at pending, and after 30 seconds the gunicorn worker throws a [CRITICAL] WORKER TIMEOUT error and reboots.

A potentially important detail: I'm using eventlet, and I've applied monkey patching.

I've tried changing around ports, using 0.0.0.0 instead of 127.0.0.1, and a few other minor alterations. I haven't been able to locate any resources online that deal with these exact issues.

The tasks asked of the server are very light, so I'm really not sure why it's hanging like this.

GNIX Config:

server {
    # listen on port 80 (http)
    listen 80;
    server_name _;
    location ~ /.well-known {
        root /home/ubuntu/letsencrypt;
    }
    location / {
        # redirect any requests to the same URL but on https
        return 301 https://$host$request_uri;
    }
}
server {
    # listen on port 443 (https)
    listen 443 ssl;
    server_name _;

    ...

    location / {
        # forward application requests to the gunicorn server
        proxy_pass http://127.0.0.1:5000;
        proxy_redirect off;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
    location /socket.io {
        include proxy_params;
        proxy_http_version 1.1;
        proxy_buffering off;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "Upgrade";
        proxy_pass http://127.0.0.1:5000/socket.io;
    }

    ...
}

Launching the gunicorn server: gunicorn -b 127.0.0.1:5000 -w 1 "app:create_app()"

Client socket declaration:

var protocol =  window.location.protocol
var socket = io.connect(protocol + '//' + document.domain + ':' + location.port);

Requirements.txt:

Flask_SQLAlchemy==2.4.0
SQLAlchemy==1.3.4
Flask_Login==0.4.1
Flask_SocketIO==4.1.0
eventlet==0.25.0
Flask==1.0.3
Flask_Migrate==2.5.2

Sample client-side error messages:

POST https://example.com/socket.io/?EIO=3&transport=polling&t=1565902131372-4&sid=17c5c83a59e04ee58fe893cd598f6df1 400 (BAD REQUEST)

socket.io.min.js:1 GET https://example.com/socket.io/?EIO=3&transport=polling&t=1565902131270-3&sid=17c5c83a59e04ee58fe893cd598f6df1 400 (BAD REQUEST)

socket.io.min.js:1 GET https://example.com/socket.io/?EIO=3&transport=polling&t=1565902165300-7&sid=4d64d2cfc94f43b1bf6d47ea53f9d7bd 502 (Bad Gateway)

socket.io.min.js:2 WebSocket connection to 'wss://example.com/socket.io/?EIO=3&transport=websocket&sid=4d64d2cfc94f43b1bf6d47ea53f9d7bd' failed: WebSocket is closed before the connection is established

Sample gunicorn error messages (note: first line is the result of a print statement)

Client has joined his/her room successfully
[2019-08-15 20:54:18 +0000] [7195] [CRITICAL] WORKER TIMEOUT (pid:7298)
[2019-08-15 20:54:18 +0000] [7298] [INFO] Worker exiting (pid: 7298)
[2019-08-15 20:54:19 +0000] [7300] [INFO] Booting worker with pid: 7300

voretaq7 over 10 years

...my normal suggestion would be "Use Bacula" (or some other backup software that can handle retention and rotation for you) :-)
kraymer over 6 years

this question made me write cronicle <github.com/Kraymer/cronicle> because the accepted answer has the obvious defect of duplicating backups into the daily/weekly/etc folders. cronicle relies on symlinks and takes care of the rotation, deleting underlying files when no folders contain symlinks pointing to it.

Florin Andrei over 10 years

I actually don't have separate directories. It was written to dump files into an S3 bucket. Once everything is in one place, the total amount of logic that you need to implement is about the same, no matter how you go about it.
MadHatter about 10 years

Evidently it isn't.
takeshin over 9 years

Nice, and how do I do rm /dir/*.* except keep[@]?
robb17 almost 5 years

Thanks so much, Miguel! Really appreciate all your work.
robb17 almost 5 years

This seems to have resolved most of the problems posted above, although occasionally the last request will still stall and throw a gateway timeout error. No gunicorn worker issues are observed, although I did find this in the log: *6118 upstream timed out (110: Connection timed out) while reading response header from upstream, client: <IP>, server: _, request: "GET /socket.io/?EIO=3&transport=polling&t= ... HTTP/1.1", upstream: "127.0.0.1:5000/socket.io/… ... ", host: "example.com", referrer: "example.com/group/1/1"
gbyte almost 4 years

@takeshin If backups are named like 'auto_20200912', you can do something like this: for backup in "${backups[@]}"; do if [[ " ${!keep[*]} " != *"$(echo "$backup" | cut -d'_' -f 2"* ]]; then echo "Deleting: $backup" fi done
Ken Williams almost 3 years

I'm not sure how you get from the simple question "I need daily/weekly/monthly/yearly backups" to "you are seriously over-engineering this."