How can one efficiently use S3 to back up files incrementally?

60,955

Solution 1

Since this question was last answered, there is a new AWS command line tool, aws.

It can sync, rsync-like, between local storage and s3. Example usage:

aws s3 sync s3://mybucket /some/local/dir/

If your system's python environment is set up properly, you can install AWS client using pip:

pip install awscli

Solution 2

The s3cmd tool has a great sync option. I use it to sync local backups, using something like:

s3cmd sync --skip-existing $BACKUPDIR/weekly/ s3://MYBACKUP/backup/mysql/

The --skip-existing means it doesn't try to checksum compare the existing files. If there is a file with that name already, it will just quickly skip it and move on. There is also --delete-removed option which will remove files not existing locally, but I want to keep on S3 even ones that I have cleaned up locally so I don't use this.

Solution 3

You can alternatively use minio client aka mc Using 'mc mirror' command will do the job.

$ mc mirror share/sharegain/ s3/MyS3Bucket/share/sharegain 
  • mc: minio client
  • share/sharegain: local directory
  • s3: Alias for https://s3.amazonaws.com
  • MyS3Bucket: My remote S3 bucket
  • share/sharegain: My object on s3

You can write a simple script as cronjob which will keep a sync at periodic interval.

Hope, it helps.

Solution 4

Don't want to tell anyone what to do but may I wave a flag for duplicity? or other incremental backup solution. Syncing is all very well, but if you backup nightly, what happens if you don't notice the problem for two days? Answer: Its too late, your local files and your backup are a mirror of each other and neither have the data you need. You really should consider incremental backups or snapshots so you can recover to a particular moment in time and to do this efficiently you need incremental backups. And if losing your data is an end of the world scenario then keep copies at different providers as you never know, then could get lost, hacked who knows.

I use duplicity and s3, its fine but is cpu intensive. But it does incremental backups. In an emergency when you want to restore a dir or particular file, as it was last wednesday, or last January, without restoring the other files on the same partition you need incremental backups and a tool where you can request just the files your need.

I have a cron, that does full every x months, otherwise incremental and deletes older than x months to keep s3 storage totals down, finally does collection status so I get mailed each morning with the status. You need to keep an eye on it regularly so you notice when your backup isnt working.

It requires significant local temp space to keep the local signatures so setup the temp dir carefully. This backups /mnt, excluding various dirs inside /mnt. This is good for backing up data, for system partitions use amazon imaging or snapshot tools.

PHP script:

# Duplicity Backups

$exclude  = "--exclude /mnt/ephemeral ".
            "--exclude /mnt/logs ".
            "--exclude /mnt/service ".
            "--exclude /mnt/mail ".
            "--exclude /mnt/mysql ";

$key = "PASSPHRASE=securegpgpassphrase";

$tmp = "/mnt/mytempdir";

system("mkdir -p $tmp");

# Amazon

$aws = "AWS_ACCESS_KEY_ID=xxxxxx ".
       "AWS_SECRET_ACCESS_KEY=xxxxxx ";

$ops = "-v5 --tempdir=$tmp --archive-dir=$tmp --allow-source-mismatch --s3-european-buckets --s3-use-new-style --s3-use-rrs";
$target = " s3://s3-eu-west-1.amazonaws.com/mybucket";

# Clean + Backup

system("$key $aws /usr/bin/duplicity $ops --full-if-older-than 2M $exclude /mnt $target");
system("$key $aws /usr/bin/duplicity $ops remove-older-than 6M --force $target");
system("$key $aws /usr/bin/duplicity $ops cleanup --force --extra-clean $target");
system("$key $aws /usr/bin/duplicity $ops collection-status $target")

Solution 5

S3 is a general purpose object storage system that provides enough flexibility for you to design how you want to use it.

I'm not sure from your question the issues with rsync (other than indexing) or issues with '3rd party' tool you've run into.

If you have large set of files well structured, you can run multiple s3 syncs on your sub-folders.

The nice folks at Amazon also allow you to do an import/export from your portable harddrive for large file transfer to S3 or EBS -- http://aws.amazon.com/importexport/ which you can use for the first upload.

See Amazon s3 best practices here -- http://aws.amazon.com/articles/1904

As far as differrent tools, try them and see what works best for you. Regarding pricing, there is reduced redundancy pricing if it suits your needs -- http://aws.amazon.com/s3/pricing/

General recommendation -- have a fast multicore CPU and good network pipe.

UPDATE: Mention about checksumming on S3

Regarding S3 stores data in key value pairs and there is no concept of directories. S3sync verifies checksum (S3 has a mechanism to send checksum as an header for verification -- Content-MD5 header). The best practices link Data Integrity part of it has it in detail. S3 allows you to send/verify and retrieve checksums. There are plenty of folks doing incremental backups with duplicity. Even though there is no rsync running on S3, you can do checksums like I mentioned here.

rsync is a proven tool and most of the modern tools use the same algorithm or rsync library or call rsync externally.

Share:
60,955

Related videos on Youtube

Jaimie Sirovich
Author by

Jaimie Sirovich

Updated on September 18, 2022

Comments

  • Jaimie Sirovich
    Jaimie Sirovich almost 2 years

    I understand how rsync works on a high-level, but there are 2 sides. With S3 there is no daemon to speak of — well there is, but it's basically just HTTP.

    There look to be a few approaches.

    s3rsync (but this just bolts on rsync to s3). Straightforward. Not sure I want to depend on something 3rd party. I wish s3 just supported rsync.

    There also are some rsync 'clones' like duplicity that claim to support s3 without said bolt-on. But how can it do this? Are they keeping an index file locally? I'm not sure how that can be as efficient.

    I obviously want to use s3 because it's cheap and reliable, but there are things that rsync is the tool for, like backing up a giant directory of images.

    What are the options here? What do I lose by using duplicity + s3 instead of rsync + s3rsync + s3?

    • Admin
      Admin almost 12 years
      S3 is cheap? That's news to me. Reliable? For sure, but not cheap.
    • Admin
      Admin almost 12 years
      Well, s3 is $0.13/gb or less as you store more or want less redundancy. A quick search reveals evbackup.com for rsync storage. Far more expensive. What's cheaper and has some level of redundancy?
    • Admin
      Admin about 5 years
      If I were to design rsync, it would support plugins so that new protocols (e.g. s3://) could be added. However, at present, rsync doesn't support this, so I don't believe rsync can be used directly for backing up to S3.
    • Admin
      Admin about 5 years
      The next issue is that I don't think S3 stores metadata such as ownership or permissions, so using e.g. "aws s3 sync" to do backups will work but probably isn't suitable for a full-blown backup of a Unix filesystem, since too much data would be lost on restore. I also think symlinks, hardlinks, and other special files would be lost.
  • Jaimie Sirovich
    Jaimie Sirovich almost 12 years
    I don't see how this answers the question. I was asking how duplicity manages to do what rsync does without a daemon on the other side. It has no ability to even get a checksum, or maybe it does, but then how would it incrementally update the files?
  • Jaimie Sirovich
    Jaimie Sirovich almost 12 years
    OK. So you're saying that Duplicity uses this hash from S3, but it also claims to work over FTP. FTP has no hashing mechanism. I tend to err on the safe side and use the 'proven' tools. Rsync is proven yes, but it won't do s3 backups without the s3 add-on service s3rsync. I'm a bit scared of duplicity, but it has wider protocol appeal if I can get some level of rsync-like functionality with s3 without said accessory service. I just don't get how well it works (and possibly differently with various protocols). How the heck does it do FTP syncing ? :)
  • David Given
    David Given over 10 years
    Danger, Will Robinson! This is really expensive as you're not getting any benefits of the rsync low-bandwidth communication --- s3fs will end up reading (and then writing, if it changes) the entire file, which means Amazon will bill you twice. Instead consider using an EC2 instance and using rsync remotely to that via ssh. Transfers to S3 from an EC2 instance are free, so all you pay for is rsync's low-bandwidth communication from your local machine to the EC2 instance. Running an EC2 micro instance on demand costs practically nothing.
  • Ants-double
    Ants-double over 10 years
    This! There's a lot of bad advice out there for those that do not understand rsync and S3...
  • ceejayoz
    ceejayoz over 9 years
    @JaimieSirovich Test it and see. If you had, you'd have known Duplicity builds "manifest" files in less time than it took you to type all these comments about what it might be doing.
  • Dan Pritts
    Dan Pritts almost 9 years
    The one downside of this is that now you have a micro instance to manage. Trivial if you know how, but a barrier to entry for many. On the plus side, EC2-attached EBS storage is about half the price per byte of S3.
  • ryebread
    ryebread over 8 years
    In my experience, this uploads everything, not a just delta of changes. For example, I was pushing a static site to a dev server with rsync, and it took an average of 1 second, with just the changes going out over my slow connection. aws s3 sync on the other hand, took about 5 minutes, retransferring each and every file.
  • Dan Pritts
    Dan Pritts over 8 years
    I believe you that it doesn't work, but the docs say "A local file will require uploading if the size of the local file is different than the size of the s3 object, the last modified time of the local file is newer than the last modified time of the s3 object, or the local file does not exist under the specified bucket and prefix." Make sure you have the latest version of aws-cli - if you can reproduce this, file a bug with them on github. They were responsive when i filed a bug a while ago.
  • alkar
    alkar over 7 years
    There's also a -w flag now, which will use fsnotify to watch for changes. It can easily be set up as a system service or similar.
  • Carlo S
    Carlo S over 6 years
    The command should be: aws s3 sync /some/local/dir/ s3://mybucket
  • Dan Pritts
    Dan Pritts over 6 years
    Carlos, I'm not sure what your point is. If you mean to suggest that my example command is wrong, we are both right. The s3 sync can work in either direction.
  • mcmillab
    mcmillab about 5 years
    turn versioning on for the s3 bucket, then it will keep old copies
  • Edward Falk
    Edward Falk about 5 years
    Late to the party, but here's what happening: When uploading to S3, the quick check rules apply (upload if size or date has changed). When downloading, there are no quick check rules, and everything is downloaded unconditionally.