Is it safe to use a HDD while rsync is running?

linux hard-drive backup rsync ext4

5,220

Solution 1

As others have already pointed out, it is safe to read from the source disk, or use the target disk outside out the target directory, while rsync is running. It is also safe to read within the target directory, especially if the target directory is being populated exclusively by the rsync run.

What's not generally safe is to write within the source directory while rsync is running. "Writes" is anything that modifies the content of the source directory or any subdirectory thereof, so includes file updates, deletes, creation, etc.

Doing so won't actually break anything, but the change may or may not actually get picked up by rsync for copying to the target location. That depends on the type of change, whether rsync has scanned that particular directory yet, and whether rsync has copied the file or directory in question yet.

However, there is an easy way around that: Once it finishes, run rsync again, with the same parameters. (Unless you have some funky delete parameter; if you do, then be a bit more careful.) Doing so will cause it to re-scan the source, and transfer any differences that weren't picked up during the original run.

The second run should transfer only differences that happened during the previous rsync run, and as such will complete much faster. Thus, you can feel free to use the computer normally during the first run, but should avoid as much as possible making any changes to the source during the second run. If you can, strongly consider remounting the source file system read-only before starting the second rsync run. (Something like mount -o ro,remount /media/source should do.)

Solution 2

This depends of the backup system you use, but in general it is a bad idea to modify the contents of a device while you're backing it up. However, you can read its contents; that's a safe operation, even if it will slow down the process.

In your case, rsync will build up a file list and then start the backup. Therefore any file you add to the source HDD after the backup has started will not be copied.

What I do is not to use a device at all during a backup. This is the safer way to obtain a fast and consistent backup.

Solution 3

It is safe to read data from the source areas while rsync is operating, but if you update anything the copy that rsync creates/updates is likely to be inconsistent:

If you update a file that rsync has already scanned then it will not see the update until a future run. If you update a file it has yet to scan the change will be respected in the destination. If you update files that both have and have not been scanned you will end up with a mix of old and new versions in the destination.
If you add a file to a directory that has already been scanned it will be missed from the destination copy this time around. If you remove a file from a directory that has already been scanned it will be left in the destination copy this time. Depending on how you invoke rsync the whole tree may be scanned at the start or it may be incrementally scanned as the sync process happens.
In some circumstances rsync will see the inconsistency and warn you. If you remove a file or sub-directory from a directory that has already been scanned itself but has not had its contents scanned you will get an error message about the object being missing. In similar circumstances it can sometimes (if the size and/or timestamp has changed) also warn about files changing mid-scan.

For some backups this inconsistency may not be a massive issue, but for most it will be so it is recommended that you don't try sync an actively changing source.

If you use LVM to portion your storage system you could use a temporary snapshot to take a point-in-time backup. This requires that you have enough space on the volume group to create a snapshot volume large enough to hold all the changes that will happen in the duration that the snapshot is needed. Check the LVM documentation (or one of many online examples: search for "LVM snapshot backup" or similar) for more details.

Even without LVM some filesystems support snapshots themselves - so you may wish to look into that option too.

If you want to backup large active volumes without long downtime and can't use snapshots, it may be sufficient to run the "live" scan to completion then stop access to the volume and run another rsync process which may take far less time (if very little has changed it will just scan the directory tree then the few updated files). This way the duration in which you should avoid changes could be much shorter.

Solution 4

Source HDD can read anything while rsync.
Source HDD can write any content not related to the rsync content.
Destination HDD can read anything while rsync.
Destination HDD can write anything while rsync with the condition to have sufficient space reserved for the sync'ed content.

Of course, in any of the cases, there will be performance reduction.

View more solutions

5,220

mfbayrarian

Updated on September 18, 2022

Comments

mfbayrarian over 1 year

I plan to backup my large HDDs by rsync, and anticipate that it takes a few days. Is it safe to use the original HDD (adding files) while rsync is working? Or it is better to leave the HDDs untouched until the rsync is finished?
gerlos over 7 years

One can even do a third run after a second run: it may take even less time... ;-)
Monty Harder over 7 years

@gerlos A pattern seems to be emerging. It sounds almost like one could just keep running the rsync command at the end of each use session, and within a few days it would be done in no time.
user over 7 years

@gerlos If you remount read-only before running rsync the second time, that won't be necessary and the backup will be all but guaranteed to be consistent while minimizing the time during which you cannot write to the source file system.
Martin Ueding over 7 years

I usually let it run and then do a second run of rsync which will finish in a few seconds because only the files that I have changed during the run will be copied. Everything will be in the caches, so it is way easier to refrain from modifications during that period.
gerlos over 7 years

@MichaelKjörling you're right. It obviously depends on what and how you want to backup your data.
gerlos over 7 years

@MontyHarder anyways your rsync command can't run for less time than the time needed to scan the file system for files to backup. If there are lots of files and directories it can take a long time, even if there's nothing more to copy.
user over 7 years

@gerlos As an aside, that's why I have an entry much like @reboot root find / -print &>/dev/null in my system crontab, to populate the cache. (The actual entry is more complex to account for a few special cases on my particular system.) It uses some RAM and some wallclock time early after startup to improve directory-tree scanning quite a bit IME.
ibennetch over 7 years

I like your answer best because you go in to detail about what happens if files are modified. You not only provide an alternative but also address the inconsistencies it can cause (missing an update, warning about a missing file, etc.). In my situation, using rsync to seed a long backup and then refreshing it days later is no big deal, and that sounds like the OP's situation as well. It doesn't sound like he/she is requiring an enterprise level backup the first time through, but just wants to use the computer in the mean time. I say just run rsync a second time to catch the updated files.
Olivier Dulac over 7 years

@MichaelKjörling: interresting idea to cache the hierarchy. But maybe you should run updatedb (building locate's database) or slocate -u (same, if you have slocate) instead? That way you still cache the hierarchy but you also build-up the databases of locate or slocate, allowing you to use those commands to quickly find many file ?
user over 7 years

@OlivierDulac Wouldn't that depend on every other program also using locatedb? Also, it's really rare for me to want to find all files with a given name; I practically always restrict my searches to a directory subtree, and as far as I can tell, using locate for such a use case is a fair bit more complex than a simple find . -iname whatever.
Olivier Dulac over 7 years

@MichaelKjörling : locate (or slocate) does NOT replace find, but is a quick and easy command to help find some files. their updating program will use find to crawl over the hierarchy (as your crontab does) and the saved db will allow users to quickly find most files (not a necessity, but a very nice thing to have at hand in most cases). it has nothing to do with "every other program using locatedb". Iow, it is a bit like your find, with the added bonus of updating also the locate (or slocate) db, in case some users may want to use those
user over 7 years

If the media is marginal or even potentially marginal, dd is not the best choice. Use ddrescue instead; it handles partial failures much better. But that was not a consideration in the original question.
Zak over 7 years

@MichaelKjörling That is a good point.