Options to efficiently synchronize 1 million files with remote servers?

11,390

Solution 1

Since instant updates are also acceptable, you could use lsyncd.
It watches directories (inotify) and will rsync changes to slaves.
At startup it will do a full rsync, so that will take some time, but after that only changes are transmitted.
Recursive watching of directories is possible, if a slave server is down the sync will be retried until it comes back.

If this is all in a single directory (or a static list of directories) you could also use incron.
The drawback there is that it does not allow recursive watching of folders and you need to implement the sync functionality yourself.

Solution 2

Consider using a distributed filesystem, such as GlusterFS. Being designed with replication and parallelism in mind, GlusterFS may scale up to 10 servers much more smoothly than ad-hoc solutions involving inotify and rsync.

For this particular use-case, one could build a 10-server GlusterFS volume of 10 replicas (i.e. 1 replica/brick per server), so that each replica would be an exact mirror of every other replica in the volume. GlusterFS would automatically propagate filesystem updates to all replicas.

Clients in each location would contact their local server, so read access to files would be fast. The key question is whether write latency could be kept acceptably low. The only way to answer that is to try it.

Solution 3

I doubt rsync would work for this in the normal way, because scanning a million files and comparing it to the remote system 10 times would take to long. I would try to implement a system with something like inotify that keeps a list of modified files and pushes them to the remote servers (if these changes don't get logged in another way anyway). You can then use this list to quickly identify the files required to be transferred - maybe even with rsync (or better 10 parallel instances of it).

Edit: With a little bit of work, you could even use this inotify/log watch approach to copy the files over as soon as the modification happens.

Solution 4

Some more alternatives:

  • Insert a job into RabbitMQ or Gearman to asynchronously go off and delete (or add) the same file on all remote servers whenever you delete or add a file on the primary server.
  • Store the files in a database and use replication to keep the remote servers in sync.
  • If you have ZFS you can use ZFS replication.
  • Some SANs have file replication. I have no idea if this can be used over the Internet.

Solution 5

This seems to be an ideal storybook use case for MongoDB and maybe GridFS. Since the files are relatively small, MongoDB alone should be enough, although it may be convenient to use the GridFS API.

MongoDB is a nosql database and GridFS is a file storage build on top of it. MongoDB has a lot of built in options for replication and sharding, so it should scale very well in your use case.

In your case you will probably start with a replica set which consists of the master located in your primary datacenter (maybe a second one, in case you want to failover on the same location) and your ten "slaves" distributed around the world. Then do load tests to check if the write performance is enough and check the replication times to your nodes. If you need more performace, you could turn the setup into a sharded one (mostly to distribute the write load to more servers). MongoDB has been designed with scaling up huge setups with "cheap" hardware, so you can throw in a batch of inexpensive servers to improve performance.

Share:
11,390
Zilvinas
Author by

Zilvinas

Staying curious!

Updated on September 18, 2022

Comments

  • Zilvinas
    Zilvinas over 1 year

    At a company I work for we have such a thing called "playlists" which are small files ~100-300 bytes each. There's about a million of them. About 100,000 of them get changed every hour. These playlists need to be uploaded to 10 other remote servers on different continents every hour and it needs to happen quick in under 2 mins ideally. It's very important that files that are deleted on the master are also deleted on all the replicas. We currently use Linux for our infrastructure.

    I was thinking about trying rsync with the -W option to copy whole files without comparing contents. I haven't tried it yet but maybe people who have more experience with rsync could tell me if it's a viable option?

    What other options are worth considering?

    Update: I have chosen the lsyncd option as the answer but only because it was the most popular. Other suggested alternatives are also valid in their own way.

    • Oliver
      Oliver almost 12 years
      Do you have a log indicating what files have been changed or deleted?
    • hookenz
      hookenz almost 12 years
      If only the playlists were mysql records. You could then use database replication and get mysql to work out what's needed to be sent/received.
    • Zilvinas
      Zilvinas almost 12 years
      @oliver we do. However then you need to trust that log meaning the code generating it must be correct and then you need custom code to process that log which also needs to be correct. I'd rather avoid in house built code to do it over something that has been extensively tested by the community.
    • faker
      faker almost 12 years
      Do you want the change to only get applied every hour? Or is instant replication also acceptable?
    • Zilvinas
      Zilvinas almost 12 years
      @faker instant is ok. Atm we can only get the playlists changed every hour. However in the future we're hoping to get as close to real time as possible/needed.
    • Oliver
      Oliver almost 12 years
      Don't underestimate the time it takes rsync to work through a million files. Just try it and you will see what you are up to. If you have that log, use it or try any other of the proposed solutions.
  • Zilvinas
    Zilvinas almost 12 years
    Again a brilliant tip :)
  • Philip
    Philip almost 12 years
    While the storage would be synchronized you'd have to notify the application, so you'd be back to square one, or the app would have to poll storage every time someone access on of these playlists. Performance would be horrible in either case.
  • Philip
    Philip almost 12 years
    +1 This is essentially a cache coherency problem, a monitor that pushes changes is the easiest solution. lsyncd implements that...
  • Bobe Kryant
    Bobe Kryant almost 12 years
    The application doesn't need to poll the storage every time someone accesses the play lists, just enough times within the hour to ensure that the application is running without stale data. Also, if S3 is used as a backend, why would the application need to poll the files in the first place? They will always be up to date
  • Old Pro
    Old Pro almost 12 years
    I would investigate lsyncd and inotify deeply as applies to your specific server OS. There is a limit on the number of inotify watches available. I believe the default is around 1500 or 8000 depending on your particular Linux version. Most kernels let you raise the limit, but monitoring 1 million files may be more than is practical. It didn't work for me in 2008. Also, the inotify event queue can overflow causing you to lose events, and you need to have a way to recover from that. A carefully tuned lsyncd implementation plus a daily rsync might work now in 2012 to cover your bases.
  • faker
    faker almost 12 years
    Actually it does an iontify on the directory not the individual files. How many directories can you watch? Check /proc/sys/fs/inotify/max_user_watches (usually 8192).
  • Zilvinas
    Zilvinas almost 12 years
    We probably have about ~50k directories as well at the moment which hold those playlists. I understand this might be a concern then.
  • enedene
    enedene almost 12 years
    Great program! I've put 2 directories for monitoring and it's working good, but does anyone know where are the setup files, I have no idea where the info about directories I've put for monitoring is stored.
  • Tom O'Connor
    Tom O'Connor almost 12 years
    +1 for Glusterfs
  • Dragos
    Dragos almost 12 years
    There might be a problem watching so many files with inotify from lysncd for changes. You can try to put all the files in one directory or as another solution: inotify itself with a custom upload script.
  • sbrattla
    sbrattla over 11 years
    I'd like to add that Csync2 might be a good substitute for Rsync in this case. Csync2 uses the Rsync transfer algorithm but keeps a local cache of the file tree on all nodes, which means that it does not have to go through all files to find changes. It works together with inotify, and synchronizes batches of file changes to other nodes. See axivo.com/community/threads/… for more details.