What are best practices for backing up a cassandra cluster?

10,412

Traditional "backup and restore" info can be found here: http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_backup_restore_c.html

Essentially, you take snapshot on each machine, and back the files up. Pretty much "take a snapshot and rsync it somewhere"!! Incremental backups can help reduce backup sizes, etc. The link explains it in more detail.

However, if all you want is a "secondary" which can be used if the machines get hit by a meteor, then a common approach is to have another data center (often with fewer nodes), and set the replication factor on the keyspace(s) so that the "backup" datacenter has data replicated to. Your apps would normally use local quorum to write to the "main" datacenter, while the backup will serve...well...as a backup. If the backup dc is powerful, it can even serve as a hot backup.

With this setup, cassandra will stream data to the backup as it's added. This prevents cumbersome snapshot based backups with files stored on a network. However, this will not protect from a dev mistakenly deleting data off cassandra. (things like drop keyspace ... can be recovered up to a certain time period, but if you mistakenly delete some rows...they're gone).

Hope that helps.

Share:
10,412
Andrew
Author by

Andrew

Updated on June 26, 2022

Comments

  • Andrew
    Andrew almost 2 years

    I have a cassandra cluster with ~20 nodes in multiple datacenters. I want to back up the cassandra database. I want it to be possible to restore the backup to a new cluster even if every node in the existing one is simultaneously hit by a meteor.

    1. What exactly do I need to copy off of the server(s) and preserve in order to make a from-scratch restore of a cassandra database possible, and where are these items stored? I gather that this is not as simple as "take a snapshot and rsync it somewhere".
    2. How do I perform the backup and restore?
    3. Where is this process documented?
  • Andrew
    Andrew almost 9 years
    The reason I suggest it isn't just "take a snapshot and rsync it somewhere" comes from the very page (well, subpages thereof) you're pointing to. e.g. the "Restore from a snapshot" page suggests that I should also be (separately) backing up the schema, and the "restoring to a new cluster" page suggests that I need a token list from the old cluster as well.
  • Andrew
    Andrew almost 9 years
    (hilariously, the latter assumes the old cluster will be alive when describing how to restore to a new one. That was the point at which I decided to ask here instead)
  • ashic
    ashic almost 9 years
    If using vnodes, token lists may not be necessary. Another cluster with same schema should work - I've restored data to vagrant boxes for example. But again, easiest way from getting data from A to B is via cassandra replication - that's what it's built for. You could "rsync backup" the backup :)