How to find the total space occupied by a cassandra keyspace?

13,924

Solution 1

What is Compaction?

SStables are immutable -- once a memtable is flushed to disk, it remains unchanced until it is deleted (expired) or compacted. Compaction is the process of combining sstables together. This is important when your workload is update heavy and you may have several instances of a CQL row stored in your SSTables (see sstables per read in nodetool cfhistograms). When you go to read that row, you may have to scan across multiple sstables to find the latest version of the data (in c* last write wins). When we compact, we may take up additional space on disk (especially size tiered compaction which may take up to--this is a theoretical maximum--50% of your data size when compacting) so it is important to keep free disk space. However, compaction will not take data away from your keyspace directory. This is not where your data is.

Then where did my data go?

You're right in your suspicion that data that has not yet been flushed to disk must be sitting in memtables. This data will make it to disk as soon as your commitlog fills up (default 1gb in 2.0 or 8gb in 2.1) or as soon as your memtables get too big -- memtable_total_space_in_mb.

If you want to see your data in sstables, you can flush it manually:

nodetool flush

and your memtables will be dropped into your KS directory in the form of SSTables. Or just be patient and wait until you hit either the commitlog or memtable thresholds.

But aren't cassandra writes durable?

Yes, your memtable data is also stored in the commitlog. If your machine looses power, etc, the data that has been written is still persisted to disk and the commit-log data will get replayed on startup!

Solution 2

I use nodetool status <keyspace>. The load column value is roughly the same as the value I get using df -h (my cassandra installations are on different partitions than the system.

Share:
13,924

Related videos on Youtube

Knight71
Author by

Knight71

Enthusiastic programmer.

Updated on September 15, 2022

Comments

  • Knight71
    Knight71 over 1 year

    I am trying to find the total physical size occupied by cassandra keyspace.

    I have a msg generator which dumps lot of messages to cassandra . I want to find out the total physical size of messages in cassandra Table.

    When I do du -h /mnt/data/keyspace linux says only 12kb. I am sure that the data size is much greater than that. The rest of the data must either be in memtables or should be in compaction.

    How do I find the total space occupied in cassandra for that keyspace?

    I tried the

         nodetool cfstats <keyspace>
    

    But it gives me only for that particular node. And also the bytes are present in memtable . I actually want the total size of keyspaces that are actually written to disk across all nodes in the cluster . Is there any command to find this ?

    Thanks for the help.

    • phact
      phact about 9 years
      You can du -h in your keyspace data directory around your cluster and add. There may also be an sstable size mbean in JMX but I think it's per table not keyspace.
    • Knight71
      Knight71 about 9 years
      I have a msg generator which dumps lot of message to cassandra . I want to find out the total physical size of messages in cassandra Table. When I do du -h /mnt/data/keyspace it says only 12kb where as I am sure that the data size is much more than that. So the actual data is either in memtable or should be in compaction. How to find the total space occupied in cassandra for that keyspace ? Thanks for the help.
  • FGreg
    FGreg over 8 years
    This doesn't seem to answer the question 'I actually want the total size of keyspaces that are actually written to disk across all nodes in the cluster. Is there any command to find this?'
  • törzsmókus
    törzsmókus over 3 years
    AFAIK, nodetool status does not take a <keyspace> argument but shows status for the whole cluster.
  • törzsmókus
    törzsmókus over 3 years
    use numfmt or awk to convert the bytes to human-readable (i.e. kiB, MiB etc): awk '{ split( "B KiB MiB GiB TiB PiB" , v ); s=1; while( $1>1024 ){ $1/=1024; s++ } printf "%.2f %s", $1, v[s] }'