How to find the total space occupied by a cassandra keyspace?

cassandra

13,924

Solution 1

What is Compaction?

SStables are immutable -- once a memtable is flushed to disk, it remains unchanced until it is deleted (expired) or compacted. Compaction is the process of combining sstables together. This is important when your workload is update heavy and you may have several instances of a CQL row stored in your SSTables (see sstables per read in nodetool cfhistograms). When you go to read that row, you may have to scan across multiple sstables to find the latest version of the data (in c* last write wins). When we compact, we may take up additional space on disk (especially size tiered compaction which may take up to--this is a theoretical maximum--50% of your data size when compacting) so it is important to keep free disk space. However, compaction will not take data away from your keyspace directory. This is not where your data is.

Then where did my data go?

You're right in your suspicion that data that has not yet been flushed to disk must be sitting in memtables. This data will make it to disk as soon as your commitlog fills up (default 1gb in 2.0 or 8gb in 2.1) or as soon as your memtables get too big -- memtable_total_space_in_mb.

If you want to see your data in sstables, you can flush it manually:

nodetool flush

and your memtables will be dropped into your KS directory in the form of SSTables. Or just be patient and wait until you hit either the commitlog or memtable thresholds.

But aren't cassandra writes durable?

Yes, your memtable data is also stored in the commitlog. If your machine looses power, etc, the data that has been written is still persisted to disk and the commit-log data will get replayed on startup!

Solution 2

I use nodetool status <keyspace>. The load column value is roughly the same as the value I get using df -h (my cassandra installations are on different partitions than the system.

13,924

Knight71

Enthusiastic programmer.

Updated on September 15, 2022

Comments

Knight71 over 1 year
I am trying to find the total physical size occupied by cassandra keyspace.

I have a msg generator which dumps lot of messages to cassandra . I want to find out the total physical size of messages in cassandra Table.

When I do du -h /mnt/data/keyspace linux says only 12kb. I am sure that the data size is much greater than that. The rest of the data must ~~either~~ be in memtables ~~or should be in compaction~~.

How do I find the total space occupied in cassandra for that keyspace?

I tried the
```
     nodetool cfstats <keyspace>
```
But it gives me only for that particular node. And also the bytes are present in memtable . I actually want the total size of keyspaces that are actually written to disk across all nodes in the cluster . Is there any command to find this ?

Thanks for the help.
- phact about 9 years
  
  You can du -h in your keyspace data directory around your cluster and add. There may also be an sstable size mbean in JMX but I think it's per table not keyspace.
- Knight71 about 9 years
  
  I have a msg generator which dumps lot of message to cassandra . I want to find out the total physical size of messages in cassandra Table. When I do du -h /mnt/data/keyspace it says only 12kb where as I am sure that the data size is much more than that. So the actual data is either in memtable or should be in compaction. How to find the total space occupied in cassandra for that keyspace ? Thanks for the help.
FGreg over 8 years

This doesn't seem to answer the question 'I actually want the total size of keyspaces that are actually written to disk across all nodes in the cluster. Is there any command to find this?'
törzsmókus over 3 years

AFAIK, nodetool status does not take a <keyspace> argument but shows status for the whole cluster.
törzsmókus over 3 years

use numfmt or awk to convert the bytes to human-readable (i.e. kiB, MiB etc): awk '{ split( "B KiB MiB GiB TiB PiB" , v ); s=1; while( $1>1024 ){ $1/=1024; s++ } printf "%.2f %s", $1, v[s] }'