commitLog and SSTables in Cassandra database
You are almost there in your understanding. However, missing some minute details.
So explaining things in a structured way, cassandra write operation life cycle is divided in these steps
- commitlog write
- memtable write
- sstable write
Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is said to successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. When ever the memtable runs out of space, i.e when the number of keys exceed certain limit (128 is default) or when it reaches the time duration (cluster clock), it is being stored into sstable, immutable space (This mechanism is called Flushing). Once writes are done on SSTable, then you can see the corresponding datas in the data folder, in your case its S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data
. Each SSTable composes of mainly 2 files - Index file and Data file
-
Index file contains - Bloom filter and Key-Offset pairs
- Bloom Filter: A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free
- (Key, offset) pairs (points into data file)
Data file contains the actual column data
And regarding commitlog files, these are encrypted files maintained intrinsically by Cassandra, for which you are not able to see anything properly.
UPDATE:
Memtable is an in-memory cache with content stored as key/column (data are sorted by key). Each column-family has a separate Memtable and retrieve column data from the key. So now i hope you are in clear state of mind to understand the fact, why we can't locate them in our disk.
In your case your memtable is not full as memtable thresholds are not bleached yet resulting to no flushing. You can know more about MemtableThresholds here though it is recommended not to touch that Dial.
SSTableStructure:
- Your data folder
- KEYSPACE
- CF
- CompressionInfo.db
- Data.db
- Filter.db
- Index.db
- Statistics.db
- snapshots //if snapshots are taken
- CF
- KEYSPACE
For more information Refer sstable
Related videos on Youtube
arsenal
profile for ferhan on Stack Exchange, a network of free, community-driven Q&A sites http://stackexchange.com/users/flair/335839.png
Updated on September 15, 2022Comments
-
arsenal over 1 year
I recently started working with Cassandra database. I have installed
single node cluster
in my local box. And I am working withCassandra 1.2.3
.I was reading the article on the internet and I found this line-
Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (sorted string table).
So to understand the above lines, I wrote a simple program that will write to Cassandra Database using
Pelops client
. And I was able to insert the data in Cassandra database.And now I am trying to see how my data was written into
commit log
and where thatcommit log file
is? And also howSSTables
is generated and where I can find that as well in my local box and what it contains also.I wanted to see these two files so that I can understand more how Cassandra works behind the scenes.
In my cassandra.yaml file, I have something like this
# directories where Cassandra should store data on disk. data_file_directories: - S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data # commit log commitlog_directory: S:\Apache Cassandra\apache-cassandra-1.2.3\storage\commitlog # saved caches saved_caches_directory: S:\Apache Cassandra\apache-cassandra-1.2.3\storage\savedcaches
But when I opened commitLog, first of all it has lot of data so my notepad++ is not able to open it properly and if it gets opened, I cannot see properly because of some encoding or what. And in my data folder, I cannot find out anything?
Meaning this folder is empty for me-
S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data\my_keyspace\users
Is there anything I am missing here? Can anybody explain me how to read commitLog and SSTables files and where I can find these two files? And also what exactly happens behind the scenes whenever I am writing to Cassandra database.
Updated:-
Code I am using to insert into Cassandra Database-
public class MyPelops { private static final Logger log = Logger.getLogger(MyPelops.class); public static void main(String[] args) throws Exception { // ------------------------------------------------------------- // -- Nodes, Pool, Keyspace, Column Family --------------------- // ------------------------------------------------------------- // A comma separated List of Nodes String NODES = "localhost"; // Thrift Connection Pool String THRIFT_CONNECTION_POOL = "Test Cluster"; // Keyspace String KEYSPACE = "my_keyspace"; // Column Family String COLUMN_FAMILY = "users"; // ------------------------------------------------------------- // -- Cluster -------------------------------------------------- // ------------------------------------------------------------- Cluster cluster = new Cluster(NODES, 9160); Pelops.addPool(THRIFT_CONNECTION_POOL, cluster, KEYSPACE); // ------------------------------------------------------------- // -- Mutator -------------------------------------------------- // ------------------------------------------------------------- Mutator mutator = Pelops.createMutator(THRIFT_CONNECTION_POOL); log.info("- Write Column -"); mutator.writeColumn( COLUMN_FAMILY, "Row1", new Column().setName(" Name ".getBytes()).setValue(" Test One ".getBytes()).setTimestamp(new Date().getTime())); mutator.writeColumn( COLUMN_FAMILY, "Row1", new Column().setName(" Work ".getBytes()).setValue(" Engineer ".getBytes()).setTimestamp(new Date().getTime())); log.info("- Execute -"); mutator.execute(ConsistencyLevel.ONE); // ------------------------------------------------------------- // -- Selector ------------------------------------------------- // ------------------------------------------------------------- Selector selector = Pelops.createSelector(THRIFT_CONNECTION_POOL); int columnCount = selector.getColumnCount(COLUMN_FAMILY, "Row1", ConsistencyLevel.ONE); System.out.println("- Column Count = " + columnCount); List<Column> columnList = selector .getColumnsFromRow(COLUMN_FAMILY, "Row1", Selector.newColumnsPredicateAll(true, 10), ConsistencyLevel.ONE); System.out.println("- Size of Column List = " + columnList.size()); for (Column column : columnList) { System.out.println("- Column: (" + new String(column.getName()) + "," + new String(column.getValue()) + ")"); } System.out.println("- All Done. Exit -"); System.exit(0); } }
Keyspace and Column family that I have created-
create keyspace my_keyspace with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; use my_keyspace; create column family users with column_type = 'Standard' and comparator = 'UTF8Type';
-
arsenal about 11 yearsThanks a lot for the detailed explanation. That clears most of my doubt but came across some more question. :) . Hope you don't mind answering those as well. Firstly, You mentioned it goes to
an in-memory table structure called a memtable
. Is there any location of thesetable structure
where I can see how it looks like? And what it contains actually? Secondly, you mentioned it gets written to sstable only when memtable runs out of space. So in my case, I cannot see sstable being created inside data folder. I have createdusers
as the columns family and I inserted two rows into that. -
arsenal about 11 yearsContinuation, from above. It might be possible that memtable is not full in my case as I inserted two rows and that's why it is not flushed to SSTable right? And I have updated my question with the code I am working with to insert into Cassandra database. Thirdly, In my case, If I need to see
SSTable
and how it looks like and what it contains? then I need to make surememtable
is full then only it will flush out to SSTable right and then it will get created inside data folder? If yes, how can I make sure memtable is full from my program? -
Andy almost 11 years@abhi who told you commit logs are encrypted? It would had affected performance. I can read them in Cassandra 1.2.4
-
krithikaGopalakrishnan over 7 yearsI see a lot of information about commitlog syncing. This is quite unclear for me. What are the information that will be available on commit log? In what format data is stored in commit log?. Can anyone shed some light over this?