How to batch select data from Cassandra effectively?

13,579

Solution 1

Large In query create GC pauses and heap pressure that leads to overall slower performance. When you execute large in query this means you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing.

Approach 1 :

Try to convert your in query into range query (>=, <=)

SELECT * visit WHERE enterprise_id = ? and id >= ? and id <= ?

Approach 2 :

Use executeAsync, Java Example

PreparedStatement statement = session.prepare("SELECT * FROM visit where enterprise_id = ? and id = ?");

List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
    ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
    futures.add(resultSetFuture);
}

List<String> results = new ArrayList<>();
for (ResultSetFuture future : futures){
     ResultSet rows = future.getUninterruptibly();
     Row row = rows.one();
     results.add(row.getString("name"));
}
return results; 

Approach 3 :

If possible then instead of in query, create another table and when a data that you will perform in query are about to insert or update also insert the data to new table, then you can just query from the new table without in query

Source :
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Solution 2

My preferred way to issue these kind of queries is to unroll the IN part. That simply means you need to issue multiple queries in parallel, simply because the token-o-matic (aka token-aware) driver will treat each query as a single independent query, and will then spread these among different nodes, making each single node the coordinator responsible for each query it will be reached for.

You should run at most X queries and wait until at least one of them finishes (I use Java):

final int X = 1000;
ArrayList<ResultSetFuture> futures = new ArrayList<>();
ArrayList<ResultSet> results = new ArrayList<>();
for (int i = 0; i < allTheRowsINeedToFetch; i++) {
    futures.add(session.executeAsync(myBeautifulPreparedStatement.bind(xxx,yyy,zzz)));
    while (futures.size() >= X || (futures.size() > 0 && futures.get(0).isDone())) {
        ResultSetFuture rsf = futures.remove(0);
        results.add(rsf.getUninterruptibly());
    }
}

while (futures.size() > 0) {
    ResultSetFuture rsf = futures.remove(0);
    results.add(rsf.getUninterruptibly());
}

// Now use the results

This is known as backpressure, and it is used to move the pressure from the cluster to the client.

The nice thing about this method is that you can go truly parallel (X = allTheRowsINeedToFetch), as well as truly serial (X = 1), and everything in between depends only your cluster hardware. Low values of X mean you are not using your cluster capabilities enough, high values mean you're going to call for troubles because you'll start to see timeouts. So, you really need to tune it.

Share:
13,579
pangpang
Author by

pangpang

Write the code, change myself!

Updated on June 05, 2022

Comments

  • pangpang
    pangpang almost 2 years

    I know Cassandra doesn't support batch query, and it also doesn't recommend to use IN, because it can degrade performance. But I have to get the data by id, for example:

    select * from visit where id in ([visit_id array])
    

    desc table:

    CREATE TABLE visit (
        enterprise_id int,
        id text,
        ........
        PRIMARY KEY (enterprise_id, id)
    

    The array maybe has thousands of elements. Is there any way can make it effectively?