Is the IN relation in Cassandra bad for queries?

10,478

Solution 1

I remembered seeing someone answer this question in the Cassandra user mailing list a short while back, but I cannot find the exact message right now. Ironically, Cassandra Evangelist Rebecca Mills just posted an article that addresses this issue (Things you should be doing when using Cassandra drivers...points #13 and #22). But the answer is "yes" that in some cases, multiple, parallel queries would be faster than using an IN. The underlying reason can be found in the DataStax SELECT documentation.

When not to use IN

...Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster with 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range.

So based on that, it would seem that this becomes more of a problem as your cluster gets larger.

Therefore, the best way to solve this problem (and not have to use IN at all) would be to rethink your data model for this query. Without knowing too much about your schema, perhaps there are attributes (column values) that are shared by ticket IDs 1, 2, 3, and 4. Maybe using something like level or group (if tickets are for a particular venue) or maybe even an event (id), instead.

Basically, while using a unique, high-cardinality identifier to partition your data sounds like a good idea, it actually makes it harder to query your data (in Cassandra) later on. If you could come up with a different column to partition your data on, that would certainly help you in this case. Regardless, creating a new, specific column family (table) to handle queries for those rows is going to be a better approach than using IN or multiple queries.

Solution 2

Yes, its better to query individually than using IN in Cassandra.

For this query, the coordinator has to get the data from 4 different partitions and if each partition is very big then the data gets filled in JVM which can cause problem.

Instead querying the data using multiple queries is better as each query is individual and don't have to wait for other partitions data to send it back to user.

Share:
10,478
Andy Leung
Author by

Andy Leung

Updated on June 05, 2022

Comments

  • Andy Leung
    Andy Leung almost 2 years

    Given an example of the following select in CQL:

    SELECT * FROM tickets WHERE ID IN (1,2,3,4)
    

    Given ID is a partition key, is using IN relation better than doing multiple queries or is there no difference?

  • pinkpanther
    pinkpanther about 9 years
    what about delete query? I currently have a query DELETE FROM xyz WHERE pkey IN(1,2,3,4). Is IN operator for DELETE better than the SELECT?
  • Aaron
    Aaron about 9 years
    @pinkpanther It depends if you care about that delete operation performing well. If you have several nodes, I would think that the same performance issues would befall you. But if you're just doing a one-off maintenance delete, then I wouldn't worry about it.
  • Andrew Weaver
    Andrew Weaver almost 8 years
    The advise to consolidate records under a higher-cardinality partition key is good to optimize for the IN query, however you must be mindful of the partition size. The sweet spot is around 100MB so if your partitions become much larger than that, you should think of another way to split it up. Since the data model is the single-biggest determiner of performance in Cassandra, I suggest taking the data modeling course from DataStax Academy.
  • User3518958
    User3518958 about 4 years
    what if IN clause is used on a clustering column? Is it still not recommended?
  • Aaron
    Aaron about 4 years
    @User3518958 If you're providing the partition key and using IN on a clustering key, you'll benefit from being able to retrieve data by hitting only one node. However, you would still be forcing Cassandra to read randomly (and not sequentially) within the partition. The performance of such a query would largely depend on the partition size and number of clustering keys.
  • User3518958
    User3518958 about 4 years
    ok, got it, 'you would still be forcing Cassandra to read randomly (and not sequentially) within the partition', this is good, thanks