Cassandra read timeout

cassandra cassandra-2.0 datastax datastax-java-driver cassandra-cli

22,002

If using the java client from datastax, pagination is enabled by default with a row set of 5000. If you still get a timeout, you may try to reduce this using

public Statement setFetchSize(int fetchSize)

(read more)

If you are using the cli, you may need to experiment with some kind of manual pagination:

SELECT KeywordId, Date, HourOfDay, Impressions, Clicks,AveragePosition,ConversionRate,AOV,AverageCPC,Bid 
FROM StatisticsKeyspace.hourlystatistics 
WHERE Date >= '2014-03-22' AND Date <= '2014-03-24' 
LIMIT 100;

SELECT * FROM ....  WHERE token(KeywordId) > token([Last KeywordId received]) AND ...
LIMIT 100;

To detect some cluster issues you can try a select with a limit of 1, maybe there is an underlying problem.

Hope that helps.

If you are still experiencing performance issues with your query, I would look at your secondary index, since the amount of data transferred seems to reasonable (only 'small' data types are returned). If I am right, changing the fetch size will not change much. Instead, do you insert dates only in your "Date" (timestamp) column? If you are inserting actual timestamps instead, the secondary index on this column will be very slow due to the cardinality. If you insert a date only, the timestamp will default to date + "00:00:00" + TZ which should reduce the cardinality and thus improve the look-up speed. (watch out for timezone issues!) To be absolutely sure, try a secondary index on a column with a different data type, like an int for Date (counting the days since 1970-01-01 or sth).

22,002

Author by

Wild Goat

Updated on July 16, 2022

Comments

Wild Goat almost 2 years

I am pulling big amount of data from cassandra 2.0, but unfortunately getting timeout exception. My table:

CREATE KEYSPACE StatisticsKeyspace
  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };


CREATE TABLE StatisticsKeyspace.HourlyStatistics(
KeywordId text,
Date timestamp,
HourOfDay int,
Impressions int,
Clicks int,
AveragePosition double,
ConversionRate double,
AOV double,
AverageCPC double,
Cost double,
Bid double,
PRIMARY KEY(KeywordId, Date, HourOfDay)
);
CREATE INDEX ON StatisticsKeyspace.HourlyStatistics(Date);

My query:

SELECT KeywordId, Date, HourOfDay, Impressions, Clicks,AveragePosition,ConversionRate,AOV,AverageCPC,Bid 
FROM StatisticsKeyspace.hourlystatistics 
WHERE Date >= '2014-03-22' AND Date <= '2014-03-24'

I've changed configurations in my cassandra.yaml file.

read_request_timeout_in_ms: 60000
range_request_timeout_in_ms: 60000
write_request_timeout_in_ms: 40000
cas_contention_timeout_in_ms: 3000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 60000

But it still throws timeout approximately in 10 seconds. Any ideas how can I fix this problem?

Wild Goat almost 10 years

Thanks! I was actually changed SocketOptions and set timeout inside my datastax java client. Right now it does not timeout but takes ages. Do you think I can improve performance by tweaking FetchSize?
John almost 10 years

I updated my answer. Try if reducing the FetchSize helps to pin point the issue. Maybe it's the secondary index though (see my answer).
Wild Goat almost 10 years

thanks for your reply. I still didn't get the point why timestamp would reduce a performance since I am rounding it to midnight, in my understanding number of indexes should not vary from number of days since 1970, but I will definitely try right now! Also, do you think I should move my Date as primary index and keywordId as secondary, how that would reflect on my INSERT/READ performance? Thanks a lot!
John almost 10 years

Well the main impact of the PK is the distribution among your nodes. For optimal write performance you want an even distribution. Using only time-related attributes will always result in hot stops (for example, every write between 10:00 and 11:00 may go to the same node). Could you give some information on your "keywordId" field? If there are a limited number of keyword Ids, you may add this as another secondary index at any time and see if this increases lookup speed. Also, try to monitor read/write throughput for example using the Datastax opsCenter or similar.
Wild Goat almost 10 years

thanks! I've tried using int days since 1970 and looks like it improved performance, but anyway I've only one node, could you please explain that behavior and why it faster consider fact that I was rounding all Date to midnight 00:00:00 and running on the one node. Also, my keyword is a string in following format: 53961673d446bd71503d8bde
John almost 10 years

How can you have just 1 node, but a replication_factor of 3 (in your question)? This may cause problems; documentation: "When replication factor exceeds the number of nodes, writes are rejected, but reads are served as long as the desired consistency level can be met." Regarding the secondary index performance of rounded timestamps vs. integers, I am not sure how timestamps are indexed by Cassandra. Secondary indexes are not distributed like reverse-lookup tables, so a look up hits each node, and performance is fine if cardinality is not so high. Maybe the lookup is costly for ts..
Wild Goat almost 10 years

Thanks! Should I put replication_factor:1 if i am only on one node?
John almost 10 years

Let us continue this discussion in chat.
Stevel over 7 years

@omni how does a "hotspot" in the PK impact distribution across nodes? Isn't the distribution based on the hash of the PK, which removes concern of such hotspots in the key?
John over 7 years

@Stevel This may warrant different question, but here we go: the original post states that he uses dates in his PK, which makes all the difference. The distribution across nodes is determined by the partition key, in this case, the KeywordId. If he used his "date" as the partition key instead, this date would result in the same hashed partition key on that day, since a hash of the same date value will always return the same hashed value. All writes for data from that day would hit the same nodes, creating a hotspot.