How do I reduce Elasticsearch scroll response time?

14,682

Solution 1

.setSize(5000) means that each client.prepareSearchScroll call is going to retrieve 5000 records per shard. You are requesting back source, and if your records are big, assembling 5000 records in memory might take awhile. I would suggest trying a smaller number. Try 100 and 10 to see if you are getting a better performance.

.setFrom(0) is not necessary.

Solution 2

I'm going to add another answer here, because I was very puzzled by this behaviour and it took me a long time to find the answer in the comments by @AaronM

This applies to ES 1.7.2, using the java API.

I was scrolling/scanning an index of 500m records, but with a query that returns about 400k rows.

I started off with a scroll size of 1,000 which seemed to me a reasonable trade-off in terms of network versus CPU.

This query ran terribly slowly, taking about 30 minutes to complete, with very long pauses between fetches from the cursor.

I worried that maybe it was just the query I was running and did not believe that decreasing the scroll size could help, as 1000 seemed tiny.

However, seeing AaronM's comment above, I tried a scroll size of 10.

The whole job completed in 30 seconds (and this was whether I had restarted ES or not, so presumably nothing to do with caching) - a speed-up of about 60x!!!

So if you're having performance problems with scroll/scan, I highly recommend trying decreasing the scroll size. I couldn't find much about this on the internet, so posted this here.

Solution 3

  • Query data node not client node or master node
  • Select the fields you need with filter_pathproperty
  • Set scroll size according your document size, there is no a magic rule, you must set value and try, and so on
  • Monitor your network band width
  • If it's not enough, let's go for some multi-threads stuff:

Think that elasticsearch index is composed of multiple shards. This design means you can parallelize operation.

Let's say your index has 3 shards, and your cluster 3 nodes (good practice to have more nodes than shards by index).

You could run 3 Java "workers", in a separate thread each, that will search scroll a different shard and node, and use a queue to "centralize" the results.

This way, you will have a good performance!

This is what the elasticsearch-hadoop library does.

To retrieve shards/nodes details about an index, use the https://www.elastic.co/guide/en/elasticsearch/reference/current/search-shards.html API.

Share:
14,682
dranxo
Author by

dranxo

Updated on June 23, 2022

Comments

  • dranxo
    dranxo almost 2 years

    I have a query returning ~200K hits from 7 different indices distributed across our cluster. I process my results as:

    while (true) {
        scrollResp = client.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(600000)).execute().actionGet();
    
        for (SearchHit hit : scrollResp.getHits()){
                //process hit}
    
        //Break condition: No hits are returned
        if (scrollResp.hits().hits().length == 0) {
            break;
        }
    }
    

    I'm noticing that the client.prepareSearchScroll line can hang for quite some time before returning the next set of search hits. This seems to get worse the longer I run the code for.

    My setup for the search is:

    SearchRequestBuilder searchBuilder = client.prepareSearch( index_names )
        .setSearchType(SearchType.SCAN)
        .setScroll(new TimeValue(60000)) //TimeValue?
        .setQuery( qb )
        .setFrom(0) //?
        .setSize(5000); //number of jsons to get in each search, what should it be? I have no idea.
        SearchResponse scrollResp = searchBuilder.execute().actionGet();
    

    Is it expected that scanning and scrolling just takes a long time when examining many results? I'm very new to Elastic Search so keep in mind that I may be missing something very obvious.

    My query:

    QueryBuilder qb = QueryBuilders.boolQuery().must(QueryBuilders.termsQuery("tweet", interesting_words));