How to prevent Elasticsearch from index throttling?

13,120

Solution 1

The setting that actually corresponds to the maxNumMerges in the log file is called index.merge.scheduler.max_merge_count. Increasing this along with index.merge.scheduler.max_thread_count (where max_thread_count <= max_merge_count) will increase the number of simultaneous merges which are allowed for segments within an individual index's shards.

If you have a very high indexing rate that results in many GBs in a single index, you probably want to raise some of the other assumptions that the Elasticsearch default settings make about segment size, too. Try raising the floor_segment - the minimum size before a segment will be considered for merging, the max_merged_segment - the maximum size of a single segment, and the segments_per_tier -- the number of segments of roughly equivalent size before they start getting merged into a new tier. On an application that has a high indexing rate and finished index sizes of roughly 120GB with 10 shards per index, we use the following settings:

curl -XPUT /index_name/_settings -d'
{
  "settings": {
    "index.merge.policy.max_merge_at_once": 10,
    "index.merge.scheduler.max_thread_count": 10,
    "index.merge.scheduler.max_merge_count": 10,
    "index.merge.policy.floor_segment": "100mb",
    "index.merge.policy.segments_per_tier": 25,
    "index.merge.policy.max_merged_segment": "10gb"
  }
}

Also, one important thing you can do to improve loss-of-node/node restarted recovery time on applications with high indexing rates is taking advantage of index recovery prioritization (in ES >= 1.7). Tune this setting so that the indices that receive the most indexing activity are recovered first. As you may know, the "normal" shard initialization process just copies the already-indexed segment files between nodes. However, if indexing activity is occurring against a shard before or during initialization, the translog with the new documents can become very large. In the scenario where merging goes through the roof during recovery, it's the replay of this translog against the shard that is almost always the culprit. Thus, using index recovery prioritization to recover those shards first and delay shards with less indexing activity, you can minimize the eventual size of the translog which will dramatically improve recovery time.

Solution 2

We are using 1.7 and noticed a similar problem: The indexing getting throttled even when the IO was not saturated (Fusion IO in our case).

After increasing "index.merge.scheduler.max_thread_count" the problem seems to be gone -- we did not see any more throttling being logged so far.

I would try setting "index.merge.scheduler.max_thread_count" to at least the max reported numMergesInFlight (6 in the logs above).

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/index-modules-merge.html#scheduling

Hope this helps!

Share:
13,120
grouma
Author by

grouma

Microsoft employee turned Google employee.

Updated on June 28, 2022

Comments

  • grouma
    grouma almost 2 years

    I have a 40 node Elasticsearch cluster which is hammered by a high index request rate. Each of these nodes makes use of an SSD for the best performance. As suggested from several sources, I have tried to prevent index throttling with the following configuration:

    indices.store.throttle.type: none
    

    Unfortunately, I'm still seeing performance issues as the cluster still periodically throttles indices. This is confirmed by the following logs:

    [2015-03-13 00:03:12,803][INFO ][index.engine.internal    ] [CO3SCH010160941] [siphonaudit_20150313][19] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
    [2015-03-13 00:03:12,829][INFO ][index.engine.internal    ] [CO3SCH010160941] [siphonaudit_20150313][19] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
    [2015-03-13 00:03:13,804][INFO ][index.engine.internal    ] [CO3SCH010160941] [siphonaudit_20150313][19] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
    [2015-03-13 00:03:13,818][INFO ][index.engine.internal    ] [CO3SCH010160941] [siphonaudit_20150313][19] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
    [2015-03-13 00:05:00,791][INFO ][index.engine.internal    ] [CO3SCH010160941] [siphon_20150313][6] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
    [2015-03-13 00:05:00,808][INFO ][index.engine.internal    ] [CO3SCH010160941] [siphon_20150313][6] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
    [2015-03-13 00:06:00,861][INFO ][index.engine.internal    ] [CO3SCH010160941] [siphon_20150313][6] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
    [2015-03-13 00:06:00,879][INFO ][index.engine.internal    ] [CO3SCH010160941] [siphon_20150313][6] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
    

    The throttling occurs after one of the 40 nodes dies for various expected reasons. The cluster immediately enters a yellow state, in which a number of shards will begin initializing on the remaining nodes.

    Any idea why the cluster continues to throttle after explicitly configuring it not to? Any other suggestions to have the cluster more quickly return to a green state after a node failure?