Elasticsearch is using way too much disk space

35,858

Elasticsearch does not shrink your data automagically. This is true for any database. Beside storing the raw data, each database has to store metadata along with it. Normal databases only store an index (for faster search) for the columns the db-admin chose upfront. ElasticSearch is different as it indexes every column by default. Thus making the index extremely large, but on the other hand gives perfect performance while retrieving data.

In normal configurations you see an increase of 4 to 6 times of the raw data after indexing. Although it heavily depends on the actual data. But this is actually intended behavior.

So to decrease the database size, you have to go the other way around like you did in RDBMs: Exclude columns from being indexed or stored that you do not need to be indexed.

Additionally you could turn on compression, but this will only improve when your "documents" are large, which is probably not true for log file entries.

There are some comparisons and and useful tips here: https://github.com/jordansissel/experiments/tree/master/elasticsearch/disk

But remember: Searching comes with a cost. The cost to pay is disk space. But you gain flexibility. If your storage size exceeds, then grow horizontally! This is where ElasticSearch wins.

Share:
35,858
mac
Author by

mac

Updated on September 18, 2022

Comments

  • mac
    mac over 1 year

    I have a CentOS 6.5 server on which I installed Elasticsearch 1.3.2.

    My elasticsearch.yml configuration file is a minimal modification of the one shipping with elasticsearch as a default. Once stripped of all commented lines, it looks like:

    cluster.name: xxx-kibana
    
    node:
        name: "xxx"
        master: true
        data: true
    
    index.number_of_shards: 5
    
    index.number_of_replicas: 1
    
    path:
        logs: /log/elasticsearch/log
        data: /log/elasticsearch/data
    
    
    transport.tcp.port: 9300
    
    http.port: 9200
    
    discovery.zen.ping.multicast.enabled: false
    

    Elasticsearch should have compression ON by default, and I read various benchmarks putting the compression ratio from as low as 50% to as high as 95%. Unluckily, the compression ratio in my case is -400%, or in other words: data stored with ES takes 4 times as much disk space than the text file with the same content. See:

    12K     logstash-2014.10.07/2/translog
    16K     logstash-2014.10.07/2/_state
    116M    logstash-2014.10.07/2/index
    116M    logstash-2014.10.07/2
    12K     logstash-2014.10.07/4/translog
    16K     logstash-2014.10.07/4/_state
    127M    logstash-2014.10.07/4/index
    127M    logstash-2014.10.07/4
    12K     logstash-2014.10.07/0/translog
    16K     logstash-2014.10.07/0/_state
    109M    logstash-2014.10.07/0/index
    109M    logstash-2014.10.07/0
    16K     logstash-2014.10.07/_state
    12K     logstash-2014.10.07/1/translog
    16K     logstash-2014.10.07/1/_state
    153M    logstash-2014.10.07/1/index
    153M    logstash-2014.10.07/1
    12K     logstash-2014.10.07/3/translog
    16K     logstash-2014.10.07/3/_state
    119M    logstash-2014.10.07/3/index
    119M    logstash-2014.10.07/3
    622M    logstash-2014.10.07/  # <-- This is the total!
    

    versus:

    6,3M    /var/log/td-agent/legacy_api.20141007_0.log
    8,0M    /var/log/td-agent/legacy_api.20141007_10.log
    7,6M    /var/log/td-agent/legacy_api.20141007_11.log
    6,7M    /var/log/td-agent/legacy_api.20141007_12.log
    8,0M    /var/log/td-agent/legacy_api.20141007_13.log
    7,6M    /var/log/td-agent/legacy_api.20141007_14.log
    7,6M    /var/log/td-agent/legacy_api.20141007_15.log
    7,7M    /var/log/td-agent/legacy_api.20141007_16.log
    5,6M    /var/log/td-agent/legacy_api.20141007_17.log
    7,9M    /var/log/td-agent/legacy_api.20141007_18.log
    6,3M    /var/log/td-agent/legacy_api.20141007_19.log
    7,8M    /var/log/td-agent/legacy_api.20141007_1.log
    7,1M    /var/log/td-agent/legacy_api.20141007_20.log
    8,0M    /var/log/td-agent/legacy_api.20141007_21.log
    7,2M    /var/log/td-agent/legacy_api.20141007_22.log
    3,8M    /var/log/td-agent/legacy_api.20141007_23.log
    7,5M    /var/log/td-agent/legacy_api.20141007_2.log
    7,3M    /var/log/td-agent/legacy_api.20141007_3.log
    8,0M    /var/log/td-agent/legacy_api.20141007_4.log
    7,5M    /var/log/td-agent/legacy_api.20141007_5.log
    7,5M    /var/log/td-agent/legacy_api.20141007_6.log
    7,8M    /var/log/td-agent/legacy_api.20141007_7.log
    7,8M    /var/log/td-agent/legacy_api.20141007_8.log
    7,2M    /var/log/td-agent/legacy_api.20141007_9.log
    173M    total
    

    What am I doing wrong? Why is data not being compressed?

    I have provisionally added index.store.compress.stored: 1 to my configuration file, as I found that in the elasticsearch 0.19.5 release notes (that's when the store compression came out first), but I'm not yet able to tell if it is making a difference, and anyhow compression should be ON by default, nowadays...

    • mailq
      mailq over 9 years
      Did you ever consider the overhead it takes to store and index that data? This is where the difference comes from.
    • mac
      mac over 9 years
      @mailq - AFAIK, Elastic compresses both data and indices, and you still should notice a decrease in space usage on your disk, compared to text logs. I assume mileage may vary according to log structure, but logs are typically very repetitive in nature, so indexing shouldn't be the most space-consuming of operations. ...or am I getting this wrong?
    • mailq
      mailq over 9 years
      Logs are not really repetitive. User A logs in at time 1. User B logs in at time 2. What is repetitive? Both tuples have to be indexed and stored separately. In addition to the log entry itself.
    • mailq
      mailq over 9 years
    • mac
      mac over 9 years
      @mailq - Supercool maliq, thank you a ton. If you expand on your comment and write a proper answer, I'd be glad to mark it as accepted (otherwise I will do it later on, but don't want to steal your thunder!).