Elasticsearch is using way too much disk space
Elasticsearch does not shrink your data automagically. This is true for any database. Beside storing the raw data, each database has to store metadata along with it. Normal databases only store an index (for faster search) for the columns the db-admin chose upfront. ElasticSearch is different as it indexes every column by default. Thus making the index extremely large, but on the other hand gives perfect performance while retrieving data.
In normal configurations you see an increase of 4 to 6 times of the raw data after indexing. Although it heavily depends on the actual data. But this is actually intended behavior.
So to decrease the database size, you have to go the other way around like you did in RDBMs: Exclude columns from being indexed or stored that you do not need to be indexed.
Additionally you could turn on compression, but this will only improve when your "documents" are large, which is probably not true for log file entries.
There are some comparisons and and useful tips here: https://github.com/jordansissel/experiments/tree/master/elasticsearch/disk
But remember: Searching comes with a cost. The cost to pay is disk space. But you gain flexibility. If your storage size exceeds, then grow horizontally! This is where ElasticSearch wins.
mac
Updated on September 18, 2022Comments
-
mac over 1 year
I have a CentOS 6.5 server on which I installed Elasticsearch 1.3.2.
My
elasticsearch.yml
configuration file is a minimal modification of the one shipping with elasticsearch as a default. Once stripped of all commented lines, it looks like:cluster.name: xxx-kibana node: name: "xxx" master: true data: true index.number_of_shards: 5 index.number_of_replicas: 1 path: logs: /log/elasticsearch/log data: /log/elasticsearch/data transport.tcp.port: 9300 http.port: 9200 discovery.zen.ping.multicast.enabled: false
Elasticsearch should have compression ON by default, and I read various benchmarks putting the compression ratio from as low as 50% to as high as 95%. Unluckily, the compression ratio in my case is -400%, or in other words: data stored with ES takes 4 times as much disk space than the text file with the same content. See:
12K logstash-2014.10.07/2/translog 16K logstash-2014.10.07/2/_state 116M logstash-2014.10.07/2/index 116M logstash-2014.10.07/2 12K logstash-2014.10.07/4/translog 16K logstash-2014.10.07/4/_state 127M logstash-2014.10.07/4/index 127M logstash-2014.10.07/4 12K logstash-2014.10.07/0/translog 16K logstash-2014.10.07/0/_state 109M logstash-2014.10.07/0/index 109M logstash-2014.10.07/0 16K logstash-2014.10.07/_state 12K logstash-2014.10.07/1/translog 16K logstash-2014.10.07/1/_state 153M logstash-2014.10.07/1/index 153M logstash-2014.10.07/1 12K logstash-2014.10.07/3/translog 16K logstash-2014.10.07/3/_state 119M logstash-2014.10.07/3/index 119M logstash-2014.10.07/3 622M logstash-2014.10.07/ # <-- This is the total!
versus:
6,3M /var/log/td-agent/legacy_api.20141007_0.log 8,0M /var/log/td-agent/legacy_api.20141007_10.log 7,6M /var/log/td-agent/legacy_api.20141007_11.log 6,7M /var/log/td-agent/legacy_api.20141007_12.log 8,0M /var/log/td-agent/legacy_api.20141007_13.log 7,6M /var/log/td-agent/legacy_api.20141007_14.log 7,6M /var/log/td-agent/legacy_api.20141007_15.log 7,7M /var/log/td-agent/legacy_api.20141007_16.log 5,6M /var/log/td-agent/legacy_api.20141007_17.log 7,9M /var/log/td-agent/legacy_api.20141007_18.log 6,3M /var/log/td-agent/legacy_api.20141007_19.log 7,8M /var/log/td-agent/legacy_api.20141007_1.log 7,1M /var/log/td-agent/legacy_api.20141007_20.log 8,0M /var/log/td-agent/legacy_api.20141007_21.log 7,2M /var/log/td-agent/legacy_api.20141007_22.log 3,8M /var/log/td-agent/legacy_api.20141007_23.log 7,5M /var/log/td-agent/legacy_api.20141007_2.log 7,3M /var/log/td-agent/legacy_api.20141007_3.log 8,0M /var/log/td-agent/legacy_api.20141007_4.log 7,5M /var/log/td-agent/legacy_api.20141007_5.log 7,5M /var/log/td-agent/legacy_api.20141007_6.log 7,8M /var/log/td-agent/legacy_api.20141007_7.log 7,8M /var/log/td-agent/legacy_api.20141007_8.log 7,2M /var/log/td-agent/legacy_api.20141007_9.log 173M total
What am I doing wrong? Why is data not being compressed?
I have provisionally added
index.store.compress.stored: 1
to my configuration file, as I found that in theelasticsearch 0.19.5
release notes (that's when thestore
compression came out first), but I'm not yet able to tell if it is making a difference, and anyhow compression should be ON by default, nowadays...-
mailq over 9 yearsDid you ever consider the overhead it takes to store and index that data? This is where the difference comes from.
-
mac over 9 years@mailq - AFAIK, Elastic compresses both data and indices, and you still should notice a decrease in space usage on your disk, compared to text logs. I assume mileage may vary according to log structure, but logs are typically very repetitive in nature, so indexing shouldn't be the most space-consuming of operations. ...or am I getting this wrong?
-
mailq over 9 yearsLogs are not really repetitive. User A logs in at time 1. User B logs in at time 2. What is repetitive? Both tuples have to be indexed and stored separately. In addition to the log entry itself.
-
mailq over 9 yearsTry those recommendations. github.com/jordansissel/experiments/tree/master/elasticsearch/…
-
mac over 9 years@mailq - Supercool maliq, thank you a ton. If you expand on your comment and write a proper answer, I'd be glad to mark it as accepted (otherwise I will do it later on, but don't want to steal your thunder!).
-