Auto compact the deleted space in mongodb?

37,260

Solution 1

In general if you don't need to shrink your datafiles you shouldn't shrink them at all. This is because "growing" your datafiles on disk is a fairly expensive operation and the more space that MongoDB can allocate in datafiles the less fragmentation you will have.

So, you should try to provide as much disk-space as possible for the database.

However if you must shrink the database you should keep two things in mind.

  1. MongoDB grows it's data files by doubling so the datafiles may be 64MB, then 128MB, etc up to 2GB (at which point it stops doubling to keep files until 2GB.)

  2. As with most any database ... to do operations like shrinking you'll need to schedule a separate job to do so, there is no "autoshrink" in MongoDB. In fact of the major noSQL databases (hate that name) only Riak will autoshrink. So, you'll need to create a job using your OS's scheduler to run a shrink. You could use an bash script, or have a job run a php script, etc.

Serverside Javascript

You can use server side Javascript to do the shrink and run that JS via mongo's shell on a regular bases via a job (like cron or the windows scheduling service) ...

Assuming a collection called foo you would save the javascript below into a file called bar.js and run ...

$ mongo foo bar.js

The javascript file would look something like ...

// Get a the current collection size.
var storage = db.foo.storageSize();
var total = db.foo.totalSize();

print('Storage Size: ' + tojson(storage));

print('TotalSize: ' + tojson(total));

print('-----------------------');
print('Running db.repairDatabase()');
print('-----------------------');

// Run repair
db.repairDatabase()

// Get new collection sizes.
var storage_a = db.foo.storageSize();
var total_a = db.foo.totalSize();

print('Storage Size: ' + tojson(storage_a));
print('TotalSize: ' + tojson(total_a));

This will run and return something like ...

MongoDB shell version: 1.6.4
connecting to: foo
Storage Size: 51351
TotalSize: 79152
-----------------------
Running db.repairDatabase()
-----------------------
Storage Size: 40960
TotalSize: 65153

Run this on a schedule (during none peak hours) and you are good to go.

Capped Collections

However there is one other option, capped collections.

Capped collections are fixed sized collections that have a very high performance auto-FIFO age-out feature (age out is based on insertion order). They are a bit like the "RRD" concept if you are familiar with that.

In addition, capped collections automatically, with high performance, maintain insertion order for the objects in the collection; this is very powerful for certain use cases such as logging.

Basically you can limit the size of (or number of documents in ) a collection to say .. 20GB and once that limit is reached MongoDB will start to throw out the oldest records and replace them with newer entries as they come in.

This is a great way to keep a large amount of data, discarding the older data as time goes by and keeping the same amount of disk-space used.

Solution 2

I have another solution that might work better than doing db.repairDatabase() if you can't afford for the system to be locked, or don't have double the storage.

You must be using a replica set.

My thought is once you've removed all of the excess data that's gobbling your disk, stop a secondary replica, wipe its data directory, start it up and let it resynchronize with the master.

The process is time consuming, but it should only cost a few seconds of down time, when you do the rs.stepDown().

Also this can not be automated. Well it could, but I don't think I'm willing to try.

Solution 3

Running db.repairDatabase() will require that you have space equal to the current size of the database available on the file system. This can be bothersome when you know that the collections left or data you need to retain in the database would currently use much less space than what is allocated and you do not have enough space to make the repair.

As an alternative if you have few collections you actually need to retain or only want a subset of the data, then you can move the data you need to keep into a new database and drop the old one. If you need the same database name you can then move them back into a fresh db by the same name. Just make sure you recreate any indexes.

use cleanup_database
db.dropDatabase();

use oversize_database

db.collection.find({},{}).forEach(function(doc){
    db = db.getSiblingDB("cleanup_database");
    db.collection_subset.insert(doc);
});

use oversize_database
db.dropDatabase();

use cleanup_database

db.collection_subset.find({},{}).forEach(function(doc){
    db = db.getSiblingDB("oversize_database");
    db.collection.insert(doc);
});

use oversize_database

<add indexes>
db.collection.ensureIndex({field:1});

use cleanup_database
db.dropDatabase();

An export/drop/import operation for databases with many collections would likely achieve the same result but I have not tested.

Also as a policy you can keep permanent collections in a separate database from your transient/processing data and simply drop the processing database once your jobs complete. Since MongoDB is schema-less, nothing except indexes would be lost and your db and collections will be recreated when the inserts for the processes run next. Just make sure your jobs include creating any nessecary indexes at an appropriate time.

Solution 4

If you are using replica sets, which were not available when this question was originally written, then you can set up a process to automatically reclaim space without incurring significant disruption or performance issues.

To do so, you take advantage of the automatic initial sync capabilities of a secondary in a replica set. To explain: if you shut down a secondary, wipe its data files and restart it, the secondary will re-sync from scratch from one of the other nodes in the set (by default it picks the node closest to it by looking at ping response times). When this resync occurs, all data is rewritten from scratch (including indexes), effectively do the same thing as a repair, and disk space it reclaimed.

By running this on secondaries (and then stepping down the primary and repeating the process) you can effectively reclaim disk space on the whole set with minimal disruption. You do need to be careful if you are reading from secondaries, since this will take a secondary out of rotation for a potentially long time. You also want to make sure your oplog window is sufficient to do a successful resync, but that is generally something you would want to make sure of whether you do this or not.

To automate this process you would simply need to have a script run to perform this action on separate days (or similar) for each member of your set, preferably during your quiet time or maintenance window. A very naive version of this script would look like this in bash:

NOTE: THIS IS BASICALLY PSEUDO CODE - FOR ILLUSTRATIVE PURPOSES ONLY - DO NOT USE FOR PRODUCTION SYSTEMS WITHOUT SIGNIFICANT CHANGES

#!/bin/bash 

# First arg is host MongoDB is running on, second arg is the MongoDB port

MONGO=/path/to/mongo
MONGOHOST=$1
MONGOPORT=$2
DBPATH = /path/to/dbpath

# make sure the node we are connecting to is not the primary
while (`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'db.isMaster().ismaster'`)
do
    `$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'rs.stepDown()'`
    sleep 2
done    
echo "Node is no longer primary!\n"

# Now shut down that server 
# something like (assuming user is set up for key based auth and has password-less sudo access a la ec2-user in EC2)
ssh -t user@$MONGOHOST sudo service mongodb stop

# Wipe the data files for that server

ssh -t user@$MONGOHOST sudo rm -rf $DBPATH
ssh -t user@$MONGOHOST sudo mkdir $DBPATH
ssh -t user@$MONGOHOST sudo chown mongodb:mongodb $DBPATH

# Start up server again
# similar to shutdown something like 
ssh -t user@$MONGOHOST sudo service mongodb start 
Share:
37,260
Zealot Ke
Author by

Zealot Ke

@chzealot

Updated on July 09, 2022

Comments

  • Zealot Ke
    Zealot Ke almost 2 years

    The mongodb document says that

    To compact this space, run db.repairDatabase() from the mongo shell (note this operation will block and is slow).

    in http://www.mongodb.org/display/DOCS/Excessive+Disk+Space

    I wonder how to make the mongodb free deleted disk space automatically ?

    p.s. We stored many downloading task in mongodb, up to 20GB, and finished these in half an hour.

  • Zealot Ke
    Zealot Ke over 13 years
    Thanks for the great post. if I don't shrink the datafiles the mongod will always cost a lot of memory, How could I solve it?
  • Justin Jenkins
    Justin Jenkins over 13 years
    @Zealot ... See my answer on memory use, it might be helpful. stackoverflow.com/questions/4468873/…
  • Zealot Ke
    Zealot Ke over 13 years
    I got it, we have 16GB memory, and the mongodb cost 4GB. So I may not care about it. Thank you for these answer.
  • Justin Jenkins
    Justin Jenkins over 12 years
    Note, while there is still no "auto" compact as of 1.9 there is a "compact" feature which can be used per collection: mongodb.org/display/DOCS/compact+Command
  • tcbcw
    tcbcw over 10 years
    Thank you. This works awesome for replica sets and was exactly what we needed for a replica set that ran out of space.
  • Keeth
    Keeth over 9 years
    this should be the top answer. it is simple and works in a real-world deployment.
  • Gary
    Gary over 8 years
    @JustinJenkins As of mongo 3.0, the compact command only frees disk space when using the WiredTiger storage engine. docs.mongodb.org/manual/reference/command/compact/#disk-spac‌​e
  • Gary
    Gary over 8 years
    @JustinJenkins The example script you have here is only printing the size of one collection (foo) within the database. Seems you could instead use the dataSize, storageSize, and fileSize fields from a db.stats() result to print the same stats for the entire database. This has the added bonus of not having any db/collection name hard-coded in the script.
  • scho
    scho almost 8 years
    Be aware, that replication from scratch does not work, if the oplog size is too small (or you have a lot of data). Then, the initial syncing will take longer as the oplog's time span and replication stops somewhere in between.
  • scho
    scho almost 8 years
    The situation in my comment is described here: stackoverflow.com/questions/30250872/…