Which NoSQL database should I use for logging?
Solution 1
I've decided to revise this accepted answer as the state of the art has moved significantly in the last 18 months, and much better alternatives exist.
New Answer
MongoDB is a sub-par choice for a scalable logging solution. There are the usual reasons for this (write performance under load for example). I'd like to put forward one more, which is that it only solves a single use case in a logging solution.
A strong logging solution needs to cover at least the following stages:
- Collection
- Transport
- Processing
- Storage
- Search
- Visualisation
MongoDB as a choice only solves the Storage use case (albeit somewhat poorly). Once the complete chain is analysed, there are more appropriate solutions.
@KazukiOhta mentions a few options. My preferred end to end solution these days involves:
- Logstash-Forwarder for Collection & Transport
- Logstash & Riemann for Processing
- ElasticSearch for Storage & Queries
- Kibana3 for Visualisation
The underlying use of ElasticSearch for log data storage uses the current best of breed NoSQL solution for the logging and searching use case. The fact that Logstash-Forwarder / Logstash / ElasticSearch / Kibana3 are under the umbrella of ElasticSearch makes for an even more compelling argument.
Since Logstash can also act as a Graphite proxy, a very similar chain can be built for the associated problem of collecting and analysing metrics (not just logs).
Old Answer
MongoDB Capped Collections are extremely popular and suitable for logging, with the added bonus of being 'schema less', which is usually a semantic fit for logging. Often we only know what we want to log well into a project, or after certain issues have been found in production. Relational databases or strict schemas tend to be difficult to change in these cases, and attempts to make them 'flexible' tends just to make them 'slow' and difficult to use or understand.
But if you'd like to manage your logs in the dark and have lasers going and make it look like you're from space there's always Graylog2 which uses MongoDB as part of its overall infrastructure but provides a whole lot more on top such as a common, extensible format, a dedicated log collection server, distributed architecture and a funky UI.
Solution 2
I've seen a lot of companies are using MongoDB to store application logs. Its schema-freeness is really flexible for application logs, at which schema tends to change time-to-time. Also, its Capped Collection feature is really useful because it automatically purges old data to keep the data fit into the memory.
People aggregates the logs by normal Grouping or MapReduce, but it's not that fast. Especially MongoDB's MapReduce only works within a single thread and its JavaScript execution overhead is huge. New aggregation framework could solve this problem.
When you use MongoDB for logging, the concern is the lock contention by high write throughputs. Although MongoDB's insert is fire-and-forget style by default, calling a lot of insert() causes a heavy write lock contention. This could affect the application performance, and prevent the readers to aggregate / filter the stored logs.
One solution might be using the log collector framework such as Fluentd, Logstash, or Flume. These daemons are supposed to be launched at every application nodes, and takes the logs from app processes.
They buffer the logs and asynchronously writes out the data to other systems like MongoDB / PostgreSQL / etc. The write is done by batches, so it's a lot more efficient than writing directly from apps. This link describes how to put the logs into Fluentd from PHP program.
Here's some tutorials about MongoDB + Fluentd.
- Fluentd + MongoDB: The Easiest Way to Log Your Data Effectively on 10gen blog
- Fluentd: Store Apache Logs into MongoDB
MongoDB's problem is it starts slowing down when the data volume exceeds the memory size. At that point, you can switch to other solutions like Apache Hadoop or Cassandra. If you have a distributed logging layer mentioned above, you can instantly switch into another solution as you grow. This tutorial describes how to store logs to HDFS by using Fluentd.
ikrain
Updated on July 09, 2022Comments
-
ikrain almost 2 years
Do you have any experience logging to NoSQL databases for scalable apps? I have done some research on NoSQL databases for logging and found that MongoDB seems to be a good choice. Also, I found log4mongo-net which seems to be a very straightforward option.
Would you recommend this kind of approach? Are there any other suggestions?
-
Matt Zukowski almost 11 yearsJust as a word of warning, we've ran into serious issues with MongoDB when writing more than a few thousand events per second to log collections. MongoDB's lackluster write performance may be the culprit.
-
L0j1k almost 10 yearsAbout Graylog2, please be advised: "All running on the existing JVM in your datacenter." If you miss this, you won't see anything until you're looking in the third or fourth paragraph of the installation instructions of the download package ("You also must use Java 7!"). I always think it's kind of funny how Java-based projects conveniently forget to mention they're Java-based projects when selling themselves. Just IMO.
-
uylmz almost 10 yearsIs the answer applicable for both: 1) business level logs (which shouldn't be lost, must be durable) 2) statistic logs like website access logs etc. where losing some records does not really matter ?
-
yamen over 9 yearsThere are some examples of Elasticsearch losing data, although we haven't seen that happening in our implementations. If true audit capability is required, I would use a dual approach of saving the raw files off to (say) HDFS for archiving, and also sending them to Elasticsearch for real-time indexing and search. The archive can be used to reconcile or reload the Elasticsearch index if required.
-
jonathancardoso over 8 yearsYou should put a disclaimer at the top that you are affiliated with ElasticSearch. At least looks like it.
-
Patrissol Kenfack over 2 yearsThank you very very much. It's really helpful