Which NoSQL database should I use for logging?

57,406

Solution 1

I've decided to revise this accepted answer as the state of the art has moved significantly in the last 18 months, and much better alternatives exist.

New Answer

MongoDB is a sub-par choice for a scalable logging solution. There are the usual reasons for this (write performance under load for example). I'd like to put forward one more, which is that it only solves a single use case in a logging solution.

A strong logging solution needs to cover at least the following stages:

  • Collection
  • Transport
  • Processing
  • Storage
  • Search
  • Visualisation

MongoDB as a choice only solves the Storage use case (albeit somewhat poorly). Once the complete chain is analysed, there are more appropriate solutions.

@KazukiOhta mentions a few options. My preferred end to end solution these days involves:

The underlying use of ElasticSearch for log data storage uses the current best of breed NoSQL solution for the logging and searching use case. The fact that Logstash-Forwarder / Logstash / ElasticSearch / Kibana3 are under the umbrella of ElasticSearch makes for an even more compelling argument.

Since Logstash can also act as a Graphite proxy, a very similar chain can be built for the associated problem of collecting and analysing metrics (not just logs).

Old Answer

MongoDB Capped Collections are extremely popular and suitable for logging, with the added bonus of being 'schema less', which is usually a semantic fit for logging. Often we only know what we want to log well into a project, or after certain issues have been found in production. Relational databases or strict schemas tend to be difficult to change in these cases, and attempts to make them 'flexible' tends just to make them 'slow' and difficult to use or understand.

But if you'd like to manage your logs in the dark and have lasers going and make it look like you're from space there's always Graylog2 which uses MongoDB as part of its overall infrastructure but provides a whole lot more on top such as a common, extensible format, a dedicated log collection server, distributed architecture and a funky UI.

Solution 2

I've seen a lot of companies are using MongoDB to store application logs. Its schema-freeness is really flexible for application logs, at which schema tends to change time-to-time. Also, its Capped Collection feature is really useful because it automatically purges old data to keep the data fit into the memory.

People aggregates the logs by normal Grouping or MapReduce, but it's not that fast. Especially MongoDB's MapReduce only works within a single thread and its JavaScript execution overhead is huge. New aggregation framework could solve this problem.

When you use MongoDB for logging, the concern is the lock contention by high write throughputs. Although MongoDB's insert is fire-and-forget style by default, calling a lot of insert() causes a heavy write lock contention. This could affect the application performance, and prevent the readers to aggregate / filter the stored logs.

One solution might be using the log collector framework such as Fluentd, Logstash, or Flume. These daemons are supposed to be launched at every application nodes, and takes the logs from app processes.

Fluentd plus MongoDB

They buffer the logs and asynchronously writes out the data to other systems like MongoDB / PostgreSQL / etc. The write is done by batches, so it's a lot more efficient than writing directly from apps. This link describes how to put the logs into Fluentd from PHP program.

Here's some tutorials about MongoDB + Fluentd.

MongoDB's problem is it starts slowing down when the data volume exceeds the memory size. At that point, you can switch to other solutions like Apache Hadoop or Cassandra. If you have a distributed logging layer mentioned above, you can instantly switch into another solution as you grow. This tutorial describes how to store logs to HDFS by using Fluentd.

Share:
57,406
ikrain
Author by

ikrain

Updated on July 09, 2022

Comments

  • ikrain
    ikrain almost 2 years

    Do you have any experience logging to NoSQL databases for scalable apps? I have done some research on NoSQL databases for logging and found that MongoDB seems to be a good choice. Also, I found log4mongo-net which seems to be a very straightforward option.

    Would you recommend this kind of approach? Are there any other suggestions?

  • Matt Zukowski
    Matt Zukowski almost 11 years
    Just as a word of warning, we've ran into serious issues with MongoDB when writing more than a few thousand events per second to log collections. MongoDB's lackluster write performance may be the culprit.
  • L0j1k
    L0j1k almost 10 years
    About Graylog2, please be advised: "All running on the existing JVM in your datacenter." If you miss this, you won't see anything until you're looking in the third or fourth paragraph of the installation instructions of the download package ("You also must use Java 7!"). I always think it's kind of funny how Java-based projects conveniently forget to mention they're Java-based projects when selling themselves. Just IMO.
  • uylmz
    uylmz almost 10 years
    Is the answer applicable for both: 1) business level logs (which shouldn't be lost, must be durable) 2) statistic logs like website access logs etc. where losing some records does not really matter ?
  • yamen
    yamen over 9 years
    There are some examples of Elasticsearch losing data, although we haven't seen that happening in our implementations. If true audit capability is required, I would use a dual approach of saving the raw files off to (say) HDFS for archiving, and also sending them to Elasticsearch for real-time indexing and search. The archive can be used to reconcile or reload the Elasticsearch index if required.
  • jonathancardoso
    jonathancardoso over 8 years
    You should put a disclaimer at the top that you are affiliated with ElasticSearch. At least looks like it.
  • Patrissol Kenfack
    Patrissol Kenfack over 2 years
    Thank you very very much. It's really helpful