Database choice for large data volume?

62,550

Solution 1

I've used PostgreSQL in an environment where we're seeing 100K-2M new rows per day, most added to a single table. However, those rows tend to be reduced to samples and then deleted within a few days, so I can't speak about long-term performance with more than ~100M rows.

I've found that insert performance is quite reasonable, especially if you use the bulk COPY. Query performance is fine, although the choices the planner makes sometimes puzzle me; particularly when doing JOINs / EXISTS. Our database requires pretty regular maintenance (VACUUM/ANALYZE) to keep it running smoothly. I could avoid some of this by more carefully optimizing autovacuum and other settings, and it's not so much of an issue if you're not doing many DELETEs. Overall, there are some areas where I feel it's more difficult to configure and maintain than it should be.

I have not used Oracle, and MySQL only for small datasets, so I can't compare performance. But PostgreSQL does work fine for large datasets.

Solution 2

Do you have a copy of "The Data Warehouse Toolkit"?

The suggestion there is to do the following.

  1. Separate facts (measurable, numeric) values from the dimensions which qualify or organize those facts. One big table isn't really the best idea. It's a fact table that dominates the design, plus a number of small dimension tables to allow "slicing and dicing" the facts.

  2. Keep the facts in simple flat files until you want to do SQL-style reporting. Don't create and back up a database. Create and back up files; load a data base only for the reports you must do from SQL.

  3. Where possible create summary or extra datamarts for analysis. In some cases, you may need to load the whole thing to a database. If your files reflect your table design, all databases have bulk loader tools that can populate and index SQL tables from the files.

Solution 3

Google's BigTable database and Hadoop are two database engines that can handle large amount of data.

Solution 4

The amount of data (200m records per year) is not really big and should go with any standard database engine.

The case is yet easier if you do not need live reports on it. I'd mirror and preaggregate data on some other server in e.g. daily batch. Like S.Lott suggested, you might like to read up on data warehousing.

Solution 5

Some interesting points regarding Google BigTable in there are...

Bigtable Vs DBMS

  • Fast Query rate
  • No Joins, No SQL support, column-oriented database
  • Uses one Bigtable instead of having many normalized tables
  • Is not even in 1NF in a traditional view
  • Designed to support historical queries timestamp field => what did this webpage look like yesterday ?
  • Data compression is easier –rows are sparse

I highlighted the Joins and No SQL Support as you mentioned you will need to run a series of reports. I dont know how much (if any) not having the abililty to do this will have on you running reports if you where to use this.

Share:
62,550
Marko
Author by

Marko

Well, ...

Updated on April 13, 2020

Comments

  • Marko
    Marko about 4 years

    I'm about to start a new project which should have a rather large database.

    The number of tables will not be large (<15), majority of data (99%) will be contained in one big table, which is almost insert/read only (no updates).

    The estimated amount of data in that one table is going to grow at 500.000 records a day, and we should keep at least 1 year of them to be able to do various reports.

    There needs to be (read-only) replicated database as a backup/failover, and maybe for offloading reports in peak time.

    I don't have first hand experience with that large databases, so I'm asking the ones that have which DB is the best choice in this situation. I know that Oracle is the safe bet, but am more interested if anyone have experience with Postgresql or Mysql with similar setup.

  • Marko
    Marko over 15 years
    Those are not SQL databases. How do they fare wrt reporting?
  • MrValdez
    MrValdez over 15 years
    I don't have direct experience in programming these two engines, but from what I gather from reading papers online, they have an advantage over SQL when it comes to selecting specific data from large database. I'll look for the papers at my hard drive at home and see if I can post it here.
  • hansvb
    hansvb over 15 years
    Can BigTable be used outside of Google AppEngine?
  • Cerin
    Cerin about 12 years
    There are other considerations can simply "can it store 200m records". Of course most databases can handle that, but not all handle it equally well, which is really what the OP is asking. I've used both MySQL and PostgreSQL for this and PostgreSQL wins hands down. In my experience, PG runs queries (especially complicated ones) on large tables faster and can dump/load it's contents faster.
  • mahesh
    mahesh over 11 years
    Currently, I have stored my data into files only and everyday there will be around 50k new entries. Now I want to use this data for reporting. Mostly the reporting query will be of aggregated as it contain only 3 to 4 fields so no join..What do u suggest??