How big is too big for a PostgreSQL table?

sql ruby-on-rails performance postgresql database-design

132,849

Solution 1

Rows per a table won't be an issue on it's own.

So roughly speaking 1 million rows a day for 90 days is 90 million rows. I see no reason Postgres can't deal with that, without knowing all the details of what you are doing.

Depending on your data distribution you can use a mixture of indexes, filtered indexes, and table partitioning of some kind to speed thing up once you see what performance issues you may or may not have. Your problem will be the same on any other RDMS that I know of. If you only need 3 months worth of data design in a process to prune off the data you don't need any more. That way you will have a consistent volume of data on the table. Your lucky you know how much data will exist, test it for your volume and see what you get. Testing one table with 90 million rows may be as easy as:

select x,1 as c2,2 as c3
from generate_series(1,90000000) x;

https://wiki.postgresql.org/wiki/FAQ

Limit   Value
Maximum Database Size       Unlimited
Maximum Table Size          32 TB
Maximum Row Size            1.6 TB
Maximum Field Size          1 GB
Maximum Rows per Table      Unlimited
Maximum Columns per Table   250 - 1600 depending on column types
Maximum Indexes per Table   Unlimited

Solution 2

Another way to speed up your queries significantly on a table with > 100 million rows is to cluster the table on the index that is most often used in your queries. Do this in your database's "off" hours. We have a table with > 218 million rows and have found 30X improvements.

Also, for a very large table, it's a good idea to create an index on your foreign keys.

EXAMPLE:

Assume we have a table named investment in a database named ccbank.
Assume the index most used in our queries is (bankid,record_date)

Here are the steps to create and cluster an index:

psql -c "drop index investment_bankid_rec_dt_idx;" ccbank
psql -c "create index investment_bankid_rec_dt_idx on investment(bankid, record_date);"
psql -c "cluster investment_bankid_rec_dt_idx on investment;"
vacuumdb -d ccbank -z -v -t investment

In steps 1-2 we replace the old index with a new, optimized one. In step 3 we cluster the table: this basically puts the DB table in the physical order of the index, so that when PostgreSQL performs a query it caches the most likely next rows. In step 4 we vacuum the database to reset the statistics for the query planner.

132,849

Dylan Karr

Updated on July 14, 2022

Comments

Dylan Karr almost 2 years

I'm working on the design for a RoR project for my company, and our development team has already run into a bit of a debate about the design, specifically the database.

We have a model called Message that needs to be persisted. It's a very, very small model with only three db columns other than the id, however there will likely be A LOT of these models when we go to production. We're looking at as much as 1,000,000 insertions per day. The models will only ever be searched by two foreign keys on them which can be indexed. As well, the models never have to be deleted, but we also don't have to keep them once they're about three months old.

So, what we're wondering is if implementing this table in Postgres will present a significant performance issue? Does anyone have experience with very large SQL databases to tell us whether or not this will be a problem? And if so, what alternative should we go with?
- Sam over 10 years
  
  with a good caching layer and some little configuration in PG you should be fine. You should tackle performance issues case by case and avoid preoptimizing. That said, partitioning and replicating are always great options you can take advantage of once you hit bottlenecks.
- Erwin Brandstetter over 10 years
  
  Related question here and here.
- Frank Heikens over 10 years
  
  We process about 30 million messages per day in one 5+ TB PostgreSQL database, works fine.
- rogerdpack about 9 years
  
  see also stackoverflow.com/questions/3132444/…
- Al Chou over 7 years
  
  FYI, I happened to be reading postgresql.org/about today and noticed that it says that (in principle) the number of rows in a table is unlimited.
- favq over 5 years
  
  @FrankHeikens - I know this is an old comment, but I'm interested in knowing details about the use case and whether this setup is still working for you - could you elaborate, please?
- Arya almost 5 years
  
  @FrankHeikens can you tell me about the hardware spec you are using?
Mike Sherrill 'Cat Recall' over 10 years

I agree that 90 million rows won't be a problem for PostgreSQL. But it might be a problem for an ORM with PostgreSQL. (An ORM with any dbms, actually.)
Abdul Saqib over 10 years

@MikeSherrill'Catcall' Good point, I was just focused on "How big is too big for a PostgreSQL table?"
yeyo almost 9 years

@MikeSherrill'CatRecall' why it might be a problem for an ORM? :)
Mike Sherrill 'Cat Recall' almost 9 years

@yeyo: Because ORMs usually use a lot of queries to get data that could be returned with only one or two. The OP is using Ruby on Rails.
Stefan Theard about 7 years

This is a little late but I think that in a lot of cases (especially with rails / active record) it is common to completely remove the ORM from the equation and write a raw sql string to query for performance reasons. Don't let your ORM make data decisions for you! It's an accessory not an essential.
Shorn almost 6 years

The about URL quoted in the URL does not show these limits currently - anyone know where it's moved to?
spy over 4 years

> in the off hours cluster the table on the index that is most often used in your queries....can you explain how this is done?
James Doherty over 4 years

Yes here is a step by step EXAMPLE: 1) The table I am referring to is called investment in this example. 2) The index most used in queries is (bankid,record_date) So here is your step by step: 1) psql -c "drop index investment_bankid_rec_dt_idx;" dbname 2) psql -c "create index investment_bankid_rec_dt_idx on investment(bankid, record_date);" 3) psql -c "cluster investment_bankid_rec_dt_idx on investment;" 4) vacuumdb -d ccbank -z -v -t investment So in step one and two we drop the index and recreate it.
James Doherty over 4 years

Step 3 we create the cluster, this basically puts the DB table in the physical order of the index, so when postgresql performs a query it caches the most likely next rows. Step 4 we vacuum the database to reset the statistics for the query planner
TiggerToo almost 3 years

It is not correct to say that Rails/ActiveRecord uses a lot of queries to get data that could be returned with only one or two. It is totally capable of doing the same thing in one or two queries, but you have to actually tell it to do so rather than relying on default behavior. The problem is that because it's so easy to use and so abstracted, a lot of queries are written by people with less experience who aren't necessarily aware of such things, let alone actively taking them into consideration. That's a general problem with making programming tools easy, not an ORM thing or a Rails thing.
KodornaRocks over 2 years

I believe the real OP question should be how much is too much for postgres with the specs X. With unlimited resources your only bearer would be the hard limitations of your compilation, which again, with unlimited resources could be re-builded for the specifc hardware architecture and limitation spectations. But I can't see how a billion 1kb rows would hit any kind of limit in any kind of rdbms that I know of.