How well does PostgreSQL perform with a large number of databases?

12,402

Solution 1

At the low end, it basically boils down to "can you absolutely say that you have no shared data?" Unlike mysql, the database is an absolute boundry in postgresql. You cannot SELECT zip_code FROM common.city_zip WHERE city=... if you go with separate databases (at least not without dblink).

If you have any shared data at all, postgresql's "schema" is similar to what mysql calls a "database". You can CREATE SCHEMA clienta; CREATE TABLE clienta.customer (...);. You would create a schema for each client, that client's user would have their schema first in their search path, and permissions would be granted so that Client A's user would have access to the clienta and the public schemas (and their tables).

Your issue is going to be that at the high end of # of clients, each table is stored as a file, so whether you go with one database per client, one schema per client, or use something like ${client}_customer for your table names, you will likely run into filedescriptor limits with 10k clients even if you only had one table per client (plus one filedescriptor per connection). Of course, you can adjust the kernel's maximum number of file descriptors on the fly using sysctl, but the per-process limit (ulimit) will require restarting postgresql if you set it too low the first time around.

The alternative is to have "one big table" with a client column that identifies which client that row belongs to (ideally, by username if you have one user per client, this makes the stuff below a LOT easier). By not granting any access at all to this table by the clients, you can create client-specific views (or use session_user to identify the current client). Updates can't be done directly through a view, though. You would need to have defined functions to insert/update/delete on the table (one set of functions per client or else using session_user) with the functions using SECURITY DEFINER to execute as a special user with permission to insert/update/delete on the tables (note: session_user is used because user and current_user are based on the current context, and within a SECURITY DEFINER function this would always be the user who defined the function).

Performance-wise, beyond the fd issue, I honestly don't know what would happen with 10000 databases in postgresql, versus having one large table with 10000 clients' worth of data in it. Proper index design should keep the large table from being slow to query.

I will say that I went with separate databases for each client here (we add servers to keep the system usable, shifting client databases to new servers as needed, so we will never get to 10k databases on one server). I've had to restore individual clients' data from backups for debugging or due to user error on a regular basis, something that would be an absolute nightmare on the "one big table" design. Also, if you intend to sell customization of your product to your clients, the "one big table" design might end up hobbling you as far as ability to customize the data model.

Solution 2

Without more details about your application, it's difficult to say that you will get any additional security from this set up. If each client connects to the web app and there is a shared user from the web app to the database, then you haven't isolated your data in a way that is any different from using a single monolithic database. Accessing your data through properly parameterized stored procedures will provide you the level of isolation that you're looking for without the administrative headache of managing 10,000+ databases on any number of servers.

I have personally run a similar set up on a single database server using nothing more than parameterized stored procedures hitting a single database. If you can guarantee that the only access to the database is through stored procedures, there is no danger of data co-mingling in the results.

If you do want to go forward with your design, here are my primary concerns:

  1. running out of open file descriptors (ulimit -n) on your host OS
  2. tuning 10,000+ databases for different querying patterns
  3. administering 10,000+ databases with different security concerns (backups and potential restores, do you really want to restore 10,000+ database if there is a server failure?)
  4. rolling out changes across 10,000+ databases
Share:
12,402

Related videos on Youtube

Carlos Melo
Author by

Carlos Melo

Updated on September 18, 2022

Comments

  • Carlos Melo
    Carlos Melo over 1 year

    We have an web application whose architecture requires that any registered user (a company, actually) should be isolated from the other, i.e., I'll run the same webapp with the same data models, but with different data sets for every customer.

    So, we did think about creating a different database in Postgres for every customer. Can this solution scale to, say, 10-20K databases? How well?

    Does anyone have a better solution for this?

    Thanks in advance.

  • Carlos Melo
    Carlos Melo about 13 years
    Hi, DerfK. I can't use the "one big table" approach due to the reason you stated. Even if today, the data models are the same for every user, we can't guarantee that they will always be the same. Also, I didn't knew about the absolute database boundry in PSQL (as we do have some shared data). I think I have left the schema and table naming approaches left. In your experience, how difficult is to manage this number of databases (even in different servers)?
  • Carlos Melo
    Carlos Melo about 13 years
    And how hard would be to backup and restore a client's data? Is easier to do this with stored procedures or with schemas? As you stated, the app design does use only one shared user to connect to the database. At first, the multiple database approach was considered for management concerns, rather than security.
  • DerfK
    DerfK about 13 years
    @Eduardo The biggest difficulty I have with it is making sure that when the data model needs to change for everyone, that it gets done. Someday we'll adapt something like Rails's system for managing changes to the data model, until then I have a script that loops through the clients and executes the same command on every database. Since we don't do shared data at all, everything else has been pretty easy. If you go with one db with multiple schemas, you can still dump one client schema at a time using pg_dump -n (be sure to dump your common schema too!) to list schema: psql -E then \dn
  • Jeremiah Peschka
    Jeremiah Peschka about 13 years
    @Eduardo don't design for features that you don't have right. If that were the case my car would be a submarine and would repel bears and would be able to fly to the moon. There are plenty of solid database design patterns that will let you start with a large table and add additional features as needed. The key is to ask yourself what you need today and what your ops team will be able to support based on growth projections.
  • Carlos Melo
    Carlos Melo about 13 years
    @DerfK, what is the web stack you use today?
  • Carlos Melo
    Carlos Melo about 13 years
    @Jeremiah, you have a good point. Do you have experience with multitenant applications?
  • Jeremiah Peschka
    Jeremiah Peschka about 13 years
    @Eduardo I have a great deal of experiencing designing, developing, and administering multi-tenant applications, specifically from the database side of things.
  • Carlos Melo
    Carlos Melo about 13 years
    @Jeremiah in your experience, using a multi-tenant application for small companies, what's the average max user capacity for a high-end server?
  • Jeremiah Peschka
    Jeremiah Peschka about 13 years
    @Eduardo A moderately priced server with connection pooling turned on at the clients can theoretically handle a pretty heavy load, but it depends on the load: the type of queries you're running, different querying patterns, network connectivity and host of other things that are pretty difficult to determine through Stack Overflow. Enabling connection pooling, using gigabit ethernet or Infiniband, multiple ethernet cards, and proper storage configuration go a long way toward serving a huge number of concurrent users.
  • mikerobi
    mikerobi about 11 years
    Parameterized stored procedures don't protect against anything but SQL injections. If one of those procedures does a SELECT * WHERE clientId = 3, you have a security leak.