Are there performance issues storing files in PostgreSQL?

35,632

Solution 1

You have basically two choices. You can store the data right in the row or you can use the large object facility. Since PostgreSQL now uses something called TOAST to move large fields out of the table there should be no performance penalty associated with storing large data in the row directly. There remains a 1 GB limit in the size of a field. If this is too limited or if you want a streaming API, you can use the large object facility, which gives you something more like file descriptors in the database. You store the LO ID in your column and can read and write from that ID.

I personally would suggest you avoid the large object facility unless you absolutely need it. With TOAST, most use cases are covered by just using the database the way you'd expect. With large objects, you give yourself additional maintenance burden, because you have to keep track of the LO IDs you've used and be sure to unlink them when they're not used anymore (but not before) or they'll sit in your data directory taking up space forever. There are also a lot of facilities that have exceptional behavior around them, the details of which escape me because I never use them.

For most people, the big performance penalty associated with storing large data in the database is that your ORM software will pull out the big data on every query unless you specifically instruct it not to. You should take care to tell Hibernate or whatever you're using to treat these columns as large and only fetch them when they're specifically requested.

Solution 2

The BLOB (LO) type stores data in 2KB chunks within standard PostgreSQL heap pages which default to 8KB in size. They are not stored as independent, cohesive files in the file system - for example, you wouldn't be able to locate the file, do a byte-by-byte comparison and expect it to be the same as the original file data that you loaded into the database, since there's also Postgres heap page headers and structures which delineate the chunks.

You should avoid using the Large Object (LO) interface if your application would need to frequently update the binary data, and particularly if that involved a lot of small, random-access writes, which due to the way PostgreSQL implements concurrency control (MVCC) can lead to an explosion in the amount of disk space used until you VACUUM the database. The same outcome is probably also applicable to data stored inline in a column with the bytea type or even TOAST'd.

However, if your data follows a Write-Once-Read-Many pattern (e.g. upload a PNG image and never modify it afterwards), it should be fine from the standpoint of disk usage.

See this pgsql-general mailing list thread for further discussion.

Share:
35,632
Renato Dinhani
Author by

Renato Dinhani

Solving real problems.

Updated on July 22, 2022

Comments

  • Renato Dinhani
    Renato Dinhani almost 2 years

    Is ok storing files like HTML pages, images, PDF, etc in a table in PostgreSQL or it is slow? I read some articles saying that this is not recommended, but I don't know if is true.

    The column types I have in mind are BLOB (as far as I know it stores in a file) or bytea type, but others are applicable also.

  • Renato Dinhani
    Renato Dinhani over 12 years
    Nice answer, very thanks. Is the bytea type used to store the content in the table?
  • araqnid
    araqnid over 12 years
    bytea data will be stored in the table by default if it's small, and moved to the auxiliary ("toast") table and compressed for larger values. See: postgresql.org/docs/9.1/static/storage-toast.html for an introduction. You can disable compression of auxiliary storage, which will improve the performance of fetching only parts of the values.
  • Daniel Lyons
    Daniel Lyons over 12 years
    bytea is a good choice for binary data. You can also use text or varchar if the data is textual and in the same encoding as the database.
  • Nicholas DiPiazza
    Nicholas DiPiazza over 9 years
    It would be great to insert something into this post about the time taken for a PSQL backup. Does it make it take much longer? sounds like not.