Why is Solr so much faster than Postgres?

performance postgresql solr lucene rdbms

29,412

Solution 1

First, Solr doesn't use B-trees. A Lucene (the underlying library used by Solr) index is made of a read-only segments. For each segment, Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted. Looking up a term in this term dictionary is made using a binary search, so the cost of a single-term lookup is O(log(t)) where t is the number of terms. On the contrary, using the index of a standard RDBMS costs O(log(d)) where d is the number of documents. When many documents share the same value for some field, this can be a big win.

Moreover, Lucene committer Uwe Schindler added support for very performant numeric range queries a few years ago. For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster. (For more information, read the javadocs which are very interesting and give links to relevant research papers.)

But Solr can only do this because it doesn't have all the constraints that a RDBMS has. For example, Solr is very bad at updating a single document at a time (it prefers batch updates).

Solution 2

You didn't really say much about what you did to tune your PostgreSQL instance or your queries. It's not unusual to see a 50x speed up on a PostgreSQL query through tuning and/or restating your query in a format which optimizes better.

Just this week there was a report at work which someone had written using Java and multiple queries in a way which, based on how far it had gotten in four hours, was going to take roughly a month to complete. (It needed to hit five different tables, each with hundreds of millions of rows.) I rewrote it using several CTEs and a window function so that it ran in less than ten minutes and generated the desired results straight out of the query. That's a 4400x speed up.

Perhaps the best answer to your question has nothing to do with the technical details of how searches can be performed in each product, but more to do with ease of use for your particular use case. Clearly you were able to find the fast way to search with Solr with less trouble than PostgreSQL, and it may not come down to anything more than that.

I am including a short example of how text searches for multiple criteria might be done in PostgreSQL, and how a few little tweaks can make a large performance difference. To keep it quick and simple I'm just running War and Peace in text form into a test database, with each "document" being a single text line. Similar techniques can be used for arbitrary fields using the hstore type or JSON columns, if the data must be loosely defined. Where there are separate columns with their own indexes, the benefits to using indexes tend to be much bigger.

-- Create the table.
-- In reality, I would probably make tsv NOT NULL,
-- but I'm keeping the example simple...
CREATE TABLE war_and_peace
  (
    lineno serial PRIMARY KEY,
    linetext text NOT NULL,
    tsv tsvector
  );

-- Load from downloaded data into database.
COPY war_and_peace (linetext)
  FROM '/home/kgrittn/Downloads/war-and-peace.txt';

-- "Digest" data to lexemes.
UPDATE war_and_peace
  SET tsv = to_tsvector('english', linetext);

-- Index the lexemes using GiST.
-- To use GIN just replace "gist" below with "gin".
CREATE INDEX war_and_peace_tsv
  ON war_and_peace
  USING gist (tsv);

-- Make sure the database has statistics.
VACUUM ANALYZE war_and_peace;

Once set up for indexing, I show a few searches with row counts and timings with both types of indexes:

-- Find lines with "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
  WHERE tsv @@ to_tsquery('english', 'gentlemen');

84 rows, gist: 2.006 ms, gin: 0.194 ms

-- Find lines with "ladies".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
  WHERE tsv @@ to_tsquery('english', 'ladies');

184 rows, gist: 3.549 ms, gin: 0.328 ms

-- Find lines with "ladies" and "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
  WHERE tsv @@ to_tsquery('english', 'ladies & gentlemen');

1 row, gist: 0.971 ms, gin: 0.104 ms

Now, since the GIN index was about 10 times faster than the GiST index you might wonder why anyone would use GiST for indexing text data. The answer is that GiST is generally faster to maintain. So if your text data is highly volatile the GiST index might win on overall load, while the GIN index would win if you are only interested in search time or for a read-mostly workload.

Without the index the above queries take anywhere from 17.943 ms to 23.397 ms since they must scan the entire table and check for a match on each row.

The GIN indexed search for rows with both "ladies" and "gentlemen" is over 172 times faster than a table scan in exactly the same database. Obviously the benefits of indexing would be more dramatic with bigger documents than were used for this test.

The setup is, of course, a one-time thing. With a trigger to maintain the tsv column, any changes made would instantly be searchable without redoing any of the setup.

With a slow PostgreSQL query, if you show the table structure (including indexes), the problem query, and the output from running EXPLAIN ANALYZE of your query, someone can almost always spot the problem and suggest how to get it to run faster.

UPDATE (Dec 9 '16)

I didn't mention what I used to get the prior timings, but based on the date it probably would have been the 9.2 major release. I just happened across this old thread and tried it again on the same hardware using version 9.6.1, to see whether any of the intervening performance tuning helps this example. The queries for only one argument only increased in performance by about 2%, but searching for lines with both "ladies" and "gentlemen" about doubled in speed to 0.053 ms (i.e., 53 microseconds) when using the GIN (inverted) index.

Solution 3

Solr is designed primarily for searching data, not for storage. This enables it to discard much of the functionality required from an RDMS. So it (or rather lucene) concentrates on purely indexing data.

As you've no doubt discovered, Solr enables the ability to both search and retrieve data from it's index. It's the latter (optional) capability that leads to the natural question... "Can I use Solr as a database?"

The answer is a qualified yes, and I refer you to the following:

My personal opinion is that Solr is best thought of as a searchable cache between my application and the data mastered in my database. That way I get the best of both worlds.

Solution 4

This biggest difference is that a Lucene/Solr index is like a single-table database without any support for relational queries (JOINs). Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.

Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.

What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.

Solution 5

Please read this and this.

Solr (Lucene) creates an inverted index which is where retrieving data gets quite faster. I read that PostgreSQL also has similar facility but not sure if you had used that.

The performance differences that you observed can also be accounted to "what is being searched for ?", "what are the user queries ?"

View more solutions

29,412

Author by

cberner

I work on the Presto project

Updated on December 10, 2020

Comments

cberner over 3 years

I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: "Find all vehicles with mileage < 50,000, $5,000 < price < $10,000, make=Mazda..."

I created indices on all the relevant columns in Postgres, so it should be a pretty fair comparison. Looking at the query plan in Postgres though it was still just using a single index and then scanning (I assume because it couldn't make use of all the different indices).

As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and they both cache data in-memory. So I'm wondering where such a large performance difference comes from.

What differences in architecture would explain this?
cberner about 12 years

Thanks! Those were very interesting. I was hoping for something a bit more technical though. Like an architecture overview of Solr, or something like that.
Yavar about 12 years

@Tejas: Even databases can create inverted indexes. What's stopping them to create inverted indexes?
Yavar about 12 years

Great answer (the first paragraph) +1.
Yavar about 12 years

Though I have one thing to say here. When you say "Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted.". So if the terms are lexicographically sorted & it takes o(log t) time (Binary Search) that means the terms are stored in an array? Is that correct (I mean as you rejected a balanced tree structure for storing indexes and hash table cant store straightaway sorted indexes so we are just left with a 2D array (with term, pointer to postings list as its elements) structure on which binary search is carried on)
Yavar about 12 years

Another thing to add here is even if the Lucene Index is not a BTree, it is however an inverted index (like most of search engine). Your answer though came as something new to me. As I was expecting a Balanced Tree structure to store terms (In that case too search would be log (t) and each node will also contain pointer to posting lists. With Balance Trees we would even be able to maintain lexicographically sorted terms.
jpountz about 12 years

There is no need for a balance tree, since the data never needs to be updated. When data is added to a Lucene index, a new segment is created. This segment has precedence over the previous segments. When there are too many segments, a MergeScheduler chooses the segments to merge according to a MergePolicy (these are the names of the classes in Lucene).
jpountz about 12 years

When a new SegmentReader (an IndexReader over a single segment) is open, it loads by default every (n*16)th term of the index in a Java array in memory (16 is the indexDivisor). Then the lookup is performed by a binary search in memory over this array, then one disk seek and at most 15 term comparisons on disk. So the overall cost is O(log(t/16)) + O(1) + O(15) = O(log(t)).
Tejas Patil about 12 years

Yavar: I didnt say that databases cannot create inverted indexes. In fact in 2nd line I pointed to link about PostgreSQL using GIN-inverted index. There is another type: GiST(Generalized Search Tree)-based index in PostgreSQL which SLOWER than GIN. The actual index type used by @cberner will be one factor for the low performance of PostgreSQL.
Hassan Syed almost 12 years

If I understand your rationale correctly you are only talking about a scenario where FTS is implemented naively in an un-inverted relational fashion ? What about comparing the specialised fts indexing methods of Postgres ?
jpountz almost 12 years

I don't know how PostgreSQL full-text works, I was only comparing BTree indexing (used by default in most relational databases, including PostgreSQL in a standard configuration (ie. not full-text)) with Lucene/Solr indexing. But this has a cost, segment-based storage makes fine-grained transactions much more complex. For example, Lucene currently doesn't support individual fields updates (you need to re-index the whole document).
ArtemGr about 9 years

Note that GiST is no longer that much faster to maintain, cf. blog.pgaddict.com/posts/…
kgrittn almost 9 years

PostgreSQL in a standard configuration supports B-tree, hash, GiST, SP-GiST, GIN, and (starting with 9.5) BRIN indexes. It is true that if you omit the USING clause in a CREATE INDEX statement it defaults to USING btree. Configuring full text search to behave as you prefer can take more configuration, though; a court system in PostgreSQL with tens thousands of legal opinions, indexed with custom "dictionaries" for legal terms and custom ability to recognize statute cites, etc., would typically return the list of documents matching user-supplied search criteria in about 300ms.