MongoDB text index search slow for common words in large table

10,277

Well I worked around these performance issues by allowing MongoDB full text search to search in OR based format. I'm prioritizing my results by fine tuning the weights on my indexed fields and just ordering by rank. I do get more results than desired, but that's not a huge problem because my weighted results that appear at the top will most likely be consumed before my user gets to less relevant results at the bottom.

If anyone is struggling with MongoDB text search performance using AND searching only, just switch back to OR and control your results using weights. It performs leaps better.

hth

Share:
10,277
kmehta
Author by

kmehta

Updated on July 25, 2022

Comments

  • kmehta
    kmehta almost 2 years

    I am hosting a mongodb database for a service that supports full text searching on a collection with 6.8 million records.

    Its text index includes ten fields with varying weights.

    index specification

    Most searches take less than a second. Some searches take two to three seconds. However, some searches take 15 - 60 seconds! The 15-60 second search cases are unacceptable for my application. I need to find a way to speed those up.

    Searching takes 15-60 seconds when words that are very common in the index are used in the search query.

    I seems that the text search feature does not support lazy parameters. My first thought was to cache a list of the 50 most common words in my text index and then ask mongodb to evaluate those last (lazy) and on top of the filtered results returned by the less common parameters. Hopefully people are still with me. For example, say I have a query "products chocolate", where products is common and chocolate is uncommon. I would like to be able to ask mongodb to evaluate "chocolate" first, and then filter those results with the "products" term. Does anyone know of a way to achieve this?

    I can achieve the above scenario by omitting the most common words (i.e. "products") from the db query and then reapplying the common term filter on the application side after it has received records found by db. It is preferable for all query logic to happen on the database, but am open to application side processing for a speed payout.

    There are still some holes in this design. If a user only searches common terms, I have no choice but to hit the database with all the terms. From preliminary reading, I gather that it is not recommended (or not supported) to have multiple text indexes (with different names) on the same collection. My plan is to create two identical tables, each with my 6.8M records, with different indexes - one for common words and one for uncommon words. This feels kludgy and clunky, but am willing to do this for a speed increase.

    Does anyone have any insight and/or advice on how to speed up this system. I'd like as much processing to happen on the database as possible to keep it fast. I'm sure my little 6.8M record table is not the largest that mongodb has seen. Thanks!

    • Nico
      Nico over 6 years
      It's now 2018 (5 years later), and mongodb still has the exact same issue :(
    • kmehta
      kmehta over 6 years
      because of this paired with mongo's significant performance hit by implementing this, by which we determined the use of mongo in such a way was not a "supported" or "intended" primary use, we decided to ditch mongo altogether. sorry for the cold water.
    • Burf2000
      Burf2000 about 6 years
      Did you find that writing records, got a lot slower over time? Did you add the index at the end?
    • Black Mamba
      Black Mamba about 2 years
      Even in 2022.... close to 10 years now
  • Amalia
    Amalia over 10 years
    Exactly. If you are using search terms in quotes (which I think is what you mean by AND format), MongoDB's text search will first use the text index over stemmed words, and then check every document to make sure that (a) both words are present, and (b) the unstemmed version of the words is identical to the quoted terms you passed in. This is much less performant than the other option (not quoting the terms), in which the text index is used and there is no second pass through each document. The weighting should take care of ranking results with both words present above results with only one.