Fastest way to find string by substring in SQL?

39,431

Solution 1

If you want to use less space than Randy's answer and there is considerable repetition in your data, you can create an N-Ary tree data structure where each edge is the next character and hang each string and trailing substring in your data on it.

You number the nodes in depth first order. Then you can create a table with up to 255 rows for each of your records, with the Id of your record, and the node id in your tree that matches the string or trailing substring. Then when you do a search, you find the node id that represents the string you are searching for (and all trailing substrings) and do a range search.

Solution 2

if you dont care about storage, then you can create another table with partial Title entries, beginning with each substring (up to 255 entries per normal title ).

in this way, you can index these substrings, and match only to the beginning of the string, should greatly improve performance.

Solution 3

Sounds like you've ruled out all good alternatives.

You already know that your query

SELECT * FROM t WHERE TITLE LIKE '%abc%'

won't use an index, it will do a full table scan every time.

If you were sure that the string was at the beginning of the field, you could do

SELECT * FROM t WHERE TITLE LIKE 'abc%'

which would use an index on Title.

Are you sure full text search wouldn't help you here?

Depending on your business requirements, I've sometimes used the following logic:

  • Do a "begins with" query (LIKE 'abc%') first, which will use an index.
  • Depending on if any rows are returned (or how many), conditionally move on to the "harder" search that will do the full scan (LIKE '%abc%')

Depends on what you need, of course, but I've used this in situations where I can show the easiest and most common results first, and only move on to the more difficult query when necessary.

Solution 4

You can add another calculated column on the table: titleLength as len(title) PERSISTED. This would store the length of the "title" column. Create an index on this.

Also, add another calculated column called: ReverseTitle as Reverse(title) PERSISTED.

Now when someone searches for a keyword, check if the length of keyword is same as titlelength. If so, do a "=" search. If length of keyword is less than the length of the titleLength, then do a LIKE. But first do a title LIKE 'abc%', then do a reverseTitle LIKE 'cba%'. Similar to Brad's approach - ie you do the next difficult query only if required.

Also, if the 80-20 rules applies to your keywords/ substrings (ie if most of the searches are on a minority of the keywords), then you can also consider doing some sort of caching. For eg: say you find that many users search for the keyword "abc" and this keyword search returns records with ids 20, 22, 24, 25 - you can store this in a separate table and have this indexed. And now when someone searches for a new keyword, first look in this "cache" table to see if the search was already performed by an earlier user. If so, no need to look again in main table. Simply return results from "cache" table.

You can also combine the above with SQL Server TextSearch. (assuming you have a valid reason not to use it). But you could nevertheless use Text search first to shortlist the result set. and then run a SQL query against your table to get exact results using the Ids returned by the TExt Search as a parameter along with your keyword.

All this is obviously assuming you have to use SQL. If not, you can explore something like Apache Solr.

Share:
39,431

Related videos on Youtube

msergey
Author by

msergey

Updated on January 26, 2020

Comments

  • msergey
    msergey over 4 years

    I have huge table with 2 columns: Id and Title. Id is bigint and I'm free to choose type of Title column: varchar, char, text, whatever. Column Title contains random text strings like "abcdefg", "q", "allyourbasebelongtous" with maximum of 255 chars.

    My task is to get strings by given substring. Substrings also have random length and can be start, middle or end of strings. The most obvious way to perform it:

    SELECT * FROM t LIKE '%abc%'
    

    I don't care about INSERT, I need only to do fast selects. What can I do to perform search as fast as possible?

    I use MS SQL Server 2008 R2, full text search will be useless, as far as I see.

    • paxdiablo
      paxdiablo almost 13 years
      Welcome to the wonderful world of incredibly poor database performance :-)
    • Adir D
      Adir D almost 13 years
      Why will full text search be useless?
    • sgtz
      sgtz almost 13 years
      could substrings be tokens? If you can split words by space, comma, or hyphen, I have an idea. Let me know.
    • Magnus
      Magnus almost 13 years
      How many rows does the table have?
    • msergey
      msergey almost 13 years
      Strings in Title are not sentences and they don't have any word boundaries. I wonder if full text search still help here.
    • JeffO
      JeffO almost 13 years
      How poorly is querying a properly indexed table with a single LIKE clause on a 255 character column performing?
    • MatBailie
      MatBailie almost 13 years
      @Jeff O - With a LIKE '%<anything>%' query, there is no such thing as 'properly indexed'. No index will ever be usable, simply because of the first %.
    • JeffO
      JeffO almost 13 years
      @Dems - Just thought I'd avoid a performance problem due to another query.
  • sgtz
    sgtz almost 13 years
    make that table a clustered non-unique index... that's probably as good as you are going to get with raw SQL.
  • msergey
    msergey almost 13 years
    Given substring most likely in the middle of Title, but I'll try to measure performance with your approach.
  • msergey
    msergey almost 13 years
    Thanks, I'm unable to try your and Randy's solution right now, but I'll try it ASAP.
  • JeffO
    JeffO almost 13 years
    Or a Clustered Index Scan. Probably no major difference.
  • topski
    topski almost 13 years
    @Jeff - That's just semantics. A Clustered Index Scan is simply a "full table scan" on a table that has a clustered index (which most tables should anyway). Either way, it has to read every record.
  • Uğur Gümüşhan
    Uğur Gümüşhan over 12 years
    I find the N-Ary tree method in other reply hilarious.