using REPLACE in WHERE clause to check spelling permutations - MS SQL

35,715

Solution 1

Depending on how complex your scenario can get, this will be lots of work, and slow too. But there's a more flexible approach. Consider something like this, referred to as initialTable:

| id | lastname | firstname |
|  1 | o'malley | josé      |
|  2 | omállèy  | dònáld    |
|  3 | o'neill  | jámès     |
|  4 | onackers | sharon    |

Maybe a bit much, but it illustrates the general problem. I had to implement a "fuzzy" search for our intranet website based on character data that looked very similar - there's many accents in french or spanish names or street addresses for example.

What I did was define a function that performed all replacements for a given string, for example (pseudocode):

function string replacestuff(string input)
{
  input = replace(input, "è", "e");
  input = replace(input, "é", "e");
  input = replace(input, "ò", "o");
  input = replace(input, "ó", "o");
  input = replace(input, "'", "");
  ...
  return input;
}

Using this conversion function, create a second table fuzzyTable that has the following content:

| id | lastname | firstname |
|  1 | omalley  | jose      |
|  2 | omalley  | donald    |
|  3 | oneill   | james     |
|  4 | onackers | sharon    |

Now, assume you'll get an input string for your search of josè. This can't be found in either table. What you'll have to do is this:

declare @input varchar(50)
declare @input_mod varchar(50)
set @input = 'josè'
set @input_mod = replacestuff(@input)

SELECT id FROM initialTable WHERE firstname like @input OR firstname like @input_mod
UNION
SELECT id FROM fuzzyTable WHERE firstname like @input OR firstname like @input_mod
GROUP BY id

(Of course, you'd have to add % to make LIKE work.) The key here is to modify your input search string using the replacement function; this way you'll get a match if searching for against a content of because both come down to se when being processed by the replacement function.

You could even do a two-level search; first check only the unmodified string against the proper table and then with the statement shown above do a fuzzy search if the user says so.

This is a very flexible approach and can handle all sorts of stuff, like finding german letters ä, ö, ü, ß by using two-letter expressions ae, oe, ue, ss. The disadvantage is that you'll have to keep duplicates of some data, and change those duplicates within fuzzyTable as the initialTable (or the replacement function) changes. In our current use case, the intranet database gets updated once a night, so it's not a problem.

EDIT

You need to be aware that, using this, in some cases you'll get false positives. For example, we're using this for an employee search, and if you've got a dutch name spelled Hoek, you'd also find this name searching for Hök, because in german the replacement for ö would be oe. This could be solved using country-aware replacement functions, but we never took the concept this far. Depending on your input data this is more or less academic, for our use case I can't remember anyone complaining.

The main reason why we came up with this approach in the first place was that some of the data we had to work with was riddled with spelling errors, ie. in french many vowels were accented the wrong way around, but still we needed to deliver a result.

Solution 2

I believe the trouble you're having is that SQL-Server's repalce function doesn't accept [^A-Za-z] to mean "non-alpa characters". Instead it's actually looking for that exact string to replace it.

http://msdn.microsoft.com/en-us/library/ms186862%28v=sql.90%29.aspx


In terms of using Regular Expressions, I've only done that by using the CLR, which seems to be getting much too involved for this particular problem.


My advice would be to hold the searchable fields, in the two different formats, in the table itself. And then use a simple LIKE search.

WHERE last_name LIKE @last_name OR last_name_stripped LIKE @last_name

last_name_stripped could then be a computed column (maybe using a function to strip all non_alpha characters), or handled by your client at Insert time.

Solution 3

Using:

WHERE ( REPLACE(people.lastname, '[^A-Za-z]', '') LIKE @last_name + '%' )

or

WHERE ( ComplexFunction( field ) LIKE whatever )

will most likely have the result that your query will not use the index (if there is one) of field people.lastname and thus scan the whole table every time you run the query.

I see two ways to avoid this:

One, add another field lastnameStripped to the table, where the ComplexFunction(lastname) is stored and an index to this field. Then you can search with either:

WHERE ( lastnameStripped LIKE REPLACE(@last_name, '[^A-Za-z]', '') + '%' )

or

WHERE ( lastnameStripped LIKE @last_name + '%' )

and both will use the index of lastnameStripped.

Two, create an indexed view with the ComplexFunction( lastname ) as a field.

Solution 4

If you need to do relatively complex lookups on a column on a large table, it could be more efficient to create a second column that contains the data formatted for efficient searches (with the immediate caveat that "like" searches are rarely efficient). So where you have column LastName, add a new column like LastNameLookup, and populate that column with the data formatted appropriately for your search criteria. If the formatting rules are relatively simple, you could implement this as a computed column column; if performance is important, make it a persisted computed column.

Also to mention, SQL does not support regular expressions (though there is a limited form tied in to the LIKE clause in SQL 2008).

Solution 5

Hmm...using classic asp example. I'm guessing this is from a form. For this example I'm calling your textbox field 'namesearch'. So the page where you request.form("namesearch"), just assign strSearch = request.form("namesearch"). Then before you run it into the SQL query do something like this:

strSearch = request.form("namesearch") 'to get textbox info from form

strSearch = replace(strSearch," ", "") 'to remove spaces
strSearch = replace(strSearch,"'", "") 'to remove apostrophes

For the SQL

SELECT id, lastname, firstname FROM people WHERE people.lastname like '%"& strSearch &"%' ORDER BY lastname

Tested and works using VBScript and SQL 2005 Server

Share:
35,715
Steph Rose
Author by

Steph Rose

I'm a web developer focusing on database architecture, CMS design, and programming in PHP and Ruby on Rails.

Updated on September 15, 2020

Comments

  • Steph Rose
    Steph Rose almost 4 years

    I have a table like:

    | id | lastname | firstname |
    |  1 | doe      | john      |
    |  2 | oman     | donald    |
    |  3 | o'neill  | james     |
    |  4 | onackers | sharon    |
    

    Essentially, users are going to be searching by the first letters of the last name.

    I want to be able to return results that contain and don't contain punctuation from the database. For instance, when a user searches for: on

    I want to return both: o'neill, onackers

    I want someone to be able to search "o, on, oneill, o neill, etc" to get o'neill.

    So the best way to do this seems to take the lastname column value and have two permutations of it searched in the WHERE clause with an OR. One where any special characters are replaced with the _ in SQL, and one where all non-alpha chars (including spaces) are gone.

    I figure I can use the underscore in the SQL replace to keep the one space available.

    I'm having a little trouble with the WHERE clause. I'd prefer to do this with a simple REPLACE rather than creating a regex function if possible. If that's a no-go though, I understand:

    @last_name (this is the nvarchar input)
    
    SELECT id, lastname, firstname
    FROM people
    WHERE ((REPLACE(people.lastname, '[^A-Za-z]', '_') like @last_name + '%')
    OR (REPLACE(people.lastnname,'[^A-Za-z ]', '') like @last_name + '%'))
    ORDER BY lastname
    

    I'm pretty sure the replace part has to be on the other side of the LIKE. I'm messing up the structure but need some help.

    I am Using MSSQL Server 2005.

    Thank you so much in advance.

    UPDATE

    It seems like I have two options:

    1. Create a regular expression function using CLR (excuse me if I'm saying this wrong, I'm new to it)
    2. Create extra columns on the table or create a new "fuzzyTable" with the cleaned up last names.

    The database gets updated once a night. I have actually already begun the new table approach, as it was what I was originally going to do. However, I'm beginning to think it's smarter to add the "fuzzy" columns to the main table and then on the nightly update to add the adjusted lastnames to the new / updated rows.

    Stack Overflow: Which approach is better? User-defined REGEX function I can use in the SQL, and thus avoid extra columns? Or adding the extra column or two to the table? Or a new table?

  • Steph Rose
    Steph Rose about 13 years
    So you're essentially adjusting the string in the code before running the query. I'm using a stored procedure. I'd much rather run it all in one query. And that doesn't fix the idea that your variable strSearch (say, oneill) won't match o'neill in the database. Hence why the adjustment needs to be done to the column itself in the query.
  • MatBailie
    MatBailie about 13 years
    -1 : Agreed - Seems to miss the point; Need to search the stored lastname with and/or without non-alpha characters...
  • Steph Rose
    Steph Rose about 13 years
    I was planning on doing so -- that was EXACTLY my original intention because I thought it the best way to do it -- but the person who manages the DB was trying to avoid such. Said a replace would handle it, but I did notice it didn't look like REPLACE accepted patterns. Let me go approach the person again. I feel a tad vindicated that I was the one on the right track the first time.
  • MatBailie
    MatBailie about 13 years
    DBA's are a unique breed, don't give him hell, even though you want to. He'll cry and just get in the way forever after ;)
  • Steph Rose
    Steph Rose about 13 years
    Sounds like what I was originally thinking about doing, albeit with a bit more flexibility. Would you recommend doing this in a second table, or just a second column in the main table called, say, lastname_mod?
  • takrl
    takrl about 13 years
    It all depends on what your data looks like. If you've only got firstname and lastname, I'd probably go for extra columns within the same table - that might make updating the modified fields easier when the main content changes. In our case, we had twelve to fourteen fields that needed to be modified like this, so we went for a second table. That only had an id column, a language id, the text column and a texttype column to distinguish the 14 different text types. This approach also later on greatly simplified a fuzzy full text search on all of those fields.
  • takrl
    takrl about 13 years
    If it's just names, you could also go for a single field containing "lastname, firstname" or "firstname lastname".
  • Steph Rose
    Steph Rose about 13 years
    The main table has a nice number of columns (not my design). Probably 30-50. So second table with a UNION is probably best. I was also thinking about a firstname lastname field too. Would you keep a space between firstname lastname in the column?
  • takrl
    takrl about 13 years
    The main question then is if you really need a fuzzy search on all columns, or if you can reduce it to a subset. We had about 30 columns too but only use this method on 14 of them. That definetly helps performance, as you really need to index all of the fields to boost performance. As a side note, you could even go as far as defining different replacement functions for different fields; ie. have all space and hyphens removed from phone numbers and then match 1 2-3 against 123. It just depends on your needs.
  • Steph Rose
    Steph Rose about 13 years
    Just the lastname and possibly firstname columns. Nothing too dramatic. The rest just other info, not info we're querying by.
  • takrl
    takrl about 13 years
    Good, that makes life easier. You could even use a trigger to update the simplified fields automatically as the main content changes. We chose not do to this, but our content is static during the day and changes once per night, your case may be different.
  • Steph Rose
    Steph Rose about 13 years
    Table updates once per night using SSIS. Going to have to issue an update at the same time for this. Thanks for all your help.