Dropping noise words in SQL Server 2005 full text indexing

11,084

Solution 1

Noise words are stripped out before the indexing is stored. So it is impossible to write a query that searches on a stop word. If you REALLY want to enable this behavior, you need to edit the list of stop words. (http://msdn.microsoft.com/en-us/library/ms142551.aspx) and then re-build your index.

Solution 2

I had the same question and after a thorough search I've come to the conclusion there's no good solution.

As a compromise, I'm implementing the brute force solution:

1) Open C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\noiseENU.txt and copy all the text in there.

2) Paste into a code file in the application, replace line breaks with "," to get a List initializer like this:

public static List<string> _noiseWords = new List<string>{ "about", "1", "after", "2", "all", "also", "3", "an", "4", "and", "5", "another", "6", "any", "7", "are", "8", "as", "9", "at", "0", "be", "$", "because", "been", "before", "being", "between", "both", "but", "by", "came", "can", "come", "could", "did", "do", "does", "each", "else", "for", "from", "get", "got", "has", "had", "he", "have", "her", "here", "him", "himself", "his", "how", "if", "in", "into", "is", "it", "its", "just", "like", "make", "many", "me", "might", "more", "most", "much", "must", "my", "never", "no", "now", "of", "on", "only", "or", "other", "our", "out", "over", "re", "said", "same", "see", "should", "since", "so", "some", "still", "such", "take", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "up", "use", "very", "want", "was", "way", "we", "well", "were", "what", "when", "where", "which", "while", "who", "will", "with", "would", "you", "your", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };

3) Before you submit the search string, break it into words and remove any words in the noise words like this:

List<string> goodWords = new List<string>();
string[] words = searchString.Split(' ');
foreach (string word in words)
{
   if (!_noiseWords.Contains(word))
      goodWords.Add(word);
}

Not an ideal solution, but should work as long as the noise words file doesn't change. Multiple language support would use a dictionary of lists by language.

Solution 3

Here's a working function. The file noiseENU.txt is copied as-is from \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData .

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function

Solution 4

You can also remove noise words BEFORE making the query. List of language id: http://msdn.microsoft.com/en-us/library/ms190303.aspx

Dim queryTextWithoutNoise As String = removeNoiseWords(queryText, ConnectionString, 1033)

Public Function removeNoiseWords(ByVal inputText As String, ByVal cnStr As String, ByVal languageID As Integer) As String

    Dim r As New System.Text.StringBuilder
    Try
        If inputText.Contains(CChar("""")) Then
            r.Append(inputText)
        Else
            Using cn As New SqlConnection(cnStr)

                Const q As String = "SELECT display_term,special_term FROM sys.dm_fts_parser(@q,@l,0,0)"
                cn.Open()
                Dim cmd As New SqlCommand(q, cn)
                With cmd.Parameters
                    .Add(New SqlParameter("@q", """" & inputText & """"))
                    .Add(New SqlParameter("@l", languageID))
                End With
                Dim dr As SqlDataReader = cmd.ExecuteReader
                While dr.Read
                    If Not (dr.Item("special_term").ToString.Contains("Noise")) Then
                        r.Append(dr.Item("display_term").ToString)
                        r.Append(" ")
                    End If
                End While
            End Using
        End If
    Catch ex As Exception
        ' ...        
    End Try
    Return r.ToString

End Function
Share:
11,084
jamiecon
Author by

jamiecon

Updated on June 09, 2022

Comments

  • jamiecon
    jamiecon about 2 years

    In a pretty typical scenario, I have a 'Search' text box on my web application which has user input passed directly to a stored procedure which then uses full text indexing to search on two fields in two tables, which are joined using appropriate keys.

    I am using the CONTAINS predicate to search the fields. Before passing the search string in, I do the following:

    SET @ftQuery = '"' + REPLACE(@query,' ', '*" OR "') + '*"'
    

    Changing the castle to "the*" OR "castle*", for example. This is necessary because I want people to be able to search on cas and get results for castle.

    WHERE CONTAINS(Building.Name, @ftQuery) OR CONTAINS(Road.Name, @ftQuery)
    

    The problem is that now that I have appended a wildcard to the end of each word, noise words (e.g. the) also have a wildcard appended and therefore no longer appear to get dropped. This means that a search for the castle will return items with words such as theatre etc.

    Changing OR to AND was my first thought, but that appears to simply return no matches if a noise word is then used in the query.

    All I am trying to achieve is to allow the user to enter multiple, space separated words that respresent either the entirety or a prefix of the words they are searching on, in any order - and drop noise words such as the from their input (otherwise when they search for the castle they get a big list of items with the result they need somewhere in the middle of the list.

    I could go ahead and implement my own noise word removal procedure, but it seems like something that full text indexing ought to be able to handle.

    Grateful for any help!

    Jamie