Dropping noise words in SQL Server 2005 full text indexing
Solution 1
Noise words are stripped out before the indexing is stored. So it is impossible to write a query that searches on a stop word. If you REALLY want to enable this behavior, you need to edit the list of stop words. (http://msdn.microsoft.com/en-us/library/ms142551.aspx) and then re-build your index.
Solution 2
I had the same question and after a thorough search I've come to the conclusion there's no good solution.
As a compromise, I'm implementing the brute force solution:
1) Open C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\noiseENU.txt and copy all the text in there.
2) Paste into a code file in the application, replace line breaks with "," to get a List initializer like this:
public static List<string> _noiseWords = new List<string>{ "about", "1", "after", "2", "all", "also", "3", "an", "4", "and", "5", "another", "6", "any", "7", "are", "8", "as", "9", "at", "0", "be", "$", "because", "been", "before", "being", "between", "both", "but", "by", "came", "can", "come", "could", "did", "do", "does", "each", "else", "for", "from", "get", "got", "has", "had", "he", "have", "her", "here", "him", "himself", "his", "how", "if", "in", "into", "is", "it", "its", "just", "like", "make", "many", "me", "might", "more", "most", "much", "must", "my", "never", "no", "now", "of", "on", "only", "or", "other", "our", "out", "over", "re", "said", "same", "see", "should", "since", "so", "some", "still", "such", "take", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "up", "use", "very", "want", "was", "way", "we", "well", "were", "what", "when", "where", "which", "while", "who", "will", "with", "would", "you", "your", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };
3) Before you submit the search string, break it into words and remove any words in the noise words like this:
List<string> goodWords = new List<string>();
string[] words = searchString.Split(' ');
foreach (string word in words)
{
if (!_noiseWords.Contains(word))
goodWords.Add(word);
}
Not an ideal solution, but should work as long as the noise words file doesn't change. Multiple language support would use a dictionary of lists by language.
Solution 3
Here's a working function. The file noiseENU.txt
is copied as-is from \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData
.
Public Function StripNoiseWords(ByVal s As String) As String
Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
Return Result
End Function
Solution 4
You can also remove noise words BEFORE making the query. List of language id: http://msdn.microsoft.com/en-us/library/ms190303.aspx
Dim queryTextWithoutNoise As String = removeNoiseWords(queryText, ConnectionString, 1033)
Public Function removeNoiseWords(ByVal inputText As String, ByVal cnStr As String, ByVal languageID As Integer) As String
Dim r As New System.Text.StringBuilder
Try
If inputText.Contains(CChar("""")) Then
r.Append(inputText)
Else
Using cn As New SqlConnection(cnStr)
Const q As String = "SELECT display_term,special_term FROM sys.dm_fts_parser(@q,@l,0,0)"
cn.Open()
Dim cmd As New SqlCommand(q, cn)
With cmd.Parameters
.Add(New SqlParameter("@q", """" & inputText & """"))
.Add(New SqlParameter("@l", languageID))
End With
Dim dr As SqlDataReader = cmd.ExecuteReader
While dr.Read
If Not (dr.Item("special_term").ToString.Contains("Noise")) Then
r.Append(dr.Item("display_term").ToString)
r.Append(" ")
End If
End While
End Using
End If
Catch ex As Exception
' ...
End Try
Return r.ToString
End Function
jamiecon
Updated on June 09, 2022Comments
-
jamiecon about 2 years
In a pretty typical scenario, I have a 'Search' text box on my web application which has user input passed directly to a stored procedure which then uses full text indexing to search on two fields in two tables, which are joined using appropriate keys.
I am using the CONTAINS predicate to search the fields. Before passing the search string in, I do the following:
SET @ftQuery = '"' + REPLACE(@query,' ', '*" OR "') + '*"'
Changing the castle to "the*" OR "castle*", for example. This is necessary because I want people to be able to search on cas and get results for castle.
WHERE CONTAINS(Building.Name, @ftQuery) OR CONTAINS(Road.Name, @ftQuery)
The problem is that now that I have appended a wildcard to the end of each word, noise words (e.g. the) also have a wildcard appended and therefore no longer appear to get dropped. This means that a search for the castle will return items with words such as theatre etc.
Changing OR to AND was my first thought, but that appears to simply return no matches if a noise word is then used in the query.
All I am trying to achieve is to allow the user to enter multiple, space separated words that respresent either the entirety or a prefix of the words they are searching on, in any order - and drop noise words such as the from their input (otherwise when they search for the castle they get a big list of items with the result they need somewhere in the middle of the list.
I could go ahead and implement my own noise word removal procedure, but it seems like something that full text indexing ought to be able to handle.
Grateful for any help!
Jamie