Lucene query: bla~* (match words that start with something fuzzy), how?

14,009

Solution 1

I do not believe Lucene supports anything like this, nor do I believe it has a trivial solution.

"Fuzzy" searches do not operate on a fixed number of characters. bla~ may for example match blah and so it must consider the entire term.

What you could do is implement a query expansion algorithm that took the query bla~* and converted it into a series of OR queries

bla* OR blb* OR blc OR .... etc.

But that is really only viable if the string is very short or if you can narrow the expansion based on some rules.

Alternatively if the length of the prefix is fixed you could add a field with the substrings and perform the fuzzy search on that. That would give you what you want, but will only work if your use case is sufficiently narrow.

You don't specify exactly why you need this, perhaps doing so will elicit other solutions.

One scenario I can think of is dealing with different form of words. E.g. finding car and cars.

This is easy in English as there are word stemmers available. In other languages it can be quite difficult to implement word stemmers, if not impossible.

In this scenario you can however (assuming you have access to a good dictionary) look up the search term and expand the search programmatically to search for all forms of the word.

E.g. a search for cars is translated into car OR cars. This has been applied successfully for my language in at least one search engine, but is obviously non-trivial to implement.

Solution 2

in the development trunk of lucene (not yet a release), there is code to support use cases like this, via AutomatonQuery. Warning: the APIs might/will change before its released, but it gives you the idea.

Here is an example for your case:

// a term representative of the query, containing the field. 
// the term text is not so important and only used for toString() and such
Term term = new Term("yourfield", "bla~*");

// builds a DFA that accepts all strings within an edit distance of 2 from "bla"
Automaton fuzzy = new LevenshteinAutomata("bla").toAutomaton(2);

// concatenate this DFA with another DFA equivalent to the "*" operator
Automaton fuzzyPrefix = BasicOperations.concatenate(fuzzy, BasicAutomata.makeAnyString());

// build a query, search with it to get results.
AutomatonQuery query = new AutomatonQuery(term, fuzzyPrefix);

Solution 3

It's for an address search service, where I want to suggest addresses based on partially typed and possibly mistyped streetnames/citynames/etc (any combination). (think ajax, users typing partial street addresses in a text field)

For this case the suggested query expansion is perhaps not so feasible, as the partial string (street address) may become longer than "short" :)

Normalization

One possibility I can think of is to use string "normalization", instead of fuzzy searches, and simply combine that with wildcard queries. A street address of

"miklabraut 42, 101 reykjavík", would become "miklabrat 42 101 rekavik" when normalized.

So, building index like this:

1) build the index with records containing "normalized" versions of street names, city names etc, with one street address per document (1 or several fields).

And search the index like this:

2) Normalize inputstrings (e.g. mikl reyk) used to form the queries (i.e. mik rek). 3) use the wildcard op to perform the search (i.e. mik* AND rek*), leaving the fuzzy part out.

That would fly, provided the normalization algorithm is good enough :)

Share:
14,009
Pimin Konstantin Kefaloukos
Author by

Pimin Konstantin Kefaloukos

Current problems: Data analytics, data science, machine learning, prediction. Current technologies: Apache Spark, Python, Amazon Web Services.

Updated on July 19, 2022

Comments

  • Pimin Konstantin Kefaloukos
    Pimin Konstantin Kefaloukos almost 2 years

    In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to: bla~* //invalid query

    Meaning: Please match words that begin with "bla" or something similar to "bla".

    Update: What I do now, works for small input, is use the following (snippet of SOLR schema):

    <fieldtype name="text_ngrams" class="solr.TextField">
      <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
      </analyzer>
      <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
            <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    

    In case you don't use SOLR, this does the following.

    Indextime: Index data by creating a field containing all prefixes of my (short) input.

    Searchtime: only use the ~ operator, as prefixes are explicitly present in the index.

  • JBoy
    JBoy about 14 years
    I do not believe this would accomplish what the OP is asking as it would basically become "bar~ OR bar*" which is not the same thing as "bar~*" and would not find (for example) "brafoo".
  • Pimin Konstantin Kefaloukos
    Pimin Konstantin Kefaloukos about 14 years
    Althoug fuzzy searches don't operate on a fixed number of characters, for my case simply using ~ wont work (to big diff in char count). I want to match e.g. Sunla to Sundlaugarvegur.
  • Pimin Konstantin Kefaloukos
    Pimin Konstantin Kefaloukos about 14 years
    yup, this is not what I want :)
  • Pimin Konstantin Kefaloukos
    Pimin Konstantin Kefaloukos about 14 years
    of course, if i could tell lucene to only match on the first x characters of each word in the index, using ~ would work...
  • Mikos
    Mikos about 14 years
    Ok, thanks for clarifying, one approach I have used (for matching names of proteins etc.) using string distances like Smith-Waterman, Jaro-Winkler etc. A tool like SimMetrics might be of some help dcs.shef.ac.uk/~sam/simmetrics.html
  • Mikos
    Mikos about 14 years
    You would need to go beyond Lucene here, use a string comparison algorithm like Levenstein, Jaro-Winkler etc. (qv. below)
  • Pimin Konstantin Kefaloukos
    Pimin Konstantin Kefaloukos about 14 years
    This is the second time I hear of SimMetrics, maybe it's time to give it a spin :) thank you
  • Pimin Konstantin Kefaloukos
    Pimin Konstantin Kefaloukos over 13 years
    I just came back to this question, and saw your answer again. Have you tried it? What I'm doing now (works for small input), is to generate all prefixes of my input and put the prefixes in the index. Then I only need to use the ~ operator, and get functionality like ~*
  • Robert Muir
    Robert Muir over 13 years
    Your workaround is just fine for small inputs... but as you hinted, will be a problem for large inputs: you will add a huge amount of terms and postings for all these prefixes... this will make the pre-Lucene 4.0 fuzzy query even slower as it does a linear scan of all terms.
  • wrschneider
    wrschneider about 11 years
    Is there a Lucene query syntax that gives you access to the Automaton query through Solr without coding?