ElasticSearch - Searching For Human Names

17,969

First, I recreated your current configuration in Play: https://www.found.no/play/gist/867785a709b4869c5543

If you go there, switch to the "Analysis"-tab to see how the text is transformed:

Note, for example that Heaney ends up tokenized as [hn, heanei] with the search_analyzer and as [HN, heanei] with the index_analyzer. Note the case-difference for the metaphone-term. Thus, that one is not matching.

The fuzzy-query does not do query time text analysis. Thus, you end up comparing Heavey with heanei. This has a Damerau-Levenshtein distance longer than what your parameters allow.

What you really want to do is using the fuzzy functionality of match. Match does do query time text analysis, and has a fuzziness-parameter.

As for the fuzziness, this changed a bit in Lucene 4. Before, it was typically specified as a float. Now it should be specified as the allowed distance. There's an outstanding pull request to clarify that: https://github.com/elasticsearch/elasticsearch/pull/4332/files

The reason why you are getting people without the forename Michael is that you are doing a bool.should. This has OR-semantics. It's sufficient that one matches, but scoring-wise it's better the more that matches.

Lastly, combining all that filtering into the same term is not necessarily the best approach. For example, you cannot know and boost exact spellings. What you should consider is using a multi_field to process the field in many ways.

Here's an example you can play with, with the curl commands to recreate it below. I'd skip using the "porter" stemmer entirely for this, however. I kept it just to show how multi_field works. Using a combination of match, match with fuzziness and phonetic matching should get you far. (Make sure you don't allow fuzziness when you do phonetic matching - or you'll get uselessly fuzzy matching. :-)

#!/bin/bash

export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Create indexes

curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
    "settings": {
        "analysis": {
            "text": [
                "Michael",
                "Heaney",
                "Heavey"
            ],
            "analyzer": {
                "metaphone": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "my_metaphone"
                    ]
                },
                "porter": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "porter_stem"
                    ]
                }
            },
            "filter": {
                "my_metaphone": {
                    "encoder": "metaphone",
                    "replace": false,
                    "type": "phonetic"
                }
            }
        }
    },
    "mappings": {
        "jr": {
            "properties": {
                "pty_surename": {
                    "type": "multi_field",
                    "fields": {
                        "pty_surename": {
                            "type": "string",
                            "analyzer": "simple"
                        },
                        "metaphone": {
                            "type": "string",
                            "analyzer": "metaphone"
                        },
                        "porter": {
                            "type": "string",
                            "analyzer": "porter"
                        }
                    }
                }
            }
        }
    }
}'


# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"jr"}}
{"pty_surname":"Heaney"}
{"index":{"_index":"play","_type":"jr"}}
{"pty_surname":"Heavey"}
'

# Do searches

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "query": {
        "bool": {
            "should": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "pty_surname": {
                                        "query": "heavey"
                                    }
                                }
                            },
                            {
                                "match": {
                                    "pty_surname": {
                                        "query": "heavey",
                                        "fuzziness": 1
                                    }
                                }
                            },
                            {
                                "match": {
                                    "pty_surename.metaphone": {
                                        "query": "heavey"
                                    }
                                }
                            },
                            {
                                "match": {
                                    "pty_surename.porter": {
                                        "query": "heavey"
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}
'
Share:
17,969
Nathan Smith
Author by

Nathan Smith

Updated on June 03, 2022

Comments

  • Nathan Smith
    Nathan Smith almost 2 years

    I have a large database of names, primarily from Scotland. We're currently producing a prototype to replace an existing piece of software which carries out the search. This is still in production and we're aiming to get our results as closes as possible to the current results of the same search.

    I was hoping someone could help me out, I am entering in a search into Elastic Search, the query is "Michael Heaney", I get some wild results. The current search returns two main surnames, these are - "Heaney" and "Heavey" all with the forename of "Michael", I can get the "Heaney" results in Elastic Search however I can't obtain "Heavey" and ES also returns people without the surname "Michael" however I appreciate that that's due to it being part of the fuzzy query. I know this is a narrow use case, as it's only one search but getting this result and knowing how I can obtain it will help.

    Thanks.

    Mapping

    {
       "jr": {
        "_all": {
            "enabled": true,
            "index_analyzer": "index_analyzer",
            "search_analyzer": "search_analyzer"
        },
        "properties": {
            "pty_forename": {
                "type": "string",
                "index": "analyzed",
                "boost": 2,
                "index_analyzer": "index_analyzer",
                "search_analyzer": "search_analyzer",
                "store": "yes"
            },
            "pty_full_name": {
                "type": "string",
                "index": "analyzed",
                "boost": 4,
                "index_analyzer": "index_analyzer",
                "search_analyzer": "search_analyzer",
                "store": "yes"
            },
            "pty_surname": {
                "type": "string",
                "index": "analyzed",
                "boost": 4,
                "index_analyzer": "index_analyzer",
                "search_analyzer": "search_analyzer",
                "store": "yes"
            }
         }
       }
    }'
    

    Index Settings

    {
      "settings": {
        "number_of_shards": 2,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                "index_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "standard",
                        "my_delimiter",
                        "lowercase",
                        "stop",
                        "asciifolding",
                        "porter_stem",
                        "my_metaphone"
                    ]
                },
                "search_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "standard",
                        "my_metaphone",
                        "synonym",
                        "lowercase",
                        "stop",
                        "asciifolding",
                        "porter_stem"
                    ]
                }
            },
            "filter": {
                "synonym": {
                    "type": "synonym",
                    "synonyms_path": "synonyms/synonyms.txt"
                },
                "my_delimiter": {
                    "type": "word_delimiter",
                    "generate_word_parts": true,
                    "catenate_words": false,
                    "catenate_numbers": false,
                    "catenate_all": false,
                    "split_on_case_change": false,
                    "preserve_original": false,
                    "split_on_numerics": false,
                    "stem_english_possessive": false
                },
                "my_metaphone": {
                    "type": "phonetic",
                    "encoder": "metaphone",
                    "replace": false
                }
            }
         }
       }
    }'
    

    Fuzzy

    {
    "from":0, "size":100,
    "query": {
        "bool": {
            "should": [
                {
                    "fuzzy": {
                        "pty_surname": {
                            "min_similarity": 0.2,
                            "value": "Heaney",
                            "prefix_length": 0,
                            "boost": 5
                        }
                    }
                },
                {
                    "fuzzy": {
                        "pty_forename": {
                            "min_similarity": 1,
                            "value": "Michael",
                            "prefix_length": 0,
                            "boost": 1
                        }
                    }
                }
            ]
         }
      }
    }
    
  • Nathan Smith
    Nathan Smith over 10 years
    Thank you, Alex. Let me get my head round all this information and I'll report back. Answer looks very thorough.
  • Alex Brasetvik
    Alex Brasetvik over 10 years
    We just published an article on fuzzy search which may also be of interest: found.no/foundation/fuzzy-search
  • Nathan Smith
    Nathan Smith over 10 years
    Will bookmark that. Thanks a lot for your help, I've learnt a lot.
  • Arthur
    Arthur about 6 years
    I don't understand why you need the two layers of should and bool?
  • James Daily
    James Daily almost 5 years
    All of the found.no links are dead