Changing the default analyzer in ElasticSearch or LogStash

12,191

Solution 1

As you know, elasticsearch uses standard analyzer when no analyzer is specified explicitly. So while setting the templates, you can set your custom analyzer which is named as standard. And there you can set you own rules of setting analyzer, tokenzier, token filters.

Here are some helpful links that will help you understand better:

http://elasticsearch-users.115913.n3.nabble.com/How-we-can-change-Elasticsearch-default-analyzer-td4040411.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.html

Solution 2

According this page analyzers can be specified per-query, per-field or per-index.

At index time, Elasticsearch will look for an analyzer in this order:

  • The analyzer defined in the field mapping.
  • An analyzer named default in the index settings.
  • The standard analyzer.

At query time, there are a few more layers:

  • The analyzer defined in a full-text query.
  • The search_analyzer defined in the field mapping.
  • The analyzer defined in the field mapping.
  • An analyzer named default_search in the index settings.
  • An analyzer named default in the index settings.
  • The standard analyzer.

On the other hand, this page point to important thing:

An analyzer is registered under a logical name. It can then be referenced from mapping definitions or certain APIs. When none are defined, defaults are used. There is an option to define which analyzers will be used by default when none can be derived.

So the only way to define a custom analyzer as default is overriding one of pre-defined analyzers, in this case the default analyzer. it means we can not use an arbitrary name for our analyzer, it must be named default

here a simple example of index setting:

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "char_filter": {
        "charMappings": {
          "type": "mapping",
          "mappings": [
            "\\u200C => "
          ]
        }
      },
      "filter": {
        "persian_stop": {
          "type": "stop",
          "stopwords_path": "stopwords.txt"
        }
      },
      "analyzer": {
        "default": {<--------- analyzer name must be default
          "tokenizer": "standard",
          "char_filter": [
            "charMappings"
          ],
          "filter": [
            "lowercase",
            "arabic_normalization",
            "persian_normalization",
            "persian_stop"
          ]
        }
      }
    }
  }
}
Share:
12,191
Brian Hicks
Author by

Brian Hicks

Updated on July 26, 2022

Comments

  • Brian Hicks
    Brian Hicks over 1 year

    I've got data coming in from Logstash that's being analyzed in an overeager manner. Essentially, the field "OS X 10.8" would be broken into "OS", "X", and "10.8". I know I could just change the mapping and re-index for existing data, but how would I change the default analyzer (either in ElasticSearch or LogStash) to avoid this problem in future data?

    Concrete Solution: I created a mapping for the type before I sent data to the new cluster for the first time.

    Solution from IRC: Create an Index Template