Changing the default analyzer in ElasticSearch or LogStash
Solution 1
As you know, elasticsearch uses standard analyzer when no analyzer is specified explicitly. So while setting the templates, you can set your custom analyzer which is named as standard. And there you can set you own rules of setting analyzer, tokenzier, token filters.
Here are some helpful links that will help you understand better:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.html
Solution 2
According this page analyzers can be specified per-query, per-field or per-index.
At index time
, Elasticsearch will look for an analyzer in this order:
- The analyzer defined in the
field mapping
. - An analyzer named
default
in the index settings. - The
standard
analyzer.
At query time
, there are a few more layers:
- The analyzer defined in a
full-text query
. - The
search_analyzer
defined in the field mapping. - The analyzer defined in the
field mapping
. - An analyzer named
default_search
in the index settings. - An analyzer named
default
in the index settings. - The
standard
analyzer.
On the other hand, this page point to important thing:
An analyzer is registered under a logical name. It can then be referenced from mapping definitions or certain APIs. When none are defined, defaults are used. There is an option to define which analyzers will be used by default when none can be derived.
So the only way to define a custom analyzer as default is overriding one of pre-defined analyzers, in this case the default
analyzer. it means we can not use an arbitrary name for our analyzer, it must be named default
here a simple example of index setting:
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"char_filter": {
"charMappings": {
"type": "mapping",
"mappings": [
"\\u200C => "
]
}
},
"filter": {
"persian_stop": {
"type": "stop",
"stopwords_path": "stopwords.txt"
}
},
"analyzer": {
"default": {<--------- analyzer name must be default
"tokenizer": "standard",
"char_filter": [
"charMappings"
],
"filter": [
"lowercase",
"arabic_normalization",
"persian_normalization",
"persian_stop"
]
}
}
}
}
}
Brian Hicks
Updated on July 26, 2022Comments
-
Brian Hicks over 1 year
I've got data coming in from Logstash that's being analyzed in an overeager manner. Essentially, the field
"OS X 10.8"
would be broken into"OS"
,"X"
, and"10.8"
. I know I could just change the mapping and re-index for existing data, but how would I change the default analyzer (either in ElasticSearch or LogStash) to avoid this problem in future data?Concrete Solution: I created a mapping for the type before I sent data to the new cluster for the first time.
Solution from IRC: Create an Index Template