Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working

html elasticsearch filter full-text-search mapping

10,081

You are confusing the "_source" field in the response to return what is being analyzed and indexed. It looks like your expectation is that the _source field in response returns the analyzed document. This is incorrect.

From the documentation ;

The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.

Ideally in the above case wherein you want to format the source data for presentation purposes it should be done at the client end.

However that being said one way to achieve it for the above use case is using script fields and keyword-tokenizer as follows :

PUT test
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_html_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "char_filter": [
                  "html_strip"
               ]
            },
            "parsed_analyzer": {
               "type": "custom",
               "tokenizer": "keyword",
               "char_filter": [
                  "html_strip"
               ]
            }
         }
      }
   },
   "mappings": {
      "test": {
         "properties": {
            "body": {
               "type": "string",
               "analyzer": "my_html_analyzer",
               "fields": {
                  "parsed": {
                     "type": "string",
                     "analyzer": "parsed_analyzer"
                  }
               }
            }
         }
      }
   }
}


PUT test/test/1 
{
    "body" : "Title <p> Some d&eacute;j&agrave; vu <a href='http://somedomain.com'> website </a> <span> this is inline </span></p> "
}

GET test/_search
{
  "query" : {
    "match_all" : { }
  },
  "script_fields": {
    "terms" : {
        "script": "doc[field].values",
        "params": {
            "field": "body.parsed"
        }
    }
  }
}

Result:

{
   "_index": "test",
   "_type": "test",
   "_id": "1",
   "_score": 1,
   "fields": {
        "terms": [
            "Title \n Some déjà vu  website   this is inline \n "
           ]
        }
   }

note I believe the above is a bad idea since stripping the html tags could be easily achived on the client end and you would have much more control with regard to formatting than depending on a work around such as this. More importantly it maybe performant doing it on the client side.

10,081

DaddyMoe

Creative & passionate Developer. Tech: Java 7, 8, Junit, Mockito, Cucumber ... Spring Framework - Spring-MVC | Spring-boot | Spring Data ... Microservices - RESTful APIs, Servlerless ... Enterprise Search - Elasticsearch | MongoDB ... AGILE - Scrum, TDD & BDD, Extreme Programming ... AWS Cloud Stack - Code Pipeline, CodeCommit, CloudFormation, Lambdas, S3, EC2... Nexus, Jenkins, TeamCity, SonarQube, Maven, Gradle ... A passionate Test Driven Development advocate and practitioner. Has grown to favour Behaviour Driven Development for automated user acceptance tests. Most experience has been on enterprise software solutions and consultation about Java technologies. Currently learning and mastering: - GraphQL and GraphQL Tools - Gradle, Groovy + Spock testing framework

Updated on July 01, 2022

Comments

DaddyMoe almost 2 years

Given I have specified my html strip char filter in my custom analyser

When I index a document with html content

Then I expect the html to be strip out of the indexed content

And on retrieval the returned doc from the index shoult not contain hmtl

ACTUAL: The indexed doc contained html The retrieved doc contained html

I have tried specifying the analyzer as index_analyzer as one would expect and a few others out of desperation search_analyzer and analyzer. Non seem to have any effect on the doc being indexed or retrieve.

Test Doc Indexing against HTML_Strip Analysed field :

REQUEST : Example POST document with html content

POST /html_poc_v2/html_poc_type/02
{
  "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
  "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
  "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
}

Expected : indexed data to have being parsed through the html analyser. Actual : data is indexed with html

RESPONSE

{
   "_index": "html_poc_v2",   "_type": "html_poc_type",   "_id": "02", ...
   "_source": {
      "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
      "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
      "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
   }
}

Settings and Doc Mapping

PUT /html_poc_v2
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_html_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ]
        }
      }
    },
    "mappings": {
      "html_poc_type": {
        "properties": {
          "body": {
            "type": "string",
            "analyzer": "my_html_analyzer"
          },
          "description": {
            "type": "string",
            "analyzer": "my_html_analyzer"
          },
          "title": {
            "type": "string",
            "search_analyser": "my_html_analyzer"
          },
          "urlTitle": {
            "type": "string"
          }
        }
      }
    }
  }
}

Test to proof Custom Analyser works perfectly:

REQUEST

GET /html_poc_v2/_analyze?analyzer=my_html_analyzer
{<p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a>}

Response

{
   "tokens": [
      {
         "token": "Some",… "position": 1
      },
      {
         "token": "déjà",… "position": 2
      },
      {
         "token": "vu",…  "position": 3
      },
      {
         "token": "website",… "position": 4
      }
   ]
}

Under the hood

going under the hood with an in-line script proofs further that my html analyser must have been skipped

REQUEST

GET /html_poc_v2/html_poc_type/_search?pretty=true
{
  "query" : {
    "match_all" : { }
  },
  "script_fields": {
    "terms" : {
        "script": "doc[field].values",
        "params": {
            "field": "title"
        }
    }
  }
}

RESPONSE

{ …
   "hits": { ..
      "hits": [
         {
            "_index": "html_poc_v2",
            "_type": "html_poc_type",
            …
            "fields": {
               "terms": [
                  [
                     "a",
                     "agrave",
                     "d",
                     "eacute",
                     "href",
                     "http",
                     "j",
                     "p",
                     "some",
                     "somedomain.com",
                     "title",
                     "vu",
                     "website"
                  ]
               ]
            }
         }
      ]
   }
}

Similar to this question here : Why HTML tag is searchable even if it was filtered in elastic search

I have also read this amazing doc : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

ES version : 1.7.2

Please Help.