Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working

10,081

You are confusing the "_source" field in the response to return what is being analyzed and indexed. It looks like your expectation is that the _source field in response returns the analyzed document. This is incorrect.

From the documentation ;

The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.

Ideally in the above case wherein you want to format the source data for presentation purposes it should be done at the client end.

However that being said one way to achieve it for the above use case is using script fields and keyword-tokenizer as follows :

PUT test
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_html_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "char_filter": [
                  "html_strip"
               ]
            },
            "parsed_analyzer": {
               "type": "custom",
               "tokenizer": "keyword",
               "char_filter": [
                  "html_strip"
               ]
            }
         }
      }
   },
   "mappings": {
      "test": {
         "properties": {
            "body": {
               "type": "string",
               "analyzer": "my_html_analyzer",
               "fields": {
                  "parsed": {
                     "type": "string",
                     "analyzer": "parsed_analyzer"
                  }
               }
            }
         }
      }
   }
}


PUT test/test/1 
{
    "body" : "Title <p> Some d&eacute;j&agrave; vu <a href='http://somedomain.com'> website </a> <span> this is inline </span></p> "
}

GET test/_search
{
  "query" : {
    "match_all" : { }
  },
  "script_fields": {
    "terms" : {
        "script": "doc[field].values",
        "params": {
            "field": "body.parsed"
        }
    }
  }
}

Result:

{
   "_index": "test",
   "_type": "test",
   "_id": "1",
   "_score": 1,
   "fields": {
        "terms": [
            "Title \n Some déjà vu  website   this is inline \n "
           ]
        }
   }

note I believe the above is a bad idea since stripping the html tags could be easily achived on the client end and you would have much more control with regard to formatting than depending on a work around such as this. More importantly it maybe performant doing it on the client side.

Share:
10,081

Related videos on Youtube

DaddyMoe
Author by

DaddyMoe

Creative &amp; passionate Developer. Tech: Java 7, 8, Junit, Mockito, Cucumber ... Spring Framework - Spring-MVC | Spring-boot | Spring Data ... Microservices - RESTful APIs, Servlerless ... Enterprise Search - Elasticsearch | MongoDB ... AGILE - Scrum, TDD &amp; BDD, Extreme Programming ... AWS Cloud Stack - Code Pipeline, CodeCommit, CloudFormation, Lambdas, S3, EC2... Nexus, Jenkins, TeamCity, SonarQube, Maven, Gradle ... A passionate Test Driven Development advocate and practitioner. Has grown to favour Behaviour Driven Development for automated user acceptance tests. Most experience has been on enterprise software solutions and consultation about Java technologies. Currently learning and mastering: - GraphQL and GraphQL Tools - Gradle, Groovy + Spock testing framework

Updated on July 01, 2022

Comments

  • DaddyMoe
    DaddyMoe almost 2 years

    Given I have specified my html strip char filter in my custom analyser

    When I index a document with html content

    Then I expect the html to be strip out of the indexed content

    And on retrieval the returned doc from the index shoult not contain hmtl

    ACTUAL: The indexed doc contained html The retrieved doc contained html

    I have tried specifying the analyzer as index_analyzer as one would expect and a few others out of desperation search_analyzer and analyzer. Non seem to have any effect on the doc being indexed or retrieve.

    Test Doc Indexing against HTML_Strip Analysed field :

    REQUEST : Example POST document with html content

    POST /html_poc_v2/html_poc_type/02
    {
      "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
      "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
      "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
    }
    

    Expected : indexed data to have being parsed through the html analyser. Actual : data is indexed with html

    RESPONSE

    {
       "_index": "html_poc_v2",   "_type": "html_poc_type",   "_id": "02", ...
       "_source": {
          "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
          "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
          "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
       }
    }
    

    Settings and Doc Mapping

    PUT /html_poc_v2
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_html_analyzer": {
              "type": "custom",
              "tokenizer": "standard",
              "char_filter": [
                "html_strip"
              ]
            }
          }
        },
        "mappings": {
          "html_poc_type": {
            "properties": {
              "body": {
                "type": "string",
                "analyzer": "my_html_analyzer"
              },
              "description": {
                "type": "string",
                "analyzer": "my_html_analyzer"
              },
              "title": {
                "type": "string",
                "search_analyser": "my_html_analyzer"
              },
              "urlTitle": {
                "type": "string"
              }
            }
          }
        }
      }
    }
    

    Test to proof Custom Analyser works perfectly:

    REQUEST

    GET /html_poc_v2/_analyze?analyzer=my_html_analyzer
    {<p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a>}
    

    Response

    {
       "tokens": [
          {
             "token": "Some",… "position": 1
          },
          {
             "token": "déjà",… "position": 2
          },
          {
             "token": "vu",…  "position": 3
          },
          {
             "token": "website",… "position": 4
          }
       ]
    }
    

    Under the hood

    going under the hood with an in-line script proofs further that my html analyser must have been skipped

    REQUEST

    GET /html_poc_v2/html_poc_type/_search?pretty=true
    {
      "query" : {
        "match_all" : { }
      },
      "script_fields": {
        "terms" : {
            "script": "doc[field].values",
            "params": {
                "field": "title"
            }
        }
      }
    }
    

    RESPONSE

    { …
       "hits": { ..
          "hits": [
             {
                "_index": "html_poc_v2",
                "_type": "html_poc_type",
                …
                "fields": {
                   "terms": [
                      [
                         "a",
                         "agrave",
                         "d",
                         "eacute",
                         "href",
                         "http",
                         "j",
                         "p",
                         "some",
                         "somedomain.com",
                         "title",
                         "vu",
                         "website"
                      ]
                   ]
                }
             }
          ]
       }
    }
    

    Similar to this question here : Why HTML tag is searchable even if it was filtered in elastic search

    I have also read this amazing doc : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

    ES version : 1.7.2

    Please Help.