How to extract only text from HTML in Golang?

12,247

Solution 1

As indicated by @Eric Pauley, I look at TextTokens & StartTagTokens. Here is my solution

    s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`

    domDocTest := html.NewTokenizer(strings.NewReader(s))
    previousStartTokenTest := domDocTest.Token()
loopDomTest:
    for {
        tt := domDocTest.Next()
        switch {
        case tt == html.ErrorToken:
            break loopDomTest // End of the document,  done
        case tt == html.StartTagToken:
            previousStartTokenTest = domDocTest.Token()
        case tt == html.TextToken:
            if previousStartTokenTest.Data == "script" {
                continue
            }
            TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
            if len(TxtContent) > 0 {
                fmt.Printf("%s\n", TxtContent)
            }
        }
    }

Solution 2

If you use github.com/PuerkitoBio/goquery it's pretty easy to achieve what you want.

  • You first need to use document.Find() to identify the element you want to remove, in your case scripts, so document.Find(scripts)

  • Then, you need to remove it from the document using element.Remove()

  • Finally, you print/get the text using document.Text()

So, the final code would be

package main

import (
  "fmt"
  "strings"
  "github.com/PuerkitoBio/goquery"
)

func main(){
  s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span><script type='text/javascript'>/* <![CDATA[ */var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};/* ]]> */</script>`

  p := strings.NewReader(s)
  doc, _ := goquery.NewDocumentFromReader(p)

  doc.Find("script").Each(func(i int, el *goquery.Selection) {
      el.Remove()
  })

  fmt.Println(doc.Text()) // Links:FooBarBazTEXT I WANT

}
Share:
12,247
LeMoussel
Author by

LeMoussel

Problem solver. Je recherche, je développe, je teste, je veille, ..., Passionné par toutes les techniques de Dév. (.NET, C#, PHP, JavaScript, ...).

Updated on June 16, 2022

Comments

  • LeMoussel
    LeMoussel almost 2 years

    To extract text from HTML, I use a fully HTML5-compliant tokenizer and parser, like this

        s := `
    <p>Links:</p><ul><li><a href="foo">Foo</a><li>
    <a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
    <script type='text/javascript'>
    /* <![CDATA[ */
    var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
    /* ]]> */
    </script>`
    
        domDocTest := html.NewTokenizer(strings.NewReader(s))
        for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; {
            if tokenType != html.TextToken {
                tokenType = domDocTest.Next()
                continue
            }
            TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
            if len(TxtContent) > 0 {
                fmt.Printf("%s\n", TxtContent)
            }
            tokenType = domDocTest.Next()
        }
    

    but I got this result

    Links:
    Foo
    BarBaz
    TEXT
    I
    WANT
    /* <![CDATA[ */
    var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
    /* ]]> */
    

    I don't want CDATA content. Some idea, how to get only the text content?