Extract the text out of HTML string using JavaScript

111,703

Solution 1

Create an element, store the HTML in it, and get its textContent:

function extractContent(s) {
  var span = document.createElement('span');
  span.innerHTML = s;
  return span.textContent || span.innerText;
};
    
alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));

Here's a version that allows you to have spaces between nodes, although you'd probably want that for block-level elements only:

function extractContent(s, space) {
  var span= document.createElement('span');
  span.innerHTML= s;
  if(space) {
    var children= span.querySelectorAll('*');
    for(var i = 0 ; i < children.length ; i++) {
      if(children[i].textContent)
        children[i].textContent+= ' ';
      else
        children[i].innerText+= ' ';
    }
  }
  return [span.textContent || span.innerText].toString().replace(/ +/g,' ');
};
    
console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>.  Nice to <em>see</em><strong><em>you!</em></strong>"));

console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>.  Nice to <em>see</em><strong><em>you!</em></strong>",true));

Solution 2

textContext is a very good technique for achieving desired results but sometimes we don't want to load DOM. So simple workaround will be following regular expression:

let htmlString = "<p>Hello</p><a href='http://w3c.org'>W3C</a>"
let plainText = htmlString.replace(/<[^>]+>/g, '');

Solution 3

use this regax for remove html tags and store only the inner text in html

it shows the HelloW3c only check it

var content_holder = value.replace(/<(?:.|\n)*?>/gm, '');

Solution 4

Try This:-

<!DOCTYPE html>
<html>
<body>
<script type="text/javascript">
function extractContent(value){
        var div = document.createElement('div')
        div.innerHTML=value;
        var text= div.textContent;            
        return text;
}
window.onload=function()
{
   alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));
};
</script>
</body>
</html>
Share:
111,703
Toshkuuu
Author by

Toshkuuu

Updated on July 27, 2022

Comments

  • Toshkuuu
    Toshkuuu almost 2 years

    I am trying to get the inner text of HTML string, using a JS function(the string is passed as an argument). Here is the code:

    function extractContent(value) {
      var content_holder = "";
    
      for (var i = 0; i < value.length; i++) {
        if (value.charAt(i) === '>') {
          continue;
          while (value.charAt(i) != '<') {
            content_holder += value.charAt(i);
          }
        }
    
      }
      console.log(content_holder);
    }
    
    extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");

    The problem is that nothing gets printed on the console(*content_holder* stays empty). I think the problem is caused by the === operator.

  • NewToJS
    NewToJS about 9 years
    You don't need an array, if you want the results to be a string but having an array will allow the user to access each result/value.
  • Admin
    Admin about 9 years
    Right approach, but you don't need an element in the DOM to do this. Just create an element with var div = document.createElement('div') and proceed from there.
  • Admin
    Admin about 9 years
    You haven't fixed the basic logic error in the OP code. Did you test this?
  • Admin
    Admin about 9 years
    Did you test this? It fails to extract "W3C" as it should.
  • Rana Ahmer Yasin
    Rana Ahmer Yasin about 9 years
    please give me a reason please?
  • Admin
    Admin about 9 years
    Also, this will fail with nested HTML elements, such as <p>Hello<i>Bob</i></p><a>...</a>. It will retain the markup inside the p element.
  • Admin
    Admin about 9 years
  • davidkonrad
    davidkonrad about 9 years
    Outputs HelloW3C - really what OP wanted? Not Hello W3C?
  • Toshkuuu
    Toshkuuu about 9 years
    No, white spaces are not required :) Sorry for not mentioning it!
  • Rick Hitchcock
    Rick Hitchcock about 9 years
    Added a version that can add spaces between nodes.
  • Admin
    Admin about 9 years
    Please try your solution with the string Hello, <p>Buggy<i>World</i></p>.
  • Admin
    Admin about 9 years
    delete span accomplishes nothing.
  • Rick Hitchcock
    Rick Hitchcock about 9 years
    @torazaburo, thanks, I wasn't sure about that. Edited.
  • Admin
    Admin about 9 years
    If you are going to use regexp, then a simpler version would be /<[\s\S]*?>/, or /<[^]*?>/. Your m flag accomplishes nothing; it relates to the behavior of ^ and $.
  • NewToJS
    NewToJS about 9 years
    I'm going to guess "not"
  • Kelly
    Kelly almost 5 years
    I know this is a very old comment, but could you please explain the meaning of the expression /<[^>]+>/g ? I'm having trouble understanding what each individual character means.
  • Kade
    Kade over 4 years
    @Kelly The symbols you are referring to are a regular expression. It's kind of like a mini-programming language for parsing text. Here's a link to where you can learn more about each symbol: developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/…
  • Kade
    Kade over 4 years
    It essentially says to find and remove each < that has stuff that is not a > between it and a >.
  • GD- Ganesh Deshmukh
    GD- Ganesh Deshmukh almost 4 years
    most helpful, regex, one of the best tool/mini-language for coders.
  • Xerillio
    Xerillio almost 4 years
    fyi, the part about adding spaces does not work properly for all types of nested nodes: extractContent("<div>foo<div>bar</div></div>", true) produces "foobar "
  • Rick Hitchcock
    Rick Hitchcock almost 4 years
    @Xerillio, good point. It would take more code to differentiate block-level from inline elements, especially considering that CSS could change block-level to inline and vice-versa.
  • hanism
    hanism over 3 years
    Different technique for different cases, and this is the right approach for my case, Telegram's bot development that require no innerHTML or something that required in web development.