Extract the text out of HTML string using JavaScript
111,703
Solution 1
Create an element, store the HTML in it, and get its textContent
:
function extractContent(s) {
var span = document.createElement('span');
span.innerHTML = s;
return span.textContent || span.innerText;
};
alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));
Here's a version that allows you to have spaces between nodes, although you'd probably want that for block-level elements only:
function extractContent(s, space) {
var span= document.createElement('span');
span.innerHTML= s;
if(space) {
var children= span.querySelectorAll('*');
for(var i = 0 ; i < children.length ; i++) {
if(children[i].textContent)
children[i].textContent+= ' ';
else
children[i].innerText+= ' ';
}
}
return [span.textContent || span.innerText].toString().replace(/ +/g,' ');
};
console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>. Nice to <em>see</em><strong><em>you!</em></strong>"));
console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>. Nice to <em>see</em><strong><em>you!</em></strong>",true));
Solution 2
textContext is a very good technique for achieving desired results but sometimes we don't want to load DOM. So simple workaround will be following regular expression:
let htmlString = "<p>Hello</p><a href='http://w3c.org'>W3C</a>"
let plainText = htmlString.replace(/<[^>]+>/g, '');
Solution 3
use this regax for remove html tags and store only the inner text in html
it shows the HelloW3c only check it
var content_holder = value.replace(/<(?:.|\n)*?>/gm, '');
Solution 4
Try This:-
<!DOCTYPE html>
<html>
<body>
<script type="text/javascript">
function extractContent(value){
var div = document.createElement('div')
div.innerHTML=value;
var text= div.textContent;
return text;
}
window.onload=function()
{
alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));
};
</script>
</body>
</html>
Author by
Toshkuuu
Updated on July 27, 2022Comments
-
Toshkuuu almost 2 years
I am trying to get the inner text of HTML string, using a JS function(the string is passed as an argument). Here is the code:
function extractContent(value) { var content_holder = ""; for (var i = 0; i < value.length; i++) { if (value.charAt(i) === '>') { continue; while (value.charAt(i) != '<') { content_holder += value.charAt(i); } } } console.log(content_holder); } extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");
The problem is that nothing gets printed on the
console(*content_holder* stays empty)
. I think the problem is caused by the===
operator. -
NewToJS about 9 yearsYou don't need an array, if you want the results to be a string but having an array will allow the user to access each result/value.
-
Admin about 9 yearsRight approach, but you don't need an element in the DOM to do this. Just create an element with
var div = document.createElement('div')
and proceed from there. -
Admin about 9 yearsYou haven't fixed the basic logic error in the OP code. Did you test this?
-
Admin about 9 yearsDid you test this? It fails to extract "W3C" as it should.
-
Rana Ahmer Yasin about 9 yearsplease give me a reason please?
-
Admin about 9 yearsAlso, this will fail with nested HTML elements, such as
<p>Hello<i>Bob</i></p><a>...</a>
. It will retain the markup inside thep
element. -
Admin about 9 years
-
davidkonrad about 9 yearsOutputs
HelloW3C
- really what OP wanted? NotHello W3C
? -
Toshkuuu about 9 yearsNo, white spaces are not required :) Sorry for not mentioning it!
-
Rick Hitchcock about 9 yearsAdded a version that can add spaces between nodes.
-
Admin about 9 yearsPlease try your solution with the string
Hello, <p>Buggy<i>World</i></p>
. -
Admin about 9 years
delete span
accomplishes nothing. -
Rick Hitchcock about 9 years@torazaburo, thanks, I wasn't sure about that. Edited.
-
Admin about 9 yearsIf you are going to use regexp, then a simpler version would be
/<[\s\S]*?>/
, or/<[^]*?>/
. Yourm
flag accomplishes nothing; it relates to the behavior of^
and$
. -
NewToJS about 9 yearsI'm going to guess "not"
-
Kelly almost 5 yearsI know this is a very old comment, but could you please explain the meaning of the expression /<[^>]+>/g ? I'm having trouble understanding what each individual character means.
-
Kade over 4 years@Kelly The symbols you are referring to are a regular expression. It's kind of like a mini-programming language for parsing text. Here's a link to where you can learn more about each symbol: developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/…
-
Kade over 4 yearsIt essentially says to find and remove each < that has stuff that is not a > between it and a >.
-
GD- Ganesh Deshmukh almost 4 yearsmost helpful, regex, one of the best tool/mini-language for coders.
-
Xerillio almost 4 yearsfyi, the part about adding spaces does not work properly for all types of nested nodes:
extractContent("<div>foo<div>bar</div></div>", true)
produces"foobar "
-
Rick Hitchcock almost 4 years@Xerillio, good point. It would take more code to differentiate block-level from inline elements, especially considering that CSS could change block-level to inline and vice-versa.
-
hanism over 3 yearsDifferent technique for different cases, and this is the right approach for my case, Telegram's bot development that require no innerHTML or something that required in web development.