Parse an HTML string with JS
Solution 1
Create a dummy DOM element and add the string to it. Then, you can manipulate it like any DOM element.
var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";
el.getElementsByTagName( 'a' ); // Live NodeList of your anchor elements
Edit: adding a jQuery answer to please the fans!
var el = $( '<div></div>' );
el.html("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>");
$('a', el) // All the anchor elements
Solution 2
It's quite simple:
var parser = new DOMParser();
var htmlDoc = parser.parseFromString(txt, 'text/html');
// do whatever you want with htmlDoc.getElementsByTagName('a');
According to MDN, to do this in chrome you need to parse as XML like so:
var parser = new DOMParser();
var htmlDoc = parser.parseFromString(txt, 'text/xml');
// do whatever you want with htmlDoc.getElementsByTagName('a');
It is currently unsupported by webkit and you'd have to follow Florian's answer, and it is unknown to work in most cases on mobile browsers.
Edit: Now widely supported
Solution 3
EDIT: The solution below is only for HTML "fragments" since html,head and body are removed. I guess the solution for this question is DOMParser's parseFromString() method:
const parser = new DOMParser();
const document = parser.parseFromString(html, "text/html");
For HTML fragments, the solutions listed here works for most HTML, however for certain cases it won't work.
For example try parsing <td>Test</td>
. This one won't work on the div.innerHTML solution nor DOMParser.prototype.parseFromString nor range.createContextualFragment solution. The td tag goes missing and only the text remains.
Only jQuery handles that case well.
So the future solution (MS Edge 13+) is to use template tag:
function parseHTML(html) {
var t = document.createElement('template');
t.innerHTML = html;
return t.content;
}
var documentFragment = parseHTML('<td>Test</td>');
For older browsers I have extracted jQuery's parseHTML() method into an independent gist - https://gist.github.com/Munawwar/6e6362dbdf77c7865a99
Solution 4
var doc = new DOMParser().parseFromString(html, "text/html");
var links = doc.querySelectorAll("a");
Solution 5
const parse = Range.prototype.createContextualFragment.bind(document.createRange());
document.body.appendChild( parse('<p><strong>Today is:</strong></p>') ),
document.body.appendChild( parse(`<p style="background: #eee">${new Date()}</p>`) );
Only valid child
Node
s within the parent Node
(start of the Range
) will be parsed. Otherwise, unexpected results may occur:
// <body> is "parent" Node, start of Range
const parseRange = document.createRange();
const parse = Range.prototype.createContextualFragment.bind(parseRange);
// Returns Text "1 2" because td, tr, tbody are not valid children of <body>
parse('<td>1</td> <td>2</td>');
parse('<tr><td>1</td> <td>2</td></tr>');
parse('<tbody><tr><td>1</td> <td>2</td></tr></tbody>');
// Returns <table>, which is a valid child of <body>
parse('<table> <td>1</td> <td>2</td> </table>');
parse('<table> <tr> <td>1</td> <td>2</td> </tr> </table>');
parse('<table> <tbody> <td>1</td> <td>2</td> </tbody> </table>');
// <tr> is parent Node, start of Range
parseRange.setStart(document.createElement('tr'), 0);
// Returns [<td>, <td>] element array
parse('<td>1</td> <td>2</td>');
parse('<tr> <td>1</td> <td>2</td> </tr>');
parse('<tbody> <td>1</td> <td>2</td> </tbody>');
parse('<table> <td>1</td> <td>2</td> </table>');
stage
Updated on July 17, 2022Comments
-
stage almost 2 years
I want to parse a string which contains HTML text. I want to do it in JavaScript.
I tried the Pure JavaScript HTML Parser library but it seems that it parses the HTML of my current page, not from a string. Because when I try the code below, it changes the title of my page:
var parser = new HTMLtoDOM("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>", document);
My goal is to extract links from an HTML external page that I read just like a string.
Do you know an API to do it?
-
stage about 12 yearsJust a note: With this solution, if I do a "alert(el.innerHTML)", I lose the <html>, <body> and <head> tag....
-
Rob W about 12 yearsWhy are you prefixing
$
? Also, as mentioned in the linked duplicate,text/html
is not supported very well, and has to be implemented using a polyfill. -
Mathieu about 12 yearsI copied this line from a project, I'm used to prefix variables with $ in javascript application (not in library). it's just to avoir having a conflict with a library. that's not very usefull as almost every variable is scoped but it used to be usefull. it also (maybe) help to identify variables easily.
-
stage about 12 yearsProblem: I need to get links from <frame> tag. But with this solution, the frame tag are deleted...
-
Florian Margaine about 12 yearsYou can clone the
<frame>
and work on the clone. This way, you keep the original untouched and work on the cloned element (which you can delete/whatever). To clone, you can use:var c = el.cloneNode( true );
or with jQuery:var c = $( el ).clone();
. -
stage about 12 yearsI think I didn't understand because when I try it, it doesn't work: var c = el.cloneNode( true ); alert(c.innerHTML); The frame tag is still deleted
-
Florian Margaine about 12 yearsIt does work in there: jsfiddle.net/Ralt/nkPjp . If what you want is getting elements from an
iframe
on another domain, then it is not possible for security reasons. -
stage about 12 yearsI've got this: jsfiddle.net/aHWJ8 i cannot grap the link ? as you can see, even the <body>, <head>, <html> are deleted.
-
stage about 12 yearsThe link is in the "src" of the frame. <FRAME SRC='web-pages/page.html'>
-
Florian Margaine about 12 yearsWell, that's completely different from what your question states. You should ask another question for this.
-
Florian Margaine about 12 yearsBut the problem is that you can't do that. Even jQuery will strip off the
frame
tags, since it's just usinginnerHTML
. I don't think using frames is a good idea btw. -
stage about 12 yearsBut this is what I asked: "My goal is to extract links from a HTML external page that I read just like a String." I extract links from <img>, <script>, <a>... I just miss FRAME because it's deleted by the innerHTML method.
-
Florian Margaine about 12 yearsIn an HTML page, a link is an
anchor
tag (a), that's how everybody answered you :-). You can't get the FRAME source. innerHTML is the only way to do this, so you can't do it. Your only way would be to send the html server side with ajax so that you can work with it. -
Jokester about 11 years
-
Nick over 10 yearsThanks for posting an answer that involves vanilla Javascript! Almost in 99.999% of the cases there's no need to use jQuery! Occasionally, I get lazy and use $.get/post, but that's it.
-
Sebastian Carroll over 10 yearsI couldn't get this to work on IE8. I get the error "Object doesn't support this property or method" for the first line in the function. I don't think the createHTMLDocument function exists
-
John Slegers over 10 yearsWhat exactly is your use case? If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document.createElement("DIV"); (2) div.innerHTML = markup; (3) result = div.childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.
-
Sebastian Carroll over 10 yearsThanks for the alternate option, I'll try it if I need to do this again. For now though I used the JQuery solution above.
-
omninonsense about 9 years@stage I'm a little bit late to the party, but you should be able to use
document.createElement('html');
to preserve the<head>
and<body>
tags. -
Ry- almost 9 yearsNote that, like (the simple)
innerHTML
, this will execute an<img>
’sonerror
. -
Munawwar over 8 yearsAn issue with this is that, html like '<td>test</td>' would ignore the td in the document.body context (and only create 'test' text node).OTOH, if it used internally in a templating engine then the right context would be available.
-
Munawwar over 8 yearsAlso BTW, IE 11 supports createContextualFragment.
-
JMRC over 8 yearsI was afraid for ID collision, but this did not happen. Just in case another newbie was wondering the same thing.
-
aendra over 8 yearsWorth noting that in 2016 DOMParser is now widely supported. caniuse.com/#feat=xml-serializer
-
Toothbrush over 7 years@SebastianCarroll Note that IE8 doesn't support the
trim
method on strings. See stackoverflow.com/q/2308134/3210837. -
John Slegers over 7 years@Toothbrush : Is IE8 support still relevant at the dawn of 2017?
-
Toothbrush over 7 years@JohnSlegers For some companies, yes.
-
symbiont almost 7 yearsit looks like you are putting an html element within an html element
-
Jeff Laughlin over 6 yearsIf you want to write forward-compatible code that also works on old browsers you can polyfill the
<template>
tag. It depends on custom elements which you may also need to polyfill. In fact you might just want to use webcomponents.js to polyfill custom elements, templates, shadow dom, promises, and a few other things all at one go. -
ceving over 6 yearsWorth noting that all relative links in the created document are broken, because the document gets created by inheriting the
documentURL
ofwindow
, which most likely differs from the URL of the string. -
Glitch over 6 yearsIn my case, my page needs to repeat this activity over and over again. Would repeatedly creating a dummy dom element get memory intensive? Is there a way to dispose of the dom element once the innerHtml has been extracted? I'm not quite familiar with how the browser handles javascript variables.
-
Jack G about 6 yearsWorth noting that you should only call
new DOMParser
once and then reuse that same object throughout the rest of your script. -
Justin over 5 yearsI'm concerned is upvoted as the top answer. The
parse()
solution below is more reusable and elegant. -
Justin over 5 yearsThe
parse()
solution below is more reusable and specific to HTML. This is nice if you need an XML document, however. -
sea26.2 about 5 yearsThe question was how to parse with JS - not Chrome or Firefox
-
Shariq Musharaf about 5 yearsHow can I display this parsed webpage on a dialog box or something? I was not able to find solution for that
-
Leif Arne Storset over 4 yearsSecurity note: this will execute any script in the input, and thus is unsuitable for untrusted input.
-
Leif Arne Storset over 4 yearsSecurity note: this will execute any script in the input, and thus is unsuitable for untrusted input.
-
Leif Arne Storset over 4 yearsSecurity note: this will execute any script in the input, and thus is unsuitable for untrusted input.
-
Leif Arne Storset over 4 yearsSecurity note: this will execute without any browser context, so no scripts will run. It should be suitable for untrusted input.
-
Leif Arne Storset over 4 yearsSecurity note: this will execute without any browser context, so no scripts will run. It should be suitable for untrusted input.
-
Hardik Mandankaa almost 4 yearshow to convert HTML to string using javascript?
-
Rene Koch over 3 yearsThis does not answer the Quest. OP wants to extract links.
-
Nathan B over 3 yearsThat's not an ideal solution, since if the html string contains images for example, the browser will try to fetch them! So this is a side effect of the parsing that we may not want: See this example: var html = "<div><img src=\"img_girl.jpg\" width=\"500\" height=\"600\"></div>"; var div = document.createElement('div'); div.innerHTML = html;
-
Timo about 3 years@HardikMandankaa
html
IS
a string, so no need to convert. It is already there as string rep. -
Timo about 3 yearsSince chrome 31,
text/html
is possible. I wonder if there is anybody using this version of chrome or lower.. -
Timo about 3 yearsI can imagine that
html
is twice there, it iscreated
and thenused
ininnerHtml
.