Parse an HTML string with JS

javascript html dom html-parsing

581,991

Solution 1

Create a dummy DOM element and add the string to it. Then, you can manipulate it like any DOM element.

var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";

el.getElementsByTagName( 'a' ); // Live NodeList of your anchor elements

Edit: adding a jQuery answer to please the fans!

var el = $( '<div></div>' );
el.html("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>");

$('a', el) // All the anchor elements

Solution 2

It's quite simple:

var parser = new DOMParser();
var htmlDoc = parser.parseFromString(txt, 'text/html');
// do whatever you want with htmlDoc.getElementsByTagName('a');

According to MDN, to do this in chrome you need to parse as XML like so:

var parser = new DOMParser();
var htmlDoc = parser.parseFromString(txt, 'text/xml');
// do whatever you want with htmlDoc.getElementsByTagName('a');

~~It is currently unsupported by webkit and you'd have to follow Florian's answer, and it is unknown to work in most cases on mobile browsers.~~

Edit: Now widely supported

Solution 3

EDIT: The solution below is only for HTML "fragments" since html,head and body are removed. I guess the solution for this question is DOMParser's parseFromString() method:

const parser = new DOMParser();
const document = parser.parseFromString(html, "text/html");

For HTML fragments, the solutions listed here works for most HTML, however for certain cases it won't work.

For example try parsing <td>Test</td>. This one won't work on the div.innerHTML solution nor DOMParser.prototype.parseFromString nor range.createContextualFragment solution. The td tag goes missing and only the text remains.

Only jQuery handles that case well.

So the future solution (MS Edge 13+) is to use template tag:

function parseHTML(html) {
    var t = document.createElement('template');
    t.innerHTML = html;
    return t.content;
}

var documentFragment = parseHTML('<td>Test</td>');

For older browsers I have extracted jQuery's parseHTML() method into an independent gist - https://gist.github.com/Munawwar/6e6362dbdf77c7865a99

Solution 4

var doc = new DOMParser().parseFromString(html, "text/html");
var links = doc.querySelectorAll("a");

Solution 5

const parse = Range.prototype.createContextualFragment.bind(document.createRange());

document.body.appendChild( parse('<p><strong>Today is:</strong></p>') ),
document.body.appendChild( parse(`<p style="background: #eee">${new Date()}</p>`) );

Only valid child Nodes within the parent Node (start of the Range) will be parsed. Otherwise, unexpected results may occur:

// <body> is "parent" Node, start of Range
const parseRange = document.createRange();
const parse = Range.prototype.createContextualFragment.bind(parseRange);

// Returns Text "1 2" because td, tr, tbody are not valid children of <body>
parse('<td>1</td> <td>2</td>');
parse('<tr><td>1</td> <td>2</td></tr>');
parse('<tbody><tr><td>1</td> <td>2</td></tr></tbody>');

// Returns <table>, which is a valid child of <body>
parse('<table> <td>1</td> <td>2</td> </table>');
parse('<table> <tr> <td>1</td> <td>2</td> </tr> </table>');
parse('<table> <tbody> <td>1</td> <td>2</td> </tbody> </table>');

// <tr> is parent Node, start of Range
parseRange.setStart(document.createElement('tr'), 0);

// Returns [<td>, <td>] element array
parse('<td>1</td> <td>2</td>');
parse('<tr> <td>1</td> <td>2</td> </tr>');
parse('<tbody> <td>1</td> <td>2</td> </tbody>');
parse('<table> <td>1</td> <td>2</td> </table>');

View more solutions

581,991

Author by

stage

Updated on July 17, 2022

Comments

stage almost 2 years
I want to parse a string which contains HTML text. I want to do it in JavaScript.

I tried the Pure JavaScript HTML Parser library but it seems that it parses the HTML of my current page, not from a string. Because when I try the code below, it changes the title of my page:
```
var parser = new HTMLtoDOM("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>", document);
```
My goal is to extract links from an HTML external page that I read just like a string.

Do you know an API to do it?
stage about 12 years

Just a note: With this solution, if I do a "alert(el.innerHTML)", I lose the <html>, <body> and <head> tag....
Rob W about 12 years

Why are you prefixing $? Also, as mentioned in the linked duplicate, text/html is not supported very well, and has to be implemented using a polyfill.
Mathieu about 12 years

I copied this line from a project, I'm used to prefix variables with $ in javascript application (not in library). it's just to avoir having a conflict with a library. that's not very usefull as almost every variable is scoped but it used to be usefull. it also (maybe) help to identify variables easily.
stage about 12 years

Problem: I need to get links from <frame> tag. But with this solution, the frame tag are deleted...
Florian Margaine about 12 years

You can clone the <frame> and work on the clone. This way, you keep the original untouched and work on the cloned element (which you can delete/whatever). To clone, you can use: var c = el.cloneNode( true ); or with jQuery: var c = $( el ).clone();.
stage about 12 years

I think I didn't understand because when I try it, it doesn't work: var c = el.cloneNode( true ); alert(c.innerHTML); The frame tag is still deleted
Florian Margaine about 12 years

It does work in there: jsfiddle.net/Ralt/nkPjp . If what you want is getting elements from an iframe on another domain, then it is not possible for security reasons.
stage about 12 years

I've got this: jsfiddle.net/aHWJ8 i cannot grap the link ? as you can see, even the <body>, <head>, <html> are deleted.
stage about 12 years

The link is in the "src" of the frame. <FRAME SRC='web-pages/page.html'>
Florian Margaine about 12 years

Well, that's completely different from what your question states. You should ask another question for this.
Florian Margaine about 12 years

But the problem is that you can't do that. Even jQuery will strip off the frame tags, since it's just using innerHTML. I don't think using frames is a good idea btw.
stage about 12 years

But this is what I asked: "My goal is to extract links from a HTML external page that I read just like a String." I extract links from <img>, <script>, <a>... I just miss FRAME because it's deleted by the innerHTML method.
Florian Margaine about 12 years

In an HTML page, a link is an anchor tag (a), that's how everybody answered you :-). You can't get the FRAME source. innerHTML is the only way to do this, so you can't do it. Your only way would be to send the html server side with ajax so that you can work with it.
Jokester about 11 years

Sadly DOMParser neither work on text/html in chrome, this MDN page gives workaround.
Nick over 10 years

Thanks for posting an answer that involves vanilla Javascript! Almost in 99.999% of the cases there's no need to use jQuery! Occasionally, I get lazy and use $.get/post, but that's it.
Sebastian Carroll over 10 years

I couldn't get this to work on IE8. I get the error "Object doesn't support this property or method" for the first line in the function. I don't think the createHTMLDocument function exists
John Slegers over 10 years

What exactly is your use case? If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document.createElement("DIV"); (2) div.innerHTML = markup; (3) result = div.childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.
Sebastian Carroll over 10 years

Thanks for the alternate option, I'll try it if I need to do this again. For now though I used the JQuery solution above.
omninonsense about 9 years

@stage I'm a little bit late to the party, but you should be able to use document.createElement('html'); to preserve the <head> and <body> tags.
Ry- almost 9 years

Note that, like (the simple) innerHTML, this will execute an <img>’s onerror.
Munawwar over 8 years

An issue with this is that, html like '<td>test</td>' would ignore the td in the document.body context (and only create 'test' text node).OTOH, if it used internally in a templating engine then the right context would be available.
Munawwar over 8 years

Also BTW, IE 11 supports createContextualFragment.
JMRC over 8 years

I was afraid for ID collision, but this did not happen. Just in case another newbie was wondering the same thing.
aendra over 8 years

Worth noting that in 2016 DOMParser is now widely supported. caniuse.com/#feat=xml-serializer
Toothbrush over 7 years

@SebastianCarroll Note that IE8 doesn't support the trim method on strings. See stackoverflow.com/q/2308134/3210837.
John Slegers over 7 years

@Toothbrush : Is IE8 support still relevant at the dawn of 2017?
Toothbrush over 7 years

@JohnSlegers For some companies, yes.
symbiont almost 7 years

it looks like you are putting an html element within an html element
Jeff Laughlin over 6 years

If you want to write forward-compatible code that also works on old browsers you can polyfill the <template> tag. It depends on custom elements which you may also need to polyfill. In fact you might just want to use webcomponents.js to polyfill custom elements, templates, shadow dom, promises, and a few other things all at one go.
ceving over 6 years

Worth noting that all relative links in the created document are broken, because the document gets created by inheriting the documentURL of window, which most likely differs from the URL of the string.
Glitch over 6 years

In my case, my page needs to repeat this activity over and over again. Would repeatedly creating a dummy dom element get memory intensive? Is there a way to dispose of the dom element once the innerHtml has been extracted? I'm not quite familiar with how the browser handles javascript variables.
Jack G about 6 years

Worth noting that you should only call new DOMParser once and then reuse that same object throughout the rest of your script.
Justin over 5 years

I'm concerned is upvoted as the top answer. The parse() solution below is more reusable and elegant.
Justin over 5 years

The parse() solution below is more reusable and specific to HTML. This is nice if you need an XML document, however.
sea26.2 about 5 years

The question was how to parse with JS - not Chrome or Firefox
Shariq Musharaf about 5 years

How can I display this parsed webpage on a dialog box or something? I was not able to find solution for that
Leif Arne Storset over 4 years

Security note: this will execute any script in the input, and thus is unsuitable for untrusted input.
Leif Arne Storset over 4 years

Security note: this will execute any script in the input, and thus is unsuitable for untrusted input.
Leif Arne Storset over 4 years

Security note: this will execute any script in the input, and thus is unsuitable for untrusted input.
Leif Arne Storset over 4 years

Security note: this will execute without any browser context, so no scripts will run. It should be suitable for untrusted input.
Leif Arne Storset over 4 years

Security note: this will execute without any browser context, so no scripts will run. It should be suitable for untrusted input.
Hardik Mandankaa almost 4 years

how to convert HTML to string using javascript?
Rene Koch over 3 years

This does not answer the Quest. OP wants to extract links.
Nathan B over 3 years

That's not an ideal solution, since if the html string contains images for example, the browser will try to fetch them! So this is a side effect of the parsing that we may not want: See this example: var html = "<div><img src=\"img_girl.jpg\" width=\"500\" height=\"600\"></div>"; var div = document.createElement('div'); div.innerHTML = html;
Timo about 3 years

@HardikMandankaa html IS a string, so no need to convert. It is already there as string rep.
Timo about 3 years

Since chrome 31, text/html is possible. I wonder if there is anybody using this version of chrome or lower..
Timo about 3 years

I can imagine that html is twice there, it is created and then used in innerHtml.