Parse an HTML string with JS

581,991

Solution 1

Create a dummy DOM element and add the string to it. Then, you can manipulate it like any DOM element.

var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";

el.getElementsByTagName( 'a' ); // Live NodeList of your anchor elements

Edit: adding a jQuery answer to please the fans!

var el = $( '<div></div>' );
el.html("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>");

$('a', el) // All the anchor elements

Solution 2

It's quite simple:

var parser = new DOMParser();
var htmlDoc = parser.parseFromString(txt, 'text/html');
// do whatever you want with htmlDoc.getElementsByTagName('a');

According to MDN, to do this in chrome you need to parse as XML like so:

var parser = new DOMParser();
var htmlDoc = parser.parseFromString(txt, 'text/xml');
// do whatever you want with htmlDoc.getElementsByTagName('a');

It is currently unsupported by webkit and you'd have to follow Florian's answer, and it is unknown to work in most cases on mobile browsers.

Edit: Now widely supported

Solution 3

EDIT: The solution below is only for HTML "fragments" since html,head and body are removed. I guess the solution for this question is DOMParser's parseFromString() method:

const parser = new DOMParser();
const document = parser.parseFromString(html, "text/html");

For HTML fragments, the solutions listed here works for most HTML, however for certain cases it won't work.

For example try parsing <td>Test</td>. This one won't work on the div.innerHTML solution nor DOMParser.prototype.parseFromString nor range.createContextualFragment solution. The td tag goes missing and only the text remains.

Only jQuery handles that case well.

So the future solution (MS Edge 13+) is to use template tag:

function parseHTML(html) {
    var t = document.createElement('template');
    t.innerHTML = html;
    return t.content;
}

var documentFragment = parseHTML('<td>Test</td>');

For older browsers I have extracted jQuery's parseHTML() method into an independent gist - https://gist.github.com/Munawwar/6e6362dbdf77c7865a99

Solution 4

var doc = new DOMParser().parseFromString(html, "text/html");
var links = doc.querySelectorAll("a");

Solution 5

const parse = Range.prototype.createContextualFragment.bind(document.createRange());

document.body.appendChild( parse('<p><strong>Today is:</strong></p>') ),
document.body.appendChild( parse(`<p style="background: #eee">${new Date()}</p>`) );


Only valid child Nodes within the parent Node (start of the Range) will be parsed. Otherwise, unexpected results may occur:

// <body> is "parent" Node, start of Range
const parseRange = document.createRange();
const parse = Range.prototype.createContextualFragment.bind(parseRange);

// Returns Text "1 2" because td, tr, tbody are not valid children of <body>
parse('<td>1</td> <td>2</td>');
parse('<tr><td>1</td> <td>2</td></tr>');
parse('<tbody><tr><td>1</td> <td>2</td></tr></tbody>');

// Returns <table>, which is a valid child of <body>
parse('<table> <td>1</td> <td>2</td> </table>');
parse('<table> <tr> <td>1</td> <td>2</td> </tr> </table>');
parse('<table> <tbody> <td>1</td> <td>2</td> </tbody> </table>');

// <tr> is parent Node, start of Range
parseRange.setStart(document.createElement('tr'), 0);

// Returns [<td>, <td>] element array
parse('<td>1</td> <td>2</td>');
parse('<tr> <td>1</td> <td>2</td> </tr>');
parse('<tbody> <td>1</td> <td>2</td> </tbody>');
parse('<table> <td>1</td> <td>2</td> </table>');
Share:
581,991
stage
Author by

stage

Updated on July 17, 2022

Comments

  • stage
    stage almost 2 years

    I want to parse a string which contains HTML text. I want to do it in JavaScript.

    I tried the Pure JavaScript HTML Parser library but it seems that it parses the HTML of my current page, not from a string. Because when I try the code below, it changes the title of my page:

    var parser = new HTMLtoDOM("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>", document);
    

    My goal is to extract links from an HTML external page that I read just like a string.

    Do you know an API to do it?

  • stage
    stage about 12 years
    Just a note: With this solution, if I do a "alert(el.innerHTML)", I lose the <html>, <body> and <head> tag....
  • Rob W
    Rob W about 12 years
    Why are you prefixing $? Also, as mentioned in the linked duplicate, text/html is not supported very well, and has to be implemented using a polyfill.
  • Mathieu
    Mathieu about 12 years
    I copied this line from a project, I'm used to prefix variables with $ in javascript application (not in library). it's just to avoir having a conflict with a library. that's not very usefull as almost every variable is scoped but it used to be usefull. it also (maybe) help to identify variables easily.
  • stage
    stage about 12 years
    Problem: I need to get links from <frame> tag. But with this solution, the frame tag are deleted...
  • Florian Margaine
    Florian Margaine about 12 years
    You can clone the <frame> and work on the clone. This way, you keep the original untouched and work on the cloned element (which you can delete/whatever). To clone, you can use: var c = el.cloneNode( true ); or with jQuery: var c = $( el ).clone();.
  • stage
    stage about 12 years
    I think I didn't understand because when I try it, it doesn't work: var c = el.cloneNode( true ); alert(c.innerHTML); The frame tag is still deleted
  • Florian Margaine
    Florian Margaine about 12 years
    It does work in there: jsfiddle.net/Ralt/nkPjp . If what you want is getting elements from an iframe on another domain, then it is not possible for security reasons.
  • stage
    stage about 12 years
    I've got this: jsfiddle.net/aHWJ8 i cannot grap the link ? as you can see, even the <body>, <head>, <html> are deleted.
  • stage
    stage about 12 years
    The link is in the "src" of the frame. <FRAME SRC='web-pages/page.html'>
  • Florian Margaine
    Florian Margaine about 12 years
    Well, that's completely different from what your question states. You should ask another question for this.
  • Florian Margaine
    Florian Margaine about 12 years
    But the problem is that you can't do that. Even jQuery will strip off the frame tags, since it's just using innerHTML. I don't think using frames is a good idea btw.
  • stage
    stage about 12 years
    But this is what I asked: "My goal is to extract links from a HTML external page that I read just like a String." I extract links from <img>, <script>, <a>... I just miss FRAME because it's deleted by the innerHTML method.
  • Florian Margaine
    Florian Margaine about 12 years
    In an HTML page, a link is an anchor tag (a), that's how everybody answered you :-). You can't get the FRAME source. innerHTML is the only way to do this, so you can't do it. Your only way would be to send the html server side with ajax so that you can work with it.
  • Jokester
    Jokester about 11 years
    Sadly DOMParser neither work on text/html in chrome, this MDN page gives workaround.
  • Nick
    Nick over 10 years
    Thanks for posting an answer that involves vanilla Javascript! Almost in 99.999% of the cases there's no need to use jQuery! Occasionally, I get lazy and use $.get/post, but that's it.
  • Sebastian Carroll
    Sebastian Carroll over 10 years
    I couldn't get this to work on IE8. I get the error "Object doesn't support this property or method" for the first line in the function. I don't think the createHTMLDocument function exists
  • John Slegers
    John Slegers over 10 years
    What exactly is your use case? If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document.createElement("DIV"); (2) div.innerHTML = markup; (3) result = div.childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.
  • Sebastian Carroll
    Sebastian Carroll over 10 years
    Thanks for the alternate option, I'll try it if I need to do this again. For now though I used the JQuery solution above.
  • omninonsense
    omninonsense about 9 years
    @stage I'm a little bit late to the party, but you should be able to use document.createElement('html'); to preserve the <head> and <body> tags.
  • Ry-
    Ry- almost 9 years
    Note that, like (the simple) innerHTML, this will execute an <img>’s onerror.
  • Munawwar
    Munawwar over 8 years
    An issue with this is that, html like '<td>test</td>' would ignore the td in the document.body context (and only create 'test' text node).OTOH, if it used internally in a templating engine then the right context would be available.
  • Munawwar
    Munawwar over 8 years
    Also BTW, IE 11 supports createContextualFragment.
  • JMRC
    JMRC over 8 years
    I was afraid for ID collision, but this did not happen. Just in case another newbie was wondering the same thing.
  • aendra
    aendra over 8 years
    Worth noting that in 2016 DOMParser is now widely supported. caniuse.com/#feat=xml-serializer
  • Toothbrush
    Toothbrush over 7 years
    @SebastianCarroll Note that IE8 doesn't support the trim method on strings. See stackoverflow.com/q/2308134/3210837.
  • John Slegers
    John Slegers over 7 years
    @Toothbrush : Is IE8 support still relevant at the dawn of 2017?
  • Toothbrush
    Toothbrush over 7 years
    @JohnSlegers For some companies, yes.
  • symbiont
    symbiont almost 7 years
    it looks like you are putting an html element within an html element
  • Jeff Laughlin
    Jeff Laughlin over 6 years
    If you want to write forward-compatible code that also works on old browsers you can polyfill the <template> tag. It depends on custom elements which you may also need to polyfill. In fact you might just want to use webcomponents.js to polyfill custom elements, templates, shadow dom, promises, and a few other things all at one go.
  • ceving
    ceving over 6 years
    Worth noting that all relative links in the created document are broken, because the document gets created by inheriting the documentURL of window, which most likely differs from the URL of the string.
  • Glitch
    Glitch over 6 years
    In my case, my page needs to repeat this activity over and over again. Would repeatedly creating a dummy dom element get memory intensive? Is there a way to dispose of the dom element once the innerHtml has been extracted? I'm not quite familiar with how the browser handles javascript variables.
  • Jack G
    Jack G about 6 years
    Worth noting that you should only call new DOMParser once and then reuse that same object throughout the rest of your script.
  • Justin
    Justin over 5 years
    I'm concerned is upvoted as the top answer. The parse() solution below is more reusable and elegant.
  • Justin
    Justin over 5 years
    The parse() solution below is more reusable and specific to HTML. This is nice if you need an XML document, however.
  • sea26.2
    sea26.2 about 5 years
    The question was how to parse with JS - not Chrome or Firefox
  • Shariq Musharaf
    Shariq Musharaf about 5 years
    How can I display this parsed webpage on a dialog box or something? I was not able to find solution for that
  • Leif Arne Storset
    Leif Arne Storset over 4 years
    Security note: this will execute any script in the input, and thus is unsuitable for untrusted input.
  • Leif Arne Storset
    Leif Arne Storset over 4 years
    Security note: this will execute any script in the input, and thus is unsuitable for untrusted input.
  • Leif Arne Storset
    Leif Arne Storset over 4 years
    Security note: this will execute any script in the input, and thus is unsuitable for untrusted input.
  • Leif Arne Storset
    Leif Arne Storset over 4 years
    Security note: this will execute without any browser context, so no scripts will run. It should be suitable for untrusted input.
  • Leif Arne Storset
    Leif Arne Storset over 4 years
    Security note: this will execute without any browser context, so no scripts will run. It should be suitable for untrusted input.
  • Hardik Mandankaa
    Hardik Mandankaa almost 4 years
    how to convert HTML to string using javascript?
  • Rene Koch
    Rene Koch over 3 years
    This does not answer the Quest. OP wants to extract links.
  • Nathan B
    Nathan B over 3 years
    That's not an ideal solution, since if the html string contains images for example, the browser will try to fetch them! So this is a side effect of the parsing that we may not want: See this example: var html = "<div><img src=\"img_girl.jpg\" width=\"500\" height=\"600\"></div>"; var div = document.createElement('div'); div.innerHTML = html;
  • Timo
    Timo about 3 years
    @HardikMandankaa html IS a string, so no need to convert. It is already there as string rep.
  • Timo
    Timo about 3 years
    Since chrome 31, text/html is possible. I wonder if there is anybody using this version of chrome or lower..
  • Timo
    Timo about 3 years
    I can imagine that html is twice there, it is created and then used in innerHtml.