Does .NET framework offer methods to parse an HTML string?

10,804

Solution 1

HtmlDocument

GetElementById

HtmlElement

You can create a dummy html document.

WebBrowser w = new WebBrowser();
w.Navigate(String.Empty);
HtmlDocument doc = w.Document;
doc.Write("<html><head></head><body><img id=\"myImage\" src=\"c:\"/><a id=\"myLink\" href=\"myUrl\"/></body></html>");
Console.WriteLine(doc.Body.Children.Count);
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Console.WriteLine(doc.GetElementById("myLink").GetAttribute("href"));
Console.ReadKey();

Output:

2

file:///c:

about:myUrl

Editing elements:

HtmlElement imageElement = doc.GetElementById("myImage");
string newSource = "d:";
imageElement.OuterHtml = imageElement.OuterHtml.Replace(
        "src=\"c:\"",
        "src=\"" + newSource + "\"");
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));

Output:

file:///d:

Solution 2

Assuming you're dealing with well formed HTML, you could simply treat the text as an XML document. The framework is loaded with features to do exactly what you're asking.

http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx

Solution 3

Aside from the HTML Agility Pack, and porting HtmlUnit over to C#, what sounds like solid solutions are:

  • Most obviously - use regex. (System.Text.RegularExpressions)
  • Using an XML Parser. (because HTML is a system of tags treat it like an XML document?)
  • Linq?

One thing I do know is that parsing HTML like XML may cause you to run into a few problems. XML and HTML are not the same. Read about it: here

Also, here is a post about Linq vs Regex.

Share:
10,804
Jelly Ama
Author by

Jelly Ama

Updated on June 22, 2022

Comments

  • Jelly Ama
    Jelly Ama almost 2 years

    Knowing that I can't use HTMLAgilityPack, only straight .NET, say I have a string that contains some HTML that I need to parse and edit in such ways:

    • find specific controls in the hierarchy by id or by tag
    • modify (and ideally create) attributes of those found elements

    Are there methods available in .net to do so?

  • porges
    porges about 12 years
    This requires you to load up the document in a Winforms control.
  • Jelly Ama
    Jelly Ama about 12 years
    Correct me if I'm wrong but this requires a webBrowser control and doesn't allow for direct HTML string parsing.
  • Alexei Levenkov
    Alexei Levenkov about 12 years
    @JellyAma, yes, but isn't it what you seem to want in "modify (and ideally create) attributes of those found elements"?
  • Jelly Ama
    Jelly Ama about 12 years
    @Alexei, most importantly, I need to parse strings of HTML.
  • L.B
    L.B about 12 years
    Try to parse this well formed html. <html><body>line1 <br> line2</body></html>
  • L.B
    L.B about 12 years