C#.net Use HTMLDocument from Console?

18,301

Solution 1

As an alternative, you could use the free Html Agility Pack library. That can parse HTML and will let you query it with LINQ. I used an older version for a project at home and it worked great.

EDIT: You may also want to use the WebClient or WebRequest classes to download the web page. See my blog post on Web scraping in .NET. (Note that I haven't tried this in a console app.)

Solution 2

Add the [STAThread] attribute to your Main method

    [STAThread]
    static void Main(string[] args)
    {
    }

That should fix it.

Share:
18,301

Related videos on Youtube

mpen
Author by

mpen

Updated on June 04, 2022

Comments

  • mpen
    mpen almost 2 years

    I'm trying to use System.Windows.Forms.HTMLDocument in a console application. First, is this even possible? If so, how can I load up a page from the web into it? I was trying to use WebBrowser, but it's telling me:

    Unhandled Exception: System.Threading.ThreadStateException: ActiveX control '885 6f961-340a-11d0-a96b-00c04fd705a2' cannot be instantiated because the current th read is not in a single-threaded apartment.

    There seems to be a severe lack of tutorials on the HTMLDocument object (or Google is just turning up useless results).


    Just discovered mshtml.HTMLDocument.createDocumentFromUrl, but that throws me

    Unhandled Exception: System.Runtime.InteropServices.COMException (0x80010105): T he server threw an exception. (Exception from HRESULT: 0x80010105 (RPC_E_SERVERF AULT)) at System.RuntimeType.ForwardCallToInvokeMember(String memberName, BindingFla gs flags, Object target, Int32[] aWrapperTypes, MessageData& msgData) at mshtml.HTMLDocumentClass.createDocumentFromUrl(String bstrUrl, String bstr Options) at iget.Program.Main(String[] args)

    What the heck? All I want is a list of <a> tags on a page. Why is this so hard?


    For those that are curious, here's the solution I came up with, thanks to TrueWill:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Net;
    using System.IO;
    using HtmlAgilityPack;
    
    namespace iget
    {
        class Program
        {
            static void Main(string[] args)
            {
                WebClient wc = new WebClient();
                HtmlDocument doc = new HtmlDocument();
                doc.Load(wc.OpenRead("http://google.com"));
                foreach(HtmlNode a in doc.DocumentNode.SelectNodes("//a[@href]"))
                {
                    Console.WriteLine(a.Attributes["href"].Value);
                }
            }
        }
    }
    
  • mpen
    mpen over 14 years
    It's not XHTML. RegEx is a hack... I have no idea how malformed the HTML I'll be working with is. I need the links (hrefs) in the anchors.
  • Wil P
    Wil P over 14 years
    Why is regex a hack? Easy enough to get the hrefs too. +Regex is fast.
  • TrueWill
    TrueWill over 14 years
    As for why regex (in this case) is a hack, see codinghorror.com/blog/archives/001311.html
  • mpen
    mpen over 14 years
    Not familiar with LINQ, but a quick glance over that front page mentions XPATH, which is good! Might give this a go if chris's solution doesn't work.
  • mpen
    mpen over 14 years
    I don't think it solves the problem though. I've created a WebBrowser object, and then I Navigate to google.com.. I've attached a DocumentCompleted event handler so I know when it's done loading, but it never gets fired. In fact, the program just runs to completion almost immediately, which tells me it's not waiting for the page to load at all. I don't think it likes being single-threaded.
  • TrueWill
    TrueWill over 14 years
    @Mark: You don't have to use LINQ - when I was using the library that feature hadn't been added. It was still pretty easy. You could create an XPathNavigator, call Select on that and pass in an XPath string, then iterate over the result. SelectSingleNode is the other major method I used.
  • TrueWill
    TrueWill over 14 years
    Looks like you'd also need a message pump. See stackoverflow.com/questions/764869/c-console-app-event-handl‌​ing
  • mpen
    mpen over 14 years
    I added some code to my question. Works great in a console :)
  • TrueWill
    TrueWill over 14 years
    @Mark: Thanks! Your code is very concise. One aside: It probably isn't relevant in your program, but WebClient is IDisposable.
  • mpen
    mpen over 14 years
    That sounds nasty. Way too much work just to read an HTML doc from the web :) Thanks though.
  • Josh
    Josh over 14 years
    mshtml is definitely not designed for console use. It's been long recommended against using it in server-side applications for the same reasons. HTML agility pack is a great alternative for parsing though.