C#.net Use HTMLDocument from Console?
Solution 1
As an alternative, you could use the free Html Agility Pack library. That can parse HTML and will let you query it with LINQ. I used an older version for a project at home and it worked great.
EDIT: You may also want to use the WebClient or WebRequest classes to download the web page. See my blog post on Web scraping in .NET. (Note that I haven't tried this in a console app.)
Solution 2
Add the [STAThread] attribute to your Main method
[STAThread]
static void Main(string[] args)
{
}
That should fix it.
Related videos on Youtube
Comments
-
mpen almost 2 years
I'm trying to use
System.Windows.Forms.HTMLDocument
in a console application. First, is this even possible? If so, how can I load up a page from the web into it? I was trying to useWebBrowser
, but it's telling me:Unhandled Exception: System.Threading.ThreadStateException: ActiveX control '885 6f961-340a-11d0-a96b-00c04fd705a2' cannot be instantiated because the current th read is not in a single-threaded apartment.
There seems to be a severe lack of tutorials on the
HTMLDocument
object (or Google is just turning up useless results).
Just discovered
mshtml.HTMLDocument.createDocumentFromUrl
, but that throws meUnhandled Exception: System.Runtime.InteropServices.COMException (0x80010105): T he server threw an exception. (Exception from HRESULT: 0x80010105 (RPC_E_SERVERF AULT)) at System.RuntimeType.ForwardCallToInvokeMember(String memberName, BindingFla gs flags, Object target, Int32[] aWrapperTypes, MessageData& msgData) at mshtml.HTMLDocumentClass.createDocumentFromUrl(String bstrUrl, String bstr Options) at iget.Program.Main(String[] args)
What the heck? All I want is a list of
<a>
tags on a page. Why is this so hard?
For those that are curious, here's the solution I came up with, thanks to TrueWill:
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Net; using System.IO; using HtmlAgilityPack; namespace iget { class Program { static void Main(string[] args) { WebClient wc = new WebClient(); HtmlDocument doc = new HtmlDocument(); doc.Load(wc.OpenRead("http://google.com")); foreach(HtmlNode a in doc.DocumentNode.SelectNodes("//a[@href]")) { Console.WriteLine(a.Attributes["href"].Value); } } } }
-
mpen over 14 yearsIt's not XHTML. RegEx is a hack... I have no idea how malformed the HTML I'll be working with is. I need the links (hrefs) in the anchors.
-
Wil P over 14 yearsWhy is regex a hack? Easy enough to get the hrefs too. +Regex is fast.
-
TrueWill over 14 yearsAs for why regex (in this case) is a hack, see codinghorror.com/blog/archives/001311.html
-
mpen over 14 yearsNot familiar with LINQ, but a quick glance over that front page mentions XPATH, which is good! Might give this a go if chris's solution doesn't work.
-
mpen over 14 yearsI don't think it solves the problem though. I've created a WebBrowser object, and then I
Navigate
togoogle.com
.. I've attached aDocumentCompleted
event handler so I know when it's done loading, but it never gets fired. In fact, the program just runs to completion almost immediately, which tells me it's not waiting for the page to load at all. I don't think it likes being single-threaded. -
TrueWill over 14 years@Mark: You don't have to use LINQ - when I was using the library that feature hadn't been added. It was still pretty easy. You could create an XPathNavigator, call Select on that and pass in an XPath string, then iterate over the result. SelectSingleNode is the other major method I used.
-
TrueWill over 14 yearsLooks like you'd also need a message pump. See stackoverflow.com/questions/764869/c-console-app-event-handling
-
mpen over 14 yearsI added some code to my question. Works great in a console :)
-
TrueWill over 14 years@Mark: Thanks! Your code is very concise. One aside: It probably isn't relevant in your program, but WebClient is IDisposable.
-
mpen over 14 yearsThat sounds nasty. Way too much work just to read an HTML doc from the web :) Thanks though.
-
Josh over 14 yearsmshtml is definitely not designed for console use. It's been long recommended against using it in server-side applications for the same reasons. HTML agility pack is a great alternative for parsing though.