Scrape data from HTML pages using Java, output to database
Solution 1
First you need to get familiar with a HTML
DOM
parser in Java like JTidy. This will help you to extract the stuff you want from a HTML
file. Once you have the essential stuff, you can use JDBC
to put in the database
.
It might be tempting to use regular expression for this job. But don't. HTML is not a regular language so regex are not the way to go.
Solution 2
I am running a scraper using JSoup I'm a noob yet found it to be very intuitive and easy to work with. It is also capable of parsing a wide range or sources html, XML, RSS, etc.
I experimented with htmlunit with little to no success.
Tanith
Updated on June 30, 2022Comments
-
Tanith almost 2 years
I need to know how to create a scraper (in Java) to gather data from HTML pages and output to a database...do not have a clue where to start so any information you can give me on this would be great. Also, you can't be too basic or simple here...thanks :)
-
radai over 14 yearsI've done these things before, and i found JTidy to be a little fragile. i'd go with tagsoup: home.ccil.org/~cowan/XML/tagsoup