Scrape data from HTML pages using Java, output to database

13,854

Solution 1

First you need to get familiar with a HTML DOM parser in Java like JTidy. This will help you to extract the stuff you want from a HTML file. Once you have the essential stuff, you can use JDBC to put in the database.

It might be tempting to use regular expression for this job. But don't. HTML is not a regular language so regex are not the way to go.

Solution 2

I am running a scraper using JSoup I'm a noob yet found it to be very intuitive and easy to work with. It is also capable of parsing a wide range or sources html, XML, RSS, etc.

I experimented with htmlunit with little to no success.

Share:
13,854
Tanith
Author by

Tanith

Updated on June 30, 2022

Comments

  • Tanith
    Tanith almost 2 years

    I need to know how to create a scraper (in Java) to gather data from HTML pages and output to a database...do not have a clue where to start so any information you can give me on this would be great. Also, you can't be too basic or simple here...thanks :)

  • radai
    radai over 14 years
    I've done these things before, and i found JTidy to be a little fragile. i'd go with tagsoup: home.ccil.org/~cowan/XML/tagsoup