Extract and Parse HTML Table using Jsoup

28,043

Solution 1

xpath for the columns - //*[@id="phone_details"]/tbody/tr[3]/td[2]/strong

xpath for the values - //*[@id="phone_details"]/tbody/tr[3]/td[3]

@Joey's code tries to zero in on these. You should be able to write the select() rules based on the Xpath.

Replace the numbers (tr[N] / td[N]) with appropriate values.

Alternatively, you can pipe the HTML thought a text only browser and extract the data from the text. Here is the text version of the page. You can delimit the text or read after N chars to extract the data.

Solution 2

Here is an attempt to find the solution to your problem

Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();

for (Element table : doc.select("table[id=phone_details]")) {
     for (Element row : table.select("tr:gt(2)")) {
        Elements tds = row.select("td:not([rowspan])");
        System.out.println(tds.get(0).text() + "->" + tds.get(1).text());
     }
}

Parsing the HTML is tricky and if the HTML changes your code needs to change as well.

You need to study the HTML markup to come up with your parsing rules first.

  • There are multiple tables in the HTML, so you first filter on the correct one table[id=phone_details]
  • The first 2 table rows contain only markup for formatting, so skip them tr:gt(2)
  • Every other row starts with the global description for the content type, filter it out td:not([rowspan])

For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax

Solution 3

this is how i get the data from a html table.

org.jsoup.nodes.Element tablaRegistros = doc
                    .getElementById("tableId");
for (org.jsoup.nodes.Element row : tablaRegistros.select("tr")) {
                for (org.jsoup.nodes.Element column : row.select("td")) {
                    // Elements tds = row.select("td");
                    // cadena += tds.get(0).text() + "->" +
                    // tds.get(1).text()
                    // + " \n";
                    cadena += column.text() + ",";
                }
                cadena += "\n";
            }

Solution 4

Here is a generic solution to extraction of table from HTML page via JSoup.

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ExtractTableDataUsingJSoup {

    public static void main(String[] args) {
        extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
    }

    public static void extractTableUsingJsoup(String url, String tableId){
        Document doc;
        try {
            // need http protocol
            doc = Jsoup.connect(url).get();

            //Set id of any table from any website and the below code will print the contents of the table.
            //Set the extracted data in appropriate data structures and use them for further processing
            Element table = doc.getElementById(tableId);

            Elements tds = table.getElementsByTag("td");

            //You can check for nesting of tds if such structure exists
            for (Element td : tds) {
                System.out.println("\n"+td.text());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}
Share:
28,043
KNU
Author by

KNU

Kunal's (definitely) not Unix @kunalkrishna85

Updated on July 05, 2022

Comments

  • KNU
    KNU almost 2 years

    How could I use Jsoup to extract specification data from this website separately for each row e.g. Network->Network Type, Battery etc.

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    public class mobilereviews {
        public static void main(String[] args) throws Exception {
            Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
            for (Element table : doc.select("table")) {
                for (Element row : table.select("tr")) {
                    Elements tds = row.select("td");
                    System.out.println(tds.get(0).text());   
                }
            }
        }
    }