how to extract content from <div> tag java

18,749

Solution 1

I'd recommend avoiding regex for parsing HTML. You can easily do what you ask by using Jsoup:

public static void main(String[] args) {
    String html = "<html><head/><body><div class=\"main-content\">" +
            "<div class=\"sub-content\">Sub content here</div>" +
            "Main content here </div></body></html>";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }
}

In response to comment: if you want to put the content of the div elements into an array of Strings you can simply do:

    String[] divsTexts = new String[divs.size()];
    for (int i = 0; i < divs.size(); i++) {
        divsTexts[i] = divs.get(i).ownText();
    }

In response to comment: if you have nested elements and you want to get own text for each element than you can use jquery multiple selector syntax. Here's an example:

public static void main(String[] args) {
    String html = "<html><head/><body><div class=\"main-content\">" +
            "<div class=\"sub-content\">" +
            "<p>a paragraph <b>with some bold text</b></p>" +
            "Sub content here</div>" +
            "Main content here </div></body></html>";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div, p, b");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }
}

The code above will parse the following HTML:

<html>
<head />
<body>
<div class="main-content">
<div class="sub-content">
<p>a paragraph <b>with some bold text</b></p>
Sub content here</div>
Main content here</div>
</body>
</html>

and print the following output:

Main content here
Sub content here
a paragraph
with some bold text

Solution 2

<div class="main-content" id="mainCon">
    <div class="sub-content" id="subCon">Sub content here</div>
 Main content here </div>

From this code if you want to get the result you have mentioned

Use document.getElementById("mainCon").innerHTML it will give Main content here along with sub div but you parse that thing.

And similarly for sub-div you can use the above code sniplet i.e. document.getElementById("subCon").innerHTML

Share:
18,749
kyo21
Author by

kyo21

Updated on June 15, 2022

Comments

  • kyo21
    kyo21 almost 2 years

    i have a serious problem. i would like to extract the content from tag such as:

    <div class="main-content">
        <div class="sub-content">Sub content here</div>
          Main content here </div>
    

    output i would expect is:

    Sub content here
    Main content here

    i've tried using regex, but the result isn't so impressive. By using:

    Pattern.compile("<div>(\\S+)</div>");
    

    would return all the strings before the first <*/div> tag
    so, could anyone help me pls?

  • Ankit
    Ankit about 13 years
    @kyo21 : Yes you give manual id to each div and you can also give it dynamically with javascript.
  • kyo21
    kyo21 about 13 years
    err...what if i would like to add each <div> content into array? any suggestion? thanks
  • MarcoS
    MarcoS about 13 years
    @kyo21: I added some code to my answer to answer your question on having the div contents into an array.
  • kyo21
    kyo21 about 13 years
    oh, sorry, i need ur explanation again, i use method element.text() to acquire all text inside <div>tag, i've added tag <p>at div content, but the result: -Sub content here Main content here - Sub content here how could this happen?
  • MarcoS
    MarcoS about 13 years
    @kyo21: text() gets the combined text of this element and all its children. See jsoup javadocs
  • MarcoS
    MarcoS almost 13 years
    @kyo21: I'm not quite sure what html you have now, but note that if you have two nested div tags each containing text, and you use the text() method, then the text of the innermost div tag is printed twice: once when you call text() on the outer div, and once when you call it on the inner div (the for loop process all div tags). I hope this helps.
  • kyo21
    kyo21 almost 13 years
    i'm workin on extracting content from news web page, most of them have nested div tags containing text tag: <p>,<b>,etc. using text()method would obviously get all text contents, but printed the innermost <div> contents twice as u said. how to prevent this? do u have any idea or any other methods? the reult i would expect: - Main content here - Sub content here , Thanks
  • MarcoS
    MarcoS almost 13 years
    @kyo21: well, why don't you use ownText() as in my example? That returns only the text of the element, and not that of its nested elements. So, you can select the elements that you're interested in, process them one by one, and call ownText() to retrieve their own text (if any).
  • kyo21
    kyo21 almost 13 years
    yeah, it's true but ownText() will not return content inside <p>,<b> tag,, err...or i just remove the <p>,and <b> tag, so it will not bother me anymore?? anyway, thanks for your help ... :)
  • MarcoS
    MarcoS almost 13 years
    @kyo21: oh that's easy: Jsoup supports CSS/jquery selector syntax, so you can write document.select("div, p, b") :) I've edited my answer to address your comment. I hope this helps.
  • kyo21
    kyo21 almost 13 years
    @MarcoS out of curiosity, does jsoup has any method to remove a particular tag, ex: i would like to remove tag <div clas="...."> inside my html? i guess it will help me alot in future
  • MarcoS
    MarcoS almost 13 years
    @kyo21: as far as I know, yes: have a look at remove in jsoup javadaoc
  • kyo21
    kyo21 almost 13 years
    ah, there is... it's said this method will remove all the available child nodes also right? if i want to remove only the <div> tag in: '<div>text</div>', it's impossible for me to retain the "text" .. isn't it?
  • MarcoS
    MarcoS almost 13 years
    @kyo21: I don't remember: try ... and if not, browse the javadoc to look at what other methods do :)