how to extract content from <div> tag java
Solution 1
I'd recommend avoiding regex for parsing HTML. You can easily do what you ask by using Jsoup:
public static void main(String[] args) {
String html = "<html><head/><body><div class=\"main-content\">" +
"<div class=\"sub-content\">Sub content here</div>" +
"Main content here </div></body></html>";
Document document = Jsoup.parse(html);
Elements divs = document.select("div");
for (Element div : divs) {
System.out.println(div.ownText());
}
}
In response to comment: if you want to put the content of the div
elements into an array of String
s you can simply do:
String[] divsTexts = new String[divs.size()];
for (int i = 0; i < divs.size(); i++) {
divsTexts[i] = divs.get(i).ownText();
}
In response to comment: if you have nested elements and you want to get own text for each element than you can use jquery multiple selector syntax. Here's an example:
public static void main(String[] args) {
String html = "<html><head/><body><div class=\"main-content\">" +
"<div class=\"sub-content\">" +
"<p>a paragraph <b>with some bold text</b></p>" +
"Sub content here</div>" +
"Main content here </div></body></html>";
Document document = Jsoup.parse(html);
Elements divs = document.select("div, p, b");
for (Element div : divs) {
System.out.println(div.ownText());
}
}
The code above will parse the following HTML:
<html>
<head />
<body>
<div class="main-content">
<div class="sub-content">
<p>a paragraph <b>with some bold text</b></p>
Sub content here</div>
Main content here</div>
</body>
</html>
and print the following output:
Main content here
Sub content here
a paragraph
with some bold text
Solution 2
<div class="main-content" id="mainCon">
<div class="sub-content" id="subCon">Sub content here</div>
Main content here </div>
From this code if you want to get the result you have mentioned
Use document.getElementById("mainCon").innerHTML
it will give Main content here along with sub div but you parse that thing.
And similarly for sub-div you can use the above code sniplet i.e. document.getElementById("subCon").innerHTML
kyo21
Updated on June 15, 2022Comments
-
kyo21 almost 2 years
i have a serious problem. i would like to extract the content from tag such as:
<div class="main-content"> <div class="sub-content">Sub content here</div> Main content here </div>
output i would expect is:
Sub content here
Main content herei've tried using regex, but the result isn't so impressive. By using:
Pattern.compile("<div>(\\S+)</div>");
would return all the strings before the first <*/div> tag
so, could anyone help me pls? -
Ankit about 13 years@kyo21 : Yes you give manual id to each div and you can also give it dynamically with javascript.
-
kyo21 about 13 yearserr...what if i would like to add each <div> content into array? any suggestion? thanks
-
MarcoS about 13 years@kyo21: I added some code to my answer to answer your question on having the
div
contents into an array. -
kyo21 about 13 yearsoh, sorry, i need ur explanation again, i use method element.text() to acquire all text inside <div>tag, i've added tag <p>at div content, but the result: -Sub content here Main content here - Sub content here how could this happen?
-
MarcoS about 13 years@kyo21:
text()
gets the combined text of this element and all its children. See jsoup javadocs -
MarcoS almost 13 years@kyo21: I'm not quite sure what html you have now, but note that if you have two nested
div
tags each containing text, and you use thetext()
method, then the text of the innermostdiv
tag is printed twice: once when you calltext()
on the outerdiv
, and once when you call it on the innerdiv
(thefor
loop process alldiv
tags). I hope this helps. -
kyo21 almost 13 yearsi'm workin on extracting content from news web page, most of them have nested div tags containing text tag: <p>,<b>,etc. using text()method would obviously get all text contents, but printed the innermost <div> contents twice as u said. how to prevent this? do u have any idea or any other methods? the reult i would expect: - Main content here - Sub content here , Thanks
-
MarcoS almost 13 years@kyo21: well, why don't you use
ownText()
as in my example? That returns only the text of the element, and not that of its nested elements. So, you can select the elements that you're interested in, process them one by one, and callownText()
to retrieve their own text (if any). -
kyo21 almost 13 yearsyeah, it's true but ownText() will not return content inside <p>,<b> tag,, err...or i just remove the <p>,and <b> tag, so it will not bother me anymore?? anyway, thanks for your help ... :)
-
MarcoS almost 13 years@kyo21: oh that's easy: Jsoup supports CSS/jquery selector syntax, so you can write
document.select("div, p, b")
:) I've edited my answer to address your comment. I hope this helps. -
kyo21 almost 13 years@MarcoS out of curiosity, does jsoup has any method to remove a particular tag, ex: i would like to remove tag <div clas="...."> inside my html? i guess it will help me alot in future
-
MarcoS almost 13 years@kyo21: as far as I know, yes: have a look at remove in jsoup javadaoc
-
kyo21 almost 13 yearsah, there is... it's said this method will remove all the available child nodes also right? if i want to remove only the <div> tag in: '<div>text</div>', it's impossible for me to retain the "text" .. isn't it?
-
MarcoS almost 13 years@kyo21: I don't remember: try ... and if not, browse the javadoc to look at what other methods do :)