How to extract texts between tags

java html parsing jsoup

30,509

This can do the job

Elements e=doc.select("p");

Here is a list of all selectors you can use.

Suppose you have this html:

String html="<p>some <strong>bold</strong> text</p>";

To get some bold text as result you should use:

Document doc = Jsoup.parse(html);
Element p= doc.select("p").first();
String text = doc.body().text(); //some bold text

String text = p.text(); //some bold text

Suppose now you have the following complex html

String html="<div id=someid><p>some text</p><span>some other text</span><p> another p tag</p></div>"

To get the values from the two p tags you have to do something like this

Document doc = Jsoup.parse(html);
Element content = doc.getElementById("someid");
Elements p= content.getElementsByTag("p");

String pConcatenated="";
for (Element x: p) {
  pConcatenated+= x.text();
}

System.out.println(pConcatenated);//sometext another p tag

You can find more info here also

Hope this helped

30,509

Author by

rena-c

Updated on April 09, 2020

Comments

rena-c about 4 years
I want to extract texts from HTML page(s) which placed in p and li tags, so I can start to tokenize the page to construct inverted index(es) for each page in order to answer search queries.

How I can get p tags using jsoup
```
Elements e = doc.select(""); 
```
What could be the string to be written in that parameter?
rena-c almost 11 years

Yeah i know it gets the p tags from the cookbook,but with complex structures like or <p class... etc it doesnt work for them.For all html writing structure,it must produce same result.How can i do that?
QuangDT over 7 years

Note: when using select.("p").first(); it will return the second element if the first element is empty e.g. for  test, the function will return "test" rather than " ". I had to use getElementsByTag to work around it.