How do I preserve line breaks when using jsoup to convert html to plain text?

java jsoup

68,299

Solution 1

The real solution that preserves linebreaks should be like this:

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    String s = document.html().replaceAll("\\\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

It satisfies the following requirements:

if the original html contains newline(\n), it gets preserved
if the original html contains br or p tags, they gets translated to newline(\n).

Solution 2

With

Jsoup.parse("A\nB").text();

you have output

"A B"

and not

A

B

For this I'm using:

descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

Solution 3

Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

We're using this method here:

public static String clean(String bodyHtml,
                       String baseUri,
                       Whitelist whitelist,
                       Document.OutputSettings outputSettings)

By passing it Whitelist.none() we make sure that all HTML is removed.

By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.

Solution 4

On Jsoup v1.11.2, we can now use Element.wholeText().

String cleanString = Jsoup.parse(htmlString).wholeText();

user121196's answer still works. But wholeText() preserves the alignment of texts.

Solution 5

Try this by using jsoup:

public static String cleanPreserveLineBreaks(String bodyHtml) {

    // get pretty printed html with preserved br and p tags
    String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
    // get plain text with preserved line breaks by disabled prettyPrint
    return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}

View more solutions

68,299

Billy

Updated on July 08, 2022

Comments

Billy almost 2 years
I have the following code:
```
 public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
         "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}
```
And I have the result:
```
hello world yo googlez
```
But I want to break the line:
```
hello world
yo googlez
```
I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.

If there's a <br> in the markup I parse, how can I get a line break in my resulting output?
- Robin Green about 13 years
  
  edit your text - there is no line break showing up in your question. In general please read the preview of your question before posting it, to check everything is showing up right.
- Eduardo almost 13 years
  
  I asked the same question (without the jsoup requirement) but I still do not have a good solution: stackoverflow.com/questions/2513707/…
- Jang-Ho Bae over 4 years
  
  see @zeenosaur 's answer.
Billy about 13 years

<p><b>hello world</b></p> <p><br /><b>yo</b> <a href="google.com">googlez</a></p> but i need hello world yo googlez (without html tags)
SRG almost 12 years

Indeed this is an easy palliative, but IMHO this should be fully handled by the Jsoup library itself (which has at this time a few disturbing behaviors like this one - otherwise it's a great library !).
Mike Samuel almost 11 years

Doesn't JSoup give you a DOM? Why not just replace all <br> elements with text nodes containing new lines and then call .text() instead of doing a regex transform that will cause incorrect output for some strings like <div title=<br>'not an attribute'></div>
Vito Meuli over 10 years

the answer by @MircoAttocchi works best for me. this solution leaves entities as such...that's not good! i.e. "La porta è aperta" remains unchanged, whereas I want "La porta è aperta".
Dr NotSoKind about 10 years

Good one, but you don't need recursion, just add this line: while(dirtyHTML.contains(linebreakerString)) linebreakerString = linebreakerString + "1";
Chris6647 about 10 years

Ah, yes. Completely true. Guess my mind got caught up in for once actually being able to use recursion :)
DD. over 9 years

br2nl is not the most helpful or accurate method name
adarshr over 9 years

This should be the only correct answer. All others assume that only br tags produce new lines. What about any other block element in HTML such as div, p, ul etc? All of them introduce new lines too.
user2043553 over 9 years

This is the best answer. But how about for (Element e : document.select("br")) e.after(new TextNode("\n", "")); appending real newline and not the sequence \n? See Node::after() and Elements::append() for the difference. The replaceAll() is not be needed in this case. Similar for p and other block elements.
Steve Waters about 9 years

Nice, but where does that "descrizione" come from?
karth500 almost 9 years

@user121196's answer should be the chosen answer. If you still have HTML entities after you clean the input HTML, apply StringEscapeUtils.unescapeHtml(...) Apache commons to the output from the Jsoup clean.
KajMagnus over 8 years

This answer doesn't return plain text; it returns HTML with newlines inserted.
KajMagnus over 8 years

I think you should test if isBlock in tail(node, depth) instead, and append \n when leaving the block rather than when entering it? I'm doing that (i.e. using tail) and that works fine. However if I use head like you do, then this: <p>line one<p>line two ends up as a single line.
JohnC over 8 years

With this solution, the html "<html><body><div>line 1</div><div>line 2</div><div>line 3</div></body></html>" produced the output: "line 1line 2line 3" with no new lines.
Malcolm Smith almost 7 years

See github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/… for a comprehensive answer to this problem.
Grumblesaurus over 6 years

This doesn't work for me; <br>'s aren't creating line breaks.
Ashu almost 6 years

nice it works me with a small change new Document.OutputSettings().prettyPrint(true)
enigma969 almost 6 years

"descrizione" represents the variable the plain text gets assigned to
user3338098 almost 5 years

<p>Line one</p>Line 2 should NOT be \nLine one Line 2 newlines have to be inserted before AND after the relevant block tags. and it's missing MANY block tags such as <div> and <li>.
Andrei Volgin over 4 years

This solution leaves " " as text instead of parsing them into a space.
Andrei Volgin over 4 years

You need to prepend a new line to <div> tags as well. Otherwise, if a div follows <a> or <span> tags, it will not be on a new line.
Pshemo over 2 years

new NodeTraversor(nodeVisitor).traverse(element); no longer works on newer Jsoup versions (currently 1.14.3). Now all traverse methods in NodeTraversor are static so should be called like NodeTraversor.traverse(nodeVisitor, element);.
Mustafa almost 2 years

Yes this does a good job.