How do I preserve line breaks when using jsoup to convert html to plain text?

68,299

Solution 1

The real solution that preserves linebreaks should be like this:

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    String s = document.html().replaceAll("\\\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

It satisfies the following requirements:

  1. if the original html contains newline(\n), it gets preserved
  2. if the original html contains br or p tags, they gets translated to newline(\n).

Solution 2

With

Jsoup.parse("A\nB").text();

you have output

"A B" 

and not

A

B

For this I'm using:

descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

Solution 3

Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

We're using this method here:

public static String clean(String bodyHtml,
                       String baseUri,
                       Whitelist whitelist,
                       Document.OutputSettings outputSettings)

By passing it Whitelist.none() we make sure that all HTML is removed.

By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.

Solution 4

On Jsoup v1.11.2, we can now use Element.wholeText().

String cleanString = Jsoup.parse(htmlString).wholeText();

user121196's answer still works. But wholeText() preserves the alignment of texts.

Solution 5

Try this by using jsoup:

public static String cleanPreserveLineBreaks(String bodyHtml) {

    // get pretty printed html with preserved br and p tags
    String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
    // get plain text with preserved line breaks by disabled prettyPrint
    return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}
Share:
68,299

Related videos on Youtube

Billy
Author by

Billy

Updated on July 08, 2022

Comments

  • Billy
    Billy almost 2 years

    I have the following code:

     public class NewClass {
         public String noTags(String str){
             return Jsoup.parse(str).text();
         }
    
    
         public static void main(String args[]) {
             String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
             "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";
    
             NewClass text = new NewClass();
             System.out.println((text.noTags(strings)));
    }
    

    And I have the result:

    hello world yo googlez
    

    But I want to break the line:

    hello world
    yo googlez
    

    I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.

    If there's a <br> in the markup I parse, how can I get a line break in my resulting output?

    • Robin Green
      Robin Green about 13 years
      edit your text - there is no line break showing up in your question. In general please read the preview of your question before posting it, to check everything is showing up right.
    • Eduardo
      Eduardo almost 13 years
      I asked the same question (without the jsoup requirement) but I still do not have a good solution: stackoverflow.com/questions/2513707/…
    • Jang-Ho Bae
      Jang-Ho Bae over 4 years
      see @zeenosaur 's answer.
  • Billy
    Billy about 13 years
    <p><b>hello world</b></p> <p><br /><b>yo</b> <a href="google.com">googlez</a></p> but i need hello world yo googlez (without html tags)
  • SRG
    SRG almost 12 years
    Indeed this is an easy palliative, but IMHO this should be fully handled by the Jsoup library itself (which has at this time a few disturbing behaviors like this one - otherwise it's a great library !).
  • Mike Samuel
    Mike Samuel almost 11 years
    Doesn't JSoup give you a DOM? Why not just replace all <br> elements with text nodes containing new lines and then call .text() instead of doing a regex transform that will cause incorrect output for some strings like <div title=<br>'not an attribute'></div>
  • Vito Meuli
    Vito Meuli over 10 years
    the answer by @MircoAttocchi works best for me. this solution leaves entities as such...that's not good! i.e. "La porta &egrave; aperta" remains unchanged, whereas I want "La porta è aperta".
  • Dr NotSoKind
    Dr NotSoKind about 10 years
    Good one, but you don't need recursion, just add this line: while(dirtyHTML.contains(linebreakerString)) linebreakerString = linebreakerString + "1";
  • Chris6647
    Chris6647 about 10 years
    Ah, yes. Completely true. Guess my mind got caught up in for once actually being able to use recursion :)
  • DD.
    DD. over 9 years
    br2nl is not the most helpful or accurate method name
  • adarshr
    adarshr over 9 years
    This should be the only correct answer. All others assume that only br tags produce new lines. What about any other block element in HTML such as div, p, ul etc? All of them introduce new lines too.
  • user2043553
    user2043553 over 9 years
    This is the best answer. But how about for (Element e : document.select("br")) e.after(new TextNode("\n", "")); appending real newline and not the sequence \n? See Node::after() and Elements::append() for the difference. The replaceAll() is not be needed in this case. Similar for p and other block elements.
  • Steve Waters
    Steve Waters about 9 years
    Nice, but where does that "descrizione" come from?
  • karth500
    karth500 almost 9 years
    @user121196's answer should be the chosen answer. If you still have HTML entities after you clean the input HTML, apply StringEscapeUtils.unescapeHtml(...) Apache commons to the output from the Jsoup clean.
  • KajMagnus
    KajMagnus over 8 years
    This answer doesn't return plain text; it returns HTML with newlines inserted.
  • KajMagnus
    KajMagnus over 8 years
    I think you should test if isBlock in tail(node, depth) instead, and append \n when leaving the block rather than when entering it? I'm doing that (i.e. using tail) and that works fine. However if I use head like you do, then this: <p>line one<p>line two ends up as a single line.
  • JohnC
    JohnC over 8 years
    With this solution, the html "<html><body><div>line 1</div><div>line 2</div><div>line 3</div></body></html>" produced the output: "line 1line 2line 3" with no new lines.
  • Malcolm Smith
    Malcolm Smith almost 7 years
    See github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/… for a comprehensive answer to this problem.
  • Grumblesaurus
    Grumblesaurus over 6 years
    This doesn't work for me; <br>'s aren't creating line breaks.
  • Ashu
    Ashu almost 6 years
    nice it works me with a small change new Document.OutputSettings().prettyPrint(true)
  • enigma969
    enigma969 almost 6 years
    "descrizione" represents the variable the plain text gets assigned to
  • user3338098
    user3338098 almost 5 years
    <p>Line one</p>Line 2 should NOT be \nLine one Line 2 newlines have to be inserted before AND after the relevant block tags. and it's missing MANY block tags such as <div> and <li>.
  • Andrei Volgin
    Andrei Volgin over 4 years
    This solution leaves "&nbsp;" as text instead of parsing them into a space.
  • Andrei Volgin
    Andrei Volgin over 4 years
    You need to prepend a new line to <div> tags as well. Otherwise, if a div follows <a> or <span> tags, it will not be on a new line.
  • Pshemo
    Pshemo over 2 years
    new NodeTraversor(nodeVisitor).traverse(element); no longer works on newer Jsoup versions (currently 1.14.3). Now all traverse methods in NodeTraversor are static so should be called like NodeTraversor.traverse(nodeVisitor, element);.
  • Mustafa
    Mustafa almost 2 years
    Yes this does a good job.