How do I preserve line breaks when using jsoup to convert html to plain text?
Solution 1
The real solution that preserves linebreaks should be like this:
public static String br2nl(String html) {
if(html==null)
return html;
Document document = Jsoup.parse(html);
document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
String s = document.html().replaceAll("\\\\n", "\n");
return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}
It satisfies the following requirements:
- if the original html contains newline(\n), it gets preserved
- if the original html contains br or p tags, they gets translated to newline(\n).
Solution 2
With
Jsoup.parse("A\nB").text();
you have output
"A B"
and not
A
B
For this I'm using:
descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
Solution 3
Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
We're using this method here:
public static String clean(String bodyHtml,
String baseUri,
Whitelist whitelist,
Document.OutputSettings outputSettings)
By passing it Whitelist.none()
we make sure that all HTML is removed.
By passsing new OutputSettings().prettyPrint(false)
we make sure that the output is not reformatted and line breaks are preserved.
Solution 4
On Jsoup v1.11.2, we can now use Element.wholeText()
.
String cleanString = Jsoup.parse(htmlString).wholeText();
user121196's
answer still works. But wholeText()
preserves the alignment of texts.
Solution 5
Try this by using jsoup:
public static String cleanPreserveLineBreaks(String bodyHtml) {
// get pretty printed html with preserved br and p tags
String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
// get plain text with preserved line breaks by disabled prettyPrint
return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}
Related videos on Youtube
Billy
Updated on July 08, 2022Comments
-
Billy almost 2 years
I have the following code:
public class NewClass { public String noTags(String str){ return Jsoup.parse(str).text(); } public static void main(String args[]) { String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" + "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> "; NewClass text = new NewClass(); System.out.println((text.noTags(strings))); }
And I have the result:
hello world yo googlez
But I want to break the line:
hello world yo googlez
I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.
If there's a
<br>
in the markup I parse, how can I get a line break in my resulting output?-
Robin Green about 13 yearsedit your text - there is no line break showing up in your question. In general please read the preview of your question before posting it, to check everything is showing up right.
-
Eduardo almost 13 yearsI asked the same question (without the jsoup requirement) but I still do not have a good solution: stackoverflow.com/questions/2513707/…
-
Jang-Ho Bae over 4 yearssee @zeenosaur 's answer.
-
-
Billy about 13 years<p><b>hello world</b></p> <p><br /><b>yo</b> <a href="google.com">googlez</a></p> but i need hello world yo googlez (without html tags)
-
SRG almost 12 yearsIndeed this is an easy palliative, but IMHO this should be fully handled by the Jsoup library itself (which has at this time a few disturbing behaviors like this one - otherwise it's a great library !).
-
Mike Samuel almost 11 yearsDoesn't JSoup give you a DOM? Why not just replace all
<br>
elements with text nodes containing new lines and then call.text()
instead of doing a regex transform that will cause incorrect output for some strings like<div title=<br>'not an attribute'></div>
-
Vito Meuli over 10 yearsthe answer by @MircoAttocchi works best for me. this solution leaves entities as such...that's not good! i.e. "La porta è aperta" remains unchanged, whereas I want "La porta è aperta".
-
Dr NotSoKind about 10 yearsGood one, but you don't need recursion, just add this line: while(dirtyHTML.contains(linebreakerString)) linebreakerString = linebreakerString + "1";
-
Chris6647 about 10 yearsAh, yes. Completely true. Guess my mind got caught up in for once actually being able to use recursion :)
-
DD. over 9 yearsbr2nl is not the most helpful or accurate method name
-
adarshr over 9 yearsThis should be the only correct answer. All others assume that only
br
tags produce new lines. What about any other block element in HTML such asdiv
,p
,ul
etc? All of them introduce new lines too. -
user2043553 over 9 yearsThis is the best answer. But how about
for (Element e : document.select("br")) e.after(new TextNode("\n", ""));
appending real newline and not the sequence \n? See Node::after() and Elements::append() for the difference. ThereplaceAll()
is not be needed in this case. Similar for p and other block elements. -
Steve Waters about 9 yearsNice, but where does that "descrizione" come from?
-
karth500 almost 9 years@user121196's answer should be the chosen answer. If you still have HTML entities after you clean the input HTML, apply StringEscapeUtils.unescapeHtml(...) Apache commons to the output from the Jsoup clean.
-
KajMagnus over 8 yearsThis answer doesn't return plain text; it returns HTML with newlines inserted.
-
KajMagnus over 8 yearsI think you should test if
isBlock
intail(node, depth)
instead, and append\n
when leaving the block rather than when entering it? I'm doing that (i.e. usingtail
) and that works fine. However if I usehead
like you do, then this:<p>line one<p>line two
ends up as a single line. -
JohnC over 8 yearsWith this solution, the html "<html><body><div>line 1</div><div>line 2</div><div>line 3</div></body></html>" produced the output: "line 1line 2line 3" with no new lines.
-
Malcolm Smith almost 7 yearsSee github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/… for a comprehensive answer to this problem.
-
Grumblesaurus over 6 yearsThis doesn't work for me; <br>'s aren't creating line breaks.
-
Ashu almost 6 yearsnice it works me with a small change
new Document.OutputSettings().prettyPrint(true)
-
enigma969 almost 6 years"descrizione" represents the variable the plain text gets assigned to
-
user3338098 almost 5 years
<p>Line one</p>Line 2
should NOT be\nLine one Line 2
newlines have to be inserted before AND after the relevant block tags. and it's missing MANY block tags such as<div>
and<li>
. -
Andrei Volgin over 4 yearsThis solution leaves " " as text instead of parsing them into a space.
-
Andrei Volgin over 4 yearsYou need to prepend a new line to <div> tags as well. Otherwise, if a div follows <a> or <span> tags, it will not be on a new line.
-
Pshemo over 2 years
new NodeTraversor(nodeVisitor).traverse(element);
no longer works on newer Jsoup versions (currently 1.14.3). Now alltraverse
methods in NodeTraversor arestatic
so should be called likeNodeTraversor.traverse(nodeVisitor, element);
. -
Mustafa almost 2 yearsYes this does a good job.