how to decode html codes using Java?
Solution 1
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Solution 2
Do not try to solve everything by regexp.
While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.
See this question: RegEx match open tags except XHTML self-contained tags for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!
Chuck Norris can parse HTML with regex.
The bad news is: there is more than one way to encode characters.
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
For example, the character 'λ' can be represented as
λ
,λ
orλ
And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. ™
for example is not valid, yet many browsers will interpret it as ™
.
Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.
So I strongly recommend:
- Feed string into a robust HTML parser
- Get parsed (and fully decoded) string back
user
Updated on August 19, 2020Comments
-
user almost 4 years
Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?I need to extract paragraphs (like
title
in StackOverflow) from anhtml
file.I can use regular expressions in Java to extract the fields I need but I have to
decode
the fields obtained.EXAMPLE
field extracted:
Paging Lucene's search results (with **;** among **'** and **s**)
field after decoding:
Paging Lucene's search results
Is there any class in java that will allow me to convert these html codes?
-
user over 11 yearsI need to extract from htmls with same structures and tags (like wikipedia). So I think regex is a good approach.
-
Has QUIT--Anony-Mousse over 11 years@MrCarAsus: NO IT IS NOT. Use a HTML parser, and DOM for extraction. That is what they are for!
-
Has QUIT--Anony-Mousse over 11 yearsTry using DBPedia, btw. It is an already parsed version of Wikipedia.
-
user over 11 yearsAnd do you know a parsed version of StackOverflow? I try to use regex with stackoverflow htmls and it works. I extract title and answers with a set of regexps applied on htlm.
-
Has QUIT--Anony-Mousse over 11 yearsUse an HTML parser. Every time you rape HTML with a regexp parsing attempt, god kills a kitten.
-
Has QUIT--Anony-Mousse over 11 yearsSeriously, read using Regexp to parse HTML is wrong. HTML is a Chomsky Type 2 language, and Regexp is of type 3. You need a Type 2 parser.
-
Has QUIT--Anony-Mousse over 11 yearsPlus there are plenty of HTML parsers around. Why don't you just try using them? The StackOverflow data dump is also quite well pre-parsed, btw. - you can get a lot of information out of it with a simple XML pull parser, and not having to do anything yourself.
-
Mike Samuel over 11 yearsRe "
™
for example is not valid" is perfectly valid though possibly interpreted inconsistently by user-agents? Section 4.6 of HTML 5 puts no bounds on the codepoints that can be represented by decimal numeric character references and that codepoint is a valid control character codepoint. -
Has QUIT--Anony-Mousse over 11 years@MikeSamuel The page says in number 3: "not ... in the range U+0080–U+009F". 0x0099 is in this range.
-
Mike Samuel over 11 years@Anony-Mousse, Ah, thanks.
-
useranon about 7 years