how to decode html codes using Java?

java html regex decode

42,334

Solution 1

Use methods provided by Apache Commons Lang

import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);

Solution 2

Do not try to solve everything by regexp.

While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.

See this question: RegEx match open tags except XHTML self-contained tags for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!

Chuck Norris can parse HTML with regex.

The bad news is: there is more than one way to encode characters.

https://en.wikipedia.org/wiki/Character_encodings_in_HTML

For example, the character 'λ' can be represented as λ, λ or &#X03bb;

And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings.  for example is not valid, yet many browsers will interpret it as ™.

Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.

So I strongly recommend:

Feed string into a robust HTML parser
Get parsed (and fully decoded) string back

42,334

Author by

user

Updated on August 19, 2020

Comments

user almost 4 years
Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?

I need to extract paragraphs (like title in StackOverflow) from an html file.

I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.

EXAMPLE

field extracted:
```
Paging Lucene&#39s search results (with **;** among **&#39** and **s**)
```
field after decoding:
```
Paging Lucene's search results
```
Is there any class in java that will allow me to convert these html codes?
user over 11 years

I need to extract from htmls with same structures and tags (like wikipedia). So I think regex is a good approach.
Has QUIT--Anony-Mousse over 11 years

@MrCarAsus: NO IT IS NOT. Use a HTML parser, and DOM for extraction. That is what they are for!
Has QUIT--Anony-Mousse over 11 years

Try using DBPedia, btw. It is an already parsed version of Wikipedia.
user over 11 years

And do you know a parsed version of StackOverflow? I try to use regex with stackoverflow htmls and it works. I extract title and answers with a set of regexps applied on htlm.
Has QUIT--Anony-Mousse over 11 years

Use an HTML parser. Every time you rape HTML with a regexp parsing attempt, god kills a kitten.
Has QUIT--Anony-Mousse over 11 years

Seriously, read using Regexp to parse HTML is wrong. HTML is a Chomsky Type 2 language, and Regexp is of type 3. You need a Type 2 parser.
Has QUIT--Anony-Mousse over 11 years

Plus there are plenty of HTML parsers around. Why don't you just try using them? The StackOverflow data dump is also quite well pre-parsed, btw. - you can get a lot of information out of it with a simple XML pull parser, and not having to do anything yourself.
Mike Samuel over 11 years

Re " for example is not valid" is perfectly valid though possibly interpreted inconsistently by user-agents? Section 4.6 of HTML 5 puts no bounds on the codepoints that can be represented by decimal numeric character references and that codepoint is a valid control character codepoint.
Has QUIT--Anony-Mousse over 11 years

@MikeSamuel The page says in number 3: "not ... in the range U+0080–U+009F". 0x0099 is in this range.
Mike Samuel over 11 years

@Anony-Mousse, Ah, thanks.
useranon about 7 years

commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/‌… - Latest link