How to convert html entities to readable text?
Solution 1
With Free recode
(formerly known as GNU recode
):
recode html < file
If you don't have recode
or HTML::Entities
and only need to decode &#x<hex>;
entities, you could do it by hand with:
perl -Mopen=locale -pe 's/&#x([\da-f]+);/chr hex $1/gie'
Solution 2
From How can I decode HTML entities? on StackOverflow, you may be able to implement a simple perl solution such as
perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)' email.txt
e.g. using your example text
$ perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)' email.txt
chciałabym zapytać, czy rozważa Pan takze udział w nowych projektach w Warszawie ? Obecnie poszukujemy specjalisty javascript/architekta z bardzo dobrą znajomością Angular.js do projektu, który dotyczy systemu, służącego do monitorowania i zarządzania flotą pojazdów. Zespół, do którego poszukujemy
With -Mopen=locale
, I/O is done in the locale's character set. That includes input from email.txt
. It looks like email.txt
contains only ASCII characters (the whole point of encoding those characters using the &#x<hex>;
notation I suppose), but if not you may need to adapt the above to also decode that file using the right charset (if it's not the same as the locale's one) instead of using open=locale
.
Solution 3
A python 3.2+ version, can be used in a pipe:
python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]' < file
Related videos on Youtube
Tim
My name is Jakub T. Jankiewicz, I'm coding mostly in JavaScript. I love Lisp Macros, jQuery library, ReactJS, CSS3, HTML5, SVG, GNU/Linux, GNU Emacs and Inkscape. Working with JavaScript and R for Roche/Genentech via Astek Poland. my english blog - In Code We Trust my polish blog - Głównie JavaScript (ang. Mostly JavaScript) Usefull Links Other links Yet another links Few of my JavaScript Open Source projects: jQuery Terminal: JavaScript library for Web based Terminal Emulator LIPS - Powerful Scheme based lisp interpreter written in JavaScript sysend.js: Library for sending messages between Windows and Tabs Gaiman Programming Language and Text based Game engine GIT Web Terminal Posts: EchoJS News, EchoJS News (2), HackerNews
Updated on September 18, 2022Comments
-
Tim over 1 year
I want html number entities like ę and want to convert it to real character. I have emails mostly from linkedin that look like this:
chciałabym zapytać, czy rozważa Pan takze udział w nowych projektach w Warszawie ? Obecnie poszukujemy specjalisty javascript/architekta z bardzo dobrą znajomością Angular.js do projektu, który dotyczy systemu, służącego do monitorowania i zarządzania flotą pojazdów. Zespół, do którego poszukujemy
I'm using clawsmail, switching to html don't convert it to text, I've try to copy and use
xclip -o -sel clip | html2text | less
but it didn't convert the entities. Is there a way to have that text using command line tools?
The only way I can think of is to use
data:text/html,<PASTE THE EMAIL>
and open it in a browser, but would prefer the command line. -
Tim over 9 yearsto get up votes you should probably write shell code that will convert
ę
toecho -e "\x01\x19"
should be possible with sed. -
Tim over 9 yearsAlso this don't work because it's one character and I don't get it when I run your command.
-
Tim over 9 years\u119 work, but I'm not able to make it work with sed. So far I have
c-v | sed -e 's/&#x\([^;]*\);/\\u\1/g' -e 's/.*/echo -e "&"/' | bash
-
Stéphane Chazelas over 9 yearsYou should use the
-Mopen=locale
option so that the text is output in the user's charset (and make that warning go away). -
Tim over 9 yearsthis work perfect
c-v | html2text | recode html
-
ariddell about 6 yearsCleaner:
python3 -c'import html,sys;print(html.unescape(sys.stdin.read()), end="")'
-
Aissen about 6 years@ariddell : your version isn't line-by-line, and I wanted to preserve line boundaries; otherwise it blocks a pipe until everything is read on stdin (pipe is exhausted).
-
Pysis over 4 yearsDidn't have
html2text
; not sure it matters. This example fails withrecode: Request 'html' is erroneous
. Seems it needs to be run this way now with a range instead of a single identifier:recode html..utf-8
. A bit strange, but I guess it's all similar translating codes at some levels. -
Stéphane Chazelas over 4 years@Pysis, you'll notice the first version of this answer had
html..
later changed tohtml
in 2014.html
alone definitely works with the latest version (git head from December 2019) or from 3.6 from 2008. Is it possible you have a very old version? -
Pysis over 4 yearsJust installed to use in cygwin, I think it was from Choco? recode 3.7-beta2
-
Diomidis Spinellis about 4 yearsWith recode 3.7-beta2 the command that currently works is
recode HTML..utf-8
.