How to convert html entities to readable text?

13,131

Solution 1

With Free recode (formerly known as GNU recode):

recode html < file

If you don't have recode or HTML::Entities and only need to decode &#x<hex>; entities, you could do it by hand with:

perl -Mopen=locale -pe 's/&#x([\da-f]+);/chr hex $1/gie'

Solution 2

From How can I decode HTML entities? on StackOverflow, you may be able to implement a simple perl solution such as

perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)' email.txt

e.g. using your example text

$ perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)' email.txt
chciałabym zapytać, czy rozważa Pan takze udział w nowych projektach w Warszawie ? Obecnie poszukujemy specjalisty javascript/architekta z bardzo dobrą znajomością Angular.js do projektu, który dotyczy systemu, służącego do monitorowania i zarządzania flotą pojazdów. Zespół, do którego poszukujemy

With -Mopen=locale, I/O is done in the locale's character set. That includes input from email.txt. It looks like email.txt contains only ASCII characters (the whole point of encoding those characters using the &#x<hex>; notation I suppose), but if not you may need to adapt the above to also decode that file using the right charset (if it's not the same as the locale's one) instead of using open=locale.

Solution 3

A python 3.2+ version, can be used in a pipe:

python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]' < file
Share:
13,131

Related videos on Youtube

Tim
Author by

Tim

My name is Jakub T. Jankiewicz, I'm coding mostly in JavaScript. I love Lisp Macros, jQuery library, ReactJS, CSS3, HTML5, SVG, GNU/Linux, GNU Emacs and Inkscape. Working with JavaScript and R for Roche/Genentech via Astek Poland. my english blog - In Code We Trust my polish blog - Głównie JavaScript (ang. Mostly JavaScript) Usefull Links Other links Yet another links Few of my JavaScript Open Source projects: jQuery Terminal: JavaScript library for Web based Terminal Emulator LIPS - Powerful Scheme based lisp interpreter written in JavaScript sysend.js: Library for sending messages between Windows and Tabs Gaiman Programming Language and Text based Game engine GIT Web Terminal Posts: EchoJS News, EchoJS News (2), HackerNews

Updated on September 18, 2022

Comments

  • Tim
    Tim over 1 year

    I want html number entities like &#x119; and want to convert it to real character. I have emails mostly from linkedin that look like this:

    chcia&#x142;abym zapyta&#x107;, czy rozwa&#x17c;a Pan takze udzia&#x142; w nowych projektach w Warszawie ? Obecnie poszukujemy specjalisty javascript/architekta z bardzo dobr&#x105; znajomo&#x15b;ci&#x105; Angular.js do projektu, kt&#xf3;ry dotyczy systemu, s&#x142;u&#x17c;&#x105;cego do monitorowania i zarz&#x105;dzania flot&#x105; pojazd&#xf3;w. Zesp&#xf3;&#x142;, do kt&#xf3;rego poszukujemy

    I'm using clawsmail, switching to html don't convert it to text, I've try to copy and use

    xclip -o -sel clip | html2text | less
    

    but it didn't convert the entities. Is there a way to have that text using command line tools?

    The only way I can think of is to use data:text/html,<PASTE THE EMAIL> and open it in a browser, but would prefer the command line.

  • Tim
    Tim over 9 years
    to get up votes you should probably write shell code that will convert &#x119; to echo -e "\x01\x19" should be possible with sed.
  • Tim
    Tim over 9 years
    Also this don't work because it's one character and I don't get it when I run your command.
  • Tim
    Tim over 9 years
    \u119 work, but I'm not able to make it work with sed. So far I have c-v | sed -e 's/&#x\([^;]*\);/\\u\1/g' -e 's/.*/echo -e "&"/' | bash
  • Stéphane Chazelas
    Stéphane Chazelas over 9 years
    You should use the -Mopen=locale option so that the text is output in the user's charset (and make that warning go away).
  • Tim
    Tim over 9 years
    this work perfect c-v | html2text | recode html
  • ariddell
    ariddell about 6 years
    Cleaner: python3 -c'import html,sys;print(html.unescape(sys.stdin.read()), end="")'
  • Aissen
    Aissen about 6 years
    @ariddell : your version isn't line-by-line, and I wanted to preserve line boundaries; otherwise it blocks a pipe until everything is read on stdin (pipe is exhausted).
  • Pysis
    Pysis over 4 years
    Didn't have html2text; not sure it matters. This example fails with recode: Request 'html' is erroneous. Seems it needs to be run this way now with a range instead of a single identifier: recode html..utf-8. A bit strange, but I guess it's all similar translating codes at some levels.
  • Stéphane Chazelas
    Stéphane Chazelas over 4 years
    @Pysis, you'll notice the first version of this answer had html.. later changed to html in 2014. html alone definitely works with the latest version (git head from December 2019) or from 3.6 from 2008. Is it possible you have a very old version?
  • Pysis
    Pysis over 4 years
    Just installed to use in cygwin, I think it was from Choco? recode 3.7-beta2
  • Diomidis Spinellis
    Diomidis Spinellis about 4 years
    With recode 3.7-beta2 the command that currently works is recode HTML..utf-8.