Tool to convert accented characters to HTML entities?

8,804

Recode can convert to HTML entities:

$ echo "é" | recode ..html
é

There are a few slightly different HTML transformations available in recode; see info recode HTML.

If you want to recode a file or some files, you can use

$ recode ..html one_file another_file and so on

For recursive action, use the find command, e.g.

$ find your_directory -type f -name "*.html"

The above find command will only show the files. Please make sure that you have found only the right files, not any binaries and not any files in unwanted directories. It is also a good idea to make a backup or use a copy of your files, not the real files. If you have found the correct find command, append -exec your_command {} +, where your_command is the recode ..html from above and the {} denotes the file(s) which are given by find to recode:

$ find your_directory -type f -name "*.html" -exec recode ..html {} +

But wait a moment, there's one big caveat: recode ..html assumes that your input files are in the same character set (encoding) that you are using on the command line. If all of your files use the "modern" UTF-8, it will work fine, because Ubuntu used UTF-8 from the standard. But if some of your files use the older ISO-8859-1 or other charsets, it will be a lot more complicated.

Share:
8,804

Related videos on Youtube

bafromca
Author by

bafromca

Updated on September 18, 2022

Comments

  • bafromca
    bafromca over 1 year

    Is there a tool (command-line is fine) that can convert accented characters to HTML entities in Ubuntu? Preferably recursively and without also converting html/php tags.

    e.g.
    from: é
    to: é
    or: é
    
  • bafromca
    bafromca about 13 years
    I'm aware of those tools but I need to convert hundreds of files (so gedit is out) and I need to convert all accented characters (and there are a lot of those).
  • Denwerko
    Denwerko about 13 years
    if you need to convert hundreds of files, you use that sed with find, maybe like this find /folder_where_you_have_files -mindepth 0 -name *.html -exec sed s/"é"/"\&#233"/g < {} > {}.new \; sed can read instructions from file, so you can replace all char at once. Im not sure that i typed command exactly right, will try on some examples and post if something changes
  • bafromca
    bafromca about 13 years
    Ya I ran a rename command to get rid of all the spaces in the files with rename 's/\ /_/g' * and then for i in *.php; do iconv --from-code=ISO-8859-15 --to-code=UTF-8 $i > $i.iconv; mv $i.iconv $i; done to convert to UTF-8. Problem with that program is that it does every character imaginable, including html and php tags.
  • elmicha
    elmicha about 13 years
    You didn't need to rename the files. You can use double quotes around your variable values, i.e. "$i". These double quotes make sure that your variable values are not split.
  • That Brazilian Guy
    That Brazilian Guy over 2 years
    The link for the solution is broken.