Best way to convert text files between character sets?

555,753

Solution 1

Stand-alone utility approach

iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt
-f ENCODING  the encoding of the input
-t ENCODING  the encoding of the output

You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.

Solution 2

Try VIM

If you have vim you can use this:

Not tested for every encoding.

The cool part about this is that you don't have to know the source encoding

vim +"set nobomb | set fenc=utf8 | x" filename.txt

Be aware that this command modify directly the file


Explanation part!

  1. + : Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line: vim +14 file.txt
  2. | : Separator of multiple commands (like ; in bash)
  3. set nobomb : no utf-8 BOM
  4. set fenc=utf8 : Set new encoding to utf-8 doc link
  5. x : Save and close file
  6. filename.txt : path to the file
  7. " : qotes are here because of pipes. (otherwise bash will use them as bash pipe)

Solution 3

Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.

Solution 4

iconv(1)

iconv -f FROM-ENCODING -t TO-ENCODING file.txt

Also there are iconv-based tools in many languages.

Solution 5

Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT

The shortest version, if you can assume that the input BOM is correct:

gc FILE.TXT | Out-File -en utf7 file-utf7.txt
Share:
555,753
devlearn
Author by

devlearn

Updated on October 21, 2021

Comments

  • devlearn
    devlearn over 2 years

    What is the fastest, easiest tool or method to convert text files between character sets?

    Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.

    Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.

    Best solutions so far:

    On Linux/UNIX/OS X/cygwin:

    • Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:

        $ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
      

      As pointed out by Ben, there is an online converter using iconv.

    • recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:

        $ recode UTF8..ISO-8859-15 in.txt
      

      This one uses shorter aliases:

        $ recode utf8..l9 in.txt
      

      Recode also supports surfaces which can be used to convert between different line ending types and encodings:

      Convert newlines from LF (Unix) to CR-LF (DOS):

        $ recode ../CR-LF in.txt
      

      Base64 encode file:

        $ recode ../Base64 in.txt
      

      You can also combine them.

      Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:

        $ recode utf8/Base64..l1/CR-LF/Base64 file.txt
      

    On Windows with Powershell (Jay Bazuzi):

    • PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt

    (No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)

    Edit

    Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa

    gc -en string in.txt | Out-File -en utf8 out.txt
    

    Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".

  • coredumperror
    coredumperror almost 12 years
    For anyone else who's getting tripped up by the non-dash versions being unavailable, it looks like OSX (and possibly all BSD) versions of iconv don't support the non-dash aliases for the various UTF-* encodings. iconv -l | grep UTF will tell you all the UTF-related encodings that your copy of iconv does support.
  • Larry Battle
    Larry Battle almost 12 years
    Here's a shorter version that works better. gc .\file-utf8.txt | sc -en utf7 .\file-utf7.txt
  • Jay Bazuzi
    Jay Bazuzi almost 12 years
    @LarryBattle: How does Set-Content work better than Out-File?
  • Larry Battle
    Larry Battle almost 12 years
    ...oh. I guess they're nearly the same thing. I had trouble running your example because I was assuming that both versions were using the same file-utf8.txt file for input since they both had the same output file as file-utf7.txt.
  • Noah
    Noah over 10 years
    This would be really great, except that it doesn't support UTF16. It supports UTF32, but not UTF16! I wouldn't need to convert files, except that a lot of Microsoft software (f.e. SQL server bcp) insists on UTF16 - and then their utility won't convert to it. Interesting to say the least.
  • OZZIE
    OZZIE over 10 years
    I tried gc -en Ascii readme.html | Out-File -en UTF8 readme.html but it converts the file to utf-8 but then it's empty! Notepad++ says the file is Ansi-format but reading up as I understand it that's not even a valid charset?? uk.answers.yahoo.com/question/index?qid=20100927014115AAiREx‌​F
  • rob
    rob over 10 years
    @OZZIE I don't think you can edit a file in place like that. Try saving the content to a temporary file first
  • Stew
    Stew over 9 years
    Don't know the encoding of your input file? Use chardet in.txt to generate a best guess. The result can be used as ENCODING in iconv -f ENCODING.
  • knb
    knb about 9 years
    Prevent exit at invalid characters (avoiding illegal input sequence at position messages), and replace "weird" characters with "similar" characters: iconv -c -f UTF-8 -t ISO-8859-1//TRANSLIT in.txt > out.txt.
  • LMZ
    LMZ about 9 years
    it's better style to use tmp=$(mktmp) to create a temporary file. Also, the line with rm is redundant.
  • frIT
    frIT over 8 years
    I like this because it's standard on most NIX platforms. But also see the VIM command option (alias: ex) below. Additional info: (1) you (probably) don't need to specify the -f (from) option with iconv. (2) the file --mime-encoding <file/s> command can help you to figure out the encoding in the first place.
  • mlibre
    mlibre almost 8 years
    can you complete this function with auto detect input format?
  • DomQ
    DomQ almost 8 years
    Quite cool, but somewhat slow. Is there a way to change this to convert a number of files at once (thus saving on vim's initialization costs)?
  • jjwdesign
    jjwdesign over 7 years
    Thank you for explanation! I was having a difficult time with beginning of the file until I read up about the bomb/nobomb setting.
  • jjwdesign
    jjwdesign over 7 years
    This statement works great when converting strings, but not for files.
  • Boop
    Boop over 7 years
    np, additionaly you can view the bom if you use vim -b or head file.txt|cat -e
  • Plato
    Plato over 7 years
    FWIW the file command reported my source as UTF-16 Little Endian; running iconv -f UTF-16 -t UTF-8... transformed it incorrectly to ASCII, i had to explicitly specify iconv -f UTF-16LE... to output UTF-8
  • Brian J. Miller
    Brian J. Miller about 7 years
    I had to adapt this solution a bit to work on Mac OS X, at least at my version. find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -b --mime-encoding "$1" | awk "{print toupper(\$0)}") -t UTF-8 > converted "$1" && mv converted "$1"' -- {} \;
  • Gabriel
    Gabriel about 7 years
    for example: find -regextype posix-extended -type f -regex ".*\.(h|cpp|rc|fx|cs|props|xaml)" -exec vim +'set nobomb | set fenc=utf8 | x' {} \;
  • NightOwlPrgmr
    NightOwlPrgmr almost 7 years
    I used this to convert the encoding of CSV files and was really excited when I saw the charset had indeed changed. Unfortunately, when I went to load the file into MySQL, it had a different number of columns than what it previously had before running the vim command. Wonder if it would be possible to just open the file, convert the encoding, and save/close the file while leaving all other file content the same?
  • philwalk
    philwalk over 6 years
    beware, this function deletes the input file without verifying that the iconv call succeeded.
  • silvioprog
    silvioprog over 6 years
    Your code worked on Windows 7 with MinGW-w64 (latest version) too. Thanks for sharing it!
  • Serge Stroobandt
    Serge Stroobandt almost 5 years
    @rmuller The sed command is there on purpose, enabling the automatic detection of character encoding. I have expanded the answer to explain this now. It would be courteous with regards to the readership to delete any remaining irrelevant comments. Thank you.
  • rmuller
    rmuller almost 5 years
    @SergeStroobandt Maybe i was not clear enough. My point is when you use "file -b --mime-encoding" instead of "file -bi" there is no need for filtering the result with sed. This command already returns the file encoding only. So in your example "us-ascii"
  • Boop
    Boop over 4 years
    many ways: 1 - Use @Gabriel's command , 2 - Shell expansion vim +'set nobomb | set fenc=utf8 | x' *.yaml (e.g.), 3 - A loop for f in a.txt b.txt; do vim +'set nobomb | set fenc=utf8 | x' "${f}"; done (none of theses has been tested)
  • Aaron Franke
    Aaron Franke about 4 years
    This doesn't actually seem to do anything for me on Linux. I saved a file as UTF-8 with BOM and expected it to convert to UTF-8 without BOM and it didn't.
  • Aaron Franke
    Aaron Franke about 4 years
    This changes the contents of the text file. I ran this on a UTF-8 with BOM expecting to get out a UTF-8 without BOM file, but it prepended  at the start of the file.
  • Aaron Franke
    Aaron Franke about 4 years
    What about auto-detecting the original encoding?
  • Aaron Franke
    Aaron Franke about 4 years
    How do you convert to LF? There is /CR and /CR-LF but no /LF
  • Ali80
    Ali80 over 3 years
    exactly what the Doc. ordered
  • paradox
    paradox over 2 years
    that's what I get when I run it: Usage: iconv [-c] [-s] [-f fromcode] [-t tocode] [file ...] or: iconv -l Try 'iconv --help' for more information.
  • tbotalla
    tbotalla over 2 years
    For anyone wondering: for a file with about 6M rows (2.3 GB) it took 7 min to convert
  • Fabien Haddadi
    Fabien Haddadi over 2 years
    Very fast, even with a 4GB file.