Best way to convert text files between character sets?

text unicode utf-8 character-set

555,753

Solution 1

iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt

-f ENCODING  the encoding of the input
-t ENCODING  the encoding of the output

You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.

Solution 2

Try VIM

If you have vim you can use this:

Not tested for every encoding.

The cool part about this is that you don't have to know the source encoding

vim +"set nobomb | set fenc=utf8 | x" filename.txt

Be aware that this command modify directly the file

Explanation part!

+ : Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line: vim +14 file.txt
| : Separator of multiple commands (like ; in bash)
set nobomb : no utf-8 BOM
set fenc=utf8 : Set new encoding to utf-8 doc link
x : Save and close file
filename.txt : path to the file
" : qotes are here because of pipes. (otherwise bash will use them as bash pipe)

Solution 3

Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.

Solution 4

iconv(1)

iconv -f FROM-ENCODING -t TO-ENCODING file.txt

Also there are iconv-based tools in many languages.

Solution 5

Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT

The shortest version, if you can assume that the input BOM is correct:

gc FILE.TXT | Out-File -en utf7 file-utf7.txt

View more solutions

555,753

Author by

devlearn

Updated on October 21, 2021

Comments

devlearn over 2 years
What is the fastest, easiest tool or method to convert text files between character sets?

Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.

Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.

Best solutions so far:

On Linux/UNIX/OS X/cygwin:
- Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:
```
  $ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
```
  As pointed out by Ben, there is an online converter using iconv.
- recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:
```
  $ recode UTF8..ISO-8859-15 in.txt
```
  This one uses shorter aliases:
```
  $ recode utf8..l9 in.txt
```
  Recode also supports surfaces which can be used to convert between different line ending types and encodings:
  
  Convert newlines from LF (Unix) to CR-LF (DOS):
```
  $ recode ../CR-LF in.txt
```
  Base64 encode file:
```
  $ recode ../Base64 in.txt
```
  You can also combine them.
  
  Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:
```
  $ recode utf8/Base64..l1/CR-LF/Base64 file.txt
```
On Windows with Powershell (Jay Bazuzi):
- PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt
(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)

Edit

Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa
```
gc -en string in.txt | Out-File -en utf8 out.txt
```
Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".
- CsCvt - Kalytta's Character Set Converter is another great command line based conversion tool for Windows.
coredumperror almost 12 years

For anyone else who's getting tripped up by the non-dash versions being unavailable, it looks like OSX (and possibly all BSD) versions of iconv don't support the non-dash aliases for the various UTF-* encodings. iconv -l | grep UTF will tell you all the UTF-related encodings that your copy of iconv does support.
Larry Battle almost 12 years

Here's a shorter version that works better. gc .\file-utf8.txt | sc -en utf7 .\file-utf7.txt
Jay Bazuzi almost 12 years

@LarryBattle: How does Set-Content work better than Out-File?
Larry Battle almost 12 years

...oh. I guess they're nearly the same thing. I had trouble running your example because I was assuming that both versions were using the same file-utf8.txt file for input since they both had the same output file as file-utf7.txt.
Noah over 10 years

This would be really great, except that it doesn't support UTF16. It supports UTF32, but not UTF16! I wouldn't need to convert files, except that a lot of Microsoft software (f.e. SQL server bcp) insists on UTF16 - and then their utility won't convert to it. Interesting to say the least.
OZZIE over 10 years

I tried gc -en Ascii readme.html | Out-File -en UTF8 readme.html but it converts the file to utf-8 but then it's empty! Notepad++ says the file is Ansi-format but reading up as I understand it that's not even a valid charset?? uk.answers.yahoo.com/question/index?qid=20100927014115AAiREx‌F
rob over 10 years

@OZZIE I don't think you can edit a file in place like that. Try saving the content to a temporary file first
Stew over 9 years

Don't know the encoding of your input file? Use chardet in.txt to generate a best guess. The result can be used as ENCODING in iconv -f ENCODING.
knb about 9 years

Prevent exit at invalid characters (avoiding illegal input sequence at position messages), and replace "weird" characters with "similar" characters: iconv -c -f UTF-8 -t ISO-8859-1//TRANSLIT in.txt > out.txt.
LMZ about 9 years

it's better style to use tmp=$(mktmp) to create a temporary file. Also, the line with rm is redundant.
frIT over 8 years

I like this because it's standard on most NIX platforms. But also see the VIM command option (alias: ex) below. Additional info: (1) you (probably) don't need to specify the -f (from) option with iconv. (2) the file --mime-encoding <file/s> command can help you to figure out the encoding in the first place.
mlibre almost 8 years

can you complete this function with auto detect input format?
DomQ almost 8 years

Quite cool, but somewhat slow. Is there a way to change this to convert a number of files at once (thus saving on vim's initialization costs)?
jjwdesign over 7 years

Thank you for explanation! I was having a difficult time with beginning of the file until I read up about the bomb/nobomb setting.
jjwdesign over 7 years

This statement works great when converting strings, but not for files.
Boop over 7 years

np, additionaly you can view the bom if you use vim -b or head file.txt|cat -e
Plato over 7 years

FWIW the file command reported my source as UTF-16 Little Endian; running iconv -f UTF-16 -t UTF-8... transformed it incorrectly to ASCII, i had to explicitly specify iconv -f UTF-16LE... to output UTF-8
Brian J. Miller about 7 years

I had to adapt this solution a bit to work on Mac OS X, at least at my version. find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -b --mime-encoding "$1" | awk "{print toupper(\$0)}") -t UTF-8 > converted "$1" && mv converted "$1"' -- {} \;
Gabriel about 7 years

for example: find -regextype posix-extended -type f -regex ".*\.(h|cpp|rc|fx|cs|props|xaml)" -exec vim +'set nobomb | set fenc=utf8 | x' {} \;
NightOwlPrgmr almost 7 years

I used this to convert the encoding of CSV files and was really excited when I saw the charset had indeed changed. Unfortunately, when I went to load the file into MySQL, it had a different number of columns than what it previously had before running the vim command. Wonder if it would be possible to just open the file, convert the encoding, and save/close the file while leaving all other file content the same?
philwalk over 6 years

beware, this function deletes the input file without verifying that the iconv call succeeded.
silvioprog over 6 years

Your code worked on Windows 7 with MinGW-w64 (latest version) too. Thanks for sharing it!
Serge Stroobandt almost 5 years

@rmuller The sed command is there on purpose, enabling the automatic detection of character encoding. I have expanded the answer to explain this now. It would be courteous with regards to the readership to delete any remaining irrelevant comments. Thank you.
rmuller almost 5 years

@SergeStroobandt Maybe i was not clear enough. My point is when you use "file -b --mime-encoding" instead of "file -bi" there is no need for filtering the result with sed. This command already returns the file encoding only. So in your example "us-ascii"
Boop over 4 years

many ways: 1 - Use @Gabriel's command , 2 - Shell expansion vim +'set nobomb | set fenc=utf8 | x' *.yaml (e.g.), 3 - A loop for f in a.txt b.txt; do vim +'set nobomb | set fenc=utf8 | x' "${f}"; done (none of theses has been tested)
Aaron Franke about 4 years

This doesn't actually seem to do anything for me on Linux. I saved a file as UTF-8 with BOM and expected it to convert to UTF-8 without BOM and it didn't.
Aaron Franke about 4 years

This changes the contents of the text file. I ran this on a UTF-8 with BOM expecting to get out a UTF-8 without BOM file, but it prepended ï»¿ at the start of the file.
Aaron Franke about 4 years

What about auto-detecting the original encoding?
Aaron Franke about 4 years

How do you convert to LF? There is /CR and /CR-LF but no /LF
Ali80 over 3 years

exactly what the Doc. ordered
paradox over 2 years

that's what I get when I run it: Usage: iconv [-c] [-s] [-f fromcode] [-t tocode] [file ...] or: iconv -l Try 'iconv --help' for more information.
tbotalla over 2 years

For anyone wondering: for a file with about 6M rows (2.3 GB) it took 7 min to convert
Fabien Haddadi over 2 years

Very fast, even with a 4GB file.