Best way to convert text files between character sets?
Solution 1
Stand-alone utility approach
iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt
-f ENCODING the encoding of the input
-t ENCODING the encoding of the output
You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.
Solution 2
Try VIM
If you have vim
you can use this:
Not tested for every encoding.
The cool part about this is that you don't have to know the source encoding
vim +"set nobomb | set fenc=utf8 | x" filename.txt
Be aware that this command modify directly the file
Explanation part!
-
+
: Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line:vim +14 file.txt
-
|
: Separator of multiple commands (like;
in bash) -
set nobomb
: no utf-8 BOM -
set fenc=utf8
: Set new encoding to utf-8 doc link -
x
: Save and close file -
filename.txt
: path to the file -
"
: qotes are here because of pipes. (otherwise bash will use them as bash pipe)
Solution 3
Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.
Solution 4
iconv -f FROM-ENCODING -t TO-ENCODING file.txt
Also there are iconv-based tools in many languages.
Solution 5
Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT
The shortest version, if you can assume that the input BOM is correct:
gc FILE.TXT | Out-File -en utf7 file-utf7.txt
devlearn
Updated on October 21, 2021Comments
-
devlearn over 2 years
What is the fastest, easiest tool or method to convert text files between character sets?
Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.
Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.
Best solutions so far:
On Linux/UNIX/OS X/cygwin:
-
Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:
$ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
As pointed out by Ben, there is an online converter using iconv.
-
recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:
$ recode UTF8..ISO-8859-15 in.txt
This one uses shorter aliases:
$ recode utf8..l9 in.txt
Recode also supports surfaces which can be used to convert between different line ending types and encodings:
Convert newlines from LF (Unix) to CR-LF (DOS):
$ recode ../CR-LF in.txt
Base64 encode file:
$ recode ../Base64 in.txt
You can also combine them.
Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:
$ recode utf8/Base64..l1/CR-LF/Base64 file.txt
On Windows with Powershell (Jay Bazuzi):
PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt
(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)
Edit
Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa
gc -en string in.txt | Out-File -en utf8 out.txt
Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".
- CsCvt - Kalytta's Character Set Converter is another great command line based conversion tool for Windows.
-
-
coredumperror almost 12 yearsFor anyone else who's getting tripped up by the non-dash versions being unavailable, it looks like OSX (and possibly all BSD) versions of iconv don't support the non-dash aliases for the various UTF-* encodings.
iconv -l | grep UTF
will tell you all the UTF-related encodings that your copy of iconv does support. -
Larry Battle almost 12 yearsHere's a shorter version that works better.
gc .\file-utf8.txt | sc -en utf7 .\file-utf7.txt
-
Jay Bazuzi almost 12 years@LarryBattle: How does
Set-Content
work better thanOut-File
? -
Larry Battle almost 12 years...oh. I guess they're nearly the same thing. I had trouble running your example because I was assuming that both versions were using the same
file-utf8.txt
file for input since they both had the same output file asfile-utf7.txt
. -
Noah over 10 yearsThis would be really great, except that it doesn't support UTF16. It supports UTF32, but not UTF16! I wouldn't need to convert files, except that a lot of Microsoft software (f.e. SQL server bcp) insists on UTF16 - and then their utility won't convert to it. Interesting to say the least.
-
OZZIE over 10 yearsI tried
gc -en Ascii readme.html | Out-File -en UTF8 readme.html
but it converts the file to utf-8 but then it's empty! Notepad++ says the file is Ansi-format but reading up as I understand it that's not even a valid charset?? uk.answers.yahoo.com/question/index?qid=20100927014115AAiRExF -
rob over 10 years@OZZIE I don't think you can edit a file in place like that. Try saving the content to a temporary file first
-
Stew over 9 yearsDon't know the encoding of your input file? Use
chardet in.txt
to generate a best guess. The result can be used as ENCODING iniconv -f ENCODING
. -
knb about 9 yearsPrevent exit at invalid characters (avoiding
illegal input sequence at position
messages), and replace "weird" characters with "similar" characters:iconv -c -f UTF-8 -t ISO-8859-1//TRANSLIT in.txt > out.txt
. -
LMZ about 9 yearsit's better style to use tmp=$(mktmp) to create a temporary file. Also, the line with rm is redundant.
-
frIT over 8 yearsI like this because it's standard on most NIX platforms. But also see the VIM command option (alias:
ex
) below. Additional info: (1) you (probably) don't need to specify the-f
(from) option withiconv
. (2) thefile --mime-encoding <file/s>
command can help you to figure out the encoding in the first place. -
mlibre almost 8 yearscan you complete this function with auto detect input format?
-
DomQ almost 8 yearsQuite cool, but somewhat slow. Is there a way to change this to convert a number of files at once (thus saving on vim's initialization costs)?
-
jjwdesign over 7 yearsThank you for explanation! I was having a difficult time with beginning of the file until I read up about the bomb/nobomb setting.
-
jjwdesign over 7 yearsThis statement works great when converting strings, but not for files.
-
Boop over 7 yearsnp, additionaly you can view the bom if you use
vim -b
orhead file.txt|cat -e
-
Plato over 7 yearsFWIW the
file
command reported my source as UTF-16 Little Endian; runningiconv -f UTF-16 -t UTF-8...
transformed it incorrectly to ASCII, i had to explicitly specifyiconv -f UTF-16LE...
to output UTF-8 -
Brian J. Miller about 7 yearsI had to adapt this solution a bit to work on Mac OS X, at least at my version.
find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -b --mime-encoding "$1" | awk "{print toupper(\$0)}") -t UTF-8 > converted "$1" && mv converted "$1"' -- {} \;
-
Gabriel about 7 yearsfor example:
find -regextype posix-extended -type f -regex ".*\.(h|cpp|rc|fx|cs|props|xaml)" -exec vim +'set nobomb | set fenc=utf8 | x' {} \;
-
NightOwlPrgmr almost 7 yearsI used this to convert the encoding of CSV files and was really excited when I saw the charset had indeed changed. Unfortunately, when I went to load the file into MySQL, it had a different number of columns than what it previously had before running the vim command. Wonder if it would be possible to just open the file, convert the encoding, and save/close the file while leaving all other file content the same?
-
philwalk over 6 yearsbeware, this function deletes the input file without verifying that the iconv call succeeded.
-
silvioprog over 6 yearsYour code worked on Windows 7 with MinGW-w64 (latest version) too. Thanks for sharing it!
-
Serge Stroobandt almost 5 years@rmuller The
sed
command is there on purpose, enabling the automatic detection of character encoding. I have expanded the answer to explain this now. It would be courteous with regards to the readership to delete any remaining irrelevant comments. Thank you. -
rmuller almost 5 years@SergeStroobandt Maybe i was not clear enough. My point is when you use "file -b --mime-encoding" instead of "file -bi" there is no need for filtering the result with sed. This command already returns the file encoding only. So in your example "us-ascii"
-
Boop over 4 yearsmany ways: 1 - Use @Gabriel's command , 2 - Shell expansion
vim +'set nobomb | set fenc=utf8 | x' *.yaml
(e.g.), 3 - A loopfor f in a.txt b.txt; do vim +'set nobomb | set fenc=utf8 | x' "${f}"; done
(none of theses has been tested) -
Aaron Franke about 4 yearsThis doesn't actually seem to do anything for me on Linux. I saved a file as UTF-8 with BOM and expected it to convert to UTF-8 without BOM and it didn't.
-
Aaron Franke about 4 yearsThis changes the contents of the text file. I ran this on a UTF-8 with BOM expecting to get out a UTF-8 without BOM file, but it prepended

at the start of the file. -
Aaron Franke about 4 yearsWhat about auto-detecting the original encoding?
-
Aaron Franke about 4 yearsHow do you convert to
LF
? There is/CR
and/CR-LF
but no/LF
-
Ali80 over 3 yearsexactly what the Doc. ordered
-
paradox over 2 yearsthat's what I get when I run it:
Usage: iconv [-c] [-s] [-f fromcode] [-t tocode] [file ...] or: iconv -l Try 'iconv --help' for more information.
-
tbotalla over 2 yearsFor anyone wondering: for a file with about 6M rows (2.3 GB) it took 7 min to convert
-
Fabien Haddadi over 2 yearsVery fast, even with a 4GB file.