iconv any encoding to UTF-8
Solution 1
Maybe you are looking for enca
:
Enca is an Extremely Naive Charset Analyser. It detects character set and encoding of text files and can also convert them to other encodings using either a built-in converter or external libraries and tools like libiconv, librecode, or cstocs.
Currently it supports Belarusian, Bulgarian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese, and some multibyte encodings independently on language.
Note that in general, autodetection of current encoding is a difficult process (the same byte sequence can be correct text in multiple encodings). enca
uses heuristics based on the language you tell it to detect (to limit the number of encodings). You can use enconv
to convert text files to a single encoding.
Solution 2
You can get what you need using standard gnu utils file and awk. Example:
file -bi .xsession-errors
gives me:
"text/plain; charset=us-ascii"
so file -bi .xsession-errors |awk -F "=" '{print $2}'
gives me
"us-ascii"
I use it in scripts like so:
CHARSET="$(file -bi "$i"|awk -F "=" '{print $2}')"
if [ "$CHARSET" != utf-8 ]; then
iconv -f "$CHARSET" -t utf8 "$i" -o outfile
fi
Solution 3
Compiling all them. Go to dir, create dir2utf8.sh
:
#!/bin/bash
# converting all files in a dir to utf8
for f in *
do
if test -f $f then
echo -e "\nConverting $f"
CHARSET="$(file -bi "$f"|awk -F "=" '{print $2}')"
if [ "$CHARSET" != utf-8 ]; then
iconv -f "$CHARSET" -t utf8 "$f" -o "$f"
fi
else
echo -e "\nSkipping $f - it's a regular file";
fi
done
Solution 4
Here is my solution to in place all files using recode and uchardet:
#!/bin/bash
apt-get -y install recode uchardet > /dev/null
find "$1" -type f | while read FFN # 'dir' should be changed...
do
encoding=$(uchardet "$FFN")
echo "$FFN: $encoding"
enc=`echo $encoding | sed 's#^x-mac-#mac#'`
set +x
recode $enc..UTF-8 "$FFN"
done
put it into convert-dir-to-utf8.sh
and run:
bash convert-dir-to-utf8.sh /pat/to/my/trash/dir
Note that sed
is a workaround for mac encodings here.
Many uncommon encodings need workarounds like this.
Solution 5
First answer
#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
echo "Converting ($CHARSET) $LINE_FILE"
# NOTE: Convert/reconvert to utf8. By Questor
iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE"
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]
FURTHER QUESTION: I do not know if my approach is the safest. I say this because I noticed that some files are not correctly converted (characters will be lost) or are "truncated". I suspect that this has to do with the "iconv" tool or with the charset information obtained with the "uchardet" tool. I was curious about the solution presented by @demofly because it could be safer.
Another answer
Based on @demofly 's answer:
#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
echo "\"$CHARSET\" \"$LINE_FILE\""
# NOTE: Convert/reconvert to utf8. By Questor
recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
if [ -n "$STDERR_OP" ] ; then
# NOTE: Convert/reconvert to utf8. By Questor
iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
fi
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
if [ -n "$STDERR_OP" ] ; then
echo "ERROR: \"$STDERR_OP\""
fi
STDOUT_OP=$(cat STDOUT_OP)
rm -f STDOUT_OP
if [ -n "$STDOUT_OP" ] ; then
echo "RESULT: \"$STDOUT_OP\""
fi
done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]
Third answer
Hybrid solution with recode and vim:
#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
echo "\"$CHARSET\" \"$LINE_FILE\""
# NOTE: Convert/reconvert to utf8. By Questor
recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
if [ -n "$STDERR_OP" ] ; then
# NOTE: Convert/reconvert to utf8. By Questor
bash -c "</dev/tty vim -u NONE +\"set binary | set noeol | set nobomb | set encoding=utf-8 | set fileencoding=utf-8 | wq\" \"$LINE_FILE\""
else
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
fi
done
This was the solution with the highest number of perfect conversions. Additionally, we did not have any truncated files.
- WARNING: Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!
-
TIP: The command
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
can be executed after a preliminary comparison with the merge tool after a conversion without it since it can cause "differences". -
NOTE: The search using
find
brings all non-binary files from the given path ("") and its subfolders.
Related videos on Youtube
Blainer
Updated on July 09, 2022Comments
-
Blainer almost 2 years
I am trying to point iconv to a directory and all files will be converted UTF-8 regardless of the current encoding
I am using this script but you have to specify what encoding you are going FROM. How can I make it autdetect the current encoding?
dir_iconv.sh
#!/bin/bash ICONVBIN='/usr/bin/iconv' # path to iconv binary if [ $# -lt 3 ] then echo "$0 dir from_charset to_charset" exit fi for f in $1/* do if test -f $f then echo -e "\nConverting $f" /bin/mv $f $f.old $ICONVBIN -f $2 -t $3 $f.old > $f else echo -e "\nSkipping $f - not a regular file"; fi done
terminal line
sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8
-
glerYbo almost 4 years
-
-
tripleee over 11 yearsThe heuristics used by
file
can be fairly crude, though. Watch out. -
trante about 10 yearsYour Enca link doesn't work. Is this updated one ? freecode.com/projects/enca
-
Michal Kottman about 10 yearsIt seems like Enca moved to Github since then. Notice that the freecode site also links to nonexistent Gitorious link. Updated the link in answer.
-
Daniel Dropik over 9 yearsI wonder if you meant
iconv
rather thaneconv
, because I can't findeconv
in the manual. -
Éderson T. Szlachta about 6 years
uchardet
saved my script -
Eduardo Lucio over 5 yearsTIP: Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!
-
phyatt about 5 yearsTwo minor notes: I would replace
<YOUR_FOLDER_PATH>
with"$1"
and let the end user pass in the folder path. And for MacOS users, you need to run:brew install recode uchardet gnu-sed
, and then changesed
togsed
to get it to work. And nice job removing binary files usinggrep -I
. Top marks! -
Pablo Bianchi almost 4 years
recode
doesn't seem to be maintained any more beside this fork -
Eduardo Lucio almost 4 yearsYour suggestions were accepted almost completely. I didn't keep the change 'I would replace <YOUR_FOLDER_PATH> with "$1"', because I thought the previous approach is clearer for more people. Thanks! = D
-
glerYbo almost 4 yearsSyntax:
enca -x utf8 -L mylanguage file.srt
. -
Peter Krauss about 3 yearsList of valid laguages in your version:
enca -l languages
... But UBUNTU is ugly on update, myenca --version
is 2005! How to upgrade it? -
rofrol over 2 yearsYou shouldn't provide to iconv same file for input and output unix.stackexchange.com/questions/10241/… stackoverflow.com/questions/17872302/…
-
rofrol over 2 yearsI read that it is better to use inplace iconv
iconv -f UTF-32 -t UTF-8 file.csv
stackoverflow.com/questions/64860/…