iconv any encoding to UTF-8

linux ubuntu encoding utf-8 iconv

49,311

Solution 1

Solution 2

You can get what you need using standard gnu utils file and awk. Example:

file -bi .xsession-errors gives me: "text/plain; charset=us-ascii"

so file -bi .xsession-errors |awk -F "=" '{print $2}' gives me "us-ascii"

I use it in scripts like so:

CHARSET="$(file -bi "$i"|awk -F "=" '{print $2}')"

if [ "$CHARSET" != utf-8 ]; then
  iconv -f "$CHARSET" -t utf8 "$i" -o outfile
fi

Solution 3

Compiling all them. Go to dir, create dir2utf8.sh:

#!/bin/bash
# converting all files in a dir to utf8

for f in *
do
  if test -f $f then
    echo -e "\nConverting $f"
    CHARSET="$(file -bi "$f"|awk -F "=" '{print $2}')"
    if [ "$CHARSET" != utf-8 ]; then
      iconv -f "$CHARSET" -t utf8 "$f" -o "$f"
    fi
  else
    echo -e "\nSkipping $f - it's a regular file";
  fi
done

Solution 4

Here is my solution to in place all files using recode and uchardet:

#!/bin/bash

apt-get -y install recode uchardet > /dev/null
find "$1" -type f | while read FFN # 'dir' should be changed...
do
  encoding=$(uchardet "$FFN")
  echo "$FFN: $encoding"
  enc=`echo $encoding | sed 's#^x-mac-#mac#'`
  set +x
  recode $enc..UTF-8 "$FFN"
done

put it into convert-dir-to-utf8.sh and run:

bash convert-dir-to-utf8.sh /pat/to/my/trash/dir

Note that sed is a workaround for mac encodings here. Many uncommon encodings need workarounds like this.

Solution 5

First answer

#!/bin/bash

find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
  CHARSET=$(uchardet $LINE_FILE)
  echo "Converting ($CHARSET) $LINE_FILE"

  # NOTE: Convert/reconvert to utf8. By Questor
  iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE"

  # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
  # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
  # https://stackoverflow.com/a/45240995/3223785 ]
  sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]

FURTHER QUESTION: I do not know if my approach is the safest. I say this because I noticed that some files are not correctly converted (characters will be lost) or are "truncated". I suspect that this has to do with the "iconv" tool or with the charset information obtained with the "uchardet" tool. I was curious about the solution presented by @demofly because it could be safer.

Another answer

Based on @demofly 's answer:

#!/bin/bash

find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
  CHARSET=$(uchardet $LINE_FILE)
  REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
  echo "\"$CHARSET\" \"$LINE_FILE\""

  # NOTE: Convert/reconvert to utf8. By Questor
  recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

  STDERR_OP=$(cat STDERR_OP)
  rm -f STDERR_OP
  if [ -n "$STDERR_OP" ] ; then

    # NOTE: Convert/reconvert to utf8. By Questor
    iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

    STDERR_OP=$(cat STDERR_OP)
    rm -f STDERR_OP
  fi

  # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
  # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
  # https://stackoverflow.com/a/45240995/3223785 ]
  sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

  if [ -n "$STDERR_OP" ] ; then
    echo "ERROR: \"$STDERR_OP\""
  fi
  STDOUT_OP=$(cat STDOUT_OP)
  rm -f STDOUT_OP
  if [ -n "$STDOUT_OP" ] ; then
    echo "RESULT: \"$STDOUT_OP\""
  fi
done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]

Third answer

Hybrid solution with recode and vim:

#!/bin/bash

find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'\0' LINE_FILE; do
  CHARSET=$(uchardet $LINE_FILE)
  REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
  echo "\"$CHARSET\" \"$LINE_FILE\""

  # NOTE: Convert/reconvert to utf8. By Questor
  recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

  STDERR_OP=$(cat STDERR_OP)
  rm -f STDERR_OP
  if [ -n "$STDERR_OP" ] ; then

    # NOTE: Convert/reconvert to utf8. By Questor
    bash -c "</dev/tty vim -u NONE +\"set binary | set noeol | set nobomb | set encoding=utf-8 | set fileencoding=utf-8 | wq\" \"$LINE_FILE\""

  else

    # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
    # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
    # https://stackoverflow.com/a/45240995/3223785 ]
    sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

  fi
done

This was the solution with the highest number of perfect conversions. Additionally, we did not have any truncated files.

WARNING: Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!
TIP: The command sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE" can be executed after a preliminary comparison with the merge tool after a conversion without it since it can cause "differences".
NOTE: The search using find brings all non-binary files from the given path ("") and its subfolders.

View more solutions

49,311

Blainer

Updated on July 09, 2022

Comments

Blainer almost 2 years

I am trying to point iconv to a directory and all files will be converted UTF-8 regardless of the current encoding

I am using this script but you have to specify what encoding you are going FROM. How can I make it autdetect the current encoding?

dir_iconv.sh

#!/bin/bash

ICONVBIN='/usr/bin/iconv' # path to iconv binary

if [ $# -lt 3 ]
then
  echo "$0 dir from_charset to_charset"
  exit
fi

for f in $1/*
do
  if test -f $f
  then
    echo -e "\nConverting $f"
    /bin/mv $f $f.old
    $ICONVBIN -f $2 -t $3 $f.old > $f
  else
    echo -e "\nSkipping $f - not a regular file";
  fi
done

terminal line

sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8

glerYbo almost 4 years

Related: How to change encoding from Non-ISO extended-ASCII text?

tripleee over 11 years

The heuristics used by file can be fairly crude, though. Watch out.
trante about 10 years

Your Enca link doesn't work. Is this updated one ? freecode.com/projects/enca
Michal Kottman about 10 years

It seems like Enca moved to Github since then. Notice that the freecode site also links to nonexistent Gitorious link. Updated the link in answer.
Daniel Dropik over 9 years

I wonder if you meant iconv rather than econv, because I can't find econv in the manual.
Éderson T. Szlachta about 6 years

uchardet saved my script
Eduardo Lucio over 5 years

TIP: Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!
phyatt about 5 years

Two minor notes: I would replace <YOUR_FOLDER_PATH> with "$1" and let the end user pass in the folder path. And for MacOS users, you need to run: brew install recode uchardet gnu-sed, and then change sed to gsed to get it to work. And nice job removing binary files using grep -I. Top marks!
Pablo Bianchi almost 4 years

recode doesn't seem to be maintained any more beside this fork
Eduardo Lucio almost 4 years

Your suggestions were accepted almost completely. I didn't keep the change 'I would replace <YOUR_FOLDER_PATH> with "$1"', because I thought the previous approach is clearer for more people. Thanks! = D
glerYbo almost 4 years

Syntax: enca -x utf8 -L mylanguage file.srt.
Peter Krauss about 3 years

List of valid laguages in your version: enca -l languages ... But UBUNTU is ugly on update, my enca --version is 2005! How to upgrade it?
rofrol over 2 years

You shouldn't provide to iconv same file for input and output unix.stackexchange.com/questions/10241/… stackoverflow.com/questions/17872302/…
rofrol over 2 years

I read that it is better to use inplace iconv iconv -f UTF-32 -t UTF-8 file.csv stackoverflow.com/questions/64860/…