gedit can't recognize character encoding, but gvim can

11,007

Solution 1

iconv is probably what you'll want to use. iconv -l will show you the available encodings and then you can use a couple of commands to recode them all:

# all text files are in ./originals/
# new files will be written to ./newversions/

mkdir -p newversions
cd originals
for file in *.txt; do
    cat $file | iconv -f ASCII -t utf-8 > ../newversions/$file;
done

If you want to do this with files you don't the encoding of (because they're all over the place), you want to bring in a few more commands: find, file, awk and sed. The last two are just there to process the output of file.

for file in find . -type f -exec file --mime {} \; | grep "ascii" | awk '{print $1}' | sed s/.$//; do
    ...

I've no idea if this actually works so I certainly wouldn't run it from anything but the least important directory you have (make a testing folder with some known ASCII files in). The syntax of find might preclude it from being within a for loop. I'd hope that somebody else with more bash experience could jump in there and sort it out so it does the right thing.

Solution 2

Gedit can detect the correct character set only if it is listed at "File-Open-Character encoding". You can alter this list but keep in mind that the order is important.

Share:
11,007

Related videos on Youtube

Peter.O
Author by

Peter.O

Free! Free at last! ... my Windows box died in September 2010 ... and I'm free at last!

Updated on September 17, 2022

Comments

  • Peter.O
    Peter.O over 1 year

    I have a lot of plain text files which come from a Windows environment.
    Many of them use a whacky default Windows code-page, which is neither ASCII (7 bits) nor UTF-8.

    gvim has no problem opening these files, but gedit fails to do so.
    gvim reports the encoding as latin1.

    I assume that gvim is making a "smart" assumption about the code-page.
    (I believe this code-page still has international variants).

    Some questions arise from this:

    • (1). Is there some way the gedit can be told to recoginze this code-page?
      ** NB. [Update] For this point (1), see my answer, below.
      ** For points (2) and (3). see Oli's answer.

    • (2). Is there a way to scan the file system to identify these problem files?

    • (3). Is there a batch converting tool to convert these files to UTF-8?

    (.. this old-world text mayhem was actually the final straw which brought me over to Ubuntu... UTF-8 system-wide by default Brilliant)

    [UPDATE]
    ** NB: ** I now consider the following Update to be partially irrelevent, because the "problem" files aren't the "problem" (see my answer below).
    I've left it here, because is may be of some general use to someone.


    I've worked out a rough and ready way to identify the problem files...
    The file command was not suitable, because it identified my example file as ASCII... but an ASCII file is 100% UTF-8 compliant...

    As I mentioned in a comment below, the test for an invalid first byte of a UTF-8 codepoint is:

    • if the first byte (of a UTF-8 codepoint) is between 0x80 and 0xBF (reserved for additional bytes), or greater than 0xF7 ("overlong form"), that is considered an error

    I know sed (a bit, via a Win32 port), so I've managed to cobble together a RegEx pattern which finds these offending bytes.

    It's an ugly line, so look away now if regular expressions scare you :)

    I'd really appreciate it if someone points out how to use hex values in a range [] expression.. I've just used the or operator \|

    fqfn="/my/fully/qualified/filename"  
    sed -n "/\x80\|\x81\|\x82\|\x83\|\x84\|\x85\|\x86\|\x87\|\x88\|\x89\|\x8A\|\x8B\|\x8C\|\x8D\|\x8E\|\x8F\|\x90\|\x91\|\x92\|\x93\|\x94\|\x95\|\x96\|\x97\|\x98\|\x99\|\x9A\|\x9B\|\x9C\|\x9D\|\x9E\|\x9F\|\xA0\|\xA1\|\xA2\|\xA3\|\xA4\|\xA5\|\xA6\|\xA7\|\xA8\|\xA9\|\xAA\|\xAB\|\xAC\|\xAD\|\xAE\|\xAF\|\xB0\|\xB1\|\xB2\|\xB3\|\xB4\|\xB5\|\xB6\|\xB7\|\xB8\|\xB9\|\xBA\|\xBB\|\xBC\|\xBD\|\xBE\|\xBF\|\xF8\|\xF9\|\xFA\|\xFB\|\xFC\|\xFD\|\xFE\|\xFF/p" "${fqfn}"  
    

    So, I'll now graft this into Oli's batch solution... Thanks Oli!

    PS. Here is the invalid UTF-8 byte it found in my sample file ...
    "H.Bork, Gøte-borg." ... the "ø" = F8 hex... which is an invalid UTF-8 character.

  • Peter.O
    Peter.O over 13 years
    Aha thanks; that sorts out the batch encoding.. I actually used a Win32 port of iconv a couplo of years ago, to convert between UTF-8 and UTF-16... I thought it was only for conversion between different unicode encodings, but as you've pointed out, it does much more.... Do you know of any way to actually find these files? They are scattered over many many directories...
  • Oli
    Oli over 13 years
    You'd want to expand the script to use find (to find all .txt files) and its -exec flag to run something against that file. Then you'd want the file command to detect the charset. Hang on a minute. I'll edit my answer to show you what I'm thinking.
  • Peter.O
    Peter.O over 13 years
    I tried file, but it returns ASCII... I think this is what I need to detect ... If the first byte (of a UTF-8 codepoint) is between 0x80 and 0xBF (reserved for additional bytes), or greater than 0xF7 ("overlong form"), that is considered an error
  • carnendil
    carnendil about 11 years
    in case you don't have a clue, is there any way to "detect" from the terminal which codepage you need?
  • Peter.O
    Peter.O about 11 years
    Trying to establish which codepage was intended by the original author is not possible in many situations. Sub-sets of several codepages often coincide and produce a "valid" state (for that codepage and that text). The matching codepage can be technically valid, but text may show as the wrong symbol. In these cases, it really requires that the text be proof-read by a person!
  • Peter.O
    Peter.O about 11 years
    Regarding ways to detect the codepage via the terminal; iconv and recode can be used, but it is clumsy, as there are often multiple "valid" possibilities. You need to just keep testing codepages (valid and invalid) until you find a technically valid codepage... I just use emacs instead... which, by the way, is scriptable. ie. You can call emac's elisp scripting language from a bash script (just as you may call awk).