iconv illegal input sequence- why?


Solution 1

The file is encoded in ISO-8859-1, not in UTF-8:

$ hd 0606461.txt | grep -B1 '^0002c520'
0002c510  64 75 6d 20 66 65 72 69  65 6e 74 20 72 75 69 6e  |dum ferient ruin|
0002c520  e6 0d 0a 2d 2d 48 6f 72  61 63 65 2e 0d 0a 0d 0a  |...--Horace.....|

And the byte "e6" alone is not a valid UTF-8 sequence.

So, use iconv -f latin1 -t ascii//TRANSLIT file.

Solution 2

The file you linked appears to be UTF-8 inside an HTML document

$ file 0606461.txt 
0606461.txt: HTML document, ASCII text, with CRLF line terminators

If you run it through an HTML-to-text converter first, e.g.

iconv -f UTF-8 -t ascii//TRANSLIT < <(html2text 0606461.txt)

then the UTF-8 fragment you appear to be having trouble with appears to transliterate without error i.e

Si fractus illabatur orbis.
Impavidum ferient ruinæ


Si fractus illabatur orbis.
Impavidum ferient ruinae

The html2text utility may not be installed on your system - if you can't locate/install it there are other converters including a python module.


Related videos on Youtube

Author by


Updated on September 18, 2022


  • user1132254
    user1132254 over 1 year

    I'm working on a Matlab code that takes video frames, puts them inside a cell array such as

    frames{fr} = read(videoObj, fr);

    For a sample video, I check the frames cell, and it contains 84 frames (video has 84 frames), and all cells are full with images.

    I have a mex file, as standard, it is such as

    void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

    As I have read the properties of mex file creation, they say input prhs is const mxArray so that whatever happens inside the mex code, input does not change. I give the matlab's frames cell array as input to mex file as prhs[0] is the frames cell and I do some processing (I swear I don't change what I take in prhs[0]). However as I debug the Matlab code, I see that right after I call the mex file (and give frames to the mex file as input), last cell of my frames cell seems to be deleted. When I look at 84th cell, it is [], and sometimes it is a random number such as 8 or 4. I don't understand how or why my frames cell's last column disappears when I call mex file.

    You may think I might have an error in my Matlab code but when I debug, I looked at frames right before the mex is processed, it has 84 cells and last cell is an image as expected, but right after I call the mex, frames is still 84 cells but last cell is [] : empty.

    I will appreciate if anyone can help me, why would such as thing happen? Does const mxArray not mean 'input will not be changed in the mex' ?

  • vinc17
    vinc17 almost 10 years
    No, the file is not encoded in UTF-8, but in ISO-8859-1. BTW, the file command says ASCII, but the reason is that it just looks at the beginning of the file, and the ISO-8859-1 character appears far away, at position 181536.
  • user13107
    user13107 almost 10 years
    @vinc17 how did you find out the file was in ISO-8859?
  • vinc17
    vinc17 almost 10 years
    @user13107 by looking at the encoding of the offending character: it is the byte "e6", not the UTF-8 sequence "c3 a6". Emacs also detected the file as being in ISO-8859-1.
  • user13107
    user13107 almost 10 years
    How do find out which is the byte corresponding to offending character? I tried hexdump -C file command and got 0002b220 72 75 69 6e e6 0a 20 2d 2d 20 48 6f 72 61 63 65 |ruin.. -- Horace| as output.
  • vinc17
    vinc17 almost 10 years
    In what you got, you can see that the only top-bit-set byte (a byte whose value is ≥ 80 in hexadecimal) is e6. This doesn't correspond to a valid UTF-8 sequence (in UTF-8, non-ASCII characters need at least 2 top-bit-set bytes). In ISO-8859-1, e6 is the encoding of the character "æ", which corresponds to the expected text; so, this confirms that the ISO-8859-1 encoding (or similar) is used for this file.