iconv illegal input sequence- why?

text-processing character-encoding unicode

242

Solution 1

The file is encoded in ISO-8859-1, not in UTF-8:

$ hd 0606461.txt | grep -B1 '^0002c520'
0002c510  64 75 6d 20 66 65 72 69  65 6e 74 20 72 75 69 6e  |dum ferient ruin|
0002c520  e6 0d 0a 2d 2d 48 6f 72  61 63 65 2e 0d 0a 0d 0a  |...--Horace.....|

And the byte "e6" alone is not a valid UTF-8 sequence.

So, use iconv -f latin1 -t ascii//TRANSLIT file.

Solution 2

The file you linked appears to be UTF-8 inside an HTML document

$ file 0606461.txt 
0606461.txt: HTML document, ASCII text, with CRLF line terminators

If you run it through an HTML-to-text converter first, e.g.

iconv -f UTF-8 -t ascii//TRANSLIT < <(html2text 0606461.txt)

then the UTF-8 fragment you appear to be having trouble with appears to transliterate without error i.e

Si fractus illabatur orbis.
Impavidum ferient ruinæ
--Horace.

becomes

Si fractus illabatur orbis.
Impavidum ferient ruinae
--Horace.

The html2text utility may not be installed on your system - if you can't locate/install it there are other converters including a python module.

242

user1132254

Updated on September 18, 2022

Comments

user1132254 over 1 year
I'm working on a Matlab code that takes video frames, puts them inside a cell array such as
```
frames{fr} = read(videoObj, fr);
```
For a sample video, I check the frames cell, and it contains 84 frames (video has 84 frames), and all cells are full with images.

I have a mex file, as standard, it is such as
```
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
```
As I have read the properties of mex file creation, they say input prhs is const mxArray so that whatever happens inside the mex code, input does not change. I give the matlab's frames cell array as input to mex file as prhs[0] is the frames cell and I do some processing (I swear I don't change what I take in prhs[0]). However as I debug the Matlab code, I see that right after I call the mex file (and give frames to the mex file as input), last cell of my frames cell seems to be deleted. When I look at 84th cell, it is [], and sometimes it is a random number such as 8 or 4. I don't understand how or why my frames cell's last column disappears when I call mex file.

You may think I might have an error in my Matlab code but when I debug, I looked at frames right before the mex is processed, it has 84 cells and last cell is an image as expected, but right after I call the mex, frames is still 84 cells but last cell is [] : empty.

I will appreciate if anyone can help me, why would such as thing happen? Does const mxArray not mean 'input will not be changed in the mex' ?
vinc17 almost 10 years

No, the file is not encoded in UTF-8, but in ISO-8859-1. BTW, the file command says ASCII, but the reason is that it just looks at the beginning of the file, and the ISO-8859-1 character appears far away, at position 181536.
user13107 almost 10 years

@vinc17 how did you find out the file was in ISO-8859?
vinc17 almost 10 years

@user13107 by looking at the encoding of the offending character: it is the byte "e6", not the UTF-8 sequence "c3 a6". Emacs also detected the file as being in ISO-8859-1.
user13107 almost 10 years

How do find out which is the byte corresponding to offending character? I tried hexdump -C file command and got 0002b220 72 75 69 6e e6 0a 20 2d 2d 20 48 6f 72 61 63 65 |ruin.. -- Horace| as output.
vinc17 almost 10 years

In what you got, you can see that the only top-bit-set byte (a byte whose value is ≥ 80 in hexadecimal) is e6. This doesn't correspond to a valid UTF-8 sequence (in UTF-8, non-ASCII characters need at least 2 top-bit-set bytes). In ISO-8859-1, e6 is the encoding of the character "æ", which corresponds to the expected text; so, this confirms that the ISO-8859-1 encoding (or similar) is used for this file.