iconv illegal input sequence- why?
Solution 1
The file is encoded in ISO-8859-1, not in UTF-8:
$ hd 0606461.txt | grep -B1 '^0002c520'
0002c510 64 75 6d 20 66 65 72 69 65 6e 74 20 72 75 69 6e |dum ferient ruin|
0002c520 e6 0d 0a 2d 2d 48 6f 72 61 63 65 2e 0d 0a 0d 0a |...--Horace.....|
And the byte "e6" alone is not a valid UTF-8 sequence.
So, use iconv -f latin1 -t ascii//TRANSLIT file
.
Solution 2
The file you linked appears to be UTF-8 inside an HTML document
$ file 0606461.txt
0606461.txt: HTML document, ASCII text, with CRLF line terminators
If you run it through an HTML-to-text converter first, e.g.
iconv -f UTF-8 -t ascii//TRANSLIT < <(html2text 0606461.txt)
then the UTF-8 fragment you appear to be having trouble with appears to transliterate without error i.e
Si fractus illabatur orbis.
Impavidum ferient ruinæ
--Horace.
becomes
Si fractus illabatur orbis.
Impavidum ferient ruinae
--Horace.
The html2text
utility may not be installed on your system - if you can't locate/install it there are other converters including a python module.
Related videos on Youtube
user1132254
Updated on September 18, 2022Comments
-
user1132254 over 1 year
I'm working on a Matlab code that takes video frames, puts them inside a cell array such as
frames{fr} = read(videoObj, fr);
For a sample video, I check the frames cell, and it contains 84 frames (video has 84 frames), and all cells are full with images.
I have a mex file, as standard, it is such as
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
As I have read the properties of mex file creation, they say input
prhs
isconst mxArray
so that whatever happens inside the mex code, input does not change. I give the matlab's frames cell array as input to mex file asprhs[0]
is the frames cell and I do some processing (I swear I don't change what I take inprhs[0]
). However as I debug the Matlab code, I see that right after I call the mex file (and give frames to the mex file as input), last cell of my frames cell seems to be deleted. When I look at 84th cell, it is[]
, and sometimes it is a random number such as 8 or 4. I don't understand how or why my frames cell's last column disappears when I call mex file.You may think I might have an error in my Matlab code but when I debug, I looked at frames right before the mex is processed, it has 84 cells and last cell is an image as expected, but right after I call the mex, frames is still 84 cells but last cell is
[]
: empty.I will appreciate if anyone can help me, why would such as thing happen? Does
const mxArray
not mean 'input will not be changed in the mex' ? -
vinc17 almost 10 yearsNo, the file is not encoded in UTF-8, but in ISO-8859-1. BTW, the
file
command says ASCII, but the reason is that it just looks at the beginning of the file, and the ISO-8859-1 character appears far away, at position 181536. -
user13107 almost 10 years@vinc17 how did you find out the file was in ISO-8859?
-
vinc17 almost 10 years@user13107 by looking at the encoding of the offending character: it is the byte "e6", not the UTF-8 sequence "c3 a6". Emacs also detected the file as being in ISO-8859-1.
-
user13107 almost 10 yearsHow do find out which is the byte corresponding to offending character? I tried
hexdump -C file
command and got0002b220 72 75 69 6e e6 0a 20 2d 2d 20 48 6f 72 61 63 65 |ruin.. -- Horace|
as output. -
vinc17 almost 10 yearsIn what you got, you can see that the only top-bit-set byte (a byte whose value is ≥ 80 in hexadecimal) is e6. This doesn't correspond to a valid UTF-8 sequence (in UTF-8, non-ASCII characters need at least 2 top-bit-set bytes). In ISO-8859-1, e6 is the encoding of the character "æ", which corresponds to the expected text; so, this confirms that the ISO-8859-1 encoding (or similar) is used for this file.