How to convert xml file which is in non UTF-8 format to xml that is UTF-8 compliant
Use the character set conversion tool:
iconv -f ISO-8859-1 -t UTF-8 filename.txt
See gnu-page
...and in file http://standards.ieee.org/develop/regauth/oui/oui.txt "aglia" (as in your example above) is reported as:
00-0B-91 (hex) Aglaia Gesellschaft für Bildverarbeitung und Kommunikation m
000B91 (base 16) Aglaia Gesellschaft für Bildverarbeitung und Kommunikation m
Tiniusstr. 12-15
Berlin D-13089
GERMANY
it seems like "ü" is the character that gets mangeld.
Update
When downloading "oui.txt" using wget, I see the character "ü" in the file. If you don't have that something is broken in your download. consider using one of these:
wget --header='Accept-Charset: utf-8'
- try using
curl -o oui.txt
instead
If none of the above works, just open the link in you favorite browser and do a "save as". In that case, comment the wget
line in the script below.
I had success with the following script (update BEGIN & END to get a valid XML-file)
#!/bin/bash
wget http://standards.ieee.org/develop/regauth/oui/oui.txt
iconv -f iso-8859-15 -t utf-8 oui.txt > converted
awk 'BEGIN {
print "HTML-header"
}
/base 16/ {
printf("<vendor name=\"%s\">\n", $4)
read
desc = substr($0, index($0, $4))
printf("<vendorOUI oui=\"%s\" description=\"%s\"/>\n", $1, desc)
}
END {
print "HTML-footer"
}
' converted
Hope this helps!
Nohsib
Updated on June 04, 2022Comments
-
Nohsib almost 2 years
I have a huge xml file whose sample data is as follows :
<vendor name="aglaia"><br> <vendorOUI oui="000B91" description="Aglaia Gesellschaft für Bildverarbeitung ud Kommunikation m" /><br> </vendor><br> <vendor name="ag"><br> <vendorOUI oui="0024A9" description="Ag Leader Technology" /><br> </vendor><br>
as it can be see there are text " Gesellschaft für Bildverarbeitung " which is not UTF-8 compliant because which I am getting errors from the xml validator , errors like:
Import failed: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
So the query is how to take care of this in Linux environment to convert the xml file to UTF-8 compliant format? or is there a way in bash such that while creating the xml in the first place i can ensure that all variables/strings are stored in UTF-8 compliant format?
-
Jim Garrison almost 13 yearsThat assumes the current codepage is ISO-8859-1, which it may not be.
-
Fredrik Pihl almost 13 years@Nosib what does
file filename.txt
output? -
Nohsib almost 13 years@Fredrik : I have a file vendor.xml whose encoding i am not sure about which has to be converted to a UTF-8 compliant file , so my usage as per your advice is iconv -f ISO-8859-1 -t UTF-8 vendor.xml hope that answers what you are asking
-
Nohsib almost 13 yearsThe encoding format is also what is used in standards.ieee.org/develop/regauth/oui/oui.txt , because my basic input file is this. So can we find out the encoding format used here?
-
gatkin almost 13 yearsYou can't convert it to uft-8 unless you know what encoding the file is in now. You need to get ahold of the people or program that generated it and find out what encoding was in effect. If it really is 8859-1, fine. If you just guess that it's 8859-1 and you guess wrong, you're making a bigger mess.
-
Nohsib almost 13 yearsThe file is posted in standards.ieee.org/develop/regauth/oui/oui.txt so whats the option out , coz i doubt they would reply none the less i have mailed them..is there an alternative?
-
Fredrik Pihl almost 13 years
file
reports oui.txt as ASCII -
Nohsib almost 13 yearsI am parsing the oui.txt file by a bash script to generate the xml during which i am getting the xml as < vendor name="aglaia"> < vendorOUI oui="000B91" description="Aglaia Gesellschaft für Bildverarbeitung ud Kommunikation m" /> < /vendor> < vendor name="ag"> < vendorOUI oui="0024A9" description="Ag Leader Technology" /> < /vendor>
-
Fredrik Pihl almost 13 yearstype
locale
at you prompt and post output here (should be utf-8). Should be trivial to create the xml-file from oui.txt using python. I can do it tomorrow if you haven't solved it by then. Well past midnight now and my wife hate me when doing SO late ;-) -
Nohsib almost 13 yearsThis is what it looks like after i download the file using wget :00-0B-91 (hex) Aglaia Gesellschaft für Bildverarbeitung und Kommunikation m 000B91 (base 16) Aglaia Gesellschaft für Bildverarbeitung und Kommunikation m Tiniusstr. 12-15 Berlin D-13089 GERMANY
-
Nohsib almost 13 yearslocale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= Thanks for the help!!! Hope you could give a break through tomorrow with a bash script..
-
Nohsib almost 13 yearsThanks a bunch Fredrik for going out your way to help!!! sincere gratitude at your support. I modified the xml parser to take care of it...so thankyou once again.