How to convert xml file which is in non UTF-8 format to xml that is UTF-8 compliant

14,445

Use the character set conversion tool:

iconv -f ISO-8859-1 -t UTF-8 filename.txt

See gnu-page

...and in file http://standards.ieee.org/develop/regauth/oui/oui.txt "aglia" (as in your example above) is reported as:

00-0B-91   (hex)            Aglaia Gesellschaft für Bildverarbeitung und Kommunikation m
000B91     (base 16)        Aglaia Gesellschaft für Bildverarbeitung und Kommunikation m
                            Tiniusstr. 12-15
                            Berlin  D-13089
                            GERMANY

it seems like "ü" is the character that gets mangeld.

Update

When downloading "oui.txt" using wget, I see the character "ü" in the file. If you don't have that something is broken in your download. consider using one of these:

  • wget --header='Accept-Charset: utf-8'
  • try using curl -o oui.txt instead

If none of the above works, just open the link in you favorite browser and do a "save as". In that case, comment the wget line in the script below.

I had success with the following script (update BEGIN & END to get a valid XML-file)

#!/bin/bash

wget http://standards.ieee.org/develop/regauth/oui/oui.txt
iconv -f iso-8859-15 -t utf-8 oui.txt > converted

awk 'BEGIN {
         print "HTML-header"
     }

     /base 16/ {
         printf("<vendor name=\"%s\">\n", $4)
         read
         desc = substr($0, index($0, $4))
         printf("<vendorOUI oui=\"%s\" description=\"%s\"/>\n", $1, desc)
     }
     END {
         print "HTML-footer"
    }
    ' converted

Hope this helps!

Share:
14,445
Nohsib
Author by

Nohsib

Updated on June 04, 2022

Comments

  • Nohsib
    Nohsib almost 2 years

    I have a huge xml file whose sample data is as follows :

     <vendor name="aglaia"><br>
                  <vendorOUI oui="000B91" description="Aglaia Gesellschaft für Bildverarbeitung ud Kommunikation m" /><br>
             </vendor><br>
             <vendor name="ag"><br>
                  <vendorOUI oui="0024A9" description="Ag Leader Technology" /><br>
             </vendor><br>
    

    as it can be see there are text " Gesellschaft für Bildverarbeitung " which is not UTF-8 compliant because which I am getting errors from the xml validator , errors like:

    Import failed:
    com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    

    So the query is how to take care of this in Linux environment to convert the xml file to UTF-8 compliant format? or is there a way in bash such that while creating the xml in the first place i can ensure that all variables/strings are stored in UTF-8 compliant format?

  • Jim Garrison
    Jim Garrison almost 13 years
    That assumes the current codepage is ISO-8859-1, which it may not be.
  • Fredrik Pihl
    Fredrik Pihl almost 13 years
    @Nosib what does file filename.txt output?
  • Nohsib
    Nohsib almost 13 years
    @Fredrik : I have a file vendor.xml whose encoding i am not sure about which has to be converted to a UTF-8 compliant file , so my usage as per your advice is iconv -f ISO-8859-1 -t UTF-8 vendor.xml hope that answers what you are asking
  • Nohsib
    Nohsib almost 13 years
    The encoding format is also what is used in standards.ieee.org/develop/regauth/oui/oui.txt , because my basic input file is this. So can we find out the encoding format used here?
  • gatkin
    gatkin almost 13 years
    You can't convert it to uft-8 unless you know what encoding the file is in now. You need to get ahold of the people or program that generated it and find out what encoding was in effect. If it really is 8859-1, fine. If you just guess that it's 8859-1 and you guess wrong, you're making a bigger mess.
  • Nohsib
    Nohsib almost 13 years
    The file is posted in standards.ieee.org/develop/regauth/oui/oui.txt so whats the option out , coz i doubt they would reply none the less i have mailed them..is there an alternative?
  • Fredrik Pihl
    Fredrik Pihl almost 13 years
    file reports oui.txt as ASCII
  • Nohsib
    Nohsib almost 13 years
    I am parsing the oui.txt file by a bash script to generate the xml during which i am getting the xml as < vendor name="aglaia"> < vendorOUI oui="000B91" description="Aglaia Gesellschaft für Bildverarbeitung ud Kommunikation m" /> < /vendor> < vendor name="ag"> < vendorOUI oui="0024A9" description="Ag Leader Technology" /> < /vendor>
  • Fredrik Pihl
    Fredrik Pihl almost 13 years
    type locale at you prompt and post output here (should be utf-8). Should be trivial to create the xml-file from oui.txt using python. I can do it tomorrow if you haven't solved it by then. Well past midnight now and my wife hate me when doing SO late ;-)
  • Nohsib
    Nohsib almost 13 years
    This is what it looks like after i download the file using wget :00-0B-91 (hex) Aglaia Gesellschaft für Bildverarbeitung und Kommunikation m 000B91 (base 16) Aglaia Gesellschaft für Bildverarbeitung und Kommunikation m Tiniusstr. 12-15 Berlin D-13089 GERMANY
  • Nohsib
    Nohsib almost 13 years
    locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= Thanks for the help!!! Hope you could give a break through tomorrow with a bash script..
  • Nohsib
    Nohsib almost 13 years
    Thanks a bunch Fredrik for going out your way to help!!! sincere gratitude at your support. I modified the xml parser to take care of it...so thankyou once again.