Convert ISO-8859-1 to UTF-8 using groovy

28,944

Solution 1

def f=new File('c:/data/myiso88591.xml').getText('ISO-8859-1')
new File('c:/data/myutf8.xml').write(f,'utf-8')

(I just gave it a try, it works :-)

same as in java: the libraries do the conversion for you... as deceze said: when you specify an encoding, it will be converted to an internal format (utf-16 afaik). When you specify another encoding when you write the string, it will be converted to this encoding.

But if you work with XML, you shouldn't have to worry about the encoding anyway because the XML parser will take care of it. It will read the first characters <?xml and determines the basic encoding from those characters. After that, it is able to read the encoding information from your xml header and use this.

Solution 2

Making it a little more Groovy, and not requiring the whole file to fit in memory, you can use the readers and writers to stream the file. This was my solution when I had files too big for plain old Unix iconv(1).

new FileOutputStream('out.txt').withWriter('UTF-8') { writer ->
    new FileInputStream('in.txt').withReader('ISO-8859-1') { reader ->
        writer << reader
    }
}
Share:
28,944
Booyeoo
Author by

Booyeoo

Updated on September 07, 2020

Comments

  • Booyeoo
    Booyeoo over 3 years

    i need to convert a ISO-8859-1 file to utf-8 encoding, without loosing content intormations...

    i have a file which looks like this:

    <?xml version="1.0" encoding="ISO-8859-1" ?> 
    <HelloEncodingWorld>Üöäüßßß Test!!!</HelloEncodingWorld>
    

    Not i want to encode it into UTF-8. I tried following:

    f=new File('c:/temp/myiso88591.xml').getText('ISO-8859-1')
    ts=new String(f.getBytes("UTF-8"), "UTF-8")
    g=new File('c:/temp/myutf8.xml').write(ts)
    

    didnt work due to String incompatibilities. Then i read something about bytestreamreaders/writers/streamingmarkupbuilder and other...

    then i tried

    f=new File('c:/temp/myiso88591.xml').getText('ISO-8859-1')
    mb = new groovy.xml.StreamingMarkupBuilder()
    mb.encoding = "UTF-8"
    
    new OutputStreamWriter(new FileOutputStream('c:/temp/myutf8.xml'),'utf-8') << mb.bind {
        mkp.xmlDeclaration()
        out << f
    }
    

    this was totally not that what i wanted..

    I just want to get the content of an xml read with an ISO-8859-1 reader and then put it into a new (old) file... why this is so complicated :-/

    The result should just be, and the file should be really encoded in utf-8:

    <?xml version="1.0" encoding="UTF-8" ?> 
    <HelloEncodingWorld>Üöäüßßß Test!!!</HelloEncodingWorld>
    

    Thanks for any answers Cheers

  • user772401
    user772401 over 12 years
    <?xml? Isn't that the same in UTF-8 and ASCII and others? :)
  • rdmueller
    rdmueller over 12 years
    Some UTF encodings start the file with a BOM. And in some encodings like ebcdic, the <?xml characters are not the same. See w3.org/TR/xml/#sec-guessing for details. It's very interesting and a good reason not to create your own code in order to guess the encoding.
  • Booyeoo
    Booyeoo over 12 years
    Sorry but it is not right, that it really works. It is stored like this: <?xml version="1.0" encoding="ISO-8859-1" ?> <HelloEncodingWorld>ÃöäüÃÃà Test!!!</HelloEncodingWorld> and the encoding which is shown is still ISO-8859-1 (using notepad++) Maybe the first line forces the editor to show it as...ahh ok this was that case, omg, i already tried this way so often....but never realized that the data was encoded in utf 8 but shown as ansi... thanks a lot.
  • David
    David about 11 years
    Very groovy solution. I like the way you use the withReader/Writer. I tried it out myself and it worked great :)