Unmarshal an ISO-8859-1 XML input in Go

14,350

Solution 1

Updated answer for 2015 & beyond:

import (
    "encoding/xml"
    "golang.org/x/net/html/charset"
)
reader := bytes.NewReader(theXml)
decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReaderLabel
err = decoder.Decode(&parsed)

Solution 2

Expanding on @anschel-schaffer-cohen suggestion and @mjibson's comment, using the go-charset package as mentioned above allows you to use these three lines

decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReader
err = decoder.Decode(&parsed)

to achieve the required result. just remember to let charset know where its data files are by calling

charset.CharsetDir = ".../src/code.google.com/p/go-charset/datafiles"

at some point when the app starts up.

EDIT

Instead of the above, charset.CharsetDir = etc. it's more sensible to just import the data files. they are treated as an embedded resource:

import (
    "code.google.com/p/go-charset/charset"
    _ "code.google.com/p/go-charset/data"
    ...
)

go install will just do its thing, this also avoids the deployment headache (where/how do I get data files relative to the executing app?).

using import with an underscore just calls the package's init() func which loads the required stuff into memory.

Solution 3

Here's a sample Go program which uses a CharsetReader function to convert XML input from ISO-8859-1 to UTF-8. The program prints the test file XML comments.

package main

import (
    "bytes"
    "fmt"
    "io"
    "os"
    "strings"
    "utf8"
    "xml"
)

type CharsetISO88591er struct {
    r   io.ByteReader
    buf *bytes.Buffer
}

func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
    buf := bytes.NewBuffer(make([]byte, 0, utf8.UTFMax))
    return &CharsetISO88591er{r.(io.ByteReader), buf}
}

func (cs *CharsetISO88591er) ReadByte() (b byte, err os.Error) {
    // http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
    // Date: 1999 July 27; Last modified: 27-Feb-2001 05:08
    if cs.buf.Len() <= 0 {
        r, err := cs.r.ReadByte()
        if err != nil {
            return 0, err
        }
        if r < utf8.RuneSelf {
            return r, nil
        }
        cs.buf.WriteRune(int(r))
    }
    return cs.buf.ReadByte()
}

func (cs *CharsetISO88591er) Read(p []byte) (int, os.Error) {
    // Use ReadByte method.
    return 0, os.EINVAL
}

func isCharset(charset string, names []string) bool {
    charset = strings.ToLower(charset)
    for _, n := range names {
        if charset == strings.ToLower(n) {
            return true
        }
    }
    return false
}

func IsCharsetISO88591(charset string) bool {
    // http://www.iana.org/assignments/character-sets
    // (last updated 2010-11-04)
    names := []string{
        // Name
        "ISO_8859-1:1987",
        // Alias (preferred MIME name)
        "ISO-8859-1",
        // Aliases
        "iso-ir-100",
        "ISO_8859-1",
        "latin1",
        "l1",
        "IBM819",
        "CP819",
        "csISOLatin1",
    }
    return isCharset(charset, names)
}

func IsCharsetUTF8(charset string) bool {
    names := []string{
        "UTF-8",
        // Default
        "",
    }
    return isCharset(charset, names)
}

func CharsetReader(charset string, input io.Reader) (io.Reader, os.Error) {
    switch {
    case IsCharsetUTF8(charset):
        return input, nil
    case IsCharsetISO88591(charset):
        return NewCharsetISO88591(input), nil
    }
    return nil, os.NewError("CharsetReader: unexpected charset: " + charset)
}

func main() {
    // Print the XML comments from the test file, which should
    // contain most of the printable ISO-8859-1 characters.
    r, err := os.Open("ISO88591.xml")
    if err != nil {
        fmt.Println(err)
        return
    }
    defer r.Close()
    fmt.Println("file:", r.Name())
    p := xml.NewParser(r)
    p.CharsetReader = CharsetReader
    for t, err := p.Token(); t != nil && err == nil; t, err = p.Token() {
        switch t := t.(type) {
        case xml.ProcInst:
            fmt.Println(t.Target, string(t.Inst))
        case xml.Comment:
            fmt.Println(string([]byte(t)))
        }
    }
}

To unmarshal XML with encoding="ISO-8859-1" from an io.Reader r into a structure result, while using the CharsetReader function from the program to translate from ISO-8859-1 to UTF-8, write:

p := xml.NewParser(r)
p.CharsetReader = CharsetReader
err := p.Unmarshal(&result, nil)

Solution 4

There appears to be an external library which handles this: go-charset. I haven't tried it myself; does it work for you?

Solution 5

Edit: do not use this, use the go-charset answer.

Here's an updated version of @peterSO's code that works with go1:

package main

import (
    "bytes"
    "io"
    "strings"
)

type CharsetISO88591er struct {
    r   io.ByteReader
    buf *bytes.Buffer
}

func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
    buf := bytes.Buffer{}
    return &CharsetISO88591er{r.(io.ByteReader), &buf}
}

func (cs *CharsetISO88591er) Read(p []byte) (n int, err error) {
    for _ = range p {
        if r, err := cs.r.ReadByte(); err != nil {
            break
        } else {
            cs.buf.WriteRune(rune(r))
        }
    }
    return cs.buf.Read(p)
}

func isCharset(charset string, names []string) bool {
    charset = strings.ToLower(charset)
    for _, n := range names {
        if charset == strings.ToLower(n) {
            return true
        }
    }
    return false
}

func IsCharsetISO88591(charset string) bool {
    // http://www.iana.org/assignments/character-sets
    // (last updated 2010-11-04)
    names := []string{
        // Name
        "ISO_8859-1:1987",
        // Alias (preferred MIME name)
        "ISO-8859-1",
        // Aliases
        "iso-ir-100",
        "ISO_8859-1",
        "latin1",
        "l1",
        "IBM819",
        "CP819",
        "csISOLatin1",
    }
    return isCharset(charset, names)
}

func CharsetReader(charset string, input io.Reader) (io.Reader, error) {
    if IsCharsetISO88591(charset) {
        return NewCharsetISO88591(input), nil
    }
    return input, nil
}

Called with:

d := xml.NewDecoder(reader)
d.CharsetReader = CharsetReader
err := d.Decode(&dst)
Share:
14,350

Related videos on Youtube

Denys Séguret
Author by

Denys Séguret

Author of several popular open-source programs and libraries. Also known as dystroy @ Miaou or Canop @ GitHub. I'm also available as a freelance programmer and consultant to solve your problems or design your next system. My current focus is Rust but I have a wide full-stack experience. Contact information on https://dystroy.org

Updated on September 09, 2020

Comments

  • Denys Séguret
    Denys Séguret over 3 years

    When your XML input isn't encoded in UTF-8, the Unmarshal function of the xml package seems to require a CharsetReader.

    Where do you find such a thing ?

    • Denys Séguret
      Denys Séguret about 6 years
      The best answer to this common problem changes when go changes. I already gave twice the "accept" mark to another answer in order to avoid people using an obsolete solution.
  • Denys Séguret
    Denys Séguret almost 13 years
    I saw the exemple in the tests, it's not exactly useful, indeed. In fact I failed to understand how this CharsetReader should work.
  • Denys Séguret
    Denys Séguret almost 13 years
    I'll try, thanks (for now I just made a trivial charsetreader working only for the ASCII reduced set and I'll try a complete translation as soon as I've resolved my other issues). But I'm surprised that go seems to consider that the world is made of UTF-8 today.
  • Anschel Schaffer-Cohen
    Anschel Schaffer-Cohen almost 13 years
    I think it's more a matter of considering UTF-8 to be the best internal representation for Unicode, and not having finished the libraries yet.
  • peterSO
    peterSO almost 13 years
    @dystroy: To protect yourself and your users, establish an audit trail by including a clear acknowledgement of the source (Stack Overflow) and the author (peterSO) of the code. Include a full link to this question or my answer. Include a full link to my Stack Overflow user page. I'm glad you found the code useful.
  • offby1
    offby1 about 11 years
    Yikes. Is this still the best answer? Everyone who wants to read ISO-8859 has to jump through these hoops?
  • mjibson
    mjibson almost 11 years
    Using go-charset is actually the best answer, so I'd rather peterSO's answer remain wrong to point people in that direction. (I only discovered go-charset after porting his code.)
  • Denys Séguret
    Denys Séguret over 10 years
    Have an upvote but it's disappointing to have to have external resources for such a simple operation than this conversion.
  • Jonno
    Jonno over 10 years
    Agreed. Probably just 'still early days for go'.. I'm pretty happy with "go get" though as it makes the whole thing pretty close to painless
  • chakrit
    chakrit about 10 years
    this should be packaged into a library so that other people can just drop it in to the xml parser.
  • eatingthenight
    eatingthenight about 9 years
    This answer is a little outdated. It uses many packages that have since been restructured and will not work in go 1.3. I have an answer below that will work with newer versions of go and will also handle many different charsets that are not handled by this answer. Although it is worth understanding how this is implemented.
  • Denys Séguret
    Denys Séguret about 9 years
    @peterSO Your answer was the best one when I asked in 2011 and was helpful to me and other developers. But we're now in 2015 and developers are still coming to this question, looking for a solution. I've been asked many times to accept the solution which, today, is the best one, and I'll do it now. Sorry for having to unaccept your answer.
  • Denys Séguret
    Denys Séguret about 9 years
    @mschuett In 2011, peterSo's answer was the best one and thus I accepted it. You should have a look at all answers before starting to code. Anyway, I'll accept this one now to avoid confusion.
  • eatingthenight
    eatingthenight about 9 years
    @dystroy sorry, I posted this before i read the comment up top and just forgot to delete it.