script to translate xml to json

ubuntu text-processing scripting xml json

10,403

Solution 1

in fact, you could get away here even w/o Python programming, just using 2 unix utilities:

jtm - that one allows xml <-> json lossless conversion
jtc - that one allows manipulating JSONs

Thus, assuming your xml is in file.xml, jtm would convert it to the following json:

bash $ jtm file.xml 
[
   {
      "quiz": [
         {
            "que": "The question her"
         },
         {
            "ca": "text"
         },
         {
            "ia": "text"
         },
         {
            "ia": "text"
         },
         {
            "ia": "text"
         }
      ]
   }
]
bash $

and then, applying a series of JSON transformations, you can arrive to a desired result:

bash $ jtm file.xml | jtc -w'<quiz>l:[1:][-2]' -ei echo { '"answer[-]"': {} }\; -i'<quiz>l:[1:]' | jtc -w'<quiz>l:[-1][:][0]' -w'<quiz>l:[-1][:]' -s | jtc -w'<quiz>l:' -w'<quiz>l:[0]' -s | jtc -w'<quiz>l: <>v' -u'"text"'
[
   {
      "answer1": "text",
      "answer2": "text",
      "answer3": "text",
      "answer4": "text",
      "text": "The question her"
   }
]
bash $

Though, due to involved shell scripting (echo command), it'll be slower than Python's - for 5000 questions, I'd expect it would run around a minute. (In the future version of the jtc I plan to allow interpolations even in statically specified JSONs, so that for templating no external shell-scripting would be required, then the operations will be blazing fast)

if you're curious about jtc syntax, you could find a user guide here: https://github.com/ldn-softdev/jtc/blob/master/User%20Guide.md

Solution 2

I assume your Ubuntu has python installed

#!/usr/bin/python3
import io
import json
import xml.etree.ElementTree

d = """<quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>
"""

s = io.StringIO(d)
# root = xml.etree.ElementTree.parse("filename_here").getroot()
root = xml.etree.ElementTree.parse(s).getroot()
out = {}
i = 1
for child in root:
    name, value = child.tag, child.text
    if name == 'que':
        name = 'question'
    else:
        name = 'answer%s' % i
        i += 1
    out[name] = value

print(json.dumps(out))

save it and chmod to executable you can easily modify to take a file as input instead of just text

EDIT Okey, this is a more complete script:

#!/usr/bin/python3
import json
import sys
import xml.etree.ElementTree


def read_file(filename):
    root = xml.etree.ElementTree.parse(filename).getroot()
    return root


# assule we have a list of <quiz>, contained in some other element
def parse_quiz(quiz_element, out):
    i = 1
    tmp = {}
    for child in quiz_element:

        name, value = child.tag, child.text
        if name == 'que':
            name = 'question'
        else:
            name = 'answer%s' % i
            i += 1
        tmp[name] = value
    out.append(tmp)


def parse_root(root_element, out):
    for child in root_element:
        if child.tag == 'quiz':
            parse_quiz(child, out)


def convert_xml_to_json(filename):
    root = read_file(filename)
    out = []
    parse_root(root, out)
    print(json.dumps(out))


if __name__ == '__main__':
    if len(sys.argv) > 1:
        convert_xml_to_json(sys.argv[1])
    else:
        print("Usage: script <filename_with_xml>")

I made a file with following, which I named xmltest:

<questions>
    <quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>
     <quiz>
            <que>Question number 1</que>
            <ca>blabla</ca>
            <ia>stuff</ia>
    </quiz>
</questions>

So you have a list of quiz inside some other container.

Now, I launch it like this: $ chmod u+x scratch.py, then scratch.py filenamewithxml

This gives me the answer:

$ ./scratch4.py xmltest
[{"answer3": "text", "answer2": "text", "question": "The question her", "answer4": "text", "answer1": "text"}, {"answer2": "stuff", "question": "Question number 1", "answer1": "blabla"}]

Solution 3

The xq tool from https://kislyuk.github.io/yq/ turns your XML into

{
  "quiz": {
    "que": "The question her",
    "ca": "text",
    "ia": [
      "text",
      "text",
      "text"
    ]
  }
}

by just using the identity filter (xq . file.xml).

We can massage this into something closer the form you want using

xq '.quiz | { text: .que, answers: .ia }' file.xml

which outputs

{
  "text": "The question her",
  "answers": [
    "text",
    "text",
    "text"
  ]
}

To fix up the answers bit so that you get your enumerated keys:

xq '.quiz |
    { text: .que } +
    (
        [
            range(.ia|length) as $i | { key: "answer\($i+1)", value: .ia[$i] }
        ] | from_entries
    )' file.xml

This adds the enumerated answer keys with the values from the ia nodes by iterating over the ia nodes and manually producing a set of keys and values. These are then turned into real key-value pairs using from_entries and added to the proto-object we've created ({ text: .que }).

The output:

{
  "text": "The question her",
  "answer1": "text",
  "answer2": "text",
  "answer3": "text"
}

If your XML document contains multiple quiz nodes under some root node, then change .quiz in the jq expression above to .[].quiz[] to transform each of them, and you may want to put the resulting objects into an array:

xq '.[].quiz[] |
    [ { text: .que } +
    (
        [
            range(.ia|length) as $i | { key: "answer\($i+1)", value: .ia[$i] }
        ] | from_entries
    ) ]' file.xml

Solution 4

Here ruby answer is.

script xml2json.rb:

require 'nokogiri'
require 'json'
    
in_doc = File.open(ARGV[0]) do |f|
  Nokogiri::XML(f)
end

for quiz in in_doc.xpath('/contents/quiz') do
  result = {
    text:     quiz.xpath('que').text,
    answer1:  quiz.xpath('ca').text
  }
  quiz.xpath('ia').each_with_index do |ia, ix|
    result["answer#{ix + 2}"] = ia.text
  end
  print result.to_json, ",\n"
end

input quiz.xml:

<contents>
  <quiz>
    <que>The question her</que>
    <ca>text a1</ca>
    <ia>text a2</ia>
    <ia>text a3</ia>
    <ia>text a4</ia>
  </quiz>
  <quiz>
    <que>The question foo</que>
    <ca>text b1</ca>
    <ia>text b2</ia>
    <ia>text b3</ia>
    <ia>text b4</ia>
  </quiz>
</contents>

execution:

$ ./xml2json.rb quiz.xml

View more solutions

10,403

Silver

Updated on September 18, 2022

Comments

Silver over 1 year
I have 5000 questions in txt file like this:
```
<quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>
```
I want to write a script in Ubuntu to convert all questions like this:
```
  {
   "text":"The question her",
   "answer1":"text",
   "answer2":"text",
   "answer3":"text",
   "answer4":"text"
  },
```
- dgan about 5 years
  
  that's better be done in python perl etc.. than in pure bash. so OS is irrelevant, it's more stack overflow question
- Silver about 5 years
  
  yes it is - Ark
- Stéphane Chazelas about 5 years
  
  What kind of assumptions can you make on the input? Are questions and answers always on separate lines, are the <quiz> and </quiz> tags always on their own on a line. Could the text have some encoding (CDATA, {, &eacute....), could it contain double quotes or backslashes? Is the XML file encoded in UTF-8?
- Silver about 5 years
  
  patch = small file contain code and work by double click in ubuntu.
- Silver about 5 years
  
  sorry I mean script not patch- thank you Ark.
- roaima about 5 years
  
  If it's XML and you have 5000 questions, there must be a higher level element in the source (<quizzes/>, for example?) and a corresponding [ ... ] in the output JSON. Please show that in your question.
Stéphane Chazelas about 5 years

That only works if the input has just one <quiz>
Victor Yarema about 4 years

Your solution looks really good. I didn't have time to check your concern regarding performance. But I'm curious: isn't it possible to replace multiple echo invocations with piping through sed?