Regex with sed command to parse json text

30,539

Solution 1

Do not parse complex nested data structures like JSON or XML with regular expressions, use a proper JSON parser, like jshon.

First you need to install it:

sudo apt-get install jshon

Then you have to provide it the JSON data to parse via standard input, so you can either redirect another command's output there with a pipe (|) or redirect a file to it (< filename).

The arguments it needs to extract the data you want look like this:

jshon -e "buildStatus" -e "status" -u
  • -e "buildStatus" picks the element with the "buildStatus" index from the top level dictionary.
  • -e "status" picks the element with the "status" index from the second level dictionary picked above.
  • -u converts the selected data from JSON to plain data (i.e. here it removes the quotes around the string)

So the command you run, depending on where you get the data from, looks like one of those:

jshon -e "buildStatus" -e "status" -u < YOUR_INPUT_FILE
YOUR_JSON_PRODUCING_COMMAND | jshon -e "buildStatus" -e "status" -u

To learn more about jshon, you can read its manpage accessible online here or by simply typing man jshon.

Solution 2

Job for jq:

jq -r '.["buildStatus"]["status"]' file.json

Can be shortened to:

jq -r '.buildStatus.status' file.json

-r (--raw-output) outputs the string without json string formatting i.e. without quotes.

Example:

% cat file.json                   
{
    "buildStatus" : {
        "status" : "ERROR",
        "conditions" : [{
                "status" : "OK",
                "metricKey" : "bugs"
            }, {
                "status" : "ERROR",
                "metricKey" : "test_success_density"
            }, {
                "status" : "OK",
                "metricKey" : "vulnerabilities"
            }
        ],
        "periods" : []
    }
}

% jq -r '.["buildStatus"]["status"]' file.json
ERROR

% jq -r '.buildStatus.status' file.json       
ERROR

If not installed already, install it by (available in the Universe repository):

sudo apt-get install jq 

Solution 3

As has been mentioned, parsing complex structured data is preferable with appropriate API. Python has json module for that , which I personally use quite a lot in my scripts, and it's quite easy to extract the desired fields you want as so:

$ python -c 'import sys,json;print json.load(sys.stdin)["buildStatus"]["status"]' <  input.txt
ERROR

What happens here is that we redirect input file to python's stdin, and read that with json.load(). That becomes a python dictionary with key "buildStatus", and it contains another python dictionary with "status" key. Thus, we're merely are printing out value of a key in a dictionary that is stored within another dictionary. Fairly simple.

Aside from simplicity, another advantage is that python and this API are all preinstalled and come with Ubuntu by default.

Solution 4

You can actually do this in sed, but I strongly urge you to use a more sophisticated language that has tools written to handle JSON data. You could try perl or python, for example.

Now, in your simple example, all you want is the first occurrence of "status", so you could do:

$ sed -nE '/status/{s/.*:\s*"(.*)",/\1/p;q}' file.json 
ERROR

The trick is to use -n to avoid printing, then if the line matches status (/status/), you remove everything but the part you want s/.*:\s*"(.*)",/\1/, print the line and quit.


Personally, I find this equivalent grep command much simpler:

$ grep -m1 -oP '"status"\s*:\s*"\K[^"]+' file.json 
ERROR

Or this one:

$ perl -ne 'if(s/.*"status"\s*:\s*"([^"]+).*/$1/){print;exit}' file.json 
ERROR

Seriously though, if you plan to be parsing JSON files, do not try to do this manually. Use a proper JSON parser.

Solution 5

Not saying you should use sed (I think someone has downvoted me just for not writing obligatory caveat) but, if you need to search for something on the next line to buildStatus as you seem to be trying in your own attempt, you need to tell sed to read the next line with the N command

$ sed -rn '/buildStatus/N;s/.*buildStatus.*\n.*: "(.*)",/\1/p' file
ERROR

Notes:

  • -n don't print anything until we ask for it
  • -r use ERE (same as -E)
  • /buildStatus/N find this pattern and read the next line too
  • s/old/new/ replace old with new
  • .* any number of any characters on the line
  • \n newline
  • : "(.*)", save any characters occurring between : " and ",
  • \1 back reference to saved pattern
  • p print the part we worked on
Share:
30,539

Related videos on Youtube

user1876040
Author by

user1876040

Updated on September 18, 2022

Comments

  • user1876040
    user1876040 over 1 year

    I have this json text:

    {
        "buildStatus" : {
            "status" : "ERROR",
            "conditions" : [{
                    "status" : "OK",
                    "metricKey" : "bugs"
                }, {
                    "status" : "ERROR",
                    "metricKey" : "test_success_density"
                }, {
                    "status" : "OK",
                    "metricKey" : "vulnerabilities"
                }
            ],
            "periods" : []
        }
    }
    

    I want to extract the overall status of the buildStatus, i.e the expected output was "ERROR"

    "buildStatus" : {
        "status" : "ERROR",
        ....
    }
    

    I tried the sed expression below, but it isn't working, it returns OK:

    status= sed -E 's/.*\"buildStatus\":.*\"status\":\"([^\"]*)\",.*/\1/' jsonfile
    

    What am I doing wrong?

  • slowko
    slowko over 7 years
    or this one: grep -m 1 status file.json | tr -cd '[[:alnum:]]:' | cut -f2 -d':'
  • terdon
    terdon over 7 years
    @user1876040 you're welcome. Please remember to accept one of the answers (I recommend ByteCommander's, his is a better solution) so the question can be marked as answered).
  • muru
    muru over 7 years
    There's also jq: jq -r .buildStatus.status
  • muru
    muru over 7 years
    You're probably talking of stackoverflow.com/a/1732454/2072269
  • HTNW
    HTNW over 7 years
  • Sergiy Kolodyazhnyy
    Sergiy Kolodyazhnyy over 7 years
    Consider improving your answer by showing example of how jsontool can be used for OP's specific case
  • Barb Hammond
    Barb Hammond over 7 years
    @HTNW I've never liked that answer, because "single XML open tag" (which is what the question asking) is a regular language (and you could in principle build a full XML parser by using regexes to match tags, comments, cdata sections, and using a simple stack to handle the nested context). However, the most 'interesting' regular language in JSON is a string literal.
  • Pysis
    Pysis over 7 years
    Lol @muru, correct, that is one of the posts attempting to deter uses from parsing XML/JSON with Regex! I was more recommending jq that muru and heemayl describe that already have exmaples, and just posting the reasoning behind it: askubuntu.com/a/863948/230721