How to parse HTTP headers using Bash?

15,866

Solution 1

Full bashsolution. Demonstrate how to easily parse other headers without requiring awk:

shopt -s extglob # Required to trim whitespace; see below

while IFS=':' read key value; do
    # trim whitespace in "value"
    value=${value##+([[:space:]])}; value=${value%%+([[:space:]])}

    case "$key" in
        Server) SERVER="$value"
                ;;
        Content-Type) CT="$value"
                ;;
        HTTP*) read PROTO STATUS MSG <<< "$key{$value:+:$value}"
                ;;
     esac
done < <(curl -sI http://www.google.com)
echo $STATUS
echo $SERVER
echo $CT

Producing:

302
GFE/2.0
text/html; charset=UTF-8

According to RFC-2616, HTTP headers are modeled as described in "Standard for the Format of ARPA Internet Text Messages" (RFC822), which states clearly section 3.1.2:

The field-name must be composed of printable ASCII characters (i.e., characters that have values between 33. and 126., decimal, except colon). The field-body may be composed of any ASCII characters, except CR or LF. (While CR and/or LF may be present in the actual text, they are removed by the action of unfolding the field.)

So the above script should catch any RFC-[2]822 compliant header with the notable exception of folded headers.

Solution 2

If you wanted to extract more than a couple of headers, you could stuff all the headers into a bash associative array. Here's a simple-minded function which assumes that any given header only occurs once. (Don't use it for Set-Cookie; see below.)

# Call this as: headers ARRAY URL
headers () {
  {
    # (Re)define the specified variable as an associative array.
    unset $1;
    declare -gA $1;
    local line rest

    # Get the first line, assuming HTTP/1.0 or above. Note that these fields
    # have Capitalized names.
    IFS=$' \t\n\r' read $1[Proto] $1[Status] rest
    # Drop the CR from the message, if there was one.
    declare -gA $1[Message]="${rest%$'\r'}"
    # Now read the rest of the headers. 
    while true; do
      # Get rid of the trailing CR if there is one.
      IFS=$'\r' read line rest;
      # Stop when we hit an empty line
      if [[ -z $line ]]; then break; fi
      # Make sure it looks like a header
      # This regex also strips leading and trailing spaces from the value
      if [[ $line =~ ^([[:alnum:]_-]+):\ *(( *[^ ]+)*)\ *$ ]]; then
        # Force the header to lower case, since headers are case-insensitive,
        # and store it into the array
        declare -gA $1[${BASH_REMATCH[1],,}]="${BASH_REMATCH[2]}"
      else
        printf "Ignoring non-header line: %q\n" "$line" >> /dev/stderr
      fi
    done
  } < <(curl -Is "$2")
}

Example:

$ headers so http://stackoverflow.com/
$ for h in ${!so[@]}; do printf "%s=%s\n" $h "${so[$h]}"; done | sort
Message=OK
Proto=HTTP/1.1
Status=200
cache-control=public, no-cache="Set-Cookie", max-age=43
content-length=224904
content-type=text/html; charset=utf-8
date=Fri, 25 Jul 2014 17:35:16 GMT
expires=Fri, 25 Jul 2014 17:36:00 GMT
last-modified=Fri, 25 Jul 2014 17:35:00 GMT
set-cookie=prov=205fd7f3-10d4-4197-b03a-252b60df7653; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
vary=*
x-frame-options=SAMEORIGIN

Note that the SO response includes one or more cookies, in Set-Cookie headers, but we can only see the last one because the naive script overwrites entries with the same header name. (As it happens, there was only one but we can't know that.) While it would be possible to augment the script to special case Set-Cookie, a better approach would probably be to provide a cookie-jar file, and use the -b and -c curl options in order to maintain it.

Solution 3

Using process substitution, (<( ... )) you are able to read into shell variable:

sh$ read STATUS SERVER < <(
      curl -sI http://www.google.com | 
      awk '/^HTTP/ { STATUS = $2 } 
           /^Server:/ { SERVER = $2 } 
           END { printf("%s %s\n",STATUS, SERVER) }'
    )

sh$ echo $STATUS
302
sh$ $ echo $SERVER
GFE/2.0
Share:
15,866

Related videos on Youtube

jpshook
Author by

jpshook

Updated on June 04, 2022

Comments

  • jpshook
    jpshook almost 2 years

    I need to get 2 values from a web page header that I am getting using curl. I have been able to get the values individually using:

    response1=$(curl -I -s http://www.example.com | grep HTTP/1.1 | awk {'print $2'})
    response2=$(curl -I -s http://www.example.com | grep Server: | awk {'print $2'})
    

    But I cannot figure out how to grep the values separately using a single curl request like:

    response=$(curl -I -s http://www.example.com)
    http_status=$response | grep HTTP/1.1 | awk {'print $2'}
    server=$response | grep Server: | awk {'print $2'}
    

    Every attempt either leads to a error message or empty values. I am sure it is just a syntax issue.

    • Stephen Garle
      Stephen Garle almost 10 years
      doing $response |... won't work because the value of $response is not a command. echo $response should work.
  • jpshook
    jpshook almost 10 years
    What if there were 20 properties to be read, would you suggest the same approach?
  • Sylvain Leroux
    Sylvain Leroux almost 10 years
    @JPShook As of myself, I would either use awk or bash. In most of the cases, it doesn't add much here to use both of them. But without enough background, I have only speculated that you wanted an hybrid solution.
  • Sylvain Leroux
    Sylvain Leroux almost 10 years
    @JPShook I posted an other answer demonstrating how you could use bash alone. Depending your needs, this might be a better solution.
  • jpshook
    jpshook almost 10 years
    Why is the HTTP* case different than the others? I am a bash n00b, so please forgive me if the question is really basic.
  • Sylvain Leroux
    Sylvain Leroux almost 10 years
    @JPShook IFS=':' means I break input as key/value based on the : character. The HTTP status line does not have that format. So it is a special case.
  • rici
    rici almost 10 years
    I think the HTTP* case could be better written as read PROTO STATUS MSG <<<"$key$value" in case the message contains a colon (and making use of <<<, which is conceptually simpler than spawning a child to echo.)
  • Sylvain Leroux
    Sylvain Leroux almost 10 years
    @rici Thank you for your comment. Very good catch! I changed my answer accordingly.
  • jpshook
    jpshook almost 10 years
    I tried this method. I am also reading a header that returns "online" and when i echo it is shows as online, but when I try to do a comparison to "online" it does not match. Any ideas? Do they need to be trimmed or something?
  • Sylvain Leroux
    Sylvain Leroux almost 10 years
    @JPShook Remember that in the case statement you place the "field-name". For RFC-822/RCF-2822 compliant headers, space is not allowed as a field name character. You don't have to trim anything (I've add a bit of background info in my answer). May I suggest you capture the full header to a file in order to examine how it is structured -- and if you are not able to fix the script to deal with that particular header, to ask an other question with the relevant pieces of data.
  • jpshook
    jpshook almost 10 years
    @SylvainLeroux I have not included any spaces in the field name. The value that is coming back has some hidden characters, newlines or something causing it to not only be comprised of the value I am looking for. Any way to trim/remove newline chars from the value?
  • jpshook
    jpshook almost 10 years
    @SylvainLeroux Ended up using a wildcard in the condition. Would it be possible for you to update your answer to strip the results of each header value of any extra spaces, newlines, etc?
  • Sylvain Leroux
    Sylvain Leroux almost 10 years
    @JPShook I've updated my answer to trim whitespace in the value field. You now have all the basic building blocks to adapt to your special need. BTW the issue you had while parsing your very specific header is weird. You definitively should post that (incl. the header raw data of course) as new question. That would be an interesting puzzle game to solve...
  • jpshook
    jpshook almost 10 years
    The whitespace trim fixed the issue so I no longer have to use the wildcard match. Thanks for the help!
  • djuarezg
    djuarezg over 5 years
    How do you add a timeout to your solution?
  • rici
    rici over 5 years
    @djuarez: I'd probably use the timeout command to wrap the entire script, but bash's read builtin has an option for setting a timeout if it were acceptable to do the timeout per line.