Parse HTML using shell

bash shell parsing awk

36,681

Solution 1

awk  -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", $3); print $3 } ' file

Output:

Another:

awk  -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
    while (getline > 0 && /<td /) {
        gsub(/<b>/, ""); sub(/ .*/, "", $3)
        print $3
    }
    exit
}' file

Solution 2

awk is not an HTML parser. Use xpath or even xslt for that. xmllint is a commandline tool which is able to execute XPath queries and xsltproc can be used to perform XSL transformations. Both tools belong to the package libxml2-utils.

Also you can use a programming language which is able to parse HTML

Solution 3

$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '$2~/[0-9]/{print $2+0}' file
54
1
0
0

Solution 4

You really should to use some real HTML parser for this job, like:

perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'

prints:

But for this you need to have perl, and installed Mojolicious package.

(it is easy to install with:)

curl -L get.mojolicio.us | sh

Solution 5

BSD/GNU `grep`/`ripgrep`

For simple extracting, you can use grep, for example:

Your example using grep:

$ egrep -o "[0-9][^<]\?\+" file.html
54
1
0 (0/0)
0

and using ripgrep:

$ rg -o ">([^>]+)<" -r '$1' <file.html | tail +2
54
1
0 (0/0)
0

Extracting outer html of H1:

$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
<h1>Example Domain</h1>

Other examples:

Extracting the body:

$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...

^{Instead of xargs you can also use tr '\n' ' '.}

For multiple tags, see: Text between two tags.

If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.

View more solutions

36,681

Author by

Lenny

Updated on April 05, 2021

Comments

Lenny about 3 years

I have a HTML with lots of data and part I am interested in:

<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>

I try to use awk which now is:

awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print $1, $2, $3 }' "index.html"

but what I want is to have:

Right now I am getting:

'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'

Any suggestions?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Shell/Bash parsing text file

find ip address of my system for a particular interface with shell script (bash)

Unix to verify file has no content and empty lines

decode Base64 from shell

Awk & Sort-Output as Comma Delimited?

syntax error near unexpected token `&' (using "|&")

Awk using index with Substring

using awk to check between two dates

How can I replace dot in Bash shell?

Substring extraction using bash shell scripting and awk