Simple way to extract value from HTML
Solution 1
You can extract a value in your example with grep
and assign it to the variable in the following way
$ x=$(wget -0 - 'http://foo/bar.html' | grep -Po '<value.*strValue="\K[[:digit:]]*')
$ echo $x
57
Explanation:
-
$()
: command substitution -
grep -P
: grep with Perl regexp enable -
grep -o
: grep shows only matched part of the line -
\K
: do not show in the output anything what was matched up to this point -
wget -O -
: prints downloaded document to standard output (not to file)
However, for general approach it is better to use dedicated parser for html code.
Solution 2
I have no idea what wget
you're talking about but I am guessing that you want to download the file. If so, yes, you can download it and parse it with no intermediate temp file:
$ value=$(wget -O - http://example.com/file.html | grep -oP 'strValue="\K[^"]+')
$ echo $value
57
Solution 3
Apart from the
wget -O - ...
technique, you can also usecurl -Ss ...
to avoid the hassle of a temporary file.The following illustrates the use of
pup
(https://github.com/ericchiang/pup), which supports a CSS-based query language.
a) To extract the "text" value of the <value>
tag:
pup 'value text{}' # yields 572
b) To extract the value of the strValue attribute of the <value>
tag:
pup 'value attr{strvalue}' # yields 57
njordan
Updated on September 18, 2022Comments
-
njordan over 1 year
I have a very simple html file with a value inside. Value is 57 in this case.
<eta version="1.0"><value uri="/user/var/48/10391/0/0/12528" strValue="57" unit="%" decPlaces="0" scaleFactor="10" advTextOffset="0">572</value></eta>
What is an easy bash script way to extract and write in a variable? Is there a way to not even require a wget into a file as an intermediate step, so as not require to open and use a file where it is stored, but directly work with the wget?
To clarify, I could do a simple
wget
, save to a file and check the file for the value or is there an even more enhanced way to do thewget
somewhere in RAM and not require an explicit file to be stored?Thanks a million times, highly appreciated Norbert
-
eyoung100 over 9 yearsHTML is a subset of XML. You need to read up on using an XML Reader in Linux, which is most likely why you were downvoted.
-
Pablo A over 4 years@eyoung100 HTML5 is not XML
-
-
jimmij over 9 yearsSee updated edit.
\K
works only with-P
option. -
Hackaholic over 9 years+ 1 for \K using perl regex
-
jimmij over 9 yearsUseless use of cat and doesn't work anyway. If you really want to involve
sed
trysed 's/.*strValue="\([[:digit:]]*\).*/\1/' file
. -
terdon over 9 years+1 but since you're using
-P
, why not use\d+
instead of[[:digit:]]*
? -
jimmij over 9 yearsYes, you are right
\d+
is shorter, also[^"]+
is better because value inside""
probably(?) doesn't need to be numerical. -
DisplayName over 9 yearsYeah, i tried..
-
geedoubleya over 9 yearsNice explanations. No temp file would be nice.
-
DisplayName over 9 yearsI suck at everything.
-
jimmij over 9 yearsNo, you just need some practise and you have very good questions, I like especially this one: unix.stackexchange.com/q/159489/80886 for obvious reason. BTW, it was not me who downvoted.
-
DisplayName over 9 yearsI't doesn't matter who down voted, i don't really care about internet points that much :).
-
njordan over 9 yearsi know it is a stupid question, but can you show how it would look like if I do a wget to a website....so there is no need to intermediate have a local file stored?
-
jimmij over 9 years@njordan See the update, you just need to use
-O -
option withwget
as in terdon answer. The-
means to use standard output for downloaded document, not a file. -
njordan over 9 yearsone more question, it seems that [[:digit:]]* does only extract a integer value.....I did use the same great line to extract another parameter....that is float (e.g., 15,4) and it cuts at 15....what do I have to do to take the complete string in the "" as a float variable?
-
jimmij over 9 yearsTry
grep -Po '<value.*strValue="\K[[:digit:]]*(,[[:digit:]]+){0,1}'
. The{0,1}
means that group inside()
can be present only zero or one time. So57,11,22
will match57,11
. -
njordan over 9 yearsShould my last question not also work directly this way: grep -Po '<value.*strValue="\K[[:digit:]]*,[[:digit:]] Assuming that there has to be float value with one "," Also, isn't there a more simple way to just take out all found between the ""?
-
jimmij over 9 years@njordan
grep -Po '<value.*strValue="\K[[:digit:]]*,[[:digit:]]
will fail if there is integer and in case of float it will match only first digit after,
. If that is what you are looking for that's fine. And yes, there is very easy way to take everything between""
withawk
, in your case:awk -F'"' '{print $6}'
, however you must guarantee, that your value is exactly at6
th position. -
njordan over 9 yearsSorry, I have another usecase now....what if I need to take a STRING....so everything between "". Thanks
-
jimmij over 9 years@njordan as I've said in last comment with
awk
that would beawk -F'"' '{print $6}'
, just change 6 to string position. If you wantgrep
then crucial regexp would be"[^"]*"
, but the whole command would depend on specific case.