extract text from a file using terminal?
Solution 1
Not so much a one liner (although the command to run it is a one liner :) ), but here is a python option:
#!/usr/bin/env python3
import sys
file = sys.argv[1]
with open(file) as src:
text = src.read()
starters = [(i+6, text[i:].find("&action")+i) for i in range(len(text)) if text[i:i+6] == "id_ad="]
if len (starters) > 0:
for item in starters:
print(text[item[0]:item[1]])
The script first lists all occurrences (indexes) of the (start) string "id_ad=", in combination with (end) string "&action". Then it prints all that is between those "markers".
Extracted from a prepared file:
" I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :( There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929170&action There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929990&action"
The result is:
1929170
1889170
1889170
1929990
How to use
Paste the script into an empty file, save it as extract.py
run it by the command:
python3 <script> <file>
Note
If there is only one occurrence in the text file, the script can be much shorter:
#!/usr/bin/env python3
import sys
file = sys.argv[1]
with open(file) as src:
text = src.read()
print(text[text.find("id_ad=")+6:text.find("&action")])
Solution 2
For example:
egrep "id_ad=[[:digit:]]+&action" file.txt | tr "=&" " " | cut -d " " -f2
...but I am sure there are more elegant ways ;-).
Step by step:
egrep "id_ad=[[:digit:]]+&action" file.txt
scan file.txt
for the pattern (regular expression) that is composed by a literal id_ad=
, followed by 1 or more digits (the meaning of [[:digit:]]+
, followed by a literal &action
. Send the output to standard output.
tr "=&" " "
transforms the characters "=" and "&" into two spaces.
cut -d " " -f2
print the second field (space-separated) of the standard input.
Solution 3
With sed:
sed 's/id_ad=\(.*\)&action/\1/' filename
Explanation:
Above command returns any strings(.*
) between two START word(id_ad=
) and END word(&action
) in filename.
\(...\)
Is used for capturing groups. \(
is start of a capturing group and end with \)
. And with \1
we print the its group index(we have one capture group)
Better sed
command for above solution can be like this:
sed 's/^id_ad=\([0-9]*\)&action/\1/' filename
^
Start of the line.
[0-9]*
: Any number with 0 or more occurrences.
See for more about sed command
With grep:
Explanation:
grep -Po '(?<=id_ad=)[0-9]*(?=&action)' filename
From man grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
-P, --perl-regexp
Interpret PATTERN as a Perl compatible regular expression (PCRE)
Returns any number with 0 or more occurrences([0-9]*
) between two START word(id_ad=
) and END word(&action
) in filename.
(?<=pattern)
: Positive Lookbehind. A pair of parentheses, with the opening parenthesis followed by a question mark, "less than" symbol, and an equals sign.
(?<=id_ad=)[0-9]*
(positive lookbehind) matches the 0 or more occurrences of numbers which followed after id_ad=
in filename.
(?=pattern)
: Positive Lookahead: The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.
[0-9]*(?=&action)
: (positive lookahead) matches 0 or more occurrences of numbers that is followed by pattern(&action
), without making the pattern(&action
) part of the match.
Read more about Lookahead and Lookbehind
Extra links:
Advanced Grep Topics
GREP for Designers
Solution 4
Another python answer through re
module. Example stolen from Jacob's post.
script.py
#!/usr/bin/python3
import sys
import re
file = sys.argv[1]
L = [] # Declare an empty list
with open(file) as src:
for j in src: # iterate through all the lines
for i in re.findall(r'id_ad=(\d+)&action', j): # extracts the digits which was present in-between `id_ad=` and `&action` strings.
L.append(i) # Append the extracted digits to the already declared empty list L.
for f in L: # Iterate through all the elements in the list L
print(f) # Print each element from the list L in a separate new line.
Run the above script as,
python3 script.py /path/to/the/file
Example:
$ cat fi
I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :( There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains
id_ad=1929170&action There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains
id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929990&action
$ python3 script.py ~/file
1929170
1889170
1889170
1929990
Related videos on Youtube
bcsteeve
Updated on September 18, 2022Comments
-
bcsteeve over 1 year
I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :(
There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains
id_ad=1929170&action
and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers.
So intuitively I know I just want to ignore everything up to (and including)
id_ad=
and ignore everything after (and including)&action
and I'll be left with the integer I want. And I know I can use regular expressions to achieve this. But I can't seem to figure it out.I'd like to do this as a one liner from terminal if possible.
-
Jacob Vlijm over 9 yearsthe result should be 1929170 right? does it only occur once in the body?
-
bcsteeve over 9 yearsWell, in that example yes that is the result. And it may (or may not) occur elsewhere. I want to pull any numbers in that position
-
mickmackusa about 4 yearsThis question would have been much clearer if you would have presented a realistic sample body of text. This way we could determine if lookingahead for the substring that follows the digits is necessary. @bcsteeve
-
bcsteeve about 4 years@mickmackusa 2014. 6 YEARS ago.
-
mickmackusa about 4 years@bcs I saw the timestamp.
-
-
bcsteeve over 9 yearsThanks! Can you explain why :digit: is inside double square brackets please?? I'm guessing the internal bracket is simply part of the specified structure while the outer is what says we're matching that (while the left and right literals are not in brackets so therefore not matched???). Am I close?
-
Rmano over 9 yearsNo, it's just the syntax of the regular expression used by
egrep
. Seeman egrep
, scroll down to "Character Classes and Bracket Expressions". -
steeldriver over 9 yearsWith sed, you'd need to at least use the -n switch and just print the substitution I think i.e.
sed -n 's/id_ad=\(.*\)&action/\1/p'
(otherwise, sed prints all lines by default) although personally I'd make the match a bit more specific e.g. `sed -n 's/^id_ad=([0-9]*)&action/\1/p' -
bcsteeve over 9 yearsI'm choosing this answer because it has really sent me down the right track! The other answers work too, of course. But this provided enough info, context and example to help me understand a lot. Thank you.
-
Avinash Raj over 9 years@Kasiya as steeldriver said, your sed solution won't work if two or more
id_ad=00&action
present on the same line. -
Avinash Raj over 9 yearsAnd you don't need to go for a lookbehind. This
grep -Po 'id_ad=\K[0-9]*(?=&action)' filename
would be enough. -
Jacob Vlijm over 9 yearsI tested it, on large files, this becomes faster, thanks for mentioning it.