extract text from a file using terminal?

command-line regex text-processing

5,492

Solution 1

Not so much a one liner (although the command to run it is a one liner :) ), but here is a python option:

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    text = src.read()

starters = [(i+6, text[i:].find("&action")+i) for i in range(len(text)) if text[i:i+6] == "id_ad="]
if len (starters) > 0:
    for item in starters:
        print(text[item[0]:item[1]])

The script first lists all occurrences (indexes) of the (start) string "id_ad=", in combination with (end) string "&action". Then it prints all that is between those "markers".

Extracted from a prepared file:

" I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :( There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929170&action There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929990&action"

The result is:

How to use

Paste the script into an empty file, save it as extract.py run it by the command:

python3 <script> <file>

Note

If there is only one occurrence in the text file, the script can be much shorter:

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    text = src.read()
print(text[text.find("id_ad=")+6:text.find("&action")])

Solution 2

For example:

 egrep "id_ad=[[:digit:]]+&action" file.txt |  tr "=&" "  " | cut -d " " -f2

...but I am sure there are more elegant ways ;-).

Step by step:

egrep "id_ad=[[:digit:]]+&action" file.txt

scan file.txt for the pattern (regular expression) that is composed by a literal id_ad=, followed by 1 or more digits (the meaning of [[:digit:]]+, followed by a literal &action. Send the output to standard output.

tr "=&" "  "

transforms the characters "=" and "&" into two spaces.

cut -d " " -f2

print the second field (space-separated) of the standard input.

Solution 3

With sed:

sed 's/id_ad=\(.*\)&action/\1/' filename

Explanation:

Above command returns any strings(.*) between two START word(id_ad=) and END word(&action) in filename.
\(...\) Is used for capturing groups. \( is start of a capturing group and end with \). And with \1 we print the its group index(we have one capture group)

Better sed command for above solution can be like this:

sed 's/^id_ad=\([0-9]*\)&action/\1/' filename

^ Start of the line.
[0-9]*: Any number with 0 or more occurrences.
_{See for more about sed command}

With grep:

Explanation:

grep -Po '(?<=id_ad=)[0-9]*(?=&action)' filename

From man grep:

-o, --only-matching
      Print only the matched (non-empty) parts of a matching line,
      with each such part on a separate output line.
-P, --perl-regexp
      Interpret PATTERN as a Perl compatible regular expression (PCRE)

Returns any number with 0 or more occurrences([0-9]*) between two START word(id_ad=) and END word(&action) in filename.

(?<=pattern): Positive Lookbehind. A pair of parentheses, with the opening parenthesis followed by a question mark, "less than" symbol, and an equals sign.

(?<=id_ad=)[0-9]* (positive lookbehind) matches the 0 or more occurrences of numbers which followed after id_ad= in filename.

(?=pattern): Positive Lookahead: The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.

[0-9]*(?=&action): (positive lookahead) matches 0 or more occurrences of numbers that is followed by pattern(&action), without making the pattern(&action) part of the match.
_{Read more about Lookahead and Lookbehind}

Extra links:
_{Advanced Grep Topics

GREP for Designers}

Solution 4

Another python answer through re module. Example stolen from Jacob's post.

script.py

#!/usr/bin/python3
import sys
import re
file = sys.argv[1]
L = []                                                  # Declare an empty list
with open(file) as src:
    for j in src:                                       # iterate through all the lines
        for i in re.findall(r'id_ad=(\d+)&action', j):  # extracts the digits which was present in-between `id_ad=` and `&action` strings.
            L.append(i)                                 # Append the extracted digits to the already declared empty list L. 
    for f in L:                                         # Iterate through all the elements in the list L
        print(f)                                        # Print each element from the list L in a separate new line.

Run the above script as,

python3 script.py /path/to/the/file

Example:

$ cat fi
I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :( There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains

 id_ad=1929170&action There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains

 id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929990&action

$ python3 script.py ~/file
1929170
1889170
1889170
1929990

View more solutions

5,492

bcsteeve

Updated on September 18, 2022

Comments

bcsteeve over 1 year
I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :(

There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains
```
id_ad=1929170&action
```
and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers.

So intuitively I know I just want to ignore everything up to (and including) id_ad= and ignore everything after (and including) &action and I'll be left with the integer I want. And I know I can use regular expressions to achieve this. But I can't seem to figure it out.

I'd like to do this as a one liner from terminal if possible.
- Jacob Vlijm over 9 years
  
  the result should be 1929170 right? does it only occur once in the body?
- bcsteeve over 9 years
  
  Well, in that example yes that is the result. And it may (or may not) occur elsewhere. I want to pull any numbers in that position
- mickmackusa about 4 years
  
  This question would have been much clearer if you would have presented a realistic sample body of text. This way we could determine if lookingahead for the substring that follows the digits is necessary. @bcsteeve
- bcsteeve about 4 years
  
  @mickmackusa 2014. 6 YEARS ago.
- mickmackusa about 4 years
  
  @bcs I saw the timestamp.
bcsteeve over 9 years

Thanks! Can you explain why :digit: is inside double square brackets please?? I'm guessing the internal bracket is simply part of the specified structure while the outer is what says we're matching that (while the left and right literals are not in brackets so therefore not matched???). Am I close?
Rmano over 9 years

No, it's just the syntax of the regular expression used by egrep. See man egrep, scroll down to "Character Classes and Bracket Expressions".
steeldriver over 9 years

With sed, you'd need to at least use the -n switch and just print the substitution I think i.e. sed -n 's/id_ad=\(.*\)&action/\1/p' (otherwise, sed prints all lines by default) although personally I'd make the match a bit more specific e.g. `sed -n 's/^id_ad=([0-9]*)&action/\1/p'
bcsteeve over 9 years

I'm choosing this answer because it has really sent me down the right track! The other answers work too, of course. But this provided enough info, context and example to help me understand a lot. Thank you.
Avinash Raj over 9 years

@Kasiya as steeldriver said, your sed solution won't work if two or more id_ad=00&action present on the same line.
Avinash Raj over 9 years

And you don't need to go for a lookbehind. This grep -Po 'id_ad=\K[0-9]*(?=&action)' filename would be enough.
Jacob Vlijm over 9 years

I tested it, on large files, this becomes faster, thanks for mentioning it.