extract text from a file using terminal?

5,492

Solution 1

Not so much a one liner (although the command to run it is a one liner :) ), but here is a python option:

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    text = src.read()

starters = [(i+6, text[i:].find("&action")+i) for i in range(len(text)) if text[i:i+6] == "id_ad="]
if len (starters) > 0:
    for item in starters:
        print(text[item[0]:item[1]])

The script first lists all occurrences (indexes) of the (start) string "id_ad=", in combination with (end) string "&action". Then it prints all that is between those "markers".

Extracted from a prepared file:

" I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :( There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929170&action There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929990&action"

The result is:

1929170
1889170
1889170
1929990

How to use

Paste the script into an empty file, save it as extract.py run it by the command:

python3 <script> <file>

Note

If there is only one occurrence in the text file, the script can be much shorter:

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    text = src.read()
print(text[text.find("id_ad=")+6:text.find("&action")])

Solution 2

For example:

 egrep "id_ad=[[:digit:]]+&action" file.txt |  tr "=&" "  " | cut -d " " -f2 

...but I am sure there are more elegant ways ;-).

Step by step:

egrep "id_ad=[[:digit:]]+&action" file.txt 

scan file.txt for the pattern (regular expression) that is composed by a literal id_ad=, followed by 1 or more digits (the meaning of [[:digit:]]+, followed by a literal &action. Send the output to standard output.

tr "=&" "  " 

transforms the characters "=" and "&" into two spaces.

cut -d " " -f2

print the second field (space-separated) of the standard input.

Solution 3

With sed:

sed 's/id_ad=\(.*\)&action/\1/' filename

Explanation:

Above command returns any strings(.*) between two START word(id_ad=) and END word(&action) in filename.
\(...\) Is used for capturing groups. \( is start of a capturing group and end with \). And with \1 we print the its group index(we have one capture group)

Better sed command for above solution can be like this:

sed 's/^id_ad=\([0-9]*\)&action/\1/' filename

^ Start of the line.
[0-9]*: Any number with 0 or more occurrences.
See for more about sed command

With grep:

Explanation:

grep -Po '(?<=id_ad=)[0-9]*(?=&action)' filename

From man grep:

-o, --only-matching
      Print only the matched (non-empty) parts of a matching line,
      with each such part on a separate output line.
-P, --perl-regexp
      Interpret PATTERN as a Perl compatible regular expression (PCRE)

Returns any number with 0 or more occurrences([0-9]*) between two START word(id_ad=) and END word(&action) in filename.

(?<=pattern): Positive Lookbehind. A pair of parentheses, with the opening parenthesis followed by a question mark, "less than" symbol, and an equals sign.

(?<=id_ad=)[0-9]* (positive lookbehind) matches the 0 or more occurrences of numbers which followed after id_ad= in filename.

(?=pattern): Positive Lookahead: The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.

[0-9]*(?=&action): (positive lookahead) matches 0 or more occurrences of numbers that is followed by pattern(&action), without making the pattern(&action) part of the match.
Read more about Lookahead and Lookbehind

Extra links:
Advanced Grep Topics
GREP for Designers

Solution 4

Another python answer through re module. Example stolen from Jacob's post.

script.py

#!/usr/bin/python3
import sys
import re
file = sys.argv[1]
L = []                                                  # Declare an empty list
with open(file) as src:
    for j in src:                                       # iterate through all the lines
        for i in re.findall(r'id_ad=(\d+)&action', j):  # extracts the digits which was present in-between `id_ad=` and `&action` strings.
            L.append(i)                                 # Append the extracted digits to the already declared empty list L. 
    for f in L:                                         # Iterate through all the elements in the list L
        print(f)                                        # Print each element from the list L in a separate new line.

Run the above script as,

python3 script.py /path/to/the/file

Example:

$ cat fi
I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :( There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains

 id_ad=1929170&action There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains

 id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929990&action

$ python3 script.py ~/file
1929170
1889170
1889170
1929990
Share:
5,492

Related videos on Youtube

bcsteeve
Author by

bcsteeve

Updated on September 18, 2022

Comments

  • bcsteeve
    bcsteeve over 1 year

    I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :(

    There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains

    id_ad=1929170&action
    

    and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers.

    So intuitively I know I just want to ignore everything up to (and including) id_ad= and ignore everything after (and including) &action and I'll be left with the integer I want. And I know I can use regular expressions to achieve this. But I can't seem to figure it out.

    I'd like to do this as a one liner from terminal if possible.

    • Jacob Vlijm
      Jacob Vlijm over 9 years
      the result should be 1929170 right? does it only occur once in the body?
    • bcsteeve
      bcsteeve over 9 years
      Well, in that example yes that is the result. And it may (or may not) occur elsewhere. I want to pull any numbers in that position
    • mickmackusa
      mickmackusa about 4 years
      This question would have been much clearer if you would have presented a realistic sample body of text. This way we could determine if lookingahead for the substring that follows the digits is necessary. @bcsteeve
    • bcsteeve
      bcsteeve about 4 years
      @mickmackusa 2014. 6 YEARS ago.
    • mickmackusa
      mickmackusa about 4 years
      @bcs I saw the timestamp.
  • bcsteeve
    bcsteeve over 9 years
    Thanks! Can you explain why :digit: is inside double square brackets please?? I'm guessing the internal bracket is simply part of the specified structure while the outer is what says we're matching that (while the left and right literals are not in brackets so therefore not matched???). Am I close?
  • Rmano
    Rmano over 9 years
    No, it's just the syntax of the regular expression used by egrep. See man egrep, scroll down to "Character Classes and Bracket Expressions".
  • steeldriver
    steeldriver over 9 years
    With sed, you'd need to at least use the -n switch and just print the substitution I think i.e. sed -n 's/id_ad=\(.*\)&action/\1/p' (otherwise, sed prints all lines by default) although personally I'd make the match a bit more specific e.g. `sed -n 's/^id_ad=([0-9]*)&action/\1/p'
  • bcsteeve
    bcsteeve over 9 years
    I'm choosing this answer because it has really sent me down the right track! The other answers work too, of course. But this provided enough info, context and example to help me understand a lot. Thank you.
  • Avinash Raj
    Avinash Raj over 9 years
    @Kasiya as steeldriver said, your sed solution won't work if two or more id_ad=00&action present on the same line.
  • Avinash Raj
    Avinash Raj over 9 years
    And you don't need to go for a lookbehind. This grep -Po 'id_ad=\K[0-9]*(?=&action)' filename would be enough.
  • Jacob Vlijm
    Jacob Vlijm over 9 years
    I tested it, on large files, this becomes faster, thanks for mentioning it.