How to filter data from txt using grep or sed?

10,613

Solution 1

For your specific input this will work:

grep -Po '\s[a-z1-9-]{2,}(?=\..{2,4})' file.txt
  • -P : make us able to use look ahead.
  • -o : only show the matchs.
  • \s : only search for the ones which start with an space
  • [a-z1-9-]{2,} Followed by any alpha-numeric character or hyphen, at least 2 or more.
  • (?=\..{3}) : which will be ended by a dot and 2 to 4 character (domain suffix) but do not include it.

Here is the output:

wantit1  
wantit2  
wantit3  
wantit4  
sidefun  
coffeetec  
lifeout  
new-fun-boys  

A better idea (based on your comment) is to use:

awk '(/2017-05-20/ && /Auctions were started/)' file.txt | grep -Po '\s[a-z1-9-]{1,}(?=\..{2,4})'

Solution 2

Here's a couple of options.

KISS approach using two greps:

$ grep 'Auctions were started for' file | grep -o '\S*\.com'
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com

More elegant:

$ perl -lne 'if (/"Auctions were started for (.*)"/) {print for split(/, | and /, $1)}' file
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com

Solution 3

You can easily achieve this with a combination of grep to find all lines in file.txt containing the text "Auctions were started for", and sed to extract only the domain names without TLD and print one per line:

grep -Po '(?<="Auctions were started for ).*(?=")' file.txt | sed -r 's/and |,|.com//g;y/ /\n/'

Here's a breakdown of the command:

grep -Po '(?<="Auctions were started for ).*(?=")' file.txt

This scans file.txt line by line and matches anything (.*) that is preceded by the string "Auctions were started for and followed by another ". We need grep's -P option to enable PCRE regular expressions (otherwise we could not use the (?<=...) and (?=...) regex lookarounds) and its -o option to only print the matched part of a line (excluding the lookarounds) instead of the whole line.

In a second step, we pipe the output of this first command into this sed command:

sed -r 's/and |,|.com//g;y/ /\n/'

This sed line actually contains two commands, s/and |,|.com//g and y/ /\n/.

First, s/PATTERN/REPLACEMENT/ searches for the regular expression (extended regex actually, because of the -r option) pattern and |,|.com, which means and , , or .com. Then it replaces that with nothing, so these patterns actually get removed from the input line. The g in the end enables global search and replacement instead of just processing the first match on every line.

Second, y/CHARACTERS/REPLACEMENTS/ translates all characters in the first field to their corresponding characters in the second field. Here I am using this to simply convert all remaining spaces to line breaks.

Share:
10,613

Related videos on Youtube

Kasaram Bala
Author by

Kasaram Bala

Updated on September 18, 2022

Comments

  • Kasaram Bala
    Kasaram Bala about 1 year

    I am trying fetch data from twitter, I am able to read each line but do not know what commands to use to filter data like how I want. Any suggestions.

    Input file : file.txt

    id,created_at,text
    842433,2017-05-20 14:45:05,goldring.com was just registered https://t.co/xt9345d
    336353,2017-05-20 14:45:04,stretch.com was just registered https://t.co/QBEX965hf
    84244e,2017-05-20 14:45:03,"Auctions were started for wantit1.com, wantit2.com, wantit3.com and wantit4.com"
    842434,2017-05-20 14:45:02,"Auctions were started for sidefun.com, coffeetec.com, lifeout.com and new-fun-boys.com"
    

    Expecting output:

    wantit1
    wantit2
    wantit3
    wantit4
    sidefun
    coffeetec
    lifeout
    new-fun-boys
    

    Code I have :

    cat file.txt | while read line; 
    do
    
    echo "$line"  >> out1.txt
    
    done
    
    • David Foerster
      David Foerster over 6 years
      What's the pattern here and what are you trying to achieve? Do you want the prefix of every word ending in .com from the third comma-separated column?
    • Kasaram Bala
      Kasaram Bala over 6 years
      Yes I am looking for a pattern Where I will get .com domains name list in the line having text 'Auctions were started for'
  • Kasaram Bala
    Kasaram Bala over 6 years
    This worked as what I want but I still have another question. I Want to get data with date '2017-05-20 only and output should be in sorted by length of domain name.
  • Ravexina
    Ravexina over 6 years
    @KasaramBala I updated my answer for first part of your question, for the second I've got an other answer: here.