How to filter data from txt using grep or sed?
Solution 1
For your specific input this will work:
grep -Po '\s[a-z1-9-]{2,}(?=\..{2,4})' file.txt
-P
: make us able to use look ahead.-o
: only show the matchs.\s
: only search for the ones which start with an space[a-z1-9-]{2,}
Followed by any alpha-numeric character or hyphen, at least 2 or more.(?=\..{3})
: which will be ended by a dot and 2 to 4 character (domain suffix) but do not include it.
Here is the output:
wantit1
wantit2
wantit3
wantit4
sidefun
coffeetec
lifeout
new-fun-boys
A better idea (based on your comment) is to use:
awk '(/2017-05-20/ && /Auctions were started/)' file.txt | grep -Po '\s[a-z1-9-]{1,}(?=\..{2,4})'
Solution 2
Here's a couple of options.
KISS approach using two greps:
$ grep 'Auctions were started for' file | grep -o '\S*\.com'
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com
More elegant:
$ perl -lne 'if (/"Auctions were started for (.*)"/) {print for split(/, | and /, $1)}' file
wantit1.com
wantit2.com
wantit3.com
wantit4.com
sidefun.com
coffeetec.com
lifeout.com
new-fun-boys.com
Solution 3
You can easily achieve this with a combination of grep
to find all lines in file.txt
containing the text "Auctions were started for", and sed
to extract only the domain names without TLD and print one per line:
grep -Po '(?<="Auctions were started for ).*(?=")' file.txt | sed -r 's/and |,|.com//g;y/ /\n/'
Here's a breakdown of the command:
grep -Po '(?<="Auctions were started for ).*(?=")' file.txt
This scans file.txt
line by line and matches anything (.*
) that is preceded by the string "Auctions were started for
and followed by another "
. We need grep
's -P
option to enable PCRE regular expressions (otherwise we could not use the (?<=...)
and (?=...)
regex lookarounds) and its -o
option to only print the matched part of a line (excluding the lookarounds) instead of the whole line.
In a second step, we pipe the output of this first command into this sed
command:
sed -r 's/and |,|.com//g;y/ /\n/'
This sed
line actually contains two commands, s/and |,|.com//g
and y/ /\n/
.
First, s/PATTERN/REPLACEMENT/
searches for the regular expression (extended regex actually, because of the -r
option) pattern and |,|.com
, which means and
, ,
or .com
. Then it replaces that with nothing, so these patterns actually get removed from the input line. The g
in the end enables global search and replacement instead of just processing the first match on every line.
Second, y/CHARACTERS/REPLACEMENTS/
translates all characters in the first field to their corresponding characters in the second field. Here I am using this to simply convert all remaining spaces to line breaks.
Related videos on Youtube
Kasaram Bala
Updated on September 18, 2022Comments
-
Kasaram Bala about 1 year
I am trying fetch data from twitter, I am able to read each line but do not know what commands to use to filter data like how I want. Any suggestions.
Input file : file.txt
id,created_at,text 842433,2017-05-20 14:45:05,goldring.com was just registered https://t.co/xt9345d 336353,2017-05-20 14:45:04,stretch.com was just registered https://t.co/QBEX965hf 84244e,2017-05-20 14:45:03,"Auctions were started for wantit1.com, wantit2.com, wantit3.com and wantit4.com" 842434,2017-05-20 14:45:02,"Auctions were started for sidefun.com, coffeetec.com, lifeout.com and new-fun-boys.com"
Expecting output:
wantit1 wantit2 wantit3 wantit4 sidefun coffeetec lifeout new-fun-boys
Code I have :
cat file.txt | while read line; do echo "$line" >> out1.txt done
-
David Foerster over 6 yearsWhat's the pattern here and what are you trying to achieve? Do you want the prefix of every word ending in
.com
from the third comma-separated column? -
Kasaram Bala over 6 yearsYes I am looking for a pattern Where I will get .com domains name list in the line having text 'Auctions were started for'
-
-
Kasaram Bala over 6 yearsThis worked as what I want but I still have another question. I Want to get data with date '2017-05-20 only and output should be in sorted by length of domain name.
-
Ravexina over 6 years@KasaramBala I updated my answer for first part of your question, for the second I've got an other answer: here.