Compare two text files and find matching lines
Solution 1
Since your patterns are only four to six lines, why not use them in an OR pattern? An example limiting to 10 matches that operates on a second file "bigDNA.txt":
grep -E 'GAGA|CAGA|GGGT|TATT' -m 10 bigDNA.txt
This will save you from manually typing the patterns from file patt.txt
. It joins lines by |
(append |
to each line, remove newline, remove trailing |
):
grep -E "$(sed 's#$#|#' patt.txt | tr -d '\n' | sed 's#|$##')" -m 10 bigDNA.txt
Solution 2
Have you tried iterating through "file A" with a while
loop?
while read string
do grep "$string" file-B | head -10
done < file-A
Or in one line:
while read string; do grep "$string" file-B | head -10; done < file-A
Solution 3
This will print the 1st 10 lines that match any of your strings:
grep -m 10 -Ff motifs sequence.fa
This one will read each motif and print the first ten lines mathcing it, so it will print 10 lines for each motif:
while read mot; do grep -m 10 "$mot" sequence.fa; done < motifs
However, that looks like a DNA sequence, which means that the line breaks are completely arbitrary and you can have matches like this:
ACTG GA
GA
With these approaches, the GAGA
above will not count as a match and this is probably not what you want. Instead, I suggest you put everything in a single line before you search. Since you are asking for matching lines, I assume you want each of these motifs in their context. So, to do this properly, matching motifs that are split across newlines, first transform your file to TBL format. I've been using the same little awk
script written by a colleague (thanks Pep) for years:
#!/bin/sh
gawk '{
if (substr($1,1,1)==">")
if (NR>1)
printf "\n%s\t", substr($0,2,length($0)-1)
else
printf "%s\t", substr($0,2,length($0)-1)
else
printf "%s", $0
}END{printf "\n"}' "$@"
Save the script above as FastaToTbl
somewhere in your $PATH
(/usr/local/bin
for example) and make it executable (chmod a+x /usr/local/bin/FastaToTbl
). Then, you can simply pipe FASTA format sequences and it will print out .tbl
format, where the identifier and the sequence are all on the same line.
So, once you have FastaToTbl
set up, you can run:
while read mot;
do
FastaToTbl sequence.fa | grep -Po ".{10}$mot.{10}" | head -n 10
done < motifs
The above will give you the 1st 10 matches for each pattern and will also match motifs that are split across newlines. It will also print the 10 characters on either side of the matched pattern, change the {10}
to another number to control this behavior.
Solution 4
Here is a hopefully readable script.
FIRSTFILE contains 1 item per line (with no extra spaces, etc) BIGFILE contains the big list that you want to match
awk -F, '
BEGIN{
regexp="__NOTMATCHING__"
linematched=0
while(( getline line<"FIRSTFILE") > 0 ) {
nb_items[line]=0; #initialise a counter in items["...."]
regexp=regexp"|"line #we create a "egrep-like" regexp matching each item
}
}
#main : read each line.
# - save each matching lines.
# - and increment each corresponding counters.
{ if ( $0 ~ regexp ) {
matchinglines[++linematched]=$0
for ( item in nb_items ) {
#for each matching item, we also increment that item s number
if ( $0 ~ item ) {
nb_items[item]++ ;
}
}
}
}
END { #at the end, we print all items which have nb_item[item]>=10
for ( item in nb_items ) {
if (nb_items[item] >= 10) {
print "for this item:",item
for (i;i<=linematched;i++) {
if ( matchinglines[i] ~ item ) {
print matchinglines[i] ;
}
}
}
}
}
' BIGFILE
Related videos on Youtube
Alejandro
Updated on September 18, 2022Comments
-
Alejandro over 1 year
I have two files A and B. A looks like this (4 to 6 lines):
GAGA CAGA GGGT TATT
file B is a really big file with thousands of lines, here is a short example:
AAATGTCAAGAGACAGAAATGTCAAGAGGGT AAGGGGGTTTATAATCATAAATCAAAGAAAT ATATACAGAAATGTCAAGAGACAGAAATGTC TCAAGAGACAGAAATGTCAAGAGGGTCTATA AAGAGGGTCTATAATCATAAATCAAAGAAAT AAGAGGGTCTATAATCATAAATCAAAGAAAT ATACAGAAATGTCAAAACAGAAATGTCAAGG ATATACAGAATATACAGAAATGTCAAGTTAT ACAGAATATACAGAAATGTCAAGTTATATAC ATATACAGAAATGTCAAGAGACAGAAATGTC TCAGAATATAGTATTCTATTATATACAGAAA AATATAGTATTCTATTATATACAGAAATGTC GAATATACAGAAATGTCAAGTTATATACAGA TATACAGAATATAGTATTCTATTATATACAG CAGAATATAGTATTCTATTATATACAGAATA AGTTATATACAGAATATAGTATTCTATTATA TACAGAATATAGTATTCTATTATATACAGAA CAGAAATGTCAAGTTATATACAGAATATAGT
I need to search every string in file A in all the lines in file B, and recover the first 10 lines from file B that contain each string from A. I have tried grep and awk but not with good results. Thanks
-
Admin about 10 yearsIf a line in B contains both strings in A, it will print two times?
-
Admin about 10 yearsWhat if the string falls at the end of a line? Something like
GA\nGA
?
-
-
orion about 10 years+1 but I think the question wanted 10 lines for EACH line, so head maybe goes inside the loop.
-
Alejandro about 10 yearsI forgot to mention file A change with time, is not the same every case. That is why I need something more "complex"
-
Lekensteyn about 10 years@Alejandro It doesn't matter if it changes periodically, the second command can handle that. If you need to do some pre-processing, that is still possible. E.g. to print the last word if a line matches
FOO
, you would use a subcommand like$(awk '/FOO/{print $NF}/' patt.txt | sed 's#$#|#' | tr ...
-
Olivier Dulac about 10 yearsit cuold be optimized: each time you increment a matching nb_items: if all are above 10, you can stop reading BIGFILE and go to the END section!
-
Alejandro about 10 yearsThanks! It seems to be working right. The only thing I noticed is if I run it, and then go a change the order of the strings inside file A (and save it), i get a weird result (more than ten lines, and the are not in order). Any idea why this is happening?
-
h3rrmiller about 10 yearsAre you doing this while the loop is running in another terminal?
-
Alejandro about 10 yearsNo, I just waited until it was done and then change file A, and run it again, in the same terminal window
-
airstrike about 10 years@Alejandro, if I understand correctly, file A would be patt.txt
-
Lekensteyn about 10 yearsCorrect,
patt.txt
would be your "file A".bigDNA.txt
is "file B".