Compare two text files and find matching lines

grep awk search string

26,513

Solution 1

Since your patterns are only four to six lines, why not use them in an OR pattern? An example limiting to 10 matches that operates on a second file "bigDNA.txt":

grep -E 'GAGA|CAGA|GGGT|TATT' -m 10 bigDNA.txt

This will save you from manually typing the patterns from file patt.txt. It joins lines by | (append | to each line, remove newline, remove trailing |):

grep -E "$(sed 's#$#|#' patt.txt | tr -d '\n' | sed 's#|$##')" -m 10 bigDNA.txt

Solution 2

Have you tried iterating through "file A" with a while loop?

while read string
 do grep "$string" file-B | head -10
done < file-A

Or in one line:

while read string; do grep "$string" file-B | head -10; done < file-A

Solution 3

This will print the 1st 10 lines that match any of your strings:

grep -m 10 -Ff motifs sequence.fa

This one will read each motif and print the first ten lines mathcing it, so it will print 10 lines for each motif:

while read mot; do grep -m 10 "$mot" sequence.fa; done < motifs

However, that looks like a DNA sequence, which means that the line breaks are completely arbitrary and you can have matches like this:

ACTG GA
GA

With these approaches, the GAGA above will not count as a match and this is probably not what you want. Instead, I suggest you put everything in a single line before you search. Since you are asking for matching lines, I assume you want each of these motifs in their context. So, to do this properly, matching motifs that are split across newlines, first transform your file to TBL format. I've been using the same little awk script written by a colleague (thanks Pep) for years:

#!/bin/sh
gawk '{
        if (substr($1,1,1)==">")
        if (NR>1)
                    printf "\n%s\t", substr($0,2,length($0)-1)
        else 
            printf "%s\t", substr($0,2,length($0)-1)
        else 
                printf "%s", $0
}END{printf "\n"}'  "$@"

Save the script above as FastaToTbl somewhere in your $PATH (/usr/local/bin for example) and make it executable (chmod a+x /usr/local/bin/FastaToTbl). Then, you can simply pipe FASTA format sequences and it will print out .tbl format, where the identifier and the sequence are all on the same line.

So, once you have FastaToTbl set up, you can run:

while read mot; 
do 
    FastaToTbl sequence.fa | grep -Po ".{10}$mot.{10}" | head -n 10 
done < motifs

The above will give you the 1st 10 matches for each pattern and will also match motifs that are split across newlines. It will also print the 10 characters on either side of the matched pattern, change the {10} to another number to control this behavior.

Solution 4

Here is a hopefully readable script.

FIRSTFILE contains 1 item per line (with no extra spaces, etc) BIGFILE contains the big list that you want to match

awk -F, '
  BEGIN{
     regexp="__NOTMATCHING__"
     linematched=0
     while(( getline line<"FIRSTFILE") > 0 ) {
        nb_items[line]=0; #initialise a counter in items["...."]
        regexp=regexp"|"line  #we create a "egrep-like" regexp matching each item
     }

  }

#main : read each line. 
#           - save each matching lines. 
#           - and increment each corresponding counters. 
  {  if ( $0 ~ regexp ) {
        matchinglines[++linematched]=$0
        for ( item in nb_items ) {
           #for each matching item, we also increment that item s number
           if ( $0 ~ item ) { 
              nb_items[item]++ ; 
           }
        }
     }
  }

END  {  #at the end, we print all items which have nb_item[item]>=10
        for ( item in nb_items ) {
           if (nb_items[item] >= 10) {
              print "for this item:",item
              for (i;i<=linematched;i++) {
                 if ( matchinglines[i] ~ item ) {
                    print matchinglines[i] ; 
                 }
              }
           }
        }
     }
   ' BIGFILE

View more solutions

26,513

Alejandro

Updated on September 18, 2022

Comments

Alejandro over 1 year

I have two files A and B. A looks like this (4 to 6 lines):

GAGA
CAGA
GGGT
TATT

file B is a really big file with thousands of lines, here is a short example:

AAATGTCAAGAGACAGAAATGTCAAGAGGGT
AAGGGGGTTTATAATCATAAATCAAAGAAAT
ATATACAGAAATGTCAAGAGACAGAAATGTC
TCAAGAGACAGAAATGTCAAGAGGGTCTATA
AAGAGGGTCTATAATCATAAATCAAAGAAAT
AAGAGGGTCTATAATCATAAATCAAAGAAAT
ATACAGAAATGTCAAAACAGAAATGTCAAGG
ATATACAGAATATACAGAAATGTCAAGTTAT
ACAGAATATACAGAAATGTCAAGTTATATAC
ATATACAGAAATGTCAAGAGACAGAAATGTC
TCAGAATATAGTATTCTATTATATACAGAAA
AATATAGTATTCTATTATATACAGAAATGTC
GAATATACAGAAATGTCAAGTTATATACAGA
TATACAGAATATAGTATTCTATTATATACAG
CAGAATATAGTATTCTATTATATACAGAATA
AGTTATATACAGAATATAGTATTCTATTATA
TACAGAATATAGTATTCTATTATATACAGAA
CAGAAATGTCAAGTTATATACAGAATATAGT

I need to search every string in file A in all the lines in file B, and recover the first 10 lines from file B that contain each string from A. I have tried grep and awk but not with good results. Thanks

Admin about 10 years

If a line in B contains both strings in A, it will print two times?
Admin about 10 years

What if the string falls at the end of a line? Something like GA\nGA?

orion about 10 years

+1 but I think the question wanted 10 lines for EACH line, so head maybe goes inside the loop.
Alejandro about 10 years

I forgot to mention file A change with time, is not the same every case. That is why I need something more "complex"
Lekensteyn about 10 years

@Alejandro It doesn't matter if it changes periodically, the second command can handle that. If you need to do some pre-processing, that is still possible. E.g. to print the last word if a line matches FOO, you would use a subcommand like $(awk '/FOO/{print $NF}/' patt.txt | sed 's#$#|#' | tr ...
Olivier Dulac about 10 years

it cuold be optimized: each time you increment a matching nb_items: if all are above 10, you can stop reading BIGFILE and go to the END section!
Alejandro about 10 years

Thanks! It seems to be working right. The only thing I noticed is if I run it, and then go a change the order of the strings inside file A (and save it), i get a weird result (more than ten lines, and the are not in order). Any idea why this is happening?
h3rrmiller about 10 years

Are you doing this while the loop is running in another terminal?
Alejandro about 10 years

No, I just waited until it was done and then change file A, and run it again, in the same terminal window
airstrike about 10 years

@Alejandro, if I understand correctly, file A would be patt.txt
Lekensteyn about 10 years

Correct, patt.txt would be your "file A". bigDNA.txt is "file B".