Compare two text files and find matching lines

26,513

Solution 1

Since your patterns are only four to six lines, why not use them in an OR pattern? An example limiting to 10 matches that operates on a second file "bigDNA.txt":

grep -E 'GAGA|CAGA|GGGT|TATT' -m 10 bigDNA.txt

This will save you from manually typing the patterns from file patt.txt. It joins lines by | (append | to each line, remove newline, remove trailing |):

grep -E "$(sed 's#$#|#' patt.txt | tr -d '\n' | sed 's#|$##')" -m 10 bigDNA.txt

Solution 2

Have you tried iterating through "file A" with a while loop?

while read string
 do grep "$string" file-B | head -10
done < file-A

Or in one line:

while read string; do grep "$string" file-B | head -10; done < file-A

Solution 3

This will print the 1st 10 lines that match any of your strings:

grep -m 10 -Ff motifs sequence.fa 

This one will read each motif and print the first ten lines mathcing it, so it will print 10 lines for each motif:

while read mot; do grep -m 10 "$mot" sequence.fa; done < motifs

However, that looks like a DNA sequence, which means that the line breaks are completely arbitrary and you can have matches like this:

ACTG GA
GA

With these approaches, the GAGA above will not count as a match and this is probably not what you want. Instead, I suggest you put everything in a single line before you search. Since you are asking for matching lines, I assume you want each of these motifs in their context. So, to do this properly, matching motifs that are split across newlines, first transform your file to TBL format. I've been using the same little awk script written by a colleague (thanks Pep) for years:

#!/bin/sh
gawk '{
        if (substr($1,1,1)==">")
        if (NR>1)
                    printf "\n%s\t", substr($0,2,length($0)-1)
        else 
            printf "%s\t", substr($0,2,length($0)-1)
        else 
                printf "%s", $0
}END{printf "\n"}'  "$@"

Save the script above as FastaToTbl somewhere in your $PATH (/usr/local/bin for example) and make it executable (chmod a+x /usr/local/bin/FastaToTbl). Then, you can simply pipe FASTA format sequences and it will print out .tbl format, where the identifier and the sequence are all on the same line.

So, once you have FastaToTbl set up, you can run:

while read mot; 
do 
    FastaToTbl sequence.fa | grep -Po ".{10}$mot.{10}" | head -n 10 
done < motifs   

The above will give you the 1st 10 matches for each pattern and will also match motifs that are split across newlines. It will also print the 10 characters on either side of the matched pattern, change the {10} to another number to control this behavior.

Solution 4

Here is a hopefully readable script.

FIRSTFILE contains 1 item per line (with no extra spaces, etc) BIGFILE contains the big list that you want to match

awk -F, '
  BEGIN{
     regexp="__NOTMATCHING__"
     linematched=0
     while(( getline line<"FIRSTFILE") > 0 ) {
        nb_items[line]=0; #initialise a counter in items["...."]
        regexp=regexp"|"line  #we create a "egrep-like" regexp matching each item
     }

  }

#main : read each line. 
#           - save each matching lines. 
#           - and increment each corresponding counters. 
  {  if ( $0 ~ regexp ) {
        matchinglines[++linematched]=$0
        for ( item in nb_items ) {
           #for each matching item, we also increment that item s number
           if ( $0 ~ item ) { 
              nb_items[item]++ ; 
           }
        }
     }
  }

END  {  #at the end, we print all items which have nb_item[item]>=10
        for ( item in nb_items ) {
           if (nb_items[item] >= 10) {
              print "for this item:",item
              for (i;i<=linematched;i++) {
                 if ( matchinglines[i] ~ item ) {
                    print matchinglines[i] ; 
                 }
              }
           }
        }
     }
   ' BIGFILE
Share:
26,513

Related videos on Youtube

Alejandro
Author by

Alejandro

Updated on September 18, 2022

Comments

  • Alejandro
    Alejandro over 1 year

    I have two files A and B. A looks like this (4 to 6 lines):

    GAGA
    CAGA
    GGGT
    TATT
    

    file B is a really big file with thousands of lines, here is a short example:

    AAATGTCAAGAGACAGAAATGTCAAGAGGGT
    AAGGGGGTTTATAATCATAAATCAAAGAAAT
    ATATACAGAAATGTCAAGAGACAGAAATGTC
    TCAAGAGACAGAAATGTCAAGAGGGTCTATA
    AAGAGGGTCTATAATCATAAATCAAAGAAAT
    AAGAGGGTCTATAATCATAAATCAAAGAAAT
    ATACAGAAATGTCAAAACAGAAATGTCAAGG
    ATATACAGAATATACAGAAATGTCAAGTTAT
    ACAGAATATACAGAAATGTCAAGTTATATAC
    ATATACAGAAATGTCAAGAGACAGAAATGTC
    TCAGAATATAGTATTCTATTATATACAGAAA
    AATATAGTATTCTATTATATACAGAAATGTC
    GAATATACAGAAATGTCAAGTTATATACAGA
    TATACAGAATATAGTATTCTATTATATACAG
    CAGAATATAGTATTCTATTATATACAGAATA
    AGTTATATACAGAATATAGTATTCTATTATA
    TACAGAATATAGTATTCTATTATATACAGAA
    CAGAAATGTCAAGTTATATACAGAATATAGT
    

    I need to search every string in file A in all the lines in file B, and recover the first 10 lines from file B that contain each string from A. I have tried grep and awk but not with good results. Thanks

    • Admin
      Admin about 10 years
      If a line in B contains both strings in A, it will print two times?
    • Admin
      Admin about 10 years
      What if the string falls at the end of a line? Something like GA\nGA?
  • orion
    orion about 10 years
    +1 but I think the question wanted 10 lines for EACH line, so head maybe goes inside the loop.
  • Alejandro
    Alejandro about 10 years
    I forgot to mention file A change with time, is not the same every case. That is why I need something more "complex"
  • Lekensteyn
    Lekensteyn about 10 years
    @Alejandro It doesn't matter if it changes periodically, the second command can handle that. If you need to do some pre-processing, that is still possible. E.g. to print the last word if a line matches FOO, you would use a subcommand like $(awk '/FOO/{print $NF}/' patt.txt | sed 's#$#|#' | tr ...
  • Olivier Dulac
    Olivier Dulac about 10 years
    it cuold be optimized: each time you increment a matching nb_items: if all are above 10, you can stop reading BIGFILE and go to the END section!
  • Alejandro
    Alejandro about 10 years
    Thanks! It seems to be working right. The only thing I noticed is if I run it, and then go a change the order of the strings inside file A (and save it), i get a weird result (more than ten lines, and the are not in order). Any idea why this is happening?
  • h3rrmiller
    h3rrmiller about 10 years
    Are you doing this while the loop is running in another terminal?
  • Alejandro
    Alejandro about 10 years
    No, I just waited until it was done and then change file A, and run it again, in the same terminal window
  • airstrike
    airstrike about 10 years
    @Alejandro, if I understand correctly, file A would be patt.txt
  • Lekensteyn
    Lekensteyn about 10 years
    Correct, patt.txt would be your "file A". bigDNA.txt is "file B".