counting multiple patterns in a single pass with grep?

6,541
IFS=$'\n'
gzip -dc file.gz | grep -v '^>' | grep -Foe "${tri[*]}" | sort | uniq -c

But by the way, AAAC matches both AAA and AAC, but grep -o will output only one of them. Is that what you want? Also, how many occurrences of AAA in AAAAAA? 2 or 4 ([AAA]AAA, A[AAA]AA, AA[AAA]A, AAA[AAA])?

Maybe you want instead:

gzip -dc file.gz | grep -v '^>' | fold -w3 | grep -Fxe "${tri[*]}" | sort | uniq -c

That is split the lines in groups of 3 characters and count the occurrences as full lines (would find 0 occurrence of AAA in ACAAATTCG (as that's ACA AAT TCG)).

Or on the other hand:

gzip -dc file.gz | awk '
  BEGIN{n=ARGC;ARGC=0}
  !/^>/ {l = length - 2; for (i = 1; i <= l; i++) a[substr($0,i,3)]++}
  END{for (i=1;i<n;i++) printf "%s: %d\n", ARGV[i], a[ARGV[i]]}' "${tri[@]}"

(would find 4 occurrences of AAA in AAAAAA).

Share:
6,541

Related videos on Youtube

Stephen Henderson
Author by

Stephen Henderson

Updated on September 18, 2022

Comments

  • Stephen Henderson
    Stephen Henderson almost 2 years

    I've written a grep loop to iteratively count DNA trinucleotides within a gzipped DNA fasta file containing DNA sequences e.g.

    declare -a tri=(AAA AAC AAG AAT CAA .. etc)
    
    for i in ${tri[@]}
    do
       gzip -cd gencode.v18.pc_transcripts.fa.gz | grep -v "^>" | grep -o $i | wc -l
    done
    

    Where the fasta file is in this format (though much much bigger)

    head test.fa
    >id1
    TTTTTAAAAA
    >id2
    GGGGGCCCCC
    etc..
    

    Whilst this works (i.e. counts occurrences of each trinucleotide) it is to my mind quite inefficient as it has to pass through the data 64 times (once for each possible trinucleotide).

    My question is how using bash or grep is there a way I can count each trinucleotide in a single pass through the file (as the files are quite large)?

    thx

    • Admin
      Admin over 10 years
    • Admin
      Admin over 10 years
      Fasta files are usually folded at 60 characters per line, this means that your grep will not report the counts accurately, since you are not taking into account line breaks. Also, do you need to take frames into account?
    • Admin
      Admin over 10 years
      these fasta are not folded, perhaps to avoid that sort of problem. I think the behaviour of 3rd awk solution below is what I want as I'm calculating the fraction of potential different mutation sites (hmmm probably) not looking at the actual codon code.
    • Admin
      Admin over 10 years
      Why would you be reading triplets to check mutation sites? All it takes is a single nt. Reading triplets implies codons and, if this is raw DNA data, you have 6 possible frames you need to check and you're only checking one. You might want to post a question on Biology or biostars.org or here (if you make it about the *nix side of things). I am not completely clear on what you are attempting but it sounds strange.
    • Admin
      Admin over 10 years
      @terdon Yes but different types of mutagen cause point mutations at quite different DNA contexts triplets e.g UV, smoking, age, mismatch repair, not at all randomly across single sites. To deconvolute the relative effect size of mutagen signals you also need to (OK, I need to) estimate the empirical fractions of these different sites (amongst other things). A similar approach is explained here: go.nature.com/PRdkuZ
    • Admin
      Admin over 10 years
      My point exactly, to do that type of analysis, you'll need to treat ACTGA as ACT,CTG and TGA since you are looking at the context of each central nucleotide.
    • Admin
      Admin over 10 years
      @terdon Yes, that's exactly what the awk script below does.
    • Admin
      Admin over 10 years
      Watch out that, presumably, you don't want to count the line CATTAG as including ATT. Stephane's answer addresses this issue.
    • Admin
      Admin over 10 years
      so it does, sorry I thought you were using the fold approach.
  • Stephen Henderson
    Stephen Henderson over 10 years
    thx, I will test this now. Those are good points my previous loop would count AAA and AAC but was intended to count 2 AAA not 4 in AAAAAA. But now you mention it these seem incompatible to me. I will try and test your suggestions to get the behaviour I need. Many thx again.
  • Stephen Henderson
    Stephen Henderson over 10 years
    Thx again that's actually more than answered my first question and pushed on to all sorts of different approaches. I'd up vote more if I could :)