counting multiple patterns in a single pass with grep?

bash shell-script text-processing grep bioinformatics

6,541

IFS=$'\n'
gzip -dc file.gz | grep -v '^>' | grep -Foe "${tri[*]}" | sort | uniq -c

But by the way, AAAC matches both AAA and AAC, but grep -o will output only one of them. Is that what you want? Also, how many occurrences of AAA in AAAAAA? 2 or 4 ([AAA]AAA, A[AAA]AA, AA[AAA]A, AAA[AAA])?

Maybe you want instead:

gzip -dc file.gz | grep -v '^>' | fold -w3 | grep -Fxe "${tri[*]}" | sort | uniq -c

That is split the lines in groups of 3 characters and count the occurrences as full lines (would find 0 occurrence of AAA in ACAAATTCG (as that's ACA AAT TCG)).

Or on the other hand:

gzip -dc file.gz | awk '
  BEGIN{n=ARGC;ARGC=0}
  !/^>/ {l = length - 2; for (i = 1; i <= l; i++) a[substr($0,i,3)]++}
  END{for (i=1;i<n;i++) printf "%s: %d\n", ARGV[i], a[ARGV[i]]}' "${tri[@]}"

(would find 4 occurrences of AAA in AAAAAA).

6,541

Stephen Henderson

Updated on September 18, 2022

Comments

Stephen Henderson almost 2 years
I've written a grep loop to iteratively count DNA trinucleotides within a gzipped DNA fasta file containing DNA sequences e.g.
```
declare -a tri=(AAA AAC AAG AAT CAA .. etc)

for i in ${tri[@]}
do
   gzip -cd gencode.v18.pc_transcripts.fa.gz | grep -v "^>" | grep -o $i | wc -l
done
```
Where the fasta file is in this format (though much much bigger)
```
head test.fa
>id1
TTTTTAAAAA
>id2
GGGGGCCCCC
etc..
```
Whilst this works (i.e. counts occurrences of each trinucleotide) it is to my mind quite inefficient as it has to pass through the data 64 times (once for each possible trinucleotide).

My question is how using bash or grep is there a way I can count each trinucleotide in a single pass through the file (as the files are quite large)?

thx
- Admin over 10 years
  
  you could refer to superuser.com/questions/529821/grep-count-multiple-occurrenc‌es
- Admin over 10 years
  
  Fasta files are usually folded at 60 characters per line, this means that your grep will not report the counts accurately, since you are not taking into account line breaks. Also, do you need to take frames into account?
- Admin over 10 years
  
  these fasta are not folded, perhaps to avoid that sort of problem. I think the behaviour of 3rd awk solution below is what I want as I'm calculating the fraction of potential different mutation sites (hmmm probably) not looking at the actual codon code.
- Admin over 10 years
  
  Why would you be reading triplets to check mutation sites? All it takes is a single nt. Reading triplets implies codons and, if this is raw DNA data, you have 6 possible frames you need to check and you're only checking one. You might want to post a question on Biology or biostars.org or here (if you make it about the *nix side of things). I am not completely clear on what you are attempting but it sounds strange.
- Admin over 10 years
  
  @terdon Yes but different types of mutagen cause point mutations at quite different DNA contexts triplets e.g UV, smoking, age, mismatch repair, not at all randomly across single sites. To deconvolute the relative effect size of mutagen signals you also need to (OK, I need to) estimate the empirical fractions of these different sites (amongst other things). A similar approach is explained here: go.nature.com/PRdkuZ
- Admin over 10 years
  
  My point exactly, to do that type of analysis, you'll need to treat ACTGA as ACT,CTG and TGA since you are looking at the context of each central nucleotide.
- Admin over 10 years
  
  @terdon Yes, that's exactly what the awk script below does.
- Admin over 10 years
  
  Watch out that, presumably, you don't want to count the line CATTAG as including ATT. Stephane's answer addresses this issue.
- Admin over 10 years
  
  so it does, sorry I thought you were using the fold approach.
Stephen Henderson over 10 years

thx, I will test this now. Those are good points my previous loop would count AAA and AAC but was intended to count 2 AAA not 4 in AAAAAA. But now you mention it these seem incompatible to me. I will try and test your suggestions to get the behaviour I need. Many thx again.
Stephen Henderson over 10 years

Thx again that's actually more than answered my first question and pushed on to all sorts of different approaches. I'd up vote more if I could :)