counting multiple patterns in a single pass with grep?
IFS=$'\n'
gzip -dc file.gz | grep -v '^>' | grep -Foe "${tri[*]}" | sort | uniq -c
But by the way, AAAC
matches both AAA
and AAC
, but grep -o
will output only one of them. Is that what you want? Also, how many occurrences of AAA
in AAAAAA
? 2 or 4 ([AAA]AAA
, A[AAA]AA
, AA[AAA]A
, AAA[AAA]
)?
Maybe you want instead:
gzip -dc file.gz | grep -v '^>' | fold -w3 | grep -Fxe "${tri[*]}" | sort | uniq -c
That is split the lines in groups of 3 characters and count the occurrences as full lines (would find 0 occurrence of AAA
in ACAAATTCG
(as that's ACA AAT TCG
)).
Or on the other hand:
gzip -dc file.gz | awk '
BEGIN{n=ARGC;ARGC=0}
!/^>/ {l = length - 2; for (i = 1; i <= l; i++) a[substr($0,i,3)]++}
END{for (i=1;i<n;i++) printf "%s: %d\n", ARGV[i], a[ARGV[i]]}' "${tri[@]}"
(would find 4 occurrences of AAA
in AAAAAA
).
Related videos on Youtube
Stephen Henderson
Updated on September 18, 2022Comments
-
Stephen Henderson almost 2 years
I've written a grep loop to iteratively count DNA trinucleotides within a gzipped DNA fasta file containing DNA sequences e.g.
declare -a tri=(AAA AAC AAG AAT CAA .. etc) for i in ${tri[@]} do gzip -cd gencode.v18.pc_transcripts.fa.gz | grep -v "^>" | grep -o $i | wc -l done
Where the fasta file is in this format (though much much bigger)
head test.fa >id1 TTTTTAAAAA >id2 GGGGGCCCCC etc..
Whilst this works (i.e. counts occurrences of each trinucleotide) it is to my mind quite inefficient as it has to pass through the data 64 times (once for each possible trinucleotide).
My question is how using
bash
orgrep
is there a way I can count each trinucleotide in a single pass through the file (as the files are quite large)?thx
-
Admin over 10 yearsyou could refer to superuser.com/questions/529821/grep-count-multiple-occurrences
-
Admin over 10 yearsFasta files are usually folded at 60 characters per line, this means that your grep will not report the counts accurately, since you are not taking into account line breaks. Also, do you need to take frames into account?
-
Admin over 10 yearsthese fasta are not folded, perhaps to avoid that sort of problem. I think the behaviour of 3rd awk solution below is what I want as I'm calculating the fraction of potential different mutation sites (hmmm probably) not looking at the actual codon code.
-
Admin over 10 yearsWhy would you be reading triplets to check mutation sites? All it takes is a single nt. Reading triplets implies codons and, if this is raw DNA data, you have 6 possible frames you need to check and you're only checking one. You might want to post a question on Biology or biostars.org or here (if you make it about the *nix side of things). I am not completely clear on what you are attempting but it sounds strange.
-
Admin over 10 years@terdon Yes but different types of mutagen cause point mutations at quite different DNA contexts triplets e.g UV, smoking, age, mismatch repair, not at all randomly across single sites. To deconvolute the relative effect size of mutagen signals you also need to (OK, I need to) estimate the empirical fractions of these different sites (amongst other things). A similar approach is explained here: go.nature.com/PRdkuZ
-
Admin over 10 yearsMy point exactly, to do that type of analysis, you'll need to treat
ACTGA
asACT
,CTG
andTGA
since you are looking at the context of each central nucleotide. -
Admin over 10 years@terdon Yes, that's exactly what the awk script below does.
-
Admin over 10 yearsWatch out that, presumably, you don't want to count the line
CATTAG
as includingATT
. Stephane's answer addresses this issue. -
Admin over 10 yearsso it does, sorry I thought you were using the fold approach.
-
-
Stephen Henderson over 10 yearsthx, I will test this now. Those are good points my previous loop would count AAA and AAC but was intended to count 2 AAA not 4 in AAAAAA. But now you mention it these seem incompatible to me. I will try and test your suggestions to get the behaviour I need. Many thx again.
-
Stephen Henderson over 10 yearsThx again that's actually more than answered my first question and pushed on to all sorts of different approaches. I'd up vote more if I could :)