R extract part of string

r regex string bioinformatics vcf-variant-call-format

35,505

Solution 1

Try this:

sub(".*?GN=(.*?);.*", "\\1", a)
# [1] "NOC2L"

Solution 2

Assuming semicolons separate your elements, and equals signs occur exclusively between key/value pairs, a non-strictly-regex method would be:

bits <- unlist(strsplit(a, ';'))
do.call(rbind, strsplit(bits, '='))

      [,1] [,2]               
 [1,] "DP" "26"               
 [2,] "AN" "2"                
 [3,] "DB" "1"                
 [4,] "AC" "1"                
 [5,] "MQ" "56"               
 [6,] "MZ" "0"                
 [7,] "ST" "5:10,7:2"         
 [8,] "CQ" "SYNONYMOUS_CODING"
 [9,] "GN" "NOC2L"            
[10,] "PA" "1^1:0.720&2^1:0"

Then it's just a matter of selecting the appropriate element.

Solution 3

One way would be:

gsub(".+=(\\w+);.+", "\\1", a, perl=T)

I am sure there are more elegant ways to do it.

Solution 4

a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"
m = regexpr("GN.*;",a)
substr(a,m+3,m+attr(m,"match.length")-2)

Solution 5

As the string is coming from VCF file, we can use VariantAnnotation package:

library(VariantAnnotation)

# read dummy VCF file
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")

# see first 5 variables for info column
info(vcf)[1:3, 1:5]
# DataFrame with 3 rows and 5 columns
#                  LDAF   AVGPOST       RSQ     ERATE     THETA
#             <numeric> <numeric> <numeric> <numeric> <numeric>
# rs7410291      0.3431    0.9890    0.9856     2e-03    0.0005
# rs147922003    0.0091    0.9963    0.8398     5e-04    0.0011
# rs114143073    0.0098    0.9891    0.5919     7e-04    0.0008

# Now extract one column, e.g.: LDAF
info(vcf)[1:3, "LDAF"]
# [1] 0.3431 0.0091 0.0098

In above example VCF object there is no "GN" column, but the idea is the same, so in your case, below should work:

# extract gene name
info(vcf)[, "GN"]

View more solutions

35,505

Author by

Lisann

Updated on July 09, 2022

Comments

Lisann almost 2 years
I have a question about extracting a part of a string. For example I have a string like this:
```
a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"
```
I need to extract everything between GN= and ;.So here it will be NOC2L.

Is that possible?

Note: This is INFO column form VCF file format. GN is Gene Name, so we want to extract gene name from INFO column.
Lisann about 12 years

Thank Kohske. And what if NOC2L is at the end of the line? then the hole line is selected!
kohske about 12 years

How is your string exactly? Could you please provide an example?
Lisann about 12 years

like this: a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_C‌ODING;GN=NOC2L
kohske about 12 years

try this: sub(".*?GN=(.*?)(;.*|$)", "\\1", a)
Rotail about 8 years

Thanks for the question/answer. What if there is no such a thing in "a". In that case, I would like this to return NA. It doesn't in this shape. Any idea?