R extract part of string
Solution 1
Try this:
sub(".*?GN=(.*?);.*", "\\1", a)
# [1] "NOC2L"
Solution 2
Assuming semicolons separate your elements, and equals signs occur exclusively between key/value pairs, a non-strictly-regex method would be:
bits <- unlist(strsplit(a, ';'))
do.call(rbind, strsplit(bits, '='))
[,1] [,2]
[1,] "DP" "26"
[2,] "AN" "2"
[3,] "DB" "1"
[4,] "AC" "1"
[5,] "MQ" "56"
[6,] "MZ" "0"
[7,] "ST" "5:10,7:2"
[8,] "CQ" "SYNONYMOUS_CODING"
[9,] "GN" "NOC2L"
[10,] "PA" "1^1:0.720&2^1:0"
Then it's just a matter of selecting the appropriate element.
Solution 3
One way would be:
gsub(".+=(\\w+);.+", "\\1", a, perl=T)
I am sure there are more elegant ways to do it.
Solution 4
a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"
m = regexpr("GN.*;",a)
substr(a,m+3,m+attr(m,"match.length")-2)
Solution 5
As the string is coming from VCF file, we can use VariantAnnotation package:
library(VariantAnnotation)
# read dummy VCF file
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
# see first 5 variables for info column
info(vcf)[1:3, 1:5]
# DataFrame with 3 rows and 5 columns
# LDAF AVGPOST RSQ ERATE THETA
# <numeric> <numeric> <numeric> <numeric> <numeric>
# rs7410291 0.3431 0.9890 0.9856 2e-03 0.0005
# rs147922003 0.0091 0.9963 0.8398 5e-04 0.0011
# rs114143073 0.0098 0.9891 0.5919 7e-04 0.0008
# Now extract one column, e.g.: LDAF
info(vcf)[1:3, "LDAF"]
# [1] 0.3431 0.0091 0.0098
In above example VCF object there is no "GN" column, but the idea is the same, so in your case, below should work:
# extract gene name
info(vcf)[, "GN"]
Lisann
Updated on July 09, 2022Comments
-
Lisann almost 2 years
I have a question about extracting a part of a string. For example I have a string like this:
a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"
I need to extract everything between
GN=
and;
.So here it will beNOC2L
.Is that possible?
Note: This is
INFO
column form VCF file format. GN is Gene Name, so we want to extract gene name fromINFO
column. -
Lisann about 12 yearsThank Kohske. And what if NOC2L is at the end of the line? then the hole line is selected!
-
kohske about 12 yearsHow is your string exactly? Could you please provide an example?
-
Lisann about 12 yearslike this: a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L
-
kohske about 12 yearstry this:
sub(".*?GN=(.*?)(;.*|$)", "\\1", a)
-
Rotail about 8 yearsThanks for the question/answer. What if there is no such a thing in "a". In that case, I would like this to return NA. It doesn't in this shape. Any idea?