Split a column of values delimited by colons into separate columns for each value
Solution 1
call that table vcf
vcf.info <- data.frame(t(sapply(vcf[,2], function(y) strsplit(y,split=":")[[1]])))
then cbind
that with the original vcf column(s) that you want
vcf.info2 <- cbind(vcf[,1],vcf.info)
but in a real vcf I would
vcf.info2 <- cbind(vcf[,c(1,2,4,5,6,8,9)],vcf.info)
Something else you may find useful, in this case I am just getting the read depth, replace n with however many samples you have, and the 3 with 1 to 5 for GT,AD,DP,GQ,PL
selectReadDepth <- apply(vcf[,10:n],2,function(x) sapply(x, function(y) strsplit(y,split=":")[[1]][3]))
Solution 2
I've included a family of functions called concat.split
in my "splitstackshape" package, one of which is concat.split.multiple
. Under the hood, it is like @agstudy's answer, but allows you to split multiple columns at once.
Usage is simple:
library(splitstackshape)
### Three required arguments: The input dataset,
### a vector of the columns that need to be split up
### (can also be the numeric column position), and the
### separator that should be used (can be different
### for each column).
concat.split.multiple(data = dat, split.cols = c("V2"), seps = ":")
# V1 V2_1 V2_2 V2_3 V2_4 V2_5
# 1 GT:AD:DP:GQ:PL 0/1 10,45 55 70 106,0,70
# 2 GT:AD:DP:GQ:PL 1/1 2,42 44 16 288,16,0
# 3 GT:AD:DP:GQ:PL 1/1 3,37 40 14 147,14,0
# 4 GT:AD:DP:GQ:PL 0/1 7,50 57 55 250,0,55
See also this answer and this Gist for an idea for where the development of the function might be headed. The "data.table" variant will be much faster on larger datasets, but the data must be "rectangular" (that is, the resulting number of columns after the split must be balanced).
Solution 3
Another way to do it
data.frame(DF$V1, do.call(rbind, strsplit(DF$V2, split = ":", fixed = TRUE)))
## DF.V1 X1 X2 X3 X4 X5
## 1 GT:AD:DP:GQ:PL 0/1 10,45 55 70 106,0,70
## 2 GT:AD:DP:GQ:PL 1/1 2,42 44 16 288,16,0
## 3 GT:AD:DP:GQ:PL 1/1 3,37 40 14 147,14,0
## 4 GT:AD:DP:GQ:PL 0/1 7,50 57 55 250,0,55
Solution 4
Using read.table
twice with 2 different separators:
txt = ' V1 V2
1 GT:AD:DP:GQ:PL 0/1:10,45:55:70:106,0,70
2 GT:AD:DP:GQ:PL 1/1:2,42:44:16:288,16,0
3 GT:AD:DP:GQ:PL 1/1:3,37:40:14:147,14,0
4 GT:AD:DP:GQ:PL 0/1:7,50:57:55:250,0,55'
## here replace text=txt with your file name
dat <- read.table(text=txt,header=TRUE,stringsAsFactors=FALSE)
data.frame(x1=dat$V1,read.table(text=dat$V2,sep=':'))
x1 V1 V2 V3 V4 V5
1 GT:AD:DP:GQ:PL 0/1 10,45 55 70 106,0,70
2 GT:AD:DP:GQ:PL 1/1 2,42 44 16 288,16,0
3 GT:AD:DP:GQ:PL 1/1 3,37 40 14 147,14,0
4 GT:AD:DP:GQ:PL 0/1 7,50 57 55 250,0,55
user3201532
Updated on June 16, 2022Comments
-
user3201532 almost 2 years
I have a table of stings and numbers as below:
V1 V2 1 GT:AD:DP:GQ:PL 0/1:10,45:55:70:106,0,70 2 GT:AD:DP:GQ:PL 1/1:2,42:44:16:288,16,0 3 GT:AD:DP:GQ:PL 1/1:3,37:40:14:147,14,0 4 GT:AD:DP:GQ:PL 0/1:7,50:57:55:250,0,55
For vector
V2
, I would like to split the ':'- delimited (colon-delimited) values into separate columns for each value, e.g.:V1 V2 V3 V4 V5 V6 1 GT:AD:DP:GQ:PL 0/1 10,45 55 70 106,0,70
-
agstudy over 10 years@Jeremy I don't get your point. Looks that you try to speak about a general case but it is not clear for me (at least the vcf format the way you describe it). You assume something that is not in the OP ( I can't guess that the OP has other columns)..Maybe You can add an example to your answer ...
-
JeremyS over 10 yearsyeah I know more about the data than the OP gave. It is two columns from a variant call file (vcf) obtained from aligning genetic sequence data to a reference and calling variants. Your solution works fine for the sample data given.
-
agstudy over 10 years@Jeremy What I mean , it will be helpful and more interesting is if you provide the general fil format to adapt my solution ( even I don't see how it fails)
-
JeremyS over 10 yearsOh right, this site has the first few lines of one: 1000genomes.org/wiki/Analysis/Variant%20Call%20Format/…