Split a column of values delimited by colons into separate columns for each value

r split delimiter

11,183

Solution 1

call that table vcf

vcf.info <- data.frame(t(sapply(vcf[,2], function(y) strsplit(y,split=":")[[1]])))

then cbind that with the original vcf column(s) that you want

vcf.info2 <- cbind(vcf[,1],vcf.info)

but in a real vcf I would

vcf.info2 <- cbind(vcf[,c(1,2,4,5,6,8,9)],vcf.info)

Something else you may find useful, in this case I am just getting the read depth, replace n with however many samples you have, and the 3 with 1 to 5 for GT,AD,DP,GQ,PL

selectReadDepth <- apply(vcf[,10:n],2,function(x) sapply(x, function(y) strsplit(y,split=":")[[1]][3]))

Solution 2

I've included a family of functions called concat.split in my "splitstackshape" package, one of which is concat.split.multiple. Under the hood, it is like @agstudy's answer, but allows you to split multiple columns at once.

Usage is simple:

library(splitstackshape)
### Three required arguments: The input dataset,
###   a vector of the columns that need to be split up
###   (can also be the numeric column position), and the 
###   separator that should be used (can be different 
###   for each column).
concat.split.multiple(data = dat, split.cols = c("V2"), seps = ":")
#               V1 V2_1  V2_2 V2_3 V2_4     V2_5
# 1 GT:AD:DP:GQ:PL  0/1 10,45   55   70 106,0,70
# 2 GT:AD:DP:GQ:PL  1/1  2,42   44   16 288,16,0
# 3 GT:AD:DP:GQ:PL  1/1  3,37   40   14 147,14,0
# 4 GT:AD:DP:GQ:PL  0/1  7,50   57   55 250,0,55

See also this answer and this Gist for an idea for where the development of the function might be headed. The "data.table" variant will be much faster on larger datasets, but the data must be "rectangular" (that is, the resulting number of columns after the split must be balanced).

Solution 3

Another way to do it

data.frame(DF$V1, do.call(rbind, strsplit(DF$V2, split = ":", fixed = TRUE)))
##            DF.V1  X1    X2 X3 X4       X5
## 1 GT:AD:DP:GQ:PL 0/1 10,45 55 70 106,0,70
## 2 GT:AD:DP:GQ:PL 1/1  2,42 44 16 288,16,0
## 3 GT:AD:DP:GQ:PL 1/1  3,37 40 14 147,14,0
## 4 GT:AD:DP:GQ:PL 0/1  7,50 57 55 250,0,55

Solution 4

Using read.table twice with 2 different separators:

txt = '           V1                  V2
1  GT:AD:DP:GQ:PL  0/1:10,45:55:70:106,0,70
2  GT:AD:DP:GQ:PL  1/1:2,42:44:16:288,16,0
3  GT:AD:DP:GQ:PL  1/1:3,37:40:14:147,14,0
4  GT:AD:DP:GQ:PL  0/1:7,50:57:55:250,0,55'

## here replace text=txt with your file name
dat <- read.table(text=txt,header=TRUE,stringsAsFactors=FALSE)
data.frame(x1=dat$V1,read.table(text=dat$V2,sep=':'))

              x1  V1    V2 V3 V4       V5
1 GT:AD:DP:GQ:PL 0/1 10,45 55 70 106,0,70
2 GT:AD:DP:GQ:PL 1/1  2,42 44 16 288,16,0
3 GT:AD:DP:GQ:PL 1/1  3,37 40 14 147,14,0
4 GT:AD:DP:GQ:PL 0/1  7,50 57 55 250,0,55

View more solutions

11,183

Author by

user3201532

Updated on June 16, 2022

Comments

user3201532 almost 2 years

I have a table of stings and numbers as below:

           V1                  V2
1  GT:AD:DP:GQ:PL  0/1:10,45:55:70:106,0,70
2  GT:AD:DP:GQ:PL  1/1:2,42:44:16:288,16,0
3  GT:AD:DP:GQ:PL  1/1:3,37:40:14:147,14,0
4  GT:AD:DP:GQ:PL  0/1:7,50:57:55:250,0,55

For vector V2, I would like to split the ':'- delimited (colon-delimited) values into separate columns for each value, e.g.:

   V1              V2   V3     V4  V5  V6
1  GT:AD:DP:GQ:PL  0/1  10,45  55  70  106,0,70

agstudy over 10 years

@Jeremy I don't get your point. Looks that you try to speak about a general case but it is not clear for me (at least the vcf format the way you describe it). You assume something that is not in the OP ( I can't guess that the OP has other columns)..Maybe You can add an example to your answer ...
JeremyS over 10 years

yeah I know more about the data than the OP gave. It is two columns from a variant call file (vcf) obtained from aligning genetic sequence data to a reference and calling variants. Your solution works fine for the sample data given.
agstudy over 10 years

@Jeremy What I mean , it will be helpful and more interesting is if you provide the general fil format to adapt my solution ( even I don't see how it fails)
JeremyS over 10 years

Oh right, this site has the first few lines of one: 1000genomes.org/wiki/Analysis/Variant%20Call%20Format/…