R: Split Variable Column into multiple (unbalanced) columns by comma
From Ananda's splitstackshape
package:
cSplit(df, "Events", sep=",")
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA
Or with tidyr
:
separate(df, 'Events', paste("Events", 1:4, sep="_"), sep=",", extra="drop")
# Name Age Number Events_1 Events_2 Events_3 Events_4 First
#1 Karen 24 8 Triathlon/IM Marathon 10k 5k 0
#2 Kurt 39 2 Half-Marathon 10k <NA> <NA> 0
#3 Leah 18 0 NA <NA> <NA> <NA> 1
With the data.table
package:
setDT(df)[,paste0("Events_", 1:4) := tstrsplit(Events, ",")][,-"Events", with=F]
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA
Data
df <- structure(list(Name = structure(1:3, .Label = c("Karen", "Kurt",
"Leah "), class = "factor"), Age = c(24L, 39L, 18L), Number = c(8L,
2L, 0L), Events = structure(c(3L, 2L, 1L), .Label = c(" NA",
" Half-Marathon,10k", " Triathlon/IM,Marathon,10k,5k"
), class = "factor"), First = c(0L, 0L, 1L)), .Names = c("Name",
"Age", "Number", "Events", "First"), class = "data.frame", row.names = c(NA,
-3L))
Kfruge
Updated on June 19, 2022Comments
-
Kfruge almost 2 years
I have a dataset of 25 variables and over 2 million observations. One of my variables is a combination of a few different "categories" that I want to split to where it shows 1 category per column (similar to what split would do in stata). For example:
# Name Age Number Events First # Karen 24 8 Triathlon/IM,Marathon,10k,5k 0 # Kurt 39 2 Half-Marathon,10k 0 # Leah 18 0 1
And I want it to look like:
# Name Age Number Events_1 Event_2 Events_3 Events_4 First # Karen 24 8 Triathlon/IM Marathon 10k 5k 0 # Kurt 39 2 Half-Marathon 10k NA NA 0 # Leah 18 0 NA NA NA NA 1
I have looked through stackoverflow but have not found anything that works (everything gives me an error of some sort). Any suggestions would be greatly appreciated.
Note: May not be important but the largest number of categories 1 person has is 19 therefore I would need to create Event_1:Event_19
Comment: Previous stack overflows have suggested the separate function, however this function does not seem to work with my dataset. When I input the function the program runs but when it is finished nothing is changed, there is no output, and no error code. When I tried to use other suggestions made in other threads I received error messages. However, I finally got it is work by using the cSplit function. Thank for the help!!!
-
akrun almost 9 yearsOr
library(data.table); setDT(df)[, paste0('Events', 1:4):= tstrsplit(Events, ',')]