Multiple Separators for the same file input R

21,576

Solution 1

Try this:

# dummy data
df <- read.table(text="
Name    Name1   *XYZ_Name3_KB_MobApp_M-18-25_AU_PI ANDROID  2013-09-32 14:39:55.0   2013-10-16 13:58:00.0   0   218 4   93  1377907200000
Name    Name2   *CCC_Name3_KB_MobApp_M-18-25_AU_PI ANDROID  2013-09-32 14:39:55.0   2013-10-16 13:58:00.0   0   218 4   93  1377907200000
", as.is = TRUE)

# replace "_" to "-"
df_V3 <- gsub(pattern="_", replacement="-", df$V3, fixed = TRUE)

# strsplit, make dataframe
df_V3 <- do.call(rbind.data.frame, strsplit(df_V3, split = "-"))

# output, merge columns
output <- cbind(df[, c(1:2)],
                df_V3,
                df[, c(4:ncol(df))])

Building on the comments below, here is another related option, but one which uses read.table instead of strsplit.

splitCol <- "V3"
temp <- read.table(text = gsub("-", "_", df[, splitCol]), sep = "_")
names(temp) <- paste(splitCol, seq_along(temp), sep = "_")
cbind(df[setdiff(names(df), splitCol)], temp)

Solution 2

I find the functions in package splitstackshape convenient in cases like this.

library(splitstackshape)

# split concatenated column by `_`
results2 <- concat.split(data = results, split.col = "V3", sep = "_", drop = TRUE)

# split the remaining concatenated part by `-`
results3 <- concat.split(data = results2, split.col = "V3_5", sep = "-", drop = TRUE)
results3

Solution 3

library(stringr)

results <- read.delim("~/results", header=F)
results <- cbind(results,str_split_fixed(results$V3, "[_-]", 9))

(this assumes you're OK with having the original column still in place)

Share:
21,576
CArnold
Author by

CArnold

Updated on November 30, 2020

Comments

  • CArnold
    CArnold over 3 years

    I've had a look for answers, but have only found things referring to C or C#. I realise that much of R is written in C but my knowledge of it is non-existent. I am also relatively new to R. I am using the current Rstudio.

    This is similar to what I want, I think. Read the data efficiently with multiple separating lines in R

    I have a csv file but one variable is a string with values separated by _ and - And I would like to know if there is a package or extra code which does the following on the read. command.

    "1","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",0,218,4,93,1377907200000
    "2","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",0,390,5,157,1377993600000
    "3","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",0,376,5,193,1.37808e+12
    "4","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",1,35,1,15,1377907200000
    "5","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",12,11258,117,2843,1377993600000
    "6","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",5,4659,56,1826,1.37808e+12
    "7","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",7,7296,136,2684,1377907200000
    "8","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_IOS_IPAD","2013-08-31 13:18:21.0","2013-10-16 13:58:00.0",0,4533,35,1632,1377907200000
    "9","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_IOS_IPAD","2013-08-31 13:18:21.0","2013-10-16 13:58:00.0",0,421,6,161,1377993600000
    "10","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_IOS_IPAD","2013-08-31 13:18:21.0","2013-10-16 13:58:00.0",0,57,2,23,1.37808e+12
    

    Example row:

    Name    Name1   *XYZ_Name3_KB_MobApp_M-18-25_AU_PI ANDROID  2013-09-32 14:39:55.0   2013-10-16 13:58:00.0   0   218 4   93  1377907200000
    

    So it's easy enough to read in

    results <- read.delim("~/results", header=F)
    

    but then I still have the string *XYZ_Name3_KB_MobApp_M-18-25_AU_PI

    Desired output(separate by _ and by -):

    Name    Name1   *XYZ   Name3  KB   MobApp   M 18 25  AU  PI ANDROID 2013-09-32 14:39:55.0   2013-10-16 13:58:00.0   0   218 4   93  1377907200000
    

    but not split up the time string.

    ---- Thanks @Henrik and @AnandaMahto for the code and package. ----

    library(splitstackshape)
    
    # split concatenated column by `_`
    df4 <- concat.split(data = df3, split.col = "V3", sep = "_", drop = TRUE)
    
    # split the remaining concatenated part by `-`
    df5 <- concat.split(data = df4, split.col = "V3_5", sep = "-", drop = TRUE)