Create new column in dataframe based on partial string matching other column

32,064

Solution 1

Since you have only two conditions, you can use a nested ifelse:

#random data; it wasn't easy to copy-paste yours  
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10), 
  c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12", 
     "asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))

DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials", 
         ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))

DF
#   GL         GLDESC      KIND
#1   8        e gas l Materials
#2   1  c payroll12 y   Payroll
#3  10      m GaSer v Materials
#4   6       t asdf n     Other
#5   2      w qweaa t     Other
#6   4 r PayROll-12 q   Payroll
#7   9      n asdfg a     Other
#8   5     d GAS--2 w Materials
#9   7     s fghfgh e     Other
#10  3      g qweee k     Other

EDIT 10/3/2016 (..after receiving more attention than expected)

A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:

ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
    stopifnot(length(patterns) == length(replacements))

    ans = rep_len(as.character(fill), length(x))    
    empty = seq_along(x)

    for(i in seq_along(patterns)) {
        greps = grepl(patterns[[i]], x[empty], ...)
        ans[empty[greps]] = replacements[[i]]  
        empty = empty[!greps]
    }

    return(ans)
}

ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll"   "Materials" "Other"     "Other"     "Payroll"   "Other"     "Materials" "Other"     "Other"

ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"), 
   c("pat1a|pat1b", "pat2", "pat3"), 
   c("1", "2", "3"), fill = "empty")
#[1] "1"     "1"     "3"     "empty"

ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"), 
   c("pat2", "pat1a|pat1b", "pat3"), 
   c("2", "1", "3"), fill = "empty")
#[1] "2"     "1"     "3"     "empty"

Solution 2

I personally like matching by index. You can loop grep over your new labels, in order to get the indices of your partial matches, then use this with a lookup table to simply reassign the values.

If you wanna create new labels, use a named vector.

DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
  c(
    "gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
    "asdfg", "GAS--2", "fghfgh", "qweee"
  ), sample(letters, 10),
  sep = " "
))


lu <- stack(sapply(c(Material = "gas", Payroll = "payroll"), grep, x = DF$GLDESC, ignore.case = TRUE))

DF$KIND <- DF$GLDESC
DF$KIND[lu$values] <- as.character(lu$ind)
DF$KIND[-lu$values] <- "Other"

DF
#>    GL         GLDESC     KIND
#> 1   6        x gas f Material
#> 2   3  t payroll12 q  Payroll
#> 3   5      a GaSer h Material
#> 4   4       s asdf x    Other
#> 5   1      m qweaa y    Other
#> 6  10 y PayROll-12 r  Payroll
#> 7   7      g asdfg a    Other
#> 8   2     k GAS--2 i Material
#> 9   9     e fghfgh j    Other
#> 10  8      l qweee p    Other

Created on 2021-11-13 by the reprex package (v2.0.1)

Share:
32,064
user2948714
Author by

user2948714

Updated on November 14, 2021

Comments

  • user2948714
    user2948714 over 2 years

    I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.

    The dataframe is as follows:

          GL                             GLDESC
    1 515100         Payroll-Indir Salary Labor
    2 515900 Payroll-Indir Compensated Absences
    3 532300                           Bulk Gas
    4 539991                     Area Charge In
    5 551000        Repairs & Maint-Spare Parts
    6 551100                 Supplies-Operating
    7 551300                        Consumables
    

    For each row of the data table:

    • If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll
    • If GLDESC contains the word Gas anywhere in the string then I want KIND to be Materials
    • In all other cases I want KIND to be Other

    I looked for similar examples on stackoverflow but could not find any, also looked in R for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC column and then fill the KIND column with the kind of account but was unable to make it work.