Dictionary style replace multiple items
Solution 1
map = setNames(c("0101", "0102", "0103"), c("AA", "AC", "AG"))
foo[] <- map[unlist(foo)]
assuming that map
covers all the cases in foo
. This would feel less like a 'hack' and be more efficient in both space and time if foo
were a matrix (of character()), then
matrix(map[foo], nrow=nrow(foo), dimnames=dimnames(foo))
Both matrix and data frame variants run afoul of R's 2^31-1 limit on vector size when there are millions of SNPs and thousands of samples.
Solution 2
If you're open to using packages, plyr
is a very popular one and has this handy mapvalues() function that will do just what you're looking for:
foo <- mapvalues(foo, from=c("AA", "AC", "AG"), to=c("0101", "0102", "0103"))
Note that it works for data types of all kinds, not just strings.
Solution 3
Here is a quick solution
dict = list(AA = '0101', AC = '0102', AG = '0103')
foo2 = foo
for (i in 1:3){foo2 <- replace(foo2, foo2 == names(dict[i]), dict[i])}
Solution 4
Note this answer started as an attempt to solve the much simpler problem posted in How to replace all values in data frame with a vector of values?. Unfortunately, this question was closed as duplicate of the actual question. So, I'll try to suggest a solution based on replacing factor levels for both cases, here.
In case there is only a vector (or one data frame column) whose values need to be replaced and there are no objections to use factor we can coerce the vector to factor and change the factor levels as required:
x <- c(1, 1, 4, 4, 5, 5, 1, 1, 2)
x <- factor(x)
x
#[1] 1 1 4 4 5 5 1 1 2
#Levels: 1 2 4 5
replacement_vec <- c("A", "T", "C", "G")
levels(x) <- replacement_vec
x
#[1] A A C C G G A A T
#Levels: A T C G
Using the forcats
package this can be done in a one-liner:
x <- c(1, 1, 4, 4, 5, 5, 1, 1, 2)
forcats::lvls_revalue(factor(x), replacement_vec)
#[1] A A C C G G A A T
#Levels: A T C G
In case all values of multiple columns of a data frame need to be replaced, the approach can be extended.
foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"),
snp2 = c("AA", "AT", "AG", "AA"),
snp3 = c(NA, "GG", "GG", "GC"),
stringsAsFactors=FALSE)
level_vec <- c("AA", "AC", "AG", "AT", "GC", "GG")
replacement_vec <- c("0101", "0102", "0103", "0104", "0302", "0303")
foo[] <- lapply(foo, function(x) forcats::lvls_revalue(factor(x, levels = level_vec),
replacement_vec))
foo
# snp1 snp2 snp3
#1 0101 0101 <NA>
#2 0103 0104 0303
#3 0101 0103 0303
#4 0101 0101 0302
Note that level_vec
and replacement_vec
must have equal lengths.
More importantly, level_vec
should be complete , i.e., include all possible values in the affected columns of the original data frame. (Use unique(sort(unlist(foo)))
to verify). Otherwise, any missing values will be coerced to <NA>
. Note that this is also a requirement for Martin Morgans's answer.
So, if there are only a few different values to be replaced you will be probably better off with one of the other answers, e.g., Ramnath's.
Solution 5
One of the most readable way to replace value in a string or a vector of string with a dictionary is stringr::str_replace_all
, from the stringr
package. The pattern needed by str_replace_all
can be a dictionnary, e.g.,
# 1. Made your dictionnary
dictio_replace= c("AA"= "0101",
"AC"= "0102",
"AG"= "0103") # short example of dictionnary.
# 2. Replace all pattern, according to the dictionary-values (only a single vector of string, or a single string)
foo$snp1 <- stringr::str_replace_all(string = foo$snp1,
pattern= dictio_replace) # we only use the 'pattern' option here: 'replacement' is useless since we provide a dictionnary.
Repeat step 2 with foo$snp2 & foo$snp3. If you have more vectors to transform it's a good idea to use another func', in order to replace values in each of the columns/vector in the dataframe without repeating yourself.
Related videos on Youtube
Stedy
Using R for evaluation for a variety of research interests, primarily in the field of public health. Profile picture is from the Olympic Mountains in WA state
Updated on July 09, 2022Comments
-
Stedy almost 2 years
I have a large data.frame of character data that I want to convert based on what is commonly called a dictionary in other languages.
Currently I am going about it like so:
foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"), snp2 = c("AA", "AT", "AG", "AA"), snp3 = c(NA, "GG", "GG", "GC"), stringsAsFactors=FALSE) foo <- replace(foo, foo == "AA", "0101") foo <- replace(foo, foo == "AC", "0102") foo <- replace(foo, foo == "AG", "0103")
This works fine, but it is obviously not pretty and seems silly to repeat the
replace
statement each time I want to replace one item in the data.frame.Is there a better way to do this since I have a dictionary of approximately 25 key/value pairs?
-
MarkIs your dictionary an R list?
-
-
Ramnath over 12 yearsi wouldn't advise the use of the global assignment operator
<<-
. -
joran over 12 years@Ramnath Agreed,
<<-
can be risky, but it's not inherently bad. -
Uwe about 7 yearsUnfortunately, this throws an Error in plyr::mapvalues(foo, from = c("AA", "AC", "AG"), to = c("0101", :
x
must be an atomic vector. This also documented in?mapvalues
. -
IRTFM about 6 yearsThis is the only answer that could handle the variant where the original had keys of 0:2 and the task was to convert to equivalent character values. The highest voted answer failed because 0 is not an acceptable index. Ramnaths's and c.gutierrez' answers also failed in my hands. (I didn't test all the answers.) This is the link to the question: stackoverflow.com/questions/49504035/…
-
Frank almost 6 yearsIt looks like your input is a data.frame and your output is a matrix. I guess you could coerce back at the end, though.
-
Scientist almost 4 yearsLooks like the best option for me, but I cannot make it work some reason. Output makes little sense.
-
Scott over 3 yearsFYI - if you are using
tidyverse
and have foo as a tibble, you have to coerce it to a data.frame prior to assigningmap[unlist(foo)]
, otherwise the row count of assigned vs existing data will differ. -
Jane Kathambi about 3 yearsThis works absolutely well! Thank you c.gutierrez.