How to convert a factor to integer\numeric without loss of information?
Solution 1
See the Warning section of ?factor
:
In particular,
as.numeric
applied to a factor is meaningless, and may happen by implicit coercion. To transform a factorf
to approximately its original numeric values,as.numeric(levels(f))[f]
is recommended and slightly more efficient thanas.numeric(as.character(f))
.
The FAQ on R has similar advice.
Why is as.numeric(levels(f))[f]
more efficent than as.numeric(as.character(f))
?
as.numeric(as.character(f))
is effectively as.numeric(levels(f)[f])
, so you are performing the conversion to numeric on length(x)
values, rather than on nlevels(x)
values. The speed difference will be most apparent for long vectors with few levels. If the values are mostly unique, there won't be much difference in speed. However you do the conversion, this operation is unlikely to be the bottleneck in your code, so don't worry too much about it.
Some timings
library(microbenchmark)
microbenchmark(
as.numeric(levels(f))[f],
as.numeric(levels(f)[f]),
as.numeric(as.character(f)),
paste0(x),
paste(x),
times = 1e5
)
## Unit: microseconds
## expr min lq mean median uq max neval
## as.numeric(levels(f))[f] 3.982 5.120 6.088624 5.405 5.974 1981.418 1e+05
## as.numeric(levels(f)[f]) 5.973 7.111 8.352032 7.396 8.250 4256.380 1e+05
## as.numeric(as.character(f)) 6.827 8.249 9.628264 8.534 9.671 1983.694 1e+05
## paste0(x) 7.964 9.387 11.026351 9.956 10.810 2911.257 1e+05
## paste(x) 7.965 9.387 11.127308 9.956 11.093 2419.458 1e+05
Solution 2
R has a number of (undocumented) convenience functions for converting factors:
as.character.factor
as.data.frame.factor
as.Date.factor
as.list.factor
as.vector.factor
- ...
But annoyingly, there is nothing to handle the factor -> numeric conversion. As an extension of Joshua Ulrich's answer, I would suggest to overcome this omission with the definition of your own idiomatic function:
as.double.factor <- function(x) {as.numeric(levels(x))[x]}
that you can store at the beginning of your script, or even better in your .Rprofile
file.
Solution 3
Note: this particular answer is not for converting numeric-valued factors to numerics, it is for converting categorical factors to their corresponding level numbers.
Every answer in this post failed to generate results for me , NAs were getting generated.
y2<-factor(c("A","B","C","D","A"));
as.numeric(levels(y2))[y2]
[1] NA NA NA NA NA Warning message: NAs introduced by coercion
What worked for me is this -
as.integer(y2)
# [1] 1 2 3 4 1
Solution 4
The most easiest way would be to use unfactor
function from package varhandle which can accept a factor vector or even a dataframe:
unfactor(your_factor_variable)
This example can be a quick start:
x <- rep(c("a", "b", "c"), 20)
y <- rep(c(1, 1, 0), 20)
class(x) # -> "character"
class(y) # -> "numeric"
x <- factor(x)
y <- factor(y)
class(x) # -> "factor"
class(y) # -> "factor"
library(varhandle)
x <- unfactor(x)
y <- unfactor(y)
class(x) # -> "character"
class(y) # -> "numeric"
You can also use it on a dataframe. For example the iris
dataset:
sapply(iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "factor"
# load the package
library("varhandle")
# pass the iris to unfactor
tmp_iris <- unfactor(iris)
# check the classes of the columns
sapply(tmp_iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "character"
# check if the last column is correctly converted
tmp_iris$Species
[1] "setosa" "setosa" "setosa" "setosa" "setosa" [6] "setosa" "setosa" "setosa" "setosa" "setosa" [11] "setosa" "setosa" "setosa" "setosa" "setosa" [16] "setosa" "setosa" "setosa" "setosa" "setosa" [21] "setosa" "setosa" "setosa" "setosa" "setosa" [26] "setosa" "setosa" "setosa" "setosa" "setosa" [31] "setosa" "setosa" "setosa" "setosa" "setosa" [36] "setosa" "setosa" "setosa" "setosa" "setosa" [41] "setosa" "setosa" "setosa" "setosa" "setosa" [46] "setosa" "setosa" "setosa" "setosa" "setosa" [51] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [56] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [61] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [66] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [71] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [76] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [81] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [86] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [91] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [96] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [101] "virginica" "virginica" "virginica" "virginica" "virginica" [106] "virginica" "virginica" "virginica" "virginica" "virginica" [111] "virginica" "virginica" "virginica" "virginica" "virginica" [116] "virginica" "virginica" "virginica" "virginica" "virginica" [121] "virginica" "virginica" "virginica" "virginica" "virginica" [126] "virginica" "virginica" "virginica" "virginica" "virginica" [131] "virginica" "virginica" "virginica" "virginica" "virginica" [136] "virginica" "virginica" "virginica" "virginica" "virginica" [141] "virginica" "virginica" "virginica" "virginica" "virginica" [146] "virginica" "virginica" "virginica" "virginica" "virginica"
Solution 5
It is possible only in the case when the factor labels match the original values. I will explain it with an example.
Assume the data is vector x
:
x <- c(20, 10, 30, 20, 10, 40, 10, 40)
Now I will create a factor with four labels:
f <- factor(x, levels = c(10, 20, 30, 40), labels = c("A", "B", "C", "D"))
1) x
is with type double, f
is with type integer. This is the first unavoidable loss of information. Factors are always stored as integers.
> typeof(x)
[1] "double"
> typeof(f)
[1] "integer"
2) It is not possible to revert back to the original values (10, 20, 30, 40) having only f
available. We can see that f
holds only integer values 1, 2, 3, 4 and two attributes - the list of labels ("A", "B", "C", "D") and the class attribute "factor". Nothing more.
> str(f)
Factor w/ 4 levels "A","B","C","D": 2 1 3 2 1 4 1 4
> attributes(f)
$levels
[1] "A" "B" "C" "D"
$class
[1] "factor"
To revert back to the original values we have to know the values of levels used in creating the factor. In this case c(10, 20, 30, 40)
. If we know the original levels (in correct order), we can revert back to the original values.
> orig_levels <- c(10, 20, 30, 40)
> x1 <- orig_levels[f]
> all.equal(x, x1)
[1] TRUE
And this will work only in case when labels have been defined for all possible values in the original data.
So if you will need the original values, you have to keep them. Otherwise there is a high chance it will not be possible to get back to them only from a factor.
Adam SO
Experimental Psychology, vision, taste, smell and multisensory integration.
Updated on July 16, 2022Comments
-
Adam SO almost 2 years
When I convert a factor to a numeric or integer, I get the underlying level codes, not the values as numbers.
f <- factor(sample(runif(5), 20, replace = TRUE)) ## [1] 0.0248644019011408 0.0248644019011408 0.179684827337041 ## [4] 0.0284090070053935 0.363644931698218 0.363644931698218 ## [7] 0.179684827337041 0.249704354675487 0.249704354675487 ## [10] 0.0248644019011408 0.249704354675487 0.0284090070053935 ## [13] 0.179684827337041 0.0248644019011408 0.179684827337041 ## [16] 0.363644931698218 0.249704354675487 0.363644931698218 ## [19] 0.179684827337041 0.0284090070053935 ## 5 Levels: 0.0248644019011408 0.0284090070053935 ... 0.363644931698218 as.numeric(f) ## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2 as.integer(f) ## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
I have to resort to
paste
to get the real values:as.numeric(paste(f)) ## [1] 0.02486440 0.02486440 0.17968483 0.02840901 0.36364493 0.36364493 ## [7] 0.17968483 0.24970435 0.24970435 0.02486440 0.24970435 0.02840901 ## [13] 0.17968483 0.02486440 0.17968483 0.36364493 0.24970435 0.36364493 ## [19] 0.17968483 0.02840901
Is there a better way to convert a factor to numeric?
-
CJB over 8 yearsThe levels of a factor are stored as character data type anyway (
attributes(f)
), so I don't think there is anything wrong withas.numeric(paste(f))
. Perhaps it would be better to think why (in the specific context) you are getting a factor in the first place, and try to stop that. E.g., is thedec
argument inread.table
set correctly? -
davsjob over 5 yearsIf you use a dataframe you can use convert from hablar.
df %>% convert(num(column))
. Or if you have a factor vector you can useas_reliable_num(factor_vector)
-
Denis Cousineau almost 2 yearsThank good for this question. This is SO MUCH frustrating to see numbers get transformed into other numbers pretty much randomly.
-
-
Ari B. Friedman over 12 yearsFor timings see this answer: stackoverflow.com/questions/6979625/…
-
Sam about 10 yearsMany thanks for your solution. Can I ask why the as.numeric(levels(f))[f] is more precise and faster? Thanks.
-
Joshua Ulrich about 10 yearsThere's nothing to handle the factor-to-integer (or numeric) conversion because it's expected that
as.integer(factor)
returns the underlying integer codes (as shown in the examples section of?factor
). It's probably okay to define this function in your global environment, but you might cause problems if you actually register it as an S3 method. -
Jealie about 10 yearsThat's a good point and I agree: a complete redefinition of the factor->numeric conversion is likely to mess a lot of things. I found myself writing the cumbersome
factor->numeric
conversion a lot before realizing that it is in fact a shortcoming of R: some convenience function should be available... Calling itas.numeric.factor
makes sense to me, but YMMV. -
Joshua Ulrich about 10 yearsIf you find yourself doing that a lot, then you should do something upstream to avoid it all-together.
-
Jonathan almost 10 years@Sam as.character(f) requires a "primitive lookup" to find the function as.character.factor(), which is defined as as.numeric(levels(f))[f].
-
jO. over 9 yearsas.numeric.factor returns NA?
-
Jealie over 9 years@jO.: in the cases where you used something like
v=NA;as.numeric.factor(v)
orv='something';as.numeric.factor(v)
, then it should, otherwise you have a weird thing going on somewhere. -
CJB over 8 yearsThe
unfactor
function converts to character data type first and then converts back to numeric. Typeunfactor
at the console and you can see it in the middle of the function. Therefore it doesn't really give a better solution than what the asker already had. -
CJB over 8 yearsHaving said that, the levels of a factor are of character type anyway, so nothing is lost by this approach.
-
maycca about 8 yearswhen apply as.numeric(levels(f))[f] OR as.numeric(as.character(f)), I have an warning msg: Warning message:NAs introduced by coercion. Do you know where the problem could be? thank you !
-
Mehrad Mahmoudian almost 8 yearsThe
unfactor
function takes care of things that cannot be converted to numeric. Check the examples inhelp("unfactor")
-
Selrac over 7 yearsError: could not find function "unfactor"
-
Mehrad Mahmoudian over 7 years@Selrac I've mentioned that this function is available in varhandle package, meaning you should load the package (
library("varhandle")
) first (as I mentioned in the first line of my answer!!) -
Gregor Thomas over 7 yearsI appreciate that your package probably has some other nice functions too, but installing a new package (and adding an external dependency to your code) isn't as nice or easy as typing
as.character(as.numeric())
. -
Mehrad Mahmoudian over 7 years@Gregor adding a light dependency does not harm usually and of course if you are looking for the most efficient way, writing the code your self might perform faster. but as you can also see in your comment this is not trivial since you also put the
as.numeric()
andas.character()
in a wrong order ;) What your code chunk does is to turn the factor's level index into a character matrix, so what you will have at the and is a character vector that contains some numbers that has been once assigned to certain level of your factor. Functions in that package are there to prevent these confusions -
user08041991 over 7 years@maycca did you overcame this issue?
-
MrFlick about 7 yearsAre you sure you had a factor? Look at this example.
y<-factor(c("5","15","20","2")); unclass(y) %>% as.numeric
This returns 4,1,3,2, not 5,15,20,2. This seems like incorrect information. -
Indi about 7 yearsOk, this is similar to what I was trying to do today :- y2<-factor(c("A","B","C","D","A")); as.numeric(levels(y2))[y2] [1] NA NA NA NA NA Warning message: NAs introduced by coercion whereas unclass(y2) %>% as.numeric gave me the results that I needed.
-
Indi about 7 yearsLet me update my scenario in the answer that I had provided
-
MrFlick about 7 yearsOK, well that's not the question that was asked above. In this question the factor levels are all "numeric". In your case ,
as.numeric(y)
should have worked just fine, no need for theunclass()
. But again, that's not what this question was about. This answer isn't appropriate here. -
Indi about 7 yearsWell, I really hope it helps someone who was in a hurry like me and read just the title !
-
Phil almost 7 years@jogo
%>%
is from themagrittr
package. -
MrFlick over 5 yearsIs there a reason you would recommend using
trimws
overas.character
as described in the accepted answer? It seems to me like unless you actually had whitespace you needed to remove,trimws
is just going to do a bunch of unnecessary regular expression work to return the same result. -
Jerry T about 5 yearsas.numeric(levels(f))[f] is might be a bit confusing and hard to remember for beginners. trimws does no harm.
-
aimme over 4 yearsIf you have characters representing the integers as factors, this is the one I would recommend. this is the only one that worked for me.
-
MBorg over 3 years@user08041991 I have the same issue as maycca. I suspect this is from gradual changes in R over time (this answer was posted in 2010), and this answer is now outdated
-
qwerty about 3 years
as.numeric(as.character.factor(x))
just did the trick for me -
Phil almost 3 yearsNice simple solution, as fast as other solutions too.
-
luchonacho over 2 yearsThis is the answer so many of us are after and the first hit in Google. I can't find a similar question.
-
Jealie over 2 years@rui-barradas comment = as a historical anomaly, R has two types for floating point vectors:
numeric
anddouble
. According to the documentation, it is better to write code for thedouble
type, thusas.double.factor
seems like a more proper name. Link to documentation: stat.ethz.ch/R-manual/R-devel/library/base/html/numeric.html . Thanks @rui-barradas ! -
Denis Cousineau almost 2 years?? On R 4.1, it does work.
-
Denis Cousineau almost 2 yearsHowever, loading another package just for that single operation is not parcimonious