Apply function to each column in a data frame observing each columns existing data type
Solution 1
If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:
sapply(df, function(x) max(as.numeric(x)) ) # not generally a useful result
Or if you want to test for factors first and return as you expect then:
sapply( df, function(x) if("factor" %in% class(x) ) {
max(as.numeric(as.character(x)))
} else { max(x) } )
@Darrens comment does work better:
sapply(df, function(x) max(as.character(x)) )
max
does succeed with character vectors.
Solution 2
The reason that max
works with apply
is that apply
is coercing your data frame to a matrix first, and a matrix can only hold one data type. So you end up with a matrix of characters. sapply
is just a wrapper for lapply
, so it is not surprising that both yield the same error.
The default behavior when you create a data frame is for categorical columns to be stored as factors. Unless you specify that it is an ordered factor, operations like max
and min
will be undefined, since R is assuming that you've created an unordered factor.
You can change this behavior by specifying options(stringsAsFactors = FALSE)
, which will change the default for the entire session, or you can pass stringsAsFactors = FALSE
in the data.frame()
construction call itself. Note that this just means that min
and max
will assume "alphabetical" ordering by default.
Or you can manually specify an ordering for each factor, although I doubt that's what you want to do.
Regardless, sapply
will generally yield an atomic vector, which will entail converting everything to characters in many cases. One way around this is as follows:
#Some test data
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
d[4,] <- NA
#Similar function to DWin's answer
fun <- function(x){
if(is.numeric(x)){max(x,na.rm = 1)}
else{max(as.character(x),na.rm=1)}
}
#Use colwise from plyr package
colwise(fun)(d)
v1 v2 v3 v4
1 0.8478983 j 1.999435 J
Solution 3
If you want to learn your data summary (df)
provides the min, 1st quantile, median and mean, 3rd quantile and max of numerical columns and the frequency of the top levels of the factor columns.
Solution 4
The best way to do this is avoid base *apply
functions, which coerces the entire data frame to an array, possibly losing information.
If you wanted to apply a function as.numeric
to every column, a simple way is using mutate_all
from dplyr:
t %>% mutate_all(as.numeric)
Alternatively use colwise
from plyr, which will "turn a function that operates on a vector into a function that operates column-wise on a data.frame."
t %>% (colwise(as.numeric))
In the special case of reading in a data table of character vectors and coercing columns into the correct data type, use type.convert
or type_convert
from readr.
Less interesting answer: we can apply on each column with a for-loop:
for (i in 1:nrow(t)) { t[, i] <- parse_guess(t[, i]) }
I don't know of a good way of doing assignment with *apply while preserving data frame structure.
Solution 5
building on @ltamar's answer:
Use summary and munge the output into something useful!
library(tidyr)
library(dplyr)
df %>%
summary %>%
data.frame %>%
select(-Var1) %>%
separate(data=.,col=Freq,into = c('metric','value'),sep = ':') %>%
rename(column_name=Var2) %>%
mutate(value=as.numeric(value),
metric = trimws(metric,'both')
) %>%
filter(!is.na(value)) -> metrics
It's not pretty and it is certainly not fast but it gets the job done!
Darren Cook
I'm data scientist, software developer, computer book author, entrepreneur. I'm director at QQ Trends, a company that solves difficult data and software challenges for our clients. Lots of machine learning, especially NLP-related, recently. (We sometimes have freelance projects, so get in touch if interested.) (Contact me at dc at qqtrend dot com: please mention you are coming from StackOverflow, so I know it is not spam.) My first book was "Data Push Apps with HTML5 SSE", with O'Reilly, 2014 (ISBN: 978-1449371937). Old by computer standards, but the standard has been stable, so surprisingly still useful. My second book, at the end of 2016, also with O'Reilly, was Practical Machine Learning with H2O (ISBN: 978-1491964606). I'm British, speak English and Japanese (fairly fluent, 1 kyu), with a bit of German, Chinese and Arabic. As for computer languages, I've done commercial work in most of them; but it has been mostly JavaScript, R, Python, C++ the past five years. All my Stack Overflow and all my Stack Exchange contributions (across all sites) are dedicated to the public domain or available under the CC0 license at your choice. I don't like viral licenses. Easy ways to irritate me on StackExchange sites (whether my own question or someone else's): 1. Downvote without a comment (N/A if someone already left a comment and you just agree with it, of course); 2. Answers in comments. Other than that I'm an easy-going and pragmatic guy :-)
Updated on July 08, 2022Comments
-
Darren Cook almost 2 years
I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:
apply(t,2,max,na.rm=1)
It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as
" -99.5"
.I then tried this:
sapply(t,max,na.rm=1)
but it complains about max not meaningful for factors. (
lapply
is the same.) What is confusing me is thatapply
thoughtmax
was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.