Count unique values for every column
Solution 1
You could use apply
:
apply(Testdata, 2, function(x) length(unique(x)))
# var_1 var_2 var_3
# 1 1 3
Solution 2
In dplyr
:
Testdata %>% summarise_all(n_distinct)
🙂
( For those curious about the complete syntax.
In dplyr >0.8.0
using purrr
syntax:
Testdata %>% summarise_all(list(~n_distinct(.)))
In dplyr <0.8.0
:
Testdata %>% summarise_all(funs(n_distinct(.)))
)
For more information on summarizing multiple columns found here: https://dplyr.tidyverse.org/reference/summarise_all.html
Solution 3
This is actually an improvement on the comment by @Ananda Mahto. It didn't fit in the comment so I decided to add as an answer.
sapply
is actually marginally faster than lapply
, and gives the output in a more compact form, just like the output from apply
.
A test run result on actual data:
> start <- Sys.time()
> apply(datafile, 2, function(x)length(unique(x)))
symbol. date volume
1371 261 53647
> Sys.time() - start
Time difference of 1.619567 secs
>
> start <- Sys.time()
> lapply(datafile, function(x)length(unique(x)))
$symbol.
[1] 1371
$date
[1] 261
$volume
[1] 53647
> Sys.time() - start
Time difference of 0.07129478 secs
>
> start <- Sys.time()
> sapply(datafile, function(x)length(unique(x)))
symbol. date volume
1371 261 53647
> Sys.time() - start
Time difference of 0.06939292 secs
The datafile
has around 3.5 million rows.
Quoting the help text:
sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).
Solution 4
Using the lengths
function:
lengths(lapply(Testdata, unique))
# var_1 var_2 var_3
# 1 1 3
Solution 5
Here, I've used dplyr
and tidyr
to count (using your Testdata
data frame):
Testdata %>%
gather(var, value) %>%
distinct() %>%
count(var)
# # A tibble: 3 × 2
# var n
# <chr> <int>
# 1 var_1 1
# 2 var_2 1
# 3 var_3 3
Related videos on Youtube
Zfunk
Updated on July 09, 2022Comments
-
Zfunk almost 2 years
I would like to return the count of the unique (distinct) values for every column in a data frame. For example, if I have the table:
Testdata <- data.frame(var_1 = c("a","a","a"), var_2 = c("b","b","b"), var_3 = c("c","d","e")) var_1 | var_2 | var_3 a | b | c a | b | d a | b | e
I would like the output to be:
Variable | Unique_Values var_1 | 1 var_2 | 1 var_3 | 3
I have tried playing around with loops using the unique function, e.g.
for(i in names(Testdata)){ # Code using unique function }
However I suspect there is a simpler way.
-
A5C1D2H2I1M1N2O1R2T1 over 10 years@user2721117, I would suggest
lapply
overapply
as an approach that scales better. For examplelapply(Testdata, function(x) length(unique(x))
. Some bigger test data:Testdata <- data.frame(replicate(15, sample(letters[1:sample(26, 1)], 1e6, replace = TRUE)))
-
Roy Scheffers over 5 yearsWhile this might answer the authors' question, it lacks some explaining words and/or links to documentation. Raw code snippets are not very helpful without some phrases around them. You may also find how to write a good answer very helpful. Please edit your answer.