Count unique values for every column

44,097

Solution 1

You could use apply:

apply(Testdata, 2, function(x) length(unique(x)))
# var_1 var_2 var_3 
#     1     1     3

Solution 2

In dplyr:

Testdata %>% summarise_all(n_distinct)

🙂

( For those curious about the complete syntax.

In dplyr >0.8.0 using purrr syntax:

Testdata %>% summarise_all(list(~n_distinct(.)))

In dplyr <0.8.0:

Testdata %>% summarise_all(funs(n_distinct(.)))

)

For more information on summarizing multiple columns found here: https://dplyr.tidyverse.org/reference/summarise_all.html

Solution 3

This is actually an improvement on the comment by @Ananda Mahto. It didn't fit in the comment so I decided to add as an answer.

sapply is actually marginally faster than lapply, and gives the output in a more compact form, just like the output from apply.

A test run result on actual data:

> start <- Sys.time()
> apply(datafile, 2, function(x)length(unique(x)))
          symbol.           date     volume 
             1371            261      53647 
> Sys.time() - start
Time difference of 1.619567 secs
> 
> start <- Sys.time()
> lapply(datafile, function(x)length(unique(x)))
$symbol.
[1] 1371

$date
[1] 261

$volume
[1] 53647

> Sys.time() - start
Time difference of 0.07129478 secs
> 
> start <- Sys.time()
> sapply(datafile, function(x)length(unique(x)))
          symbol.              date             volume 
             1371               261              53647 
> Sys.time() - start
Time difference of 0.06939292 secs

The datafile has around 3.5 million rows.

Quoting the help text:

sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).

Solution 4

Using the lengthsfunction:

lengths(lapply(Testdata, unique))

# var_1 var_2 var_3 
#     1     1     3 

Solution 5

Here, I've used dplyr and tidyr to count (using your Testdata data frame):

Testdata %>% 
  gather(var, value) %>% 
  distinct() %>% 
  count(var)

# # A tibble: 3 × 2
#     var     n
#   <chr> <int>
# 1 var_1     1
# 2 var_2     1
# 3 var_3     3
Share:
44,097

Related videos on Youtube

Zfunk
Author by

Zfunk

Updated on July 09, 2022

Comments

  • Zfunk
    Zfunk almost 2 years

    I would like to return the count of the unique (distinct) values for every column in a data frame. For example, if I have the table:

     Testdata <- data.frame(var_1 = c("a","a","a"), var_2 = c("b","b","b"), var_3 = c("c","d","e"))
    
     var_1 | var_2 | var_3
     a     | b     | c 
     a     | b     | d
     a     | b     | e
    

    I would like the output to be:

     Variable | Unique_Values
     var_1    | 1
     var_2    | 1
     var_3    | 3
    

    I have tried playing around with loops using the unique function, e.g.

     for(i in names(Testdata)){
        # Code using unique function
     }
    

    However I suspect there is a simpler way.

  • A5C1D2H2I1M1N2O1R2T1
    A5C1D2H2I1M1N2O1R2T1 over 10 years
    @user2721117, I would suggest lapply over apply as an approach that scales better. For example lapply(Testdata, function(x) length(unique(x)). Some bigger test data: Testdata <- data.frame(replicate(15, sample(letters[1:sample(26, 1)], 1e6, replace = TRUE)))
  • Roy Scheffers
    Roy Scheffers over 5 years
    While this might answer the authors' question, it lacks some explaining words and/or links to documentation. Raw code snippets are not very helpful without some phrases around them. You may also find how to write a good answer very helpful. Please edit your answer.