Splitting a continuous variable into equal sized groups

148,442

Solution 1

try this:

split(das, cut(das$anim, 3))

if you want to split based on the value of wt, then

library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))

anyway, you can do that by combining cut, cut2 and split.

UPDATED

if you want a group index as an additional column, then

das$group <- cut(das$anim, 3)

if the column should be index like 1, 2, ..., then

das$group <- as.numeric(cut(das$anim, 3))

UPDATED AGAIN

try this:

> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
   anim    wt wt2
1     1 181.0   1
2     2 179.0   1
3     3 180.5   1
4     4 201.0   2
5     5 201.5   2
6     6 245.0   2
7     7 246.4   3
8     8 189.3   1
9     9 301.0   3
10   10 354.0   3
11   11 369.0   3
12   12 205.0   2
13   13 199.0   1
14   14 394.0   3
15   15 231.3   2

Solution 2

Or see cut_number from the ggplot2 package, e.g.

das$wt_2 <- as.numeric(cut_number(das$wt,3))

Note that cut(...,3) divides the range of the original data into three ranges of equal lengths; it doesn't necessarily result in the same number of observations per group if the data are unevenly distributed (you can replicate what cut_number does by using quantile appropriately, but it's a nice convenience function). On the other hand, Hmisc::cut2() using the g= argument does split by quantiles, so is more or less equivalent to ggplot2::cut_number. I might have thought that something like cut_number would have made its way into dplyr by so far, but as far as I can tell it hasn't.

Solution 3

Here's another solution using the bin_data() function from the mltools package.

library(mltools)

# Resulting bins have an equal number of observations in each group
das[, "wt2"] <- bin_data(das$wt, bins=3, binType = "quantile")

# Resulting bins are equally spaced from min to max
das[, "wt3"] <- bin_data(das$wt, bins=3, binType = "explicit")

# Or if you'd rather define the bins yourself
das[, "wt4"] <- bin_data(das$wt, bins=c(-Inf, 250, 322, Inf), binType = "explicit")

das
   anim    wt                                  wt2                                  wt3         wt4
1     1 181.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
2     2 179.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
3     3 180.5              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
4     4 201.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
5     5 201.5 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
6     6 245.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
7     7 246.4              [245.466666666667, 394]              [179, 250.666666666667) [-Inf, 250)
8     8 189.3              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
9     9 301.0              [245.466666666667, 394] [250.666666666667, 322.333333333333)  [250, 322)
10   10 354.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
11   11 369.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
12   12 205.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
13   13 199.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
14   14 394.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
15   15 231.3 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)

Solution 4

If you want to split into 3 equally distributed groups, the answer is the same as Ben Bolker's answer above - use ggplot2::cut_number(). For sake of completion here are the 3 methods of converting continuous to categorical (binning).

  • cut_number(): Makes n groups with (approximately) equal numbers of observation
  • cut_interval(): Makes n groups with equal range
  • cut_width(): Makes groups of width

My go-to is cut_number() because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.

library(tidyverse)

skewed_tbl <- tibble(
    counts = c(1:100, 1:50, 1:20, rep(1:10, 3), 
               rep(1:5, 5), rep(1:2, 10), rep(1, 20))
    ) %>%
    mutate(
        counts_cut_number   = cut_number(counts, n = 4),
        counts_cut_interval = cut_interval(counts, n = 4),
        counts_cut_width    = cut_width(counts, width = 25)
        ) 

# Data
skewed_tbl
#> # A tibble: 265 x 4
#>    counts counts_cut_number counts_cut_interval counts_cut_width
#>     <dbl> <fct>             <fct>               <fct>           
#>  1      1 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  2      2 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  3      3 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  4      4 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  5      5 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  6      6 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  7      7 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  8      8 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  9      9 (3,13]            [1,25.8]            [-12.5,12.5]    
#> 10     10 (3,13]            [1,25.8]            [-12.5,12.5]    
#> # ... with 255 more rows

summary(skewed_tbl$counts)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    3.00   13.00   25.75   42.00  100.00

# Histogram showing skew
skewed_tbl %>%
    ggplot(aes(counts)) +
    geom_histogram(bins = 30)

# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
    ggplot(aes(counts_cut_number)) +
    geom_bar()

# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
    ggplot(aes(counts_cut_interval)) +
    geom_bar()

# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
    ggplot(aes(counts_cut_width)) +
    geom_bar()

Created on 2018-11-01 by the reprex package (v0.2.1)

Solution 5

Alternative without using cut2.

das$wt2 <- as.factor( as.numeric( cut(das$wt,3)))

or

das$wt2 <- as.factor( cut(das$wt,3, labels=F))

As pointed out by @ben-bolker this splits into equal-widths rather occupancy. I think that using quantiles one can approximate equal-occupancy

x = rnorm(10)
x
 [1] -0.1074316  0.6690681 -1.7168853  0.5144931  1.6460280  0.7014368
 [7]  1.1170587 -0.8503069  0.4462932 -0.1089427
bin = 3 #for 1/3 rd, 4 for 1/4, 100 for 1/100th etc
xx = cut(x, quantile(x, breaks=1/bin*c(1:bin)), labels=F, include.lowest=T)
table(xx)
1 2 3 4
3 2 2 3
Share:
148,442
baz
Author by

baz

Updated on July 08, 2022

Comments

  • baz
    baz almost 2 years

    I need to split/divide up a continuous variable into 3 equal sized groups.

    Example data frame:

    das <- data.frame(anim = 1:15,
                      wt = c(181,179,180.5,201,201.5,245,246.4,
                             189.3,301,354,369,205,199,394,231.3))
    

    After being cut up (according to the value of wt), I would need to have the 3 classes under the new variable wt2 like this:

    > das 
       anim    wt wt2
    1     1 181.0   1
    2     2 179.0   1
    3     3 180.5   1
    4     4 201.0   2
    5     5 201.5   2
    6     6 245.0   2
    7     7 246.4   3
    8     8 189.3   1
    9     9 301.0   3
    10   10 354.0   3
    11   11 369.0   3
    12   12 205.0   2
    13   13 199.0   1
    14   14 394.0   3
    15   15 231.3   2
    

    This would be applied to a large data set.

  • Ben
    Ben almost 9 years
    You can remove the as.numeric and use cut(das$anim, 3, labels=FALSE)
  • pir
    pir over 8 years
    This should be updated so it is clear that it is different from the answer by @Ben below. I mistakenly used this code in the belief that it would divide the observations evenly.
  • Ben Bolker
    Ben Bolker over 8 years
    are you sure that the Hmisc::cut2() solution doesn't? Can you give a small example where it doesn't?
  • Ben Bolker
    Ben Bolker over 8 years
    I think this splits into equal-width rather than equal-occupancy bins ?
  • wolfsatthedoor
    wolfsatthedoor over 4 years
    This should be best answer, wish I had seen this first...!
  • ForceLeft415
    ForceLeft415 about 3 years
    Confusing to me why this is the accepted answer, when the question specifically says "equal sized groups", which cut() doesn't achieve.