Splitting a continuous variable into equal sized groups
Solution 1
try this:
split(das, cut(das$anim, 3))
if you want to split based on the value of wt
, then
library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))
anyway, you can do that by combining cut
, cut2
and split
.
UPDATED
if you want a group index as an additional column, then
das$group <- cut(das$anim, 3)
if the column should be index like 1, 2, ..., then
das$group <- as.numeric(cut(das$anim, 3))
UPDATED AGAIN
try this:
> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
anim wt wt2
1 1 181.0 1
2 2 179.0 1
3 3 180.5 1
4 4 201.0 2
5 5 201.5 2
6 6 245.0 2
7 7 246.4 3
8 8 189.3 1
9 9 301.0 3
10 10 354.0 3
11 11 369.0 3
12 12 205.0 2
13 13 199.0 1
14 14 394.0 3
15 15 231.3 2
Solution 2
Or see cut_number
from the ggplot2
package, e.g.
das$wt_2 <- as.numeric(cut_number(das$wt,3))
Note that cut(...,3)
divides the range of the original data into three ranges of equal lengths; it doesn't necessarily result in the same number of observations per group if the data are unevenly distributed (you can replicate what cut_number
does by using quantile
appropriately, but it's a nice convenience function). On the other hand, Hmisc::cut2()
using the g=
argument does split by quantiles, so is more or less equivalent to ggplot2::cut_number
. I might have thought that something like cut_number
would have made its way into dplyr
by so far, but as far as I can tell it hasn't.
Solution 3
Here's another solution using the bin_data()
function from the mltools package.
library(mltools)
# Resulting bins have an equal number of observations in each group
das[, "wt2"] <- bin_data(das$wt, bins=3, binType = "quantile")
# Resulting bins are equally spaced from min to max
das[, "wt3"] <- bin_data(das$wt, bins=3, binType = "explicit")
# Or if you'd rather define the bins yourself
das[, "wt4"] <- bin_data(das$wt, bins=c(-Inf, 250, 322, Inf), binType = "explicit")
das
anim wt wt2 wt3 wt4
1 1 181.0 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
2 2 179.0 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
3 3 180.5 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
4 4 201.0 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
5 5 201.5 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
6 6 245.0 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
7 7 246.4 [245.466666666667, 394] [179, 250.666666666667) [-Inf, 250)
8 8 189.3 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
9 9 301.0 [245.466666666667, 394] [250.666666666667, 322.333333333333) [250, 322)
10 10 354.0 [245.466666666667, 394] [322.333333333333, 394] [322, Inf]
11 11 369.0 [245.466666666667, 394] [322.333333333333, 394] [322, Inf]
12 12 205.0 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
13 13 199.0 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
14 14 394.0 [245.466666666667, 394] [322.333333333333, 394] [322, Inf]
15 15 231.3 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
Solution 4
If you want to split into 3 equally distributed groups, the answer is the same as Ben Bolker's answer above - use ggplot2::cut_number()
. For sake of completion here are the 3 methods of converting continuous to categorical (binning).
-
cut_number()
: Makes n groups with (approximately) equal numbers of observation -
cut_interval()
: Makes n groups with equal range -
cut_width()
: Makes groups of width
My go-to is cut_number()
because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.
library(tidyverse)
skewed_tbl <- tibble(
counts = c(1:100, 1:50, 1:20, rep(1:10, 3),
rep(1:5, 5), rep(1:2, 10), rep(1, 20))
) %>%
mutate(
counts_cut_number = cut_number(counts, n = 4),
counts_cut_interval = cut_interval(counts, n = 4),
counts_cut_width = cut_width(counts, width = 25)
)
# Data
skewed_tbl
#> # A tibble: 265 x 4
#> counts counts_cut_number counts_cut_interval counts_cut_width
#> <dbl> <fct> <fct> <fct>
#> 1 1 [1,3] [1,25.8] [-12.5,12.5]
#> 2 2 [1,3] [1,25.8] [-12.5,12.5]
#> 3 3 [1,3] [1,25.8] [-12.5,12.5]
#> 4 4 (3,13] [1,25.8] [-12.5,12.5]
#> 5 5 (3,13] [1,25.8] [-12.5,12.5]
#> 6 6 (3,13] [1,25.8] [-12.5,12.5]
#> 7 7 (3,13] [1,25.8] [-12.5,12.5]
#> 8 8 (3,13] [1,25.8] [-12.5,12.5]
#> 9 9 (3,13] [1,25.8] [-12.5,12.5]
#> 10 10 (3,13] [1,25.8] [-12.5,12.5]
#> # ... with 255 more rows
summary(skewed_tbl$counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 3.00 13.00 25.75 42.00 100.00
# Histogram showing skew
skewed_tbl %>%
ggplot(aes(counts)) +
geom_histogram(bins = 30)
# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
ggplot(aes(counts_cut_number)) +
geom_bar()
# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
ggplot(aes(counts_cut_interval)) +
geom_bar()
# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
ggplot(aes(counts_cut_width)) +
geom_bar()
Created on 2018-11-01 by the reprex package (v0.2.1)
Solution 5
Alternative without using cut2.
das$wt2 <- as.factor( as.numeric( cut(das$wt,3)))
or
das$wt2 <- as.factor( cut(das$wt,3, labels=F))
As pointed out by @ben-bolker this splits into equal-widths rather occupancy.
I think that using quantiles
one can approximate equal-occupancy
x = rnorm(10)
x
[1] -0.1074316 0.6690681 -1.7168853 0.5144931 1.6460280 0.7014368
[7] 1.1170587 -0.8503069 0.4462932 -0.1089427
bin = 3 #for 1/3 rd, 4 for 1/4, 100 for 1/100th etc
xx = cut(x, quantile(x, breaks=1/bin*c(1:bin)), labels=F, include.lowest=T)
table(xx)
1 2 3 4
3 2 2 3
baz
Updated on July 08, 2022Comments
-
baz almost 2 years
I need to split/divide up a continuous variable into 3 equal sized groups.
Example data frame:
das <- data.frame(anim = 1:15, wt = c(181,179,180.5,201,201.5,245,246.4, 189.3,301,354,369,205,199,394,231.3))
After being cut up (according to the value of
wt
), I would need to have the 3 classes under the new variablewt2
like this:> das anim wt wt2 1 1 181.0 1 2 2 179.0 1 3 3 180.5 1 4 4 201.0 2 5 5 201.5 2 6 6 245.0 2 7 7 246.4 3 8 8 189.3 1 9 9 301.0 3 10 10 354.0 3 11 11 369.0 3 12 12 205.0 2 13 13 199.0 1 14 14 394.0 3 15 15 231.3 2
This would be applied to a large data set.
-
Ben almost 9 yearsYou can remove the as.numeric and use
cut(das$anim, 3, labels=FALSE)
-
pir over 8 yearsThis should be updated so it is clear that it is different from the answer by @Ben below. I mistakenly used this code in the belief that it would divide the observations evenly.
-
Ben Bolker over 8 yearsare you sure that the
Hmisc::cut2()
solution doesn't? Can you give a small example where it doesn't? -
Ben Bolker over 8 yearsI think this splits into equal-width rather than equal-occupancy bins ?
-
wolfsatthedoor over 4 yearsThis should be best answer, wish I had seen this first...!
-
ForceLeft415 about 3 yearsConfusing to me why this is the accepted answer, when the question specifically says "equal sized groups", which
cut()
doesn't achieve.