Random Sample of a subset of a dataframe in Pandas

75,859

Solution 1

You could add a "section" column to your data then perform a groupby and sample:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {"x": np.arange(1_000 * 100), "section": np.repeat(np.arange(100), 1_000)}
)
# >>> df
#            x  section
# 0          0        0
# 1          1        0
# 2          2        0
# 3          3        0
# 4          4        0
# ...      ...      ...
# 99995  99995       99
# 99996  99996       99
# 99997  99997       99
# 99998  99998       99
# 99999  99999       99
#
# [100000 rows x 2 columns]

sample = df.groupby("section").sample(50)
# >>> sample
#            x  section
# 907      907        0
# 494      494        0
# 775      775        0
# 20        20        0
# 230      230        0
# ...      ...      ...
# 99740  99740       99
# 99272  99272       99
# 99863  99863       99
# 99198  99198       99
# 99555  99555       99
#
# [5000 rows x 2 columns]

with additional .query("section == 42") or whatever if you are interested in only a particular section.

Note this requires pandas 1.1.0, see the docs here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html

For older versions, see the answer by @msh5678

Solution 2

You can use the sample method*:

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])

In [12]: df.sample(2)
Out[12]:
   A  B
0  1  2
2  5  6

In [13]: df.sample(2)
Out[13]:
   A  B
3  7  8
0  1  2

*On one of the section DataFrames.

Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.

In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'

In [15]: df.sample(5, replace=True)
Out[15]:
   A  B
0  1  2
1  3  4
2  5  6
3  7  8
1  3  4

Solution 3

One solution is to use the choice function from numpy.

Say you want 50 entries out of 100, you can use:

import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]

This is of course not considering your block structure. If you want a 50 item sample from block i for example, you can do:

import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]

Solution 4

Thank you, Jeff, But I received an error;

AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method

So I suggest instead of sample = df.groupby("section").sample(50) using below command :

df.groupby('section').apply(lambda grp: grp.sample(50))
Share:
75,859
WGP
Author by

WGP

Updated on March 08, 2021

Comments

  • WGP
    WGP about 3 years

    Say i have a dataframe with 100,000 entries and want to split it into 100 sections of 1000 entries.

    How do i take a random sample of say size 50 of just one of the 100 sections. the data set is already ordered such that the first 1000 results are the first section the next section the next and so on.

    many thanks

  • hoang tran
    hoang tran over 5 years
    can you please explain what does replace do? the documentation is not clear to me. Thank you!
  • Andy Hayden
    Andy Hayden over 5 years
    @hoang it takes a "sample with replacement", so if you have a dataset of size 5 you can take a sample of size 10. Also, if you take sample of N elements, without a sample of size N will have every element, with replacement it may not. E.g. see statisticshowto.datasciencecentral.com/…
  • goryh
    goryh almost 5 years
    @hoang tran replace means whether to sample with or without replacement. Without replacement means once a line is picked it cannot be picked again (e.g. I pull a marble out of the bag and do not put it back in so I cannot draw it again). With replacement means that I can sample the same line again (e.g. after drawing a marble I put it back in the bag before drawing the next marble so I can get the same one again).
  • Whynote
    Whynote almost 4 years
    @goryh Until when does this happen? I mean if you repeat this over a certain number of iteration you should end up with an empty dataframe right?
  • goryh
    goryh almost 4 years
    @whynote pandas.dataframe.sample() does not actually change the dataframe. My marble explanation is about what sampling with or without replacement means generally not how panadas implements it.
  • Jeff
    Jeff about 3 years
    this doesn't answer the question - it misses taking samples from each of the groups, I've added an answer on how to do that.
  • Jeff
    Jeff about 3 years
    groupby sample was just added in pandas 1.1.0, see the docs here: pandas.pydata.org/pandas-docs/stable/reference/api/…. Thanks for pointing out, I'll update my answer.