Random Sample of a subset of a dataframe in Pandas
Solution 1
You could add a "section"
column to your data then perform a groupby and sample:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"x": np.arange(1_000 * 100), "section": np.repeat(np.arange(100), 1_000)}
)
# >>> df
# x section
# 0 0 0
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 0
# ... ... ...
# 99995 99995 99
# 99996 99996 99
# 99997 99997 99
# 99998 99998 99
# 99999 99999 99
#
# [100000 rows x 2 columns]
sample = df.groupby("section").sample(50)
# >>> sample
# x section
# 907 907 0
# 494 494 0
# 775 775 0
# 20 20 0
# 230 230 0
# ... ... ...
# 99740 99740 99
# 99272 99272 99
# 99863 99863 99
# 99198 99198 99
# 99555 99555 99
#
# [5000 rows x 2 columns]
with additional .query("section == 42")
or whatever if you are interested in only a particular section.
Note this requires pandas 1.1.0, see the docs here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html
For older versions, see the answer by @msh5678
Solution 2
You can use the sample
method*:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
In [12]: df.sample(2)
Out[12]:
A B
0 1 2
2 5 6
In [13]: df.sample(2)
Out[13]:
A B
3 7 8
0 1 2
*On one of the section DataFrames.
Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.
In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'
In [15]: df.sample(5, replace=True)
Out[15]:
A B
0 1 2
1 3 4
2 5 6
3 7 8
1 3 4
Solution 3
One solution is to use the choice
function from numpy.
Say you want 50 entries out of 100, you can use:
import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]
This is of course not considering your block structure. If you want a 50 item sample from block i
for example, you can do:
import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
Solution 4
Thank you, Jeff, But I received an error;
AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method
So I suggest instead of sample = df.groupby("section").sample(50)
using below command :
df.groupby('section').apply(lambda grp: grp.sample(50))
WGP
Updated on March 08, 2021Comments
-
WGP about 3 years
Say i have a dataframe with 100,000 entries and want to split it into 100 sections of 1000 entries.
How do i take a random sample of say size 50 of just one of the 100 sections. the data set is already ordered such that the first 1000 results are the first section the next section the next and so on.
many thanks
-
hoang tran over 5 yearscan you please explain what does
replace
do? the documentation is not clear to me. Thank you! -
Andy Hayden over 5 years@hoang it takes a "sample with replacement", so if you have a dataset of size 5 you can take a sample of size 10. Also, if you take sample of N elements, without a sample of size N will have every element, with replacement it may not. E.g. see statisticshowto.datasciencecentral.com/…
-
goryh almost 5 years@hoang tran replace means whether to sample with or without replacement. Without replacement means once a line is picked it cannot be picked again (e.g. I pull a marble out of the bag and do not put it back in so I cannot draw it again). With replacement means that I can sample the same line again (e.g. after drawing a marble I put it back in the bag before drawing the next marble so I can get the same one again).
-
Whynote almost 4 years@goryh Until when does this happen? I mean if you repeat this over a certain number of iteration you should end up with an empty dataframe right?
-
goryh almost 4 years@whynote pandas.dataframe.sample() does not actually change the dataframe. My marble explanation is about what sampling with or without replacement means generally not how panadas implements it.
-
Jeff about 3 yearsthis doesn't answer the question - it misses taking samples from each of the groups, I've added an answer on how to do that.
-
Jeff about 3 yearsgroupby sample was just added in pandas 1.1.0, see the docs here: pandas.pydata.org/pandas-docs/stable/reference/api/…. Thanks for pointing out, I'll update my answer.