What exactly does the Pandas random_state do?
As described in the documentation of pandas.DataFrame.sample
, the random_state
parameter accepts either an integer (as in your case) or a numpy.random.RandomState
, which is a container for a Mersenne Twister pseudo random number generator.
If you pass it an integer, it will use this as a seed for a pseudo random number generator. As the name already says, the generator does not produce true randomness. It rather has an internal state (that you can get by calling np.random.get_state()
) which is initialized based on a seed. When initialized by the same seed, it will reproduce the same sequence of "random numbers".
If you pass it a RandomState it will use this (already initialized/seeded) RandomState to generate pseudo random numbers. This also allows you to get reproducible results by setting a fixed seed when initializing the RandomState and then passing this RandomState around. Actually you should prefer this over setting the seed of numpys internal RandomState. The reasoning being explained in this answer by Robert Kern and the comments to it. The idea is to have an independent stream which prevents other parts of the program to mess up your reproducibility by changing the seed of numpys internal RandomState.
Related videos on Youtube
Newskooler
I love to ski and snowmobile. When off the mountain, I enjoy spending time surfing and bjj. All questions, answers, comments and code are from me personally, and in no way represent the opinions of my employer.
Updated on October 13, 2022Comments
-
Newskooler over 1 year
I have the following code where I use the Pandas random_state
randomState = 123 sampleSize = 750 df = pd.read_csv(filePath, delim_whitespace=True) df_s = df.sample(n=sampleSize, random_state=randomState)
This generates a sample dataframe
df_s
. Every time I run the code with the samerandomState
, I get the same sampledf_s
. When I change the value from123
to12
the sample changes as well, so I guess that's what therandom_state
does.My silly question: How do the number change affect the sample change? I read the Pandas documentation and the Numpy documentation, but could not get a clear picture.
Any straight forward explanation with an example will be much appreciated.
-
ayhan almost 7 years
-
-
shaik moeed over 2 yearsWill setting
np.random.seed()
is enough when we use onlynumpy
,pandas
andsklearn
to reproduce results?