What exactly does the Pandas random_state do?

18,327

As described in the documentation of pandas.DataFrame.sample, the random_state parameter accepts either an integer (as in your case) or a numpy.random.RandomState, which is a container for a Mersenne Twister pseudo random number generator.

If you pass it an integer, it will use this as a seed for a pseudo random number generator. As the name already says, the generator does not produce true randomness. It rather has an internal state (that you can get by calling np.random.get_state()) which is initialized based on a seed. When initialized by the same seed, it will reproduce the same sequence of "random numbers".

If you pass it a RandomState it will use this (already initialized/seeded) RandomState to generate pseudo random numbers. This also allows you to get reproducible results by setting a fixed seed when initializing the RandomState and then passing this RandomState around. Actually you should prefer this over setting the seed of numpys internal RandomState. The reasoning being explained in this answer by Robert Kern and the comments to it. The idea is to have an independent stream which prevents other parts of the program to mess up your reproducibility by changing the seed of numpys internal RandomState.

Share:
18,327

Related videos on Youtube

Newskooler
Author by

Newskooler

I love to ski and snowmobile. When off the mountain, I enjoy spending time surfing and bjj. All questions, answers, comments and code are from me personally, and in no way represent the opinions of my employer.

Updated on October 13, 2022

Comments

  • Newskooler
    Newskooler over 1 year

    I have the following code where I use the Pandas random_state

    randomState = 123
    sampleSize = 750
    df = pd.read_csv(filePath, delim_whitespace=True)
    df_s = df.sample(n=sampleSize, random_state=randomState)
    

    This generates a sample dataframe df_s. Every time I run the code with the same randomState, I get the same sample df_s. When I change the value from 123 to 12 the sample changes as well, so I guess that's what the random_state does.

    My silly question: How do the number change affect the sample change? I read the Pandas documentation and the Numpy documentation, but could not get a clear picture.

    Any straight forward explanation with an example will be much appreciated.

  • shaik moeed
    shaik moeed over 2 years
    Will setting np.random.seed() is enough when we use only numpy, pandas and sklearn to reproduce results?