What is "random-state" in sklearn.model_selection.train_test_split example?

python numpy machine-learning random scikit-learn

98,939

Solution 1

Isn't that obvious? 42 is the Answer to the Ultimate Question of Life, the Universe, and Everything.

On a serious note, random_state simply sets a seed to the random generator, so that your train-test splits are always deterministic. If you don't set a seed, it is different each time.

Relevant documentation:

random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Solution 2

If you don't specify the random_state in the code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.

However, if a fixed value is assigned like random_state = 0 or 1 or 42 or any other integer then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

Solution 3

Random state ensures that the splits that you generate are reproducible. Scikit-learn uses random permutations to generate the splits. The random state that you provide is used as a seed to the random number generator. This ensures that the random numbers are generated in the same order.

Solution 4

When the Random_state is not defined in the code for every run train data will change and accuracy might change for every run. When the Random_state = " constant integer" is defined then train data will be constant For every run so that it will make easy to debug.

Solution 5

The random state is simply the lot number of the set generated randomly in any operation. We can specify this lot number whenever we want the same set again.

View more solutions

98,939

Saurabh

Updated on July 09, 2022

Comments

Saurabh almost 2 years

Can someone explain me what random_state means in below example?

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

Why is it hard coded to 42?

Kim Kern over 3 years

Does this answer your question? Random state (Pseudo-random number) in Scikit learn

Danrex over 5 years

That first sentence was more than enough.
Pleastry over 3 years

@cs95 Do I have to generate a new random_state for subsequent methods in my code? For example, if I set the random state as 42 for the train_test_split, do I set the random state also as 42 for the classifier I will be using on the split data? What about if I want to compare two different classifiers, do I use the same random state for both classifiers?
cs95 over 3 years

@Turtle I think you are looking to set a global seed so your pipeline is deterministic. This will only make the split deterministic, nothing else. Consider using something like np.random.seed or creating a random state object that is then reused across functions.
vanetoj over 2 years

but if you use it in train, test split do you still need to use it when you run each algorithm ?
Maxl Gemeinderat almost 2 years

How is the random_state saved? For example does it matter if I run my code on different Colab-Notebooks on different accounts?