train_test_split with multiple features

python python-3.x pandas dataframe scikit-learn

11,079

If you look at sklearn.model_selection.train_test_split, you can see it takes an *arrays argument. To split the first three of your arguments, therefore, you could use

CS_tr, CS_te, EN_tr, EN_te, SN_tr, SN_te = train_test_split(CS, EN, SN)

(of course, you can pass more arrays than that).

Note that current versions of sklearn return sparse arrays when given sparse arrays.

11,079

Ekkasit Smithipanon

Updated on June 08, 2022

Comments

Ekkasit Smithipanon almost 2 years

I'm currently trying to train a data set with a decision tree classifier but I couldn't get the train_test_split to work.

From the code below CS is the target output and EN SN JT FT PW YR LO LA are features input.

All variables that went through OHL are in sparse matrix format whereas the other are in array taken straight from the dataframe.

def OHL(x, column): #OneHotEncoder
    le = LabelEncoder()
    enc = OneHotEncoder()
    Labeled = le.fit_transform(x[column].astype(str))
    return enc.fit_transform(Labeled.reshape(-1,1))

###------------------------------------------------------------------------

df = pd.read_csv('h1b_kaggle.csv')
df = df.drop(['Unnamed: 0','WORKSITE'],1)

###------------------------------------------------------------------------

CS = OHL(df, 'CASE_STATUS')
EN = OHL(df, 'EMPLOYER_NAME')
SN = OHL(df, 'SOC_NAME')
JT = OHL(df, 'JOB_TITLE')
FT = OHL(df, 'FULL_TIME_POSITION')
PW = np.array(df['PREVAILING_WAGE'])
YR = OHL(df, 'YEAR')
LO = np.array(df['lon'])
LA = np.array(df['lat'])

Ekkasit Smithipanon about 6 years

But after i do this and i want to use tree.DecisionTreeClassifier dont I have to group this into one variable? the fit function takes only 1 feature and 1 target.