Python convert string to categorical - numpy

12,532

It works just fine for me (Pandas 0.19.0):

In [155]: train
Out[155]:
         day  clustDep  clustArr car2  clustRoute  scheduled_seg  delayed
0   Saturday        12        15   AA           1              5        1
1    Tuesday        12        15   AA           1              1        1
2    Tuesday        12        15   AA           1              5        1
3   Saturday        12        13   AA           4              3        1
4   Saturday         2        13   AB           6              3        1
5  Wednesday         2        13   IB           6              3        1
6     Monday         2        13   EY           6              3        0
7     Friday         2        13   EY           6              3        1
8   Saturday        11        13   AC           6              5        1
9     Friday        11        13   DL           6              5        1

In [156]: train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 7 columns):
day              10 non-null object
clustDep         10 non-null int64
clustArr         10 non-null int64
car2             10 non-null object
clustRoute       10 non-null int64
scheduled_seg    10 non-null int64
delayed          10 non-null int64
dtypes: int64(5), object(2)
memory usage: 640.0+ bytes

In [157]: train.day = train.day.astype('category')

In [158]: train.car2 = train.car2.astype('category')

In [159]: train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 7 columns):
day              10 non-null category
clustDep         10 non-null int64
clustArr         10 non-null int64
car2             10 non-null category
clustRoute       10 non-null int64
scheduled_seg    10 non-null int64
delayed          10 non-null int64
dtypes: category(2), int64(5)
memory usage: 588.0 bytes
Share:
12,532
Jan Sila
Author by

Jan Sila

Student of life, quantitative finance, mathematics and economics from Czech Republic https://cz.linkedin.com/in/jansila

Updated on June 29, 2022

Comments

  • Jan Sila
    Jan Sila almost 2 years

    I'm desperately trying to change my string variables day,car2, in the following dataset.

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 23653 entries, 0 to 23652
    Data columns (total 7 columns):
    day              23653 non-null object
    clustDep         23653 non-null int64
    clustArr         23653 non-null int64
    car2             23653 non-null object
    clustRoute       23653 non-null int64
    scheduled_seg    23653 non-null int64
    delayed          23653 non-null int64
    dtypes: int64(5), object(2)
    memory usage: 1.4+ MB
    None
    

    I have tried everything that is on SO, as you can see in the code sample below. I'm running Python 2.7, numpy 1.11.1. I tried scikits.tools.categorical but to no vail, it wont event load the namespace. This is my code:

    import numpy as np
    #from scikits.statsmodels import sm
    
    trainId = np.random.choice(range(df.shape[0]), size=int(df.shape[0]*0.8), replace=False)
    train = df[['day', 'clustDep', 'clustArr', 'car2', 'clustRoute', 'scheduled_seg', 'delayed']]
    
    #for col in ['day', 'car2', 'scheduled_seg']:
    #    train[col] = train.loc[:, col].astype('category')
    
    train['day'] = train['day'].astype('category')
    #train['day'] = sm.tools.categorical(train, cols='day', drop=True)
    #train['car2C'] = train['car2'].astype('category')
    #train['scheduled_segC'] = train['scheduled_seg'].astype('category')
    
    
    train = df.loc[trainId, train.columns]
    testId = np.in1d(df.index.values, trainId, invert=True)
    test = df.loc[testId, train.columns]
    
    
    #from sklearn import tree
    #clf = tree.DecisionTreeClassifier()
    #clf = clf.fit(train.drop(['delayed'], axis=1), train['delayed'])
    

    this yields the following error:

    /Users/air/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
    See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    

    Any help would be greatly appreciated. Thanks a lot!

    --- UPDATE --- sample data:

                 day  clustDep  clustArr car2  clustRoute  scheduled_seg  delayed
    0   Saturday        12        15   AA           1              5        1
    1    Tuesday        12        15   AA           1              1        1
    2    Tuesday        12        15   AA           1              5        1
    3   Saturday        12        13   AA           4              3        1
    4   Saturday         2        13   AB           6              3        1
    5  Wednesday         2        13   IB           6              3        1
    6     Monday         2        13   EY           6              3        0
    7     Friday         2        13   EY           6              3        1
    8   Saturday        11        13   AC           6              5        1
    9     Friday        11        13   DL           6              5        1