Key Error: None of [Int64Index...] dtype='int64] are in the columns

87,330

Solution 1

You created your scaled_inputs_all DataFrame using loc function, so it most likely contains no consecutive indices.

On the other hand, you created shuffled_indices as a shuffle from just a range of consecutive numbers.

Remember that scaled_inputs_all[shuffled_indices] gets rows of scaled_inputs_all which have index values equal to elements of shuffled_indices.

Maybe you should write:

scaled_inputs_all.iloc[shuffled_indices]

Note that iloc provides integer-location based indexing, regardless of index values, i.e. just what you need.

Solution 2

might have someone also get the same error in working with KFOLD in machine learning.

And the solution for this is as below:

Click here to watch solutinon

You need to use iloc:

 X_train, X_test = X.iloc[train_index], X.iloc[test_index]

 y_train, y_test = y.iloc[train_index], y.iloc[test_index]

Solution 3

I had this problem too. I solved it by changing the data frame and series to array.

try the follwing codeline:

scaled_inputs_all.iloc[shuffled_indices].values 

Solution 4

If you reset your index after dropping rows from your dataframe, this should stop the key error.

You can do this by running this after you run df.drop:

df = df.reset_index(drop=True)

Or, equivalently:

df.reset_index(drop=True, inplace=True)
Share:
87,330

Related videos on Youtube

Ashley E.
Author by

Ashley E.

Hey all! My name is Ashley. I'm a new mom who is studying front end development to reenter the work force as a new career. Hope to help out you guys and make friends :^)

Updated on February 05, 2021

Comments

  • Ashley E.
    Ashley E. over 3 years

    I'm trying to shuffle my indices using the np.random.shuffle() method, but I keep getting an error that I don't understand. I'd appreciate it if someone could help me puzzle this out. Thank you!

    I've tried to use the delimiter=',' and delim_whitespace=0 when I made my raw_csv_data variable at the beginning, as I saw that as the solution of another problem, but it kept throwing the same error

        import pandas as pd 
        import numpy as np 
        from sklearn.preprocessing import StandardScaler
    
        #%%
        raw_csv_data= pd.read_csv('Absenteeism-data.csv')
        print(raw_csv_data)
        #%%
        df= raw_csv_data.copy()
        print(display(df))
        #%%
        pd.options.display.max_columns=None
        pd.options.display.max_rows=None
        print(display(df))
        #%%
        print(df.info())
        #%%
        df=df.drop(['ID'], axis=1)
    
        #%%
        print(display(df.head()))
    
        #%%
        #Our goal is to see who is more likely to be absent. Let's define
        #our targets from our dependent variable, Absenteeism Time in Hours
        print(df['Absenteeism Time in Hours'])
        print(df['Absenteeism Time in Hours'].median())
        #%%
        targets= np.where(df['Absenteeism Time in Hours']>df['Absenteeism Time 
        in Hours'].median(),1,0)
        #%%
        print(targets)
        #%%
        df['Excessive Absenteeism']= targets
        #%%
        print(df.head())
    
        #%%
        #Let's Separate the Day and Month Values to see if there is 
        correlation
        #between Day of week/month with absence
        print(type(df['Date'][0]))
        #%%
        df['Date']= pd.to_datetime(df['Date'], format='%d/%m/%Y')
        #%%
        print(df['Date'])
        print(type(df['Date'][0]))
        #%%
        #Extracting the Month Value
        print(df['Date'][0].month)
        #%%
        list_months=[]
        print(list_months)
        #%%
        print(df.shape)
        #%%
        for i in range(df.shape[0]):
            list_months.append(df['Date'][i].month)
        #%%
        print(list_months)
        #%%
        print(len(list_months))
        #%%
        #Let's Create a Month Value Column for df
        df['Month Value']= list_months
        #%%
        print(df.head())
        #%%
        #Now let's extract the day of the week from date
        df['Date'][699].weekday()
        #%%
        def date_to_weekday(date_value):
            return date_value.weekday()
        #%%
        df['Day of the Week']= df['Date'].apply(date_to_weekday)
        #%%
        print(df.head())
        #%%
        df= df.drop(['Date'], axis=1)
        #%%
        print(df.columns.values)
        #%%
        reordered_columns= ['Reason for Absence', 'Month Value','Day of the 
        Week','Transportation Expense', 'Distance to Work', 'Age',
         'Daily Work Load Average', 'Body Mass Index', 'Education', 
        'Children', 
        'Pets',
         'Absenteeism Time in Hours', 'Excessive Absenteeism']
        #%%
        df=df[reordered_columns]
        print(df.head())
        #%%
        #First Checkpoint
        df_date_mod= df.copy()
        #%%
        print(df_date_mod)
    
        #%%
        #Let's Standardize our inputs, ignoring the Reasons and Education 
        Columns
        #Because they are labelled by a separate categorical criteria, not 
        numerically
        print(df_date_mod.columns.values)
        #%%
        unscaled_inputs= df_date_mod.loc[:, ['Month Value','Day of the 
        Week','Transportation Expense','Distance to Work','Age','Daily Work 
        Load 
        Average','Body Mass Index','Children','Pets','Absenteeism Time in 
        Hours']]
        #%%
        print(display(unscaled_inputs))
        #%%
        absenteeism_scaler= StandardScaler()
        #%%
        absenteeism_scaler.fit(unscaled_inputs)
        #%%
        scaled_inputs= absenteeism_scaler.transform(unscaled_inputs)
        #%%
        print(display(scaled_inputs))
        #%%
        print(scaled_inputs.shape)
        #%%
        scaled_inputs= pd.DataFrame(scaled_inputs, columns=['Month Value','Day 
        of the Week','Transportation Expense','Distance to Work','Age','Daily 
        Work Load Average','Body Mass Index','Children','Pets','Absenteeism 
        Time 
        in Hours'])
        print(display(scaled_inputs))
        #%%
        df_date_mod= df_date_mod.drop(['Month Value','Day of the 
        Week','Transportation Expense','Distance to Work','Age','Daily Work 
        Load Average','Body Mass Index','Children','Pets','Absenteeism Time in 
        Hours'], axis=1)
        print(display(df_date_mod))
        #%%
        df_date_mod=pd.concat([df_date_mod,scaled_inputs], axis=1)
        print(display(df_date_mod))
        #%%
        df_date_mod= df_date_mod[reordered_columns]
        print(display(df_date_mod.head()))
        #%%
        #Checkpoint
        df_date_scale_mod= df_date_mod.copy()
        print(display(df_date_scale_mod.head()))
        #%%
        #Let's Analyze the Reason for Absence Category
        print(df_date_scale_mod['Reason for Absence'])
        #%%
        print(df_date_scale_mod['Reason for Absence'].min())
        print(df_date_scale_mod['Reason for Absence'].max())
        #%%
        print(df_date_scale_mod['Reason for Absence'].unique())
        #%%
        print(len(df_date_scale_mod['Reason for Absence'].unique()))
        #%%
        print(sorted(df['Reason for Absence'].unique()))
        #%%
        reason_columns= pd.get_dummies(df['Reason for Absence'])
        print(reason_columns)
        #%%
        reason_columns['check']= reason_columns.sum(axis=1)
        print(reason_columns)
        #%%
        print(reason_columns['check'].sum(axis=0))
        #%%
        print(reason_columns['check'].unique())
        #%%
        reason_columns=reason_columns.drop(['check'], axis=1)
        print(reason_columns)
        #%%
        reason_columns=pd.get_dummies(df_date_scale_mod['Reason for Absence'], 
        drop_first=True)
        print(reason_columns)
        #%%
        print(df_date_scale_mod.columns.values)
        #%%
        print(reason_columns.columns.values)
        #%%
        df_date_scale_mod= df_date_scale_mod.drop(['Reason for Absence'], 
        axis=1)
        print(df_date_scale_mod)
        #%%
        reason_type_1= reason_columns.loc[:, 1:14].max(axis=1)
        reason_type_2= reason_columns.loc[:, 15:17].max(axis=1)
        reason_type_3= reason_columns.loc[:, 18:21].max(axis=1)
        reason_type_4= reason_columns.loc[:, 22:].max(axis=1)
        #%%
        print(reason_type_1)
        print(reason_type_2)
        print(reason_type_3)
        print(reason_type_4)
        #%%
        print(df_date_scale_mod.head())
        #%%
        df_date_scale_mod= pd.concat([df_date_scale_mod, 
        reason_type_1,reason_type_2, reason_type_3, reason_type_4], axis=1)
        print(df_date_scale_mod.head())
        #%%
        print(df_date_scale_mod.columns.values)
        #%%
        column_names= ['Month Value','Day of the Week','Transportation 
        Expense',
         'Distance to Work','Age','Daily Work Load Average','Body Mass Index',
         'Education','Children','Pets','Absenteeism Time in Hours',
         'Excessive Absenteeism', 'Reason_1', 'Reason_2', 'Reason_3', 
         'Reason_4']
    
        df_date_scale_mod.columns= column_names
        print(df_date_scale_mod.head())
        #%%
        column_names_reordered= ['Reason_1', 'Reason_2', 'Reason_3', 
        'Reason_4','Month Value','Day of the Week','Transportation Expense',
         'Distance to Work','Age','Daily Work Load Average','Body Mass Index',
         'Education','Children','Pets','Absenteeism Time in Hours',
         'Excessive Absenteeism']
    
        df_date_scale_mod=df_date_scale_mod[column_names_reordered]
        print(display(df_date_scale_mod.head()))
        #%%
        #Checkpoint
        df_date_scale_mod_reas= df_date_scale_mod.copy()
        print(df_date_scale_mod_reas.head())
        #%%
        #Let's Look at the Education column now
        print(df_date_scale_mod_reas['Education'].unique())
        #This shows us that education is rated from 1-4 based on level
        #of completion
        #%%
        print(df_date_scale_mod_reas['Education'].value_counts())
        #The overwhelming majority of workers are highschool educated, while 
        the 
        #rest have higher degrees
        #%%
        #We'll create our dummy variables as highschool and higher education
        df_date_scale_mod_reas['Education']= 
        df_date_scale_mod_reas['Education'].map({1:0, 2:1, 3:1, 4:1})
        #%%
        print(df_date_scale_mod_reas['Education'].unique())
        #%%
        print(df_date_scale_mod_reas['Education'].value_counts())
        #%%
        #Checkpoint
        df_preprocessed= df_date_scale_mod_reas.copy()
        print(display(df_preprocessed.head()))
        #%%
        #%%
        #Split Inputs from targets
        scaled_inputs_all= df_preprocessed.loc[:,'Reason_1':'Absenteeism Time 
        in 
        Hours']
        print(display(scaled_inputs_all.head()))
        print(scaled_inputs_all.shape)
        #%%
        targets_all= df_preprocessed.loc[:,'Excessive Absenteeism']
        print(display(targets_all.head()))
        print(targets_all.shape)
        #%%
        #Shuffle Inputs and targets
        shuffled_indices= np.arange(scaled_inputs_all.shape[0])
        np.random.shuffle(shuffled_indices)
        shuffled_inputs= scaled_inputs_all[shuffled_indices]
        shuffled_targets= targets_all[shuffled_indices]
    

    This is the error I keep getting when I try to shuffle my indices:

    KeyError                                  Traceback (most recent call last)
     in 
          1 shuffled_indices= np.arange(scaled_inputs_all.shape[0])
          2 np.random.shuffle(shuffled_indices)
    ----> 3 shuffled_inputs= scaled_inputs_all[shuffled_indices]
          4 shuffled_targets= targets_all[shuffled_indices]
    

    ~\Anaconda3\lib\site-packages\pandas\core\frame.py in getitem(self, key) 2932 key = list(key) 2933 indexer = self.loc._convert_to_indexer(key, axis=1, -> 2934 raise_missing=True) 2935 2936 # take() does not accept boolean indexers

    ~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _convert_to_indexer(self, obj, axis, is_setter, raise_missing) 1352 kwargs = {'raise_missing': True if is_setter else 1353
    raise_missing} -> 1354 return self._get_listlike_indexer(obj, axis, **kwargs)[1] 1355 else: 1356 try:

    ~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing) 1159 self._validate_read_indexer(keyarr, indexer, 1160
    o._get_axis_number(axis), -> 1161 raise_missing=raise_missing) 1162 return keyarr, indexer
    1163

    ~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing) 1244 raise KeyError( 1245
    u"None of [{key}] are in the [{axis}]".format( -> 1246 key=key, axis=self.obj._get_axis_name(axis))) 1247 1248 # We (temporarily) allow for some missing keys with .loc, except in

    KeyError: "None of [Int64Index([560, 320, 405, 141, 154, 370, 656, 26, 444, 307,\n ...\n 429, 542, 676, 588, 315, 284, 293, 607, 197, 250],\n dtype='int64', length=700)] are in the [columns]"

    • Nilesh Ingle
      Nilesh Ingle about 5 years
      A reproducible example with few rows of dataframe would help to look into this issue.
    • Ashley E.
      Ashley E. about 5 years
      @NileshIngle do you want me to send you the dataset that I was working with?
  • Ashley E.
    Ashley E. about 5 years
    omg it worked! Thank you! Can you please explain why again? I didn't really understand your explanation up above
  • Ashley E.
    Ashley E. about 5 years
    would I need to write targets_all.iloc[shuffled_indices] too?
  • Valdi_Bo
    Valdi_Bo about 5 years
    Yes, but make sure that both DataFrames contain the same number of rows. Otherwise you may get index out of range error. Maybe the range of indices should be different.
  • mayaaa
    mayaaa over 4 years
    i have the same error when i run the following code. could you tell me how to solve it skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0) for train_index, test_index in skf.split(x, y): x_train, x_test = x[train_index], x[test_index] y_train, y_test = y[train_index], y[test_index]
  • Sean William
    Sean William over 2 years
    This one solved my problem!