TypeError: 'KFold' object is not iterable

14,968

Solution 1

KFold is a splitter, so you have to give something to split.

example code:

X = np.array([1,1,1,1], [2,2,2,2], [3,3,3,3], [4,4,4,4]])
y = np.array([1, 2, 3, 4])
# Now you create your Kfolds by the way you just have to pass number of splits and if you want to shuffle.
fold = KFold(2,shuffle=False)
# For iterate over the folds just use split
for train_index, test_index in fold.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Follow fitting the classifier

If you want to get the index for the loop of train/test, just add enumerate

for i, train_index, test_index in enumerate(fold.split(X)):
    print('Iteration:', i)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

I hope this works

Solution 2

That depends on how you have imported the KFold.

If you have did this:

from sklearn.cross_validation import KFold

Then your code should work. Because it requires 3 params :- length of array, number of splits, and shuffle

But if you are doing this:

from sklearn.model_selection import KFold

then this will not work and you only need to pass the number of splits and shuffle. No need to pass the length of array along with making changes in enumerate().

By the way, the model_selection is the new module and recommended to use. Try using it like this:

fold = KFold(5,shuffle=False)

for train_index, test_index in fold.split(X):

    # Call the logistic regression model with a certain C parameter
    lr = LogisticRegression(C = c_param, penalty = 'l1')
    # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
    lr.fit(x_train_data.iloc[train_index,:], y_train_data.iloc[train_index,:].values.ravel())

    # Predict values using the test indices in the training data
    y_pred_undersample = lr.predict(x_train_data.iloc[test_index,:].values)

    # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
    recall_acc = recall_score(y_train_data.iloc[test_index,:].values,y_pred_undersample)
    recall_accs.append(recall_acc)
Share:
14,968
kevinH
Author by

kevinH

Updated on June 15, 2022

Comments

  • kevinH
    kevinH almost 2 years

    I'm following one of the kernels on Kaggle, mainly, I'm following A kernel for Credit Card Fraud Detection.

    I reached the step where I need to perform KFold in order to find the best parameters for Logistic Regression.

    The following code is shown in the kernel itself, but for some reason (probably older version of scikit-learn, give me some errors).

    def printing_Kfold_scores(x_train_data,y_train_data):
        fold = KFold(len(y_train_data),5,shuffle=False) 
    
        # Different C parameters
        c_param_range = [0.01,0.1,1,10,100]
    
        results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
        results_table['C_parameter'] = c_param_range
    
        # the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
        j = 0
        for c_param in c_param_range:
            print('-------------------------------------------')
            print('C parameter: ', c_param)
            print('-------------------------------------------')
            print('')
    
            recall_accs = []
            for iteration, indices in enumerate(fold,start=1):
    
                # Call the logistic regression model with a certain C parameter
                lr = LogisticRegression(C = c_param, penalty = 'l1')
    
                # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
                # with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
                lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
    
                # Predict values using the test indices in the training data
                y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
    
                # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
                recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
                recall_accs.append(recall_acc)
                print('Iteration ', iteration,': recall score = ', recall_acc)
    
                # The mean value of those recall scores is the metric we want to save and get hold of.
            results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
            j += 1
            print('')
            print('Mean recall score ', np.mean(recall_accs))
            print('')
    
        best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
    
        # Finally, we can check which C parameter is the best amongst the chosen.
        print('*********************************************************************************')
        print('Best model to choose from cross validation is with C parameter = ', best_c)
        print('*********************************************************************************')
    
        return best_c
    

    The errors I'm getting are as follows: for this line: fold = KFold(len(y_train_data),5,shuffle=False) Error:

    TypeError: init() got multiple values for argument 'shuffle'

    if I remove the shuffle=False from this line, I'm getting the following error:

    TypeError: shuffle must be True or False; got 5

    If I remove the 5 and keep the shuffle=False, I'm getting the following error;

    TypeError: 'KFold' object is not iterable which is from this line: for iteration, indices in enumerate(fold,start=1):

    If someone can help me with solving this issue and suggest how this can be done with the latest version of scikit-learn it will be very appreciated.

    Thanks.

  • kevinH
    kevinH over 6 years
    Hey Tzomas, thank you for the answer, it indeed solve the error problem, but I don't really get why I should split X? in the kernel itself it iterates 5 times over each parameter C, but in my case, it iterates the number of the length of X, which is much higher than 5, what is the problem here?
  • kevinH
    kevinH over 6 years
    Sorry, I think I had some error because of typo, it did solve my issue, thank you!
  • kevinH
    kevinH over 6 years
    Thank you for the extra information Vivek, I'm indeed using the model_selection and therefore didn't understand what cause this error, now I know after you clarified it for me, thank you.