How to fix "IndexError: tuple index out of range" in python?

14,533

Solution 1

The root cause of your issue is that, while you ask for the evaluation of 6 models in GridSearchCV, you provide parameters only for the first 2 ones:

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]

The result of enumerate(zip(models, params)) in this setting, i.e:

for i, (model, param) in enumerate(zip(models, params)):
    print((model, param))

is

(SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), {'C': [0.01, 1]})
(RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False), {'n_estimators': [10, 20]})

i.e the last 4 models are simply ignored, so you get empty entries for them in cv:

print(cv)
# result:
[[5950.6018771284835, 5987.293514740653, 6055.368320208183, 6099.316091619069, 6146.478702335218], [3625.3243553665975, 3301.3552182952058, 3404.3321983193728, 3521.5160621260898, 3561.254684271113], [], [], [], []]

which causes the downstream error when trying to get the np.mean(cv, 1).

The solution, as already correctly pointed out by Psi in their answer, is to go for empty dictionaries in the models in which you actually don't perform any CV search; omitting the XGBRegressor (have not installed it), here are the results:

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso()]
params2 = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}]

cv = [[] for _ in range(len(models))]
fold = KFold(5,shuffle=False)
for tr, ts in fold.split(X):
    for i, (model, param) in enumerate(zip(models, params2)):
        best_m = GridSearchCV(model, param)
        best_m.fit(X[tr], y[tr])
        s = mean_squared_error(y[ts], best_m.predict(X[ts]))
        cv[i].append(s)

where print(cv) gives:

[[4048.660483326826, 3973.984055352062, 3847.7215568088545, 3907.0566348092684, 3820.0517432992765], [1037.9378737329769, 1025.237441119364, 1016.549294695313, 993.7083268195154, 963.8115632611381], [2.2948917095935095e-26, 1.971022007799432e-26, 4.1583774042712844e-26, 2.0229469068846665e-25, 1.9295075684919642e-26], [0.0003350178681602639, 0.0003297411022124562, 0.00030834076832371557, 0.0003355298330301431, 0.00032049282437794516], [10.372789356303688, 10.137748082073076, 10.136028304131141, 10.499159069700834, 9.80779910439471]]

and print(np.mean(cv, 1)) works OK, giving:

[3.91949489e+03 1.00744890e+03 6.11665355e-26 3.25824479e-04
 1.01907048e+01]

So, in your case, you should indeed change params to:

params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]

as already suggested by Psi.

Solution 2

When you define

cv = [[] for _ in range(len(models))]

it has an empty list for each model. In the loop, however, you go over enumerate(zip(models, params)) which has only two elements, since your params list has two elements (because list(zip(x,y)) has length equal to min(len(x),len(y)).

Hence, you get an IndexError because some of the lists in cv are empty (all but the first two) when you calculate the mean with np.mean.

Solution: If you don't need to use GridSearchCV on the remaining models you may just extend the params list with empty dictionaries:

params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]
Share:
14,533
Jerry07
Author by

Jerry07

graduate student in computer science

Updated on June 05, 2022

Comments

  • Jerry07
    Jerry07 almost 2 years

    I am using sklearn modules to find the best fitting models and model parameters. However, I have an unexpected Index error down below:

    > IndexError                                Traceback (most recent call
    > last) <ipython-input-38-ea3f99e30226> in <module>
    >      22             s = mean_squared_error(y[ts], best_m.predict(X[ts]))
    >      23             cv[i].append(s)
    > ---> 24     print(np.mean(cv, 1))
    > IndexError: tuple index out of range
    

    what I want to do is to find best fitting regressor and its parameters, but I got above error. I looked into SO and tried this solution but still, same error bumps up. any idea to fix this bug? can anyone point me out why this error happening? any thought?

    my code:

    from sklearn.model_selection import KFold
    from sklearn.metrics import mean_squared_error
    from sklearn.linear_model import LinearRegression, Ridge, Lasso
    from xgboost.sklearn import XGBRegressor
    
    from sklearn.datasets import make_regression
    
    models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
    params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]
    
    X, y = make_regression(n_samples=10000, n_features=20)
    
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")
        cv = [[] for _ in range(len(models))]
        fold = KFold(5,shuffle=False)
        for tr, ts in fold.split(X):
            for i, (model, param) in enumerate(zip(models, params)):
                best_m = GridSearchCV(model, param)
                best_m.fit(X[tr], y[tr])
                s = mean_squared_error(y[ts], best_m.predict(X[ts]))
                cv[i].append(s)
        print(np.mean(cv, 1))
    

    desired output:

    if there is a way to fix up above error, I am expecting to pick up best-fitted models with parameters, then use it for estimation. Any idea to improve the above attempt? Thanks

    • Jerry07
      Jerry07 almost 5 years
      @desertnaut Do you think how can I optimize this code? any better idea?
    • desertnaut
      desertnaut almost 5 years
      That's a very general question, but doing a grid search in each one of 5 folds sounds like overkill. I kindly suggest you open another question asking for advice in this (be sure to make your code fully reproducible, including all relevant imports).
    • hpaulj
      hpaulj almost 5 years
      The error can be reproduced with np.mean([],1), which supports the idea the cv is [], or contains [] lists.
  • Jerry07
    Jerry07 almost 5 years
    I don't think this is the answer for this question. Please read SO community rule.
  • Psi
    Psi almost 5 years
    @Dan Since you haven't posted a MWE I can't verify with certainty that this is the solution, but it works with your code after importing the appropiate modules and it matches the output you gave in the comments for cv (see the last edit for the specific change you would have to make to params).
  • desertnaut
    desertnaut almost 5 years
    This is the correct answer indeed (upvoted) - can't understand the downvotes; I proceed to explain in more detail...