How to estimate the progress of a GridSearchCV from verbose output in Scikit-Learn?

15,737

Math is simple, but a little misleading at a first sight:

  1. When each task is started logging mechanism yields a '[CV] ...' line to stdout noting about starting of execution and after task ends - another line with the addition of spent time for a particular task (in the end of the line).

  2. Additionally, with some time intervals, logging mechanism writes a progress bar to stderr (or if you set verbose to >50 to stdout) indicating a number of completed task out of total tasks (fits) and total currently spent time, like that one:

    [Parallel(n_jobs=32)]: Done 2724 tasks | elapsed: 11.6s

For your case, you have 5842368 total fits, i.e. tasks.

You counted 7047332 of '[CV] ...' which is around 7047332/2 = 3523666 finished tasks and progress bar shows exactly how many tasks are completed - 3523550 (around - because some tasks could start, but not end at the time of counting).

Share:
15,737
O.rka
Author by

O.rka

I am an academic researcher studying machine-learning and microorganisms

Updated on June 25, 2022

Comments

  • O.rka
    O.rka almost 2 years

    Right now I'm running a pretty aggressive grid search. I have n=135 samples and I am running 23 folds using a custom cross-validation train/test list. I have my verbose=2.

    The following is what I ran:

    param_test = {"loss":["deviance"],
               'learning_rate':[0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
               "min_samples_split": np.linspace(0.1, 0.5, 12),
               "min_samples_leaf": np.linspace(0.1, 0.5, 12),
               "max_depth":[3,5,8],
              "max_features":["log2","sqrt"],
              "min_impurity_split":[5e-6, 1e-7, 5e-7],
              "criterion": ["friedman_mse",  "mae"],
               "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
              "n_estimators":[10]}
    
    Mod_gsearch = GridSearchCV(estimator = GradientBoostingClassifier(),
                               param_grid = param_test, scoring="accuracy",n_jobs=32, iid=False, cv=cv_indices, verbose=2)
    

    I took a look at the verbose output in stdout:

    $head gridsearch.o8475533
    Fitting 23 folds for each of 254016 candidates, totalling 5842368 fits
    

    Based on this, it looks like there are 5842368 permutations of cross-validation pairs using my grid params.

    $ grep -c  "[CV]" gridsearch.o8475533
    7047332 
    

    It looks like there are around 7 million cross-validations that have been done so far but that's more than the 5842368 total fits...

    7047332/5842368 = 1.2062458236
    

    Then when I look at the stderr file:

    $ cat ./gridsearch.e8475533
    [Parallel(n_jobs=32)]: Done 132 tasks      | elapsed:    1.2s
    [Parallel(n_jobs=32)]: Done 538 tasks      | elapsed:    2.8s
    [Parallel(n_jobs=32)]: Done 1104 tasks      | elapsed:    4.8s
    [Parallel(n_jobs=32)]: Done 1834 tasks      | elapsed:    7.9s
    [Parallel(n_jobs=32)]: Done 2724 tasks      | elapsed:   11.6s
    ...
    [Parallel(n_jobs=32)]: Done 3396203 tasks      | elapsed: 250.2min
    [Parallel(n_jobs=32)]: Done 3420769 tasks      | elapsed: 276.5min
    [Parallel(n_jobs=32)]: Done 3447309 tasks      | elapsed: 279.3min
    [Parallel(n_jobs=32)]: Done 3484240 tasks      | elapsed: 282.3min
    [Parallel(n_jobs=32)]: Done 3523550 tasks      | elapsed: 285.3min
    

    My goal:

    How can I know the progress of my gridsearch with respect to the total time it may take?

    What I'm confused about:

    What is the relationship between [CV] lines in stdout, total # of fits in stdout, and tasks in stderr?