How to estimate the progress of a GridSearchCV from verbose output in Scikit-Learn?
Math is simple, but a little misleading at a first sight:
When each task is started logging mechanism yields a '[CV] ...' line to
stdout
noting aboutstarting
of execution and after taskends
- another line with the addition of spent time for a particular task (in the end of the line).Additionally, with some time intervals, logging mechanism writes a progress bar to
stderr
(or if you setverbose
to >50 tostdout
) indicating a number of completed task out of total tasks (fits) and total currently spent time, like that one:[Parallel(n_jobs=32)]: Done 2724 tasks | elapsed: 11.6s
For your case, you have 5842368
total fits, i.e. tasks.
You counted 7047332
of '[CV] ...' which is around 7047332/2 = 3523666
finished tasks and progress bar shows exactly
how many tasks are completed - 3523550 (around - because some tasks could start, but not end at the time of counting).
O.rka
I am an academic researcher studying machine-learning and microorganisms
Updated on June 25, 2022Comments
-
O.rka almost 2 years
Right now I'm running a pretty aggressive grid search. I have
n=135 samples
and I am running23 folds
using a custom cross-validation train/test list. I have myverbose=2
.The following is what I ran:
param_test = {"loss":["deviance"], 'learning_rate':[0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2], "min_samples_split": np.linspace(0.1, 0.5, 12), "min_samples_leaf": np.linspace(0.1, 0.5, 12), "max_depth":[3,5,8], "max_features":["log2","sqrt"], "min_impurity_split":[5e-6, 1e-7, 5e-7], "criterion": ["friedman_mse", "mae"], "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0], "n_estimators":[10]} Mod_gsearch = GridSearchCV(estimator = GradientBoostingClassifier(), param_grid = param_test, scoring="accuracy",n_jobs=32, iid=False, cv=cv_indices, verbose=2)
I took a look at the verbose output in
stdout
:$head gridsearch.o8475533 Fitting 23 folds for each of 254016 candidates, totalling 5842368 fits
Based on this, it looks like there are
5842368
permutations of cross-validation pairs using my grid params.$ grep -c "[CV]" gridsearch.o8475533 7047332
It looks like there are around 7 million cross-validations that have been done so far but that's more than the
5842368
total fits...7047332/5842368 = 1.2062458236
Then when I look at the
stderr
file:$ cat ./gridsearch.e8475533 [Parallel(n_jobs=32)]: Done 132 tasks | elapsed: 1.2s [Parallel(n_jobs=32)]: Done 538 tasks | elapsed: 2.8s [Parallel(n_jobs=32)]: Done 1104 tasks | elapsed: 4.8s [Parallel(n_jobs=32)]: Done 1834 tasks | elapsed: 7.9s [Parallel(n_jobs=32)]: Done 2724 tasks | elapsed: 11.6s ... [Parallel(n_jobs=32)]: Done 3396203 tasks | elapsed: 250.2min [Parallel(n_jobs=32)]: Done 3420769 tasks | elapsed: 276.5min [Parallel(n_jobs=32)]: Done 3447309 tasks | elapsed: 279.3min [Parallel(n_jobs=32)]: Done 3484240 tasks | elapsed: 282.3min [Parallel(n_jobs=32)]: Done 3523550 tasks | elapsed: 285.3min
My goal:
How can I know the progress of my gridsearch with respect to the total time it may take?
What I'm confused about:
What is the relationship between
[CV]
lines instdout
, total # of fits instdout
, and tasks instderr
?