How to extract model hyper-parameters from spark.ml in PySpark?

29,957

Solution 1

Ran into this problem as well. I found out you need to call the java property for some reason I don't know why. So just do this:

from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="mae")
lr = LinearRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \
                                .addGrid(lr.regParam, [0]) \
                                .addGrid(lr.elasticNetParam, [1]) \
                                .build()
lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \
                        evaluator=evaluator, numFolds=3)
lrModel = lr_cv.fit(your_training_set_here)
bestModel = lrModel.bestModel

Printing out the parameters you want:

>>> print 'Best Param (regParam): ', bestModel._java_obj.getRegParam()
0
>>> print 'Best Param (MaxIter): ', bestModel._java_obj.getMaxIter()
500
>>> print 'Best Param (elasticNetParam): ', bestModel._java_obj.getElasticNetParam()
1

This applies to other methods like extractParamMap() as well. They should fix this soon.

Solution 2

This might not be as good as wernerchao answer (because it's not convenient to store hyperparameters in variables), but you can quickly look at the best hyper-parameters of a cross validation model this way :

cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]

Solution 3

Assuming cvModel3Day is your model names, params can be extracted as shown below in Spark Scala

val params = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].extractParamMap()

val depth = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxDepth

val iter = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxIter

val bins = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxBins

val features  = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getFeaturesCol

val step = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getStepSize

val samplingRate  = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getSubsamplingRate

Solution 4

I also bounced my head on this wall, unfortunately you can get only specific parameters for specific models. Happily for logistic regression you can access the intercept and weights, sadly you can not retrieve the regParam. This can be done in the following way:

best_lr = cv.bestModel

#get weigths
best_lr.weights
>>>DenseVector([3.1573])

#or better
best_lr.coefficients
>>>DenseVector([3.1573])

#get intercept
best_lr.intercept
>>>-1.0829958115287153

As I wrote before, each model has few parameters that can be extracted. Overall getting the relevant models from a Pipeline (e.g. cv.bestModel when the Cross Validator runs over a Pipeline) can be done with:

best_pipeline = cv.bestModel
best_pipeline.stages
>>>[Tokenizer_4bc8884ad68b4297fd3c,CountVectorizer_411fbdeb4100c2bfe8ef, PCA_4c538d67e7b8f29ff8d0,LogisticRegression_4db49954edc7033edc76]

Each model is obtained by simple list indexing

best_lr = best_pipeline.stages[3]

Now the above can be applied.

Solution 5

There are two questions actually:

  • what are the aspects of the fitted model (like coefficients and intercepts)
  • what were the meta parameters using which the bestModel was fitted.

Unfortunately the python api of the fitted estimators (the models) doesn't allow (easy) direct access to the parameters of the estimator, which makes it hard to answer the latter question.

However there is a workaround using the java api. For completeness, first a full setup of a cross validated model

%pyspark
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
logit = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[logit])
paramGrid = ParamGridBuilder() \
    .addGrid(logit.regParam, [0, 0.01, 0.05, 0.1, 0.5, 1]) \
    .addGrid(logit.elasticNetParam, [0.0, 0.1, 0.5, 0.8, 1]) \
    .build()
evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR')
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)
tuned_model = crossval.fit(train)
model = tuned_model.bestModel

One could then use the generic methods on the java object to get the parameter values, without explicitly referring to methods like getRegParam() :

java_model = model.stages[-1]._java_obj
{param.name: java_model.getOrDefault(java_model.getParam(param.name)) 
    for param in paramGrid[0]}

This executes the following steps:

  1. Get the fitted logit model as created by the estimator from the last stage of the best model: crossval.fit(..).bestModel.stages[-1]
  2. Get the internal java object from _java_obj
  3. Get all configured names from the paramGrid (which is a list of dictionaries). Only the first row is used, assuming it is an actual grid, as in, each row contains the same keys. Otherwise you need to collect all names ever used in any row.
  4. Get the corresponding Param<T> parameter identifier from the java object.
  5. Pass the Param<T> instance to the getOrDefault() function to get the actual value
Share:
29,957
Paul
Author by

Paul

Data scientist at MassMutual.

Updated on May 22, 2020

Comments

  • Paul
    Paul almost 4 years

    I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected:

    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.evaluation import BinaryClassificationEvaluator
    from pyspark.mllib.linalg import Vectors
    from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
    
    dataset = sqlContext.createDataFrame(
        [(Vectors.dense([0.0]), 0.0),
         (Vectors.dense([0.4]), 1.0),
         (Vectors.dense([0.5]), 0.0),
         (Vectors.dense([0.6]), 1.0),
         (Vectors.dense([1.0]), 1.0)] * 10,
        ["features", "label"])
    lr = LogisticRegression()
    grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
    evaluator = BinaryClassificationEvaluator()
    cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
    cvModel = cv.fit(dataset)
    

    Running this in PySpark shell, I can get the linear regression model's coefficients, but I can't seem to find the value of lr.regParam selected by the cross validation procedure. Any ideas?

    In [3]: cvModel.bestModel.coefficients
    Out[3]: DenseVector([3.1573])
    
    In [4]: cvModel.bestModel.explainParams()
    Out[4]: ''
    
    In [5]: cvModel.bestModel.extractParamMap()
    Out[5]: {}
    
    In [15]: cvModel.params
    Out[15]: []
    
    In [36]: cvModel.bestModel.params
    Out[36]: []