cross validation in pyspark

apache-spark machine-learning pyspark cross-validation apache-spark-ml

11,993

1) The area under the ROC curve (AUC) is defined only for binary classification, hence you cannot use it for regression tasks, as you are trying to do here.

2) The objectiveHistory for each iteration is only available when the solver argument in the regression is l-bfgs (documentation); here is a toy example:

spark.version
# u'2.1.1'

from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

dataset = spark.createDataFrame(
        [(Vectors.dense([0.0]), 0.2),
         (Vectors.dense([0.4]), 1.4),
         (Vectors.dense([0.5]), 1.9),
         (Vectors.dense([0.6]), 0.9),
         (Vectors.dense([1.2]), 1.0)] * 10,
         ["features", "label"])

lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here

modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()

crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=modelEvaluator,
                          numFolds=3)

cvModel = crossval.fit(dataset)

trainingSummary = cvModel.bestModel.summary

trainingSummary.totalIterations
# 2
trainingSummary.objectiveHistory # one value for each iteration
# [0.49, 0.4511834723904831]

3) You have already defined a RegressionEvaluator which you can use for evaluating your test set but, if used without arguments, it assumes the RMSE metric; here is a way to define evaluators with different metrics and apply them to your test set (continuing the code from above):

test = spark.createDataFrame(
        [(Vectors.dense([0.0]), 0.2),
         (Vectors.dense([0.4]), 1.1),
         (Vectors.dense([0.5]), 0.9),
         (Vectors.dense([0.6]), 1.0)],
        ["features", "label"])

modelEvaluator.evaluate(cvModel.transform(test))  # rmse by default, if not specified
# 0.35384585061028506

eval_rmse = RegressionEvaluator(metricName="rmse")
eval_r2 = RegressionEvaluator(metricName="r2")

eval_rmse.evaluate(cvModel.transform(test)) # same as above
# 0.35384585061028506

eval_r2.evaluate(cvModel.transform(test))
# -0.001655087952929124

11,993

Author by

Alex

Updated on June 04, 2022

Comments

Alex almost 2 years

I used cross validation to train a linear regression model using the following code:

from pyspark.ml.evaluation import RegressionEvaluator

lr = LinearRegression(maxIter=maxIteration)
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=modelEvaluator,
                          numFolds=3)

cvModel = crossval.fit(training)

now I want to draw the roc curve, I used the following code but I get this error:

'LinearRegressionTrainingSummary' object has no attribute 'areaUnderROC'

trainingSummary = cvModel.bestModel.stages[-1].summary
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

I also want to check the objectiveHistory at each itaration, I know that I can get it at the end

print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))

but I want to get it at each iteration, how can I do this?

Moreover I want to evaluate the model on the test data, how can I do that?

prediction = cvModel.transform(test)

I know for the training data set I can write:

print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

but how can I get these metrics for testing data set?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

How to map features from the output of a VectorAssembler back to the column names in Spark ML?

pyspark : NameError: name 'spark' is not defined

'RDD' object has no attribute '_jdf' pyspark RDD

Save and load two ML models in pyspark

pyspark.sql.utils.IllegalArgumentException: u'Field "features" does not exist.'

Convert Sparse Vector to Dense Vector in Pyspark

IllegalArgumentException: Column must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'

Apache Spark throws NullPointerException when encountering missing feature

pySpark: Save ML Model

Count number of columns in pyspark Dataframe?

cross validation in pyspark

Alex

Comments

Recents

Related