Save and load two ML models in pyspark

python apache-spark pyspark apache-spark-ml

14,015

I figured out a way to do it just by placing them together in a folder. Then the user only needs to provide and know the path to this folder.

import sys
import os
from pyspark.ml.classification import RandomForestClassifier

trainer_1 = RandomForestClassifier(featuresCol="features_1")
trainer_2 = RandomForestClassifier(featuresCol="features_2")
model_1 = trainer_1.fit(df_training_data)
model_2 = trainer_2.fit(df_training_data)

path = 'model_rfc'
os.mkdir(path)
model_1.save(os.path.join(sys.argv[1], 'model_1'))
model_2.save(os.path.join(sys.argv[1], 'model_2'))

The names model_1 and model_2 are hardcoded and not needed to be known by the user.

import sys
import os
from pyspark.ml.classification import RandomForestClassificationModel

model_1 = RandomForestClassificationModel.load(os.path.join(sys.argv[1], 'model_1'))
model_2 = RandomForestClassificationModel.load(os.path.join(sys.argv[1], 'model_2'))

This should solve the problem. Is this the best way to do it or could there be an even better way to bundle the models together using functionality from the Spark library?

14,015

PaulMag

Physicist, Python user, game development enthusiast (Unreal Engine, C++). Use numpy, scipy, scikit-image, astropy, and matplotlib. Have also touched Visual Basic Java, C, MATLAB, IDL and Bash.

Updated on June 04, 2022

Comments

PaulMag almost 2 years
First I create two ML algorithms and save them to two separate files. Note that both models are based on the same dataframe. feature_1 and feature_2 are different sets of features extracted from the same dataset.
```
import sys
from pyspark.ml.classification import RandomForestClassifier

trainer_1 = RandomForestClassifier(featuresCol="features_1")
trainer_2 = RandomForestClassifier(featuresCol="features_2")
model_1 = trainer_1.fit(df_training_data)
model_2 = trainer_2.fit(df_training_data)

model_1.save(sys.argv[1])
model_2.save(sys.argv[2])
```
Then, when I later want to use the models, I have to load them both from their respective paths, providing the paths f.ex. via sys.argv.
```
import sys
from pyspark.ml.classification import RandomForestClassificationModel

model_1 = RandomForestClassificationModel.load(sys.argv[1])
model_2 = RandomForestClassificationModel.load(sys.argv[2])
```
What I want is an elegant way to be able to save these two models together, as one, in the same path. I want this mainly so that the user do not have to keep track of two separate pathnames every time he saves and loads. These two models are closely connected and will generally always be created and used as a together, so they are sort of one model.

Is this the kind of thing pipelines are intended for?