How to serve a Spark MLlib model?

apache-spark machine-learning apache-spark-mllib

10,909

Solution 1

From one hand, a machine learning model built with spark can't be served the way you serve in Azure ML or Amazon ML in a traditional manner.

Databricks claims to be able to deploy models using it's notebook but I haven't actually tried that yet.

On other hand, you can use a model in three ways :

Training on the fly inside an application then applying prediction. This can be done in a spark application or a notebook.
Train a model and save it if it implements an MLWriter then load in an application or a notebook and run it against your data.
Train a model with Spark and export it to PMML format using jpmml-spark. PMML allows for different statistical and data mining tools to speak the same language. In this way, a predictive solution can be easily moved among tools and applications without the need for custom coding. e.g from Spark ML to R.

Those are the three possible ways.

Of course, you can think of an architecture in which you have RESTful service behind which you can build using spark-jobserver per example to train and deploy but needs some development. It's not a out-of-the-box solution.

You might also use projects like Oryx 2 to create your full lambda architecture to train, deploy and serve a model.

Unfortunately, describing each of the mentioned above solution is quite broad and doesn't fit in the scope of SO.

Solution 2

One option is to use MLeap to serve a Spark PipelineModel online with no dependencies on Spark/SparkContext. Not having to use the SparkContext is important as it will drop scoring time for a single record from ~100ms to single-digit microseconds.

In order to use it, you have to:

Serialize your Spark Model with MLeap utilities
Load the model in MLeap (does not require a SparkContext or any Spark dependencies)
Create your input record in JSON (not a DataFrame)
Score your record with MLeap

MLeap is well integrated with all the Pipeline Stages available in Spark MLlib (with the exception of LDA at the time of this writing). However, things might get a bit more complicated if you are using custom Estimators/Transformers.

Take a look at the MLeap FAQ for more info about custom transformers/estimators, performances, and integration.

Solution 3

You are comparing two rather different things. Apache Spark is a computation engine, while mentioned by you Amazon and Microsoft solutions are offering services. These services might as well have Spark with MLlib behind the scene. They save you from the trouble building a web service yourself, but you pay extra.

Number of companies, like Domino Data Lab, Cloudera or IBM offer products that you can deploy on your own Spark cluster and easily build service around your models (with various degrees of flexibility).

Naturally you build a service yourself with various open source tools. Which specifically? It all depends on what you are after. How user should interact with the model? Should there be some sort of UI or jest a REST API? Do you need to change some parameters on the model or the model itself? Are the jobs more of a batch or real-time nature? You can naturally build all-in-one solution, but that's going to be a huge effort.

My personal recommendation would be to take advantage, if you can, of one of the available services from Amazon, Google, Microsoft or whatever. Need on-premises deployment? Check Domino Data Lab, their product is mature and allows easy working with models (from building till deployment). Cloudera is more focused on cluster computing (including Spark), but it will take a while before they have something mature.

[EDIT] I'd recommend to have a look at Apache PredictionIO, open source machine learning server - amazing project with lot's of potential.

10,909

Author by

Luis Leal

Im a software and computer science engineer with experience developing software(back end primarily but not limited) including desktop,web, mobile apps. Im specialized in data having in my skillset: Machine Learning Data mining Artifical intelligence Software Engineering Big Data(hadoop, spark, etc) Business Intelligence/data warehousing Digital electronics Java,Python, R, .NET

Updated on June 12, 2022

Comments

Luis Leal almost 2 years

I'm evaluating tools for production ML based applications and one of our options is Spark MLlib , but I have some questions about how to serve a model once its trained?

For example in Azure ML, once trained, the model is exposed as a web service which can be consumed from any application, and it's a similar case with Amazon ML.

How do you serve/deploy ML models in Apache Spark ?
Elmar Macek over 7 years

I would give the spark-jobserver a chance. You can cache a model (a complete spark pipeline even) easily and quickly answer ml - relevant queries such as classifications or queries. It also gives you the opportunity to cache aggregated tables and quickly return json containing this data or parts of it for a visualisation or further processing in another application.
botchniaque about 2 years

MLeap's feature list looks like a missing puzzle in spark ML pipelines. However when I tried to serialize a simple pyspark ML pipeline containing an Imputer I failed to make it work. All examples show various transformers, which also work for me, but sadly Imputer does not seem to be supported :(