PCA Analysis in PySpark
Spark >= 1.5.0
Although PySpark 1.5 introduces distributed data structures (pyspark.mllib.linalg.distributed
) it looks like API is rather limited and there is no implementation of the computePrincipalComponents
method.
It is possible to use either from pyspark.ml.feature.PCA
or pyspark.mllib.feature.PCA
though. In the first case an expected input is a data frame with vector column:
from pyspark.ml.feature import PCA as PCAml
from pyspark.ml.linalg import Vectors # Pre 2.0 pyspark.mllib.linalg
df = sqlContext.createDataFrame([
(Vectors.dense([1, 2, 0]),),
(Vectors.dense([2, 0, 1]),),
(Vectors.dense([0, 1, 0]),)], ("features", ))
pca = PCAml(k=2, inputCol="features", outputCol="pca")
model = pca.fit(df)
transformed = model.transform(df)
In Spark 2.0 or later you should use pyspark.ml.linalg.Vector
in place of pyspark.mllib.linalg.Vector
.
For mllib
version you'll need a RDD
of Vector
:
from pyspark.mllib.feature import PCA as PCAmllib
rdd = sc.parallelize([
Vectors.dense([1, 2, 0]),
Vectors.dense([2, 0, 1]),
Vectors.dense([0, 1, 0])])
model = PCAmllib(2).fit(rdd)
transformed = model.transform(rdd)
Spark < 1.5.0
PySpark <= 1.4.1 doesn't support distributed data structures yet so there is no built-in method to compute PCA. If input matrix is relatively thin you can compute covariance matrix in a distributed manner, collect the results and perform eigendecomposition locally on a driver.
Order of operations is more or less like the one bellow. Distributed steps are followed by a name of the operation, local by "*" and optional method.
- Create
RDD[Vector]
where each element is a single row from an input matrix. You can usenumpy.ndarray
for each row (prallelize
) - Compute column-wise statistics (
reduce
) - Use results from 2. to center the matrix (
map
) - Compute outer product for each row (
map outer
) - Sum results to obtain covariance matrix (
reduce +
) - Collect and compute eigendecomposition * (
numpy.linalg.eigh
) - Choose top-n eigenvectors *
- Project the data (
map
)
Regarding Sklearn. You can use NumPy (it is already in use in Mllib
), SciPy, Scikit locally on a driver or a worker the same way as usual.
lapolonio
Updated on June 14, 2022Comments
-
lapolonio almost 2 years
Looking at http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html. The examples seem to only contain Java and Scala.
Does Spark MLlib support PCA analysis for Python? If so please point me to an example. If not, how to combine Spark with scikit-learn?
-
Mehdi LAMRANI over 6 yearsWHat about Spark > 2 ? Syntax seems to have changed
-
iratelilkid about 3 years@MehdiLAMRANI it worked for me. I am using databricks
-
iratelilkid about 3 yearshave a question for @zero323 How do I apply to the actual data frame? Any help would be appreciated thanks.
-
3nomis over 2 yearsI still haven't understood if the PCA library automatically standardazies the data before performing PCA. Anyone knows?