How to get correlation matrix values pyspark
16,482
You are almost there ! There is no need to use old rdd mllib api .
This is my method to generate pandas dataframe, you can export to excel or csv or others format.
def correlation_matrix(df, corr_columns, method='pearson'):
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=corr_columns, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)
matrix = Correlation.corr(df_vector, vector_col, method)
result = matrix.collect()[0]["pearson({})".format(vector_col)].values
return pd.DataFrame(result.reshape(-1, len(corr_columns)), columns=corr_columns, index=corr_columns)
Related videos on Youtube
Author by
Juan David
Updated on July 10, 2022Comments
-
Juan David over 1 year
I have a correlation matrix calculated as follow on pyspark 2.2:
from pyspark.ml.linalg import Vectors from pyspark.ml.stat import Correlation from pyspark.ml.linalg import Vectors from pyspark.ml.feature import VectorAssembler datos = sql("""select * from proceso_riesgos.jdgc_bd_train_mn_ingresos""") Variables_corr= ['ingreso_final_mix','ingreso_final_promedio', 'ingreso_final_mediana','ingreso_final_trimedia','ingresos_serv_q1', 'ingresos_serv_q2','ingresos_serv_q3','prom_ingresos_serv','y_correc'] assembler = VectorAssembler( inputCols=Variables_corr, outputCol="features") datos1=datos.select(Variables_corr).filter("y_correc is not null") output = assembler.transform(datos) r1 = Correlation.corr(output, "features")
the result is a data frame with a variable called "pearson(features): matrix":
Row(pearson(features)=DenseMatrix(20, 20, [1.0, 0.9428, 0.8908, 0.913, 0.567, 0.5832, 0.6148, 0.6488, ..., -0.589, -0.6145, -0.5906, -0.5534, -0.5346, -0.0797, -0.617, 1.0], False))]
I need to take those values and export it to an excel, or to be able to manipulate the result. A list could be desiderable.
Thanks for help!!