Should we parallelize a DataFrame like we parallelize a Seq before training

scala apache-spark pyspark apache-spark-sql apache-spark-ml

33,616

Solution 1

DataFrame is a distributed data structure. It is neither required nor possible to parallelize it. SparkConext.parallelize method is used only to distributed local data structures which reside in the driver memory. You shouldn't be used to distributed large datasets not to mention redistributing RDDs or higher level data structures (like you do in your previous question)

sc.parallelize(trainingData.collect())

If you want to convert between RDD / Dataframe (Dataset) use methods which are designed to do it:

from DataFrame to RDD:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD

val df: DataFrame  = Seq(("foo", 1), ("bar", 2)).toDF("k", "v")
val rdd: RDD[Row] = df.rdd

form RDD to DataFrame:

val rdd: RDD[(String, Int)] = sc.parallelize(Seq(("foo", 1), ("bar", 2)))
val df1: DataFrame = rdd.toDF
// or
val df2: DataFrame = spark.createDataFrame(rdd) // From 1.x use sqlContext

Solution 2

You should maybe check out the difference between RDD and DataFrame and how to convert between the two: Difference between DataFrame and RDD in Spark

To answer your question directly: A DataFrame is already optimized for parallel execution. You do not need to do anything and you can pass it to any spark estimators fit() method directly. The parallel executions are handled in the background.

33,616

Author by

Abhishek

Updated on January 07, 2020

Comments

Abhishek over 4 years

Consider the code given here,

https://spark.apache.org/docs/1.2.0/ml-guide.html

import org.apache.spark.ml.classification.LogisticRegression
val training = sparkContext.parallelize(Seq(
  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5))))

val lr = new LogisticRegression()
lr.setMaxIter(10).setRegParam(0.01)

val model1 = lr.fit(training)

Assuming we read "training" as a dataframe using sqlContext.read(), should we still do something like

val model1 = lr.fit(sparkContext.parallelize(training)) // or some variation of this

or the fit function will automatically take care of parallelizing the computation/ data when passed a dataFrame

Regards,

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

How to use a Scala class inside Pyspark

pyspark.sql.utils.IllegalArgumentException: u'Field "features" does not exist.'

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]

Apache Spark throws NullPointerException when encountering missing feature

Spark SQL DataFrame - distinct() vs dropDuplicates()

Dropping a nested column from Spark DataFrame

Spark dataframe get column value into a string variable

fetch more than 20 rows and display full value of column in spark-shell

Column alias after groupBy in pyspark

aggregate function Count usage with groupBy in Spark