How to More Efficiently Load Parquet Files in Spark (pySpark v1.2.0)

apache-spark apache-spark-sql pyspark parquet

10,277

You should use Spark DataFrame API: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframe-operations

Something like

dat.select("a", "b", "c").filter(lambda r: len(r.a)>0)

Or you can use Spark SQL:

dat.regiserTempTable("dat")
sqc.sql("select a, b, c from dat where length(a) > 0")

10,277

Author by

jarfa

Updated on July 12, 2022

Comments

jarfa almost 2 years
I'm loading in high-dimensional parquet files but only need a few columns. My current code looks like:
```
dat = sqc.parquetFile(path) \
          .filter(lambda r: len(r.a)>0) \
          .map(lambda r: (r.a, r.b, r.c))
```
My mental model of what's happening is that it's loading in all the data, then throwing out the columns I don't want. I'd obviously prefer it to not even read in those columns, and from what I understand about parquet that seems to be possible.

So there are two questions:
1. Is my mental model wrong? Or is the spark compiler smart enough to only read in columns a, b, and c in the example above?
2. How can I force sqc.parquetFile() to read in data more efficiently?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Related

PySpark: how to read in partitioning columns when reading parquet

EMR 5.x | Spark on Yarn | Exit code 137 and Java heap space Error

You need to build Spark before running this program error when running bin/pyspark

How to use correlation in Spark with Dataframes?

Spark query running very slow

Calculate time between two dates in pyspark

convert dataframe to libsvm format

pyspark; check if an element is in collect_list

Spark SQL(PySpark) - SparkSession import Error

How to use a Scala class inside Pyspark