How to More Efficiently Load Parquet Files in Spark (pySpark v1.2.0)
10,277
You should use Spark DataFrame API: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframe-operations
Something like
dat.select("a", "b", "c").filter(lambda r: len(r.a)>0)
Or you can use Spark SQL:
dat.regiserTempTable("dat")
sqc.sql("select a, b, c from dat where length(a) > 0")
Author by
jarfa
Updated on July 12, 2022Comments
-
jarfa almost 2 years
I'm loading in high-dimensional parquet files but only need a few columns. My current code looks like:
dat = sqc.parquetFile(path) \ .filter(lambda r: len(r.a)>0) \ .map(lambda r: (r.a, r.b, r.c))
My mental model of what's happening is that it's loading in all the data, then throwing out the columns I don't want. I'd obviously prefer it to not even read in those columns, and from what I understand about parquet that seems to be possible.
So there are two questions:
- Is my mental model wrong? Or is the spark compiler smart enough to only read in columns a, b, and c in the example above?
- How can I force
sqc.parquetFile()
to read in data more efficiently?