How to More Efficiently Load Parquet Files in Spark (pySpark v1.2.0)

10,277

You should use Spark DataFrame API: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframe-operations

Something like

dat.select("a", "b", "c").filter(lambda r: len(r.a)>0)

Or you can use Spark SQL:

dat.regiserTempTable("dat")
sqc.sql("select a, b, c from dat where length(a) > 0")
Share:
10,277
jarfa
Author by

jarfa

Updated on July 12, 2022

Comments

  • jarfa
    jarfa almost 2 years

    I'm loading in high-dimensional parquet files but only need a few columns. My current code looks like:

    dat = sqc.parquetFile(path) \
              .filter(lambda r: len(r.a)>0) \
              .map(lambda r: (r.a, r.b, r.c))
    

    My mental model of what's happening is that it's loading in all the data, then throwing out the columns I don't want. I'd obviously prefer it to not even read in those columns, and from what I understand about parquet that seems to be possible.

    So there are two questions:

    1. Is my mental model wrong? Or is the spark compiler smart enough to only read in columns a, b, and c in the example above?
    2. How can I force sqc.parquetFile() to read in data more efficiently?