Select specific columns in a PySpark dataframe to improve performance

14,040

Solution 1

IMO it makes sense to filter them beforehand:

df = SqlContext.sql('select col_1, col_2, col_3 from mytable')

so you won't waste resources...

If you can't do it this way, then you can do it as you did it...

Solution 2

It is certainly a good practice but it is rather unlikely to result in a performance boost unless you try to pass data through Python RDD or do something similar. If certain columns are not required to compute the output optimizer should automatically infer projections and push these as early as possible in the execution plan.

Also it is worth noting that using df.count() after df.cache() will be useless most of the time (if not always). In general count is rewritten by the optimizer as

SELECT SUM(1) FROM table

so what is typically requested from the source is:

SELECT 1 FROM table

Long story short there is nothing useful to cache here.

Share:
14,040
Ivan
Author by

Ivan

Data Scientist, Systems and Big Data Architect, Physicist

Updated on June 20, 2022

Comments

  • Ivan
    Ivan over 1 year

    Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with

    df = SqlContext.sql('select cols from mytable')
    

    and I'm importing the entire table with

    df = SqlContext.table(mytable)
    

    does a select and subsequent cache improves performance/decrease memory usage, like

    df = df.select('col_1', 'col_2', 'col_3')
    df.cache()
    df.count()
    

    or is just waste of time? I will do lots of operations and data manipulations on df, like avg, withColumn, etc.

  • Ivan
    Ivan over 7 years
    Thanks again, @zero323! I was thinking more on the lines of doing a count just after a cache to commit the cache operation and check a few numbers as a byproduct. I noticed that sometimes doing a select makes the performance worse before the cache.
  • zero323
    zero323 over 7 years
    Well, reasoning about cache in Spark SQL is relatively hard and tricks like cache / count are not the best idea. You can see some performance boost which is not related to Spark caching when data is for example memory mapped but IMHO it is more a ritual then anything else.