Select specific columns in a PySpark dataframe to improve performance
Solution 1
IMO it makes sense to filter them beforehand:
df = SqlContext.sql('select col_1, col_2, col_3 from mytable')
so you won't waste resources...
If you can't do it this way, then you can do it as you did it...
Solution 2
It is certainly a good practice but it is rather unlikely to result in a performance boost unless you try to pass data through Python RDD or do something similar. If certain columns are not required to compute the output optimizer should automatically infer projections and push these as early as possible in the execution plan.
Also it is worth noting that using df.count()
after df.cache()
will be useless most of the time (if not always). In general count
is rewritten by the optimizer as
SELECT SUM(1) FROM table
so what is typically requested from the source is:
SELECT 1 FROM table
Long story short there is nothing useful to cache here.
Comments
-
Ivan over 1 year
Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with
df = SqlContext.sql('select cols from mytable')
and I'm importing the entire table with
df = SqlContext.table(mytable)
does a
select
and subsequentcache
improves performance/decrease memory usage, likedf = df.select('col_1', 'col_2', 'col_3') df.cache() df.count()
or is just waste of time? I will do lots of operations and data manipulations on
df
, likeavg
,withColumn
, etc. -
Ivan over 7 yearsThanks again, @zero323! I was thinking more on the lines of doing a count just after a cache to commit the cache operation and check a few numbers as a byproduct. I noticed that sometimes doing a
select
makes the performance worse before the cache. -
zero323 over 7 yearsWell, reasoning about cache in Spark SQL is relatively hard and tricks like
cache
/count
are not the best idea. You can see some performance boost which is not related to Spark caching when data is for example memory mapped but IMHO it is more a ritual then anything else.