spark - Converting dataframe to list improving performance

26,807

Solution 1

If you really need a local list there is not much you can do here but one improvement is to collect only a single column not a whole DataFrame:

df.select(col_name).flatMap(lambda x: x).collect()

Solution 2

You can do it this way:

>>> [list(row) for row in df.collect()]

Example:
>>> d = [['Alice', 1], ['Bob', 2]]
>>> df = spark.createDataFrame(d, ['name', 'age'])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
| Bob| 2|
+-----+---+
>>> to_list = [list(row) for row in df.collect()]
print list
Result: [[u'Alice', 1], [u'Bob', 2]]

Solution 3

You can use an iterator to save memory toLocalIterator. The iterator will consume as much memory as the largest partition in this. And if you need to use the result only once, then the iterator is perfect is this case.

d = [['Bender', 12], ['Flex', 123],['Fry', 1234]]
df = spark.createDataFrame(d, ['name', 'value'])
df.show()
+------+-----+
|  name|value|
+------+-----+
|Bender|   12|
|  Flex|  123|
|   Fry| 1234|
+------+-----+`
values = [row.value for row in df.toLocalIterator()]

print(values)
>>> [12, 123, 1234]

Also toPandas() method should only be used if the resulting Pandas's DataFrame is expected to be small, as all the data is loaded into the driver's memory.

Share:
26,807
YAKOVM
Author by

YAKOVM

Updated on April 17, 2020

Comments

  • YAKOVM
    YAKOVM about 4 years

    I need to covert a column of the Spark dataframe to list to use later for matplotlib

    df.toPandas()[col_name].values.tolist()
    

    it looks like there is high performance overhead this operation takes around 18sec is there other way to do that or improve the perfomance?

  • YAKOVM
    YAKOVM about 8 years
    It didn't really helped me.Maybe something else could be done?
  • zero323
    zero323 about 8 years
    Other than dropping a whole idea? Not really. Why do you want a local list?
  • YAKOVM
    YAKOVM about 8 years
    for matplotlib maybe there is some other way
  • zero323
    zero323 about 8 years
    Well, for starters you can double check your pipeline. Is there any reason to expect faster execution? Do you cache reused data? Other than that consider using smarter visualization techniques (sampling, bucketing, different methods of extrapolation, shading) which don't require full data. How much data do you collect right now?
  • thewaywewere
    thewaywewere almost 7 years
    While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. Please read this how-to-answer for providing quality answer.
  • Davos
    Davos over 4 years
    A few months later you answered this question pointing out that flatMap is not supported on dataframe anymore stackoverflow.com/a/37225736/1335793
  • David Maddox
    David Maddox almost 4 years
    You can use df.toLocalIterator() instead of df.collect() for superior performance as per @Artem Osipov's answer