spark - Converting dataframe to list improving performance

python performance pandas apache-spark pyspark

26,807

Solution 1

If you really need a local list there is not much you can do here but one improvement is to collect only a single column not a whole DataFrame:

df.select(col_name).flatMap(lambda x: x).collect()

Solution 2

You can do it this way:

>>> [list(row) for row in df.collect()]

Example:
>>> d = [['Alice', 1], ['Bob', 2]]
>>> df = spark.createDataFrame(d, ['name', 'age'])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
| Bob| 2|
+-----+---+
>>> to_list = [list(row) for row in df.collect()]
print list
Result: [[u'Alice', 1], [u'Bob', 2]]

Solution 3

You can use an iterator to save memory toLocalIterator. The iterator will consume as much memory as the largest partition in this. And if you need to use the result only once, then the iterator is perfect is this case.

d = [['Bender', 12], ['Flex', 123],['Fry', 1234]]
df = spark.createDataFrame(d, ['name', 'value'])
df.show()
+------+-----+
|  name|value|
+------+-----+
|Bender|   12|
|  Flex|  123|
|   Fry| 1234|
+------+-----+`
values = [row.value for row in df.toLocalIterator()]

print(values)
>>> [12, 123, 1234]

Also toPandas() method should only be used if the resulting Pandas's DataFrame is expected to be small, as all the data is loaded into the driver's memory.

26,807

Author by

YAKOVM

Updated on April 17, 2020

Comments

YAKOVM about 4 years
I need to covert a column of the Spark dataframe to list to use later for matplotlib
```
df.toPandas()[col_name].values.tolist()
```
it looks like there is high performance overhead this operation takes around 18sec is there other way to do that or improve the perfomance?
YAKOVM about 8 years

It didn't really helped me.Maybe something else could be done?
zero323 about 8 years

Other than dropping a whole idea? Not really. Why do you want a local list?
YAKOVM about 8 years

for matplotlib maybe there is some other way
zero323 about 8 years

Well, for starters you can double check your pipeline. Is there any reason to expect faster execution? Do you cache reused data? Other than that consider using smarter visualization techniques (sampling, bucketing, different methods of extrapolation, shading) which don't require full data. How much data do you collect right now?
thewaywewere almost 7 years

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. Please read this how-to-answer for providing quality answer.
Davos over 4 years

A few months later you answered this question pointing out that flatMap is not supported on dataframe anymore stackoverflow.com/a/37225736/1335793
David Maddox almost 4 years

You can use df.toLocalIterator() instead of df.collect() for superior performance as per @Artem Osipov's answer