spark - Converting dataframe to list improving performance
Solution 1
If you really need a local list there is not much you can do here but one improvement is to collect only a single column not a whole DataFrame
:
df.select(col_name).flatMap(lambda x: x).collect()
Solution 2
You can do it this way:
>>> [list(row) for row in df.collect()]
Example:
>>> d = [['Alice', 1], ['Bob', 2]]
>>> df = spark.createDataFrame(d, ['name', 'age'])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
| Bob| 2|
+-----+---+
>>> to_list = [list(row) for row in df.collect()]
print list
Result: [[u'Alice', 1], [u'Bob', 2]]
Solution 3
You can use an iterator to save memory toLocalIterator
. The iterator will consume as much memory as the largest partition in this. And if you need to use the result only once, then the iterator is perfect is this case.
d = [['Bender', 12], ['Flex', 123],['Fry', 1234]]
df = spark.createDataFrame(d, ['name', 'value'])
df.show()
+------+-----+
| name|value|
+------+-----+
|Bender| 12|
| Flex| 123|
| Fry| 1234|
+------+-----+`
values = [row.value for row in df.toLocalIterator()]
print(values)
>>> [12, 123, 1234]
Also toPandas() method should only be used if the resulting Pandas's DataFrame is expected to be small, as all the data is loaded into the driver's memory.
YAKOVM
Updated on April 17, 2020Comments
-
YAKOVM about 4 years
I need to covert a column of the Spark dataframe to list to use later for matplotlib
df.toPandas()[col_name].values.tolist()
it looks like there is high performance overhead this operation takes around 18sec is there other way to do that or improve the perfomance?
-
YAKOVM about 8 yearsIt didn't really helped me.Maybe something else could be done?
-
zero323 about 8 yearsOther than dropping a whole idea? Not really. Why do you want a local list?
-
YAKOVM about 8 yearsfor matplotlib maybe there is some other way
-
zero323 about 8 yearsWell, for starters you can double check your pipeline. Is there any reason to expect faster execution? Do you cache reused data? Other than that consider using smarter visualization techniques (sampling, bucketing, different methods of extrapolation, shading) which don't require full data. How much data do you collect right now?
-
thewaywewere almost 7 yearsWhile this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. Please read this how-to-answer for providing quality answer.
-
Davos over 4 yearsA few months later you answered this question pointing out that flatMap is not supported on dataframe anymore stackoverflow.com/a/37225736/1335793
-
David Maddox almost 4 yearsYou can use df.toLocalIterator() instead of df.collect() for superior performance as per @Artem Osipov's answer