How to combine and collect elements of an RDD into a list in pyspark
13,887
I'm assuming lat_lon = df.rdd.map(lambda r,x : r.latitude, x.longitude).collect()
gave you the following error
NameError: name 'x' is not defined
try
lat_lon = df.rdd.map(lambda x : [x.latitude, x.longitude]).collect()
Author by
msharky
Updated on June 22, 2022Comments
-
msharky almost 2 years
I am working with Apache Spark for python and have created an spark dataframe with name, latitude, longitude as the column names.
my RDD dataframe is in the form:
name latitude longitude M 1.3 22.5 S 1.6 22.9 H 1.7 23.4 W 1.4 23.3 C 1.1 21.2 ... ... ....
I know that to collect only the latitude I can do
list_of_lat = df.rdd.map(lambda r: r.latitude).collect() print list_of_lat [1.3,1.6,1.7,1.4,1.1,...]
However, I need to collect the latitude and longitude values together in a list in the form:
[[1.3,22.5],[1.6,22.9],[1.7,23.4]...]
I have tried
lat_lon = df.rdd.map(lambda r,x : r.latitude, x.longitude).collect()
however this does not work.
I need to use the spark since it is a very large dataset (~1M rows).
Any help would be greatly appreciated. Thanks
-
msharky almost 7 yearsThank you this works! That's exactly the error that it had given - apologies for omitting that in my original post.