How to combine and collect elements of an RDD into a list in pyspark

13,887

I'm assuming lat_lon = df.rdd.map(lambda r,x : r.latitude, x.longitude).collect()

gave you the following error NameError: name 'x' is not defined

try

lat_lon = df.rdd.map(lambda x : [x.latitude, x.longitude]).collect()

Share:
13,887
msharky
Author by

msharky

Updated on June 22, 2022

Comments

  • msharky
    msharky almost 2 years

    I am working with Apache Spark for python and have created an spark dataframe with name, latitude, longitude as the column names.

    my RDD dataframe is in the form:

    name     latitude      longitude
    
    M          1.3           22.5
    S          1.6           22.9
    H          1.7           23.4
    W          1.4           23.3
    C          1.1           21.2
    ...        ...           ....
    

    I know that to collect only the latitude I can do

    list_of_lat = df.rdd.map(lambda r: r.latitude).collect()
    
    print list_of_lat
    
    [1.3,1.6,1.7,1.4,1.1,...]
    

    However, I need to collect the latitude and longitude values together in a list in the form:

    [[1.3,22.5],[1.6,22.9],[1.7,23.4]...]
    

    I have tried

    lat_lon = df.rdd.map(lambda r,x : r.latitude, x.longitude).collect()
    

    however this does not work.

    I need to use the spark since it is a very large dataset (~1M rows).

    Any help would be greatly appreciated. Thanks

  • msharky
    msharky almost 7 years
    Thank you this works! That's exactly the error that it had given - apologies for omitting that in my original post.