pyspark : Convert DataFrame to RDD[string]

33,735

PySpark Row is just a tuple and can be used as such. All you need here is a simple map (or flatMap if you want to flatten the rows as well) with list:

data.map(list)

or if you expect different types:

data.map(lambda row: [str(c) for c in row])
Share:
33,735
Toren
Author by

Toren

Updated on November 14, 2020

Comments

  • Toren
    Toren over 3 years

    I'd like to convert pyspark.sql.dataframe.DataFrame to pyspark.rdd.RDD[String]

    I converted a DataFrame df to RDD data:

    data = df.rdd
    type (data)
    ## pyspark.rdd.RDD 
    

    the new RDD data contains Row

    first = data.first()
    type(first)
    ## pyspark.sql.types.Row
    
    data.first()
    Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')
    

    I'd like to convert Row to list of String , like example below:

    u'aaa',u'bbb',u'ccc',u'ddd'
    

    Thanks

  • Toren
    Toren about 8 years
    Thanks @zero323 with your answers my learning curve going better