RDD to DataFrame in pyspark (columns from rdd's first element)

11,102

You will have to remove the header from your RDD. One way to do it is the following considering your rdd variable :

>>> header = rdd.first()
>>> header
# ['mailid', 'age', 'address']
>>> data = rdd.filter(lambda row : row != header).toDF(header)
>>> data.show()
# +------+---+-------+
# |mailid|age|address|
# +------+---+-------+
# | satya| 23| Mumbai|
# |   abc| 27|    Goa|
# +------+---+-------+ 
Share:
11,102
Satya
Author by

Satya

Trust Me,I want to be a programmer and still confused between whether i am already a one or still my status is in-progress. In both-way i like my status and preferably the "in-progress".

Updated on June 08, 2022

Comments

  • Satya
    Satya almost 2 years

    I have created a rdd from a csv file and the first row is the header line in that csv file. Now I want to create dataframe from that rdd and retain the column from 1st element of rdd.

    Problem is I am able to create the dataframe and with column from rdd.first(), but the created dataframe has its first row as the headers itself. How to remove that?

    lines = sc.textFile('/path/data.csv')
    rdd = lines.map(lambda x: x.split('#####'))  ###multiple char sep can be there #### or #@# , so can't directly read csv to a dataframe
    #rdd: [[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']]  ###first element is the header
    df = rdd.toDF(rdd.first())  ###retaing te column from rdd.first()
    df.show()
    #mailid  age  address
     mailid  age  address   ####I don't want this as dataframe data
     satya    23  Mumbai
     abc      27  Goa
    

    How to avoid that first element moving to dataframe data. Can I give any option in rdd.toDF(rdd.first()) to get that done??

    Note: I can't collect rdd to form list , then remove first item from that list, then parallelize that list back to form rdd again and then toDF()...

    Please suggest!!!Thanks