Read a csv into an RDD using Spark 2.0

15,470

You can convert a dataframe to a rdd by using .rdd as in below

rdd = session.read.csv("myCSV.csv", header=True).rdd
Share:
15,470
Sal
Author by

Sal

Updated on June 17, 2022

Comments

  • Sal
    Sal almost 2 years

    I want to read a csv file into a RDD using Spark 2.0. I can read it into a dataframe using

    df = session.read.csv("myCSV.csv", header=True,)
    

    and I can load it as a text file and then process it using

    import csv
    rdd = context.textFile("myCSV.csv")
    header = rdd.first().replace('"','').split(',')
    rdd = (rdd.mapPartitionsWithIndex(lambda idx, itr: itr.drop(1) if idx == 0 else itr)
            .mapPartitions(lambda x: csv.reader(x))
          )
    

    Is there a way to use the built in csv reader (spark-csv) to go straight to an RDD without having to convert from a dataframe to a csv? Or maybe the above RDD method is good enough as the built in reader does something similar under the hood?

    Edit: 1) Again, I don't want to read into a dataframe and then convert to RDD. This will build up an entire structure only to have it immediately dropped. Seems pointless. 2) Yes, I can time the above (against DF -> RDD conversion), but that will only tell me if my RDD read solution is better than conversion. A built in csv to RDD method will most likely be more optimized than the above code.

  • Sal
    Sal over 7 years
    Like I said in the original question, without having to convert. :) Seems silly to build up some structure and then drop it immediately.
  • David
    David over 7 years
    Ah, I missed that. Well, it doesn't cost anything computationally to convert from a DF to a RDD, but you're right you may be wasting some resources building up the structure. The overhead from building the structure should be quite small since a CSV is defined by the first line (eg no need to scan the whole file). They only way to know for sure it to do some timing tests.
  • Sal
    Sal over 7 years
    I could do timing tests but if a built in method existed I'm sure it would be more optimized than my solution above. So timing will only tell me if the above solution works better than converting.
  • David
    David over 7 years
    Fair point. Unfortunately, I don't believe there is a built in method to do this. And I'd guess it's because the overhead of determining the schema to make a DF in this situation is bordering on inconsequential. CSVs have a dataframe-like format (as opposed to json files, which are more RDD-like)