Read a csv into an RDD using Spark 2.0

csv apache-spark pyspark spark-dataframe

15,470

You can convert a dataframe to a rdd by using .rdd as in below

rdd = session.read.csv("myCSV.csv", header=True).rdd

15,470

Author by

Sal

Updated on June 17, 2022

Comments

Sal almost 2 years
I want to read a csv file into a RDD using Spark 2.0. I can read it into a dataframe using
```
df = session.read.csv("myCSV.csv", header=True,)
```
and I can load it as a text file and then process it using
```
import csv
rdd = context.textFile("myCSV.csv")
header = rdd.first().replace('"','').split(',')
rdd = (rdd.mapPartitionsWithIndex(lambda idx, itr: itr.drop(1) if idx == 0 else itr)
        .mapPartitions(lambda x: csv.reader(x))
      )
```
Is there a way to use the built in csv reader (spark-csv) to go straight to an RDD without having to convert from a dataframe to a csv? Or maybe the above RDD method is good enough as the built in reader does something similar under the hood?

Edit: 1) Again, I don't want to read into a dataframe and then convert to RDD. This will build up an entire structure only to have it immediately dropped. Seems pointless. 2) Yes, I can time the above (against DF -> RDD conversion), but that will only tell me if my RDD read solution is better than conversion. A built in csv to RDD method will most likely be more optimized than the above code.
Sal over 7 years

Like I said in the original question, without having to convert. :) Seems silly to build up some structure and then drop it immediately.
David over 7 years

Ah, I missed that. Well, it doesn't cost anything computationally to convert from a DF to a RDD, but you're right you may be wasting some resources building up the structure. The overhead from building the structure should be quite small since a CSV is defined by the first line (eg no need to scan the whole file). They only way to know for sure it to do some timing tests.
Sal over 7 years

I could do timing tests but if a built in method existed I'm sure it would be more optimized than my solution above. So timing will only tell me if the above solution works better than converting.
David over 7 years

Fair point. Unfortunately, I don't believe there is a built in method to do this. And I'd guess it's because the overhead of determining the schema to make a DF in this situation is bordering on inconsequential. CSVs have a dataframe-like format (as opposed to json files, which are more RDD-like)