Read a csv into an RDD using Spark 2.0
You can convert a dataframe to a rdd by using .rdd
as in below
rdd = session.read.csv("myCSV.csv", header=True).rdd
Sal
Updated on June 17, 2022Comments
-
Sal almost 2 years
I want to read a csv file into a RDD using Spark 2.0. I can read it into a dataframe using
df = session.read.csv("myCSV.csv", header=True,)
and I can load it as a text file and then process it using
import csv rdd = context.textFile("myCSV.csv") header = rdd.first().replace('"','').split(',') rdd = (rdd.mapPartitionsWithIndex(lambda idx, itr: itr.drop(1) if idx == 0 else itr) .mapPartitions(lambda x: csv.reader(x)) )
Is there a way to use the built in csv reader (spark-csv) to go straight to an RDD without having to convert from a dataframe to a csv? Or maybe the above RDD method is good enough as the built in reader does something similar under the hood?
Edit: 1) Again, I don't want to read into a dataframe and then convert to RDD. This will build up an entire structure only to have it immediately dropped. Seems pointless. 2) Yes, I can time the above (against DF -> RDD conversion), but that will only tell me if my RDD read solution is better than conversion. A built in csv to RDD method will most likely be more optimized than the above code.
-
Sal over 7 yearsLike I said in the original question, without having to convert. :) Seems silly to build up some structure and then drop it immediately.
-
David over 7 yearsAh, I missed that. Well, it doesn't cost anything computationally to convert from a DF to a RDD, but you're right you may be wasting some resources building up the structure. The overhead from building the structure should be quite small since a CSV is defined by the first line (eg no need to scan the whole file). They only way to know for sure it to do some timing tests.
-
Sal over 7 yearsI could do timing tests but if a built in method existed I'm sure it would be more optimized than my solution above. So timing will only tell me if the above solution works better than converting.
-
David over 7 yearsFair point. Unfortunately, I don't believe there is a built in method to do this. And I'd guess it's because the overhead of determining the schema to make a DF in this situation is bordering on inconsequential. CSVs have a dataframe-like format (as opposed to json files, which are more RDD-like)