How to assign and use column headers in Spark?

python hadoop apache-spark pyspark multiple-columns

28,816

Solution 1

The solution to this question really depends on the version of Spark you are running. Assuming you are on Spark 2.0+ then you can read the CSV in as a DataFrame and add columns with toDF which is good for transforming a RDD to a DataFrame OR adding columns to an existing data frame.

filename = "/path/to/file.csv"
df = spark.read.csv(filename).toDF("col1","col2","col3")

Solution 2

Here is how to add column names using DataFrame:

Assume your csv has the delimiter ','. Prepare the data as follows before transferring it to DataFrame:

f = sc.textFile("s3://test/abc.csv")
data_rdd = f.map(lambda line: [x for x in line.split(',')])

Suppose the data has 3 columns:

data_rdd.take(1)
[[u'1.2', u'red', u'55.6']]

Now, you can specify the column names when transferring this RDD to DataFrame using toDF():

df_withcol = data_rdd.toDF(['height','color','width'])

df_withcol.printSchema()

    root
     |-- height: string (nullable = true)
     |-- color: string (nullable = true)
     |-- width: string (nullable = true)

If you don't specify column names, you get a DataFrame with default column names '_1', '_2', ...:

df_default = data_rdd.toDF()

df_default.printSchema()

    root
     |-- _1: string (nullable = true)
     |-- _2: string (nullable = true)
     |-- _3: string (nullable = true)

28,816

Author by

GoldenPlatinum

Updated on August 01, 2022

Comments

GoldenPlatinum almost 2 years
I am reading a dataset as below.
```
 f = sc.textFile("s3://test/abc.csv")
```
My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.

How do I do that in PySpark ? Is DataFrame way to go here ?

PS - Newbie to Spark.