How to change dataframe column names in pyspark?

python apache-spark pyspark apache-spark-sql

487,771

Solution 1

There are many ways to do that:

Option 1. Using selectExpr.

 data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                   ["Name", "askdaosdka"])
 data.show()
 data.printSchema()

 # Output
 #+-------+----------+
 #|   Name|askdaosdka|
 #+-------+----------+
 #|Alberto|         2|
 #| Dakota|         2|
 #+-------+----------+

 #root
 # |-- Name: string (nullable = true)
 # |-- askdaosdka: long (nullable = true)

 df = data.selectExpr("Name as name", "askdaosdka as age")
 df.show()
 df.printSchema()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

 #root
 # |-- name: string (nullable = true)
 # |-- age: long (nullable = true)

Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

 from functools import reduce

 oldColumns = data.schema.names
 newColumns = ["name", "age"]

 df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
 df.printSchema()
 df.show()

Option 3. using alias, in Scala you can also use as.

 from pyspark.sql.functions import col

 data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
 data.show()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

 sqlContext.registerDataFrameAsTable(data, "myTable")
 df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")

 df2.show()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

Solution 2

df = df.withColumnRenamed("colName", "newColName")\
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.

Solution 3

If you want to change all columns names, try df.toDF(*cols)

Solution 4

In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDf trick.

Solution 5

df.withColumnRenamed('age', 'age2')

View more solutions

487,771

Shubhanshu Mishra

Programmer, Mathematics Enthusiast, Hungry for creativity, innovation and elegant problem solving methods. I enjoy asking trivial questions in hope that I can get to know some non-trivial or unique approaches to solve them, more beautifully.

Updated on July 08, 2022

Comments

Shubhanshu Mishra almost 2 years
I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:
```
df.columns = new_column_name_list
```
However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:
```
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)
```
This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas ?

My spark version is 1.5.0
Felipe Gerard over 7 years

I did it with a for loop + withColumnRenamed, but your reduce option is very nice :)
Alberto Bonsanto over 7 years

@FelipeGerard that is a very bad idea
Felipe Gerard over 7 years

Well since nothing gets done in Spark until an action is called on the DF, it's just less elegant code... In the end the resulting DF is exactly the same!
Alberto Bonsanto over 7 years

@FelipeGerard Please check this post, bad things may happen if you have many columns.
Felipe Gerard over 7 years

That happens when you use withColumn many times. I used withColumnRenamed just like you, but iterated with a loop instead of a reduce. I'll bet you the DF is the exact same one if you do for idx in range(n): data = data.withColumnRenamed(oldColumns[idx], newColumns[idx]) than what you posted :)
Alberto Bonsanto over 7 years

@FelipeGerard I was just saying
Felipe Gerard over 7 years

Let us continue this discussion in chat.
Admin about 7 years

@AlbertoBonsanto How to select column as alias if there are more than 100 columns which is the best option
Quetzalcoatl over 6 years

is there a variant of this solution that leaves all other columns unchanged? with this method, and others, only the explicitly named columns remained (all others removed)
Isma about 6 years

While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion.
Quetzalcoatl about 6 years

this solution is the closest to df.columns = new_column_name_list per the OP, both in how concise it is and its execution.
NuValue almost 6 years

hi @Alberto Bonsanto, Thanks for your answer. Using your withColumnRenamed() answer I receive a mistake: "NameError: name 'reduce' is not defined". Sure I am missing something trivial.
Andrew Myers almost 6 years

Pankaj Kumar's answer and Alberto Bonsanto's answer (which are from 2016 and 2015, respectively) already suggest using withColumnRenamed.
mnis.p almost 6 years

+1 it worked fine for me, just edited the specified column leaving others unchanged and no columns were removed.
joaofbsm over 5 years

@NuValue, you should first run from functools import reduce
user989762 over 5 years

@Quetzalcoatl This command appears to change only the specified column while maintaining all other columns. Hence, a great command to rename just one of potentially many column names
Quetzalcoatl over 5 years

@user989762: agreed; my initial understanding was incorrect on this one...!
Sahan Jayasumana over 5 years

Thanks, yes but there are a couple of different syntax's, maybe we should collect them into a more formal answer? data.withColumnRenamed(oldColumns[idx], newColumns[idx]) vs data.withColumnRenamed(columnname, new columnname) i think it depends on which version of pyspark your using
Ed Bordin over 5 years

This is not a different syntax. The only difference is you did not store your column names in an array.
Simon30 almost 5 years

Hi @AlbertoBonsanto thank you great answer. I have a 3K+ rows dataframe and my code is inspired from your reduce solution, however it seems to run forever, do you know which one I should use for my use case? (nb: i also have 2K+ columns)
Simon30 almost 5 years

I found the solution: use df.toDF(*new_cols) instead of weird reduce
Tom N Tech almost 5 years

In PySpark 2.4 with Python 3.6.8 the only method that works out of these is df.select('id').withColumnRenamed('id', 'new_id') and spark.sql("SELECT id AS new_id FROM df")
sogu over 4 years

At python 3.7 I recommend Option 3. it work flawlessly (option 1 and 2 has some issues ex.: library places moved stackoverflow.com/questions/8689184/… )
Darth Egregious over 4 years

Nice job on this one. A bit of overkill for what I needed though. And you can just pass the df because old_columns would be the same as df.columns.
KGS over 4 years

I think I'll use this df.selectExpr(*["{0} as {0}_other".format(x) for x in df.columns]) to rename all the columns at once (I'll be doing a crossJoin with itself)
HanaKaze about 4 years

I think this should be selected as the best answer
Pete about 4 years

Option 2: use range() instead of xrange() for Python 3.*
Nic Scozzaro almost 4 years

For me I was getting the header names from a pandas dataframe, so I just used df = df.toDF(*my_pandas_df.columns)
rbatt almost 4 years

This answer confuses me. Shouldn't there be a mapping from old column names to new names? Does this work by having cols be the new column names, and just assuming the the order of names in cols corresponds to the column order of the dataframe?
Powers almost 4 years

This code generates a simple physical plan that's easy for Catalyst to optimize. It's also elegant. +1
Powers almost 4 years

This is great for renaming a few columns. See my answer for a solution that can programatically rename columns. Say you have 200 columns and you'd like to rename 50 of them that have a certain type of column name and leave the other 150 unchanged. In that case, you won't want to manually run withColumnRenamed (running withColumnRenamed that many times would also be inefficient, as explained here).
Frank over 3 years

option 2 works for me. IMO option 2 is the best if you want to rename all your columns.
John Haberstroh over 3 years

I like that this uses the select statement with aliases and uses more of an "immutable" type of framework. I did, however, find that the toDF function and a list comprehension that implements whatever logic is desired was much more succinct. for example, def append_suffix_to_columns(spark_df, suffix): return spark_df.toDF([c + suffix for c in spark_df.columns])
Haha TTpro over 3 years

why have to use withColumn to create another duplicate column with different name when you can use withColumnRenamed ?
astentx over 3 years

There was a lot of similar answers so no need to post another one duplicate.
Sheldore about 3 years

The first argument in withColumnRenamed is the old column name. Your Method 1 is wrong
Sheldore about 3 years

@Powers : I checked your article but it contains a command to cast all the string types to integers. Could you plz write here how one needs to use your Medium article's code to rename, say only two columns as written in the above answer?
Sheldore about 3 years

Since mapping is a dictionary, why can't you simply use mapping[c] instead of mapping.get(c, c)?
poiter almost 3 years

This is clearly the best answer if you want to pass in a list of columns
Krunal Patel almost 3 years

@rbatt Using df.select in combination with pyspark.sql.functions col-method is a reliable way to do this since it maintains the mapping/alias applied & thus the order/schema is maintained after the rename operations. Checkout the comment for code snippet: stackoverflow.com/a/62728542/8551891
Krunal Patel almost 3 years

Option 3 is the right way to do it in spark (scala/python). Check this out: stackoverflow.com/a/62728542/8551891