How to change dataframe column names in pyspark?
Solution 1
There are many ways to do that:
-
Option 1. Using selectExpr.
data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], ["Name", "askdaosdka"]) data.show() data.printSchema() # Output #+-------+----------+ #| Name|askdaosdka| #+-------+----------+ #|Alberto| 2| #| Dakota| 2| #+-------+----------+ #root # |-- Name: string (nullable = true) # |-- askdaosdka: long (nullable = true) df = data.selectExpr("Name as name", "askdaosdka as age") df.show() df.printSchema() # Output #+-------+---+ #| name|age| #+-------+---+ #|Alberto| 2| #| Dakota| 2| #+-------+---+ #root # |-- name: string (nullable = true) # |-- age: long (nullable = true)
-
Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace
xrange
withrange
.from functools import reduce oldColumns = data.schema.names newColumns = ["name", "age"] df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data) df.printSchema() df.show()
-
Option 3. using alias, in Scala you can also use as.
from pyspark.sql.functions import col data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age")) data.show() # Output #+-------+---+ #| name|age| #+-------+---+ #|Alberto| 2| #| Dakota| 2| #+-------+---+
-
Option 4. Using sqlContext.sql, which lets you use SQL queries on
DataFrames
registered as tables.sqlContext.registerDataFrameAsTable(data, "myTable") df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable") df2.show() # Output #+-------+---+ #| name|age| #+-------+---+ #|Alberto| 2| #| Dakota| 2| #+-------+---+
Solution 2
df = df.withColumnRenamed("colName", "newColName")\
.withColumnRenamed("colName2", "newColName2")
Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.
Solution 3
If you want to change all columns names, try df.toDF(*cols)
Solution 4
In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)
new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))
df = df.toDF(*new_column_name_list)
Thanks to @user8117731 for toDf
trick.
Solution 5
df.withColumnRenamed('age', 'age2')
Related videos on Youtube
Shubhanshu Mishra
Programmer, Mathematics Enthusiast, Hungry for creativity, innovation and elegant problem solving methods. I enjoy asking trivial questions in hope that I can get to know some non-trivial or unique approaches to solve them, more beautifully.
Updated on July 08, 2022Comments
-
Shubhanshu Mishra almost 2 years
I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:
df.columns = new_column_name_list
However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt") oldSchema = df.schema for i,k in enumerate(oldSchema.fields): k.name = new_column_name_list[i] df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)
This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.
Is there a better and more efficient way to do this like we do in pandas ?
My spark version is 1.5.0
-
Felipe Gerard over 7 yearsI did it with a
for
loop +withColumnRenamed
, but yourreduce
option is very nice :) -
Alberto Bonsanto over 7 years@FelipeGerard that is a very bad idea
-
Felipe Gerard over 7 yearsWell since nothing gets done in Spark until an action is called on the DF, it's just less elegant code... In the end the resulting DF is exactly the same!
-
Alberto Bonsanto over 7 years@FelipeGerard Please check this post, bad things may happen if you have many columns.
-
Felipe Gerard over 7 yearsThat happens when you use
withColumn
many times. I usedwithColumnRenamed
just like you, but iterated with a loop instead of areduce
. I'll bet you the DF is the exact same one if you dofor idx in range(n): data = data.withColumnRenamed(oldColumns[idx], newColumns[idx])
than what you posted :) -
Alberto Bonsanto over 7 years@FelipeGerard I was just saying
-
Felipe Gerard over 7 yearsLet us continue this discussion in chat.
-
Admin about 7 years@AlbertoBonsanto How to select column as alias if there are more than 100 columns which is the best option
-
Quetzalcoatl over 6 yearsis there a variant of this solution that leaves all other columns unchanged? with this method, and others, only the explicitly named columns remained (all others removed)
-
Isma about 6 yearsWhile this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion.
-
Quetzalcoatl about 6 yearsthis solution is the closest to df.columns = new_column_name_list per the OP, both in how concise it is and its execution.
-
NuValue almost 6 yearshi @Alberto Bonsanto, Thanks for your answer. Using your withColumnRenamed() answer I receive a mistake: "NameError: name 'reduce' is not defined". Sure I am missing something trivial.
-
Andrew Myers almost 6 yearsPankaj Kumar's answer and Alberto Bonsanto's answer (which are from 2016 and 2015, respectively) already suggest using
withColumnRenamed
. -
mnis.p almost 6 years+1 it worked fine for me, just edited the specified column leaving others unchanged and no columns were removed.
-
joaofbsm over 5 years@NuValue, you should first run
from functools import reduce
-
user989762 over 5 years@Quetzalcoatl This command appears to change only the specified column while maintaining all other columns. Hence, a great command to rename just one of potentially many column names
-
Quetzalcoatl over 5 years@user989762: agreed; my initial understanding was incorrect on this one...!
-
Sahan Jayasumana over 5 yearsThanks, yes but there are a couple of different syntax's, maybe we should collect them into a more formal answer? data.withColumnRenamed(oldColumns[idx], newColumns[idx]) vs data.withColumnRenamed(columnname, new columnname) i think it depends on which version of pyspark your using
-
Ed Bordin over 5 yearsThis is not a different syntax. The only difference is you did not store your column names in an array.
-
Simon30 almost 5 yearsHi @AlbertoBonsanto thank you great answer. I have a 3K+ rows dataframe and my code is inspired from your reduce solution, however it seems to run forever, do you know which one I should use for my use case? (nb: i also have 2K+ columns)
-
Simon30 almost 5 yearsI found the solution: use df.toDF(*new_cols) instead of weird reduce
-
Tom N Tech almost 5 yearsIn PySpark 2.4 with Python 3.6.8 the only method that works out of these is
df.select('id').withColumnRenamed('id', 'new_id')
andspark.sql("SELECT id AS new_id FROM df")
-
sogu over 4 yearsAt python 3.7 I recommend Option 3. it work flawlessly (option 1 and 2 has some issues ex.: library places moved stackoverflow.com/questions/8689184/… )
-
Darth Egregious over 4 yearsNice job on this one. A bit of overkill for what I needed though. And you can just pass the df because
old_columns
would be the same asdf.columns
. -
KGS over 4 yearsI think I'll use this
df.selectExpr(*["{0} as {0}_other".format(x) for x in df.columns])
to rename all the columns at once (I'll be doing acrossJoin
with itself) -
HanaKaze about 4 yearsI think this should be selected as the best answer
-
Pete about 4 yearsOption 2: use
range()
instead ofxrange()
for Python 3.* -
Nic Scozzaro almost 4 yearsFor me I was getting the header names from a pandas dataframe, so I just used
df = df.toDF(*my_pandas_df.columns)
-
rbatt almost 4 yearsThis answer confuses me. Shouldn't there be a mapping from old column names to new names? Does this work by having
cols
be the new column names, and just assuming the the order of names incols
corresponds to the column order of the dataframe? -
Powers almost 4 yearsThis code generates a simple physical plan that's easy for Catalyst to optimize. It's also elegant. +1
-
Powers almost 4 yearsThis is great for renaming a few columns. See my answer for a solution that can programatically rename columns. Say you have 200 columns and you'd like to rename 50 of them that have a certain type of column name and leave the other 150 unchanged. In that case, you won't want to manually run
withColumnRenamed
(runningwithColumnRenamed
that many times would also be inefficient, as explained here). -
Frank over 3 yearsoption 2 works for me. IMO option 2 is the best if you want to rename all your columns.
-
John Haberstroh over 3 yearsI like that this uses the select statement with aliases and uses more of an "immutable" type of framework. I did, however, find that the
toDF
function and a list comprehension that implements whatever logic is desired was much more succinct. for example,def append_suffix_to_columns(spark_df, suffix): return spark_df.toDF([c + suffix for c in spark_df.columns])
-
Haha TTpro over 3 yearswhy have to use withColumn to create another duplicate column with different name when you can use withColumnRenamed ?
-
astentx over 3 yearsThere was a lot of similar answers so no need to post another one duplicate.
-
Sheldore about 3 yearsThe first argument in withColumnRenamed is the old column name. Your Method 1 is wrong
-
Sheldore about 3 years@Powers : I checked your article but it contains a command to cast all the string types to integers. Could you plz write here how one needs to use your Medium article's code to rename, say only two columns as written in the above answer?
-
Sheldore about 3 yearsSince
mapping
is a dictionary, why can't you simply usemapping[c]
instead ofmapping.get(c, c)
? -
poiter almost 3 yearsThis is clearly the best answer if you want to pass in a list of columns
-
Krunal Patel almost 3 years@rbatt Using
df.select
in combination withpyspark.sql.functions col-method
is a reliable way to do this since it maintains the mapping/alias applied & thus the order/schema is maintained after the rename operations. Checkout the comment for code snippet: stackoverflow.com/a/62728542/8551891 -
Krunal Patel almost 3 yearsOption 3 is the right way to do it in spark (scala/python). Check this out: stackoverflow.com/a/62728542/8551891