How to change dataframe column names in pyspark?

487,771

Solution 1

There are many ways to do that:

  • Option 1. Using selectExpr.

     data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                       ["Name", "askdaosdka"])
     data.show()
     data.printSchema()
    
     # Output
     #+-------+----------+
     #|   Name|askdaosdka|
     #+-------+----------+
     #|Alberto|         2|
     #| Dakota|         2|
     #+-------+----------+
    
     #root
     # |-- Name: string (nullable = true)
     # |-- askdaosdka: long (nullable = true)
    
     df = data.selectExpr("Name as name", "askdaosdka as age")
     df.show()
     df.printSchema()
    
     # Output
     #+-------+---+
     #|   name|age|
     #+-------+---+
     #|Alberto|  2|
     #| Dakota|  2|
     #+-------+---+
    
     #root
     # |-- name: string (nullable = true)
     # |-- age: long (nullable = true)
    
  • Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

     from functools import reduce
    
     oldColumns = data.schema.names
     newColumns = ["name", "age"]
    
     df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
     df.printSchema()
     df.show()
    
  • Option 3. using alias, in Scala you can also use as.

     from pyspark.sql.functions import col
    
     data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
     data.show()
    
     # Output
     #+-------+---+
     #|   name|age|
     #+-------+---+
     #|Alberto|  2|
     #| Dakota|  2|
     #+-------+---+
    
  • Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

     sqlContext.registerDataFrameAsTable(data, "myTable")
     df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
    
     df2.show()
    
     # Output
     #+-------+---+
     #|   name|age|
     #+-------+---+
     #|Alberto|  2|
     #| Dakota|  2|
     #+-------+---+
    

Solution 2

df = df.withColumnRenamed("colName", "newColName")\
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.

Solution 3

If you want to change all columns names, try df.toDF(*cols)

Solution 4

In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDf trick.

Solution 5

df.withColumnRenamed('age', 'age2')

Share:
487,771

Related videos on Youtube

Shubhanshu Mishra
Author by

Shubhanshu Mishra

Programmer, Mathematics Enthusiast, Hungry for creativity, innovation and elegant problem solving methods. I enjoy asking trivial questions in hope that I can get to know some non-trivial or unique approaches to solve them, more beautifully.

Updated on July 08, 2022

Comments

  • Shubhanshu Mishra
    Shubhanshu Mishra almost 2 years

    I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

    df.columns = new_column_name_list
    

    However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

    df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
    oldSchema = df.schema
    for i,k in enumerate(oldSchema.fields):
      k.name = new_column_name_list[i]
    df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)
    

    This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

    Is there a better and more efficient way to do this like we do in pandas ?

    My spark version is 1.5.0

  • Felipe Gerard
    Felipe Gerard over 7 years
    I did it with a for loop + withColumnRenamed, but your reduce option is very nice :)
  • Alberto Bonsanto
    Alberto Bonsanto over 7 years
    @FelipeGerard that is a very bad idea
  • Felipe Gerard
    Felipe Gerard over 7 years
    Well since nothing gets done in Spark until an action is called on the DF, it's just less elegant code... In the end the resulting DF is exactly the same!
  • Alberto Bonsanto
    Alberto Bonsanto over 7 years
    @FelipeGerard Please check this post, bad things may happen if you have many columns.
  • Felipe Gerard
    Felipe Gerard over 7 years
    That happens when you use withColumn many times. I used withColumnRenamed just like you, but iterated with a loop instead of a reduce. I'll bet you the DF is the exact same one if you do for idx in range(n): data = data.withColumnRenamed(oldColumns[idx], newColumns[idx]) than what you posted :)
  • Alberto Bonsanto
    Alberto Bonsanto over 7 years
    @FelipeGerard I was just saying
  • Felipe Gerard
    Felipe Gerard over 7 years
  • Admin
    Admin about 7 years
    @AlbertoBonsanto How to select column as alias if there are more than 100 columns which is the best option
  • Quetzalcoatl
    Quetzalcoatl over 6 years
    is there a variant of this solution that leaves all other columns unchanged? with this method, and others, only the explicitly named columns remained (all others removed)
  • Isma
    Isma about 6 years
    While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion.
  • Quetzalcoatl
    Quetzalcoatl about 6 years
    this solution is the closest to df.columns = new_column_name_list per the OP, both in how concise it is and its execution.
  • NuValue
    NuValue almost 6 years
    hi @Alberto Bonsanto, Thanks for your answer. Using your withColumnRenamed() answer I receive a mistake: "NameError: name 'reduce' is not defined". Sure I am missing something trivial.
  • Andrew Myers
    Andrew Myers almost 6 years
    Pankaj Kumar's answer and Alberto Bonsanto's answer (which are from 2016 and 2015, respectively) already suggest using withColumnRenamed.
  • mnis.p
    mnis.p almost 6 years
    +1 it worked fine for me, just edited the specified column leaving others unchanged and no columns were removed.
  • joaofbsm
    joaofbsm over 5 years
    @NuValue, you should first run from functools import reduce
  • user989762
    user989762 over 5 years
    @Quetzalcoatl This command appears to change only the specified column while maintaining all other columns. Hence, a great command to rename just one of potentially many column names
  • Quetzalcoatl
    Quetzalcoatl over 5 years
    @user989762: agreed; my initial understanding was incorrect on this one...!
  • Sahan Jayasumana
    Sahan Jayasumana over 5 years
    Thanks, yes but there are a couple of different syntax's, maybe we should collect them into a more formal answer? data.withColumnRenamed(oldColumns[idx], newColumns[idx]) vs data.withColumnRenamed(columnname, new columnname) i think it depends on which version of pyspark your using
  • Ed Bordin
    Ed Bordin over 5 years
    This is not a different syntax. The only difference is you did not store your column names in an array.
  • Simon30
    Simon30 almost 5 years
    Hi @AlbertoBonsanto thank you great answer. I have a 3K+ rows dataframe and my code is inspired from your reduce solution, however it seems to run forever, do you know which one I should use for my use case? (nb: i also have 2K+ columns)
  • Simon30
    Simon30 almost 5 years
    I found the solution: use df.toDF(*new_cols) instead of weird reduce
  • Tom N Tech
    Tom N Tech almost 5 years
    In PySpark 2.4 with Python 3.6.8 the only method that works out of these is df.select('id').withColumnRenamed('id', 'new_id') and spark.sql("SELECT id AS new_id FROM df")
  • sogu
    sogu over 4 years
    At python 3.7 I recommend Option 3. it work flawlessly (option 1 and 2 has some issues ex.: library places moved stackoverflow.com/questions/8689184/… )
  • Darth Egregious
    Darth Egregious over 4 years
    Nice job on this one. A bit of overkill for what I needed though. And you can just pass the df because old_columns would be the same as df.columns.
  • KGS
    KGS over 4 years
    I think I'll use this df.selectExpr(*["{0} as {0}_other".format(x) for x in df.columns]) to rename all the columns at once (I'll be doing a crossJoin with itself)
  • HanaKaze
    HanaKaze about 4 years
    I think this should be selected as the best answer
  • Pete
    Pete about 4 years
    Option 2: use range() instead of xrange() for Python 3.*
  • Nic Scozzaro
    Nic Scozzaro almost 4 years
    For me I was getting the header names from a pandas dataframe, so I just used df = df.toDF(*my_pandas_df.columns)
  • rbatt
    rbatt almost 4 years
    This answer confuses me. Shouldn't there be a mapping from old column names to new names? Does this work by having cols be the new column names, and just assuming the the order of names in cols corresponds to the column order of the dataframe?
  • Powers
    Powers almost 4 years
    This code generates a simple physical plan that's easy for Catalyst to optimize. It's also elegant. +1
  • Powers
    Powers almost 4 years
    This is great for renaming a few columns. See my answer for a solution that can programatically rename columns. Say you have 200 columns and you'd like to rename 50 of them that have a certain type of column name and leave the other 150 unchanged. In that case, you won't want to manually run withColumnRenamed (running withColumnRenamed that many times would also be inefficient, as explained here).
  • Frank
    Frank over 3 years
    option 2 works for me. IMO option 2 is the best if you want to rename all your columns.
  • John Haberstroh
    John Haberstroh over 3 years
    I like that this uses the select statement with aliases and uses more of an "immutable" type of framework. I did, however, find that the toDF function and a list comprehension that implements whatever logic is desired was much more succinct. for example, def append_suffix_to_columns(spark_df, suffix): return spark_df.toDF([c + suffix for c in spark_df.columns])
  • Haha TTpro
    Haha TTpro over 3 years
    why have to use withColumn to create another duplicate column with different name when you can use withColumnRenamed ?
  • astentx
    astentx over 3 years
    There was a lot of similar answers so no need to post another one duplicate.
  • Sheldore
    Sheldore about 3 years
    The first argument in withColumnRenamed is the old column name. Your Method 1 is wrong
  • Sheldore
    Sheldore about 3 years
    @Powers : I checked your article but it contains a command to cast all the string types to integers. Could you plz write here how one needs to use your Medium article's code to rename, say only two columns as written in the above answer?
  • Sheldore
    Sheldore about 3 years
    Since mapping is a dictionary, why can't you simply use mapping[c] instead of mapping.get(c, c)?
  • poiter
    poiter almost 3 years
    This is clearly the best answer if you want to pass in a list of columns
  • Krunal Patel
    Krunal Patel almost 3 years
    @rbatt Using df.select in combination with pyspark.sql.functions col-method is a reliable way to do this since it maintains the mapping/alias applied & thus the order/schema is maintained after the rename operations. Checkout the comment for code snippet: stackoverflow.com/a/62728542/8551891
  • Krunal Patel
    Krunal Patel almost 3 years
    Option 3 is the right way to do it in spark (scala/python). Check this out: stackoverflow.com/a/62728542/8551891