How to add suffix and prefix to all columns in python/pyspark dataframe
Solution 1
You can use withColumnRenamed
method of dataframe in combination with na
to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
Solution 2
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
Solution 3
To add prefix or suffix:
- Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
- Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
- When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
Solution 4
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
Solution 5
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff
Related videos on Youtube
Admin
Updated on July 09, 2022Comments
-
Admin almost 2 years
I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
-
Pushkr about 7 yearsUpdated my answer
-
Pushkr about 7 yearsif you are just trying to export data from mysql to hive, you might as well just use sqoop , unless you are performing any specialized processing on data , you dont have to go thru spark.
-
Ralf almost 5 yearsCould you explain in more detail how this answers the question?
-
Dwindwin almost 5 yearsThe question asked was how to had a suffix or a prefix to all the columns of a dataframe. Here I added a suffix but you can do both by simply changing the second parameter of
withColumnRenamed
. In the person's case it would be"'" + list(df.schema.names)[idx] + "'")
-
Krunal Patel almost 3 yearsThanks for the steps-breakdown. Using
df.select
in combination withpyspark.sql.functions
col-method
is a reliable way to do this since it maintains the mapping/alias applied & thus the order/schema is maintained after the rename operations. Below is the sampleselect_list
content:[Column<b'XYZ AS prefix_XYZ'>, Column<b'ABC_ID AS prefix_ABC_ID'>]