Remove spaces from all column names in pyspark
Solution 1
I would use select
in conjunction with a list
comprehension:
from pyspark.sql import functions as F
renamed_df = df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])
Solution 2
Two ways to remove the spaces from the column names: 1. Use a schema while importing the data to spark data frame: for example:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
Schema1 = StructType([StructField('field1', IntegerType(), True),
StructField('field2', StringType(), True),
StructField('field3', IntegerType(), True)])
df = spark.read.csv('/path/to/your/file.csv', header=True, schema=Schema1)
If you have already got the data imported into a dataframe, use dataframe.withColumnRenamed function to change the name of the column:
df=df.withColumnRenamed("field name","fieldName")
Solution 3
NewColumns=(column.replace(' ', 'any special character') for column in df.columns)
df = df.toDF(*NewColumns)
Admin
Updated on June 08, 2022Comments
-
Admin almost 2 years
I am new to pySpark. I have received a csv file which has around 1000 columns. I am using databricks. Most of these columns have spaces in between eg "Total Revenue" ,"Total Age" etc. I need to updates all the column names with space with underscore'_'.
I have tried this
foreach(DataColumn c in cloned.Columns) c.ColumnName = String.Join("_", c.ColumnName.Split());
but it didn't work in Pyspark on databricks.
-
Balki over 3 yearsInteresting.. when I did
df =df.withColumnRenamed("field name", "fieldname")
, it did not work for me on databricks. This was the reason I was looking to replace "field name" with "field_name". -
nam almost 2 yearsExcellent response (+1). If you want to replace only the leading an trailing spaces you can do:
renamed_df = df.select([F.col(col).alias(col.strip()) for col in df.columns])