Remove spaces from all column names in pyspark

pyspark

12,381

Solution 1

I would use select in conjunction with a list comprehension:

from pyspark.sql import functions as F

renamed_df = df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])

Solution 2

Two ways to remove the spaces from the column names: 1. Use a schema while importing the data to spark data frame: for example:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
Schema1 = StructType([StructField('field1', IntegerType(), True),
                     StructField('field2', StringType(), True),
                     StructField('field3', IntegerType(), True)])
df = spark.read.csv('/path/to/your/file.csv', header=True, schema=Schema1)

If you have already got the data imported into a dataframe, use dataframe.withColumnRenamed function to change the name of the column:

df=df.withColumnRenamed("field name","fieldName")

Solution 3

NewColumns=(column.replace(' ', 'any special character') for column in df.columns)
df = df.toDF(*NewColumns)

12,381

Author by

Admin

Updated on June 08, 2022

Comments

Admin almost 2 years

I am new to pySpark. I have received a csv file which has around 1000 columns. I am using databricks. Most of these columns have spaces in between eg "Total Revenue" ,"Total Age" etc. I need to updates all the column names with space with underscore'_'.

I have tried this

foreach(DataColumn c in cloned.Columns) c.ColumnName = String.Join("_", c.ColumnName.Split());

but it didn't work in Pyspark on databricks.
Balki over 3 years

Interesting.. when I did df =df.withColumnRenamed("field name", "fieldname") , it did not work for me on databricks. This was the reason I was looking to replace "field name" with "field_name".
nam almost 2 years

Excellent response (+1). If you want to replace only the leading an trailing spaces you can do: renamed_df = df.select([F.col(col).alias(col.strip()) for col in df.columns])