How to change case of whole pyspark dataframe to lower or upper

python-3.x apache-spark pyspark spark-dataframe case-sensitive

16,114

Solution 1

Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))

val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)

Now types are correct also when you have non-string fields, i.e. numeric fields). If you know that each column is of String type, use one of the other answers - they are correct in that cases :)

Python code in PySpark:

from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
 fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields) 
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)

Solution 2

Assuming df is your dataframe, this should do the work:

from pyspark.sql import functions as F
for col in df.columns:
    df = df.withColumn(col, F.lower(F.col(col)))

Solution 3

You can generate an expression using list comprehension:

from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]

And then just call it over your existing dataframe

>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+

>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+

16,114

Author by

Jack

Updated on July 23, 2022

Comments

Jack over 1 year
I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help
```
#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])
```
Steven about 6 years

According to OP, he wants to create a Hash from string columns ... therefore, they're all supposed to be stringType, no need to check the type.
T. Gawęda about 6 years

@Steven We can assume it in this case. It's only an additional answer, if someone has similar problem, but DataFrame also has with numeric columns ;)
Jack about 6 years

My dataframes contains all types of datatypes (String,numeric,date & many more). I am going with hash based matching since table keys information is not available.Please share it in python if possible.Thanks a ton
T. Gawęda about 6 years

@Jack Sorry, I was busy in work. I will try to change to Python today or tomorrow afternoon :)
Jack about 6 years

Thanks a ton T.Gaweda for helping me in resolving this problem
Jack about 6 years

I am getting an error in this line stringFieldsTransformed = stringFields.map(lambda f: col(f.name)).map(lambda f: lower(col(f)).alias(f))
Jack about 6 years

Error is AttributeError: 'filter' object has no attribute 'map'
T. Gawęda about 6 years

@Jack Please try again with the code in the bottom of the answer
Jack about 6 years

Thanks a lot for helping me out in your busy schedule. There are no changes in the dataframe df = sourceDF.select(allFields) since we are passing columns to the same dataframe. Please correct me if I am wrong.
T. Gawęda about 6 years

@Jack You are right, changed it. Now I checked it:)
Koppula about 3 years

psf function is useful for me to work on small data set, can it be safe to work on 40 million records table? coz now my client asking to make all data to upper case.