How to change case of whole pyspark dataframe to lower or upper

16,114

Solution 1

Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))

val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)

Now types are correct also when you have non-string fields, i.e. numeric fields). If you know that each column is of String type, use one of the other answers - they are correct in that cases :)

Python code in PySpark:

from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
 fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields) 
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)

Solution 2

Assuming df is your dataframe, this should do the work:

from pyspark.sql import functions as F
for col in df.columns:
    df = df.withColumn(col, F.lower(F.col(col)))

Solution 3

You can generate an expression using list comprehension:

from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]

And then just call it over your existing dataframe

>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+

>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
Share:
16,114
Jack
Author by

Jack

Updated on July 23, 2022

Comments

  • Jack
    Jack over 1 year

    I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help

    #Code for Dataframe column headers
    self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])
    
  • Steven
    Steven about 6 years
    According to OP, he wants to create a Hash from string columns ... therefore, they're all supposed to be stringType, no need to check the type.
  • T. Gawęda
    T. Gawęda about 6 years
    @Steven We can assume it in this case. It's only an additional answer, if someone has similar problem, but DataFrame also has with numeric columns ;)
  • Jack
    Jack about 6 years
    My dataframes contains all types of datatypes (String,numeric,date & many more). I am going with hash based matching since table keys information is not available.Please share it in python if possible.Thanks a ton
  • T. Gawęda
    T. Gawęda about 6 years
    @Jack Sorry, I was busy in work. I will try to change to Python today or tomorrow afternoon :)
  • Jack
    Jack about 6 years
    Thanks a ton T.Gaweda for helping me in resolving this problem
  • Jack
    Jack about 6 years
    I am getting an error in this line stringFieldsTransformed = stringFields.map(lambda f: col(f.name)).map(lambda f: lower(col(f)).alias(f))
  • Jack
    Jack about 6 years
    Error is AttributeError: 'filter' object has no attribute 'map'
  • T. Gawęda
    T. Gawęda about 6 years
    @Jack Please try again with the code in the bottom of the answer
  • Jack
    Jack about 6 years
    Thanks a lot for helping me out in your busy schedule. There are no changes in the dataframe df = sourceDF.select(allFields) since we are passing columns to the same dataframe. Please correct me if I am wrong.
  • T. Gawęda
    T. Gawęda about 6 years
    @Jack You are right, changed it. Now I checked it:)
  • Koppula
    Koppula about 3 years
    psf function is useful for me to work on small data set, can it be safe to work on 40 million records table? coz now my client asking to make all data to upper case.