Replace string in PySpark
15,270
You need to escape .
to match it literally, as .
is a special character that matches almost any character in regex:
df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))
Author by
cph_sto
Updated on June 08, 2022Comments
-
cph_sto almost 2 years
I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -
from pyspark.sql.functions import regexp_replace,col from pyspark.sql.types import FloatType df = spark.createDataFrame([('-1.269,75',)], ['revenue']) df.show() +---------+ | revenue| +---------+ |-1.269,75| +---------+ df.printSchema() root |-- revenue: string (nullable = true)
Output desired: df.show()
+---------+ | revenue| +---------+ |-1269.75| +---------+ df.printSchema() root |-- revenue: float (nullable = true)
I am using function
regexp_replace
to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", "")) df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", ".")) df = df.withColumn('revenue', df['revenue'].cast("float"))
But, when I attempt replacing below, I get empty string. Why?? I was expecting
-1269,75
.df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", "")) +-------+ |revenue| +-------+ | | +-------+