Replace string in PySpark

15,270

You need to escape . to match it literally, as . is a special character that matches almost any character in regex:

df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))
Share:
15,270
cph_sto
Author by

cph_sto

Updated on June 08, 2022

Comments

  • cph_sto
    cph_sto almost 2 years

    I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -

    from pyspark.sql.functions import regexp_replace,col
    from pyspark.sql.types import FloatType
    df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
    df.show()
    +---------+
    |  revenue|
    +---------+
    |-1.269,75|
    +---------+
    df.printSchema()
    root
     |-- revenue: string (nullable = true)
    

    Output desired: df.show()

    +---------+
    |  revenue|
    +---------+
    |-1269.75|
    +---------+
    df.printSchema()
    root
     |-- revenue: float (nullable = true)
    

    I am using function regexp_replace to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.

    df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
    df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
    df = df.withColumn('revenue', df['revenue'].cast("float"))
    

    But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75.

    df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
    +-------+
    |revenue|
    +-------+
    |       |
    +-------+