How to use regex_replace to replace special characters from a column in pyspark dataframe

19,790

Solution 1

df.withColumn("batch",regexp_replace(col("batch"), "/[^0-9]+/", ""))

Solution 2

You can use this regex:

\W+

\W - matches any non-word character (equal to [^a-zA-Z0-9_])

Solution 3

What have you tried so far?

select regexp_replace("'$5','9%'","[^0-9A-Za-z]","")
Share:
19,790
Chaitanya Kirty
Author by

Chaitanya Kirty

Updated on June 04, 2022

Comments

  • Chaitanya Kirty
    Chaitanya Kirty almost 2 years

    There is a column batch in dataframe. It has values like '9%','$5', etc.

    I need use regex_replace in a way that it removes the special characters from the above example and keep just the numeric part.

    Examples like 9 and 5 replacing 9% and $5 respectively in the same column.

  • undefined_variable
    undefined_variable over 6 years
    creating a column batch with only numbers as value... Every thing apart from numbers will be replaced with blank
  • Chaitanya Kirty
    Chaitanya Kirty over 6 years
    Sorry for the delay in response. I tried your code and it worked fine. Thanks a lott
  • dileepvarma
    dileepvarma over 3 years
    what if we want to do it for all columns in single command, please
  • undefined_variable
    undefined_variable over 3 years
    @dileepvarma regex_replace can be used in select... df.select(regexp_replace(col("col1"), "/[^0-9]+/", ""),regexp_replace(col("col2"), "/[^0-9]+/", ""))
  • dileepvarma
    dileepvarma over 3 years
    @undefined_variable I have 100's of columns so Its tough for me to add this statement for each column .is there any recursive way? thanks
  • undefined_variable
    undefined_variable over 3 years
    @dileepvarma use df.columns.map to create the statement and use it in your select
  • undefined_variable
    undefined_variable over 3 years
    val cols = df.columns.map(x=>regexp_replace(col(x), "/[^0-9]+/", ""))... df.select(cols:_*)