How to use regex_replace to replace special characters from a column in pyspark dataframe

regex apache-spark dataframe pyspark

19,790

Solution 1

df.withColumn("batch",regexp_replace(col("batch"), "/[^0-9]+/", ""))

You can use this regex:

\W+

\W - matches any non-word character (equal to [^a-zA-Z0-9_])

What have you tried so far?

select regexp_replace("'$5','9%'","[^0-9A-Za-z]","")

19,790

Author by

Updated on June 04, 2022

Chaitanya Kirty almost 2 years

There is a column batch in dataframe. It has values like '9%','$5', etc.

I need use regex_replace in a way that it removes the special characters from the above example and keep just the numeric part.

Examples like 9 and 5 replacing 9% and $5 respectively in the same column.
undefined_variable over 6 years

creating a column batch with only numbers as value... Every thing apart from numbers will be replaced with blank
Chaitanya Kirty over 6 years

Sorry for the delay in response. I tried your code and it worked fine. Thanks a lott
dileepvarma over 3 years

what if we want to do it for all columns in single command, please
undefined_variable over 3 years

@dileepvarma regex_replace can be used in select... df.select(regexp_replace(col("col1"), "/[^0-9]+/", ""),regexp_replace(col("col2"), "/[^0-9]+/", ""))
dileepvarma over 3 years

@undefined_variable I have 100's of columns so Its tough for me to add this statement for each column .is there any recursive way? thanks
undefined_variable over 3 years

@dileepvarma use df.columns.map to create the statement and use it in your select
undefined_variable over 3 years

val cols = df.columns.map(x=>regexp_replace(col(x), "/[^0-9]+/", ""))... df.select(cols:_*)