How to use regex_replace to replace special characters from a column in pyspark dataframe
19,790
Solution 1
df.withColumn("batch",regexp_replace(col("batch"), "/[^0-9]+/", ""))
Solution 2
You can use this regex:
\W+
\W
- matches any non-word character (equal to [^a-zA-Z0-9_])
Solution 3
What have you tried so far?
select regexp_replace("'$5','9%'","[^0-9A-Za-z]","")
Author by
Chaitanya Kirty
Updated on June 04, 2022Comments
-
Chaitanya Kirty almost 2 years
There is a column batch in dataframe. It has values like '9%','$5', etc.
I need use
regex_replace
in a way that it removes the special characters from the above example and keep just the numeric part.Examples like 9 and 5 replacing 9% and $5 respectively in the same column.
-
undefined_variable over 6 yearscreating a column
batch
with only numbers as value... Every thing apart from numbers will be replaced with blank -
Chaitanya Kirty over 6 yearsSorry for the delay in response. I tried your code and it worked fine. Thanks a lott
-
dileepvarma over 3 yearswhat if we want to do it for all columns in single command, please
-
undefined_variable over 3 years@dileepvarma regex_replace can be used in
select
... df.select(regexp_replace(col("col1"), "/[^0-9]+/", ""),regexp_replace(col("col2"), "/[^0-9]+/", "")) -
dileepvarma over 3 years@undefined_variable I have 100's of columns so Its tough for me to add this statement for each column .is there any recursive way? thanks
-
undefined_variable over 3 years@dileepvarma use
df.columns.map
to create the statement and use it in your select -
undefined_variable over 3 years
val cols = df.columns.map(x=>regexp_replace(col(x), "/[^0-9]+/", ""))
... df.select(cols:_*)