Remove blank space from data frame column values in Spark
Solution 1
While the problem you've described is not reproducible with provided code, using Python UDFs
to handle simple tasks like this, is rather inefficient. If you want to simply remove spaces from the text use regexp_replace
:
from pyspark.sql.functions import regexp_replace, col
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("v"), " ", ""))
If you want to normalize empty lines use trim
:
from pyspark.sql.functions import trim
df.select(trim(col("v")))
If you want to keep leading / trailing spaces you can adjust regexp_replace
:
df.select(regexp_replace(col("v"), "^\s+$", ""))
Solution 2
Here's a function that removes all whitespace in a string:
import pyspark.sql.functions as F
def remove_all_whitespace(col):
return F.regexp_replace(col, "\\s+", "")
You can use the function like this:
actual_df = source_df.withColumn(
"words_without_whitespace",
quinn.remove_all_whitespace(col("words"))
)
The remove_all_whitespace
function is defined in the quinn library. quinn also defines single_space
and anti_trim
methods to manage whitespace. PySpark defines ltrim
, rtrim
, and trim
methods to manage whitespace.
Solution 3
As @zero323 said, it's probably that you overlapped the replace
function somewhere. I tested your code and it works perfectly.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = sqlContext.createDataFrame([("aaa 111",), ("bbb 222",), ("ccc 333",)], ["names"])
spaceDeleteUDF = udf(lambda s: s.replace(" ", ""), StringType())
df.withColumn("names", spaceDeleteUDF("names")).show()
#+------+
#| names|
#+------+
#|aaa111|
#|bbb222|
#|ccc333|
#+------+
Solution 4
As shown by @Powers there is a very nice and easy to read function to remove white spaces provided by a package called quinn.You can find it here: https://github.com/MrPowers/quinn Here are the instructions on how to install it if working on a Data Bricks workspace: https://docs.databricks.com/libraries.html
Here again an illustration of how it works:
#import library
import quinn
#create an example dataframe
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
#function call to remove whitespace. Note, withColumn will replace column v if it already exists
df = df.withColumn(
"v",
quinn.remove_all_whitespace(col("v"))
)
Iz M
Updated on July 27, 2022Comments
-
Iz M almost 2 years
I have a data frame (
business_df
) of schema:|-- business_id: string (nullable = true) |-- categories: array (nullable = true) | |-- element: string (containsNull = true) |-- city: string (nullable = true) |-- full_address: string (nullable = true) |-- hours: struct (nullable = true) |-- name: string (nullable = true)
I want to make a new data frame (
new_df
) so that the values in the'name'
column do not contain any blank spaces.My code is:
from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import StringType udf = UserDefinedFunction(lambda x: x.replace(' ', ''), StringType()) new_df = business_df.select(*[udf(column).alias(name) if column == name else column for column in business_df.columns]) new_df.registerTempTable("vegas") new_df.printSchema() vegas_business = sqlContext.sql("SELECT stars, name from vegas limit 10").collect()
I keep receiving this error:
NameError: global name 'replace' is not defined
What's wrong with this code?
-
Iz M about 8 yearswhat I want to do is related to my previous post that you answered about sql select statements and bind variables. For this question, the set-up is the same as before: I have a list of restaurants and their cities e.g. [[Le Bernadin, Manhattan][...,...]] that I would like to loop through and retrieve matching restaurants from a data frame that contains data following the schema above. However, because the blank spaces are producing errors, I want to eliminate these spaces from the data frame queried data so that I can compare the single word strings such as 'LeBernadin'
-
Iz M about 8 yearsthe regex solution does seem better, however when I run it the df schema is replaced with the regex expression (root |-- regexp_replace(name,[\s'():^],): string (nullable = true)). I'm a bit puzzled, know why this is?
-
zero323 about 8 yearsBecause you don't use aliases. If you want a specific name use withColumn or alias.
-
Andre Carneiro over 5 yearsAs I said, my experience with regex_replace was not good because is too slow! I had a better performance using rdd.map
-
E B over 5 years@AndreCarneiro, can you share your revised code with rdd.map that you found faster than using regex.. i am trying to do some checking on UK postal codes and regexp is taking a very long time to process
-
Powers over 5 years@AndreCarneiro - I don't think your code would run faster than a
regexp_replace
. Native Spark functions visible by the compilers so they can be optimized in execution plans. It's possible my understanding of Spark is off, but I don't think so :) -
Andre Carneiro over 5 years@EB Well, if you never tried it you'll never know! Feel confortable to make a 'benchmark' if you want to! I'll do it when I can. For now, this solved my problems! So, it was good enough for me!
-
E B over 5 years@Andrei Carneiro, if I understand your function you would just append one space in the string and replace anything with more than 1 space as long as it is not leading or trailing space correct?
-
Andre Carneiro over 5 yearsYes! But the same principle can be applied! But I realize that
regexp_replace
is the best solution for this problem. I think was foolished by my VM. Something like:df = df.withColumn(colName,regexp_replace( df[colName] r"\s+","")