Remove blank space from data frame column values in Spark

79,814

Solution 1

While the problem you've described is not reproducible with provided code, using Python UDFs to handle simple tasks like this, is rather inefficient. If you want to simply remove spaces from the text use regexp_replace:

from pyspark.sql.functions import regexp_replace, col

df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

df.select(regexp_replace(col("v"), " ", ""))

If you want to normalize empty lines use trim:

from pyspark.sql.functions import trim

df.select(trim(col("v")))

If you want to keep leading / trailing spaces you can adjust regexp_replace:

df.select(regexp_replace(col("v"), "^\s+$", ""))

Solution 2

Here's a function that removes all whitespace in a string:

import pyspark.sql.functions as F

def remove_all_whitespace(col):
    return F.regexp_replace(col, "\\s+", "")

You can use the function like this:

actual_df = source_df.withColumn(
    "words_without_whitespace",
    quinn.remove_all_whitespace(col("words"))
)

The remove_all_whitespace function is defined in the quinn library. quinn also defines single_space and anti_trim methods to manage whitespace. PySpark defines ltrim, rtrim, and trim methods to manage whitespace.

Solution 3

As @zero323 said, it's probably that you overlapped the replace function somewhere. I tested your code and it works perfectly.

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = sqlContext.createDataFrame([("aaa 111",), ("bbb 222",), ("ccc 333",)], ["names"])
spaceDeleteUDF = udf(lambda s: s.replace(" ", ""), StringType())
df.withColumn("names", spaceDeleteUDF("names")).show()

#+------+
#| names|
#+------+
#|aaa111|
#|bbb222|
#|ccc333|
#+------+

Solution 4

As shown by @Powers there is a very nice and easy to read function to remove white spaces provided by a package called quinn.You can find it here: https://github.com/MrPowers/quinn Here are the instructions on how to install it if working on a Data Bricks workspace: https://docs.databricks.com/libraries.html

Here again an illustration of how it works:

#import library 
import quinn

#create an example dataframe
df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

#function call to remove whitespace. Note, withColumn will replace column v if it already exists
df = df.withColumn(
    "v",
    quinn.remove_all_whitespace(col("v"))
)

The output: enter image description here

Share:
79,814
Iz M
Author by

Iz M

Updated on July 27, 2022

Comments

  • Iz M
    Iz M almost 2 years

    I have a data frame (business_df) of schema:

    |-- business_id: string (nullable = true)
    |-- categories: array (nullable = true)
    |    |-- element: string (containsNull = true)
    |-- city: string (nullable = true)
    |-- full_address: string (nullable = true)
    |-- hours: struct (nullable = true)
    |-- name: string (nullable = true)
    

    I want to make a new data frame (new_df) so that the values in the 'name' column do not contain any blank spaces.

    My code is:

    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from pyspark.sql import HiveContext
    from pyspark.sql.functions import UserDefinedFunction
    from pyspark.sql.types import StringType
    
    udf = UserDefinedFunction(lambda x: x.replace(' ', ''), StringType())
    new_df = business_df.select(*[udf(column).alias(name) if column == name else column for column in business_df.columns])
    new_df.registerTempTable("vegas")
    new_df.printSchema()
    vegas_business = sqlContext.sql("SELECT stars, name from vegas limit 10").collect()
    

    I keep receiving this error:

    NameError: global name 'replace' is not defined

    What's wrong with this code?

  • Iz M
    Iz M about 8 years
    what I want to do is related to my previous post that you answered about sql select statements and bind variables. For this question, the set-up is the same as before: I have a list of restaurants and their cities e.g. [[Le Bernadin, Manhattan][...,...]] that I would like to loop through and retrieve matching restaurants from a data frame that contains data following the schema above. However, because the blank spaces are producing errors, I want to eliminate these spaces from the data frame queried data so that I can compare the single word strings such as 'LeBernadin'
  • Iz M
    Iz M about 8 years
    the regex solution does seem better, however when I run it the df schema is replaced with the regex expression (root |-- regexp_replace(name,[\s'():^],): string (nullable = true)). I'm a bit puzzled, know why this is?
  • zero323
    zero323 about 8 years
    Because you don't use aliases. If you want a specific name use withColumn or alias.
  • Andre Carneiro
    Andre Carneiro over 5 years
    As I said, my experience with regex_replace was not good because is too slow! I had a better performance using rdd.map
  • E B
    E B over 5 years
    @AndreCarneiro, can you share your revised code with rdd.map that you found faster than using regex.. i am trying to do some checking on UK postal codes and regexp is taking a very long time to process
  • Powers
    Powers over 5 years
    @AndreCarneiro - I don't think your code would run faster than a regexp_replace. Native Spark functions visible by the compilers so they can be optimized in execution plans. It's possible my understanding of Spark is off, but I don't think so :)
  • Andre Carneiro
    Andre Carneiro over 5 years
    @EB Well, if you never tried it you'll never know! Feel confortable to make a 'benchmark' if you want to! I'll do it when I can. For now, this solved my problems! So, it was good enough for me!
  • E B
    E B over 5 years
    @Andrei Carneiro, if I understand your function you would just append one space in the string and replace anything with more than 1 space as long as it is not leading or trailing space correct?
  • Andre Carneiro
    Andre Carneiro over 5 years
    Yes! But the same principle can be applied! But I realize that regexp_replace is the best solution for this problem. I think was foolished by my VM. Something like: df = df.withColumn(colName,regexp_replace( df[colName] r"\s+","")