How to remove blank spaces in Spark table column (Pyspark)

10,423

Solution 1

If you use a dataframe, this should work:

from pyspark.sql.functions import trim
dataset = dataset.withColumn("purch_location", trim(df. purch_location))

Solution 2

Just use pyspark.sql.functions.trim:

Trim the spaces from both ends for the specified string column.

from pyspark.sql.functions import trim

dataset.select(trim("purch_location"))

To convert to null:

from pyspark.sql.functions import length trim, when

df.select(when(length(trim("purch_location")) != 0, trim("purch_location")))

I am using a spark table rather then a dataframe

SparkSession.table returns DataFrame:

spark.range(1).createOrReplaceTempView("df")
type(spark.table("df"))
# pyspark.sql.dataframe.DataFrame
Share:
10,423

Related videos on Youtube

Bisbot
Author by

Bisbot

Updated on June 04, 2022

Comments

  • Bisbot
    Bisbot almost 2 years

    I would like to remove blank spaces from all values in a specific column(purch_location). I am using a spark table rather then a dataframe or SQL table (but I can use dataframe or SQL table if need be).

    Here is how my spark table was generated:

    dataset = spark.table("trans")
    cols = dataset.columns
    dataset.show(5)
    +----------+-------------+-------+--------------------+--------------+---+---+
    |purch_date|  purch_class|tot_amt|       serv-provider|purch_location| id|  y|
    +----------+-------------+-------+--------------------+--------------+---+---+
    |06/11/2017|     Printing|  -0.66|   CARDS AND POCKETS|              |  0|  0|
    |03/11/2017|Uncategorized| -17.53|             HOVER  |              |  1|  0|
    |02/11/2017|    Groceries| -70.05|1774 MAC'S CONVEN...|     BRAMPTON |  2|  1|
    |31/10/2017|Gasoline/Fuel|  -20.0|              ESSO  |              |  3|  0|
    |31/10/2017|       Travel|   -9.0|TORONTO PARKING A...|      TORONTO |  4|  0|
    +----------+-------------+-------+--------------------+--------------+---+---+
    

    I have attempted the following function pyspark.sql.functions import regexp_replace

    from pyspark.sql.functions import regexp_replace
    dataset1=dataset.select(regexp_replace(col("purch_location"),"\\s+",""))
    

    Which removes the blank spaces AFTER the value in the column but not before. It also gets rid of the rest of my columns which I would like to keep.

    +-------------------------------------+
    |regexp_replace(purch_location, \s+, )|
    +-------------------------------------+
    |                                     |
    |                                     |
    |                             BRAMPTON|
    |                                     |
    |                              TORONTO|
    |                                     |
    |                             BRAMPTON|
    |                             BRAMPTON|
    |                                     |
    |                                 null|
    |                          MISSISSAUGA|
    |                                     |
    |                             BRAMPTON|
    

    Any idea how I can resolve this? Thanks in advance.