How to remove blank spaces in Spark table column (Pyspark)
Solution 1
If you use a dataframe, this should work:
from pyspark.sql.functions import trim
dataset = dataset.withColumn("purch_location", trim(df. purch_location))
Solution 2
Just use pyspark.sql.functions.trim
:
Trim the spaces from both ends for the specified string column.
from pyspark.sql.functions import trim
dataset.select(trim("purch_location"))
To convert to null:
from pyspark.sql.functions import length trim, when
df.select(when(length(trim("purch_location")) != 0, trim("purch_location")))
I am using a spark table rather then a dataframe
SparkSession.table
returns DataFrame
:
spark.range(1).createOrReplaceTempView("df")
type(spark.table("df"))
# pyspark.sql.dataframe.DataFrame
Related videos on Youtube
Bisbot
Updated on June 04, 2022Comments
-
Bisbot almost 2 years
I would like to remove blank spaces from all values in a specific column(purch_location). I am using a spark table rather then a dataframe or SQL table (but I can use dataframe or SQL table if need be).
Here is how my spark table was generated:
dataset = spark.table("trans") cols = dataset.columns dataset.show(5) +----------+-------------+-------+--------------------+--------------+---+---+ |purch_date| purch_class|tot_amt| serv-provider|purch_location| id| y| +----------+-------------+-------+--------------------+--------------+---+---+ |06/11/2017| Printing| -0.66| CARDS AND POCKETS| | 0| 0| |03/11/2017|Uncategorized| -17.53| HOVER | | 1| 0| |02/11/2017| Groceries| -70.05|1774 MAC'S CONVEN...| BRAMPTON | 2| 1| |31/10/2017|Gasoline/Fuel| -20.0| ESSO | | 3| 0| |31/10/2017| Travel| -9.0|TORONTO PARKING A...| TORONTO | 4| 0| +----------+-------------+-------+--------------------+--------------+---+---+
I have attempted the following function pyspark.sql.functions import regexp_replace
from pyspark.sql.functions import regexp_replace dataset1=dataset.select(regexp_replace(col("purch_location"),"\\s+",""))
Which removes the blank spaces AFTER the value in the column but not before. It also gets rid of the rest of my columns which I would like to keep.
+-------------------------------------+ |regexp_replace(purch_location, \s+, )| +-------------------------------------+ | | | | | BRAMPTON| | | | TORONTO| | | | BRAMPTON| | BRAMPTON| | | | null| | MISSISSAUGA| | | | BRAMPTON|
Any idea how I can resolve this? Thanks in advance.