Split string in a spark dataframe column by regular expressions capturing groups
Solution 1
Here's how you can do this without using a udf
:
df = df.withColumn(
"numbers",
f.split(f.regexp_replace("numbers", "([0-9]{3})(?!$)", r"$1,"), ",")
)
df.show(truncate=False)
#+---+------------------------------+
#|id |numbers |
#+---+------------------------------+
#|742|[000, 000, 000] |
#|744|[000, 000] |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+
First use pyspark.sql.functions.regexp_replace
to replace sequences of 3 digits with the sequence followed by a comma. Then split the resulting string on a comma.
The replacement pattern "$1,"
means first capturing group, followed by a comma.
In the match pattern, we also include a negative lookahead for end of string, (?!$)
, to avoid adding a comma to the end of the string.
Reference: REGEXP_REPLACE capturing groups
Solution 2
split
will remove the pattern the string is split on; You need to create a udf for this:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
import re
# create a udf with re.findall
split_by_three = f.udf(lambda s: re.findall(r'\d{3}', s), ArrayType(StringType()))
df.withColumn('numbers', split_by_three('numbers')).show(3, False)
#+---+------------------------------+
#|id |numbers |
#+---+------------------------------+
#|742|[000, 000, 000] |
#|744|[000, 000] |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+
df.withColumn('numbers', split_by_three('numbers')).printSchema()
#root
# |-- id: long (nullable = true)
# |-- numbers: array (nullable = true)
# | |-- element: string (containsNull = true)
Solution 3
Both @pault and @Psidom ways are awesome! Here is another alternative;
>>> split_udf = F.udf(lambda x: ','.join([''.join(i) for i in zip(*[iter(x)]*3)]))
>>> df.withColumn('numbers', F.split(split_udf('numbers'),',')).show(truncate=False)
+---+------------------------------+
|id |numbers |
+---+------------------------------+
|742|[000, 000, 000] |
|744|[000, 000] |
|746|[003, 000, 000, 000, 000, 000]|
+---+------------------------------+
Related videos on Youtube
Shadab Shariff
Updated on June 04, 2022Comments
-
Shadab Shariff almost 2 years
Given the below data frame, i wanted to split the numbers column into an array of 3 characters per element of the original number in the array
Given data frame :
+---+------------------+ | id| numbers| +---+------------------+ |742| 000000000| |744| 000000| |746|003000000000000000| +---+------------------+
Expected dataframe :
+---+----------------------------------+ | id| numbers | +---+----------------------------------+ |742| [000, 000, 000] | |744| [000, 000] | |746| [003, 000, 000, 000, 000, 000] | +---+----------------------------------+
I tried different regular expressions while using the
split
function given below the with the regex that I felt should have worked on the very first try:import pyspark.sql.functions as f df = spark.createDataFrame( [ [742, '000000000'], [744, '000000'], [746, '003000000000000000'], ], ["id", "numbers"] ) df = df.withColumn("numbers", f.split("numbers", "[0-9]{3}")) df.show()
The result however is
+---+--------------+ | id| numbers| +---+--------------+ |742| [, , , ]| |744| [, , ]| |746|[, , , , , , ]| +---+--------------+
I want to understand what I am doing wrong. Is there a possibility of setting the global flag for getting all the matches or have I missed something in the regular expression altogether?
-
pault over 5 yearsWhenever I see You need to create a udf for this, I take it as a personal challenge.
-
Psidom over 5 yearsThis is pretty clever -
-
pault over 5 years@Psidom one downside is that it assumes no other commas exist in the string.