Split string in a spark dataframe column by regular expressions capturing groups

python-3.x apache-spark pyspark apache-spark-sql

10,700

Solution 1

Here's how you can do this without using a udf:

df = df.withColumn(
    "numbers",
    f.split(f.regexp_replace("numbers", "([0-9]{3})(?!$)", r"$1,"), ",")
)

df.show(truncate=False)
#+---+------------------------------+
#|id |numbers                       |
#+---+------------------------------+
#|742|[000, 000, 000]               |
#|744|[000, 000]                    |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+

First use pyspark.sql.functions.regexp_replace to replace sequences of 3 digits with the sequence followed by a comma. Then split the resulting string on a comma.

The replacement pattern "$1," means first capturing group, followed by a comma.

In the match pattern, we also include a negative lookahead for end of string, (?!$), to avoid adding a comma to the end of the string.

Reference: REGEXP_REPLACE capturing groups

Solution 2

split will remove the pattern the string is split on; You need to create a udf for this:

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
import re

# create a udf with re.findall
split_by_three = f.udf(lambda s: re.findall(r'\d{3}', s), ArrayType(StringType()))
df.withColumn('numbers', split_by_three('numbers')).show(3, False)

#+---+------------------------------+
#|id |numbers                       |
#+---+------------------------------+
#|742|[000, 000, 000]               |
#|744|[000, 000]                    |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+

df.withColumn('numbers', split_by_three('numbers')).printSchema()
#root
# |-- id: long (nullable = true)
# |-- numbers: array (nullable = true)
# |    |-- element: string (containsNull = true)

Solution 3

Both @pault and @Psidom ways are awesome! Here is another alternative;

>>> split_udf = F.udf(lambda x: ','.join([''.join(i) for i in zip(*[iter(x)]*3)]))
>>> df.withColumn('numbers', F.split(split_udf('numbers'),',')).show(truncate=False)
+---+------------------------------+
|id |numbers                       |
+---+------------------------------+
|742|[000, 000, 000]               |
|744|[000, 000]                    |
|746|[003, 000, 000, 000, 000, 000]|
+---+------------------------------+

10,700

Shadab Shariff

Updated on June 04, 2022

Comments

Shadab Shariff almost 2 years

Given the below data frame, i wanted to split the numbers column into an array of 3 characters per element of the original number in the array

Given data frame :

+---+------------------+
| id|           numbers|
+---+------------------+
|742|         000000000|
|744|            000000|
|746|003000000000000000|
+---+------------------+

Expected dataframe :

+---+----------------------------------+
| id|           numbers                |
+---+----------------------------------+
|742| [000, 000, 000]                  |
|744| [000, 000]                       |
|746| [003, 000, 000, 000, 000, 000]   |
+---+----------------------------------+

I tried different regular expressions while using the split function given below the with the regex that I felt should have worked on the very first try:

import pyspark.sql.functions as f

df = spark.createDataFrame(
    [
        [742, '000000000'], 
        [744, '000000'], 
        [746, '003000000000000000'], 
    ],
    ["id", "numbers"]
)

df = df.withColumn("numbers", f.split("numbers", "[0-9]{3}"))

df.show()

The result however is

+---+--------------+
| id|       numbers|
+---+--------------+
|742|      [, , , ]|
|744|        [, , ]|
|746|[, , , , , , ]|
+---+--------------+

I want to understand what I am doing wrong. Is there a possibility of setting the global flag for getting all the matches or have I missed something in the regular expression altogether?

pault over 5 years

Whenever I see You need to create a udf for this, I take it as a personal challenge.
Psidom over 5 years

This is pretty clever -
pault over 5 years

@Psidom one downside is that it assumes no other commas exist in the string.