Split string in a spark dataframe column by regular expressions capturing groups

10,700

Solution 1

Here's how you can do this without using a udf:

df = df.withColumn(
    "numbers",
    f.split(f.regexp_replace("numbers", "([0-9]{3})(?!$)", r"$1,"), ",")
)

df.show(truncate=False)
#+---+------------------------------+
#|id |numbers                       |
#+---+------------------------------+
#|742|[000, 000, 000]               |
#|744|[000, 000]                    |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+

First use pyspark.sql.functions.regexp_replace to replace sequences of 3 digits with the sequence followed by a comma. Then split the resulting string on a comma.

The replacement pattern "$1," means first capturing group, followed by a comma.

In the match pattern, we also include a negative lookahead for end of string, (?!$), to avoid adding a comma to the end of the string.

Reference: REGEXP_REPLACE capturing groups

Solution 2

split will remove the pattern the string is split on; You need to create a udf for this:

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
import re

# create a udf with re.findall
split_by_three = f.udf(lambda s: re.findall(r'\d{3}', s), ArrayType(StringType()))
df.withColumn('numbers', split_by_three('numbers')).show(3, False)

#+---+------------------------------+
#|id |numbers                       |
#+---+------------------------------+
#|742|[000, 000, 000]               |
#|744|[000, 000]                    |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+

df.withColumn('numbers', split_by_three('numbers')).printSchema()
#root
# |-- id: long (nullable = true)
# |-- numbers: array (nullable = true)
# |    |-- element: string (containsNull = true)

Solution 3

Both @pault and @Psidom ways are awesome! Here is another alternative;

>>> split_udf = F.udf(lambda x: ','.join([''.join(i) for i in zip(*[iter(x)]*3)]))
>>> df.withColumn('numbers', F.split(split_udf('numbers'),',')).show(truncate=False)
+---+------------------------------+
|id |numbers                       |
+---+------------------------------+
|742|[000, 000, 000]               |
|744|[000, 000]                    |
|746|[003, 000, 000, 000, 000, 000]|
+---+------------------------------+
Share:
10,700

Related videos on Youtube

Shadab Shariff
Author by

Shadab Shariff

Updated on June 04, 2022

Comments

  • Shadab Shariff
    Shadab Shariff almost 2 years

    Given the below data frame, i wanted to split the numbers column into an array of 3 characters per element of the original number in the array

    Given data frame :

    +---+------------------+
    | id|           numbers|
    +---+------------------+
    |742|         000000000|
    |744|            000000|
    |746|003000000000000000|
    +---+------------------+
    

    Expected dataframe :

    +---+----------------------------------+
    | id|           numbers                |
    +---+----------------------------------+
    |742| [000, 000, 000]                  |
    |744| [000, 000]                       |
    |746| [003, 000, 000, 000, 000, 000]   |
    +---+----------------------------------+
    

    I tried different regular expressions while using the split function given below the with the regex that I felt should have worked on the very first try:

    import pyspark.sql.functions as f
    
    df = spark.createDataFrame(
        [
            [742, '000000000'], 
            [744, '000000'], 
            [746, '003000000000000000'], 
        ],
        ["id", "numbers"]
    )
    
    df = df.withColumn("numbers", f.split("numbers", "[0-9]{3}"))
    
    df.show()
    

    The result however is

    +---+--------------+
    | id|       numbers|
    +---+--------------+
    |742|      [, , , ]|
    |744|        [, , ]|
    |746|[, , , , , , ]|
    +---+--------------+
    

    I want to understand what I am doing wrong. Is there a possibility of setting the global flag for getting all the matches or have I missed something in the regular expression altogether?

  • pault
    pault over 5 years
    Whenever I see You need to create a udf for this, I take it as a personal challenge.
  • Psidom
    Psidom over 5 years
    This is pretty clever -
  • pault
    pault over 5 years
    @Psidom one downside is that it assumes no other commas exist in the string.