PySpark explode list into multiple columns based on name

13,447

Solution 1

You can use flatMap and pivot to achieve this. Starting from the results from the first stage:

rdd = sc.parallelize([(1,['AA 1234  ZXYV','BB A 890','CC B 321']),
                      (2,['AA 1234  LMNO','BB D 123','CC E 321']),
                      (3,['AA 1234  ZXYV','CC B 321'])])

df = rdd.toDF(['index', 'result'])

You can first explode the array into multiple rows using flatMap and extract the two letter identifier into a separate column.

df_flattened = df.rdd.flatMap(lambda x: [(x[0],y, y[0:2],y[3::]) for y in x[1]])\
               .toDF(['index','result', 'identifier','identifiertype'])

and use pivot to change the two letter identifier into column names:

df_result = df_flattened.groupby(df_flattened.index,)\
                        .pivot("identifier")\
                        .agg(first("identifiertype"))\
                        .join(df,'index')

I added the join to get the result column back

Solution 2

Alternate way to explode array without converting to rdd,

from pyspark.sql import functions as F

udf1 = F.udf(lambda x : x.split()[0])
df.select('index',F.explode('result').alias('id'),udf1(F.col('id')).alias('idtype')).show()

+-----+-------------+------+
|index|           id|idtype|
+-----+-------------+------+
|    1|AA 1234  ZXYV|    AA|
|    1|     BB A 890|    BB|
|    1|     CC B 321|    CC|
|    2|AA 1234  LMNO|    AA|
|    2|     BB D 123|    BB|
|    2|     CC E 321|    CC|
|    3|AA 1234  ZXYV|    AA|
|    3|     CC B 321|    CC|
+-----+-------------+------+ 

df1.groupby('index').pivot('idtype').agg(F.first('id')).join(df,'index').show()
Share:
13,447
robarthur1
Author by

robarthur1

Updated on June 17, 2022

Comments

  • robarthur1
    robarthur1 almost 2 years

    Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe.

    The file looks similar to this:

    AA 1234  ZXYW
    BB A 890
    CC B 321
    AA 1234  LMNO
    BB D 123
    CC E 321
    AA 1234  ZXYW
    CC E 456
    

    Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record types. They are always identified with a two letter code at the start of each line. There can be 1 or many different record types in each group (i.e. not all record types are present for each group)

    As a first stage, I've managed to group the records together in this format:

    +----------------+---------------------------------+
    |           index|                           result|
    +----------------+---------------------------------+
    |               1|[AA 1234  ZXYV,BB A 890,CC B 321]|
    |               2|[AA 1234  LMNO,BB D 123,CC E 321]|
    |               3|[AA 1234  ZXYV,CC B 321]         |
    +----------------+---------------------------------+
    

    And as a second stage I really want to get data into the following columns in a dataframe:

    +----------------+---------------------------------+-------------+--------+--------+
    |           index|                           result|           AA|      BB|      CC|
    +----------------+---------------------------------+-------------+--------+--------+
    |               1|[AA 1234  ZXYV,BB A 890,CC B 321]|AA 1234  ZXYV|BB A 890|CC B 321|
    |               2|[AA 1234  LMNO,BB D 123,CC E 321]|AA 1234  LMNO|BB D 123|CC E 321|
    |               3|[AA 1234  ZXYV,CC B 321]         |AA 1234  ZXYV|    Null|CC B 321|
    +----------------+---------------------------------+-------------+--------+--------+
    

    Because at that point extracting the information that I need should be trivial.

    Does anyone have any suggestions as to how I might be able to do this?

    Many Thanks.

  • robarthur1
    robarthur1 over 6 years
    That worked absolutely perfectly, exactly what I needed. Thanks very much for your help.