Combine PySpark DataFrame ArrayType fields into single ArrayType field

python apache-spark dataframe pyspark apache-spark-sql

28,603

Solution 1

Spark >= 2.4

You can use concat function (SPARK-23736):

from pyspark.sql.functions import col, concat 

df.select(concat(col("tokens"), col("tokens_bigrams"))).show(truncate=False)

# +---------------------------------+                                             
# |concat(tokens, tokens_bigrams)   |
# +---------------------------------+
# |[one, two, two, one two, two two]|
# |null                             |
# +---------------------------------+

To keep data when one of the values is NULL you can coalesce with array:

from pyspark.sql.functions import array, coalesce      

df.select(concat(
    coalesce(col("tokens"), array()),
    coalesce(col("tokens_bigrams"), array())
)).show(truncate = False)

# +--------------------------------------------------------------------+
# |concat(coalesce(tokens, array()), coalesce(tokens_bigrams, array()))|
# +--------------------------------------------------------------------+
# |[one, two, two, one two, two two]                                   |
# |[three]                                                             |
# +--------------------------------------------------------------------+

Spark < 2.4

Unfortunately to concatenate array columns in general case you'll need an UDF, for example like this:

from itertools import chain
from pyspark.sql.functions import col, udf
from pyspark.sql.types import *


def concat(type):
    def concat_(*args):
        return list(chain.from_iterable((arg if arg else [] for arg in args)))
    return udf(concat_, ArrayType(type))

which can be used as:

df = spark.createDataFrame(
    [(["one", "two", "two"], ["one two", "two two"]), (["three"], None)], 
    ("tokens", "tokens_bigrams")
)

concat_string_arrays = concat(StringType())
df.select(concat_string_arrays("tokens", "tokens_bigrams")).show(truncate=False)

# +---------------------------------+
# |concat_(tokens, tokens_bigrams)  |
# +---------------------------------+
# |[one, two, two, one two, two two]|
# |[three]                          |
# +---------------------------------+

Solution 2

In Spark 2.4.0 (2.3 on Databricks platform) you can do it natively in the DataFrame API using the concat function. In your example you could do this:

from pyspark.sql.functions import col, concat

df.withColumn('tokens_bigrams', concat(col('tokens'), col('bigrams')))

Here is the related jira.

Solution 3

I was using Spark < 2.4 and the above solutions didn't work for me, was getting the error 'input to function concat should have StringType or BinaryType'. This worked for me :

from pyspark.sql import functions as F

df.select("*",F.array(F.concat_ws(',', col('tokens'), col('bigrams))).\
                            alias('concat_cols'))

28,603

Author by

zemekeneng

Updated on June 12, 2021

Comments

zemekeneng almost 3 years

I have a PySpark DataFrame with 2 ArrayType fields:

>>>df
DataFrame[id: string, tokens: array<string>, bigrams: array<string>]
>>>df.take(1)
[Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])]

I would like to combine them into a single ArrayType field:

>>>df2
DataFrame[id: string, tokens_bigrams: array<string>]
>>>df2.take(1)
[Row(id='ID1', tokens_bigrams=['one', 'two', 'two', 'one two', 'two two'])]

The syntax that works with strings does not seem to work here:

df2 = df.withColumn('tokens_bigrams', df.tokens + df.bigrams)

Thanks!

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Create a dataframe from a list in pyspark.sql

PySpark - Compare DataFrames

PySpark DataFrame - Join on multiple columns dynamically

Compare two dataframes Pyspark

pyspark : Convert DataFrame to RDD[string]

Pyspark: Replacing value in a column by searching a dictionary

PySpark converting a column of type 'map' to multiple columns in a dataframe

How to return a "Tuple type" in a UDF in PySpark?

How to explode multiple columns of a dataframe in pyspark

How to calculate date difference in pyspark?