How do you concatenate multiple columns in a DataFrame into a another column when some values are null?

29,531

Solution 1

Given the following dataframe:

df = sqlContext.createDataFrame([("foo", "bar"), ("baz", None)], 
                                ('a', 'b'))
df.show()

+---+----+
|  a|   b|
+---+----+
|foo| bar|
|baz|null|
+---+----+

One solution is to use a udf to filter/replace the null(s), e.g.:

import pyspark.sql.functions as F
from pyspark.sql.types import StringType

concat_udf = F.udf(lambda cols: "".join([x if x is not None else "*" for x in cols]), StringType())
df.withColumn("unique_id", concat_udf(F.array("a", "b"))).show()

which yields:

+---+----+---------+
|  a|   b|unique_id|
+---+----+---------+
|foo| bar|   foobar|
|baz|null|     baz*|
+---+----+---------+

Alternatively:

import pyspark.sql.functions as F

def myConcat(*cols):
    return F.concat(*[F.coalesce(c, F.lit("*")) for c in cols])

df.withColumn("unique_id", myConcat("a", "b")).show()

which also yields:

+---+----+---------+
|  a|   b|unique_id|
+---+----+---------+
|foo| bar|   foobar|
|baz|null|     baz*|
+---+----+---------+

Solution 2

df = df.withColumn('joined_column', sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))

Solution 3

You need to import the pyspark sql functions as below first

from pyspark.sql import functions as sf
df = df.withColumn('joined_column', sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
Share:
29,531
Swetha Baskaran
Author by

Swetha Baskaran

Updated on July 05, 2022

Comments

  • Swetha Baskaran
    Swetha Baskaran almost 2 years

    def column_concat(a,b): return concat(a,b)
    searches_df = searches_df.withColumn('unique_id',reduce(column_concat,(searches_df[col] for col in search_parameters)))

    This works except when a column contains a null value, then the whole concatenated string is null. I'd like the a place holder or some character instead in the concatenated string.

  • Swetha Baskaran
    Swetha Baskaran over 7 years
    Thank you! I overcome the problem by using concat_ws instead, however these solutions are useful and necessary if a placeholder is required.
  • Sampat Kumar
    Sampat Kumar almost 4 years
    Is there a way the above myconcat function written over scala.