In pyspark, how do you add/concat a string to a column?

44,902

Solution 1

from pyspark.sql.functions import concat, col, lit


df.select(concat(col("firstname"), lit(" "), col("lastname"))).show(5)
+------------------------------+
|concat(firstname,  , lastname)|
+------------------------------+
|                Emanuel Panton|
|              Eloisa Cayouette|
|                   Cathi Prins|
|             Mitchel Mozdzierz|
|               Angla Hartzheim|
+------------------------------+
only showing top 5 rows

http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

Solution 2

Another option here is to use pyspark.sql.functions.format_string() which allows you to use C printf style formatting.

Here's an example where the values in the column are integers.

import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([(1,), (2,), (3,), (10,), (100,)], ["col1"])
df.withColumn("col2", f.format_string("%03d", "col1")).show()
#+----+----+
#|col1|col2|
#+----+----+
#|   1| 001|
#|   2| 002|
#|   3| 003|
#|  10| 010|
#| 100| 100|
#+----+----+

Here the format "%03d" means print an integer number left padded with up to 3 zeros. This is why the 10 gets mapped to 010 and 100 does not change at all.

Or if you wanted to add exactly 3 zeros in the front:

df.withColumn("col2", f.format_string("000%d", "col1")).show()
#+----+------+
#|col1|  col2|
#+----+------+
#|   1|  0001|
#|   2|  0002|
#|   3|  0003|
#|  10| 00010|
#| 100|000100|
#+----+------+
Share:
44,902

Related videos on Youtube

ASU_TY
Author by

ASU_TY

Updated on July 09, 2022

Comments

  • ASU_TY
    ASU_TY about 2 years

    I would like to add a string to an existing column. For example, df['col1'] has values as '1', '2', '3' etc and I would like to concat string '000' on the left of col1 so I can get a column (new or replace the old one doesn't matter) as '0001', '0002', '0003'.

    I thought I should use df.withColumn('col1', '000'+df['col1']) but of course it does not work since pyspark dataframe are immutable?

    This should be an easy task but i didn't find anything online. Hope someone can give me some help!

    Thank you!

  • ASU_TY
    ASU_TY over 6 years
    Hi Steven, Thank you for your help! I think your solution works for my case and i did a little modification to suit my case as df = df.withColumn('col1', concat(lit("000"), col("col1"))) . In addition, is using lit the only way to add constant to modify the column values in pyspark? Because in pandas, i would just use df['col1']='000' + df['col1'] but not sure if in pyspark, there will be multiple ways to achieve it!
  • Josh
    Josh almost 5 years
    This doesn't really answer the question posed by OP, it just shows how to concatenate two columns.
  • earl
    earl over 4 years
    I am using Python 2.7 and I do not find lit in it. What is the alternative for python 2.7 for the same issue?
  • Pablo Adames
    Pablo Adames almost 4 years
    This is the answer a little more general and the examples address the exact question, however I found the previous one more useful for what I was looking for when I found this question. Variety is gold!