In pyspark, how do you add/concat a string to a column?
Solution 1
from pyspark.sql.functions import concat, col, lit
df.select(concat(col("firstname"), lit(" "), col("lastname"))).show(5)
+------------------------------+
|concat(firstname, , lastname)|
+------------------------------+
| Emanuel Panton|
| Eloisa Cayouette|
| Cathi Prins|
| Mitchel Mozdzierz|
| Angla Hartzheim|
+------------------------------+
only showing top 5 rows
http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions
Solution 2
Another option here is to use pyspark.sql.functions.format_string()
which allows you to use C printf
style formatting.
Here's an example where the values in the column are integers.
import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([(1,), (2,), (3,), (10,), (100,)], ["col1"])
df.withColumn("col2", f.format_string("%03d", "col1")).show()
#+----+----+
#|col1|col2|
#+----+----+
#| 1| 001|
#| 2| 002|
#| 3| 003|
#| 10| 010|
#| 100| 100|
#+----+----+
Here the format "%03d"
means print an integer number left padded with up to 3 zeros. This is why the 10
gets mapped to 010
and 100
does not change at all.
Or if you wanted to add exactly 3 zeros in the front:
df.withColumn("col2", f.format_string("000%d", "col1")).show()
#+----+------+
#|col1| col2|
#+----+------+
#| 1| 0001|
#| 2| 0002|
#| 3| 0003|
#| 10| 00010|
#| 100|000100|
#+----+------+
Related videos on Youtube
ASU_TY
Updated on July 09, 2022Comments
-
ASU_TY about 2 years
I would like to add a string to an existing column. For example,
df['col1']
has values as'1', '2', '3'
etc and I would like to concat string'000'
on the left ofcol1
so I can get a column (new or replace the old one doesn't matter) as'0001', '0002', '0003'
.I thought I should use
df.withColumn('col1', '000'+df['col1'])
but of course it does not work since pyspark dataframe are immutable?This should be an easy task but i didn't find anything online. Hope someone can give me some help!
Thank you!
-
ASU_TY over 6 yearsHi Steven, Thank you for your help! I think your solution works for my case and i did a little modification to suit my case as df = df.withColumn('col1', concat(lit("000"), col("col1"))) . In addition, is using lit the only way to add constant to modify the column values in pyspark? Because in pandas, i would just use df['col1']='000' + df['col1'] but not sure if in pyspark, there will be multiple ways to achieve it!
-
Josh almost 5 yearsThis doesn't really answer the question posed by OP, it just shows how to concatenate two columns.
-
earl over 4 yearsI am using Python 2.7 and I do not find lit in it. What is the alternative for python 2.7 for the same issue?
-
Pablo Adames almost 4 yearsThis is the answer a little more general and the examples address the exact question, however I found the previous one more useful for what I was looking for when I found this question. Variety is gold!