Pyspark: Replacing value in a column by searching a dictionary

python apache-spark dataframe pyspark apache-spark-sql

23,828

Solution 1

You can use either na.replace:

df = spark.createDataFrame([
    ('Tablet', ), ('Phone', ),  ('PC', ), ('Other', ), (None, )
], ["device_type"])

df.na.replace(deviceDict, 1).show()

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|      Other|
|       null|
+-----------+

or map literal:

from itertools import chain
from pyspark.sql.functions import create_map, lit

mapping = create_map([lit(x) for x in chain(*deviceDict.items())])


df.select(mapping[df['device_type']].alias('device_type'))

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|       null|
|       null|
+-----------+

Please note that the latter solution will convert values not present in the mapping to NULL. If this is not a desired behavior you can add coalesce:

from pyspark.sql.functions import coalesce


df.select(
    coalesce(mapping[df['device_type']], df['device_type']).alias('device_type')
)

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|      Other|
|       null|
+-----------+

Solution 2

After a lot of searching and alternatives I think that the simplest way to replace using a python dict is with pyspark dataframe method replace:

deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df_replace = df.replace(deviceDict,subset=['device_type'])

This will replace all values with the dict, you can get the same results using df.na.replace() if you pass a dict argument combined with a subset argument. It's not clear enough on his docs because if you search the function replace you will get two references, one inside of pyspark.sql.DataFrame.replace and the other one in side of pyspark.sql.DataFrameNaFunctions.replace, but the sample code of both reference use df.na.replace so it is not clear you can actually use df.replace.

Solution 3

Here is a little helper function, inspired by the R recode function, that abstracts the previous answers. As a bonus, it adds the option for a default value.

from itertools import chain
from pyspark.sql.functions import col, create_map, lit, when, isnull
from pyspark.sql.column import Column

df = spark.createDataFrame([
    ('Tablet', ), ('Phone', ),  ('PC', ), ('Other', ), (None, )
], ["device_type"])

deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}

df.show()
+-----------+
|device_type|
+-----------+
|     Tablet|
|      Phone|
|         PC|
|      Other|
|       null|
+-----------+

Here is the definition of recode.

def recode(col_name, map_dict, default=None):
    if not isinstance(col_name, Column): # Allows either column name string or column instance to be passed
        col_name = col(col_name)
    mapping_expr = create_map([lit(x) for x in chain(*map_dict.items())])
    if default is None:
        return  mapping_expr.getItem(col_name)
    else:
        return when(~isnull(mapping_expr.getItem(col_name)), mapping_expr.getItem(col_name)).otherwise(default)

Creating a column without a default gives null/None in all unmatched values.

df.withColumn("device_type", recode('device_type', deviceDict)).show()

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|       null|
|       null|
+-----------+

On the other hand, specifying a value for default replaces all unmatched values with this default.

df.withColumn("device_type", recode('device_type', deviceDict, default='Other')).show()

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|      Other|
|      Other|
+-----------+

Solution 4

You can do this using df.withColumn too:

from itertools import chain
from pyspark.sql.functions import create_map, lit

deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}

mapping_expr = create_map([lit(x) for x in chain(*deviceDict.items())])

df = df.withColumn('device_type', mapping_expr[df['dvice_type']])
df.show()

Solution 5

The simplest way to do it is to apply a udf on your dataframe :

    from pyspark.sql.functions import col , udf

    deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
    map_func = udf(lambda row : deviceDict.get(row,row))
    df = df.withColumn("device_type", map_func(col("device_type")))

View more solutions

23,828

Author by

Yuehan Lyu

Mathematics. Probability_Theory. Stochastic_Process. Statistics. Data_Science.

Updated on July 28, 2022

Comments

Yuehan Lyu almost 2 years
I'm a newbie in PySpark.

I have a Spark DataFrame df that has a column 'device_type'.

I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop".

In Python I can do the following,
```
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df['device_type'] = df['device_type'].replace(deviceDict,inplace=False)
```
How can I achieve this using PySpark? Thanks!
gilgamash over 5 years

Greetings. Even though it is more than a year later: I want to use the mapping approach with pyspark 2.1. However, in contrast to the example, when my table contains a "NULL" entry I get the error : "Py4JJavaError: An error occurred while calling o6564.collectToPython. : java.lang.RuntimeException: Cannot use null as map key!". Am i misunderstanding this or can you give a hint on where the problem has its source? Thanks
mytabi almost 4 years

How to do it in scala language ?
Ali AzG almost 4 years

@mytabi I think there is no create_map and lit for scala and spark. However match and case in scala can be an alternative solution to achieve the same result.
jgtrz almost 4 years

how can you avoid hard coding 'device_type'? @yardsale8
GiovaniSalazar over 3 years

thanks..there is a option in case the value not match with the column and set None?
narjes Karmeni over 3 years

A proper way to do it : def mapping_func(x,deviceDict): try: return deviceDict.get(x,x) except: return None map_func = udf(lambda row : mapping_func(row)) df = df.withColumn("device_type", map_func(col("device_type")))
yardsale8 over 3 years

Since device_type is a column name, I am not sure you want to abstract that out. If you did, you could put the expression in a function that had the df, column name, and translation dict as arguments.
mang4521 about 2 years

@AliAzG is there a way to Remove those rows from a pyspark dataframe whose entries from a column [of the pyspark] are not present in a dictionary's list of keys?