replace key value from dictionary-CodePudding

Below is my DF:

deviceDict = {'TABLET' : 'MOBILE', 'PHONE':'MOBILE', 'PC':'Desktop', 'CEDEX' : '', 'ST' : 'SAINT', 'AV' : 'AVENUE', 'BD': 'BOULEVARD'}
df = spark.createDataFrame([('TABLET', 'DAF ST PAQ BD'), ('PHONE', 'AVOTHA'),  ('PC', 'STPA CEDEX'), ('OTHER', 'AV DAF'), (None, None)], ["device_type", 'City'])
df.show()

Output:

 ----------- ------------- 
|device_type|         City|
 ----------- ------------- 
|     TABLET|DAF ST PAQ BD|
|      PHONE|       AVOTHA|
|         PC|   STPA CEDEX|
|      OTHER|       AV DAF|
|       null|         null|
 ----------- -------------

The aim is to replace key/value, solution from Pyspark: Replacing value in a column by searching a dictionary

tests = df.na.replace(deviceDict, 1)

Result:

 ----------- ------------- 
|device_type|         City|
 ----------- ------------- 
|     MOBILE|DAF ST PAQ BD|
|     MOBILE|       AVOTHA|
|    Desktop|   STPA CEDEX|
|      OTHER|       AV DAF|
|       null|         null|
 ----------- -------------

It worked for device_type but I wasn't able to change the city (even when using subset)

Expected output:

 ----------- ------------------------ 
|device_type|                    City|
 ----------- ------------------------ 
|     MOBILE| DAF SAINT PAQ BOULEVARD|
|     MOBILE|                  AVOTHA|
|    Desktop|                    STPA|
|      OTHER|              AVENUE DAF|
|       null|                    null|
 ----------- ------------------------

CodePudding user response：

The replacement doesn't occur for the column City because you're trying to do some partial replacement in the column values. Whereas function DataFrame.replace uses the entire value as a mapping.

To achieve what you want for column City, you can use multiple nested regexp_replace expressions that you can dynamically generate using Python functools.reduce for example:

from functools import reduce
import pyspark.sql.functions as F

m = list(deviceDict.items())

df1 = df.na.replace(deviceDict, 1).withColumn(
    "City",
    reduce(
        lambda acc, x: F.regexp_replace(acc, rf"\b{x[0]}\b", x[1]),
        m[1:],
        F.regexp_replace(F.col("City"), rf"\b{m[0][0]}\b", m[0][1]),
    )
)

df1.show(truncate=False)
# ----------- ----------------------- 
#|device_type|City                   |
# ----------- ----------------------- 
#|MOBILE     |DAF SAINT PAQ BOULEVARD|
#|MOBILE     |AVOTHA                 |
#|Desktop    |STPA                   |
#|OTHER      |AVENUE DAF             |
#|null       |null                   |
# ----------- -----------------------

CodePudding user response：

I am using pandas dataframe and I think it will not be very different.

You will need to use "to_replace" option with a regex

# this will replace the old with new values given in deviceDict only if its a full string match
>>> print(deviceDict)

{'TABLET': 'MOBILE', 
'PHONE': 'MOBILE', 
'PC': 'Desktop', 
'CEDEX': '', 
'ST': 'SAINT', 
'AV': 'AVENUE', 
'BD': 'BOULEVARD'}

>>> print(df.replace(to_replace=deviceDict))

device_type           City
0      MOBILE  DAF ST PAQ BD
1      MOBILE         AVOTHA
2     Desktop     STPA CEDEX
3       OTHER         AV DAF
4        None           None

>>> df.replace(to_replace=r'\s*ST\s ', value=' SAINT ', regex = True)

device_type              City
0      TABLET  DAF SAINT PAQ BD
1       PHONE            AVOTHA
2          PC        STPA CEDEX
3       OTHER            AV DAF
4        None              None


>>> print(df.replace(to_replace=r'\s*AV\s ', value=' AVENUE ', regex = True))

  device_type           City
0      TABLET  DAF ST PAQ BD
1       PHONE         AVOTHA
2          PC     STPA CEDEX
3       OTHER     AVENUE DAF
4        None           None

The more generic solution will be

But I dont think you want to change AVOTHA --> AVENUEOTHA or, STPA --> SAINTPA

If you are fine with that change then this will work

>>> print(df.replace(to_replace=deviceDict, regex=True))

device_type                     City
0      MOBILE  DAF SAINT PAQ BOULEVARD
1      MOBILE               AVENUEOTHA
2     Desktop                 SAINTPA 
3       OTHER               AVENUE DAF
4        None                     None