I have a data frame looking like this
code 1 code 2 Fruit_Group temp_code
(string) (string) (string) (boolean)
12E5-11 12E5-11 Apple True
12E5-11 ERE5-11,12E5-11 Apple True
12E5-11 MMMM-11 Apple True #color to be changed
12E5-11 XXXX-11 Apple False #color to be changed
12E5-11 12E5-11 Apple True
12E5-11 12E5-11, ERE5-11 Apple True
By default the Fruit_Group is assigned to some fruits:.
The color remains Apple when the first four digits of "code 1" match with first four digits of "code 2" (also consider ',' - the four digits should be matched with any one of "code 2")
I have to change the Fruit group for rows 3 and 4 since no match is found:
- If digits do not match then check
temp_code: iftemp_code == True--> Banana else Orange
Can anyone help me with this?
Data:
x = [("12E5-11", "12E5-11", "Apple", True), ("12E5-11", "ERE5-11,12E5-11", "Apple", True), ("12E5-11", "MMMM-11", "Apple", True ), ("12E5-11", "XXXX-11" ,"Apple", False), ("12E5-11", "12E5-11", "Apple", True), ("12E5-11", "ERE5-11,12E5-11", "Apple", True)]
Fruits_df = spark.createDataFrame(x, schema=["code 1", "code 2","Fruit_Group","temp_code"])
CodePudding user response:
You can use rlike to check if code 2 contains a code starting with the first 4 characters of code 1. If those characters are 1234, the first subtlety is to avoid matching with something that contains 1234 but does not start with 1234. For that, we can add (^|,) at the beginning of the regex. The second subtlety is that rlike in spark (python and scala) can only be used with a fixed string and not with a column, expect when used inside expr :-) Once you have checked whether or not code 2 matches your criteria, you can use when to change your fruit accordingly:
Fruits_df\
.withColumn("pattern", F.concat(F.lit("(^|,)"), F.substring("code 1", 0, 4)))\
.withColumn("Fruit_Group",
F.when(F.expr("`code 2` rlike `pattern`"), F.col("Fruit_Group"))
.when(F.col("temp_code"), F.lit("Banana"))
.otherwise(F.lit("Orange"))
).show()
------- --------------- ----------- --------- ---------
| code 1| code 2|Fruit_Group|temp_code| pattern|
------- --------------- ----------- --------- ---------
|12E5-11| 12E5-11| Apple| true|(^|,)12E5|
|12E5-11|ERE5-11,12E5-11| Apple| true|(^|,)12E5|
|12E5-11| MMMM-11| Banana| true|(^|,)12E5|
|12E5-11| XXXX-11| Orange| false|(^|,)12E5|
|12E5-11| 12E5-11| Apple| true|(^|,)12E5|
|12E5-11|ERE5-11,12E5-11| Apple| true|(^|,)12E5|
------- --------------- ----------- --------- ---------
