Home > Mobile >  How to join a spark dataframe twice with different id type
How to join a spark dataframe twice with different id type

Time:01-20

I have a spark.DataFrame called events and I want to join with another spark.DataFrame called users. Therefore the User can ben identified on events dataframe with two different types of Id. The schema of dataframes can be seen below: Events:

Id IdType Name Date EventType
324 UserId Daniel 2022-01-15 purchase
350 UserId Jack 2022-01-16 purchase
3247623322 UserCel Michelle 2022-01-10 claim

Users:

Id Name Cel
324 Daniel 5511737379
350 Jack 3247623817
380 Michelle 3247623322

What I want to do is to left join the events dataframe twice in order to extract all the events despite the IdType used on events dataframe

The final dataframe I want must be as follows:

Id Name Cel Date EventType
324 Daniel 5511737379 2022-01-15 Purchase
350 Jack 3247623817 2022-01-16 Purchase
380 Michelle 3247623322 2022-01-10 Claim

I guess the python (PySpark code) for this join might be close to:

(users.join(events, on = [users.Id == events.Id], how = 'left')
      .join(events, on = [users.Cel == events.Id], how = 'left'))

CodePudding user response:

You can do that with the following code

with_id = (users.join(events, on=users["Id"]==events["Id"], how='inner')
                .select(events["Id"], events["Name"],"Cel","Date","EventType"))

incorrect_id = (users.join(events, on=users["Id"]==events["Id"], how='leftanti')
                        .join(events, on=users["Cel"]==events["Id"])
                        .select(users["Id"], events["Name"],"Cel","Date","EventType"))


result = with_id.unionAll(incorrect_id)

The result

result.show()
 --- -------- ---------- ---------- --------- 
| Id|    Name|       Cel|      Date|EventType|
 --- -------- ---------- ---------- --------- 
|324|  Daniel|5511737379|2022-01-15| purchase|
|350|    Jack|3247623817|2022-01-16| purchase|
|380|Michelle|3247623322|2022-01-10|    claim|
 --- -------- ---------- ---------- --------- 
  •  Tags:  
  • Related