How to join a spark dataframe twice with different id type-CodePudding

I have a spark.DataFrame called events and I want to join with another spark.DataFrame called users. Therefore the User can ben identified on events dataframe with two different types of Id. The schema of dataframes can be seen below: Events:

Id	IdType	Name	Date	EventType
324	UserId	Daniel	2022-01-15	purchase
350	UserId	Jack	2022-01-16	purchase
3247623322	UserCel	Michelle	2022-01-10	claim

Users:

Id	Name	Cel
324	Daniel	5511737379
350	Jack	3247623817
380	Michelle	3247623322

What I want to do is to left join the events dataframe twice in order to extract all the events despite the IdType used on events dataframe

The final dataframe I want must be as follows:

Id	Name	Cel	Date	EventType
324	Daniel	5511737379	2022-01-15	Purchase
350	Jack	3247623817	2022-01-16	Purchase
380	Michelle	3247623322	2022-01-10	Claim

I guess the python (PySpark code) for this join might be close to:

(users.join(events, on = [users.Id == events.Id], how = 'left')
      .join(events, on = [users.Cel == events.Id], how = 'left'))

CodePudding user response：

You can do that with the following code

with_id = (users.join(events, on=users["Id"]==events["Id"], how='inner')
                .select(events["Id"], events["Name"],"Cel","Date","EventType"))

incorrect_id = (users.join(events, on=users["Id"]==events["Id"], how='leftanti')
                        .join(events, on=users["Cel"]==events["Id"])
                        .select(users["Id"], events["Name"],"Cel","Date","EventType"))


result = with_id.unionAll(incorrect_id)

The result

result.show()
 --- -------- ---------- ---------- --------- 
| Id|    Name|       Cel|      Date|EventType|
 --- -------- ---------- ---------- --------- 
|324|  Daniel|5511737379|2022-01-15| purchase|
|350|    Jack|3247623817|2022-01-16| purchase|
|380|Michelle|3247623322|2022-01-10|    claim|
 --- -------- ---------- ---------- ---------