I have a spark.DataFrame called events and I want to join with another spark.DataFrame called users. Therefore the User can ben identified on events dataframe with two different types of Id.
The schema of dataframes can be seen below:
Events:
| Id | IdType | Name | Date | EventType |
|---|---|---|---|---|
| 324 | UserId | Daniel | 2022-01-15 | purchase |
| 350 | UserId | Jack | 2022-01-16 | purchase |
| 3247623322 | UserCel | Michelle | 2022-01-10 | claim |
Users:
| Id | Name | Cel |
|---|---|---|
| 324 | Daniel | 5511737379 |
| 350 | Jack | 3247623817 |
| 380 | Michelle | 3247623322 |
What I want to do is to left join the events dataframe twice in order to extract all the events despite the IdType used on events dataframe
The final dataframe I want must be as follows:
| Id | Name | Cel | Date | EventType |
|---|---|---|---|---|
| 324 | Daniel | 5511737379 | 2022-01-15 | Purchase |
| 350 | Jack | 3247623817 | 2022-01-16 | Purchase |
| 380 | Michelle | 3247623322 | 2022-01-10 | Claim |
I guess the python (PySpark code) for this join might be close to:
(users.join(events, on = [users.Id == events.Id], how = 'left')
.join(events, on = [users.Cel == events.Id], how = 'left'))
CodePudding user response:
You can do that with the following code
with_id = (users.join(events, on=users["Id"]==events["Id"], how='inner')
.select(events["Id"], events["Name"],"Cel","Date","EventType"))
incorrect_id = (users.join(events, on=users["Id"]==events["Id"], how='leftanti')
.join(events, on=users["Cel"]==events["Id"])
.select(users["Id"], events["Name"],"Cel","Date","EventType"))
result = with_id.unionAll(incorrect_id)
The result
result.show()
--- -------- ---------- ---------- ---------
| Id| Name| Cel| Date|EventType|
--- -------- ---------- ---------- ---------
|324| Daniel|5511737379|2022-01-15| purchase|
|350| Jack|3247623817|2022-01-16| purchase|
|380|Michelle|3247623322|2022-01-10| claim|
--- -------- ---------- ---------- ---------
