I have a delta table A as shown below.
| point | cluster | points_in_cluster |
|---|---|---|
| 37 | 1 | [37,32] |
| 45 | 2 | [45,67,84] |
| 67 | 2 | [45,67,84] |
| 84 | 2 | [45,67,84] |
| 32 | 1 | [37,32] |
Also I have a table B as shown below.
| id | point |
|---|---|
| 101 | 37 |
| 102 | 67 |
| 103 | 84 |
I want a query like the following. Here in obviously doesn't work for a list. So, what would be the right syntax?
select b.id, a.point
from A a, B b
where b.point in a.points_in_cluster
As a result I should have a table like the following
| id | point |
|---|---|
| 101 | 37 |
| 101 | 32 |
| 102 | 45 |
| 102 | 67 |
| 102 | 84 |
| 103 | 45 |
| 103 | 67 |
| 103 | 84 |
CodePudding user response:
Based on your data sample, I'd do an equi-join on point column and then an explode on points_in_cluster :
from pyspark.sql import functions as F
# assuming A is df_A and B is df_B
df_A.join(
df_B,
on="point"
).select(
"id",
F.explode("points_in_cluster").alias("point")
)
Otherwise, you use array_contains:
select b.id, a.point
from A a, B b
where array_contains(a.points_in_cluster, b.point)
