Group rows by similar name in column-CodePudding

For calculations, I've already created this dataframe

df1:

 -------------- ------ ------------ 
|name          |MG    |questions   |
 -------------- ------ ------------ 
|toto          |MG1   |[Q1, Q2]    |
|toto          |MG3   |[Q4, Q6, Q7]|
|toto-123      |MG1   |[Q1, Q2]    |
|toto-456      |MG1   |[Q1, Q2]    |
|titi          |MG1   |[Q1, Q2]    |
|...           |...   |...         |
 -------------- ------ ------------

Until now, it was enough, but now, for a new need, I encountered a problem I do not achieve to resolve: I have to group rows by name and MG not with the exact name, but with similar name, like that

df1:

 -------------- ------ ----------------------------- 
|name          |MG    |questions                    |
 -------------- ------ ----------------------------- 
|toto          |MG1   |[Q1, Q2] [Q1, Q2] [Q1, Q2]   |
|toto          |MG3   |[Q4, Q6, Q7]                 |
|titi          |MG1   |[Q1, Q2]                     |
|...           |...   |...                          |
 -------------- ------ -----------------------------

Here, questions for toto are questions for toto, toto-123 and toto-456. They can stay in different arrays or in the same. Also ideally the rows for toto-123 and toto-456 still remain.

I've tried the groupby function with condition thought SQL expression but without success

Do you have any hint or solution? Thanks a lot

CodePudding user response：

You can first, use self join to find rows that have similar name and mg values, using like operator in join condition:

import pyspark.sql.functions as F

df = spark.createDataFrame([
    ("toto", "MG1", ["Q1", "Q2"]), ("toto", "MG3", ["Q4", "Q6", "Q7"]),
    ("toto-123", "MG1", ["Q1", "Q2"]), ("toto-456", "MG1", ["Q1", "Q2"]),
    ("titi", "MG1", ["Q1", "Q2"])], ["name", "mg", "questions"])

similar_df = df.alias("a").join(
    df.alias("b"),
    F.expr("a.name like concat('%', b.name, '%') and a.mg = b.mg and a.name != b.name"),
).selectExpr("a.name", "a.mg", "b.name as similar_name")

similar_df.show()
# -------- --- ------------ 
#|    name| mg|similar_name|
# -------- --- ------------ 
#|toto-123|MG1|        toto|
#|toto-456|MG1|        toto|
# -------- --- ------------

Then, join again with original dataframe and change the values of corresponding names with the similar one found in the first step, and finally group by name and mg to collect list of questions:

result = df.alias("df").join(
    similar_df.alias("s"),
    ["name", "mg"],
    'left'
).select(
    F.coalesce(F.col("similar_name"), F.col("df.name")).alias("name"),
    F.col("mg"),
    F.col("df.questions"),
).groupBy("name", "mg").agg(
    F.collect_list("questions").alias("questions")
)

result.show(truncate=False)
# ---- --- ------------------------------ 
#|name|mg |questions                     |
# ---- --- ------------------------------ 
#|toto|MG1|[[Q1, Q2], [Q1, Q2], [Q1, Q2]]|
#|titi|MG1|[[Q1, Q2]]                    |
#|toto|MG3|[[Q4, Q6, Q7]]                |
# ---- --- ------------------------------