Check if an array of array contains an array-CodePudding

Here are two columns of my dataframe (df):

A	B
["a"]	[["a"], ["b"]]
["c"]	[["a"], ["b"]]

I want to create an array that tells whether the array in column A is in the array of array which is in column B, like this:

A	B	C
["a"]	[["a"], ["b"]]	True
["c"]	[["a"], ["b"]]	False

I tried df.select("A", "B", array_contains(F.col("B"), F.col("A")).alias("C")) but got an error:

org.apache.spark.SparkRuntimeException: The feature is not supported: literal for '' of class java.util.ArrayList

It seems that array of array isn't implemented in PySpark. Is there a workaround?

CodePudding user response：

You can use exists and transform functions in SQL expr.

df = df.withColumn('C', F.expr('exists(transform(B, x -> x == A), x -> x is true)'))

CodePudding user response：

You can check it using arrays_overlap after flatten is used on the column "B".

F.arrays_overlap("A", F.flatten("B"))

Full test:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(["a"], [["a"], ["b"]]),
     (["c"], [["a"], ["b"]])],
    ['A', 'B'])

df = df.withColumn("C", F.arrays_overlap("A", F.flatten("B")))

df.show()
#  --- ---------- ----- 
# |  A|         B|    C|
#  --- ---------- ----- 
# |[a]|[[a], [b]]| true|
# |[c]|[[a], [b]]|false|
#  --- ---------- -----