Here are two columns of my dataframe (df):
| A | B |
|---|---|
| ["a"] | [["a"], ["b"]] |
| ["c"] | [["a"], ["b"]] |
I want to create an array that tells whether the array in column A is in the array of array which is in column B, like this:
| A | B | C |
|---|---|---|
| ["a"] | [["a"], ["b"]] | True |
| ["c"] | [["a"], ["b"]] | False |
I tried df.select("A", "B", array_contains(F.col("B"), F.col("A")).alias("C")) but got an error:
org.apache.spark.SparkRuntimeException: The feature is not supported: literal for '' of class java.util.ArrayList
It seems that array of array isn't implemented in PySpark. Is there a workaround?
CodePudding user response:
You can use exists and transform functions in SQL expr.
df = df.withColumn('C', F.expr('exists(transform(B, x -> x == A), x -> x is true)'))
CodePudding user response:
You can check it using arrays_overlap after flatten is used on the column "B".
F.arrays_overlap("A", F.flatten("B"))
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(["a"], [["a"], ["b"]]),
(["c"], [["a"], ["b"]])],
['A', 'B'])
df = df.withColumn("C", F.arrays_overlap("A", F.flatten("B")))
df.show()
# --- ---------- -----
# | A| B| C|
# --- ---------- -----
# |[a]|[[a], [b]]| true|
# |[c]|[[a], [b]]|false|
# --- ---------- -----
