PySpark DataFrame: Count Of List Values in Column-CodePudding

I have a PySpark DataFrame with a string column [text] and a separate list [word_list] and I need to count how many of the [word_list] values appear in each [text] row (can be counted more than once).

df = spark.createDataFrame(
  [(1,'Hello my name is John'), 
   (2,'Yo go Bengals'), 
   (3,'this is a text')
  ]
  , ['id','text']
)

word_list = ['is', 'm', 'o', 'my']

The result would be:

| text                  | list_count   |
| Hello my name is John |       6      |
| Yo go Bengals         |       2      |
| this is a text        |       2      |

For text's first value, "is" occurs once, "m" occurs twice, "o" occurs twice, and "my" occurs once. In the second row, the only value from [word_list] that appears is "o" and it appears twice. In the third value for [text], the only value from [word_list] that appears is "is" and it appears twice.

The result doesn't necessarily have to be PySpark-based either, it could be in Pandas if that's easier.

CodePudding user response：

You can do this with a UDF as below

UDF

df = sql.createDataFrame(
  [(1,'Hello my name is John'), 
   (2,'Yo go Bengals'), 
   (3,'this is a text')
  ]
  , ['id','text']
)

word_list = ['is', 'm', 'o', 'my']

def count_values(inp,map_list=None):
    
    count = 0
    
    for pattern in map_list:
        if re.findall(pattern,inp):
            count  = 1
    
    return count

count_values_udf = F.udf(partial(count_values,map_list=word_list),IntegerType())

df.select(
            count_values_udf(F.col('text')).alias('count_values')
         ).show()

 ------------ 
|count_values|
 ------------ 
|           4|
|           1|
|           1|
 ------------