I have a PySpark DataFrame with a string column [text] and a separate list [word_list] and I need to count how many of the [word_list] values appear in each [text] row (can be counted more than once).
df = spark.createDataFrame(
[(1,'Hello my name is John'),
(2,'Yo go Bengals'),
(3,'this is a text')
]
, ['id','text']
)
word_list = ['is', 'm', 'o', 'my']
The result would be:
| text | list_count |
| Hello my name is John | 6 |
| Yo go Bengals | 2 |
| this is a text | 2 |
For text's first value, "is" occurs once, "m" occurs twice, "o" occurs twice, and "my" occurs once. In the second row, the only value from [word_list] that appears is "o" and it appears twice. In the third value for [text], the only value from [word_list] that appears is "is" and it appears twice.
The result doesn't necessarily have to be PySpark-based either, it could be in Pandas if that's easier.
CodePudding user response:
You can do this with a UDF as below
UDF
df = sql.createDataFrame(
[(1,'Hello my name is John'),
(2,'Yo go Bengals'),
(3,'this is a text')
]
, ['id','text']
)
word_list = ['is', 'm', 'o', 'my']
def count_values(inp,map_list=None):
count = 0
for pattern in map_list:
if re.findall(pattern,inp):
count = 1
return count
count_values_udf = F.udf(partial(count_values,map_list=word_list),IntegerType())
df.select(
count_values_udf(F.col('text')).alias('count_values')
).show()
------------
|count_values|
------------
| 4|
| 1|
| 1|
------------
