I have a dataframe with a single column which contains the names of people as shown below.
name
--------------
john doe
john david doe
doe henry john
john henry
I want to count the number of time each two words appear together in a name regardless of the order. In this example, the words john and doe appear in the same same three times names john doe, john henry doe and doe john.
Expected output
name1 | name2 | count
----------------------
david | doe | 1
doe | henry | 1
doe | john | 3
henry | john | 2
Notice that name1 is the word that comes first in alphabetical order. Currently I have a brute force solution.
- Create a list of all unique words in the dataframe
- For each unique word
Win this list, filter the records in the original data frame which contain thisW - From the filtered records, count frequency of other words. This gives the number of time
Wappears with various other words
Question: This works fine for small number of records but is not efficient if we have a large number of records as it runs in quadratic complexity. How can it generate the output in a faster way? Is there any function or package that can give these counts?
Note: I tried using n-gram extraction from NLP packages but this over estimates the counts because it internally appends all the names to form a long string due to which the last word on a name and the first word of the next name shows up as a a sequence of words in the appended string which adds up to the count.
CodePudding user response:
Here's a solution which is still quadratic but over smaller n and also hides most of it in compiled code (which should hopefully execute faster):
from itertools import combinations
combs = df['name'].apply(lambda x:list(combinations(sorted(x.split()),2)))
counts = Counter(combs.explode())
res = pd.Series(counts).rename_axis(['name1', 'name2']).reset_index(name='count')
Output for your sample data:
name1 name2 count
0 doe john 3
1 david doe 1
2 david john 1
3 doe henry 1
4 henry john 2
CodePudding user response:
I suggest the below:
import itertools
# 1) Create combinations of 2 from all the names using itertools().
l = [a for b in df.name.tolist() for a in b]
c = {c for c in itertools.combinations(l, 2) if c[0] != c[1]}
df_counts = pd.DataFrame(c, columns=["name1", "name2"])
# 2) Create a new column iterating through rows to check of each name is contained in each list of words & sum the boolean outputs.
df_counts["counts"] = df_counts.apply(lambda row: sum([row["name1"] in l and row["name2"] in l for l in df.name.to_list()]), axis=1)
I hope this helps.
CodePudding user response:
This problem is an instance of 'Association Rule Mining' with just 2 items per transaction. It has simple algorithms like 'Aperiori' as well as efficient ones like 'FP-Growth' and you can find a lot of resources to learn that.
