How to efficiently count the number of two word combinations in human names?-CodePudding

I have a dataframe with a single column which contains the names of people as shown below.

name
--------------
john doe
john david doe
doe henry john
john henry

I want to count the number of time each two words appear together in a name regardless of the order. In this example, the words john and doe appear in the same same three times names john doe, john henry doe and doe john.

Expected output

name1 | name2 | count
----------------------
david | doe   | 1
doe   | henry | 1
doe   | john  | 3
henry | john  | 2

Notice that name1 is the word that comes first in alphabetical order. Currently I have a brute force solution.

Create a list of all unique words in the dataframe
For each unique word W in this list, filter the records in the original data frame which contain this W
From the filtered records, count frequency of other words. This gives the number of time W appears with various other words

Question: This works fine for small number of records but is not efficient if we have a large number of records as it runs in quadratic complexity. How can it generate the output in a faster way? Is there any function or package that can give these counts?

Note: I tried using n-gram extraction from NLP packages but this over estimates the counts because it internally appends all the names to form a long string due to which the last word on a name and the first word of the next name shows up as a a sequence of words in the appended string which adds up to the count.

CodePudding user response：

Here's a solution which is still quadratic but over smaller n and also hides most of it in compiled code (which should hopefully execute faster):

from itertools import combinations

combs = df['name'].apply(lambda x:list(combinations(sorted(x.split()),2)))
counts = Counter(combs.explode())
res = pd.Series(counts).rename_axis(['name1', 'name2']).reset_index(name='count')

Output for your sample data:

   name1  name2  count
0    doe   john      3
1  david    doe      1
2  david   john      1
3    doe  henry      1
4  henry   john      2

CodePudding user response：

I suggest the below:

import itertools

# 1) Create combinations of 2 from all the names using itertools().
l = [a for b in df.name.tolist() for a in b]
c = {c for c in itertools.combinations(l, 2) if c[0] != c[1]}
df_counts = pd.DataFrame(c, columns=["name1", "name2"])

# 2) Create a new column iterating through rows to check of each name is contained in each list of words & sum the boolean outputs.
df_counts["counts"] = df_counts.apply(lambda row: sum([row["name1"] in l and row["name2"] in l for l in df.name.to_list()]), axis=1)

I hope this helps.

CodePudding user response：

This problem is an instance of 'Association Rule Mining' with just 2 items per transaction. It has simple algorithms like 'Aperiori' as well as efficient ones like 'FP-Growth' and you can find a lot of resources to learn that.