Suppose that I have 2 objects:
Ais a list of namesBis a pandas frame with 3 columns: 'name','friend1','friend2', which list a person's name and the names of their 2 best friends
For my application, I would like to know: for each person in A, a list of people in B for which the person in A is among the 2 best friends. To be specific, for each person in A, I would like a list my_bool of booleans that can be computed as follows:
for current_name in A:
my_bool = (B['friend1'] == current_name) | (B['friend2'] == current_name)
[ ,,, other computation using my_bool ... ]
The computation works, but I'm trying to improve on its efficiency. For example, when A has length 15k and B has 50k rows, the computation time is very long.
My tuition is that: it's not efficient that the loop scans through the 50k rows of B for each person in A. Is there a way to vectorize the computation to create, say, a 15k x 50k matrix all_bools in 1 shot (without loop), then read off my_bool (as the rows of all_bools) later as needed? In another language, I can implement this idea, but I'm unable to do it in Python. If this idea is garbage too, please feel free to put forth your suggestion.
CodePudding user response:
You can use the pd.Series.isin method, which implicitly converts the list to a hash map with a more efficient look-up time.
my_bool = B['friend1'].isin(A) | (B['friend2'].isin(B)
CodePudding user response:
You can try this:
import numpy as np
import pandas as pd
A = np.array(['Bob', 'Becky', 'Mark', 'Joe', 'Zeke'])
B = pd.DataFrame([['Joe', 'Mark', 'Bob'], ['Becky', 'Joe', 'Bob'], ['Mark', 'Tom', 'Trisha']], columns=['name', 'friend1', 'friend2'])
# resulting shape is (len(A), len(B.friend1))
friend1 = np.equal(A.reshape(-1, 1), B.friend1.values)
friend2 = np.equal(A.reshape(-1, 1), B.friend2.values)
# your final all_bools for later reference
all_bools = friend1 | friend2
# processing one at a time:
for i in range(all_bools.shape[0]):
my_bool = all_bools[i]
in_friends = B.loc[my_bool, 'name'].values
if in_friends.any():
print(f'My name is {A[i]} and Im friends with {in_friends}')
Given that it's numpy it is highly vectorized and efficient.
However... the downside to creating the array of all_bools all in one go is that there is a very good chance it will consume a lot of memory to store it.
