I have three columns: id (non-unique id), X (categories) and Y (categories). (I don't have a dataset to share yet. I'll try to replicate what I have using a smaller dataset and edit as soon as possible)
I ran a for loop on a very small subset and based on those results it might take over 4 hours to run this code. I'm looking for a faster way to do this task using pandas (maybe using iterrows, like iterating over previous rows within apply)
For each row I check
- whether the current X matches any of previous Xs (check_X = X[:row] == X[row])
- whether the current Y matches any of previous Ys (check_Y = Y[:row] == Y[row])
- whether the current id does not match any of previous ids (check_id = id[:row] != id[row])
if sum(check_X & check_Y & check_id)>0: then append 1 to the array else: append 0
CodePudding user response:
Your are probably looking for duplicated:
df = pd.DataFrame({'id': [0, 0, 0, 1, 0],
'X': [1, 1, 2, 1, 1],
'Y': [2, 2, 2, 2, 2]})
df['dup'] = ~df[df.duplicated(['X', 'Y'])].duplicated('id', keep=False).loc[lambda x: ~x]
df['dup'] = df['dup'].fillna(False).astype(int)
print(df)
# Output
id X Y dup
0 0 1 2 0
1 0 1 2 0
2 0 2 2 0
3 1 1 2 1
4 0 1 2 0
CodePudding user response:
EDIT answer from @Corralien using duplicates() will likely be much faster and the best answer for this specific problem. However, apply is more flexible if you have different things to check.
You could do it with iterrows() or apply(). As far as I know apply() is faster:
check_id, check_x, check_y = set(), set(), set()
def apply_func(row):
global check_id, check_x, check_y
if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
row['duplicate'] = 1
else:
row['duplicate'] = 0
check_id.add(row['id'])
check_x.add(row['x'])
check_y.add(row['y'])
return row
df.apply(apply_func, axis=1)
With iterrows():
check_id, check_x, check_y = set(), set(), set()
for i, row in df.iterrows():
if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
df.loc[i, 'duplicate'] = 1
else:
df.loc[i, 'duplicate'] = 0
check_id.add(row['id'])
check_x.add(row['x'])
check_y.add(row['y'])
CodePudding user response:
This is essentially like @Corralien's answer. What you want can be achieved using duplicated because it returns a Series indicating whether each value has occurred in the preceding values, which is precisely "whether the current X matches any of previous Xs". Then the condition for "id" is just the negation of it. Since you want 1 if all of them evaluate to True and 0 otherwise in each row, you can do it using the & operator and converting the resulting boolean Series to dtype int:
check_X = df['X'].duplicated()
check_Y = df['Y'].duplicated()
check_id = ~df['id'].duplicated()
out = (check_X & check_Y & check_id).astype(int)
