I have two dataframes, something like below:
df1:
date col1 col2 col3
15-5-2022 ABC 1 PQR
16-5-2022 BCD 2 ABC
17-5-2022 CDE 4 XYZ
df2:
date col1 col2 col3
5-4-2022 XYZ 1 ABC
6-4-2022 PQR 2 ABC
7-4-2022 BCD 4 PQR
My task is to get total number of unique values that are in df2.col1 but not in df1.col1. The way I am doing this is by creating first a list of all col1 unique values from df1 and then from df2 and then comparing these two lists and creating a third list with what exists in second list but not the first. Since I need the count of items in the final list, I am doing a len on third list. My code is like below:
list1 = df1.col1.unique()
list2 = df2.col1.unique()
list3 = [x for x in list2 if x not in list1]
num_list3 = len(list3)
This is getting my task done, but taking a very long time to run, probably because my dfs are quite big. I was wondering if there is a smarter and more efficient way of doing this please. I would appreciate any help
CodePudding user response:
Use:
df2.loc[~df2['col1'].isin(df1['col1']), 'col1'].unique()
output: array(['XYZ', 'PQR'], dtype=object)
Or, with sets:
set(df2['col1']) - set(df1['col1'])
output: {'PQR', 'XYZ'}
CodePudding user response:
I encounter a similar problem but more difficult. My problem is to compare the unique combination and get the differences between df1 and df2. Here I post the 2 solutions in case you need them.
Solution key idea: Using a trick of concat, groupby or merge.
If you want to get the joint or the mutual of the two dataframes. By getting unique of df1, and df2, called u1 and u2. You concat u1, and u2, then use groupby to count for the number of occurance. If more than 1 then it appears in both u1 and u2. If it is 1, then it appears in one of the two u1 or u2.
If you want to get the uniques from df1 or df2 only, then use merge trick with option indicator=True
Data set for replication:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'id':['a','a','b','b','c','d','e'],
'val': [1,2,1,3,4,5,6]})
df2 = pd.DataFrame({'id':['a','a','b','b','c','c','e','f','f','d'],
'val': [1,2,1,3,4,5,5,7,8,9]})
Problem 1: a single column comparison
# Getting unique in each dataframe and concat
u1 = pd.DataFrame(df1['id'].unique(), columns=['id'])
u2 = pd.DataFrame(df2['id'].unique(), columns=['id'])
u = pd.concat([u1,u2])
# Groupby, count the number of occurrence with function `size`:
u.groupby('id').size().reset_index()
# You can do the rest by your choice if you want the joint or the not joint part
Problem 2: multiple columns combination
# Getting unique combination of `id` and `val` by using trick of `size()` in `groupby`:
u1 = df1.groupby(['id', 'val']).size().reset_index().drop(columns=0)
u2 = df2.groupby(['id', 'val']).size().reset_index().drop(columns=0)
u = pd.concat([u1,u2])
# Groupby, count the number of occurrence with function `size`:
u.groupby('id').size().reset_index()
# You can do the rest by your choice if you want the joint or the not joint part
If you need uique of df1 or df2 only
# This will tell you the uniques combinations belong to df1 or df2 or both:
pd.merge(u1, u2, how='outer', on='id', indicator=True)
This should speed up your old code.
Hope this help
