efficient way of getting unique values by comparing columns in two dataframes pandas-CodePudding

I have two dataframes, something like below:

  df1:

   date             col1          col2         col3
 15-5-2022          ABC            1            PQR
 16-5-2022          BCD            2            ABC
 17-5-2022          CDE            4            XYZ


  df2:

   date           col1          col2         col3
 5-4-2022          XYZ            1           ABC
 6-4-2022          PQR            2           ABC
 7-4-2022          BCD            4           PQR

My task is to get total number of unique values that are in df2.col1 but not in df1.col1. The way I am doing this is by creating first a list of all col1 unique values from df1 and then from df2 and then comparing these two lists and creating a third list with what exists in second list but not the first. Since I need the count of items in the final list, I am doing a len on third list. My code is like below:

list1 = df1.col1.unique()    
list2 = df2.col1.unique()
list3 = [x for x in list2 if x not in list1]
num_list3 = len(list3)

This is getting my task done, but taking a very long time to run, probably because my dfs are quite big. I was wondering if there is a smarter and more efficient way of doing this please. I would appreciate any help

CodePudding user response：

Use:

df2.loc[~df2['col1'].isin(df1['col1']), 'col1'].unique()

output: array(['XYZ', 'PQR'], dtype=object)

Or, with sets:

set(df2['col1']) - set(df1['col1'])

output: {'PQR', 'XYZ'}

CodePudding user response：

I encounter a similar problem but more difficult. My problem is to compare the unique combination and get the differences between df1 and df2. Here I post the 2 solutions in case you need them.

Solution key idea: Using a trick of concat, groupby or merge.

If you want to get the joint or the mutual of the two dataframes. By getting unique of df1, and df2, called u1 and u2. You concat u1, and u2, then use groupby to count for the number of occurance. If more than 1 then it appears in both u1 and u2. If it is 1, then it appears in one of the two u1 or u2.

If you want to get the uniques from df1 or df2 only, then use merge trick with option indicator=True

Data set for replication:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'id':['a','a','b','b','c','d','e'],
                   'val': [1,2,1,3,4,5,6]})
df2 = pd.DataFrame({'id':['a','a','b','b','c','c','e','f','f','d'],
                   'val': [1,2,1,3,4,5,5,7,8,9]})

Problem 1: a single column comparison

# Getting unique in each dataframe and concat
u1 = pd.DataFrame(df1['id'].unique(), columns=['id'])
u2 = pd.DataFrame(df2['id'].unique(), columns=['id'])
u = pd.concat([u1,u2])

# Groupby, count the number of occurrence with function `size`:
u.groupby('id').size().reset_index()

# You can do the rest by your choice if you want the joint or the not joint part

Problem 2: multiple columns combination

# Getting unique combination of `id` and `val` by using trick of `size()` in `groupby`:

u1 = df1.groupby(['id', 'val']).size().reset_index().drop(columns=0)
u2 = df2.groupby(['id', 'val']).size().reset_index().drop(columns=0)
u = pd.concat([u1,u2])

# Groupby, count the number of occurrence with function `size`:
u.groupby('id').size().reset_index()

# You can do the rest by your choice if you want the joint or the not joint part

If you need uique of df1 or df2 only

# This will tell you the uniques combinations belong to df1 or df2 or both:    
pd.merge(u1, u2, how='outer', on='id', indicator=True)

This should speed up your old code.

Hope this help