Home > Blockchain >  pandas: comparing non-identical list of panda dataframes based on values from a certain column
pandas: comparing non-identical list of panda dataframes based on values from a certain column

Time:02-09

I have a two lists of panda dataframes as follows,

import pandas as pd
import numpy as np
list_one = [pd.DataFrame({'sent_a.1': [0, 3, 2, 1], 'sent_a.2': [0, 1, 4, 0], 'sent_b.3': [0, 6, 0, 8],'sent_b.4': [1, 1, 8, 6],'ID':['id_1','id_1','id_1','id_1']}),
        pd.DataFrame({'sent_a.1': [0, 3], 'sent_a.2': [0, 2], 'sent_b.3': [0, 6],'sent_b.4': [1, 1],'ID':['id_2','id_2']})]

list_two = [pd.DataFrame({'sent_a.1': [0, 5], 'sent_a.2': [0, 1], 'sent_b.3': [0, 6],'sent_b.4': [1, 1],'ID':['id_2','id_2']}),
            pd.DataFrame({'sent_a.1': [0, 5, 3, 1], 'sent_a.2': [0, 2, 3, 1], 'sent_b.3': [0, 6, 6, 8],'sent_b.4': [1, 5, 8, 5],'ID':['id_1','id_1','id_1','id_1']})]

I would like to compare the dataframes in these two lists and if the values are the same, I would like to replace the value with 'True' and if the values are different, I would like to set them to 'False' and save the result in a different list of panda dataframes. I have done the following,

for dfs in list_one:
    for dfs2 in list_two:
       g = np.where(dfs == dfs2, 'True', 'False')
       print (g)

but I get the error,

ValueError: Can only compare identically-labeled DataFrame objects

how can I sort values in these two lists, based on the values from column 'ID'?

Edit I would like the dataframes that have the same value for column 'ID' to be compared. meaning that dataframes that have 'ID' == 'id_1' are to be compared with one another and dataframes that have 'ID' == 'id_2' to be compared with each other (not a cross comparison)

so the desired output is:

output = [   sent_a.1  sent_a.2  sent_b.3  sent_b.4    ID
        0    True         True      True       True   id_1

        1    False        False     True       False  id_1

        2    False        False     False      True   id_1

        3    False        False     True       True   id_1, 
             sent_a.1  sent_a.2  sent_b.3  sent_b.4    ID
        0    True         True      True       True   id_2
        1    True         True      False      False  id_2]

CodePudding user response:

Based on your current example

For your first question:

how can I sort values in these two lists, based on the values from column 'ID'?

list_one = sorted(list_one,key=lambda x: x['ID'].unique()[0][3:], reverse=False)
list_two =sorted(list_two,key=lambda x: x['ID'].unique()[0][3:], reverse=False)

ValueError: Can only compare identically-labeled DataFrame objects

  • error due to different index values order in dataframes or dataframes are of different shapes

First way of comparison:

for dfs in list_one:
    for dfs2 in list_two:
        if dfs.shape == dfs2.shape:
            g = np.where(dfs == dfs2, 'True', 'False')
            print (g)

Second way:

I would like the dataframes that have the same value for column 'ID' to be compared

for dfs in list_one:
    for dfs2 in list_two:
        if (dfs['ID'].unique() == dfs2['ID'].unique()) and (dfs.shape == dfs2.shape):
            g = np.where(dfs == dfs2, 'True', 'False')
            print (g)
  •  Tags:  
  • Related