How to intersect 3 dataframes of the same dimensions and output a dataframe of common occurrences in-CodePudding

I have three dataframes of the same dimensions. I would like to find common occurrences (1s and -1s) in at least 2 out of the 3 dataframes. I want an output dataframe of the same dimensions in which the elements are in at least two dataframes.

I have written the following example for everyone to understand the problem. In the example I want to find in which positions there is a 1 or a -1 in at least two dataframes.

import pandas as pd

A= {'a': [0, '.', 0, -1],'b': [0, '.', 1, 0], 'c':[1,'.', 0, 1]   }
A = pd.DataFrame(data=A)

    a  b  c
0   0  0  1
1   .  .  .
2   0  1  0
3  -1  0  1

B= {'a': [0, '.', 0, -1],'b': [1, '.', 1, 0], 'c':[1,'.', 0, -1]   }
B = pd.DataFrame(data=B)

B
    a  b   c
0   0  1   1
1   .  .   .
2   0  1   0
3  -1  0  -1

C = {'a': [0, '.', 0, 0],'b': [1, '.', 1, 0], 'c':[0,'.', 0, 0]   }
C = pd.DataFrame(data=C)

C
   a  b  c
0  0  1  0
1  .  .  .
2  0  1  0
3  0  0  0

The desired output would be:

    a  b  c
0   0  1  1
1   .  .  .
2   0  1  0
3  -1  0  0

I have tried few things but none of them worked.

I would appreciate any help.

Thanks a lot!

CodePudding user response：

You can use the underlying numpy arrays to count the values per value and use numpy.select to map an arbitrary number of choices. For are treated here independently but could also be added as a value to check.

dfs = [A,B,C]
vals = [1, -1]

masks = [sum(x.eq(val).astype(int) for x in dfs).ge(2)
         for val in vals]

pd.DataFrame(np.select(masks, vals),
             columns=A.columns, index=A.index).mask(A.eq('.'), '.')

Output:

    a  b  c
0   0  1  1
1   .  .  .
2   0  1  0
3  -1  0  0

Dot handled as 1/-1:

dfs = [A,B,C]
vals = ['.', 1, -1]
masks = [sum(x.eq(val).astype(int) for x in dfs).ge(2)
         for val in vals]
pd.DataFrame(np.select(masks, vals),
             columns=A.columns, index=A.index)

CodePudding user response：

I could imagine that there are better solutions, but you could use:

from collections import Counter

final_df = pd.DataFrame(columns=["a", "b", "c"])

for i in range(0, len(A)):
    a = 0
    temp = []
    for j in A.columns:
        count = Counter([A[j][i], B[j][i], C[j][i]]).most_common()
        if count[0][1] > 1:
            temp.append(count[0][0])
        else:
            temp.append(0)
        a =1
    final_df = pd.concat([final_df, pd.DataFrame([temp], columns=["a", "b", "c"])])

Output:

    a   b   c
0   0   1   1
0   .   .   .
0   0   1   0
0   -1  0   0

CodePudding user response：

First solving the problem

concat the dfs
groupby the lower index
aggregate with values count keeping the first element

df = (pd.concat([A, B, C], axis=0, keys=['A', 'B', 'C'])
        .groupby(level=1)
        .agg(lambda x: (x.value_counts().iloc[0] >= 2) * x.value_counts().index[0])
)

Output

    a  b   c
0   0  1   1
1   .  .   .
2   0  1   0
3  -1  0   0

Cleaning the code

The line .agg(lambda x: (x.value_counts().iloc[0] >= 2) * x.value_counts().index[0]):

is not fancy

raises a warning under pandas 1.1.5:

DeprecationWarning: In future, it will be an error for 'np.bool_' scalars to be interpreted as an index
This is separate from the ipykernel package so we can avoid doing imports until

With a function in replacement of the lambda we can enhance on both points:

def series_rule(s):
    vc = s.value_counts()
    return  vc.index[0] if (vc.iloc[0] >= 2) else 0  
    
df = (pd.concat([A, B, C], axis=0, keys=['A', 'B', 'C'])
        .groupby(level=1)
        .agg(series_rule)
)

Generalizing

We can now define a reusable function :

def Rachael_rule(pandas_serie):
    vc = pandas_serie.value_counts()
    return  vc.index[0] if (vc.iloc[0] >= 2) else 0  

def df_list_apply(rule, df_list):
    """
    Applies a rule to a list of dataframe of same shapes
    
    Parameters;
    - rule : a function taking a pandas.Series argument and returnig a value
    - df_list a list of pandas.DataFRame

    Return:
      the dataframe obtained by apllyng the rule on the dimenion of the list
    """    
    return (pd.concat(df_list, axis=0, keys=list(range(len(df_list))))
              .groupby(level=1)
              .agg(rule)
           )

Then:

>>> df_list_apply(Rachael_rule, [A, B, C])
[Out]
    a  b   c
0   0  1   1
1   .  .   .
2   0  1   0
3  -1  0   0