I have three dataframes of the same dimensions. I would like to find common occurrences (1s and -1s) in at least 2 out of the 3 dataframes. I want an output dataframe of the same dimensions in which the elements are in at least two dataframes.
I have written the following example for everyone to understand the problem. In the example I want to find in which positions there is a 1 or a -1 in at least two dataframes.
import pandas as pd
A= {'a': [0, '.', 0, -1],'b': [0, '.', 1, 0], 'c':[1,'.', 0, 1] }
A = pd.DataFrame(data=A)
a b c
0 0 0 1
1 . . .
2 0 1 0
3 -1 0 1
B= {'a': [0, '.', 0, -1],'b': [1, '.', 1, 0], 'c':[1,'.', 0, -1] }
B = pd.DataFrame(data=B)
B
a b c
0 0 1 1
1 . . .
2 0 1 0
3 -1 0 -1
C = {'a': [0, '.', 0, 0],'b': [1, '.', 1, 0], 'c':[0,'.', 0, 0] }
C = pd.DataFrame(data=C)
C
a b c
0 0 1 0
1 . . .
2 0 1 0
3 0 0 0
The desired output would be:
a b c
0 0 1 1
1 . . .
2 0 1 0
3 -1 0 0
I have tried few things but none of them worked.
I would appreciate any help.
Thanks a lot!
CodePudding user response:
You can use the underlying numpy arrays to count the values per value and use numpy.select to map an arbitrary number of choices. For are treated here independently but could also be added as a value to check.
dfs = [A,B,C]
vals = [1, -1]
masks = [sum(x.eq(val).astype(int) for x in dfs).ge(2)
for val in vals]
pd.DataFrame(np.select(masks, vals),
columns=A.columns, index=A.index).mask(A.eq('.'), '.')
Output:
a b c
0 0 1 1
1 . . .
2 0 1 0
3 -1 0 0
Dot handled as 1/-1:
dfs = [A,B,C]
vals = ['.', 1, -1]
masks = [sum(x.eq(val).astype(int) for x in dfs).ge(2)
for val in vals]
pd.DataFrame(np.select(masks, vals),
columns=A.columns, index=A.index)
CodePudding user response:
I could imagine that there are better solutions, but you could use:
from collections import Counter
final_df = pd.DataFrame(columns=["a", "b", "c"])
for i in range(0, len(A)):
a = 0
temp = []
for j in A.columns:
count = Counter([A[j][i], B[j][i], C[j][i]]).most_common()
if count[0][1] > 1:
temp.append(count[0][0])
else:
temp.append(0)
a =1
final_df = pd.concat([final_df, pd.DataFrame([temp], columns=["a", "b", "c"])])
Output:
a b c
0 0 1 1
0 . . .
0 0 1 0
0 -1 0 0
CodePudding user response:
First solving the problem
- concat the dfs
- groupby the lower index
- aggregate with values count keeping the first element
df = (pd.concat([A, B, C], axis=0, keys=['A', 'B', 'C'])
.groupby(level=1)
.agg(lambda x: (x.value_counts().iloc[0] >= 2) * x.value_counts().index[0])
)
Output
a b c
0 0 1 1
1 . . .
2 0 1 0
3 -1 0 0
Cleaning the code
The line .agg(lambda x: (x.value_counts().iloc[0] >= 2) * x.value_counts().index[0]):
- is not fancy
- raises a warning under pandas 1.1.5:
DeprecationWarning: In future, it will be an error for 'np.bool_' scalars to be interpreted as an index This is separate from the ipykernel package so we can avoid doing imports until
With a function in replacement of the lambda we can enhance on both points:
def series_rule(s):
vc = s.value_counts()
return vc.index[0] if (vc.iloc[0] >= 2) else 0
df = (pd.concat([A, B, C], axis=0, keys=['A', 'B', 'C'])
.groupby(level=1)
.agg(series_rule)
)
Generalizing
We can now define a reusable function :
def Rachael_rule(pandas_serie):
vc = pandas_serie.value_counts()
return vc.index[0] if (vc.iloc[0] >= 2) else 0
def df_list_apply(rule, df_list):
"""
Applies a rule to a list of dataframe of same shapes
Parameters;
- rule : a function taking a pandas.Series argument and returnig a value
- df_list a list of pandas.DataFRame
Return:
the dataframe obtained by apllyng the rule on the dimenion of the list
"""
return (pd.concat(df_list, axis=0, keys=list(range(len(df_list))))
.groupby(level=1)
.agg(rule)
)
Then:
>>> df_list_apply(Rachael_rule, [A, B, C])
[Out]
a b c
0 0 1 1
1 . . .
2 0 1 0
3 -1 0 0
