Custom function applied between two different pandas dataframes-CodePudding

I am trying to find a way to apply a custom function between the same columns in two separate pandas dataframes, to get a new dataframe.

So, I have 2 dataframes likes so:

import pandas as pd
df1 = pd.DataFrame({'num':['NUM1','NUM2','NUM3','NUM4','NUM5'],'mean':[3,4,6,8,2],'count':[2,2,1,4,5]})
df2 = pd.DataFrame({'num':['NUM1','NUM4','NUM5','NUM6','NUM7'],'mean':[4,2,2,5,1],'count':[2,3,1,1,1]})

df1
    num     mean    count
0   NUM1    3        2
1   NUM2    4        2
2   NUM3    6        1
3   NUM4    8        4
4   NUM5    2        5

df2

    num     mean    count
0   NUM1    4        2
1   NUM4    2        3
2   NUM5    2        1
3   NUM6    5        1
4   NUM7    1        1

I also got a custom function that updates the mean of two separate samples:

def mean_updater(mean1,n1,mean2,n2):
    return round(((mean1*n1) (mean2*n2))/(n1 n2),2)

And a function to update the count:

def count_updater(n1,n2):
return n1 n2

What I am trying to achieve is a new df that contains updated mean and count for matching records (present in both dfs), with records not present in both dfs appended with no changes.

So, the result should be:

new_df

    num     mean    count
0   NUM1    3.5      4
1   NUM2    4        2
2   NUM3    6        1
3   NUM4    5.43     7
4   NUM5    2        6
5   NUM6    5        1
6   NUM7    1        1

CodePudding user response：

Not the fanciest solution, but might work for your use case:

# merge the data
common = pd.merge(df1, df2, on="num", how="outer")

common["mean"] = common[["mean_x", "mean_y"]].mean(axis=1)
common["count"] = common[["count_x", "count_y"]].sum(axis=1)

common[["num", "mean", "count"]]

CodePudding user response：

You can pass columns to your functions, then join them in concat and last replace missing values by original df11 and df22:

#changed to np.round for working with arrays
def mean_updater(mean1,n1,mean2,n2):
    return np.round(((mean1*n1) (mean2*n2))/(n1 n2),2)

def count_updater(n1,n2):
    return n1 n2

#for match by num columns
df11 = df1.set_index('num')
df22 = df2.set_index('num')

s1 = mean_updater(df11['mean'], df11['count'], df22['mean'], df22['count'])
s2 = count_updater(df11['count'], df22['count'])


df = (pd.concat([s1.rename('mean'), s2], axis=1)
        .fillna(df11)
        .fillna(df22)
        .reset_index())
print (df)
    num  mean  count
0  NUM1  3.50    4.0
1  NUM2  4.00    2.0
2  NUM3  6.00    1.0
3  NUM4  5.43    7.0
4  NUM5  2.00    6.0
5  NUM6  5.00    1.0
6  NUM7  1.00    1.0