I have 3 dataframes (df1, df2, df3) i want to merge these dataframe based on a coloumn and add two new columns. one column should say which dataframes are matching, second how many of them matched.
# df1
data = {'ID': ["M1", "M2", "M3", "M4"],
'Movie': ["Top gun", "Thor", "Batman", "MadMax"],
'Actor' : ["Tom", "Chris", "Bale", "Tom"],
'type': ["Action", "SciFi", "Comic", "SciFi"]}
df1 = pd.DataFrame(data)
# df2
data = {'ID': ["M1", "M2", "M3"],
'highlight': ["Flight school", "Love and thunder", "I am Batman"]}
df2 = pd.DataFrame(data)
# df3
data = {'ID': ["M2", "M3"],
'no of parts': [3, 3],
'co-star' : ["portman", "neeson"],
'award': ["yes", "yes"]}
df3 = pd.DataFrame(data)
Expected output will be
The match and no of match are the new column
Thank you for your time
Any help would be much appreciated
CodePudding user response:
You can merge your three dataframes on ID, then use the indicator parameter to merge to determine which dataframes had valid data, using this info to generate the match column. You can then count the number of | characters in match to determine the No of match column:
import pandas as pd
data = {'ID': ["M1", "M2", "M3", "M4"], 'Movie': ["Top gun", "Thor", "Batman", "MadMax"], 'Actor' : ["Tom", "Chris", "Bale", "Tom"], 'type': ["Action", "SciFi", "Comic", "SciFi"]}
df1 = pd.DataFrame(data)
data = {'ID': ["M1", "M2", "M3"], 'highlight': ["Flight school", "Love and thunder", "I am Batman"]}
df2 = pd.DataFrame(data)
data = {'ID': ["M2", "M3"], 'no of parts': [3, 3], 'co-star' : ["portman", "neeson"], 'award': ["yes", "yes"]}
df3 = pd.DataFrame(data)
df = df1.merge(df2, on='ID', how='left', indicator='df1df2').merge(df3, on='ID', how='left',indicator='df3')
df['match'] = df['df1df2'].map({'both':'df1|df2', 'left_only':'df1'}) df['df3'].map({'both':'|df3', 'left_only':''})
df['No of match'] = df['match'].str.count('\|') 1
df = df.drop(['df1df2', 'df3'], axis=1)
Output:
ID Movie Actor type highlight no of parts co-star award match No of match
0 M1 Top gun Tom Action Flight school NaN NaN NaN df1|df2 2
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes df1|df2|df3 3
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi NaN NaN NaN NaN df1 1
CodePudding user response:
Use DataFrame.merge with left join and indicator parameters for see matched DataFrames, then use DataFrame.pop for remove column with processing by Series.map for dictionaries, append df3 column with mapping another dictionary and last count | by Series.str.count:
df = (df1.merge(df2, on='ID', how='left', indicator='df2')
.merge(df3, on='ID', how='left', indicator='df3'))
df['match'] = (df.pop('df2').map({'both':'df1|df2', 'left_only':'df1'})
df.pop('df3').map({'both':'|df3', 'left_only':''}))
df['No of match'] = df['match'].str.count('\|') 1
print (df)
ID Movie Actor type highlight no of parts co-star award \
0 M1 Top gun Tom Action Flight school NaN NaN NaN
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes
3 M4 MadMax Tom SciFi NaN NaN NaN NaN
match No of match
0 df1|df2 2
1 df1|df2|df3 3
2 df1|df2|df3 3
3 df1 1
CodePudding user response:
You can try this one too; merging with reduce-lambda
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ID',how='outer'), dfs)
df_temp = df_final[[df1.columns[1],df2.columns[1],df3.columns[1]]]
df_final["match"] = df_temp.apply(lambda x: "|".join(["df" str(idx 1) for idx,i in enumerate(x) if pd.isna(i)==False]),axis=1)
df_final["No of match"] = df_final["match"].apply(lambda x: x.count("|") 1)
Output;
ID Movie Actor type ... co-star award match No of match
0 M1 Top gun Tom Action ... NaN NaN df1|df2 2
1 M2 Thor Chris SciFi ... portman yes df1|df2|df3 3
2 M3 Batman Bale Comic ... neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi ... NaN NaN df1 1
CodePudding user response:
You can use pandas.concat on a list of the input DafaFrames. This will work on any number of input DataFrames (not just 3):
# dataframes will be later named in order: 1->2->3
# you can easily tweak this solution to use a dictionary
# and custom names if desired
dfs = [df1, df2, df3]
out = (pd
.concat([d.set_index('ID').assign(ID=f'df{i}')
for i,d in enumerate(dfs, start=1)], axis=1)
.assign(**{'match': lambda d: d[['ID']].agg(lambda x: '|'.join(x.dropna()),
axis=1),
'No of matches': lambda d: d[['ID']].notna().sum(axis=1)
})
.drop('ID', axis=1).reset_index()
)
NB. this approach uses a temporary ID column, make sure it is not present in any of the input DataFrame's column. You can chose another name for safety if needed.
output:
ID Movie Actor type highlight no of parts co-star award match No of matches
0 M1 Top gun Tom Action Flight school NaN NaN NaN df1|df2 2
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes df1|df2|df3 3
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi NaN NaN NaN NaN df1 1
CodePudding user response:
May not be the best way, but you can try this:
import numpy as np
def f(x):
if (str(x[1])== 'nan') & (str(x[2])== 'nan') & (str(x[3])== 'nan') :
if len(x[0]) ==1:
return (['df1'])
if len(x[0]) ==2:
return (['df1','df2'])
else:
return (['df1','df2','df3'])
df = df1.merge(df2,how='outer')
df['match'] = df['highlight'].apply(lambda x: ['df1','df2'] if str(x)!= 'nan' else ['df1'])
df = df.merge(df3,how='outer')
df['match'] = df[['match','no of parts','co-star','award']].apply(f,axis=1)
df['No of match'] = df['match'].apply(lambda x:len(x))
df['match'] = df['match'].apply(lambda x:'|'.join(x))
df = df[['ID', 'Movie', 'Actor', 'type', 'highlight', 'no of parts',
'co-star', 'award','match', 'No of match']]

