I'm trying to make a Champions League Fantasy statistic table, and I'm getting data from 2 different sites that name the player slightly different between each other.
I have df1 from site 1:
name age team skill cost gls ast
0 Lionel Messi 34-175 Paris 4 11.3 5 0
1 Ryan Gravenberch 19-214 Ajax 3 6.2 0 0
2 Junior Messias 30-217 Milan 3 6.5 1 0
3 Kepa Arrizabalaga 27-074 Chelsea 1 5.0 0 0
4 Kenneth Taylor 19-214 Ajax 3 5.0 0 0
5 Alisson 30-320 Liverpool 1 6.1 0 0
And df2 from site 2:
name age team gls ast
0 Kepa 27-074 Chelsea 0 0
1 Lionel 34-175 Paris 5 0
2 Junior 30-217 Milan 1 0
3 Kenneth 19-214 Ajax 0 0
4 Neymar 29-314 Paris 0 0
5 Ryan 19-214 Ajax 0 0
My goal is to match the names based on multiple conditions:
- Age (string in
df2equal to string indf1) - Team (string in
df2equal to string indf1) - Name (string in
df2is contained in string indf1)
The reason I want to pass the name as last condition is because there are cases were two players were born the exact same day and play for the same team like Kenneth Taylor and Ryan Gravenberch
I'm thinking of something like this:
df2.loc[(df2['team'] == df1['team']) & (df2['age'] == df1['age']) & (df2['name'].str.contains(df1['name'].str)), 'name'] = df1['name']
But I'm getting this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
The desire output for df2 is:
name age team gls ast
0 Kepa Arrizabalaga 27-074 Chelsea 0 0
1 Lionel Messi 34-175 Paris 5 0
2 Junior Messias 30-217 Milan 1 0
3 Kenneth Taylor 19-214 Ajax 0 0
4 Neymar 29-314 Paris 0 0
5 Ryan Gravenberch 19-214 Ajax 0 0
Where all the names from df2 that matched the conditions were replaced with the names from df1
CodePudding user response:
def name_match(row):
if len(row['potential_match']) > len(row['name']):
larger = row['potential_match']
smaller = row['name']
else:
larger = row['name']
smaller = row['potential_match']
if smaller in larger:
return row['potential_match']
else:
return row['name']
df2['potential_match'] = df2[['age', 'team']].merge(
df1[['name', 'age', 'team']],
on=['age', 'team'],
how='left')['name']
df2['name'] = df2.apply(name_match , axis=1)
df2.drop('potential_match', axis=1, inplace=True)
CodePudding user response:
Use this to get the answer you desire. There is no need of separate condition again based on name.
df2.loc[(df2['team'] == df1['team']) & (df2['gls'] == df1['gls']), 'name'] = df1['name']
CodePudding user response:
Try with merge:
matches = df2.merge(df1[["name", "age", "team"]],
on=["age", "team"],
how="left")
matches["name_y"] = matches["name_y"].fillna(matches["name_x"])
matches = matches.where(matches.apply(lambda x: x["name_x"] in x["name_y"], axis=1)).dropna()
output = matches.drop("name_x", axis=1).rename(columns={"name_y": "name"}).reindex(df2.columns, axis=1)
>>> output
name age team gls ast
0 Kepa Arrizabalaga 27-074 Chelsea 0.0 0.0
1 Lionel Messi 34-175 Paris 5.0 0.0
2 Junior Messias 30-217 Milan 1.0 0.0
4 Kenneth Taylor 19-214 Ajax 0.0 0.0
5 Neymar 29-314 Paris 0.0 0.0
6 Ryan Gravenberch 19-214 Ajax 0.0 0.0
CodePudding user response:
(i) Merge df2 from the right to df1 on age and team.
(ii) Assign names in df2 but not in df1 to the name column from df1 (which is name_x).
(iii) Filter out the names not matching across name_x and name_y columns and drop name_y.
df3 = df1[['name','age','team']].merge(df2, on=['age','team'], how='right')
mask = pd.isna(df3['name_x'])
df3.loc[mask,'name_x'] = df3.loc[mask,'name_y'].to_numpy()
df3 = df3[df3.apply(lambda x: x['name_y'] in x['name_x'], axis=1)].drop('name_y', axis=1)
Output:
name_x age team gls ast
0 Kepa Arrizabalaga 27-074 Chelsea 0 0
1 Lionel Messi 34-175 Paris 5 0
2 Junior Messias 30-217 Milan 1 0
4 Kenneth Taylor 19-214 Ajax 0 0
5 Neymar 29-314 Paris 0 0
6 Ryan Gravenberch 19-214 Ajax 0 0
