Pandas: change value based on multiple conditions of different type-CodePudding

I'm trying to make a Champions League Fantasy statistic table, and I'm getting data from 2 different sites that name the player slightly different between each other.

I have df1 from site 1:

                name     age       team skill  cost  gls  ast
0       Lionel Messi  34-175      Paris     4  11.3    5    0
1   Ryan Gravenberch  19-214       Ajax     3   6.2    0    0
2     Junior Messias  30-217      Milan     3   6.5    1    0
3  Kepa Arrizabalaga  27-074    Chelsea     1   5.0    0    0
4     Kenneth Taylor  19-214       Ajax     3   5.0    0    0
5            Alisson  30-320  Liverpool     1   6.1    0    0

And df2 from site 2:

      name     age       team  gls  ast
0     Kepa  27-074    Chelsea    0    0
1   Lionel  34-175      Paris    5    0
2   Junior  30-217      Milan    1    0
3  Kenneth  19-214       Ajax    0    0
4   Neymar  29-314      Paris    0    0
5     Ryan  19-214       Ajax    0    0

My goal is to match the names based on multiple conditions:

Age (string in df2 equal to string in df1)
Team (string in df2 equal to string in df1)
Name (string in df2 is contained in string in df1)

The reason I want to pass the name as last condition is because there are cases were two players were born the exact same day and play for the same team like Kenneth Taylor and Ryan Gravenberch

I'm thinking of something like this:

df2.loc[(df2['team'] == df1['team']) & (df2['age'] == df1['age']) & (df2['name'].str.contains(df1['name'].str)), 'name'] = df1['name']

But I'm getting this error:

TypeError: 'Series' objects are mutable, thus they cannot be hashed

The desire output for df2 is:

                name     age     team  gls  ast
0  Kepa Arrizabalaga  27-074  Chelsea    0    0
1       Lionel Messi  34-175    Paris    5    0
2     Junior Messias  30-217    Milan    1    0
3     Kenneth Taylor  19-214     Ajax    0    0
4             Neymar  29-314    Paris    0    0
5   Ryan Gravenberch  19-214     Ajax    0    0

Where all the names from df2 that matched the conditions were replaced with the names from df1

CodePudding user response：

def name_match(row):
    if len(row['potential_match']) > len(row['name']):
        larger = row['potential_match']
        smaller = row['name']
    else:
        larger = row['name']
        smaller = row['potential_match']
    if smaller in larger:
        return row['potential_match']
    else:
        return row['name']

df2['potential_match'] = df2[['age', 'team']].merge(
  df1[['name', 'age', 'team']], 
  on=['age', 'team'], 
  how='left')['name']

df2['name'] = df2.apply(name_match , axis=1)

df2.drop('potential_match', axis=1, inplace=True)

CodePudding user response：

Use this to get the answer you desire. There is no need of separate condition again based on name.

df2.loc[(df2['team'] == df1['team']) & (df2['gls'] == df1['gls']), 'name'] = df1['name']

CodePudding user response：

Try with merge:

matches = df2.merge(df1[["name", "age", "team"]],
                    on=["age", "team"],
                    how="left")
matches["name_y"] = matches["name_y"].fillna(matches["name_x"])
matches = matches.where(matches.apply(lambda x: x["name_x"] in x["name_y"], axis=1)).dropna()
output = matches.drop("name_x", axis=1).rename(columns={"name_y": "name"}).reindex(df2.columns, axis=1)

>>> output
                 name     age     team  gls  ast
0  Kepa Arrizabalaga  27-074  Chelsea  0.0  0.0
1       Lionel Messi  34-175    Paris  5.0  0.0
2     Junior Messias  30-217    Milan  1.0  0.0
4     Kenneth Taylor  19-214     Ajax  0.0  0.0
5             Neymar  29-314    Paris  0.0  0.0
6   Ryan Gravenberch  19-214     Ajax  0.0  0.0

CodePudding user response：

(i) Merge df2 from the right to df1 on age and team.

(ii) Assign names in df2 but not in df1 to the name column from df1 (which is name_x).

(iii) Filter out the names not matching across name_x and name_y columns and drop name_y.

df3 = df1[['name','age','team']].merge(df2, on=['age','team'], how='right')
mask = pd.isna(df3['name_x'])
df3.loc[mask,'name_x'] = df3.loc[mask,'name_y'].to_numpy()
df3 = df3[df3.apply(lambda x: x['name_y'] in x['name_x'], axis=1)].drop('name_y', axis=1)

Output:

              name_x     age     team  gls  ast
0  Kepa Arrizabalaga  27-074  Chelsea    0    0
1       Lionel Messi  34-175    Paris    5    0
2     Junior Messias  30-217    Milan    1    0
4     Kenneth Taylor  19-214     Ajax    0    0
5             Neymar  29-314    Paris    0    0
6   Ryan Gravenberch  19-214     Ajax    0    0