I'm a beginner to Pandas, so bear with me.
Here is a simplified version of my series:
| Name |
|---|
| James |
| Michael |
| Jim |
| Bob |
| Jim |
| Bob |
I want to create a df that adds a column for 'Team.' Here is my team distribution:
team1 = [
'Michael',
'James',
]
team2 = [
'Jim',
'Bob'
]
My first instinct was to def func with an if statement and isin, like so:
def Team(row):
if row['Name'].isin(team1):
return 'Team 1'
elif row['Name'].isin(team2):
return 'Team 2'
else:
return 'No Team'
df['Team'] = df.apply(Team, axis=1)
df
With the axis, I get: "TypeError: Teams() got an unexpected keyword argument 'axis'" When I remove the axis, I get: "TypeError: string indices must be integers"
Any idea if there is a better approach? Thanks!
CodePudding user response:
Not sure I understand your errors, but I see that the error also shows Teams(), instead of Team().
In any case, in your example, row is actually a pandas series, when you slice it, you get the actual strings, which does not have a method isin(). Changing your function definition should work:
def Team(row):
if row['Name'] in team1:
return 'Team 1'
elif row['Name'] in team2:
return 'Team 2'
else:
return 'No Team'
df['Team'] = df.apply(Team, axis=1)
df
Let me also suggest using directly the pandas series, instead of the whole dataframe. That should be faster as well. The .apply() method for series are similar to the ones in dataframes but you won't need to pass the axis=1 argument.
def Team(name):
if name in team1:
return 'Team 1'
elif name in team2:
return 'Team 2'
else:
return 'No Team'
df['Team'] = df.Name.apply(Team)
df
Docs:
CodePudding user response:
You could do as follow (I am using a simple example):
import pandas as pd
team1 = ["A", "B"]
team2 = ["C", "D"]
df = pd.DataFrame({"Name":["A", "B", "C", "D", "E"]})
df.set_index("Name", inplace=True)
df.assign(Team=df.index.map(lambda x: "Team1" if x in team1 else "Team2" if x in team2 else "No Team"))
OUTPUT
Team
Name
A Team1
B Team1
C Team2
D Team2
E No Team
CodePudding user response:
You can use pd.merge to help. The advantage of this over apply is that it should be faster and more efficient with a larger dataset.
s = pd.Series(['James', 'Michael', 'Jim', 'Bob', 'Jim', 'Bob'], name='Name')
team1 = [
'Michael',
'James',
]
team2 = [
'Jim',
'Bob'
]
d = {
'Team 1': team1,
'Team 2': team2
}
df = pd.DataFrame(s).merge(pd.DataFrame(d).melt().set_index('value'), right_index=True, left_on='Name').rename({'variable': 'Team'}, axis=1)
df.sort_index(inplace=True)
Result:
>>> df
Name Team
0 James Team 1
1 Michael Team 1
2 Jim Team 2
3 Bob Team 2
4 Jim Team 2
5 Bob Team 2
CodePudding user response:
I would do this using a join. You can define your teams (or read it in from somewhere), create a temporary df and then join this onto your main df
# Define teams
teams = {
'Team 1': [
'Michael',
'James',
],
'Team 2': [
'Jim',
'Bob'
]
}
# Correct format for df creation
name_to_team = dict()
for team, names in teams.items():
for name in names:
name_to_team[name] = team
# Get df with name as index and 'Team' as the column
team_df = pd.DataFrame.from_dict(name_to_team, orient='index', columns=['Team'])
# Join df using 'Name'
df = df.merge(team_df, left_on='Name', right_index=True)
