I have a dataframe like:
lst = [["High", "A"], ["High", "A"], ["High", "B"],["Medium", "A"], ["Medium", "B"], ["Medium", "C"]]
df = pd.DataFrame(lst, columns =["Class", "Grade"])
I need to get the mode (majority vote) of "Grade" in each "Class". If it's a tie vote, assign "x".
Below is what I expect to get:
| Class | Grade | Majority_vote |
|---|---|---|
| High | A | A |
| High | A | A |
| High | B | A |
| Medium | A | x |
| Medium | B | x |
| Medium | C | x |
This is my code:
df['majority_vote'] = df.groupby(['Class'])['Grade'].transform(lambda x: x.mode()[0])
I think the code will return 'nan' if it's a tie vote. Then, I will change 'nan' to 'x' later.
However, what I get is below:
| Class | Grade | Majority_vote |
|---|---|---|
| High | A | A |
| High | A | A |
| High | B | A |
| Medium | A | A |
| Medium | B | A |
| Medium | C | A |
At class "Medium", the code returns the 1st element ("A") instead of 'nan'.
Any other method is appreciated. Could you please help me? Thank you in advance.
CodePudding user response:
The issue with using x.mode()[0] is that pd.Series(['A', 'B', 'C']).mode() evaluates to ['A', 'B', 'C']. Meanwhile, pd.Series(['A', 'A', 'B']).mode() evaluates to ['A'].
Here is a function that will return the mode (if there is only one) and "x" if there is a tie (i.e., multiple modes).
import pandas as pd
lst = [["High", "A"], ["High", "A"], ["High", "B"],["Medium", "A"], ["Medium", "B"], ["Medium", "C"]]
df = pd.DataFrame(lst, columns=["Class", "Grade"])
def get_mode_or_x(series):
mode = series.mode()
if mode.size == 1:
return mode[0]
return "x"
df.loc[:, "majority_vote"] = df.groupby("Class")["Grade"].transform(get_mode_or_x)
| index | Class | Grade | majority_vote |
|---|---|---|---|
| 0 | High | A | A |
| 1 | High | A | A |
| 2 | High | B | A |
| 3 | Medium | A | x |
| 4 | Medium | B | x |
| 5 | Medium | C | x |
