I have this dataframe which looks like this:
user_id : Represents user
question_id : Represent question number
user_answer : which option user has opted for the specific question from (A,B,C,D)
correct_answer: What is correct answer for that specific question
correct : if 0.0 it means users answer was not right , if 1.0 it means user answer is right
elapsed_time : it represents time in minutes user took to answer that question
| user_iD | question_id | user_answer | correct answer | correct | elapsed_time | solving_id | bundle_id |
|---|---|---|---|---|---|---|---|
| 1 | 1 | A | D | 0.0 | 5.00 | 1 | 1 |
| 1 | 1 | B | D | 0.0 | 3.00 | 1 | 1 |
| 1 | 1 | D | D | 1.0 | 7.00 | 1 | 1 |
| 2 | 10 | C | B | 0.0 | 5.00 | 10 | 5 |
| 2 | 10 | B | B | 1.0 | 15.0 | 10 | 5 |
| 2 | 10 | B | B | 1.0 | 2.00 | 10 | 5 |
| 3 | 25 | C | A | 0.0 | 5.00 | 25 | 7 |
| 3 | 25 | A | A | 1.0 | 1.00 | 25 | 7 |
What I want another dataframe to look like is this
| user_iD | question_id | user_answer | correct answer | correct | elapsed_time | solving_id | bundle_id |
|---|---|---|---|---|---|---|---|
| 1 | 1 | D | D | 1.0 | 7.00 | 1 | 1 |
| 2 | 10 | B | B | 1.0 | 15.0 | 10 | 5 |
| 3 | 25 | A | A | 1.0 | 1.00 | 25 | 7 |
It should remove the duplicate questions attempted by a user and keeps the one which user corrected the very first time.
What I have done so far , this gives me an error
df = df['correct'].eq(1) & ~df.drop_duplicated(['user_iD','question_id','correct'])
ERROR : AttributeError: 'DataFrame' object has no attribute 'drop_duplicated'
CodePudding user response:
drop_duplicated wrong typo. use drop_duplicates.
Also, check keep parameter to preserve row you want.
CodePudding user response:
For your boolean mask, you need duplicated instead of drop_duplicated. So instead of
df['correct'].eq(1) & ~df.drop_duplicated(['user_iD','question_id','correct'])
you want
mask = df['correct'].eq(1) & ~df.duplicated(['user_iD','question_id','correct'])
This will give you a boolean mask. Then your desired outcome is obtained using this mask on df:
out = df[mask]
You can get the same outcome by filtering for correct first and then using drop_duplicates method:
out = df[df['correct']==1].drop_duplicates('correct answer'))
Output:
user_iD question_id user_answer correct answer correct elapsed_time \
2 1 1 D D 1.0 7.0
4 2 10 B B 1.0 15.0
7 3 25 A A 1.0 1.0
solving_id bundle_id
2 1 1
4 10 5
7 25 7
CodePudding user response:
groupby and agg are good at conditionally isolating. In this case, filter the ones and find the max in each group of the user_iD
df[df['correct'].eq(1.0)].groupby(df['user_iD']).agg('max')
user_iD question_id user_answer correct answer
correct \
user_iD
1 1 1 D D 1.0
2 2 10 B B 1.0
3 3 25 A A 1.0
elapsed_time solving_id bundle_id
user_iD
1 7.0 1 1
2 15.0 10 5
3 1.0 25 7
CodePudding user response:
Use Series.eq with Groupby.idxmax and df.loc:
In [174]: ix = df[df.correct.eq(1)].groupby(['user_iD', 'question_id'])['elapsed_time'].idxmax()
In [175]: df.loc[ix]
Out[175]:
user_iD question_id user_answer correct answer correct elapsed_time solving_id bundle_id
2 1 1 D D 1.0 7.0 1 1
4 2 10 B B 1.0 15.0 10 5
7 3 25 A A 1.0 1.0 25 7
