I've a dataset
| id | ref | name | conditionCol |
|---|---|---|---|
| 1 | 123 | a | no_error |
| 1 | 456 | b | error |
| 1 | 789 | c | no_error |
| 2 | 231 | d | no_error |
| 2 | 312 | e | no_error |
| 2 | 546 | f | no_error |
| 3 | 645 | g | error |
| 3 | 879 | h | error |
| 4 | 789 | i | no_error |
| 4 | 978 | j | no_error |
I'm trying to create a custom error_flag, condition being:
- for each unique
idcolumn elements - if any row in the
conditionColhas the keyworderror, then - for each row should be flagged as
yesin theerror_flag - if for any element in
idcolumn - not even a single row has the keyword
errorinconditionColcolumn, then - for each row should be flagged as
noin theerror_flag
E.g. For id:1, all the values of error_flag is yes, as for id value 1, row #2 of conditionCol has error
| id | ref | name | conditionCol | error_flag |
|---|---|---|---|---|
| 1 | 123 | a | no_error | yes |
| 1 | 456 | b | error | yes |
| 1 | 789 | c | no_error | yes |
But, for id:2, all the values of error_flag is no, as for id value 2, no row of conditionCol has error
| id | ref | name | conditionCol | error_flag |
|---|---|---|---|---|
| 2 | 231 | d | no_error | no |
| 2 | 312 | e | no_error | no |
| 2 | 546 | f | no_error | no |
Similarly for id value 3 & 4:
| id | ref | name | conditionCol | error_flag |
|---|---|---|---|---|
| 3 | 645 | g | no_error | no |
| 3 | 879 | h | no_error | no |
| 4 | 789 | i | error | yes |
| 4 | 978 | j | error | yes |
And final output being:
| id | ref | name | conditionCol | error_flag |
|---|---|---|---|---|
| 1 | 123 | a | no_error | yes |
| 1 | 456 | b | error | yes |
| 1 | 789 | c | no_error | yes |
| 2 | 231 | d | no_error | no |
| 2 | 312 | e | no_error | no |
| 2 | 546 | f | no_error | no |
| 3 | 645 | g | no_error | no |
| 3 | 879 | h | no_error | no |
| 4 | 789 | i | error | yes |
| 4 | 978 | j | error | yes |
Update:
If you wish to play around with the dataset:
import pandas as pd
import numpy as np
id_col = [1,1,1,2,2,2,3,3,4,4]
ref_col = [123,456, 789, 231, 312, 546, 645, 879, 789, 978]
name_col = ['a','b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
conditionCol = ['no_error', 'error', 'no_error', 'no_error', 'no_error', 'no_error', 'no_error', 'no_error', 'error', 'error']
df = pd.DataFrame(zip(id_col, ref_col, name_col, conditionCol), columns=['id','ref','name','conditionCol'])
df
update2: Is there a way to work with thresholds, i.e.:
- current question: atleast one occurrence of keyword
errorin conditionCol column for each individual uniqueids, then the value inerror_flagwould beyesfor all the rows in that id value - atleast 4 or atleast 5 occurrence of keyword
errorin conditionCol column for uniqueids, then only the value inerror_flagwould beyesfor all the rows in that id value
CodePudding user response:
Use numpy.where with test if at least one value error per groups by id:
m = df['id'].isin(df.loc[df['conditionCol'].eq('error'), 'id'])
#alternative
#m = df['conditionCol'].eq('error').groupby(df['id']).transform('any')
df['error_flag'] = np.where(m, 'yes', 'no')
print (df)
id ref name conditionCol error_flag
0 1 123 a no_error yes
1 1 456 b error yes
2 1 789 c no_error yes
3 2 231 d no_error no
4 2 312 e no_error no
5 2 546 f no_error no
6 3 645 g no_error no
7 3 879 h no_error no
8 4 789 i error yes
9 4 978 j error yes
