I have a dataset with missing cases (NAs) and "impossible" values, which is defined as any value different from 1,2,3,4, or 5.
df = pd.DataFrame.from_dict({'aut_a_p1_r1': {131: 52.0, 106: 4.0, 80: 4.0, 108: 3.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 2.0, 9: 3.0, 248: 5.0, 219: 3.0, 134: 5.0, 105: 3.0, 176: 3.0, 245: 1.0, 271: 4.0, 249: 4.0}, 'aut_a_p1_r2': {131: 4.0, 106: 5.0, 80: 5.0, 108: 4.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 4.0, 9: 5.0, 248: 4.0, 219: 4.0, 134: 3.0, 105: 4.0, 176: 4.0, 245: 3.0, 271: 5.0, 249: 4.0}, 'aut_a_p1_r3': {131: 5.0, 106: 5.0, 80: 5.0, 108: 4.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 5.0, 9: 5.0, 248: 5.0, 219: 5.0, 134: 5.0, 105: 5.0, 176: 5.0, 245: 4.0, 271: 5.0, 249: 4.0}, 'aut_a_p1_r4': {131: 3.0, 106: 2.0, 80: 2.0, 108: 3.0, 303: 1.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 4.0, 9: 4.0, 248: 3.0, 219: 3.0, 134: 2.0, 105: 2.0, 176: 3.0, 245: 2.0, 271: 2.0, 249: 2.0}})
My goal is to convert all variables to numeric and then create a new column with the summation of other specific variables ignoring the missing cases. If the value of the cell is outside some pre-defined range, convert it to missing. If possible, not sum columns in which missing (Nan) is present instead of return 0 as result.
This is the code I´m trying:
#Convert to numeric
df.loc[:,df.columns.str.contains("aut_a_")] = df.loc[:,df.columns.str.contains("aut_a_")].apply(pd.to_numeric, errors='coerce')
# Convert values != 1,2,3,4,5 to missing
????
# Sum
df["aut_sum"] = df.loc[:,df.columns.str.contains("aut_a_")].sum(axis=1)
df["aut_sum"]
Please, feel free to improve my code.
CodePudding user response:
You want to use filter for searching a label in index/columns:
target = df.filter(regex='aut_a_*')
df['aut_sum'] = target.dropna().where(target.isin(np.arange(1, 6)), 0).sum(1)
Output:
aut_a_p1_r1 aut_a_p1_r2 aut_a_p1_r3 aut_a_p1_r4 aut_sum
131 52.0 4.0 5.0 3.0 12.0
106 4.0 5.0 5.0 2.0 16.0
80 4.0 5.0 5.0 2.0 16.0
108 3.0 4.0 4.0 3.0 14.0
303 5.0 5.0 5.0 1.0 16.0
145 NaN NaN NaN NaN NaN
172 NaN NaN NaN NaN NaN
103 NaN NaN NaN NaN NaN
67 NaN NaN NaN NaN NaN
59 NaN NaN NaN NaN NaN
7 2.0 4.0 5.0 4.0 15.0
9 3.0 5.0 5.0 4.0 17.0
248 5.0 4.0 5.0 3.0 17.0
219 3.0 4.0 5.0 3.0 15.0
134 5.0 3.0 5.0 2.0 15.0
105 3.0 4.0 5.0 2.0 14.0
176 3.0 4.0 5.0 3.0 15.0
245 1.0 3.0 4.0 2.0 10.0
271 4.0 5.0 5.0 2.0 16.0
249 4.0 4.0 4.0 2.0 14.0
CodePudding user response:
Use mask:
df['aut_sum'] = df.dropna(how='all').filter(like='aut_a_').mask((df < 1) | (df > 5)).sum(axis=1)
print(df)
# Output
aut_a_p1_r1 aut_a_p1_r2 aut_a_p1_r3 aut_a_p1_r4 aut_sum
131 52.0 4.0 5.0 3.0 12.0
106 4.0 5.0 5.0 2.0 16.0
80 4.0 5.0 5.0 2.0 16.0
108 3.0 4.0 4.0 3.0 14.0
303 5.0 5.0 5.0 1.0 16.0
145 NaN NaN NaN NaN NaN
172 NaN NaN NaN NaN NaN
103 NaN NaN NaN NaN NaN
67 NaN NaN NaN NaN NaN
59 NaN NaN NaN NaN NaN
7 2.0 4.0 5.0 4.0 15.0
9 3.0 5.0 5.0 4.0 17.0
248 5.0 4.0 5.0 3.0 17.0
219 3.0 4.0 5.0 3.0 15.0
134 5.0 3.0 5.0 2.0 15.0
105 3.0 4.0 5.0 2.0 14.0
176 3.0 4.0 5.0 3.0 15.0
245 1.0 3.0 4.0 2.0 10.0
271 4.0 5.0 5.0 2.0 16.0
249 4.0 4.0 4.0 2.0 14.0
CodePudding user response:
Try
df["aut_sum"] = (df.applymap(lambda x: x if x in [1,2,3,4,5] else np.nan)
.filter(like="aut_a_").dropna().sum(axis=1) )
