Home > database >  Transform values different from a pre-defined range into missing and compute a new varible based on
Transform values different from a pre-defined range into missing and compute a new varible based on

Time:01-22

I have a dataset with missing cases (NAs) and "impossible" values, which is defined as any value different from 1,2,3,4, or 5.

df = pd.DataFrame.from_dict({'aut_a_p1_r1': {131: 52.0, 106: 4.0, 80: 4.0, 108: 3.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 2.0, 9: 3.0, 248: 5.0, 219: 3.0, 134: 5.0, 105: 3.0, 176: 3.0, 245: 1.0, 271: 4.0, 249: 4.0}, 'aut_a_p1_r2': {131: 4.0, 106: 5.0, 80: 5.0, 108: 4.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 4.0, 9: 5.0, 248: 4.0, 219: 4.0, 134: 3.0, 105: 4.0, 176: 4.0, 245: 3.0, 271: 5.0, 249: 4.0}, 'aut_a_p1_r3': {131: 5.0, 106: 5.0, 80: 5.0, 108: 4.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 5.0, 9: 5.0, 248: 5.0, 219: 5.0, 134: 5.0, 105: 5.0, 176: 5.0, 245: 4.0, 271: 5.0, 249: 4.0}, 'aut_a_p1_r4': {131: 3.0, 106: 2.0, 80: 2.0, 108: 3.0, 303: 1.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 4.0, 9: 4.0, 248: 3.0, 219: 3.0, 134: 2.0, 105: 2.0, 176: 3.0, 245: 2.0, 271: 2.0, 249: 2.0}})

My goal is to convert all variables to numeric and then create a new column with the summation of other specific variables ignoring the missing cases. If the value of the cell is outside some pre-defined range, convert it to missing. If possible, not sum columns in which missing (Nan) is present instead of return 0 as result.

This is the code I´m trying:

#Convert to numeric
df.loc[:,df.columns.str.contains("aut_a_")] = df.loc[:,df.columns.str.contains("aut_a_")].apply(pd.to_numeric, errors='coerce')

# Convert values != 1,2,3,4,5 to missing
????

# Sum
df["aut_sum"] = df.loc[:,df.columns.str.contains("aut_a_")].sum(axis=1)
df["aut_sum"]

Please, feel free to improve my code.

CodePudding user response:

You want to use filter for searching a label in index/columns:

target = df.filter(regex='aut_a_*') 
df['aut_sum'] = target.dropna().where(target.isin(np.arange(1, 6)), 0).sum(1)

Output:

    aut_a_p1_r1  aut_a_p1_r2  aut_a_p1_r3  aut_a_p1_r4  aut_sum
131         52.0          4.0          5.0          3.0     12.0
106          4.0          5.0          5.0          2.0     16.0
80           4.0          5.0          5.0          2.0     16.0
108          3.0          4.0          4.0          3.0     14.0
303          5.0          5.0          5.0          1.0     16.0
145          NaN          NaN          NaN          NaN      NaN
172          NaN          NaN          NaN          NaN      NaN
103          NaN          NaN          NaN          NaN      NaN
67           NaN          NaN          NaN          NaN      NaN
59           NaN          NaN          NaN          NaN      NaN
7            2.0          4.0          5.0          4.0     15.0
9            3.0          5.0          5.0          4.0     17.0
248          5.0          4.0          5.0          3.0     17.0
219          3.0          4.0          5.0          3.0     15.0
134          5.0          3.0          5.0          2.0     15.0
105          3.0          4.0          5.0          2.0     14.0
176          3.0          4.0          5.0          3.0     15.0
245          1.0          3.0          4.0          2.0     10.0
271          4.0          5.0          5.0          2.0     16.0
249          4.0          4.0          4.0          2.0     14.0

CodePudding user response:

Use mask:

df['aut_sum'] = df.dropna(how='all').filter(like='aut_a_').mask((df < 1) | (df > 5)).sum(axis=1)
print(df)

# Output
     aut_a_p1_r1  aut_a_p1_r2  aut_a_p1_r3  aut_a_p1_r4  aut_sum
131         52.0          4.0          5.0          3.0     12.0
106          4.0          5.0          5.0          2.0     16.0
80           4.0          5.0          5.0          2.0     16.0
108          3.0          4.0          4.0          3.0     14.0
303          5.0          5.0          5.0          1.0     16.0
145          NaN          NaN          NaN          NaN      NaN
172          NaN          NaN          NaN          NaN      NaN
103          NaN          NaN          NaN          NaN      NaN
67           NaN          NaN          NaN          NaN      NaN
59           NaN          NaN          NaN          NaN      NaN
7            2.0          4.0          5.0          4.0     15.0
9            3.0          5.0          5.0          4.0     17.0
248          5.0          4.0          5.0          3.0     17.0
219          3.0          4.0          5.0          3.0     15.0
134          5.0          3.0          5.0          2.0     15.0
105          3.0          4.0          5.0          2.0     14.0
176          3.0          4.0          5.0          3.0     15.0
245          1.0          3.0          4.0          2.0     10.0
271          4.0          5.0          5.0          2.0     16.0
249          4.0          4.0          4.0          2.0     14.0

CodePudding user response:

Try

df["aut_sum"] = (df.applymap(lambda x: x if x in [1,2,3,4,5] else np.nan)
.filter(like="aut_a_").dropna().sum(axis=1) )
  •  Tags:  
  • Related