Change value if it repeats a certain number of times in a month-CodePudding

I have a dataframe with time data in the format:

    date           values
0       2013-01-01 00:00:00 0.0
1       2013-01-01 01:00:00 0.0
2       2013-01-01 02:00:00 -9999
3       2013-01-01 03:00:00 -9999
4       2013-01-01 04:00:00 0.0
..         ...          ...
8754    2016-12-31 18:00:00 427.5
8755    2016-12-31 19:00:00 194.9
8756    2016-12-31 20:00:00 -9999
8757    2016-12-31 21:00:00 237.6
8758    2016-12-31 22:00:00 -9999
8759    2016-12-31 23:00:00 0.0

And I want every month that the value -9999 is repeated more than 175 times those values get changed to NaN. Imagine that we have this other dataframe with the number of times the value is repeated per month:

    date    values
0   2013-01 200
1   2013-02 0
2   2013-03 2
3   2013-04 181
4   2013-05 0
5   2013-06 0
6   2013-07 66
7   2013-08 0
8   2013-09 7

In this case, the month of January and April passed the stipulated value and that first dataframe should be:

        date       values
0       2013-01-01 00:00:00 0.0
1       2013-01-01 01:00:00 0.0
2       2013-01-01 02:00:00 NaN 
3       2013-01-01 03:00:00 NaN
4       2013-01-01 04:00:00 0.0
..         ...          ...
8754    2016-12-31 18:00:00 427.5
8755    2016-12-31 19:00:00 194.9
8756    2016-12-31 20:00:00 -9999
8757    2016-12-31 21:00:00 237.6
8758    2016-12-31 22:00:00 -9999
8759    2016-12-31 23:00:00 0.0

I imagined creating a list using tolist() that separates the months that the value appears more than 175 times and then creating a condition if df["values"]==-9999 and df["date"] in list_with_months and then change the values.

CodePudding user response：

You can do this using a transform call where you calculate the number of values per month in the same dataframe. Then you create a new column conditionally on this:

import numpy as np

MISSING = -9999
THRESHOLD = 175

# Create a month column
df['month'] = df['date'].dt.to_period('M')
# Count number of MISSING per month and assign to dataframe
df['n_missing'] = (
    df.groupby('month')['values']
    .transform(lambda d: (d == MISSING).sum()
)
# If value is MISSING and number of missing is above THRESHOLD, replace with NaN, otherwise keep original values
df['new_value'] = np.where(
    (df['values'] == MISSING) & (df['n_missing'] > THRESHOLD), 
    np.nan, 
    df['values']
)