Home > Software engineering >  Drop all rows for the month if a column has more than one value that crossed the threshold
Drop all rows for the month if a column has more than one value that crossed the threshold

Time:01-27

I have a dataframe with time data in the format:

    date           values
0       2013-01-01 00:00:00 0.0
1       2013-01-01 01:00:00 0.0
2       2013-01-01 02:00:00 -9999
3       2013-01-01 03:00:00 -9999
4       2013-01-01 04:00:00 0.0
..         ...          ...
8754    2016-12-31 18:00:00 427.5
8755    2016-12-31 19:00:00 194.9
8756    2016-12-31 20:00:00 -9999
8757    2016-12-31 21:00:00 237.6
8758    2016-12-31 22:00:00 -9999
8759    2016-12-31 23:00:00 0.0

Suppose the value -9999 was repeated 200 times in the month of January and the threshold is 150. Practically the entire month of January must be deleted or all its rows must be deleted.

    date    values repeated
1   2013-02 0
2   2013-03 2
4   2013-05 0
5   2013-06 0
6   2013-07 66
7   2013-08 0
8   2013-09 7

With this I think I can drop the rows that repeat but I want drop the whole month.

import numpy as np
df['month'] = df['date'].dt.to_period('M')
df['new_value'] = np.where((df['values'] == -9999) & (df['n_missing'] > 150),np.nan,df['values'])
df.dropna()

How can I do that ?

CodePudding user response:

One way using pandas.to_datetime with pandas.DataFrame.groupby.filter.

Here's a sample with months that have -9999 repeated 2, 1, 0, 2 times each:

                    date  values
0    2013-01-01 00:00:00     0.0
1    2013-01-01 01:00:00     0.0
2    2013-01-01 02:00:00 -9999.0
3    2013-01-01 03:00:00 -9999.0
4    2013-01-01 04:00:00     0.0
5    2013-02-01 12:00:00 -9999.0
6    2013-03-01 12:00:00     0.0
8754 2016-12-31 18:00:00   427.5
8755 2016-12-31 19:00:00   194.9
8756 2016-12-31 20:00:00 -9999.0
8757 2016-12-31 21:00:00   237.6
8758 2016-12-31 22:00:00 -9999.0
8759 2016-12-31 23:00:00     0.0

Then we do filtering:

date = pd.to_datetime(df["date"]).dt.strftime("%Y-%m")
new_df = df.groupby(date).filter(lambda x: x["values"].eq(-9999).sum() < 2)
print(new_df)

Output:

                 date  values
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00     0.0

You can see the months with 2 or more repeats are deleted.

  •  Tags:  
  • Related