I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
Suppose the value -9999 was repeated 200 times in the month of January and the threshold is 150. Practically the entire month of January must be deleted or all its rows must be deleted.
date values repeated
1 2013-02 0
2 2013-03 2
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
With this I think I can drop the rows that repeat but I want drop the whole month.
import numpy as np
df['month'] = df['date'].dt.to_period('M')
df['new_value'] = np.where((df['values'] == -9999) & (df['n_missing'] > 150),np.nan,df['values'])
df.dropna()
How can I do that ?
CodePudding user response:
One way using pandas.to_datetime with pandas.DataFrame.groupby.filter.
Here's a sample with months that have -9999 repeated 2, 1, 0, 2 times each:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999.0
3 2013-01-01 03:00:00 -9999.0
4 2013-01-01 04:00:00 0.0
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999.0
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999.0
8759 2016-12-31 23:00:00 0.0
Then we do filtering:
date = pd.to_datetime(df["date"]).dt.strftime("%Y-%m")
new_df = df.groupby(date).filter(lambda x: x["values"].eq(-9999).sum() < 2)
print(new_df)
Output:
date values
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
You can see the months with 2 or more repeats are deleted.
