I have a pandas dataframe of some time series data which contains some missing dates. I want to fill these based on the dates in the adjacent rows.
For example, the entries in the dataframe are chronologically ordered thus I know that the two missing dates in the below dataframe must also be 15/3/22 because the above and below entires are both 15/3/22.
| Date | Time | etc. |
|---|---|---|
| 15/3/22 | 1:34:40 AM | ... |
| NaN | 1:36:18 AM | ... |
| NaN | 1:36:20 AM | ... |
| 15/3/22 | 3:08:23 AM | ... |
Alternatively, when the entries are days apart, you won't be able to tell which date the missing values should be (3/3, 4/3, or 5/3) and thus they should be left as NaN.
| Date | Time | etc. |
|---|---|---|
| 03/3/22 | 8:50:15 AM | ... |
| NaN | 10:40:00 AM | ... |
| NaN | 12:10:40 AM | ... |
| 05/3/22 | 3:08:23 AM | ... |
How can I do this in Pandas? Thanks
Sample code to create the example dataframes:
data1 = [['15/3/22', '1:34:40'], ['', '1:36:18'], ['', '1:36:20'],['15/3/22', '3:08:23']]
data2 = [['03/3/22', '8:50:15'], ['', '10:40:00'], ['', '12:10:40'], ['05/3/22', '3:08:23']]
df1 = pd.DataFrame(data1, columns = ['Date', 'Time'])
df2 = pd.DataFrame(data2, columns = ['Date', 'Time'])
df1['Date'] = pd.to_datetime( df1['Date'], format='%d/%m/%y', errors="coerce" )
df1['Time'] = pd.to_datetime( df1['Time'], format='%H:%M:%S', errors="coerce" )
df2['Date'] = pd.to_datetime( df2['Date'], format='%d/%m/%y', errors="coerce" )
df2['Time'] = pd.to_datetime( df2['Time'], format='%H:%M:%S', errors="coerce" )
CodePudding user response:
There's a lot of good examples of various strategies for filling missing data in the documentation here: https://pandas.pydata.org/docs/user_guide/missing_data.html#filling-missing-values-fillna
I don't think a particular method exists that follows the filling strategy you've described (though it is a very reasonable one), but I think we can use a trick to achieve it quite easily: by doing a forward-fill and a backward-fill and then only keeping the changes when these two give the same result, which must be when it's the same date before and after the run of NaNs.
date_ff = df["Date"].fillna(method="ffill")
date_bf = df["Date"].fillna(method="bfill")
df.loc[date_ff == date_bf, "Date"] = date_ff
In case you're unfamiliar with the .loc syntax, that last line is filtering to just the rows where date_ff and date_bf are equal and setting the Date column to the corresponding value from date_ff (we could have used date_bf - the point is that they give exactly the same result).
Note that since the non-NaN values aren't touched by fillna() they will be equal in date_ff and date_bf so are effectively left unchanged.
