Find day difference between two dates-CodePudding

I have two dataframes one is failures dataframe and the other is maintenance dataframe.

I want to find the number of days difference between the records of both these dataframes based on some conditions.

Condition is-

There are multiple machineIDs and component numbers comp1, comp2, comp3, comp4. So, the operation needs to be performed on each machineID and component number such that for every machineID and its component number I have the day difference between failure and maintenance date.
First remove duplicate dates from maintenance dataframe.
Maintenance date should be older than failure date and there can be multiple dates from maintenance dataframe which are older than failures date.
So, averaging out after finding their difference and appending it in the failures dataframe as a new column.

The failures dataframe is-

            datetime        machineID   failure
0          2021-02-04           1        comp3
1          2021-03-21           1        comp1
2          2021-04-05           1        comp4
3          2021-05-05           1        comp3
4          2021-05-20           1        comp2
5          2021-06-04           1        comp4
6          2021-06-19           1        comp2
7          2021-08-03           1        comp3
8          2021-08-03           1        comp4
9          2021-11-01           1        comp4

And this is the maintenance dataframe

    datetime        machineID   comp
0   2020-07-01          1      comp4
1   2020-09-14          1      comp1
2   2020-09-14          1      comp2
3   2020-11-13          1      comp3
4   2021-01-05          1      comp1
5   2021-01-20          1      comp1
6   2021-02-04          1      comp3
7   2021-02-19          1      comp3
8   2021-03-06          1      comp3
9   2021-03-21          1      comp1

I am taking an example from failures dataframe for index 1 with machineID 1, datetime 2021-03-21 and failure of Component 1. Now, if see for the same machineID and component1 in maintenance dataframe, there are 3 dates which are older than the one in failures dataframe i.e., 2021-03-21. Now, I want to calculate day difference of 2021-03-21 from all these 3 dates of maintenance dataframe and average it out and append it the same index of failure dataframe.

failure datetime- 2021-03-21

maintenance datetime- 2020-09-14; 2021-01-05; 2021-01-20

Corresponding day difference- 188 days, 75 days; 60 days.

Average of these 3- (188 75 60)/3= 108 days.

This is something I want-

            datetime        machineID   failure     day_dif

1          2021-03-21           1        comp1       108 days

Similarly for component 2, 3 and 4

Edit1- As per @oskros answer, I am getting this error-

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
   1028                 result, _ = self.grouper.aggregate(
-> 1029                     block.values, how, axis=1, min_count=min_count
   1030                 )

/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in aggregate(self, values, how, axis, min_count)
    580         return self._cython_operation(
--> 581             "aggregate", values, how, axis, min_count=min_count
    582         )

/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in _cython_operation(self, kind, values, how, axis, min_count, **kwargs)
    510 
--> 511         func, values = self._get_cython_func_and_vals(kind, how, values, is_numeric)
    512 

/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in _get_cython_func_and_vals(self, kind, how, values, is_numeric)
    410         try:
--> 411             func = self._get_cython_function(kind, how, values, is_numeric)
    412         except NotImplementedError:

/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in _get_cython_function(self, kind, how, values, is_numeric)
    385             raise NotImplementedError(
--> 386                 f"function is not implemented for this dtype: "
    387                 f"[how->{how},dtype->{dtype_str}]"

NotImplementedError: function is not implemented for this dtype: [how->max,dtype->object]

During handling of the above exception, another exception occurred:

AssertionError                            Traceback (most recent call last)
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
   1060                     # unwrap DataFrame to get array
-> 1061                     assert len(result._data.blocks) == 1
   1062                     result = result._data.blocks[0].values

AssertionError: 

During handling of the above exception, another exception occurred:

AssertionError                            Traceback (most recent call last)
<ipython-input-119-02d194427401> in <module>
      1 # 1) Remove duplicates in maintenance dataframe, by getting the latest maintenance date
      2 #    for each combination of machineID and comp
----> 3 maint_df = maint.groupby(['machineID', 'comp']).max().reset_index()
      4 
      5 # 2) Merge the two dataframes so we have maintenance and failure dates for each

/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in f(self, **kwargs)
   1376                 # try a cython aggregation if we can
   1377                 try:
-> 1378                     return self._cython_agg_general(alias, alt=npfunc, **kwargs)
   1379                 except DataError:
   1380                     pass

/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
   1002     ) -> DataFrame:
   1003         agg_blocks, agg_items = self._cython_agg_blocks(
-> 1004             how, alt=alt, numeric_only=numeric_only, min_count=min_count
   1005         )
   1006         return self._wrap_agged_blocks(agg_blocks, items=agg_items)

/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
   1065 
   1066             finally:
-> 1067                 assert not isinstance(result, DataFrame)
   1068 
   1069                 if result is not no_result:

AssertionError:

The datatypes of maintenance dataframe is

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1038 entries, 0 to 1046
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Asset      1038 non-null   object        
 1   Type       1038 non-null   object        
 2   datetime   1038 non-null   datetime64[ns]
 3   machineID  1038 non-null   int64         
 4   comp       1038 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 48.7  KB

CodePudding user response：

If we defined your maintenance dataframe as maint_df and your failure dataframe as fail_df, the following code should do the trick

# 1) Remove duplicates in maintenance dataframe, by getting the latest maintenance date
#    for each combination of machineID and comp
maint_df = maint_df.groupby(['machineID', 'comp']).max().reset_index()

# 2) Merge the two dataframes so we have maintenance and failure dates for each 
#    combination
fail_df = fail_df.rename(columns={'failure': 'comp'})  # Rename column allowing the merge
merged = maint_df.merge(fail_df, 
                        on=['machineID', 'comp'], 
                        suffixes=('_maintenance', '_failure'))

# 3) Filter out rows where maintenance date isn't older than failure date
merged = merged[merged['datetime_maintenance'] < merged['datetime_failure']]

# 4) Calculate the day difference between the two dates. 
#    NB: This assumes your date columns are strings - if they are already date objects, 
#    you can remove the strptime
merged['diff_days'] = merged.apply(
    lambda x: (dt.datetime.strptime(x.datetime_failure, '%Y-%m-%d') - 
               dt.datetime.strptime(x.datetime_maintenance, '%Y-%m-%d')).days, 
    axis=1)

# 5) Average the day difference across machineID and comp
print(merged.groupby(['machineID', 'comp']).mean())