I have two dataframes one is failures dataframe and the other is maintenance dataframe.
I want to find the number of days difference between the records of both these dataframes based on some conditions.
Condition is-
- There are multiple machineIDs and component numbers
comp1, comp2, comp3, comp4. So, the operation needs to be performed on each machineID and component number such that for every machineID and its component number I have the day difference between failure and maintenance date. - First remove duplicate dates from maintenance dataframe.
- Maintenance date should be older than failure date and there can be multiple dates from maintenance dataframe which are older than failures date.
- So, averaging out after finding their difference and appending it in the failures dataframe as a new column.
The failures dataframe is-
datetime machineID failure
0 2021-02-04 1 comp3
1 2021-03-21 1 comp1
2 2021-04-05 1 comp4
3 2021-05-05 1 comp3
4 2021-05-20 1 comp2
5 2021-06-04 1 comp4
6 2021-06-19 1 comp2
7 2021-08-03 1 comp3
8 2021-08-03 1 comp4
9 2021-11-01 1 comp4
And this is the maintenance dataframe
datetime machineID comp
0 2020-07-01 1 comp4
1 2020-09-14 1 comp1
2 2020-09-14 1 comp2
3 2020-11-13 1 comp3
4 2021-01-05 1 comp1
5 2021-01-20 1 comp1
6 2021-02-04 1 comp3
7 2021-02-19 1 comp3
8 2021-03-06 1 comp3
9 2021-03-21 1 comp1
I am taking an example from failures dataframe for index 1 with machineID 1, datetime 2021-03-21 and failure of Component 1. Now, if see for the same machineID and component1 in maintenance dataframe, there are 3 dates which are older than the one in failures dataframe i.e., 2021-03-21. Now, I want to calculate day difference of 2021-03-21 from all these 3 dates of maintenance dataframe and average it out and append it the same index of failure dataframe.
failure datetime- 2021-03-21
maintenance datetime- 2020-09-14; 2021-01-05; 2021-01-20
Corresponding day difference- 188 days, 75 days; 60 days.
Average of these 3- (188 75 60)/3= 108 days.
This is something I want-
datetime machineID failure day_dif
1 2021-03-21 1 comp1 108 days
Similarly for component 2, 3 and 4
Edit1- As per @oskros answer, I am getting this error-
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1028 result, _ = self.grouper.aggregate(
-> 1029 block.values, how, axis=1, min_count=min_count
1030 )
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in aggregate(self, values, how, axis, min_count)
580 return self._cython_operation(
--> 581 "aggregate", values, how, axis, min_count=min_count
582 )
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in _cython_operation(self, kind, values, how, axis, min_count, **kwargs)
510
--> 511 func, values = self._get_cython_func_and_vals(kind, how, values, is_numeric)
512
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in _get_cython_func_and_vals(self, kind, how, values, is_numeric)
410 try:
--> 411 func = self._get_cython_function(kind, how, values, is_numeric)
412 except NotImplementedError:
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in _get_cython_function(self, kind, how, values, is_numeric)
385 raise NotImplementedError(
--> 386 f"function is not implemented for this dtype: "
387 f"[how->{how},dtype->{dtype_str}]"
NotImplementedError: function is not implemented for this dtype: [how->max,dtype->object]
During handling of the above exception, another exception occurred:
AssertionError Traceback (most recent call last)
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1060 # unwrap DataFrame to get array
-> 1061 assert len(result._data.blocks) == 1
1062 result = result._data.blocks[0].values
AssertionError:
During handling of the above exception, another exception occurred:
AssertionError Traceback (most recent call last)
<ipython-input-119-02d194427401> in <module>
1 # 1) Remove duplicates in maintenance dataframe, by getting the latest maintenance date
2 # for each combination of machineID and comp
----> 3 maint_df = maint.groupby(['machineID', 'comp']).max().reset_index()
4
5 # 2) Merge the two dataframes so we have maintenance and failure dates for each
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in f(self, **kwargs)
1376 # try a cython aggregation if we can
1377 try:
-> 1378 return self._cython_agg_general(alias, alt=npfunc, **kwargs)
1379 except DataError:
1380 pass
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
1002 ) -> DataFrame:
1003 agg_blocks, agg_items = self._cython_agg_blocks(
-> 1004 how, alt=alt, numeric_only=numeric_only, min_count=min_count
1005 )
1006 return self._wrap_agged_blocks(agg_blocks, items=agg_items)
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1065
1066 finally:
-> 1067 assert not isinstance(result, DataFrame)
1068
1069 if result is not no_result:
AssertionError:
The datatypes of maintenance dataframe is
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1038 entries, 0 to 1046
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Asset 1038 non-null object
1 Type 1038 non-null object
2 datetime 1038 non-null datetime64[ns]
3 machineID 1038 non-null int64
4 comp 1038 non-null object
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 48.7 KB
CodePudding user response:
If we defined your maintenance dataframe as maint_df and your failure dataframe as fail_df, the following code should do the trick
# 1) Remove duplicates in maintenance dataframe, by getting the latest maintenance date
# for each combination of machineID and comp
maint_df = maint_df.groupby(['machineID', 'comp']).max().reset_index()
# 2) Merge the two dataframes so we have maintenance and failure dates for each
# combination
fail_df = fail_df.rename(columns={'failure': 'comp'}) # Rename column allowing the merge
merged = maint_df.merge(fail_df,
on=['machineID', 'comp'],
suffixes=('_maintenance', '_failure'))
# 3) Filter out rows where maintenance date isn't older than failure date
merged = merged[merged['datetime_maintenance'] < merged['datetime_failure']]
# 4) Calculate the day difference between the two dates.
# NB: This assumes your date columns are strings - if they are already date objects,
# you can remove the strptime
merged['diff_days'] = merged.apply(
lambda x: (dt.datetime.strptime(x.datetime_failure, '%Y-%m-%d') -
dt.datetime.strptime(x.datetime_maintenance, '%Y-%m-%d')).days,
axis=1)
# 5) Average the day difference across machineID and comp
print(merged.groupby(['machineID', 'comp']).mean())
