Given an example dataframe:
example_df = pd.DataFrame({"app_id": [1,2,3,4,5,6] ,
"payment_date":["2021-01-01", "2021-02-01", "2020-03-02", "2020-04-05", "2020-01-05","2020-01-04"],
"user_id": [12,12,12,13,13,13],
"application_date":["2021-02-01", "2021-02-01", "2020-03-02", "2020-04-05", "2020-01-05", "2020-01-04"] , "flag": [1,0,0,1,0,1], "order_column": [1,2,3,4,5, 6]})
What should be done is:
- I will explain what I want to do with an example:
- Iterate through all rows
- If the flag column is equal to 1 do as stated below
- For the first row
flagcolumn is 1 and theuser_idfor the row is 12. Look at all instances withuser_id= 12 and compare theirapplication_datewith thepayment_dateof the first row. We see that the second row has anapplication_dategreater than thepayment_dateof the first row. Then the label of the first row is 1. Third row also belongs touser_id= 12 but itsapplication_dateis not greater than the payment_date of the first row. If there is one or more observation that hasapplication_dategreater thanpayment_dateof the first row, overall label of the first row is 1. If there are no such observations the overall label is 0.
I wrote the code for this with iterrows, but I want a more compact vectorized solution since iterrows can be slow with larger datasets. Like
example_df.groupby("something").filter(lambda row: row. ...)
My code is:
labels_dict = {}
for idx, row in example_df.iterrows():
if row.flag == 1:
app_id = row.app_id
user_id = row.user_id
user_df = example_df[example_df.user_id == user_id]
labelss = []
for idx2, row2 in user_df.iterrows():
if (row2.order_column != row.order_column) & (row.payment_date < row2.application_date):
label = 1
labelss.append(label)
elif (row2.order_column != row.order_column) & (row.payment_date >= row2.application_date):
label = 0
labelss.append(label)
labels_dict[app_id] = labelss
final_labels = {}
for key, value in labels_dict.items():
if 1 in value:
final_labels[key] = 1
else:
final_labels[key] = 0
final_labels is the expected output. Basically, I am asking for all rows with flag= 1 to be labelled as 1 or 0 given the criteria I explained.
Desired output :
{1: 1, 4: 0, 6: 1}
Here keys are app_id and values are labels (either 0 or 1)
CodePudding user response:
I would first built a temp dataframe with the only rows having 1 in flag and merge it with the full dataframe on user_id.
Then I will add a new boolean column being true if application_date is greater than payment_date and if the original app_id is different from the on from temp (ie different rows)
Finally it will be enough to count the number of true values per app_id and give a 1 if the number is greater than 0.
Pandas code could be:
tmp = example_df.loc[example_df['flag'] == 1,
['app_id', 'user_id', 'payment_date']]
tmp = tmp.merge(example_df.drop(columns = 'payment_date'), on='user_id')
tmp['k'] = ((tmp['app_id_x'] != tmp['app_id_y'])
& (tmp['application_date'] > tmp['payment_date']))
d = (tmp.groupby('app_id_x')['k'].sum() != 0).astype('int').to_dict()
With your data, it gives as expected:
{1: 1, 4: 0, 6: 1}
CodePudding user response:
(i) Convert all dates to datetime objects
(ii) groupby "user_id" and for each group find the first "payment_date" using first and transform it for the entire DataFrame. Then compare it with the "application_date"s using lt (less than).
(iii) groupby "user_id" again to find how many entries satisfy the condition and assign values depending on whether the sum is greater than 1 or not.
example_df['payment_date'] = pd.to_datetime(example_df['payment_date'])
example_df['application_date'] = pd.to_datetime(example_df['application_date'])
example_df['flag_cumsum'] = example_df['flag'].cumsum()
example_df['first_payment_date < application_date'] = (example_df
.groupby(['flag_cumsum','user_id'])['payment_date']
.transform('first')
.lt(example_df['application_date']))
out = (example_df.groupby('flag_cumsum').agg({'app_id':'first',
'first_payment_date < application_date':'sum'})
.set_index('app_id')['first_payment_date < application_date']
.gt(0).astype(int)
.to_dict())
Output:
{1: 1, 4: 0}
