There are two dataframes. The first one is named products_purchased, the second one is named products_suggested.
In the products_purchased alias(pp) table there is a customer_id (unique) column, an item_id column, and a purchased column (with the value of 1).
The products_suggested alias (ps) table has a customer_id (non-unique) and an item_id column. This table has more customer_id's than the product_product purchased table, as not all customers who are suggested items, purchase them.
I would like to join the two tables, retaining the purchased column for the places where ps.customer_id (non-unique) and an ps.item_id match pp.customer_id (unique) column, an pp.item_id. I would also like to keep any records where pp.customer_id (unique) match ps.customer_id(non-unique).
The idea is to have a table where only the records relate to the customers who went on to purchase an item. That item would be labeled with a 1 in the purchased column, their other suggested items would be labeled 0.
Product Suggested Table
customer_id|item_id |
---------- ---------------
| 16413| 51654|
| 16413| 75950|
| 16413| 1366117|
| 78450| 56107|
| 94038| 72358|
| 94038| 1451889|
| 113067| 75077|
| 89578| 53279|
Product Purchased Table
customer_id|item_id |purchased
----------- -------------- ---------
| 16413| 75950| 1|
| 78450| 56107| 1|
| 94038| 72358| 1|
Final Table
customer_id|item_id |purchased
----------- -------------- ---------
| 16413| 51654| 0|
| 16413| 75950| 1|
| 16413| 1366117| 0|
| 78450| 56107| 1|
| 94038| 72358| 1|
| 94038| 1451889| 0|
I tried a left join on customer_id and item_suggested. I got what I expected, a table with all the suggested items regardless of their customer purchased, then the purchased status attached:
final = products_suggested.join(
products_purchased,on =["customer_id",'item_id'], how= 'left')
Final Table
customer_id|item_id |purchased
----------- -------------- ---------
| 16413| 51654| 0|
| 16413| 75950| 1|
| 16413| 1366117| 0|
| 78450| 56107| 1|
| 94038| 72358| 1|
| 94038| 1451889| 0|
| 113067| 75077| 0|
| 89578| 53279| 0|
I tried an inner join as well on just the customer_id. That made it so all my purchased columns were 1. I'm guessing thats because anywhere a customer_id matched the purchased version, it just placed the 1.
I also tried filtering after the left join .where(pp['customer_id']==ps['customer_id]), but that didn't seem to work either.
CodePudding user response:
I created another dataframe that only had the customer_id from the product purchased. Then I joined the merged left join table I tried above, with an inner join. This filtered out the remaining non purchasing customers.
purchase_only_customers = left_join_table.join(purchase_table, on =["customer_id"], how= 'inner')
