Home > Mobile >  How to Divide Dataset based on repeated Column Values in Pandas
How to Divide Dataset based on repeated Column Values in Pandas

Time:01-12

I have a dataset like this:

df = pd.DataFrame({'customer_key': [13453, 16345, 14643, 15346, 13453],
                   'data_purchased': ['08-07-2021','06-07-2021','05-09-2020','02-09-2021','01-04-2020'],
                   'price_value': [56, 45, 29, 22, 43]})

Glimpse of the dataset:

customer_key  data_purchased   price_value

13453          08-07-2021         56
16345          06-07-2021         45
14643          05-09-2020         29
15346          02-09-2021         22
13453          01-04-2020         43

I want to divide this dataset into 2 different datasets. First dataset will contain only those rows where customer_keys values are repeated (ex- customer_key = 13453 in above dataset). and the second dataset contains only those rows where there is no repeated values of customer_key. How to do that in pandas ??

CodePudding user response:

Here's one approach:

repeat_customer_mask = df.groupby("customer_key")["customer_key"].transform("size") > 1

Then you can split your DataFrame like so:

In [6]: repeat_customers = df[repeat_customer_mask]

In [7]: repeat_customers
Out[7]:
   customer_key date_purchased  price_value
0         13453     08-07-2021           56
4         13453     01-04-2020           43

In [8]: first_time_customers = df[~repeat_customer_mask]

In [9]: first_time_customers
Out[9]:
   customer_key date_purchased  price_value
1         16345     06-07-2021           45
2         14643     05-09-2020           29
3         15346     02-09-2021           22

From there, you'd probably want to .reset_index().

CodePudding user response:

You can use duplicated method to create a boolean Series denoting duplicate rows (select keep=False to mark all duplicates as True). Then use this mask to filter the DataFrame:

has_duplicates_mask = df['customer_key'].duplicated(keep=False)
df1 = df[has_duplicates_mask]
df2 = df[~has_duplicates_mask]

df1:

   customer_key data_purchased  price_value
0         13453     08-07-2021           56
4         13453     01-04-2020           43

df2:

   customer_key data_purchased  price_value
1         16345     06-07-2021           45
2         14643     05-09-2020           29
3         15346     02-09-2021           22
  •  Tags:  
  • Related