I am trying to convert an existing Python code into an R equivalent. I am getting stuck on the following portion of Python code:
customer_history['new_customer_count'] = customer_history.groupby(['customer_id']).new_customer_count.ffill()\
customer_history.groupby(customer_history.new_customer_count.notnull().cumsum()).cumcount()
customer_history['return_customer_count'] = customer_history.groupby(['customer_id']).return_customer_count.ffill()\
customer_history.groupby(customer_history.return_customer_count.notnull().cumsum()).cumcount()
I know this code is taking the "customer_history" data frame:
| customer_id | date | new_customer_count | return_customer_count |
|---|---|---|---|
| 1 | 1/1/22 | 1 | N/A |
| 1 | 1/2/22 | N/A | N/A |
| 1 | 1/3/22 | N/A | N/A |
| 2 | 1/1/22 | 1 | N/A |
| 2 | 1/2/22 | N/A | N/A |
| 2 | 1/3/22 | N/A | N/A |
| 2 | 1/4/22 | N/A | 1 |
| 2 | 1/5/22 | N/A | N/A |
| 2 | 1/6/22 | N/A | N/A |
*Marks customer as "1" to indicate this customer is new/return. Customer considered "return" after three "new" visits.
And performing two steps in one action:
- It is filling all N/As for each customer id with the previous non-N/A value:
| customer_id | date | new_customer_flag | return_customer_flag |
|---|---|---|---|
| 1 | 1/1/22 | 1 | N/A |
| 1 | 1/2/22 | 1 | N/A |
| 1 | 1/3/22 | 1 | N/A |
| 2 | 1/1/22 | 1 | N/A |
| 2 | 1/2/22 | 1 | N/A |
| 2 | 1/3/22 | 1 | N/A |
| 2 | 1/4/22 | N/A | 1 |
| 2 | 1/5/22 | N/A | 1 |
| 2 | 1/6/22 | N/A | 1 |
- It overwrites the count columns with a cumulative count:
| customer_id | date | new_customer_flag | return_customer_flag |
|---|---|---|---|
| 1 | 1/1/22 | 1 | N/A |
| 1 | 1/2/22 | 2 | N/A |
| 1 | 1/3/22 | 3 | N/A |
| 2 | 1/1/22 | 1 | N/A |
| 2 | 1/2/22 | 2 | N/A |
| 2 | 1/3/22 | 3 | N/A |
| 2 | 1/4/22 | N/A | 1 |
| 2 | 1/5/22 | N/A | 2 |
| 2 | 1/6/22 | N/A | 3 |
I tried to following code in R to accomplish the same thing:
customer_history <- customer_history %>%
group_by(customer_id) %>%
fill(new_customer_count, .direction=c("down"))
customer_history <- customer_history %>%
group_by(customer_id) %>%
fill(return_customer_count, .direction=c("down"))
customer_history <- customer_history %>%
group_by(customer_id, new_customer_count) %>%
mutate(new_customer_count = seq(n()))
customer_history <- customer_history %>%
group_by(customer_id, return_customer_count) %>%
mutate(return_customer_count = seq(n()))
This worked except for when a new customer turns into a return one. This R code would just continue both its new and return count instead of ceasing the new count when the customer became a return customer, like for customer 2 (see example output below):
| customer_id | date | new_customer_flag | return_customer_flag |
|---|---|---|---|
| 1 | 1/1/22 | 1 | N/A |
| 1 | 1/2/22 | 2 | N/A |
| 1 | 1/3/22 | 3 | N/A |
| 2 | 1/1/22 | 1 | N/A |
| 2 | 1/2/22 | 2 | N/A |
| 2 | 1/3/22 | 3 | N/A |
| 2 | 1/4/22 | 4 | 1 |
| 2 | 1/5/22 | 5 | 2 |
| 2 | 1/6/22 | 6 | 3 |
How do I fix this mistake to simulate the Python code? Thank you!!
CodePudding user response:
Here is a way all in one go.
The main trick is to combine the N/A values in both columns in a new column, flag. Then use that flag to know up to where to cumsum. And cumsum is applued twice.
library(dplyr)
customer_history %>%
mutate(flag = cumsum(!is.na(new_customer_count) | !is.na(return_customer_count))) %>%
group_by(customer_id, flag) %>%
mutate(new_customer_count = cumsum(cumsum(!is.na(new_customer_count))),
return_customer_count = cumsum(cumsum(!is.na(return_customer_count))),
new_customer_count = ifelse(new_customer_count == 0, NA, new_customer_count),
return_customer_count = ifelse(return_customer_count == 0, NA, return_customer_count)) %>%
select(-flag)
#> Adding missing grouping variables: `flag`
#> # A tibble: 9 x 5
#> # Groups: customer_id, flag [3]
#> flag customer_id date new_customer_count return_customer_count
#> <int> <int> <date> <int> <int>
#> 1 1 1 2022-01-01 1 NA
#> 2 1 1 2022-02-01 2 NA
#> 3 1 1 2022-03-01 3 NA
#> 4 2 2 2022-01-01 1 NA
#> 5 2 2 2022-02-01 2 NA
#> 6 2 2 2022-03-01 3 NA
#> 7 3 2 2022-04-01 NA 1
#> 8 3 2 2022-05-01 NA 2
#> 9 3 2 2022-06-01 NA 3
Created on 2022-02-01 by the reprex package (v2.0.1)
Data
customer_history <- read.table(text = "
customer_id date new_customer_count return_customer_count
1 1/1/22 1 N/A
1 1/2/22 N/A N/A
1 1/3/22 N/A N/A
2 1/1/22 1 N/A
2 1/2/22 N/A N/A
2 1/3/22 N/A N/A
2 1/4/22 N/A 1
2 1/5/22 N/A N/A
2 1/6/22 N/A N/A
", header = TRUE, na.strings = "N/A")
customer_history$date <- as.Date(customer_history$date, "%d/%m/%y")
