How to perform a cumulative count & sum in R?-CodePudding

I am trying to convert an existing Python code into an R equivalent. I am getting stuck on the following portion of Python code:

customer_history['new_customer_count'] = customer_history.groupby(['customer_id']).new_customer_count.ffill()\
                                         customer_history.groupby(customer_history.new_customer_count.notnull().cumsum()).cumcount()

customer_history['return_customer_count'] = customer_history.groupby(['customer_id']).return_customer_count.ffill()\
                                         customer_history.groupby(customer_history.return_customer_count.notnull().cumsum()).cumcount()

I know this code is taking the "customer_history" data frame:

customer_id	date	new_customer_count	return_customer_count
1	1/1/22	1	N/A
1	1/2/22	N/A	N/A
1	1/3/22	N/A	N/A
2	1/1/22	1	N/A
2	1/2/22	N/A	N/A
2	1/3/22	N/A	N/A
2	1/4/22	N/A	1
2	1/5/22	N/A	N/A
2	1/6/22	N/A	N/A

*Marks customer as "1" to indicate this customer is new/return. Customer considered "return" after three "new" visits.

And performing two steps in one action:

It is filling all N/As for each customer id with the previous non-N/A value:

customer_id	date	new_customer_flag	return_customer_flag
1	1/1/22	1	N/A
1	1/2/22	1	N/A
1	1/3/22	1	N/A
2	1/1/22	1	N/A
2	1/2/22	1	N/A
2	1/3/22	1	N/A
2	1/4/22	N/A	1
2	1/5/22	N/A	1
2	1/6/22	N/A	1

It overwrites the count columns with a cumulative count:

customer_id	date	new_customer_flag	return_customer_flag
1	1/1/22	1	N/A
1	1/2/22	2	N/A
1	1/3/22	3	N/A
2	1/1/22	1	N/A
2	1/2/22	2	N/A
2	1/3/22	3	N/A
2	1/4/22	N/A	1
2	1/5/22	N/A	2
2	1/6/22	N/A	3

I tried to following code in R to accomplish the same thing:

      customer_history <- customer_history %>%
                          group_by(customer_id) %>%
                          fill(new_customer_count, .direction=c("down"))
      
      customer_history <- customer_history %>%
                          group_by(customer_id) %>%
                          fill(return_customer_count, .direction=c("down"))  
              
      customer_history <- customer_history %>%
                          group_by(customer_id, new_customer_count) %>% 
                          mutate(new_customer_count = seq(n()))

      customer_history <- customer_history %>%
                          group_by(customer_id, return_customer_count) %>% 
                          mutate(return_customer_count = seq(n()))

This worked except for when a new customer turns into a return one. This R code would just continue both its new and return count instead of ceasing the new count when the customer became a return customer, like for customer 2 (see example output below):

customer_id	date	new_customer_flag	return_customer_flag
1	1/1/22	1	N/A
1	1/2/22	2	N/A
1	1/3/22	3	N/A
2	1/1/22	1	N/A
2	1/2/22	2	N/A
2	1/3/22	3	N/A
2	1/4/22	4	1
2	1/5/22	5	2
2	1/6/22	6	3

How do I fix this mistake to simulate the Python code? Thank you!!

CodePudding user response：

Here is a way all in one go.
The main trick is to combine the N/A values in both columns in a new column, flag. Then use that flag to know up to where to cumsum. And cumsum is applued twice.

library(dplyr)

customer_history %>%
  mutate(flag = cumsum(!is.na(new_customer_count) | !is.na(return_customer_count))) %>%
  group_by(customer_id, flag) %>%
  mutate(new_customer_count = cumsum(cumsum(!is.na(new_customer_count))),
         return_customer_count = cumsum(cumsum(!is.na(return_customer_count))),
         new_customer_count = ifelse(new_customer_count == 0, NA, new_customer_count),
         return_customer_count = ifelse(return_customer_count == 0, NA, return_customer_count)) %>%
  select(-flag)
#> Adding missing grouping variables: `flag`
#> # A tibble: 9 x 5
#> # Groups:   customer_id, flag [3]
#>    flag customer_id date       new_customer_count return_customer_count
#>   <int>       <int> <date>                  <int>                 <int>
#> 1     1           1 2022-01-01                  1                    NA
#> 2     1           1 2022-02-01                  2                    NA
#> 3     1           1 2022-03-01                  3                    NA
#> 4     2           2 2022-01-01                  1                    NA
#> 5     2           2 2022-02-01                  2                    NA
#> 6     2           2 2022-03-01                  3                    NA
#> 7     3           2 2022-04-01                 NA                     1
#> 8     3           2 2022-05-01                 NA                     2
#> 9     3           2 2022-06-01                 NA                     3

^{Created on 2022-02-01 by the reprex package (v2.0.1)}

Data

customer_history <- read.table(text = "
customer_id     date    new_customer_count  return_customer_count
1   1/1/22  1   N/A
1   1/2/22  N/A     N/A
1   1/3/22  N/A     N/A
2   1/1/22  1   N/A
2   1/2/22  N/A     N/A
2   1/3/22  N/A     N/A
2   1/4/22  N/A     1
2   1/5/22  N/A     N/A
2   1/6/22  N/A     N/A
", header = TRUE, na.strings = "N/A")
customer_history$date <- as.Date(customer_history$date, "%d/%m/%y")