Home > Back-end >  How to perform a cumulative count & sum in R?
How to perform a cumulative count & sum in R?

Time:02-02

I am trying to convert an existing Python code into an R equivalent. I am getting stuck on the following portion of Python code:

customer_history['new_customer_count'] = customer_history.groupby(['customer_id']).new_customer_count.ffill()\
                                         customer_history.groupby(customer_history.new_customer_count.notnull().cumsum()).cumcount()

customer_history['return_customer_count'] = customer_history.groupby(['customer_id']).return_customer_count.ffill()\
                                         customer_history.groupby(customer_history.return_customer_count.notnull().cumsum()).cumcount()

I know this code is taking the "customer_history" data frame:

customer_id date new_customer_count return_customer_count
1 1/1/22 1 N/A
1 1/2/22 N/A N/A
1 1/3/22 N/A N/A
2 1/1/22 1 N/A
2 1/2/22 N/A N/A
2 1/3/22 N/A N/A
2 1/4/22 N/A 1
2 1/5/22 N/A N/A
2 1/6/22 N/A N/A

*Marks customer as "1" to indicate this customer is new/return. Customer considered "return" after three "new" visits.

And performing two steps in one action:

  1. It is filling all N/As for each customer id with the previous non-N/A value:
customer_id date new_customer_flag return_customer_flag
1 1/1/22 1 N/A
1 1/2/22 1 N/A
1 1/3/22 1 N/A
2 1/1/22 1 N/A
2 1/2/22 1 N/A
2 1/3/22 1 N/A
2 1/4/22 N/A 1
2 1/5/22 N/A 1
2 1/6/22 N/A 1
  1. It overwrites the count columns with a cumulative count:
customer_id date new_customer_flag return_customer_flag
1 1/1/22 1 N/A
1 1/2/22 2 N/A
1 1/3/22 3 N/A
2 1/1/22 1 N/A
2 1/2/22 2 N/A
2 1/3/22 3 N/A
2 1/4/22 N/A 1
2 1/5/22 N/A 2
2 1/6/22 N/A 3

I tried to following code in R to accomplish the same thing:

      customer_history <- customer_history %>%
                          group_by(customer_id) %>%
                          fill(new_customer_count, .direction=c("down"))
      
      customer_history <- customer_history %>%
                          group_by(customer_id) %>%
                          fill(return_customer_count, .direction=c("down"))  
              
      customer_history <- customer_history %>%
                          group_by(customer_id, new_customer_count) %>% 
                          mutate(new_customer_count = seq(n()))

      customer_history <- customer_history %>%
                          group_by(customer_id, return_customer_count) %>% 
                          mutate(return_customer_count = seq(n()))

This worked except for when a new customer turns into a return one. This R code would just continue both its new and return count instead of ceasing the new count when the customer became a return customer, like for customer 2 (see example output below):

customer_id date new_customer_flag return_customer_flag
1 1/1/22 1 N/A
1 1/2/22 2 N/A
1 1/3/22 3 N/A
2 1/1/22 1 N/A
2 1/2/22 2 N/A
2 1/3/22 3 N/A
2 1/4/22 4 1
2 1/5/22 5 2
2 1/6/22 6 3

How do I fix this mistake to simulate the Python code? Thank you!!

CodePudding user response:

Here is a way all in one go.
The main trick is to combine the N/A values in both columns in a new column, flag. Then use that flag to know up to where to cumsum. And cumsum is applued twice.

library(dplyr)

customer_history %>%
  mutate(flag = cumsum(!is.na(new_customer_count) | !is.na(return_customer_count))) %>%
  group_by(customer_id, flag) %>%
  mutate(new_customer_count = cumsum(cumsum(!is.na(new_customer_count))),
         return_customer_count = cumsum(cumsum(!is.na(return_customer_count))),
         new_customer_count = ifelse(new_customer_count == 0, NA, new_customer_count),
         return_customer_count = ifelse(return_customer_count == 0, NA, return_customer_count)) %>%
  select(-flag)
#> Adding missing grouping variables: `flag`
#> # A tibble: 9 x 5
#> # Groups:   customer_id, flag [3]
#>    flag customer_id date       new_customer_count return_customer_count
#>   <int>       <int> <date>                  <int>                 <int>
#> 1     1           1 2022-01-01                  1                    NA
#> 2     1           1 2022-02-01                  2                    NA
#> 3     1           1 2022-03-01                  3                    NA
#> 4     2           2 2022-01-01                  1                    NA
#> 5     2           2 2022-02-01                  2                    NA
#> 6     2           2 2022-03-01                  3                    NA
#> 7     3           2 2022-04-01                 NA                     1
#> 8     3           2 2022-05-01                 NA                     2
#> 9     3           2 2022-06-01                 NA                     3

Created on 2022-02-01 by the reprex package (v2.0.1)


Data

customer_history <- read.table(text = "
customer_id     date    new_customer_count  return_customer_count
1   1/1/22  1   N/A
1   1/2/22  N/A     N/A
1   1/3/22  N/A     N/A
2   1/1/22  1   N/A
2   1/2/22  N/A     N/A
2   1/3/22  N/A     N/A
2   1/4/22  N/A     1
2   1/5/22  N/A     N/A
2   1/6/22  N/A     N/A
", header = TRUE, na.strings = "N/A")
customer_history$date <- as.Date(customer_history$date, "%d/%m/%y")
  •  Tags:  
  • Related