Home > Enterprise >  How to fill in missing rows by extending last available row?
How to fill in missing rows by extending last available row?

Time:01-21

I have a database where each row has both a calendar month ("Period_1") and number of months elapsed since element inception ("Period_2") designation.

For analytical purposes, the data is parsed using both Period_1 and Period_2 groupings (using Shiny, which has been removed from the below example for simplicity).

The problem is when grouping by months elapsed since element inception (Period_2), there are fewer rows when the underlying element was originated closer to the Period_1 terminal period (2020-04 in this example). For example, in the below data frame "data" you can see how two elements are originated through 2020-04, but the first element is originated two months later than the second element (calendar-wise) and therefore there are only two months elapsed (or two "Period_2's") for that first element versus four periods elapsed for the second element.

Data frame and code that generates it:

> data
  ID Period_1 Period_2 ColA ColB
1  1  2020-03        1   10   15
2  1  2020-04        2   20   25
3  2  2020-01        1   30   35
4  2  2020-02        2   40   45
5  2  2020-03        3   50   55
6  2  2020-04        4   52   87
    
data <- data.frame(
    ID = c(1,1,2,2,2,2),
    Period_1 = c("2020-03", "2020-04", "2020-01", "2020-02", "2020-03", "2020-04"),
    Period_2 = c(1, 2, 1, 2, 3, 4),
    ColA = c(10, 20, 30, 40, 50, 52),
    ColB = c(15, 25, 35, 45, 55, 87)
    )

Now grouping the data by Period_2 with the code generating it shown beneath:

> groupData
# A tibble: 4 x 3
  Period_2  ColA  ColB   [Notes...]
     <dbl> <dbl> <dbl>
1        1    40    50   [Elements 1 and 2 included]
2        2    60    70   [Elements 1 and 2 included]
3        3    50    55   [Element 2 only]
4        4    52    87   [Element 2 only]
    
groupData <- 
  data %>%
    group_by(Period_2) %>%
      select("ColA","ColB") %>%
      summarise(across(everything(), sum))

What I would like to do is when a given ID has a maximum Period_2 (maximum Period_2 in this data frame example for element 1 = two) less than than the maximum for the entire data frame (maximum in this data frame example is four), then additional (the missing) rows are placed underneath that row, copying down the last available ColA and ColB amounts for that ID. (In my actual dataset, there has been so much decay prior to the terminal month that remaining amounts aren't very material so copying down is OK - but they sure do throw off ratio calculations as row counts decrease). So in this example, the data frame would be extended as follows:

> data
  ID Period_1 Period_2 ColA ColB   [Notes...]
1  1  2020-03        1   10   15   [Original data OK]
2  1  2020-04        2   20   25   [Original data OK]
3  1  NA             3   20   25   [Copy values down from last available Period_2 (=2) to fill in missing periods 3 and 4] 
4  1  NA             4   20   25   [Copy values down from last available Period_2 (=2) to fill in missing periods 3 and 4]
5  2  2020-01        1   30   35   [Original data OK]
6  2  2020-02        2   40   45   [Original data OK]
7  2  2020-03        3   50   55   [Original data OK]
8  2  2020-04        4   52   87   [Original data OK]

This data frame extension only needs to happen when the data is parsed using Period_2. Everything is fine when data is parsed using Period_1: no extension necessary.

Any ideas how to do this? Some sort of apply() function might work? Does dplyr() have a solution?

CodePudding user response:

I will give you a simple pure tidyverse solution. First, you have to expand grid by combinations of the variables, you can use expand() or complete() to make implicit missing values explicitly missing. Then you want to do LOCF (last observation carried forward), this can be achieved either by the fill argument inside complete() or by using fill() function. All there functions are inside the tidyr package.

library(tidyverse)

data <- data.frame(
  ID = c(1,1,2,2,2,2),
  Period_1 = c("2020-03", "2020-04", "2020-01", "2020-02", "2020-03", "2020-04"),
  Period_2 = c(1, 2, 1, 2, 3, 4),
  ColA = c(10, 20, 30, 40, 50, 52),
  ColB = c(15, 25, 35, 45, 55, 87)
)

data %>%
  tidyr::complete(ID, nesting(Period_2)) %>%
  tidyr::fill(ColA, ColB, .direction = "down")
#> # A tibble: 8 x 5
#>      ID Period_2 Period_1  ColA  ColB
#>   <dbl>    <dbl> <chr>    <dbl> <dbl>
#> 1     1        1 2020-03     10    15
#> 2     1        2 2020-04     20    25
#> 3     1        3 <NA>        20    25
#> 4     1        4 <NA>        20    25
#> 5     2        1 2020-01     30    35
#> 6     2        2 2020-02     40    45
#> 7     2        3 2020-03     50    55
#> 8     2        4 2020-04     52    87

Created on 2022-01-21 by the reprex package (v2.0.1)

CodePudding user response:

And a data.table approach:

library(data.table)

data <- data.frame(
  ID = c(1,1,2,2,2,2),
  Period_1 = c("2020-03", "2020-04", "2020-01", "2020-02", "2020-03", "2020-04"),
  Period_2 = c(1, 2, 1, 2, 3, 4),
  ColA = c(10, 20, 30, 40, 50, 52),
  ColB = c(15, 25, 35, 45, 55, 87)
)

setDT(data)

vals <- expand.grid(ID = unique(data$ID),
                    Period_2 = unique(data$Period_2))

setDT(vals)

data_final = merge(vals,data, on=c("ID", "Period_2"), all=TRUE)
setcolorder(data_final, c("ID", "Period_1", "Period_2", "ColA", "ColB"))

data_final[, `:=`(ColA= data_final[!is.na(ColA)][data_final, ColA, roll = T],
                  ColB= data_final[!is.na(ColB)][data_final, ColB, roll = T])]

Output:

> data_final
   ID Period_1 Period_2 ColA ColB
1:  1  2020-03        1   10   15
2:  1  2020-04        2   20   25
3:  1     <NA>        3   20   25
4:  1     <NA>        4   20   25
5:  2  2020-01        1   30   35
6:  2  2020-02        2   40   45
7:  2  2020-03        3   50   55
8:  2  2020-04        4   52   87
  •  Tags:  
  • Related