Estimating the percentage of common set members over time in a panel-CodePudding

I have a time-series panel dataset that is structured in the following way: There are 2 funds that each own different stocks at each time period.

df <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1, 1, 2,2,2,2),
  time_Q = c(1,1,1,2,2,2,2,3, 3, 1,1,2,2),
  stock_id = c("A", "B", "C", "A", "C", "D", "E", "D", "E", "A", "B", "B", "C")

)

> df
   fund_id time_Q stock_id
1        1      1        A
2        1      1        B
3        1      1        C
4        1      2        A
5        1      2        C
6        1      2        D
7        1      2        E
8        1      3        D
9        1      3        E
10       2      1        A
11       2      1        B
12       2      2        B
13       2      2        C

For each fund, I would like to calculate the percentage of stocks held in that current time_Q that were also held in the previous one to 2 quarters. So basically for every fund and every time_Q, I would like to have 2 columns with past 1 time_Q, past 2 time_Q which show what percentage of stocks held on that time were also present in each of those past time_Qs. Here is what the result should look like:

result <- data.frame(
  fund_id = c(1,1,1,2,2),
  time_Q = c(1,2,3,1,2),
  past_1Q = c("NA",0.5,1,"NA",0.5),
  past_2Q = c("NA","NA",0,"NA","NA")
)

>  result
  fund_id time_Q past_1Q past_2Q
1       1      1      NA      NA
2       1      2     0.5      NA
3       1      3       1       0
4       2      1      NA      NA
5       2      2     0.5      NA

I'm currently thinking about using either setdiff or intersect function but I'm not sure how to format it in the panel dataset. I'm looking for a scalable dplyr or data.table solution that would be able to cover multiple funds, stocks and time periods and also look into common elements in up to 12 lagged time-periods. I would appreciate any help as I've been stuck on this problem for quite a while.

CodePudding user response：

We can use dplyr and purrr to programmatically build up a lagged ownership variable and then summarize() across all of them using across(). First, we just need a dummy variable for ownership and group our data by fund and stock.

library(dplyr)
library(purrr)

df_grouped <- df %>%
  mutate(owned = TRUE) %>%
  group_by(fund_id, stock_id)

Then we can generate lagged ownership for each stock, based on time_Q, join all of them together, and for each fund and time_Q, calculate proportion of ownership.

map(
  1:2,
  ~df_grouped %>%
    mutate(
      "past_{.x}Q" := lag(owned, n = .x, order_by = time_Q)
    )
) %>%
  reduce(left_join, by = c("fund_id", "stock_id", "time_Q", "owned")) %>%
  group_by(fund_id, time_Q) %>%
  summarize(
    across(
      starts_with("past"),
      ~if (all(is.na(.x))) NA else sum(.x, na.rm = T) / n()
    )
  )
#> # A tibble: 5 × 4
#>   fund_id time_Q past_1Q past_2Q
#>     <dbl>  <dbl>   <dbl> <lgl>  
#> 1       1      1    NA   NA     
#> 2       1      2     0.5 NA     
#> 3       1      3     1   NA     
#> 4       2      1    NA   NA     
#> 5       2      2     0.5 NA

CodePudding user response：

Here's a dplyr-only solution:

library(dplyr)

df %>% 
  group_by(fund_id, time_Q) %>% 
  summarise(new = list(stock_id)) %>% 
  mutate(past_1Q = lag(new, 1),
         past_2Q = lag(new, 2)) %>% 
  rowwise() %>% 
  transmute(time_Q, 
            across(past_1Q:past_2Q, ~ length(intersect(new, .x)) / length(new)))

output

  fund_id time_Q past_1Q past_2Q
    <dbl>  <dbl>   <dbl>   <dbl>
1       1      1     0         0
2       1      2     0.5       0
3       1      3     1         0
4       2      1     0         0
5       2      2     0.5       0