R noob here, please bear with me!
I have a dataset in full tidy form that I'm trying to summarise. The records each have a state, year, and two different categories, which for now we'll say take on levels of A, B, C and X, Y, Z:
state year cat1 cat2
AK 2010 A X
AK 2010 B Z
AK 2010 A Y
AK 2010 C Z
...
(For scale, there are ~500k records)
For each state-year combination, I'm trying to get a cross-tab of how many records there are with each combination of cat1 and cat2 level. I can do this with group_by and summarise:
data %>%
group_by(state, year, cat1, cat2) %>%
summarise(count = n())
Which yields something like:
state year cat1 cat2 count
AK 2010 A X 32
AK 2010 A Y 501
AK 2010 A Z 109
AK 2010 B X 47
AK 2010 B Y 670
AK 2010 B Z 38
AK 2010 C X 812
AK 2010 C Y 17
AK 2010 C Z 294
...
And so on for each state-year combo.
BUT the problem is that for some state-year combos, certain combinations of cat1 and cat2 don't exist, mostly b/c of small sample sizes. For instance:
state year cat1 cat2 count
RI 2010 A X 10
RI 2010 A Y 17
RI 2010 B Y 4
RI 2010 C X 32
RI 2010 C Z 12
...
In these cases, I would like to force the result of group_by summarise to show all nine possible cat1 and cat2 pairs for each state-year combination, filling in zero values if there are no records of that combination. So the example above should then be:
state year cat1 cat2 count
RI 2010 A X 10
RI 2010 A Y 17
RI 2010 A Z 0
RI 2010 B X 0
RI 2010 B Y 4
RI 2010 B Z 0
RI 2010 C X 32
RI 2010 C Y 0
RI 2010 C Z 12
...
How do I force these explicit zero values for the missing combinations?
(I have looked some at complete but cannot figure out how to make it work here, if it's relevant.)
CodePudding user response:
You can accomplish this with
- coercing the columns you're grouping by to factors
- setting
.drop = FALSEin thegroup_bystatement.
Example:
library(tidyverse)
df <- tibble(
x = factor(letters[1:5]),
y = factor(c("x", "x", "y", "y", "z")),
z = runif(length(x))
)
df %>%
group_by(x, y) %>%
summarize(z = mean(z), .groups = "drop")
#> # A tibble: 5 × 3
#> x y z
#> <fct> <fct> <dbl>
#> 1 a x 0.0468
#> 2 b x 0.392
#> 3 c y 0.0914
#> 4 d y 0.793
#> 5 e z 0.741
df %>%
group_by(x, y, .drop = FALSE) %>% # attn
summarize(z = mean(z), .groups = "drop")
#> # A tibble: 15 × 3
#> x y z
#> <fct> <fct> <dbl>
#> 1 a x 0.0468
#> 2 a y NaN
#> 3 a z NaN
#> 4 b x 0.392
#> 5 b y NaN
#> 6 b z NaN
#> 7 c x NaN
#> 8 c y 0.0914
#> 9 c z NaN
#> 10 d x NaN
#> 11 d y 0.793
#> 12 d z NaN
#> 13 e x NaN
#> 14 e y NaN
#> 15 e z 0.741
Created on 2022-10-26 with reprex v2.0.2
CodePudding user response:
If you don't want to coerce to a factor, you can use complete
library(tidyverse)
data %>%
count(state, year, cat1, cat2) %>%
complete(expand(., cat1, cat2), fill = list(n = 0)) %>%
fill(state, year) %>%
select(state, year, cat1, cat2, n)
#> # A tibble: 9 x 5
#> state year cat1 cat2 n
#> <chr> <int> <chr> <chr> <int>
#> 1 AK 2010 A X 1
#> 2 AK 2010 A Y 1
#> 3 AK 2010 A Z 0
#> 4 AK 2010 B X 0
#> 5 AK 2010 B Y 0
#> 6 AK 2010 B Z 1
#> 7 AK 2010 C X 0
#> 8 AK 2010 C Y 0
#> 9 AK 2010 C Z 1
Created on 2022-10-26 with reprex v2.0.2
Data taken from question
data <- structure(list(state = c("AK", "AK", "AK", "AK"), year = c(2010L,
2010L, 2010L, 2010L), cat1 = c("A", "B", "A", "C"), cat2 = c("X",
"Z", "Y", "Z")), class = "data.frame", row.names = c(NA, -4L))
