I have a dataframe Segments with a company identifier gvkey, year Year, and two columns with an indicator of the industry the company operates in SICS1 and SICS2. I would like to aggregate the data as such that I receive the count of unique industry identicators per company/year combination.
My data frame looks as follows (indicative):
gvkey Year SICS1 SICS2
1 1209 2017 3569 3533
2 1209 2017 2813 3569
3 1209 2017 2813 3569
4 1209 2018 2813 3569
5 1209 2018 2813 7280
6 1209 2018 1908 3569
7 1209 2018 1412 3569
8 1209 2018 3569 3200
9 1503 2017 3569 3533
10 1503 2017 2813 3569
11 1503 2018 2813 3569
12 1503 2018 3569 3533
13 1503 2018 2813 3569
My desired output is should be somewhat like:
gvkey Year n_unique
1209 2017 3
1209 2018 6
1503 2017 3
1503 2018 3
What I tried to so far only gave me the unique values per column:
Segments %&%
group_by(gvkey, Year) %&%
summarize(across(SICS1:SICS2, n_distinct))
gvkey Year SICS1 SICS2
1209 2017 2 2
1209 2018 4 3
1503 2017 2 2
1503 2018 2 2
Simply summing up the unique values from SICS1 and SICS2 will not do the trick unfortunately, because it would result in counting some industry identifiers multiple times.
P.S. This is my first question on this platform. Please let me know how I can improve asking questions and thanks in advance for your help!
CodePudding user response:
You can filter the duplicated rows and then use summarize and count with group_by like this:
library(dplyr)
Segments %>%
filter(complete.cases(.) & !duplicated(.)) %>%
group_by(gvkey, Year) %>%
summarize(n_unique = n())
Output:
# A tibble: 4 × 3
# Groups: gvkey [2]
gvkey Year n_unique
<int> <int> <int>
1 1209 2017 2
2 1209 2018 5
3 1503 2017 2
4 1503 2018 2
CodePudding user response:
You should use dput to make a reproducible example as shown below. This also shows an alternative use of mutate and unique to get the desired result.
library(dplyr)
# the output from dput(df) is used to make reproducible examples
df <- structure(
list(gvkey = c(1209L, 1209L, 1209L, 1209L, 1209L, 1209L, 1209L, 1209L, 1503L, 1503L, 1503L, 1503L, 1503L), Year = c(2017L,
2017L, 2017L, 2018L, 2018L, 2018L, 2018L, 2018L, 2017L, 2017L,
2018L, 2018L, 2018L), SICS1 = c(3569L, 2813L, 2813L, 2813L, 2813L,
1908L, 1412L, 3569L, 3569L, 2813L, 2813L, 3569L, 2813L), SICS2 = c(3533L,
3569L, 3569L, 3569L, 7280L, 3569L, 3569L, 3200L, 3533L, 3569L,
3569L, 3533L, 3569L)), class = "data.frame", row.names = c(NA,
-13L))
summary <-
df %>%
mutate(combination=paste(SICS1, SICS2)) %>%
group_by(gvkey, Year) %>%
unique() %>%
summarize(count = n())
summary
# A tibble: 4 × 3
# Groups: gvkey [2]
gvkey Year count
<int> <int> <int>
1 1209 2017 2
2 1209 2018 5
3 1503 2017 2
4 1503 2018 2
