I have a dataset that is basically formatted backwards from how I need it to perform a specific analysis. It represents entities and the articles they are found in, represented by id numbers (see below. Column headings [article 1, 2, 3, etc.] are just the 1st, 2nd, 3rd articles they appear in. The index in the cell is the substantive part). What I'd like to get is a count of how many entities appear in each article, which I think I could do with something like dplyr's group_by and summarise, but I can't find anywhere where you can apply it to a range of columns (there are actually 97 article columns in the dataset).
| entity | article 1 | article 2 | article 3 |
|---|---|---|---|
| Bach | 51 | 72 | 122 |
| Mozart | 2 | 83 | 95 |
Two specific transformations that would be useful for me are
- The number of entities in each article calculated as the count of the times each unique ID appears in an entity row. eg:
| id | count |
|---|---|
| 51 | 5424 |
| 72 | 1001 |
| 122 | 4000 |
- The entities in each article. eg:
| id | entity 1 | entity 2 | entity 3 |
|---|---|---|---|
| 51 | Bach | Mozart | etc |
| 72 | Mozart | Liszt | etc |
All this should be possible from this dataset, I just can't figure out how to get it into a workable format. Thanks for your help!
CodePudding user response:
For number 1, you can pivot to long format, then get the counts for each unique id for each entity using tally.
library(tidyverse)
df %>%
pivot_longer(-entity) %>%
group_by(entity, value) %>%
tally()
# A tibble: 6 × 3
# Groups: entity [2]
entity value n
<chr> <dbl> <int>
1 Bach 51 1
2 Bach 72 2
3 Bach 122 1
4 Mozart 2 1
5 Mozart 83 2
6 Mozart 95 1
It is a little unclear exactly what you are looking for, as the output seems different than what you describe. So, if you just want the total counts for each unique id, then you could drop entity in the group_by statement.
df %>%
pivot_longer(-entity) %>%
group_by(value) %>%
tally()
# A tibble: 6 × 2
value n
<dbl> <int>
1 2 1
2 51 1
3 72 2
4 83 2
5 95 1
6 122 1
For number 2, you could do something like this:
df %>%
pivot_longer(-entity) %>%
group_by(value) %>%
mutate(name = paste0("entity " , 1:n())) %>%
pivot_wider(names_from = "name", values_from = "entity")
# A tibble: 6 × 3
# Groups: value [6]
value `entity 1` `entity 2`
<dbl> <chr> <chr>
1 51 Bach NA
2 72 Bach Bach
3 122 Bach NA
4 2 Mozart NA
5 83 Mozart Mozart
6 95 Mozart NA
Data
df <- structure(
list(
entity = c("Bach", "Mozart"),
article.1 = c(51, 2),
article.2 = c(72, 83),
article.3 = c(122, 95),
article.4 = c(72, 83)
),
class = "data.frame",
row.names = c(NA,-2L)
)
