Summarizing a difficult dataset-CodePudding

I have a dataset that is basically formatted backwards from how I need it to perform a specific analysis. It represents entities and the articles they are found in, represented by id numbers (see below. Column headings [article 1, 2, 3, etc.] are just the 1st, 2nd, 3rd articles they appear in. The index in the cell is the substantive part). What I'd like to get is a count of how many entities appear in each article, which I think I could do with something like dplyr's group_by and summarise, but I can't find anywhere where you can apply it to a range of columns (there are actually 97 article columns in the dataset).

entity	article 1	article 2	article 3
Bach	51	72	122
Mozart	2	83	95

Two specific transformations that would be useful for me are

The number of entities in each article calculated as the count of the times each unique ID appears in an entity row. eg:

id	count
51	5424
72	1001
122	4000

The entities in each article. eg:

id	entity 1	entity 2	entity 3
51	Bach	Mozart	etc
72	Mozart	Liszt	etc

All this should be possible from this dataset, I just can't figure out how to get it into a workable format. Thanks for your help!

CodePudding user response：

For number 1, you can pivot to long format, then get the counts for each unique id for each entity using tally.

library(tidyverse)

df %>%
  pivot_longer(-entity) %>% 
  group_by(entity, value) %>% 
  tally()

# A tibble: 6 × 3
# Groups:   entity [2]
  entity value     n
  <chr>  <dbl> <int>
1 Bach      51     1
2 Bach      72     2
3 Bach     122     1
4 Mozart     2     1
5 Mozart    83     2
6 Mozart    95     1

It is a little unclear exactly what you are looking for, as the output seems different than what you describe. So, if you just want the total counts for each unique id, then you could drop entity in the group_by statement.

df %>%
  pivot_longer(-entity) %>% 
  group_by(value) %>% 
  tally()

  # A tibble: 6 × 2
    value     n
    <dbl> <int>
  1     2     1
  2    51     1
  3    72     2
  4    83     2
  5    95     1
  6   122     1

For number 2, you could do something like this:

df %>%
  pivot_longer(-entity) %>%
  group_by(value) %>%
  mutate(name = paste0("entity " , 1:n())) %>%
  pivot_wider(names_from = "name", values_from = "entity")

# A tibble: 6 × 3
# Groups:   value [6]
  value `entity 1` `entity 2`
  <dbl> <chr>      <chr>     
1    51 Bach       NA        
2    72 Bach       Bach      
3   122 Bach       NA        
4     2 Mozart     NA        
5    83 Mozart     Mozart    
6    95 Mozart     NA

Data

df <- structure(
  list(
    entity = c("Bach", "Mozart"),
    article.1 = c(51, 2),
    article.2 = c(72, 83),
    article.3 = c(122, 95),
    article.4 = c(72, 83)
  ),
  class = "data.frame",
  row.names = c(NA,-2L)
)