Home > OS >  Summarizing a difficult dataset
Summarizing a difficult dataset

Time:01-05

I have a dataset that is basically formatted backwards from how I need it to perform a specific analysis. It represents entities and the articles they are found in, represented by id numbers (see below. Column headings [article 1, 2, 3, etc.] are just the 1st, 2nd, 3rd articles they appear in. The index in the cell is the substantive part). What I'd like to get is a count of how many entities appear in each article, which I think I could do with something like dplyr's group_by and summarise, but I can't find anywhere where you can apply it to a range of columns (there are actually 97 article columns in the dataset).

entity article 1 article 2 article 3
Bach 51 72 122
Mozart 2 83 95

Two specific transformations that would be useful for me are

  1. The number of entities in each article calculated as the count of the times each unique ID appears in an entity row. eg:
id count
51 5424
72 1001
122 4000
  1. The entities in each article. eg:
id entity 1 entity 2 entity 3
51 Bach Mozart etc
72 Mozart Liszt etc

All this should be possible from this dataset, I just can't figure out how to get it into a workable format. Thanks for your help!

CodePudding user response:

For number 1, you can pivot to long format, then get the counts for each unique id for each entity using tally.

library(tidyverse)

df %>%
  pivot_longer(-entity) %>% 
  group_by(entity, value) %>% 
  tally()

# A tibble: 6 × 3
# Groups:   entity [2]
  entity value     n
  <chr>  <dbl> <int>
1 Bach      51     1
2 Bach      72     2
3 Bach     122     1
4 Mozart     2     1
5 Mozart    83     2
6 Mozart    95     1

It is a little unclear exactly what you are looking for, as the output seems different than what you describe. So, if you just want the total counts for each unique id, then you could drop entity in the group_by statement.

df %>%
  pivot_longer(-entity) %>% 
  group_by(value) %>% 
  tally()

  # A tibble: 6 × 2
    value     n
    <dbl> <int>
  1     2     1
  2    51     1
  3    72     2
  4    83     2
  5    95     1
  6   122     1

For number 2, you could do something like this:

df %>%
  pivot_longer(-entity) %>%
  group_by(value) %>%
  mutate(name = paste0("entity " , 1:n())) %>%
  pivot_wider(names_from = "name", values_from = "entity")

# A tibble: 6 × 3
# Groups:   value [6]
  value `entity 1` `entity 2`
  <dbl> <chr>      <chr>     
1    51 Bach       NA        
2    72 Bach       Bach      
3   122 Bach       NA        
4     2 Mozart     NA        
5    83 Mozart     Mozart    
6    95 Mozart     NA    

Data

df <- structure(
  list(
    entity = c("Bach", "Mozart"),
    article.1 = c(51, 2),
    article.2 = c(72, 83),
    article.3 = c(122, 95),
    article.4 = c(72, 83)
  ),
  class = "data.frame",
  row.names = c(NA,-2L)
)
  •  Tags:  
  • Related