Home > Blockchain >  How do I combine multiple observations in a dataframe to create a list in a column?
How do I combine multiple observations in a dataframe to create a list in a column?

Time:02-04

I would like to use the ggupset package to create an upset plot but I am struggling to format my data correctly. My data is currently in a tibble similar to the one below.

> tibble
# A tibble: 13 × 3
   locus pathway fold_change
   <chr> <chr>         <dbl>
 1 0001  A               1  
 2 0001  B               1  
 3 0001  C               1  
 4 0001  D               1  
 5 0002  B              -2  
 6 0002  D              -2  
 7 0003  C               1  
 8 0004  C               3  
 9 0004  E               3  
10 0004  F               3  
11 0004  G               3  
12 0004  H               3  
13 0005  D               2.5  

ggupset requires a format in which the pathway column would need to be formatted as a list for each locus observation as in the fake tibble below (the correct formatting is also shown in the tidy_movies dataset in ggplot2).

>fake_tibble
# A tibble: 5 x 3
    locus   pathways            fold_change
    <chr>   <list>              <dbl>
1   0001    "A" "B" "C" "D"     1
2   0002    "B" "D"             -2
3   0003    "C"                 1
4   0004    "C" "E" "F" "G" "H" 3
5   0005    "D"                 2.5

The real dataset is too large for me to want to work through manually creating a list for each locus so any help wrangling this data would be appreciated.

CodePudding user response:

Use summarise with list.

df %>% 
  group_by(locus, fold_change) %>% 
  summarise(pathway = list(pathway))

  locus fold_change pathway  
  <int>       <dbl> <list>   
1     1         1   <chr [4]>
2     2        -2   <chr [2]>
3     3         1   <chr [1]>
4     4         3   <chr [5]>
5     5         2.5 <chr [1]>

data

df <- structure(list(locus = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 
4L, 4L, 4L, 5L), pathway = c("A", "B", "C", "D", "B", "D", "C", 
"C", "E", "F", "G", "H", "D"), fold_change = c(1, 1, 1, 1, -2, 
-2, 1, 3, 3, 3, 3, 3, 2.5)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"
))

CodePudding user response:

tibble("locus" = unique(df$locus),
       "pathway" = aggregate(df$pathway, list(df$locus), FUN = list)$x,
       "fold_change" = aggregate(df$fold_change, list(df$locus), FUN = unique, simplify = TRUE)$x)

if your fold_change is a list, then at least one locus must have different fold_change values. You can change FUN to mean for example, if you want to force a vector

  •  Tags:  
  • Related