averaging mass spec peak counts by sample column names-CodePudding

Hopefully this is straightforward, and I'm just thinking too hard. I have a matrix of peak counts from mass spec (MS) where peaks are rows and columns are sample names. The sample locations have several sampling sites and I would like to add the counts between sites within locations.

For example, one sample with three replicates is identified as "S19S_0010_Sed_Field_ICR.D_p2", "S19S_0010_Sed_Field_ICR.M_p2", and "S19S_0010_Sed_Field_ICR.U_p2" where it's the same location but downstream (D), midstream (M), and upstream (U). The first two samples have one count of a specific peak each, so I would like to merge the three samples to just say "S19S_0010_Sed_Field_ICR.all_p2" with two counts of the wavelength. Example dataset:

> dput(data.sed.ex)
structure(list(S19S_0004_Sed_Field_ICR.M_p15 = c(0, 0, 0, 0, 
0, 0, 0, 0, 0, 0), S19S_0006_Sed_Field_ICR.D_p2 = c(0, 0, 0, 
0, 0, 0, 1, 1, 0, 0), S19S_0006_Sed_Field_ICR.M_p2 = c(0, 0, 
0, 0, 0, 0, 1, 0, 0, 0), S19S_0006_Sed_Field_ICR.U_p2 = c(0, 
0, 0, 0, 0, 0, 1, 1, 0, 0), S19S_0008_Sed_Field_ICR.M_p15 = c(0, 
0, 0, 0, 0, 0, 0, 1, 0, 0), S19S_0009_Sed_Field_ICR.M_p2 = c(0, 
0, 1, 0, 0, 0, 1, 0, 0, 0), S19S_0009_Sed_Field_ICR.U_p2 = c(0, 
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.D_p15 = c(0, 
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.M_p15 = c(0, 
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.U_p15 = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c("200.002276", "200.015107", 
"200.0564158", "200.0565393", "200.0578394", "200.0677581", "200.092796", 
"200.1291723", "200.1292836", "200.9238455"), class = "data.frame")

TIA

CodePudding user response：

maybe wrangling to a long format can help. In this format, you can summarize by groups e.g. sample, or sample, and location, using sum, mean, sd among others.

hope this helps,

Convert to long format

## dd is the `data.sed.ex` object above

1  > library(tidyverse)                                                                                                                                                           
2  > ddLong <- dd %>%                                                                                                                                                             
3        rownames_to_column(var = "peak") %>% ## rownames to column                                                                                          
4        pivot_longer(cols = matches("^S")) %>%                      ## pivot longer                                                                                              
5        mutate(sample = gsub("(.*)\\.(.*)", "\\1", name),           ## pull sample info                                                                                          
6               location = gsub("(.*)\\.([DMU])_(.*)", "\\2", name), ## pull D M U                                                                                                
7               p = gsub("(.*)\\.([DMU])_(p.*)", "\\3", name),       ## get p2, p15                                                                                               
8               peak = as.numeric(peak))             ## coerce peak to numeric                                                                                                    
9  > ddLong                                                                                                                                                                       
10 # A tibble: 100 × 6                                                                                                                                                            
11     peak name                          value sample               location p                                                                                                   
12    <dbl> <chr>                         <dbl> <chr>                <chr>    <chr>                                                                                               
13  1  200. S19S_0004_Sed_Field_ICR.M_p15     0 S19S_0004_Sed_Field… M        p15                                                                                                 
14  2  200. S19S_0006_Sed_Field_ICR.D_p2      0 S19S_0006_Sed_Field… D        p2                                                                                                  
15  3  200. S19S_0006_Sed_Field_ICR.M_p2      0 S19S_0006_Sed_Field… M        p2                                                                                                  
16  4  200. S19S_0006_Sed_Field_ICR.U_p2      0 S19S_0006_Sed_Field… U        p2                                                                                                  
17  5  200. S19S_0008_Sed_Field_ICR.M_p15     0 S19S_0008_Sed_Field… M        p15                                                                                                 
18  6  200. S19S_0009_Sed_Field_ICR.M_p2      0 S19S_0009_Sed_Field… M        p2                                                                                                  
19  7  200. S19S_0009_Sed_Field_ICR.U_p2      0 S19S_0009_Sed_Field… U        p2                                                                                                  
20  8  200. S19S_0010_Sed_Field_ICR.D_p15     0 S19S_0010_Sed_Field… D        p15                                                                                                 
21  9  200. S19S_0010_Sed_Field_ICR.M_p15     0 S19S_0010_Sed_Field… M        p15                                                                                                 
22 10  200. S19S_0010_Sed_Field_ICR.U_p15     0 S19S_0010_Sed_Field… U        p15                                                                                                 
23 # … with 90 more rows

Summarize by one or more groups

24 > ## summarise using group_by   verbs                                                                                                                                          
25 > ddLong %>%                                                                                                                                                                   
26       group_by(sample, location) %>%                                                                                                                                           
27       summarise(n = n(),                                                                                                                                                       
28                 sum.value = sum(value),                                                                                                                                        
29                 mean.peak = mean(peak))                                                                                                                                        
30 `summarise()` has grouped output by 'sample'. You can override using the `.groups` argument.                                                                                   
31 # A tibble: 10 × 5                                                                                                                                                             
32 # Groups:   sample [5]                                                                                                                                                         
33    sample                  location     n sum.value mean.peak                                                                                                                  
34    <chr>                   <chr>    <int>     <dbl>     <dbl>                                                                                                                  
35  1 S19S_0004_Sed_Field_ICR M           10         0      200.                                                                                                                  
36  2 S19S_0006_Sed_Field_ICR D           10         2      200.                                                                                                                  
37  3 S19S_0006_Sed_Field_ICR M           10         1      200.                                                                                                                  
38  4 S19S_0006_Sed_Field_ICR U           10         2      200.                                                                                                                  
39  5 S19S_0008_Sed_Field_ICR M           10         1      200.                                                                                                                  
40  6 S19S_0009_Sed_Field_ICR M           10         2      200.                                                                                                                  
41  7 S19S_0009_Sed_Field_ICR U           10         1      200.                                                                                                                  
42  8 S19S_0010_Sed_Field_ICR D           10         1      200.                                                                                                                  
43  9 S19S_0010_Sed_Field_ICR M           10         1      200.                                                                                                                  
44 10 S19S_0010_Sed_Field_ICR U           10         0      200.                                                                                                                  
45 > ddLong %>%                                                                                                                                                                   
46       group_by(sample, p) %>%                                             
47       summarise(n = n(),                                                                                                                                                       
48                 sum.value = sum(value),                                                                                                                                        
49                 mean.peak = mean(peak))                                                                                                                                        
50 `summarise()` has grouped output by 'sample'. You can override using the `.groups` argument.                                                                                   
51 # A tibble: 5 × 5                                                                                                                                                              
52 # Groups:   sample [5]                                                                                                                                                         
53   sample                  p         n sum.value mean.peak                                                                                                                      
54   <chr>                   <chr> <int>     <dbl>     <dbl>                                                                                                                      
55 1 S19S_0004_Sed_Field_ICR p15      10         0      200.                                                                                                                      
56 2 S19S_0006_Sed_Field_ICR p2       30         5      200.                                                                                                                      
57 3 S19S_0008_Sed_Field_ICR p15      10         1      200.                                                                                                                      
58 4 S19S_0009_Sed_Field_ICR p2       20         3      200.                                                                                                                      
59 5 S19S_0010_Sed_Field_ICR p15      30         2      200.                                                                                                                      
60 >