Hopefully this is straightforward, and I'm just thinking too hard. I have a matrix of peak counts from mass spec (MS) where peaks are rows and columns are sample names. The sample locations have several sampling sites and I would like to add the counts between sites within locations.
For example, one sample with three replicates is identified as "S19S_0010_Sed_Field_ICR.D_p2", "S19S_0010_Sed_Field_ICR.M_p2", and "S19S_0010_Sed_Field_ICR.U_p2" where it's the same location but downstream (D), midstream (M), and upstream (U). The first two samples have one count of a specific peak each, so I would like to merge the three samples to just say "S19S_0010_Sed_Field_ICR.all_p2" with two counts of the wavelength. Example dataset:
> dput(data.sed.ex)
structure(list(S19S_0004_Sed_Field_ICR.M_p15 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), S19S_0006_Sed_Field_ICR.D_p2 = c(0, 0, 0,
0, 0, 0, 1, 1, 0, 0), S19S_0006_Sed_Field_ICR.M_p2 = c(0, 0,
0, 0, 0, 0, 1, 0, 0, 0), S19S_0006_Sed_Field_ICR.U_p2 = c(0,
0, 0, 0, 0, 0, 1, 1, 0, 0), S19S_0008_Sed_Field_ICR.M_p15 = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0), S19S_0009_Sed_Field_ICR.M_p2 = c(0,
0, 1, 0, 0, 0, 1, 0, 0, 0), S19S_0009_Sed_Field_ICR.U_p2 = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.D_p15 = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.M_p15 = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.U_p15 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c("200.002276", "200.015107",
"200.0564158", "200.0565393", "200.0578394", "200.0677581", "200.092796",
"200.1291723", "200.1292836", "200.9238455"), class = "data.frame")
TIA
CodePudding user response:
maybe wrangling to a long format can help. In this format, you can summarize by groups e.g. sample, or sample, and location, using sum, mean, sd among others.
hope this helps,
Convert to long format
## dd is the `data.sed.ex` object above
1 > library(tidyverse)
2 > ddLong <- dd %>%
3 rownames_to_column(var = "peak") %>% ## rownames to column
4 pivot_longer(cols = matches("^S")) %>% ## pivot longer
5 mutate(sample = gsub("(.*)\\.(.*)", "\\1", name), ## pull sample info
6 location = gsub("(.*)\\.([DMU])_(.*)", "\\2", name), ## pull D M U
7 p = gsub("(.*)\\.([DMU])_(p.*)", "\\3", name), ## get p2, p15
8 peak = as.numeric(peak)) ## coerce peak to numeric
9 > ddLong
10 # A tibble: 100 × 6
11 peak name value sample location p
12 <dbl> <chr> <dbl> <chr> <chr> <chr>
13 1 200. S19S_0004_Sed_Field_ICR.M_p15 0 S19S_0004_Sed_Field… M p15
14 2 200. S19S_0006_Sed_Field_ICR.D_p2 0 S19S_0006_Sed_Field… D p2
15 3 200. S19S_0006_Sed_Field_ICR.M_p2 0 S19S_0006_Sed_Field… M p2
16 4 200. S19S_0006_Sed_Field_ICR.U_p2 0 S19S_0006_Sed_Field… U p2
17 5 200. S19S_0008_Sed_Field_ICR.M_p15 0 S19S_0008_Sed_Field… M p15
18 6 200. S19S_0009_Sed_Field_ICR.M_p2 0 S19S_0009_Sed_Field… M p2
19 7 200. S19S_0009_Sed_Field_ICR.U_p2 0 S19S_0009_Sed_Field… U p2
20 8 200. S19S_0010_Sed_Field_ICR.D_p15 0 S19S_0010_Sed_Field… D p15
21 9 200. S19S_0010_Sed_Field_ICR.M_p15 0 S19S_0010_Sed_Field… M p15
22 10 200. S19S_0010_Sed_Field_ICR.U_p15 0 S19S_0010_Sed_Field… U p15
23 # … with 90 more rows
Summarize by one or more groups
24 > ## summarise using group_by verbs
25 > ddLong %>%
26 group_by(sample, location) %>%
27 summarise(n = n(),
28 sum.value = sum(value),
29 mean.peak = mean(peak))
30 `summarise()` has grouped output by 'sample'. You can override using the `.groups` argument.
31 # A tibble: 10 × 5
32 # Groups: sample [5]
33 sample location n sum.value mean.peak
34 <chr> <chr> <int> <dbl> <dbl>
35 1 S19S_0004_Sed_Field_ICR M 10 0 200.
36 2 S19S_0006_Sed_Field_ICR D 10 2 200.
37 3 S19S_0006_Sed_Field_ICR M 10 1 200.
38 4 S19S_0006_Sed_Field_ICR U 10 2 200.
39 5 S19S_0008_Sed_Field_ICR M 10 1 200.
40 6 S19S_0009_Sed_Field_ICR M 10 2 200.
41 7 S19S_0009_Sed_Field_ICR U 10 1 200.
42 8 S19S_0010_Sed_Field_ICR D 10 1 200.
43 9 S19S_0010_Sed_Field_ICR M 10 1 200.
44 10 S19S_0010_Sed_Field_ICR U 10 0 200.
45 > ddLong %>%
46 group_by(sample, p) %>%
47 summarise(n = n(),
48 sum.value = sum(value),
49 mean.peak = mean(peak))
50 `summarise()` has grouped output by 'sample'. You can override using the `.groups` argument.
51 # A tibble: 5 × 5
52 # Groups: sample [5]
53 sample p n sum.value mean.peak
54 <chr> <chr> <int> <dbl> <dbl>
55 1 S19S_0004_Sed_Field_ICR p15 10 0 200.
56 2 S19S_0006_Sed_Field_ICR p2 30 5 200.
57 3 S19S_0008_Sed_Field_ICR p15 10 1 200.
58 4 S19S_0009_Sed_Field_ICR p2 20 3 200.
59 5 S19S_0010_Sed_Field_ICR p15 30 2 200.
60 >
