Correlation data in separate variables in R -- stragegies for analysis?-CodePudding

First time post, thanks for your patience. In this sample data frame, there a 4 pairs of data points I want to evaluate.

To get you oriented: variable letters "a" and "b" represent 2 different methods used to measure 2 different structures "RUPV" and "RLPV". So in this df I want to compare 4 pairs of data.

Row 1 represents a single comparison (RUPV, measured using methods "a" and "b")
Row 2 represents 2 comparisons (RUPVa with RUPVb, RLPVa with RLPVb. No other comparisons)
Row 3 is also a single comparison, this time for RLPV
Rows 4 and 5 are missing data, and would not be used for analysis. But it represents what I'm dealing with in the original data set.

df1 <- data.frame(RUPVa = c(1,2,NA,NA,NA), 
                    RLPVa = c(NA,3,4,NA,NA), 
                    RUPVb = c(1,2,NA,NA,5), 
                    RLPVb = c(NA,3,4,NA,NA))

Question: Even though the values are distributed differently in the df, I want to collapse the 4 pairs of data for analysis, which includes scatterplots and correlations (regardless of the structure). Should I code using these original columns, or is there a way to bring all the pairs together into 2 new variables (say "a" and "b")?

I've tried commands like pivot_longer() but I don't know how to merge the random pairs into just 2 columns, one for method "a" and one for "b".

If you want to use my actual column names there are 8 total,listed below. Within the names, the characters "mean" = method "a", and "direct" = method "b". For a given row, the values will be randomly present or missing. I have not included them here, but can if needed.

rupv_mean_27
direct_rupv_pre rlpv_mean_32
direct_rlpv_pre lupv_mean_37
direct_lupv_pre llpv_mean_42
direct_llpv_pre

CodePudding user response：

I'm not entirely sure, but this may help out. I would add a column to indicate the row number, since you are considering pairs of values within a row. You can use pivot_longer to make the data tidy - with each row an observation.

Using pivot_longer you can put your data into long form, which will be more tidy. The arguments include:

cols which are columns to pivot (in this case, all columns except rn.
names_pattern which is a regex pattern (or regular expression) to separate the columns into two groups. The first group is \\w which is one or more word characters, like RUPV or RLPV. The second group is a|b which is the letter "a" or "b". The $ at the end is the end of the string.
names_to which has the names of the two new columns created for the matching two groups: structure and method.

df1 %>%
  mutate(rn = row_number()) %>%
  pivot_longer(cols = -rn, names_to = c("structure", "method"), names_pattern = "(\\w )(a|b)$")

      rn structure method value
   <int> <chr>     <chr>  <dbl>
 1     1 RUPV      a          1
 2     1 RLPV      a         NA
 3     1 RUPV      b          1
 4     1 RLPV      b         NA
 5     2 RUPV      a          2
 6     2 RLPV      a          3
 7     2 RUPV      b          2
 8     2 RLPV      b          3
 9     3 RUPV      a         NA
10     3 RLPV      a          4
11     3 RUPV      b         NA
12     3 RLPV      b          4
13     4 RUPV      a         NA
14     4 RLPV      a         NA
15     4 RUPV      b         NA
16     4 RLPV      b         NA
17     5 RUPV      a         NA
18     5 RLPV      a         NA
19     5 RUPV      b          5
20     5 RLPV      b         NA

You can then remove missing data. First, filter and remove NA in the value column. Then, for each structure/row combination only include structure/rn combinations when pairs of values are present. Finally, you can use pivot_wider to put into the desired wide form.

Here is the entire code:

library(tidyverse)

df1 %>%
  mutate(rn = row_number()) %>%
  pivot_longer(cols = -rn, names_to = c("structure", "method"), names_pattern = "(\\w )(a|b)") %>%
  filter(!is.na(value)) %>%
  group_by(rn, structure) %>%
  filter(n() == 2) %>%
  pivot_wider(id_cols = c(rn, structure), names_from = method, values_from = value)

Output

     rn structure     a     b
  <int> <chr>     <dbl> <dbl>
1     1 RUPV          1     1
2     2 RUPV          2     2
3     2 RLPV          3     3
4     3 RLPV          4     4