First time post, thanks for your patience. In this sample data frame, there a 4 pairs of data points I want to evaluate.
To get you oriented: variable letters "a" and "b" represent 2 different methods used to measure 2 different structures "RUPV" and "RLPV". So in this df I want to compare 4 pairs of data.
- Row 1 represents a single comparison (RUPV, measured using methods "a" and "b")
- Row 2 represents 2 comparisons (RUPVa with RUPVb, RLPVa with RLPVb. No other comparisons)
- Row 3 is also a single comparison, this time for RLPV
- Rows 4 and 5 are missing data, and would not be used for analysis. But it represents what I'm dealing with in the original data set.
df1 <- data.frame(RUPVa = c(1,2,NA,NA,NA),
RLPVa = c(NA,3,4,NA,NA),
RUPVb = c(1,2,NA,NA,5),
RLPVb = c(NA,3,4,NA,NA))
Question: Even though the values are distributed differently in the df, I want to collapse the 4 pairs of data for analysis, which includes scatterplots and correlations (regardless of the structure). Should I code using these original columns, or is there a way to bring all the pairs together into 2 new variables (say "a" and "b")?
- I've tried commands like
pivot_longer()but I don't know how to merge the random pairs into just 2 columns, one for method "a" and one for "b".
If you want to use my actual column names there are 8 total,listed below. Within the names, the characters "mean" = method "a", and "direct" = method "b". For a given row, the values will be randomly present or missing. I have not included them here, but can if needed.
rupv_mean_27
direct_rupv_pre
rlpv_mean_32
direct_rlpv_pre
lupv_mean_37
direct_lupv_pre
llpv_mean_42
direct_llpv_pre
CodePudding user response:
I'm not entirely sure, but this may help out. I would add a column to indicate the row number, since you are considering pairs of values within a row. You can use pivot_longer to make the data tidy - with each row an observation.
Using pivot_longer you can put your data into long form, which will be more tidy. The arguments include:
colswhich are columns to pivot (in this case, all columns exceptrn.names_patternwhich is a regex pattern (or regular expression) to separate the columns into two groups. The first group is\\wwhich is one or more word characters, likeRUPVorRLPV. The second group isa|bwhich is the letter "a" or "b". The$at the end is the end of the string.names_towhich has the names of the two new columns created for the matching two groups:structureandmethod.
df1 %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c("structure", "method"), names_pattern = "(\\w )(a|b)$")
rn structure method value
<int> <chr> <chr> <dbl>
1 1 RUPV a 1
2 1 RLPV a NA
3 1 RUPV b 1
4 1 RLPV b NA
5 2 RUPV a 2
6 2 RLPV a 3
7 2 RUPV b 2
8 2 RLPV b 3
9 3 RUPV a NA
10 3 RLPV a 4
11 3 RUPV b NA
12 3 RLPV b 4
13 4 RUPV a NA
14 4 RLPV a NA
15 4 RUPV b NA
16 4 RLPV b NA
17 5 RUPV a NA
18 5 RLPV a NA
19 5 RUPV b 5
20 5 RLPV b NA
You can then remove missing data. First, filter and remove NA in the value column. Then, for each structure/row combination only include structure/rn combinations when pairs of values are present. Finally, you can use pivot_wider to put into the desired wide form.
Here is the entire code:
library(tidyverse)
df1 %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c("structure", "method"), names_pattern = "(\\w )(a|b)") %>%
filter(!is.na(value)) %>%
group_by(rn, structure) %>%
filter(n() == 2) %>%
pivot_wider(id_cols = c(rn, structure), names_from = method, values_from = value)
Output
rn structure a b
<int> <chr> <dbl> <dbl>
1 1 RUPV 1 1
2 2 RUPV 2 2
3 2 RLPV 3 3
4 3 RLPV 4 4
