I have a data frame with 4 columns
| ID1 | ID2 | Value1 | Value2 |
|---|---|---|---|
| 1 | 2 | zinc | sulfur |
| 1 | 2 | zinc | calcium |
| 1 | 2 | calcium | NA |
| 3 | 4 | carbon | iron |
| 3 | 4 | iron | iron |
| 3 | 4 | iron | carbon |
The first two columns will always correspond so I only essentially need to work with one of the ID columns in configurating this.
For each pair of IDs I want every value that corresponds to either of the IDs, and for it to only appear once.
So for the example above I want:
| ID1 | ID2 | Values |
|---|---|---|
| 1 | 2 | zinc, sulfur, calcium |
| 3 | 4 | carbon, iron |
But a list might also work if it simply lists each the amount of times each value (uniquely appears) for each ID.
What I eventually want is the number of times each Value appears in the dataset
eg:
| Value | Count |
|---|---|
| calcium | 2 |
| zinc | 1 |
| iron | 1 |
| sulfur | 1 |
I guess a list might be a better way to get to this endpoint, but I’m more well-versed in working with data frames. For now, I’m mostly interested in the first part (getting each value for the pair of IDs).
CodePudding user response:
Here is a dplyr approach to solve your first problem.
First group_by your ID columns, then use summarise to paste and collapse them into single row of comma-separated string per ID pair (and also removes NA by na.omit).
Finally, str_split the comma separated string to remove duplicated elements, paste them back together, and ungroup.
library(dplyr)
df %>%
group_by(ID1, ID2) %>%
summarize(Values = paste0(na.omit(Value1), ",", na.omit(Value2), collapse = ",")) %>%
mutate(Values = paste0(unique(sort(str_split(Values, ",", simplify = T))), collapse = ",")) %>%
ungroup()
# A tibble: 2 × 3
ID1 ID2 Values
<int> <int> <chr>
1 1 2 calcium,sulfur,zinc
2 3 4 carbon,iron
Data
df <- read.table(header = T, text = "
ID1 ID2 Value1 Value2
1 2 zinc sulfur
1 2 zinc calcium
1 2 calcium NA
3 4 carbon iron
3 4 iron iron
3 4 iron carbon")
CodePudding user response:
With dplyr an option is also to transpose the 'Value' columns, get the unique and paste with toString after grouping
library(dplyr)
df %>%
group_by(ID1, ID2) %>%
summarise(Values = toString(na.omit(unique(c(t(across(starts_with("Value"))))))),
.groups = 'drop')
-output
# A tibble: 2 × 3
ID1 ID2 Values
<int> <int> <chr>
1 1 2 zinc, sulfur, calcium
2 3 4 carbon, iron
CodePudding user response:
What about this?
data.frame(table(unlist(df[, sapply(df, is.character)])))
Var1 Freq
1 calcium 2
2 carbon 2
3 iron 4
4 sulfer 1
5 zinc 2
data
df <- structure(list(ID1 = c(1, 1, 1, 3, 3, 3), ID2 = c(2, 2, 2, 4,
4, 4), Value1 = c("zinc", "zinc", "calcium", "carbon", "iron",
"iron"), Value2 = c("sulfer", "calcium", NA, "iron", "iron",
"carbon")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
# A tibble: 6 x 4
ID1 ID2 Value1 Value2
<dbl> <dbl> <chr> <chr>
1 1 2 zinc sulfer
2 1 2 zinc calcium
3 1 2 calcium NA
4 3 4 carbon iron
5 3 4 iron iron
6 3 4 iron carbon
CodePudding user response:
Within tidyverse the process is to gather the value1 and value2 in a tidy dataset, group_by your id variables, and summarize using the concatenation of your string values.
library(tidyverse)
id1 <- c(1,1,1,3,3,3)
id2 <- c(2,2,2,4,4,4)
Value1 <- c("zinc", "zinc", "calcium", "carbon", "iron", "iron")
Value2 <- c("sulfer", "calcium", NA, "iron", "iron", "carbon")
df <- as_tibble(cbind(id1,id2,Value1,Value2))
df_cln <- df %>%
gather(var, value, -id1, -id2, na.rm = T) %>%
select(-var) %>%
unique() %>%
group_by(id1, id2) %>%
summarise(minerals=paste0(value, collapse=","))
print(df_cln)
# A tibble: 2 × 3
# Groups: id1 [2]
id1 id2 minerals
<chr> <chr> <chr>
1 1 2 zinc,calcium,sulfer
2 3 4 carbon,iron
