I have a dataframe that looks like this
> df
Col1 Col2 P_value
Cell1 Cell2 0.001
Cell2 Cell1 0.05
Cell4 Cell1 0.01
Cell5 Cell2 0.03
Cell2 Cell3 0.008
Cell1 Cell4 0.008
I want to subset to a new dataframe, in which only strings present in BOTH Col1 and Col2 in both orders. So here, Cell1 and Cell2, when matched, appear in both Col1 and Col2.
> df
Col1 Col2 P_value
Cell1 Cell2 0.001
Cell2 Cell1 0.05
Cell1 Cell4 0.008
Cell4 Cell1 0.01
So here, Cell1 and Cell2, when matched, appear in both Col1 and Col2. Same for Cell1 and Cell4. This never happens for other strings.
CodePudding user response:
Perhaps this is over-simplifying, but ...
df %>%
filter(Col1 %in% Col2 & Col2 %in% Col1)
# Col1 Col2 P_value
# 1 Cell1 Cell2 <0.05
# 2 Cell2 Cell1 <0.05
# 3 Cell4 Cell1 <0.05
# 4 Cell1 Cell4 <0.05
CodePudding user response:
We may use
library(dplyr)
library(stringr)
df %>%
mutate(Col = str_c(pmin(Col1, Col2), pmax(Col1, Col2))) %>%
filter(duplicated(Col)|duplicated(Col, fromLast = TRUE)) %>%
select(-Col)
Or may do
df %>%
add_count(pmin(Col1, Col2), pmax(Col1, Col2)) %>%
filter(n > 1) %>%
select(names(df))
-output
Col1 Col2 P_value
1 Cell1 Cell2 <0.05
2 Cell2 Cell1 <0.05
3 Cell4 Cell1 <0.05
4 Cell1 Cell4 <0.05
data
df <- structure(list(Col1 = c("Cell1", "Cell2", "Cell4", "Cell5", "Cell2",
"Cell1"), Col2 = c("Cell2", "Cell1", "Cell1", "Cell2", "Cell3",
"Cell4"), P_value = c("<0.05", "<0.05", "<0.05", "<0.05", "<0.05",
"<0.05")), class = "data.frame", row.names = c(NA, -6L))
CodePudding user response:
A possible solution, based on inner_join:
library(dplyr)
inner_join(df, df[-3], by = c("Col2" = "Col1", "Col1" = "Col2"))
#> Col1 Col2 P_value
#> 1 Cell1 Cell2 0.001
#> 2 Cell2 Cell1 0.050
#> 3 Cell4 Cell1 0.010
#> 4 Cell1 Cell4 0.008
