I have a dataframe like this: If the person with that id has that fruit the value is 1, else it is 0. ID column is a primary key.
| ID | Apple | Orange | Pear | Grapes |
|---|---|---|---|---|
| E1 | 1 | 0 | 1 | 1 |
| E2 | 0 | 0 | 1 | 0 |
| E3 | 0 | 1 | 1 | 0 |
| E4 | 1 | 1 | 0 | 0 |
| E5 | 1 | 0 | 0 | 1 |
I want a output as a matrix like this. This is a count of people who has both i and j fruit. Where i and j both are same, eg: cell apple x apple, we are looking at the count of person who has only apple and no other fruit. There is no person who has only apple, so the value of that cell is 0. Similarly, we have only one person who has both apple and orange so the count is 1 in both (orange x apple) and (apple x orange) cells. We have 2 people in the above dataframe who has both grapes and apple, so value in that cell is 2.
| Apple | Orange | Pear | Grapes | |
|---|---|---|---|---|
| Apple | 0 | 1 | 1 | 2 |
| Orange | 1 | 0 | 1 | 0 |
| Pear | 1 | 1 | 1 | 1 |
| Grapes | 2 | 0 | 1 | 0 |
I am new to R and Python, and not sure how to achieve this. Any help in either R or Python would be really appreciated! Please feel free to ask questions if anything is not clear or anything is missing. Thanks!
CodePudding user response:
Here is the example with Python3 and itertools.permutations:
import itertools
import pandas as pd
# Create original dataframe
columns = ['ID', 'Apple', 'Orange', 'Pear', 'Grapes']
rows = [
['E1', 1, 0, 1, 1],
['E2', 0, 0, 1, 0],
['E3', 0, 1, 1, 0],
['E4', 1, 1, 0, 0],
['E5', 1, 0, 0, 1],
]
df = pd.DataFrame(rows, columns=columns)
# Count values of resulting matrix
def get_ones_indexes(els):
for i, el in enumerate(els):
if el == 1:
yield i
res_n = len(df.columns) - 1
res = [[0] * res_n for _ in range(res_n)]
for _, row in df.drop('ID', axis=1).iterrows():
indexes = list(get_ones_indexes(row.to_list()))
if len(indexes) == 1:
idx = indexes[0]
res[idx][idx] = 1
else:
for i, j in itertools.permutations(indexes, 2):
res[i][j] = 1
# Convert resultinng matrix to dataframe
_, *fruit_cols = df.columns
res_df = pd.DataFrame(res, index=fruit_cols, columns=fruit_cols)
print(res_df)
# Apple Orange Pear Grapes
# Apple 0 1 1 2
# Orange 1 0 1 0
# Pear 1 1 1 1
# Grapes 2 0 1 0
CodePudding user response:
Here is a tidyverse R option:
library(tidyverse)
data %>%
pivot_longer(-ID) %>%
filter(value > 0) %>%
select(-value) %>%
group_by(ID) %>%
nest() %>%
mutate(data = map(data, ~ expand.grid(.x$name, .x$name))) %>%
unnest(data) %>%
group_by(Var1, Var2) %>%
summarise(n = n(), .groups = "drop") %>%
filter(Var1 != Var2) %>%
pivot_wider(names_from = Var1, values_from = n, values_fill = 0)
## A tibble: 4 x 5
# Var2 Apple Pear Grapes Orange
# <fct> <int> <int> <int> <int>
#1 Pear 1 0 1 1
#2 Grapes 2 1 0 0
#3 Orange 1 1 0 0
#4 Apple 0 1 2 1
The idea is to use expand.grid to generate all pairwise combinations. The rest is counting occurrences and reshaping. Re-order rows & columns as necessary.
PS. I should mention that the result is slightly different from the expected output. For example, I don't understand why (Pear, Pear) = 1. I assume this in an error. Since you're asking for co-occurrences, all diagonal elements should be zero.
Sample data
data <- read.table(text = "ID Apple Orange Pear Grapes
E1 1 0 1 1
E2 0 0 1 0
E3 0 1 1 0
E4 1 1 0 0
E5 1 0 0 1", header = T)
