Home > Blockchain >  Matrix from dataframe in R/Python
Matrix from dataframe in R/Python

Time:01-24

I have a dataframe like this: If the person with that id has that fruit the value is 1, else it is 0. ID column is a primary key.

ID Apple Orange Pear Grapes
E1 1 0 1 1
E2 0 0 1 0
E3 0 1 1 0
E4 1 1 0 0
E5 1 0 0 1

I want a output as a matrix like this. This is a count of people who has both i and j fruit. Where i and j both are same, eg: cell apple x apple, we are looking at the count of person who has only apple and no other fruit. There is no person who has only apple, so the value of that cell is 0. Similarly, we have only one person who has both apple and orange so the count is 1 in both (orange x apple) and (apple x orange) cells. We have 2 people in the above dataframe who has both grapes and apple, so value in that cell is 2.

Apple Orange Pear Grapes
Apple 0 1 1 2
Orange 1 0 1 0
Pear 1 1 1 1
Grapes 2 0 1 0

I am new to R and Python, and not sure how to achieve this. Any help in either R or Python would be really appreciated! Please feel free to ask questions if anything is not clear or anything is missing. Thanks!

CodePudding user response:

Here is the example with Python3 and itertools.permutations:

import itertools

import pandas as pd

# Create original dataframe
columns = ['ID', 'Apple', 'Orange', 'Pear', 'Grapes']
rows = [
    ['E1', 1, 0, 1, 1],
    ['E2', 0, 0, 1, 0],
    ['E3', 0, 1, 1, 0],
    ['E4', 1, 1, 0, 0],
    ['E5', 1, 0, 0, 1],
]
df = pd.DataFrame(rows, columns=columns)

# Count values of resulting matrix
def get_ones_indexes(els):
    for i, el in enumerate(els):
        if el == 1:
            yield i

res_n = len(df.columns) - 1
res = [[0] * res_n for _ in range(res_n)]

for _, row in df.drop('ID', axis=1).iterrows():
    indexes = list(get_ones_indexes(row.to_list()))

    if len(indexes) == 1:
        idx = indexes[0]
        res[idx][idx]  = 1
    else:
        for i, j in itertools.permutations(indexes, 2):
            res[i][j]  = 1

# Convert resultinng matrix to dataframe
_, *fruit_cols = df.columns
res_df = pd.DataFrame(res, index=fruit_cols, columns=fruit_cols)

print(res_df)
#         Apple  Orange  Pear  Grapes
# Apple       0       1     1       2
# Orange      1       0     1       0
# Pear        1       1     1       1
# Grapes      2       0     1       0

CodePudding user response:

Here is a tidyverse R option:

library(tidyverse)
data %>%
    pivot_longer(-ID) %>%
    filter(value > 0) %>%
    select(-value) %>%
    group_by(ID) %>%
    nest() %>%
    mutate(data = map(data, ~ expand.grid(.x$name, .x$name))) %>%
    unnest(data) %>%
    group_by(Var1, Var2) %>%
    summarise(n = n(), .groups = "drop") %>%
    filter(Var1 != Var2) %>%
    pivot_wider(names_from = Var1, values_from = n, values_fill = 0)
## A tibble: 4 x 5
#  Var2   Apple  Pear Grapes Orange
#  <fct>  <int> <int>  <int>  <int>
#1 Pear       1     0      1      1
#2 Grapes     2     1      0      0
#3 Orange     1     1      0      0
#4 Apple      0     1      2      1

The idea is to use expand.grid to generate all pairwise combinations. The rest is counting occurrences and reshaping. Re-order rows & columns as necessary.

PS. I should mention that the result is slightly different from the expected output. For example, I don't understand why (Pear, Pear) = 1. I assume this in an error. Since you're asking for co-occurrences, all diagonal elements should be zero.


Sample data

data <- read.table(text = "ID   Apple   Orange  Pear    Grapes
E1  1   0   1   1
E2  0   0   1   0
E3  0   1   1   0
E4  1   1   0   0
E5  1   0   0   1", header = T)
  •  Tags:  
  • Related