Home > Back-end >  Make small pandas dataframes based on combinations of values in two columns
Make small pandas dataframes based on combinations of values in two columns

Time:01-07

Lets say I have a dataframe below.

A   B
a   1
a   2
a   3
b   1
b   2
b   3
c   1
c   2
c   3
d   1

I want to create dataframes with two columns: A : unique values in columns A and B: Possible values in column B for a given unique value in A. Note that all these dataframes must be unique. Example of few possible dataframes are given in the image below:

enter image description here

CodePudding user response:

The main idea is to group the values by column A, collecting values in B in lists, create a cartesian product of all these lists, and format back into a dataframe

First we load the data and groupby by column A, collecting all values in B into lists:

from io import StringIO
import pandas as pd
data = StringIO('''
A   B
a   1
a   2
a   3
b   1
b   2
b   3
c   1
c   2
c   3
d   1
''')
df = pd.read_csv(data, sep = '\s ')
df2 = df.groupby('A').agg(list)
df2

df2 looks like this:


    B
A   
a   [1, 2, 3]
b   [1, 2, 3]
c   [1, 2, 3]
d   [1]

next we extract all lists in column B into a list of lists:

all_lists = df2['B'].to_list()
all_lists

it looks like this:

[[1, 2, 3], [1, 2, 3], [1, 2, 3], [1]]

Now the main step, create a product of the lists using MultiIndex.from_product functionality

mi = pd.MultiIndex.from_product(all_lists)
mi

mi looks like this:

MultiIndex([(1, 1, 1, 1),
            (1, 1, 2, 1),
            (1, 1, 3, 1),
            (1, 2, 1, 1),
            (1, 2, 2, 1),
            (1, 2, 3, 1),
            (1, 3, 1, 1),
            (1, 3, 2, 1),
            (1, 3, 3, 1),
            (2, 1, 1, 1),
            (2, 1, 2, 1),
            (2, 1, 3, 1),
            (2, 2, 1, 1),
            (2, 2, 2, 1),
            (2, 2, 3, 1),
            (2, 3, 1, 1),
            (2, 3, 2, 1),
            (2, 3, 3, 1),
            (3, 1, 1, 1),
            (3, 1, 2, 1),
            (3, 1, 3, 1),
            (3, 2, 1, 1),
            (3, 2, 2, 1),
            (3, 2, 3, 1),
            (3, 3, 1, 1),
            (3, 3, 2, 1),
            (3, 3, 3, 1)],
           )

we are almost there -- now convert into a df and make sure we have the right label. Also remove duplicates (none in this case but in general a good idea)

(pd.DataFrame(index = mi)
    .reset_index()
    .drop_duplicates()
    .set_axis(df2.index, axis=1) 
    .T
)

and we get a dataframe:

    0   1   2   3   4   5   6   7   8   9   ... 17  18  19  20  21  22  23  24  25  26
A                                                                                   
a   1   1   1   1   1   1   1   1   1   2   ... 2   3   3   3   3   3   3   3   3   3
b   1   1   1   2   2   2   3   3   3   1   ... 3   1   1   1   2   2   2   3   3   3
c   1   2   3   1   2   3   1   2   3   1   ... 3   1   2   3   1   2   3   1   2   3
d   1   1   1   1   1   1   1   1   1   1   ... 1   1   1   1   1   1   1   1   1   1

where the index is the old column A and each column is a unique combination of relevant values from old column B

You can split it into separate dataframes if you need to, by looping over columns if you wish

  •  Tags:  
  • Related