Creating new columns in pandas dataframe as summed permutations of other columns-CodePudding

Hopefully this simple explanation will get across what I'm trying to do. Suppose I had a Pandas dataframe that had the following columns:

Does anyone know a way (without doing it manually) that I could create a new set of columns within that table that sum across subsets of permutations of the columns? I've been doing some searches and not had any luck given this being a rather specific query.

For instance if I wanted all combinations of sums of subsets of 3 columns from the 5 columns - some of the additional rows might like look like:

ABC ACE BDE
  7   9  12
  8   7   9
 15  12  12

...and so in (in this case there'd be 5C3 = 10 rows - but I'm actually looking at a 7C4 example)

Thanks in advance !

CodePudding user response：

Use combinations instead permutations and for each values of tuple sum values together:

from itertools import combinations

cc = list(combinations(df.columns,3))
df = pd.concat([df.loc[:, c].sum(axis=1) for c in cc], axis=1, keys=cc)
df.columns = df.columns.map(''.join)
print (df)
   ABC  ABD  ABE  ACD  ACE  ADE  BCD  BCE  BDE  CDE
0    7    9    7   11    9   11   12   10   12   14
1    8    9   10    6    7    8    7    8    9    6
2   15   12   15    9   12    9   12   15   12    9

CodePudding user response：

Using the underlying numpy array for efficiency.

from itertools import combinations

a = df.to_numpy()
df2 = pd.DataFrame(np.vstack([a[:,list(c)].sum(1)
                              for c in combinations(range(df.shape[1]), 3)
                              ]).T,
                   columns=map(''.join, combinations(df, 3))
                   )

output:

   ABC  ABD  ABE  ACD  ACE  ADE  BCD  BCE  BDE  CDE
0    7    9    7   11    9   11   12   10   12   14
1    8    9   10    6    7    8    7    8    9    6
2   15   12   15    9   12    9   12   15   12    9

NB. This is running ~10 times faster than the other pandas-based solution.

CodePudding user response：

Or as an alternative:

import itertools
    
pd.DataFrame({'{}{}{}'.format(a, b, c): df[a]   df[b]   df[c] 
              for a, b, c in itertools.combinations(df.columns, 3)})

Output:

   ABC  ABD  ABE  ACD  ACE  ADE  BCD  BCE  BDE  CDE
0    7    9    7   11    9   11   12   10   12   14
1    8    9   10    6    7    8    7    8    9    6
2   15   12   15    9   12    9   12   15   12    9

CodePudding user response：

Because I found it very interesting, I compared the three solutions.

jezrael

%%timeit

df = pd.DataFrame({"A":[1,3,4],
                  "B":[2,4,7],
                  "C":[4,1,4],
                  "D":[6,2,1],
                  "E":[4,3,4]})

cc = list(combinations(df.columns,3))
df = pd.concat([df.loc[:, c].sum(axis=1) for c in cc], axis=1, keys=cc)
df.columns = df.columns.map(''.join)
df

Result: 5.99 ms ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

mozway

%%timeit

df = pd.DataFrame({"A":[1,3,4],
                  "B":[2,4,7],
                  "C":[4,1,4],
                  "D":[6,2,1],
                  "E":[4,3,4]})
    
a = df.to_numpy()
df2 = pd.DataFrame(np.vstack([a[:,list(c)].sum(1)
                              for c in combinations(range(df.shape[1]), 3)
                              ]).T,
                   columns=map(''.join, combinations(df, 3))
                   )
df2

Result: 458 µs ± 868 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Marco_CH

%%timeit

df = pd.DataFrame({"A":[1,3,4],
                  "B":[2,4,7],
                  "C":[4,1,4],
                  "D":[6,2,1],
                  "E":[4,3,4]})
    
df = pd.DataFrame({'{}{}{}'.format(a, b, c): df[a]   df[b]   df[c] 
              for a, b, c in itertools.combinations(df.columns, 3)})
df

Result: 2.25 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, if performance is key, mozways solution is the best by far. If performance is not that important, then we have three different approaches for every preference :-)