Hopefully this simple explanation will get across what I'm trying to do. Suppose I had a Pandas dataframe that had the following columns:
A B C D E
1 2 4 6 4
3 4 1 2 3
4 7 4 1 4
Does anyone know a way (without doing it manually) that I could create a new set of columns within that table that sum across subsets of permutations of the columns? I've been doing some searches and not had any luck given this being a rather specific query.
For instance if I wanted all combinations of sums of subsets of 3 columns from the 5 columns - some of the additional rows might like look like:
ABC ACE BDE
7 9 12
8 7 9
15 12 12
...and so in (in this case there'd be 5C3 = 10 rows - but I'm actually looking at a 7C4 example)
Thanks in advance !
CodePudding user response:
Use combinations instead permutations and for each values of tuple sum values together:
from itertools import combinations
cc = list(combinations(df.columns,3))
df = pd.concat([df.loc[:, c].sum(axis=1) for c in cc], axis=1, keys=cc)
df.columns = df.columns.map(''.join)
print (df)
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
0 7 9 7 11 9 11 12 10 12 14
1 8 9 10 6 7 8 7 8 9 6
2 15 12 15 9 12 9 12 15 12 9
CodePudding user response:
Using the underlying numpy array for efficiency.
from itertools import combinations
a = df.to_numpy()
df2 = pd.DataFrame(np.vstack([a[:,list(c)].sum(1)
for c in combinations(range(df.shape[1]), 3)
]).T,
columns=map(''.join, combinations(df, 3))
)
output:
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
0 7 9 7 11 9 11 12 10 12 14
1 8 9 10 6 7 8 7 8 9 6
2 15 12 15 9 12 9 12 15 12 9
NB. This is running ~10 times faster than the other pandas-based solution.
CodePudding user response:
Or as an alternative:
import itertools
pd.DataFrame({'{}{}{}'.format(a, b, c): df[a] df[b] df[c]
for a, b, c in itertools.combinations(df.columns, 3)})
Output:
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
0 7 9 7 11 9 11 12 10 12 14
1 8 9 10 6 7 8 7 8 9 6
2 15 12 15 9 12 9 12 15 12 9
CodePudding user response:
Because I found it very interesting, I compared the three solutions.
jezrael
%%timeit
df = pd.DataFrame({"A":[1,3,4],
"B":[2,4,7],
"C":[4,1,4],
"D":[6,2,1],
"E":[4,3,4]})
cc = list(combinations(df.columns,3))
df = pd.concat([df.loc[:, c].sum(axis=1) for c in cc], axis=1, keys=cc)
df.columns = df.columns.map(''.join)
df
Result:
5.99 ms ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
mozway
%%timeit
df = pd.DataFrame({"A":[1,3,4],
"B":[2,4,7],
"C":[4,1,4],
"D":[6,2,1],
"E":[4,3,4]})
a = df.to_numpy()
df2 = pd.DataFrame(np.vstack([a[:,list(c)].sum(1)
for c in combinations(range(df.shape[1]), 3)
]).T,
columns=map(''.join, combinations(df, 3))
)
df2
Result:
458 µs ± 868 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Marco_CH
%%timeit
df = pd.DataFrame({"A":[1,3,4],
"B":[2,4,7],
"C":[4,1,4],
"D":[6,2,1],
"E":[4,3,4]})
df = pd.DataFrame({'{}{}{}'.format(a, b, c): df[a] df[b] df[c]
for a, b, c in itertools.combinations(df.columns, 3)})
df
Result:
2.25 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So, if performance is key, mozways solution is the best by far. If performance is not that important, then we have three different approaches for every preference :-)
