I have a DataFrame which looks like this toy example:
import pandas as pd
df = np.array([
[20.078,19.679,19.585,19.406,19.37,14.97,13.992,20.122,20.736],
[20.443,19.115,18.918,18.749,18.698,14.638,14.041,21.646,21.456],
[19.723,19.593,19.353,19.175,19.258,15.193,14.354,21.122,21.09],
[19.683,19.393,19.273,18.995,18.95,15.545,14.53,22.465,20.091],
[19.769,19.233,19.083,18.983,18.768,14.978,14.224,21.684,20.314],
[19.908,19.5,19.065,18.838,18.354,13.837,13.016,21.307,21.234]
])
df = pd.DataFrame(df, columns = ['u', 'g', 'r', 'i', 'zmag', 'W1', 'W2', 'NUV', 'FUV'])
and I want to combine the columns pairwise into subtractive combinations, which is what this snippet does:
df['FUV_NUV'] = dataset['FUV'] - dataset['NUV']
df['FUV_u'] = df['FUV'] - df['u']
df['u_g'] = df['u'] - df['g']
df['g_r'] = df['g'] - df['r']
df['r_i'] = df['r'] - df['i']
df['i_z'] = df['i'] - df['zmag']
df['z_W1'] = df['zmag'] - df['W1']
df['W1_W2'] = df['W1'] - df['W2']
But of course, there's a nicer way to do this, which I found here:
combs = list(chain.from_iterable(combinations(df.columns, i)
for i in range(2, len(df.columns) 1)))
for cols in combs:
df['_'.join(cols)] = df.loc[:, cols].sum(axis=1)
However, this produces all combinations (such as u g W1 W2 ...) additively.
How can I change it so that it produces
- only combinations of two columns maximum
- subtractive (i.e. not
sum)?
CodePudding user response:
Try out df.diff(axis=1) and see the function's documentation.
Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).
When we use axis=1, this function gets the difference between one column and the previous column. So you might have to re-order your columns for this function to work properly for your purpose.
Here is another option that uses loops.
In my opinion, the method using itertools is less readable than simply writing out the columns you are subtracting.
I would also suggest using the df.loc[] syntax when setting data on a dataframe.
pairs = [
("FUV", "NUV"),
("FUV", "u"),
("u", "g"),
("g", "r"),
("r", "i"),
("i", "zmag"),
("zmag", "W1"),
("W1", "W2"),
]
for col1, col2 in pairs:
new_col_name = f"{col1}_{col2}"
df.loc[:, new_col_name] = df[col1] - df[col2]
CodePudding user response:
You could generate the list a 2 elem combinations of adjacent columns with wraparound using this list comprehension Then you could generate additional columns by iterating through this list
comb = [(x, df.columns[(i 1) % len(df.columns)]) for i, x in enumerate(df.columns)]
for x, y in comb:
df[f'{x}_{y}'] = df[x] - df[y]
this produces the output:
u g r i zmag W1 W2 NUV FUV u_g g_r r_i i_zmag zmag_W1 W1_W2 W2_NUV NUV_FUV FUV_u
0 20.078 19.679 19.585 19.406 19.370 14.970 13.992 20.122 20.736 0.399 0.094 0.179 0.036 4.400 0.978 -6.130 -0.614 0.658
1 20.443 19.115 18.918 18.749 18.698 14.638 14.041 21.646 21.456 1.328 0.197 0.169 0.051 4.060 0.597 -7.605 0.190 1.013
2 19.723 19.593 19.353 19.175 19.258 15.193 14.354 21.122 21.090 0.130 0.240 0.178 -0.083 4.065 0.839 -6.768 0.032 1.367
3 19.683 19.393 19.273 18.995 18.950 15.545 14.530 22.465 20.091 0.290 0.120 0.278 0.045 3.405 1.015 -7.935 2.374 0.408
4 19.769 19.233 19.083 18.983 18.768 14.978 14.224 21.684 20.314 0.536 0.150 0.100 0.215 3.790 0.754 -7.460 1.370 0.545
5 19.908 19.500 19.065 18.838 18.354 13.837 13.016 21.307 21.234 0.408 0.435 0.227 0.484 4.517 0.821 -8.291 0.073 1.326
