Home > Enterprise >  Using `itertools` to combine DataFrame columns
Using `itertools` to combine DataFrame columns

Time:01-13

I have a DataFrame which looks like this toy example:

import pandas as pd
df = np.array([
    [20.078,19.679,19.585,19.406,19.37,14.97,13.992,20.122,20.736],
    [20.443,19.115,18.918,18.749,18.698,14.638,14.041,21.646,21.456],
    [19.723,19.593,19.353,19.175,19.258,15.193,14.354,21.122,21.09],
    [19.683,19.393,19.273,18.995,18.95,15.545,14.53,22.465,20.091],
    [19.769,19.233,19.083,18.983,18.768,14.978,14.224,21.684,20.314],
    [19.908,19.5,19.065,18.838,18.354,13.837,13.016,21.307,21.234]
])

df = pd.DataFrame(df, columns = ['u', 'g', 'r', 'i', 'zmag', 'W1', 'W2', 'NUV', 'FUV'])

and I want to combine the columns pairwise into subtractive combinations, which is what this snippet does:

df['FUV_NUV'] = dataset['FUV'] - dataset['NUV']
df['FUV_u'] = df['FUV'] - df['u']
df['u_g'] = df['u'] - df['g']
df['g_r'] = df['g'] - df['r']
df['r_i'] = df['r'] - df['i']
df['i_z'] = df['i'] - df['zmag']
df['z_W1'] = df['zmag'] - df['W1']
df['W1_W2'] = df['W1'] - df['W2']

But of course, there's a nicer way to do this, which I found here:

combs = list(chain.from_iterable(combinations(df.columns, i)
                                 for i in range(2, len(df.columns)   1)))
for cols in combs:
    df['_'.join(cols)] = df.loc[:, cols].sum(axis=1)

However, this produces all combinations (such as u g W1 W2 ...) additively.

How can I change it so that it produces

  1. only combinations of two columns maximum
  2. subtractive (i.e. not sum)?

CodePudding user response:

Try out df.diff(axis=1) and see the function's documentation.

Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).

When we use axis=1, this function gets the difference between one column and the previous column. So you might have to re-order your columns for this function to work properly for your purpose.


Here is another option that uses loops.

In my opinion, the method using itertools is less readable than simply writing out the columns you are subtracting.

I would also suggest using the df.loc[] syntax when setting data on a dataframe.

pairs = [
    ("FUV", "NUV"),
    ("FUV", "u"),
    ("u", "g"),
    ("g", "r"),
    ("r", "i"),
    ("i", "zmag"),
    ("zmag", "W1"),
    ("W1", "W2"),
]

for col1, col2 in pairs:
    new_col_name = f"{col1}_{col2}"
    df.loc[:, new_col_name] = df[col1] - df[col2]

CodePudding user response:

You could generate the list a 2 elem combinations of adjacent columns with wraparound using this list comprehension Then you could generate additional columns by iterating through this list

comb = [(x, df.columns[(i 1) % len(df.columns)]) for i, x in enumerate(df.columns)]


for x, y in comb:
  df[f'{x}_{y}'] = df[x] - df[y]

this produces the output:

        u       g       r       i    zmag      W1      W2     NUV     FUV    u_g    g_r    r_i  i_zmag  zmag_W1  W1_W2  W2_NUV  NUV_FUV  FUV_u
0  20.078  19.679  19.585  19.406  19.370  14.970  13.992  20.122  20.736  0.399  0.094  0.179   0.036    4.400  0.978  -6.130   -0.614  0.658
1  20.443  19.115  18.918  18.749  18.698  14.638  14.041  21.646  21.456  1.328  0.197  0.169   0.051    4.060  0.597  -7.605    0.190  1.013
2  19.723  19.593  19.353  19.175  19.258  15.193  14.354  21.122  21.090  0.130  0.240  0.178  -0.083    4.065  0.839  -6.768    0.032  1.367
3  19.683  19.393  19.273  18.995  18.950  15.545  14.530  22.465  20.091  0.290  0.120  0.278   0.045    3.405  1.015  -7.935    2.374  0.408
4  19.769  19.233  19.083  18.983  18.768  14.978  14.224  21.684  20.314  0.536  0.150  0.100   0.215    3.790  0.754  -7.460    1.370  0.545
5  19.908  19.500  19.065  18.838  18.354  13.837  13.016  21.307  21.234  0.408  0.435  0.227   0.484    4.517  0.821  -8.291    0.073  1.326
  •  Tags:  
  • Related