Home > Mobile >  Creating a new dataframe column based on operations applied to nested arrays in another column?
Creating a new dataframe column based on operations applied to nested arrays in another column?

Time:02-04

Let me start off by saying this unfortunately cannot be solved by doing something as simple as df[A] = df[B] - df[C].

I have a column containing arrays (let's call it df[A]). I want to z-score the items in each array (with respect to only the values in that array), then store this new array of z-scored values in the corresponding row of a new column.

To hopefully make it a bit clearer, each entry in df[A] looks like [[1, 2, 3, ..., 4170945]] and is of length 4170945. (The nesting is due to how the arrays are loaded into the dataframe, and not important.) I have 69 rows of such entries (example image below).

I then want each row of df['zscores'] to contain a corresponding array of (row[A][0] - row[A][0].mean()) / row[A][0].std().

enter image description here

I have tried the following:

1.

df['zscores'] = (df['A'] - df['A'].mean()) / df['A'].std()

This gives the following error:

ValueError: operands could not be broadcast together with shapes (69,) (1,4170945) 

My suspicion is that it's returning a single series where the first item of each row of df[A] is z-scored, then the second, etc., essentially iterating item-wise through each row.

2.

for idx, row in df.iterrows():
    if idx == 1:
        _series = pd.Series((row['A'][0] - row['A'][0].mean()) / row['A'][0].std())
    else:
        _ = pd.Series((row['A'][0] - row['A'][0].mean()) / row['A'][0].std())
        _series.append(_)

My aim was to extract each array, operate on it, and append it to a series. I then wanted to something like df['zscores'] = _series.

My ideal result looks like this:

    A                                               zscores
0   [[43.7916, 10.7261, 30.9748, ...    [[2.5077,  2.1846,  2.2108, ...
1   [[53.8916, 16.7261, 3.5668, ...     [[1.0177,  5.1846,  0.2108, ...
...

CodePudding user response:

You need an apply function for sure. This might either solve it or give you an insight:

df.apply(lambda x: (x['A'][0] - x['A'][0].mean()) / x['A'][0].std())
  •  Tags:  
  • Related