Let me start off by saying this unfortunately cannot be solved by doing something as simple as df[A] = df[B] - df[C].
I have a column containing arrays (let's call it df[A]). I want to z-score the items in each array (with respect to only the values in that array), then store this new array of z-scored values in the corresponding row of a new column.
To hopefully make it a bit clearer, each entry in df[A] looks like [[1, 2, 3, ..., 4170945]] and is of length 4170945. (The nesting is due to how the arrays are loaded into the dataframe, and not important.) I have 69 rows of such entries (example image below).
I then want each row of df['zscores'] to contain a corresponding array of (row[A][0] - row[A][0].mean()) / row[A][0].std().
I have tried the following:
1.
df['zscores'] = (df['A'] - df['A'].mean()) / df['A'].std()
This gives the following error:
ValueError: operands could not be broadcast together with shapes (69,) (1,4170945)
My suspicion is that it's returning a single series where the first item of each row of df[A] is z-scored, then the second, etc., essentially iterating item-wise through each row.
2.
for idx, row in df.iterrows():
if idx == 1:
_series = pd.Series((row['A'][0] - row['A'][0].mean()) / row['A'][0].std())
else:
_ = pd.Series((row['A'][0] - row['A'][0].mean()) / row['A'][0].std())
_series.append(_)
My aim was to extract each array, operate on it, and append it to a series. I then wanted to something like df['zscores'] = _series.
My ideal result looks like this:
A zscores
0 [[43.7916, 10.7261, 30.9748, ... [[2.5077, 2.1846, 2.2108, ...
1 [[53.8916, 16.7261, 3.5668, ... [[1.0177, 5.1846, 0.2108, ...
...
CodePudding user response:
You need an apply function for sure. This might either solve it or give you an insight:
df.apply(lambda x: (x['A'][0] - x['A'][0].mean()) / x['A'][0].std())

