Home > Enterprise >  Why when I apply DataFrame.sub() I get the new DF with a higher number of columns?
Why when I apply DataFrame.sub() I get the new DF with a higher number of columns?

Time:02-01

I can't understand Pandas behaviour when I apply .sub() method in this specific situation where my goal is to center the rows in the DataFrame.

Basically I want to compute for each row the mean value and then subtract this value from each row in order to get the DataFrame centered.

I think the issue is related probably to the element-wise operation, but remains quite obscure the reason also after reading the documentation here

import pandas as pd
val = {1: [1,2,3,4],
       2: [4,5,6,7],
       3: [8,9,10,11],
       4: [12,13,14,15]}

df = pd.DataFrame(val)
# Computing for each row the mean value
avg = df.mean(axis=1)
#Removing the mean value from each element in each row
df_centered = df.sub(avg,axis=1)

print(df)
print(avg)
print(df_centered)


   1  2   3   4
0  1  4   8  12
1  2  5   9  13
2  3  6  10  14
3  4  7  11  15
0    6.25
1    7.25
2    8.25
3    9.25
dtype: float64
    0     1     2     3   4
0 NaN -6.25 -4.25 -1.25 NaN
1 NaN -5.25 -3.25 -0.25 NaN
2 NaN -4.25 -2.25  0.75 NaN
3 NaN -3.25 -1.25  1.75 NaN
  1. I can't understand why I get this output
  2. I would like that my output is a 4X4 not a 4X5, why I have a 5th column?
  3. Why I get NaN from the operation?

Edit

Expected output:

 df_centered
         1      2      3      4
    0 -5.25  -2.25    1.75   5.75
    1 -5.25  -2.25    1.75   5.75
    2 -5.25  -2.25    1.75   5.75
    3 -5.25  -2.25    1.75   5.75

I think the fact that all the rows are equal is just specific for this example.

Now If you calculate the mean for each row from df_centered it will be 0, as expected from a "centering" process.

CodePudding user response:

The answer for all your questions are in this part of documentation:

For Series input, axis to match Series index on.

In your case your columns are print(df.columns):

Int64Index([1, 2, 3, 4], dtype='int64')

while print(avg):

0    6.25
1    7.25
2    8.25
3    9.25
dtype: float64

so here indices are 0, 1, 2, 3.

According to the piece of doc above you need to add an extra columns 0 to df, in the same way there is no index 4 in avg so you are doing subtraction with NaN and the results is NaN.

A way to overcome this is to rename indices in avg with

avg.index = df.columns

But, as far as I can see, using level=0 or level=1 yield the same result.

I guess that there are some problems with the way it broadcast the operation. So I'd suggest you the following solution.

Data

import numpy as np
import pandas as pd

import pandas as pd

val = {1: [1,2,3,4],
       2: [4,5,6,7],
       3: [8,9,10,11],
       4: [12,13,14,15]}

df = pd.DataFrame(val)

Generate matrix to subtract

We first ask the result to be a numpy array

avg = df.mean(axis=1).values

Then we repeat for the number of row in df

rep = np.repeat(avg, len(df))

And finally we reshape it according to df

mat = rep.reshape(df.shape)
[[6.25 6.25 6.25 6.25]
 [7.25 7.25 7.25 7.25]
 [8.25 8.25 8.25 8.25]
 [9.25 9.25 9.25 9.25]]

Now df.sub works as expected

df_centered = df.sub(mat)

which is the dataframe you are looking for

      1     2     3     4
0 -5.25 -2.25  1.75  5.75
1 -5.25 -2.25  1.75  5.75
2 -5.25 -2.25  1.75  5.75
3 -5.25 -2.25  1.75  5.75

CodePudding user response:

Use axis=0 in DataFrame.sub for subtract pre rows, it means 6.25 is subtracted by first row of DataFrame:

df_centered = df.sub(avg,axis=0)
print(df_centered)
      1     2     3     4
0 -5.25 -2.25  1.75  5.75
1 -5.25 -2.25  1.75  5.75
2 -5.25 -2.25  1.75  5.75
3 -5.25 -2.25  1.75  5.75

Why I get NaN from the operation?

Because indices of columns 0,5 are different like indices of index of Series avg, here 1,2,3,4, so pandas not matching and missing columns are created.

  •  Tags:  
  • Related