There is a behavior of pandas dataframes that I can't explain. I wish somone could walk me through this.
import pandas as pd
df = pd.DataFrame(np.array([[1, 5, 10]]), columns=["Jan", "Fév", "Mar"])
df2 = pd.DataFrame(np.array([[4, 4, 4]]), columns=["Jan", "Fév", "Mar"])
df
Jan Fév Mar
0 1 5 10
df2
Jan Fév Mar
0 4 4 4
So the booleans df < df2 and df >= df2 are respectively:
df < df2
Jan Fév Mar
0 True False False
df >= df2
Jan Fév Mar
0 False True True
However if I do this sequence of code:
df3 = df2
df3[df < df2] = 0
df3[df >= df2] = 7
I will get as a result:
df3
Jan Fév Mar
0 7 7 7
df2
Jan Fév Mar
0 7 7 7
My question is: Why do my code also modifies the values of df2?
Is it because of the df3 = df2?
CodePudding user response:
In pandas there is difference between views and copies, by using = you are creating view, changes applied to it are also applied to original, as opposed to copy. Consider following simple example
import pandas as pd
df1 = pd.DataFrame({'x':[1,2,3]})
df2 = df1
df3 = df1.copy()
df3['x'] = 0
print(df1)
output
x
0 1
1 2
2 3
then
df2['x'] = 0
print(df1)
gives output
0 0
1 0
2 0
If you want to know more read Views and Copies in pandas in Practical Data Science.
Note that built-in python collections also do behave this way, e.g. dicts:
d1 = dict(x=1,y=2)
d2 = d1
d3 = d1.copy()
d3['x'] = 0
print(d1) # {'x': 1, 'y': 2}
d2['x'] = 0
print(d1) # {'x': 0, 'y': 2}
