I'm trying to do a for loop on two dataframe call df and df2. The dataframes contain each two columns with price. Im trying calculate the z-score of each price in each dataframes. So in the end, each dataframes will have now four columns. I have a code that do it but it create a third dataframe also... Why ?
Here's an exemple of my two dataframe :
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
"Price1": [25, 30, 50, 10],
"Price2": [30, 25, 50, 100]})
df.set_index("date", inplace=True)
df.index = pd.to_datetime(df.index)
df2 = pd.DataFrame(
{"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
"Price1": [25, 30, 50, 10],
"Price2": [30, 25, 50, 100]})
df2.set_index("date", inplace=True)
df2.index = pd.to_datetime(df2.index)
Here's the result I would like :
print(df)
Price1 Price2 Price1_z Price2_z
2021-01-31 25 30 -0.262111 -0.716465
2021-02-28 30 25 0.087370 -0.885044
2021-03-31 50 50 1.485297 -0.042145
2021-04-30 10 100 -1.310556 1.643654
print(df2)
Price1 Price2 Price1_z Price2_z
2021-01-31 25 30 -0.262111 -0.716465
2021-02-28 30 25 0.087370 -0.885044
2021-03-31 50 50 1.485297 -0.042145
2021-04-30 10 100 -1.310556 1.643654
My problem is that it's creating a third dataframe name frame = to the last dataframe in the list -> df2.
If you run this code below you would see when you do print(df, df2, frame):
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
"Price1": [25, 30, 50, 10],
"Price2": [30, 25, 50, 100]})
df.set_index("date", inplace=True)
df.index = pd.to_datetime(df.index)
df2 = pd.DataFrame(
{"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
"Price1": [25, 30, 50, 10],
"Price2": [30, 25, 50, 100]})
df2.set_index("date", inplace=True)
df2.index = pd.to_datetime(df2.index)
for frame in [df,df2]:
cols = list(frame.columns)
for col in cols:
col_zscore = col '_z'
frame[col_zscore] = (frame[col] - frame[col].mean())/frame[col].std(ddof=0)
print(df,df2,frame)
How to the same result but without creating the 3rd dataframe ? Thanks !
CodePudding user response:
My understanding is that this is standard behavior with for lists in Python, but you could solve it two ways if it causes you problems. Option 1, delete the variable frame right after the loop:
del frame
Alternatively, you could make a list of DataFrames referencing df and df2 then loop over that:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
"Price1": [25, 30, 50, 10],
"Price2": [30, 25, 50, 100]})
df.set_index("date", inplace=True)
df.index = pd.to_datetime(df.index)
df2 = pd.DataFrame(
{"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
"Price1": [25, 30, 50, 10],
"Price2": [30, 25, 50, 100]})
df2.set_index("date", inplace=True)
df2.index = pd.to_datetime(df2.index)
dflist = [df, df2]
for num in range(len(dflist)):
cols = list(dflist[num].columns)
for col in cols:
col_zscore = col '_z'
dflist[num][col_zscore] = (dflist[num][col] - dflist[num][col].mean())/dflist[num][col].std(ddof=0)
Hopefully that is helpful!
CodePudding user response:
You are referencing the same object with a different pointer.
Consider the following code. In the loop the a, b, and c are all being referenced by letter.
a = 10
b = 20
c = 30
for letter in [a,b,c]:
pass
print(letter)
# 30
print(c is letter)
# True
On the last iteration, letter refers to variable c and still points to it after the loop finishes. We can confirm this using print(c is letter).
In the code that you provided, we can see that this is the same for df2 and frame. This is confirmed by appending print(df2 is frame) to the code block.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
"Price1": [25, 30, 50, 10],
"Price2": [30, 25, 50, 100]})
df.set_index("date", inplace=True)
df.index = pd.to_datetime(df.index)
df2 = pd.DataFrame(
{"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
"Price1": [25, 30, 50, 10],
"Price2": [30, 25, 50, 100]})
df2.set_index("date", inplace=True)
df2.index = pd.to_datetime(df2.index)
for frame in [df,df2]:
cols = list(frame.columns)
for col in cols:
col_zscore = col '_z'
frame[col_zscore] = (frame[col] - frame[col].mean())/frame[col].std(ddof=0)
print(df2 is frame)
# True
