Home > Software design >  How to make a for loop on two and more dataframe?
How to make a for loop on two and more dataframe?

Time:01-15

I'm trying to do a for loop on two dataframe call df and df2. The dataframes contain each two columns with price. Im trying calculate the z-score of each price in each dataframes. So in the end, each dataframes will have now four columns. I have a code that do it but it create a third dataframe also... Why ?

Here's an exemple of my two dataframe :

import pandas as pd
import numpy as np

df = pd.DataFrame(
         {"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
          "Price1": [25, 30, 50, 10],
          "Price2": [30, 25, 50, 100]})
df.set_index("date", inplace=True)
df.index = pd.to_datetime(df.index)

df2 = pd.DataFrame(
         {"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
          "Price1": [25, 30, 50, 10],
          "Price2": [30, 25, 50, 100]})
df2.set_index("date", inplace=True)
df2.index = pd.to_datetime(df2.index)

Here's the result I would like :

print(df)
             Price1   Price2  Price1_z   Price2_z      
2021-01-31      25      30   -0.262111  -0.716465                       
2021-02-28      30      25     0.087370 -0.885044                       
2021-03-31      50      50    1.485297 -0.042145                       
2021-04-30      10      100   -1.310556  1.643654                       

print(df2)
             Price1   Price2  Price1_z   Price2_z      
2021-01-31      25      30   -0.262111  -0.716465                       
2021-02-28      30      25     0.087370 -0.885044                       
2021-03-31      50      50    1.485297 -0.042145                       
2021-04-30      10      100   -1.310556  1.643654      

My problem is that it's creating a third dataframe name frame = to the last dataframe in the list -> df2.

If you run this code below you would see when you do print(df, df2, frame):

import pandas as pd
import numpy as np

df = pd.DataFrame(
         {"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
          "Price1": [25, 30, 50, 10],
          "Price2": [30, 25, 50, 100]})
df.set_index("date", inplace=True)
df.index = pd.to_datetime(df.index)

df2 = pd.DataFrame(
         {"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
          "Price1": [25, 30, 50, 10],
          "Price2": [30, 25, 50, 100]})
df2.set_index("date", inplace=True)
df2.index = pd.to_datetime(df2.index)

for frame in [df,df2]:
    cols = list(frame.columns)
    for col in cols:
        col_zscore = col   '_z'
        frame[col_zscore] = (frame[col] - frame[col].mean())/frame[col].std(ddof=0)

print(df,df2,frame)

How to the same result but without creating the 3rd dataframe ? Thanks !

CodePudding user response:

My understanding is that this is standard behavior with for lists in Python, but you could solve it two ways if it causes you problems. Option 1, delete the variable frame right after the loop:

del frame

Alternatively, you could make a list of DataFrames referencing df and df2 then loop over that:

import pandas as pd
import numpy as np

df = pd.DataFrame(
         {"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
          "Price1": [25, 30, 50, 10],
          "Price2": [30, 25, 50, 100]})
df.set_index("date", inplace=True)
df.index = pd.to_datetime(df.index)

df2 = pd.DataFrame(
         {"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
          "Price1": [25, 30, 50, 10],
          "Price2": [30, 25, 50, 100]})
df2.set_index("date", inplace=True)
df2.index = pd.to_datetime(df2.index)

dflist = [df, df2]

for num in range(len(dflist)):
    cols = list(dflist[num].columns)
    for col in cols:
        col_zscore = col   '_z'
        dflist[num][col_zscore] = (dflist[num][col] - dflist[num][col].mean())/dflist[num][col].std(ddof=0)

Hopefully that is helpful!

CodePudding user response:

You are referencing the same object with a different pointer.

Consider the following code. In the loop the a, b, and c are all being referenced by letter.

a = 10
b = 20
c = 30

for letter in [a,b,c]:
    pass

print(letter)
# 30

print(c is letter)
# True

On the last iteration, letter refers to variable c and still points to it after the loop finishes. We can confirm this using print(c is letter).


In the code that you provided, we can see that this is the same for df2 and frame. This is confirmed by appending print(df2 is frame) to the code block.

import pandas as pd
import numpy as np

df = pd.DataFrame(
         {"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
          "Price1": [25, 30, 50, 10],
          "Price2": [30, 25, 50, 100]})
df.set_index("date", inplace=True)
df.index = pd.to_datetime(df.index)

df2 = pd.DataFrame(
         {"date": ["2021-01-31","2021-02-28", "2021-03-31","2021-04-30"],
          "Price1": [25, 30, 50, 10],
          "Price2": [30, 25, 50, 100]})
df2.set_index("date", inplace=True)
df2.index = pd.to_datetime(df2.index)

for frame in [df,df2]:
    cols = list(frame.columns)
    for col in cols:
        col_zscore = col   '_z'
        frame[col_zscore] = (frame[col] - frame[col].mean())/frame[col].std(ddof=0)
        
print(df2 is frame)
# True
  •  Tags:  
  • Related