Create Populated Python Dataframe in Loop-CodePudding

I have many CSV files that need the same type of manipulation done. I want to write a loop that will take the .csv into a Pandas df, perform some basic manipulations, and have this dataframe available throughout the entire python code (for other work). I am creating the empty dataframe, running the loop and confirming that in the loop the dataframe has been populated from the .csv, but when the loop has completed the dataframe is still empty.

def r_insight_history_loop(f):
    df_a = pd.DataFrame(columns=['INSTANCE_ID', ' USER_ID'])
    read_file = pd.read_csv(f)
    read_file1 = read_file[['INSTANCE_ID', ' USER_ID']]
    df_a = df_a.append(read_file1)
    print(df_a)
    print('loop complete')


df_a = pd.DataFrame(columns=['INSTANCE_ID', ' USER_ID'])
df_a.info()
g = r"C:\Users\MYCOMPUTER\R_INSIGHT_HISTORY_2_1 (1).csv"
r_insight_history_loop(g)

print(df_a)

All of the prints were just troubleshooting, to confirm the loop was running. What I get is:

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   INSTANCE_ID  0 non-null      object
 1    USER_ID     0 non-null      object
dtypes: object(2)
memory usage: 0.0  bytes
                                   INSTANCE_ID  \
    0     b74eb5ba-dd27-469a-b8ae-e0b6b4f0b71b   
    1     b83859d2-86aa-4e27-b8d6-c72aa24b7465   
    2     28cbafca-bbf6-4218-ad91-5444816b28c6   
    3     eeb598b2-35c5-441c-9a8d-c0095944d423   
    4     70ddbb80-5e9e-4f74-a0cf-2e0841ef68a9   
    ...                                    ...   
    3586  bc181bb9-d1f8-475d-93fa-72cb8f2d29a2   
    
                                                    USER_ID  
 0     b74eb5ba-dd27-469a-b8ae-e0b6b4f0b71b   
    1     b83859d2-86aa-4e27-b8d6-c72aa24b7465   
    2     28cbafca-bbf6-4218-ad91-5444816b28c6   
    3     eeb598b2-35c5-441c-9a8d-c0095944d423   
    4     70ddbb80-5e9e-4f74-a0cf-2e0841ef68a9   
    ...                                    ...      
    3586  bc181bb9-d1f8-475d-93fa-72cb8f2d29a2 
    
    [3587 rows x 2 columns]
    loop complete
    Empty DataFrame
    Columns: [INSTANCE_ID,  USER_ID]
    Index: []

CodePudding user response：

The df_a defined inside your function r_insight_history_loop is a local variable that hides the global df_a defined outside the function. Hence, the global df_a is never updated. The simplest, but not recommended, change to your function code is as follows

def r_insight_history_loop(f):
    global df_a # make df_a global
    # df_a = pd.DataFrame(columns=['INSTANCE_ID', ' USER_ID']) # do not need this line
    read_file = pd.read_csv(f)
    read_file1 = read_file[['INSTANCE_ID', ' USER_ID']]
    df_a = df_a.append(read_file1)
    print(df_a)
    print('loop complete')

A cleaner version of the function would take df_a as an argument, update it and return the result, as follows:

def r_insight_history_loop(f, _df_a):
    read_file = pd.read_csv(f)
    read_file1 = read_file[['INSTANCE_ID', ' USER_ID']]
    _df_a = _df_a.append(read_file1)
    print(_df_a)
    print('loop complete')
    return _df_a


df_a = pd.DataFrame(columns=['INSTANCE_ID', ' USER_ID'])
df_a.info()
g = r"C:\Users\MYCOMPUTER\R_INSIGHT_HISTORY_2_1 (1).csv"
df_a = r_insight_history_loop(g, df_a)