I have many CSV files that need the same type of manipulation done. I want to write a loop that will take the .csv into a Pandas df, perform some basic manipulations, and have this dataframe available throughout the entire python code (for other work). I am creating the empty dataframe, running the loop and confirming that in the loop the dataframe has been populated from the .csv, but when the loop has completed the dataframe is still empty.
def r_insight_history_loop(f):
df_a = pd.DataFrame(columns=['INSTANCE_ID', ' USER_ID'])
read_file = pd.read_csv(f)
read_file1 = read_file[['INSTANCE_ID', ' USER_ID']]
df_a = df_a.append(read_file1)
print(df_a)
print('loop complete')
df_a = pd.DataFrame(columns=['INSTANCE_ID', ' USER_ID'])
df_a.info()
g = r"C:\Users\MYCOMPUTER\R_INSIGHT_HISTORY_2_1 (1).csv"
r_insight_history_loop(g)
print(df_a)
All of the prints were just troubleshooting, to confirm the loop was running. What I get is:
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 INSTANCE_ID 0 non-null object
1 USER_ID 0 non-null object
dtypes: object(2)
memory usage: 0.0 bytes
INSTANCE_ID \
0 b74eb5ba-dd27-469a-b8ae-e0b6b4f0b71b
1 b83859d2-86aa-4e27-b8d6-c72aa24b7465
2 28cbafca-bbf6-4218-ad91-5444816b28c6
3 eeb598b2-35c5-441c-9a8d-c0095944d423
4 70ddbb80-5e9e-4f74-a0cf-2e0841ef68a9
... ...
3586 bc181bb9-d1f8-475d-93fa-72cb8f2d29a2
USER_ID
0 b74eb5ba-dd27-469a-b8ae-e0b6b4f0b71b
1 b83859d2-86aa-4e27-b8d6-c72aa24b7465
2 28cbafca-bbf6-4218-ad91-5444816b28c6
3 eeb598b2-35c5-441c-9a8d-c0095944d423
4 70ddbb80-5e9e-4f74-a0cf-2e0841ef68a9
... ...
3586 bc181bb9-d1f8-475d-93fa-72cb8f2d29a2
[3587 rows x 2 columns]
loop complete
Empty DataFrame
Columns: [INSTANCE_ID, USER_ID]
Index: []
CodePudding user response:
The df_a defined inside your function r_insight_history_loop is a local variable that hides the global df_a defined outside the function. Hence, the global df_a is never updated. The simplest, but not recommended, change to your function code is as follows
def r_insight_history_loop(f):
global df_a # make df_a global
# df_a = pd.DataFrame(columns=['INSTANCE_ID', ' USER_ID']) # do not need this line
read_file = pd.read_csv(f)
read_file1 = read_file[['INSTANCE_ID', ' USER_ID']]
df_a = df_a.append(read_file1)
print(df_a)
print('loop complete')
A cleaner version of the function would take df_a as an argument, update it and return the result, as follows:
def r_insight_history_loop(f, _df_a):
read_file = pd.read_csv(f)
read_file1 = read_file[['INSTANCE_ID', ' USER_ID']]
_df_a = _df_a.append(read_file1)
print(_df_a)
print('loop complete')
return _df_a
df_a = pd.DataFrame(columns=['INSTANCE_ID', ' USER_ID'])
df_a.info()
g = r"C:\Users\MYCOMPUTER\R_INSIGHT_HISTORY_2_1 (1).csv"
df_a = r_insight_history_loop(g, df_a)
