I have an "Archive" dataframe with hundreds of columns, each representing a time series (S1, S2...)
S1 S2
Date1 5 5
Date2 8 10
I need to update the archive, importing fresh data from several dfs (I have more than one "new_data" df for each new date). So, for example:
new_data1:
S3
Date3 8
new_data2:
S2 S4
Date3 9 5
new_data3:
S3
Date4 5
new_data4:
S4
Date4 9
So each new_data df shares some of the columns of the archive df, but could also have some new columns. This is how the Archive df is supposed to be after the updates:
S1 S2 S3 S4
Date1 5 5 NaN NaN
Date2 8 10 10 9
Date3 NaN 9 8 5
Date4 NaN NaN 5 9
I see from this question that I could outer merge the Archive df with the new_data dfs and later combining the duplicate columns (_x and _y) that the merge would create:
dataframes = [new_data1, new_data2, new_data3, new_data4]
for i in dataframes:
# Merge the dataframe
archive = archive.merge(i, how='outer', on='Date')
# Get the series names
series_names = i.columns
# Combine duplicate columns
for series_name in series_names:
if series_name "_x" in archive.columns:
x = series_name "_x"
y = series_name "_y"
archive[series_name] = archive[y].fillna(archive[x])
archive.drop([x, y], 1, inplace=True)
I was wondering if there is a more efficient way to do the same thing. Thanks
CodePudding user response:
What you describe sound like "upsert" for the SQL systems. The equivalent in pandas is combine_first:
for i in dataframes:
archive = i.combine_first(archive)
