Most efficient way to update dataframe with fresh data-CodePudding

I have an "Archive" dataframe with hundreds of columns, each representing a time series (S1, S2...)

        S1  S2
Date1   5   5
Date2   8   10

I need to update the archive, importing fresh data from several dfs (I have more than one "new_data" df for each new date). So, for example:

new_data1:

        S3
Date3   8

new_data2:

        S2  S4
Date3   9   5

new_data3:

        S3
Date4   5

new_data4:

        S4
Date4   9

So each new_data df shares some of the columns of the archive df, but could also have some new columns. This is how the Archive df is supposed to be after the updates:

        S1  S2  S3  S4
Date1   5   5   NaN NaN
Date2   8   10  10  9
Date3   NaN 9   8   5
Date4   NaN NaN 5   9

I see from this question that I could outer merge the Archive df with the new_data dfs and later combining the duplicate columns (_x and _y) that the merge would create:

dataframes = [new_data1, new_data2, new_data3, new_data4]

for i in dataframes:
    # Merge the dataframe
    archive = archive.merge(i, how='outer', on='Date')

    # Get the series names
    series_names = i.columns

    # Combine duplicate columns
    for series_name in series_names:
       if series_name "_x" in archive.columns:
           x = series_name "_x"
           y = series_name "_y"
           archive[series_name] = archive[y].fillna(archive[x])
           archive.drop([x, y], 1, inplace=True)

I was wondering if there is a more efficient way to do the same thing. Thanks

CodePudding user response：

What you describe sound like "upsert" for the SQL systems. The equivalent in pandas is combine_first:

for i in dataframes:
    archive = i.combine_first(archive)