I have a large pandas dataframe, and want to merge a couple smaller dataframes into it, thus adding more columns. However, it seems there is an implicit of copy of the large dataframe after each merge, which I want to avoid. What's the most efficient way to do this? (Note the resulting dataframe will have the same rows, as it is growing with more columns.) map seems better, as it keeps the original dataframe, but there is overhead to create dictionary. Also not sure it works with merging multiple col into the main one. Or maybe the merge may not be deep copying everything internally?
Base case:
id(df) # before merge
df = df.merge(df1[["sid", "col1"]], how="left", on=["sid"])
id(df) # will be different <-- trying to avoid copying df every time a smaller one merged into it
df = df.merge(df2[["sid", "col2"]], how="left", on=["sid", "key2"])
id(df) # will be different
...
Using map():
d_col1 = {d["sid"]:d["col1"] for d in df1[["sid", "col1"]].to_dict("records")}
df["col1"] = df["sid"].map(d_col1)
id(df) # this is the same object
Some post referred dask, haven't tested that yet.
CodePudding user response:
here is another way. First map can be done with a Series and as df1 is already built, I don't know if it is less efficient than using a dictionary though.
df["col1"] = df["sid"].map(df1.set_index('sid')['col1'])
Now with two or more columns, you can play with index
df['col2'] = (
df2.set_index(['sid','key2'])['col2']
.reindex(pd.MultiIndex.from_frame(df[['sid','key2']]))
.to_numpy()
)
