Let's say I have multiple data frames df1,df2,df3 (in reality,I have about 11 data frames). These are converted to data frames from excel worksheets.
for example purpose:
DateTime | Col1 | Col2 | Col3 |
----------------
jan | 20 | 30 | 40 |
Feb | NaN | NaN| NaN |
Mar | NaN | NaN | NaN |
Apr | NaN |NaN | NaN |
| Col1 | Col2 | Col3 |
----------------
jan | NaN | NaN | NaN |
Feb | 20 | 30 | 40 |
Mar | NaN | NaN | NaN |
Apr | NaN |NaN | NaN |
| Col1 | Col2 | Col3 |
----------------
jan | NaN | NaN | NaN |
Feb | NaN | NaN | NaN |
Mar | 10 | 20 | 80 |
Apr | NaN |NaN | NaN |
I would like the output to be
Out= DateTime | Col1 | Col2 | Col3 |
----------------
jan | 20 | 30 | 40 |
Feb | 20 | 30 | 0 |
Mar | 10 | 20 | 80 |
Apr | NaN |NaN | NaN |
Say, there are multiple columns but same names in all, and multiple rows but the date column has the same number of rows and names in all the spreadsheets. I tried to join, merge and concatenate but they either overwrite the values, or add extra rows or columns. In the end, I would like the output to have the same number of rows and columns.
I am new to python, so trying to figure this out!
CodePudding user response:
Use concat with aggregate sum, solution working with column date:
df = pd.concat([df1, df2, df3]).groupby('date', sort=False).sum(min_count=1)
If working with DatetimeIndex:
df = pd.concat([df1, df2, df3]).groupby(level=0, sort=False).sum(min_count=1)
print (df)
Col1 Col2 Col3
jan 20.0 30.0 40.0
Feb 20.0 30.0 40.0
Mar 10.0 20.0 80.0
Apr NaN NaN NaN
CodePudding user response:
As of the provided examples in the question, you can also perform a simple sum. The operation will align automatically on the index/columns.
example input (with added NaN):
df1 = pd.DataFrame({'Col1': [20, 0, 0], 'Col2': [30, 0, 0], 'Col3': [40, 0, 0]},
index=['Jan', 'Feb', 'Mar'])
df2 = pd.DataFrame({'Col1': [0, 20, 0], 'Col2': [0, 30, 0], 'Col3': [0, 40, np.nan]},
index=['Jan', 'Feb', 'Mar'])
df3 = pd.DataFrame({'Col1': [0, 0, 10], 'Col2': [0, 0, 20], 'Col3': [0, 0, 80]},
index=['Jan', 'Feb', 'Mar'])
code:
dfs = [df1, df2, df3]
df_out = sum(dfs)
output:
Col1 Col2 Col3
Jan 20 30 40.0
Feb 20 30 40.0
Mar 10 20 NaN
NB. if you have other columns, please define the behavior
