How do I merge data without duplicating the columns?-CodePudding

Let's say I have multiple data frames df1,df2,df3 (in reality,I have about 11 data frames). These are converted to data frames from excel worksheets.

for example purpose:

     DateTime  | Col1 | Col2 | Col3 |
       ----------------
   jan  | 20 | 30 | 40 |
   Feb  | NaN | NaN| NaN |
   Mar  | NaN | NaN | NaN |
   Apr  | NaN |NaN | NaN |


        | Col1 | Col2 | Col3 |
       ----------------
   jan  | NaN | NaN | NaN |
   Feb  | 20 | 30 | 40 |
   Mar  | NaN | NaN | NaN |
   Apr  | NaN |NaN | NaN |
       | Col1 | Col2 | Col3 |
       ----------------
   jan  | NaN | NaN | NaN |
   Feb  | NaN | NaN | NaN |
   Mar  | 10 | 20 | 80 |
   Apr  | NaN |NaN | NaN |

I would like the output to be

Out= DateTime   | Col1 | Col2 | Col3 |
       ----------------
   jan  | 20 | 30 | 40 |
   Feb  | 20 | 30 | 0 |
   Mar  | 10 | 20 | 80 |
   Apr  | NaN |NaN | NaN |

Say, there are multiple columns but same names in all, and multiple rows but the date column has the same number of rows and names in all the spreadsheets. I tried to join, merge and concatenate but they either overwrite the values, or add extra rows or columns. In the end, I would like the output to have the same number of rows and columns.

I am new to python, so trying to figure this out!

CodePudding user response：

Use concat with aggregate sum, solution working with column date:

df = pd.concat([df1, df2, df3]).groupby('date', sort=False).sum(min_count=1)

If working with DatetimeIndex:

df = pd.concat([df1, df2, df3]).groupby(level=0, sort=False).sum(min_count=1)

print (df)
     Col1  Col2  Col3
jan  20.0  30.0  40.0
Feb  20.0  30.0  40.0
Mar  10.0  20.0  80.0
Apr   NaN   NaN   NaN

CodePudding user response：

As of the provided examples in the question, you can also perform a simple sum. The operation will align automatically on the index/columns.

example input (with added NaN):

df1 = pd.DataFrame({'Col1': [20, 0, 0], 'Col2': [30, 0, 0], 'Col3': [40, 0, 0]},
                   index=['Jan', 'Feb', 'Mar'])
df2 = pd.DataFrame({'Col1': [0, 20, 0], 'Col2': [0, 30, 0], 'Col3': [0, 40, np.nan]},
                   index=['Jan', 'Feb', 'Mar'])
df3 = pd.DataFrame({'Col1': [0, 0, 10], 'Col2': [0, 0, 20], 'Col3': [0, 0, 80]},
                   index=['Jan', 'Feb', 'Mar'])

code:

dfs = [df1, df2, df3]
df_out = sum(dfs)

output:

     Col1  Col2  Col3
Jan    20    30  40.0
Feb    20    30  40.0
Mar    10    20   NaN

NB. if you have other columns, please define the behavior