Home > Back-end >  How do I merge data without duplicating the columns?
How do I merge data without duplicating the columns?

Time:02-04

Let's say I have multiple data frames df1,df2,df3 (in reality,I have about 11 data frames). These are converted to data frames from excel worksheets.

for example purpose:

     DateTime  | Col1 | Col2 | Col3 |
       ----------------
   jan  | 20 | 30 | 40 |
   Feb  | NaN | NaN| NaN |
   Mar  | NaN | NaN | NaN |
   Apr  | NaN |NaN | NaN |


        | Col1 | Col2 | Col3 |
       ----------------
   jan  | NaN | NaN | NaN |
   Feb  | 20 | 30 | 40 |
   Mar  | NaN | NaN | NaN |
   Apr  | NaN |NaN | NaN |
       | Col1 | Col2 | Col3 |
       ----------------
   jan  | NaN | NaN | NaN |
   Feb  | NaN | NaN | NaN |
   Mar  | 10 | 20 | 80 |
   Apr  | NaN |NaN | NaN |

I would like the output to be

Out= DateTime   | Col1 | Col2 | Col3 |
       ----------------
   jan  | 20 | 30 | 40 |
   Feb  | 20 | 30 | 0 |
   Mar  | 10 | 20 | 80 |
   Apr  | NaN |NaN | NaN |

Say, there are multiple columns but same names in all, and multiple rows but the date column has the same number of rows and names in all the spreadsheets. I tried to join, merge and concatenate but they either overwrite the values, or add extra rows or columns. In the end, I would like the output to have the same number of rows and columns.

I am new to python, so trying to figure this out!

CodePudding user response:

Use concat with aggregate sum, solution working with column date:

df = pd.concat([df1, df2, df3]).groupby('date', sort=False).sum(min_count=1)

If working with DatetimeIndex:

df = pd.concat([df1, df2, df3]).groupby(level=0, sort=False).sum(min_count=1)

print (df)
     Col1  Col2  Col3
jan  20.0  30.0  40.0
Feb  20.0  30.0  40.0
Mar  10.0  20.0  80.0
Apr   NaN   NaN   NaN

CodePudding user response:

As of the provided examples in the question, you can also perform a simple sum. The operation will align automatically on the index/columns.

example input (with added NaN):

df1 = pd.DataFrame({'Col1': [20, 0, 0], 'Col2': [30, 0, 0], 'Col3': [40, 0, 0]},
                   index=['Jan', 'Feb', 'Mar'])
df2 = pd.DataFrame({'Col1': [0, 20, 0], 'Col2': [0, 30, 0], 'Col3': [0, 40, np.nan]},
                   index=['Jan', 'Feb', 'Mar'])
df3 = pd.DataFrame({'Col1': [0, 0, 10], 'Col2': [0, 0, 20], 'Col3': [0, 0, 80]},
                   index=['Jan', 'Feb', 'Mar'])

code:

dfs = [df1, df2, df3]
df_out = sum(dfs)

output:

     Col1  Col2  Col3
Jan    20    30  40.0
Feb    20    30  40.0
Mar    10    20   NaN

NB. if you have other columns, please define the behavior

  •  Tags:  
  • Related