Home > database >  pandas merge ignore duplicate merged rows
pandas merge ignore duplicate merged rows

Time:01-12

I am trying to merge below two data frames but I am not getting the expected result.

import pandas as pd
previous_dict = [{"category1":"Home", "category2":"Power","usage":"15","amount":"65"},
                 {"category1":"Home", "category2":"Power","usage":"2","amount":"15"},
                 {"category1":"Home", "category2":"Vehicle","usage":"6","amount":"5"}
                ]
current_dict = [{"category1":"Home", "category2":"Power","usage":"16","amount":"79"},
                 {"category1":"Home", "category2":"Power","usage":"0.5","amount":"2"},
                 {"category1":"Home", "category2":"Vehicle","usage":"3","amount":"4"}
                ]

df_previous = pd.DataFrame.from_dict(previous_dict)
print(df_previous)

df_current = pd.DataFrame.from_dict(current_dict)
print(df_current)

df_merge = pd.merge(df_previous, df_current, on=['category1','category2'], how='outer',indicator=True, suffixes=('', '_y'))
print(df_merge)

A previous year data frame

  category1 category2 usage amount
0      Home     Power    15     65
1      Home     Power     2     15
2      Home   Vehicle     6      5

A current year data frame

  category1 category2 usage amount
0      Home     Power    16     79
1      Home     Power   0.5      2
2      Home   Vehicle     3      4

Current result:

  category1 category2 usage amount usage_y amount_y _merge
0      Home     Power    15     65      16       79   both
1      Home     Power    15     65     0.5        2   both
2      Home     Power     2     15      16       79   both
3      Home     Power     2     15     0.5        2   both
4      Home   Vehicle     6      5       3        4   both

But my expected result is,

  category1 category2 usage amount usage_y amount_y _merge
0      Home     Power    15     65      16       79   both
3      Home     Power     2     15     0.5        2   both
4      Home   Vehicle     6      5       3        4   both

When category 1 and category2 have the same values a couple of times in both tables, I just want to match it with the correct order. How can I get the values as my expectation?

CodePudding user response:

I think this is occurring due to the duplication in the columns you are joining on. One way to fix this is to also use the index as follows:

df_merge = pd.merge(df_previous.reset_index(), df_current.reset_index(), on=['category1','category2', 'index'], how='outer',indicator=True, suffixes=('', '_y'))

   index category1 category2 usage amount usage_y amount_y _merge
0      0      Home     Power    15     65      16       79   both
1      1      Home     Power     2     15     0.5        2   both
2      2      Home   Vehicle     6      5       3        4   both

CodePudding user response:

It looks like what you're trying to do is add the columns of one dataframe to the other rather than what one typically considers a "merge". With that in mind, consider the following.

df_new = df_previous.copy()
df_new = df_new.rename(columns = {"usage":"usage_prev","amount":"amount_prev"})
df_new[["usage_current","amount_current"]] = df_current[["usage","amount"]]

print(df_new)

Resulting output:

  category1 category2 usage_prev amount_prev usage_current amount_current
0      Home     Power         15          65            16             79
1      Home     Power          2          15           0.5              2
2      Home   Vehicle          6           5             3              4
  •  Tags:  
  • Related