I have two dataframes that contain unique ID's, a review column (positive in one, and negative in the other) and a rating column (again one for positive and one for negative):
df1:
| unique_id | pos_review | pos_rating |
|---|---|---|
| 1 | "Great, would recommend... | 8 |
| 1 | "Really cool, you should go... | 7 |
| 2 | "I had a great time, you.. | 9 |
| 3 | "Good way to spend your night... | 8 |
| 4 | "I might go again for how good it was... | 9 |
df2:
| unique_id | neg_review | neg_rating |
|---|---|---|
| 1 | "Really boring... | 4 |
| 2 | "I'll never try this again... | 2 |
| 2 | "I would not recommend.. | 3 |
| 3 | "Could have been better... | 4 |
| 3 | "No one should ever go... | 1 |
I'm trying to combine the two so that the unique id's can line up, but if one had more reviews than the other, the missing review would result in a NaN value which I'd later replace with "No Review". So ideally I'd end up with:
df3:
| unique_id | pos_review | pos_rating | neg_review | neg_rating |
|---|---|---|---|---|
| 1 | "Great, would recommend... | 8 | "Really boring... | 4 |
| 1 | "Really cool, you should go... | 7 | NaN | NaN |
| 2 | "I had a great time, you.. | 9 | "I'll never try this again... | 2 |
| 2 | NaN | NaN | "I would not recommend.. | 3 |
| 3 | "Good way to spend your night... | 8 | "Could have been better... | 4 |
| 3 | NaN | NaN | "No one should ever go... | 1 |
| 4 | "I might go again for how good it was... | 9 | NaN | NaN |
I've tried using df3 = df1.merge(df2, on='unique_id', how='inner'), but that just repeats the first review of my df1 for every review in my df2, like such (looking at unique_id 2 below):
| unique_id | pos_review | pos_rating | neg_review | neg_rating |
|---|---|---|---|---|
| 1 | "Great, would recommend... | 8 | "Really boring... | 4 |
| 1 | "Really cool, you should go... | 7 | NaN | NaN |
| 2 | "I had a great time, you.. | 9 | "I'll never try this again... | 2 |
| 2 | "I had a great time, you.. | 9 | "I would not recommend.. | 3 |
| 3 | "Good way to spend your night... | 8 | "Could have been better... | 4 |
| 3 | NaN | NaN | "No one should ever go... | 1 |
| 4 | "I might go again for how good it was... | 9 | NaN | NaN |
Any ideas on how I can get the above df3?
CodePudding user response:
You need a cumulative counter as second argument for your groupby.
df3 = pd.merge(
df1,df2,
left_on=['unique_id',df1.groupby('unique_id').cumcount()],
right_on=['unique_id',df2.groupby('unique_id').cumcount()],
how='outer')
delivers the desired result
CodePudding user response:
Change inner to outer , and create the sub key with cumcount
df1['key'] = df1.groupby('unique_id').cumcount()
df2['key'] = df2.groupby('unique_id').cumcount()
df3 = df1.merge(df2,on = ['unique_id','key'],how='outer').sort_values('unique_id')
Out[134]:
unique_id pos_review pos_rating key neg_review neg_rating
0 1 reatwouldrecommend... 8.0 0 Really 4.0
1 1 Really 7.0 1 NaN NaN
2 2 had 9.0 0 I'll 2.0
5 2 NaN NaN 1 I 3.0
3 3 Good 8.0 0 Could 4.0
6 3 NaN NaN 1 No 1.0
4 4 I 9.0 0 NaN NaN
# you can also drop the key column with df3 = df3.drop(['key'],axis=1)
CodePudding user response:
Updated after @HenryEcker pointed out that append is to be depreciated.
I would use pd.concat instead of DataFrame.merge since 'unique_id' is not actually unique in the sense of a table value.
df3 = pd.concat([df1, df2], ignore_index=True)
Mabye merge is confusing your sense of what the output table should be. I think your example of your ideal df3 needs to include additional rows with NaNs
E.g. for unique_id = 1 you should have three rows:
- two with NaNs in the negative columns
- one with NaNs in the positive columns
I'm not sure why you would assign a negative review to only one one row of unique_id = 1 but not to others. Best to just keep all rows and use NaNs in all where appropriate
Then if you want to aggregate use DataFrame.groupby. E.g. for the mean rating
grouped_mean = df3.groupby('unique_id').mean()
Note that this would give you a new df with a mean for the negative reviews and a mean for the positive reviews since they are in different columns in df3
