I have a dataframe which looks like this (see table). For simplicity sake I've "aapl" is the only ticker shown. However, the real dataframe has more tickers.
| ticker | year | return |
|---|---|---|
| aapl | 1999 | 1 |
| aapl | 2000 | 3 |
| aapl | 2000 | 2 |
What I'd like to do is first group the dataframe by ticker, then by year. Next, I'd like to remove any duplicate years. In the end the dataframe should look like this:
| ticker | year | return |
|---|---|---|
| aapl | 1999 | 1 |
| aapl | 2000 | 3 |
I have a working solution, but it's not very "Pandas-esque", and involves for loops. I'm semi-certain that if I come back to the solution in three months, it'll be completely foreign to me.
Right now, I've been working on the following, with little luck:
df = df.groupby('ticker').groupby('year').drop_duplicates(subset=['year'])
This however, produces the following error:
AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'
Any help here would be greatly appreciated, thanks.
CodePudding user response:
@QuangHoang provided the simplest version in the comments:
df.drop_duplicates(['ticker', 'year'])
Alternatively, you can use .groupby twice, inside two .applys:
df.groupby("ticker", group_keys=False).apply(lambda x:
x.groupby("year", group_keys=False).apply(lambda x: x.drop_duplicates(['year']))
)
Alternatively, you can use the .duplicated function:
df.groupby('ticker', group_keys=False).apply(lambda x:
x[~x['year'].duplicated(keep='first')])
)
CodePudding user response:
You can try to sort the values first and then groupby.tail
df.sort_values('return').groupby(['ticker','year']).tail(1)
ticker year return
0 aapl 1999 1
1 aapl 2000 3
CodePudding user response:
I'm almost sure you want to do this:
df.drop_duplicates(subset=["ticker","year"])
