I am trying to leave only rows in dataframe which are the latest from each year included (from 2000 till 2018) and after that convert date from dd-mm-yyyy to only a year number.
So far I got only imported the data:
df_spx = web.DataReader('^GSPC', 'yahoo', start='2000', end='2018')
df_spx.reset_index(inplace=True)
df_spx['Date'] = pd.to_datetime(df_spx['Date'])
df_spx
And the output is this (as a image in url, can't post pictures yet):

CodePudding user response:
You might want to order the dataframe by 'Date' columne.
Then, keep only Year value to group.
Finally, you can get latest rows after grouping the dataframe, as follows:
import pandas as pd
# sample dataframe
df = pd.DataFrame({
'Date': ['2000-01-03', '2000-01-04', '2018-12-19', '2018-12-18'],
'High': [1, 2, 3, 4],
})
df['Date'] = pd.to_datetime(df['Date'])
df.sort_values(by=['Date'], inplace=True) # order by Date
df['Date'] = df['Date'].dt.year # already ordered by Date, then leave year only.
df = df.groupby('Date').tail(1) # get latest row per year
print(df)
# Date High
#1 2000 2 # latest row per 2000 (2000-01-04)
#2 2018 3 # latest row per 2018 (2018-12-19)
