I am a beginner to quantitative analysis of stocks with time series analysis and intend to convert the data column in a Pandas series to a datetime format. Earlier, the index of the series was
infy.index
Output:
Index(['03-01-2000', '04-01-2000', '05-01-2000', '06-01-2000', '07-01-2000',
'10-01-2000', '11-01-2000', '12-01-2000', '13-01-2000', '14-01-2000',
...
'16-04-2021', '19-04-2021', '20-04-2021', '22-04-2021', '23-04-2021',
'26-04-2021', '27-04-2021', '28-04-2021', '29-04-2021', '30-04-2021'],
dtype='object', name='Date', length=5306)
Currently, the format is DD-MM-YYYY. So now, I applied the following code to change the format
infy = pd.to_datetime(infy.index)
infy.head()
Output:
DatetimeIndex(['2000-03-01', '2000-04-01', '2000-05-01', '2000-06-01',
'2000-07-01', '2000-10-01', '2000-11-01', '2000-12-01',
'2000-01-13', '2000-01-14',
...
'2021-04-16', '2021-04-19', '2021-04-20', '2021-04-22',
'2021-04-23', '2021-04-26', '2021-04-27', '2021-04-28',
'2021-04-29', '2021-04-30'],
dtype='datetime64[ns]', name='Date', length=5306, freq=None)
So, the problem is that the first few dates are in the format YYYY-DD-MM, but if you look in the end, the format is now YYYY-MM-DD. I thought I might've made some error in the beginning while converting the DataFrame to a Series but upon multiple iterations of rechecking, I couldn't find any problem.
Why is this happening?
CodePudding user response:
The datetime default format in pandas is YYYY-MM-DD. So it always shows datetime objects with that format. If you want to show the dates in another format you can use:
date_as_strings=pd.to_datetime(infy.index).dt.strftime("%y-%d-%m")
Note: strftime changes the type to string.
CodePudding user response:
pd.to_datetime, without any additional arguments, can flexibly parse a single column with multiple formats for a given date. While this can be extremely powerful it is also very problematic.
The main issue here is that the default for the dayfirst argument of pd.to_datetime is False.
This means that for your first few dates, which it can properly parse because the first part is <= 12, pandas will parse this as MM-DD-YYYY (dayfirst is False so the Month comes first). Later when it encounters a date where the first part is >12 it's smart enough to know there are only 12 months in a year so it assumes the format is DD-MM-YYYY and it parses those dates as such.
This is clearly not your desired behavior so you should either specify dayfirst=True or pass the specific format that works for all of your dates.
pd.to_datetime(infy.index, format='%d-%m-%Y')
# or
pd.to_datetime(infy.index, dayfirst=True)
DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
'2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
'2000-01-13', '2000-01-14',
...
'2021-04-16', '2021-04-19',
'2021-04-20', '2021-04-22', '2021-04-23', '2021-04-26',
'2021-04-27', '2021-04-28', '2021-04-29', '2021-04-30'],
dtype='datetime64[ns]', freq=None)
