Home > Back-end >  Extracting dates from text Pandas
Extracting dates from text Pandas

Time:02-04

I have a column of texts from a column called 'internal notes' from a CRM and i am trying to using pandas to extract the dates into an new column as an array. Everything is loaded, I'd love to give some coding examples I've tried but everything I've attempted is nowhere near.

An couple of examples are:
1.

14/05/21 - Called in and saw owner and left wholesale list with her. 
Owner was in NZ though returning this weekend.
Follow up 21/05/21


2.

26/02/07 Hold feb 07 till contacted (around mid March) - was missed in the send out as was originally entered as 
"Oth-White" had sufficient funds at beginning of Feb but has used the money OK
28/03/2007 emailed for update - purchase etc
13/4/07 - did not cancel due to mix up above - contact for Aug 07 - if no response cancel then JP
8/8/08 - NO AUG 08 BUT KEEN TO STAY A MEMBER
01/02/09 - REQUESTED NO FEB 09
01/02/10 REQUESTED NO FEB 10
01/08/09 REQUESTED NO AUG 09 - OK
18/12/09 - Changed to Email
01/08/2010 NO AUG 10 REQUESTED      
Cancelled as per members request 15/02/2011

The output i need would be

  1. {'14/05/21', '21/05/21'}
  2. {'26/02/07', '28/03/07', '08/08/08', '01/02/09', '01/02/10', '01/08/09', '18/12/09', '01/08/10'}

Any questions or any other info needed please let me know.

CodePudding user response:

You could use regex to find the day, month, year.

In this case one expects to have a combination of day/month/year in every string. Otherwise you have to implement a try and except statement.

Here I will show one example. You can easily write this to a function and apply it to each row in your DataFrame, but I do not have your DataFrame and I think that it is sufficient to show the method how it works, so you can convert it to a function yourself.

import re
import datetime

s = "26/02/07 Hold feb 07 till contacted (around mid March) - was missed in the send out as was originally entered as"
dates = re.search(r'(?P<day>\d )/(?P<month>\d )/(?P<year>\d )', s).groupdict()
# This will create a dictionary:
# {'day': '26', 'month': '02', 'year': '07'}
#Then you can for example check the year, because it can be "07" or "2007":
if len(dates['year']) < 4:
    dates['year'] = '20'  dates['year']
dt = datetime.datetime(year=int(dates['year']), month=int(dates['month']), day=int(dates['day']))

# OR:
dt = datetime.datetime.strptime('-'.join(dates.values()), '%d-%m-%Y')

What happens here is that regex will return a named dictionary. After that you can check you year format and use that as input for a datetime object.

CodePudding user response:

The input you have is still ambiguous so let's assume this one:

df = pd.DataFrame({'internal notes': ['14/05/21 - Called in and saw owner and left wholesale list with her. \nOwner was in NZ though returning this weekend.\nFollow up 21/05/21\n',
                                      '26/02/07 Hold feb 07 till contacted (around mid March) - was missed in the send out as was originally entered as \n"Oth-White" had sufficient funds at beginning of Feb but has used the money OK\n28/03/2007 emailed for update - purchase etc\n13/4/07 - did not cancel due to mix up above - contact for Aug 07 - if no response cancel then JP\n8/8/08 - NO AUG 08 BUT KEEN TO STAY A MEMBER\n01/02/09 - REQUESTED NO FEB 09\n01/02/10 REQUESTED NO FEB 10\n01/08/09 REQUESTED NO AUG 09 - OK\n18/12/09 - Changed to Email\n01/08/2010 NO AUG 10 REQUESTED      \nCancelled as per members request 15/02/2011\n',
                                     ]})

#                                       internal notes
# 0  14/05/21 - Called in and saw owner and left wh...
# 1  26/02/07 Hold feb 07 till contacted (around mi...

You can use str.extractall to extract the dates using the \d{2}/\d{2}/\d{2} regex, then combined all the matches with groupby apply and aggregate to python set to have unique elements.

(df['internal notes']
 .str.extractall('(\d{2}/\d{2}/\d{2})')[0]
 .rename('New_col')
 .groupby(level=0).apply(set)
)

output:

0                                 {14/05/21, 21/05/21}
1    {18/12/09, 26/02/07, 15/02/20, 01/08/09, 01/08...
Name: New_col, dtype: object

regex (this is a simple one and would also match incorrect dates such as 99/99/99):

\d{2}   # two digits
/       # a literal /
\d{2}   # two digits
/       # a literal /
\d{2}   # two digits
  •  Tags:  
  • Related