I need to get the date month from various strings such as '14th oct', '14oct', '14.10', '14 10' and '14/10'. For these cases my below code working fine.
query = '14.oct'
print(re.search(r'(?P<date>\b\d{1,2})(?:\b|st|nd|rd|th)?(?:[\s\.\-/_\\,]*)(?P<month>\d{1,2}|[a-z]{3,9})', query, re.I).groupdict())
Result:-
{'date': '14', 'month': 'oct'}
But for this case (1410), its still capturing the date and month. But I don't want that, since this will be another number format of that entire string and not to be considered as date and month. The result should be None.
How to change the search pattern for this? (with groupdict() only)
CodePudding user response:
How to change the search pattern for this?
You might try using negative lookbehind assertion literal ( combined with negative lookahead assertion literal ) as follows
import re
query = '14.oct'
noquery = '(1410)'
print(re.search(r'(?<!\()(?P<date>\b\d{1,2})(?:\b|st|nd|rd|th)?(?:[\s\.\-/_\\,]*)(?P<month>\d{1,2}|[a-z]{3,9})(?!\))', query, re.I).groupdict())
print(re.search(r'(?<!\()(?P<date>\b\d{1,2})(?:\b|st|nd|rd|th)?(?:[\s\.\-/_\\,]*)(?P<month>\d{1,2}|[a-z]{3,9})(?!\))', noquery, re.I))
output
{'date': '14', 'month': 'oct'}
None
Beware that it does prevent all bracketed forms, i.e. not only (1410) but also (14 10), (14/10) and so on.
CodePudding user response:
Not sure if you don't want to match 1410 as in 4 digits only or (1410) with the parenthesis, but to exclude matching both you can make sure there are not 4 consecutive digits:
(?P<date>\b(?!\d{4}\b)\d{1,2})(?:st|[nr]d|th)?[\s./_\\,-]*(?P<month>\d{1,2}|[a-z]{3,9})
To not match any date between parenthesis
\([^()]*\)|(?P<date>\b\d{1,2})(?:st|[nr]d|th)?[\s./_\\,-]*(?P<month>\d{1,2}|[a-z]{3,9})
\([^()]*\)Match from opening till closing parenthesis|Or(?P<date>\b\d{1,2})Match 1-2 digits(?:st|[nr]d|th)?Optionally matchstndrdth[\s./_\\,-]*Optionally repeat matching any of the listed(?P<month>\d{1,2}|[a-z]{3,9})Match 1-2 digits or 3-9 chars a-z
For example
import re
pattern = r"\([^()]*\)|(?P<date>\b\d{1,2})(?:st|[nr]d|th)?(?:[\s./_\\,-]*)(?P<month>\d{1,2}|[a-z]{3,9})"
strings = ["14th oct", "14oct", "14.10", "14 10", "14/10", "1410", "(1410)"]
for s in strings:
m = re.search(pattern, s, re.I)
if m.group(1):
print(m.groupdict())
else:
print(f"{s} --> Not valid")
Output
{'date': '14', 'month': 'oct'}
{'date': '14', 'month': 'oct'}
{'date': '14', 'month': '10'}
{'date': '14', 'month': '10'}
{'date': '14', 'month': '10'}
{'date': '14', 'month': '10'}
(1410) --> Not valid
