hi I've to preprocess a column that has comma separated values and i can't apply .split(',\s*') because there are places where commas and spaces shouldn't be separate so therefore i'm looking for a regex pattern.
column:
0 12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)
1 11 AM to 11 PM
2 11:30 AM to 4:30 PM, 6:30 PM to 11 PM
3 12 Noon to 2 AM
4 12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no...
...
100 11 AM to 11 PM
101 10 AM to 10 PM (Mon-Thu), 8 AM to 10:30 PM (Fr...
102 12 Noon to 11 PM
103 8am to 12:30AM (Mon-Sun)
104 11:30 AM to 3 PM, 7 PM to 12 Midnight
what i've tried is
import re
pattern = '([\w \:*\s*\w*(w{2})*]*\s*to\s*[\w \:*\s*\w*(w{2})*]*\s*[\([a-zA-Z]*\-*\,*\s*
[a-zA-Z]*\s*\)]*)'
timing = data['timings'].str.lower().str.split(pattern).dropna().to_numpy()
output:
array([list(['12noon to 3:30pm,', ' 6:30pm to 11:30pm (mon-sun)', '']),
list(['11 am to 11 pm']),
list(['11:30 am to 4:30 pm, 6:30 pm to 11 pm']),
list(['12 noon to 2 am']),
list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
list(['12noon to 3:30pm, 4pm to 6:30pm, 7pm to 11:30pm (mon, tue, wed, thu, sun), 12noon to 3:30pm, 4pm to 6:30pm,', ' 7pm to 12midnight (fri-sat)', '']),
list(['7 am to 10 pm']), list(['12 noon to 12 midnight']),
list(['12 noon to 12 midnight']),
list(['', '10 am to 1 am (mon-thu)', ',', ' 10 am to 1:30 am (fri-sun)', '']),
list(['12 noon to 3:30 pm, 7 pm to 10:30 pm']),
list(['12 noon to 3:30 pm, 6:30 pm to 11:30 pm']),
list(['11:30 am to 1 am']),
list(['', '12noon to 12midnight (mon-sun)', '']),
list(['12 noon to 4:30 pm, 6:30 pm to 11:30 pm']),
list(['11 am to 11 pm']), list(['12 noon to 10:30 pm']),
list(['11:30 am to 1 am']), list(['12 noon to 12 midnight']),
list(['12 noon to 11 pm']),
list(['', '12:30 pm to 10 pm (tue-sun)', ', mon closed']),
list(['11:30 am to 3 pm, 7 pm to 11 pm']),
list(['11am to 11:30pm (mon, tue, wed, thu, sun),', ' 11am to 12midnight (fri-sat)', '']),
list(['10 am to 5 am']),
list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
list(['', '12noon to 11pm (mon-thu)', ',', '12noon to 11:30pm (fri-sun)', '']),
list(['', '12 noon to 11:30 pm (mon-wed)', ',', ' 12 noon to 1 am (fri-sat)', ',', ' 12 noon to 12 midnight (sun)', ', thu closed']),
list(['12 noon to 4 pm, 6:30 pm to 11:30 pm']),
list(['10 am to 1 am']), list(['4:30 pm to 5:30 am']),
list(['11 am to 12 midnight']),
list(['12noon to 4pm,', ' 7pm to 12midnight (mon-sun)', '']),
list(['11 am to 12 midnight']),
list(['', '6am to 12midnight (mon-sun)', '']),
list(['12 noon to 11 pm']),
list(['12:30 pm to 3:30 pm, 7 pm to 10:40 pm']),
list(['12 noon to 4 pm, 7 pm to 11 pm']),
list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
list(['12 noon to 10:30 pm']),
list(['', '12noon to 11pm (mon-sun)', '']),
list(['10 am to 10 pm']), list(['10 am to 10 pm']),
list(['7 am to 1 am']), list(['12 noon to 11:30 pm']),
list(['', '12noon to 11:30pm (mon-sun)', '']),
list(['12 noon to 11:30 pm']), list(['12 noon to 11 pm']),
list(['6 am to 10:30 pm']),
list(['11:30 am to 3:30 pm, 6:45 pm to 11:30 pm']),
list(['11:55 am to 4 pm, 7 pm to 11:15 pm']),
list(['12 noon to 11 pm']), list(['11 am to 11 pm']),
list(['12noon to 4:30pm, 6:30pm to 11:30pm (mon, tue, wed, fri, sat), closed (thu),', '12noon to 12midnight (sun)', '']),
list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
list(['8 am to 11:30 pm']),
list(['6:30am to 10:30am, 12:30pm to 3pm,', ' 7pm to 11pm (mon)', ',6:30am to 10:30am, 12:30pm to 3pm,', ' 7:30pm to 11pm (tue-sat)', ',6:30am to 10:30am, 12:30pm to 3:30pm,', ' 7pm to 11pm (sun)', '']),
list(['12 noon to 3 pm, 7 pm to 11:30 pm']),
list(['11:30 am to 1 am']), list(['9 am to 10 pm']),
list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
list(['', '5pm to 12midnight (mon-sun)', '']),
list(['11 am to 11:30 pm']),
list(['', '11:30am to 11pm (mon-sun)', '']),
list(['12 noon to 10:30 pm']), list(['1 pm to 11 pm']),
list(['11:30 am to 12 midnight']),
list(['12 noon to 12 midnight']),
list(['', '12noon to 12midnight (mon-sun)', '']),
list(['', '12noon to 11pm (mon-sun)', '']),
list(['12 noon to 3 pm, 7 pm to 11 pm']),
list(['12 noon to 3 pm, 7 pm to 11 pm']),
list(['', '11 am to 8 pm (mon-sat)', ', sun closed']),
list(['4 am to 12 midnight']), list(['9 am to 1 am']),
list(['10:30 am to 11 pm']), list(['7 am to 11 pm']),
list(['7 am to 10:30 am, 12:30 pm to 3:30 pm, 7 pm to 11 pm']),
list(['12 noon to 3:30 pm, 7 pm to 11:30 pm']),
list(['12 noon to 3:30 pm, 7 pm to 11 pm']),
list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
list(['', '11am to 11pm (mon-sun)', '']),
list(['6 am to 11:30 pm']), list(['11:30 am to 5 am']),
list(['12:30 pm to 3:30 pm, 7 pm to 11 pm']),
list(['', '6pm to 2am (mon-sun)', '']),......)
but what i want is something like this:
[['6pm to 2am (mon-sun)'], ['12 noon to 12 midnight (mon-thu, sun)'] .....] something like this
i think i've to design a better regex pattern in order to separated these values. so can anyone design a better regex pattern? Thanks in advance:).
CodePudding user response:
Here's my attempt:
import re, pandas
data = pandas.read_excel('C:\\Users\\Administrator\\Desktop\\test.xls')
pattern = '(\d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) to \d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) ?(?:\(\w{3}(?:[ ,-]{1,3}\w{3}){0,6}\))?)'
re.findall(pattern, data["myData"].str.cat(sep=", "))
With the call to re.findall() my output was:
['12noon to 3:30pm', '6:30pm to 11:30pm (Mon-Sun)', '11 AM to 11 PM', '11:30 AM to 4:30 PM', '6:30 PM to 11 PM', '12 Noon to 2 AM', '11 AM to 11 PM', '10 AM to 10 PM (Mon-Thu)', '8 AM to 10:30 PM (Fri,Sat)', '12 Noon to 11 PM', '8am to 12:30AM (Mon-Sun)', '11:30 AM to 3 PM', '7 PM to 12 Midnight']
