Home > Back-end >  finding regex pattern to separated unformatted comma separated values in the column of a dataframe i
finding regex pattern to separated unformatted comma separated values in the column of a dataframe i

Time:01-29

hi I've to preprocess a column that has comma separated values and i can't apply .split(',\s*') because there are places where commas and spaces shouldn't be separate so therefore i'm looking for a regex pattern.

column:

    0          12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)
    1                                         11 AM to 11 PM
    2                  11:30 AM to 4:30 PM, 6:30 PM to 11 PM
    3                                        12 Noon to 2 AM
    4      12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no...
                         ...                        
    100                                       11 AM to 11 PM
    101    10 AM to 10 PM (Mon-Thu), 8 AM to 10:30 PM (Fr...
    102                                     12 Noon to 11 PM
    103                             8am to 12:30AM (Mon-Sun)
    104                11:30 AM to 3 PM, 7 PM to 12 Midnight

what i've tried is

    import re
    pattern = '([\w \:*\s*\w*(w{2})*]*\s*to\s*[\w \:*\s*\w*(w{2})*]*\s*[\([a-zA-Z]*\-*\,*\s* 
    [a-zA-Z]*\s*\)]*)'
    timing = data['timings'].str.lower().str.split(pattern).dropna().to_numpy()

output:

   array([list(['12noon to 3:30pm,', ' 6:30pm to 11:30pm (mon-sun)', '']),
   list(['11 am to 11 pm']),
   list(['11:30 am to 4:30 pm, 6:30 pm to 11 pm']),
   list(['12 noon to 2 am']),
   list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
   list(['12noon to 3:30pm, 4pm to 6:30pm, 7pm to 11:30pm (mon, tue, wed, thu, sun), 12noon to 3:30pm, 4pm to 6:30pm,', ' 7pm to 12midnight (fri-sat)', '']),
   list(['7 am to 10 pm']), list(['12 noon to 12 midnight']),
   list(['12 noon to 12 midnight']),
   list(['', '10 am to 1 am (mon-thu)', ',', ' 10 am to 1:30 am (fri-sun)', '']),
   list(['12 noon to 3:30 pm, 7 pm to 10:30 pm']),
   list(['12 noon to 3:30 pm, 6:30 pm to 11:30 pm']),
   list(['11:30 am to 1 am']),
   list(['', '12noon to 12midnight (mon-sun)', '']),
   list(['12 noon to 4:30 pm, 6:30 pm to 11:30 pm']),
   list(['11 am to 11 pm']), list(['12 noon to 10:30 pm']),
   list(['11:30 am to 1 am']), list(['12 noon to 12 midnight']),
   list(['12 noon to 11 pm']),
   list(['', '12:30 pm to 10 pm (tue-sun)', ', mon closed']),
   list(['11:30 am to 3 pm, 7 pm to 11 pm']),
   list(['11am to 11:30pm (mon, tue, wed, thu, sun),', ' 11am to 12midnight (fri-sat)', '']),
   list(['10 am to 5 am']),
   list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
   list(['', '12noon to 11pm (mon-thu)', ',', '12noon to 11:30pm (fri-sun)', '']),
   list(['', '12 noon to 11:30 pm (mon-wed)', ',', ' 12 noon to 1 am (fri-sat)', ',', ' 12 noon to 12 midnight (sun)', ', thu closed']),
   list(['12 noon to 4 pm, 6:30 pm to 11:30 pm']),
   list(['10 am to 1 am']), list(['4:30 pm to 5:30 am']),
   list(['11 am to 12 midnight']),
   list(['12noon to 4pm,', ' 7pm to 12midnight (mon-sun)', '']),
   list(['11 am to 12 midnight']),
   list(['', '6am to 12midnight (mon-sun)', '']),
   list(['12 noon to 11 pm']),
   list(['12:30 pm to 3:30 pm, 7 pm to 10:40 pm']),
   list(['12 noon to 4 pm, 7 pm to 11 pm']),
   list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
   list(['12 noon to 10:30 pm']),
   list(['', '12noon to 11pm (mon-sun)', '']),
   list(['10 am to 10 pm']), list(['10 am to 10 pm']),
   list(['7 am to 1 am']), list(['12 noon to 11:30 pm']),
   list(['', '12noon to 11:30pm (mon-sun)', '']),
   list(['12 noon to 11:30 pm']), list(['12 noon to 11 pm']),
   list(['6 am to 10:30 pm']),
   list(['11:30 am to 3:30 pm, 6:45 pm to 11:30 pm']),
   list(['11:55 am to 4 pm, 7 pm to 11:15 pm']),
   list(['12 noon to 11 pm']), list(['11 am to 11 pm']),
   list(['12noon to 4:30pm, 6:30pm to 11:30pm (mon, tue, wed, fri, sat), closed (thu),', '12noon to 12midnight (sun)', '']),
   list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
   list(['8 am to 11:30 pm']),
   list(['6:30am to 10:30am, 12:30pm to 3pm,', ' 7pm to 11pm (mon)', ',6:30am to 10:30am, 12:30pm to 3pm,', ' 7:30pm to 11pm (tue-sat)', ',6:30am to 10:30am, 12:30pm to 3:30pm,', ' 7pm to 11pm (sun)', '']),
   list(['12 noon to 3 pm, 7 pm to 11:30 pm']),
   list(['11:30 am to 1 am']), list(['9 am to 10 pm']),
   list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
   list(['', '5pm to 12midnight (mon-sun)', '']),
   list(['11 am to 11:30 pm']),
   list(['', '11:30am to 11pm (mon-sun)', '']),
   list(['12 noon to 10:30 pm']), list(['1 pm to 11 pm']),
   list(['11:30 am to 12 midnight']),
   list(['12 noon to 12 midnight']),
   list(['', '12noon to 12midnight (mon-sun)', '']),
   list(['', '12noon to 11pm (mon-sun)', '']),
   list(['12 noon to 3 pm, 7 pm to 11 pm']),
   list(['12 noon to 3 pm, 7 pm to 11 pm']),
   list(['', '11 am to 8 pm (mon-sat)', ', sun closed']),
   list(['4 am to 12 midnight']), list(['9 am to 1 am']),
   list(['10:30 am to 11 pm']), list(['7 am to 11 pm']),
   list(['7 am to 10:30 am, 12:30 pm to 3:30 pm, 7 pm to 11 pm']),
   list(['12 noon to 3:30 pm, 7 pm to 11:30 pm']),
   list(['12 noon to 3:30 pm, 7 pm to 11 pm']),
   list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
   list(['', '11am to 11pm (mon-sun)', '']),
   list(['6 am to 11:30 pm']), list(['11:30 am to 5 am']),
   list(['12:30 pm to 3:30 pm, 7 pm to 11 pm']),
   list(['', '6pm to 2am (mon-sun)', '']),......)

but what i want is something like this:

    [['6pm to 2am (mon-sun)'], ['12 noon to 12 midnight (mon-thu, sun)'] .....] something like this

i think i've to design a better regex pattern in order to separated these values. so can anyone design a better regex pattern? Thanks in advance:).

CodePudding user response:

Here's my attempt:

import re, pandas
data = pandas.read_excel('C:\\Users\\Administrator\\Desktop\\test.xls')
pattern = '(\d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) to \d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) ?(?:\(\w{3}(?:[ ,-]{1,3}\w{3}){0,6}\))?)'
re.findall(pattern, data["myData"].str.cat(sep=", "))

With the call to re.findall() my output was:

['12noon to 3:30pm', '6:30pm to 11:30pm (Mon-Sun)', '11 AM to 11 PM', '11:30 AM to 4:30 PM', '6:30 PM to 11 PM', '12 Noon to 2 AM', '11 AM to 11 PM', '10 AM to 10 PM (Mon-Thu)', '8 AM to 10:30 PM (Fri,Sat)', '12 Noon to 11 PM', '8am to 12:30AM (Mon-Sun)', '11:30 AM to 3 PM', '7 PM to 12 Midnight']
  •  Tags:  
  • Related