Home > Mobile >  Regex to split by comma, but ignore commas proceeding words near a colon
Regex to split by comma, but ignore commas proceeding words near a colon

Time:01-18

I am trying to split a string by commas using python, but allow users to include commas within some of the key pairs. Here are two examples of the strings I am working with:

title.search:The relation between visualization size, grouping, and user performance,publication_year:2020

author.id:c33432,title.search:The relation between visualization size, grouping, and user performance,publication_year:2020

What I want this to turn into is:

["title.search:The relation between visualization size, grouping, and user performance", "publication_year:2020"]

["author.id:c33432", "title.search:The relation between visualization size, grouping, and user performance", "publication_year:2020"]

What helps me is that the part before the colon (the key) will always be written in one of three formats, such as:

  1. type
  2. author.id
  3. author.institutions.country_code

So it can be a single word, two words separated by a period, or three words separated by periods.

Any ideas on if this is possible?

CodePudding user response:

As per I can see, you're trying to split by comma within text, the regex in this case is \w,\w.

CodePudding user response:

Would you please try the following:

#!/usr/bin/python

import re

s = ['title.search:The relation between visualization size, grouping, and user performance,publication_year:2020',
'author.id:c33432,title.search:The relation between visualization size, grouping, and user performance,publication_year:2020']

for str in s:
    m = re.split(r',(?=\s*[\w.] :)', str)
    print(m)

Output:

['title.search:The relation between visualization size, grouping, and user performance', 'publication_year:2020']
['author.id:c33432', 'title.search:The relation between visualization size, grouping, and user performance', 'publication_year:2020']

The regex ,(?=\s*[\w.] :) matches a comma followed by

  • zero or more blank characters
  • a sequence of word characters and/or a dot character
  • a colon character

in order.
Then the string is splitted on comma(s) which satisfy the condition above.

  •  Tags:  
  • Related