I am new to regex and I am trying to write one (Python flavour) that would allow me to split at every punctuation mark or whitespace, except for the single hyphen (e.g. 9-5, Mon-Fri would not be split) . However, the text that I want to process sometimes contains a sequence of hyphens like -------------, used for separating paragraphs or thematically distinct sections of the document. Therefore, I want a solution that splits on one or more occurrences of every punctuation mark except the hyphen, and that splits on a combination of 2 or more hyphens.
I have tried with the following code:
re.split(r"[-{2,}\.,:\s]", mystring)
but the -{2,} part gets interpreted literally. I have also tried to incorporate it into a group, but again, the parentheses are interpreted literally.
I am aware that I could write a first regex to replace multiple hyphens with the null character, and a second regex that looks at all other whitespace and punctuation marks; however, I am wondering if there is a way to do it in a single step.
CodePudding user response:
Most things inside of a character class [...] is a literal EXCEPT a hyphen in certain contexts and backslash (and / in some regex flavors...). So [-{2,}\.,:\s] is matching all literal characters except for \s. There are other character class operators referenced HERE such as ^ but most regex metacharacters no longer work inside a character class.
I think you might be looking for alteration:
[,.\/]|-{2,}
^ add whatever punctuation you want to split on
(In Python, without the notion of opening a regex, you can use / inside a character class without escaping it: [,./]|-{2,})
