I would like to restrict number of repeating characters in a string given that different characters have different restrictions.
Suppose, I have a string
Mary,,, had!!!!! a--- little ? lamb........ and list of characters that are allowed to have a higher number of restriction chars = '.!?'. This means that I want to have all punctuation signs like ,- (suppose I have a list of those) to occur only once in a row, while characters from chars can occur max 3 times in a row.
Thus the final string will be formatted like this:
Mary, had!!! a- little ? lamb...
Could anyone give me a hint what is the fastest way to do that, please? I suppose I will have to use groupby from itertools, but I can't quite wrap my head around it. Any tips are appreciated! Thank you in advance!
CodePudding user response:
You can use re.sub together with a lambda function which handles the replacement logic:
import re
n_max = {**dict.fromkeys('-,', 1), **dict.fromkeys('.!?', 3)}
test_string = 'Mary,,, had!!!!! a--- little ? lamb........'
result = re.sub(
r'([{chars}])\1 '.format(chars=''.join(re.escape(c) for c in n_max)),
lambda m: m.group(0)[:n_max[m.group(1)]],
test_string,
)
CodePudding user response:
Another solution with re.sub that goes without callback function:
import re
only_once = ',-'
only_thrice = '.!?'
regex = f"([{re.escape(only_once)}])\\1 |([{re.escape(only_thrice)}])\\2{{3,}}"
# example
s = 'Mary,,, had!!!!! a--- little ? lamb........'
result = re.sub(regex, r"\1\2\2\2", s)
CodePudding user response:
You could indeed use groupby and setup a dictionary of number of allowed repetition for characters that have a restriction:
from itertools import groupby,islice
from collections import Counter
maxRep = Counter(",-"*1 ".!?"*3)
output:
S = "Mary,,, had!!!!! a--- little ? lamb........"
S = "".join(c for g,r in groupby(S) for c in islice(r,0,maxRep.get(g)))
print(S)
# Mary, had!!! a- little ? lamb...
Note that this is slower than regular expressions (the re module). However, if you want to use regular expressions, it will be simpler and faster to perform clean-ups by deleting superfluous characters than replacing repetitions with their maximum steaks
import re
pattern = "[{0}] (?=[{0}]{{{1},{1}}})" # look ahead for x reps
max1 = pattern.format(r",-",1) # [,-] (?=[,-]{1,1})
max3 = pattern.format(r".!?",3) # [.!?] (?=[.!?]{3,3})
restrictions = re.compile(max1 "|" max3)
Note that you will have to use escaping if you want restrictions on characters that need to be escaped within a character class in a regular expression (e.g. a closing square bracket: r"\]")
output:
S = "Mary,,, had!!!!! a--- little ? lamb........"
S = restrictions.sub("",S)
print(S)
# Mary, had!!! a- little ? lamb...
This is roughly 3x faster than the groupby solution
