I have a list with repeating patterns. I want to remove these repeating pattern to make the list as short as possible. For example:
[a, b, a, b, a, b] => [a, b]
[a, b, c, a, b, c] => [a, b, c]
[a, b, c, d, a, b, c, d] => [a, b, c, d]
[a, a, a, b, b, b, c, c] => [a, b, c]
What is the best way to cover all the possible cases?
I have tried to convert the list to string, and apply regular expression on it:
input = ['a', 'a', 'b', 'c', 'a', 'b', 'c']
temp = ",".join(input) ","
last_temp = ""
while temp != last_temp:
last_temp = temp
temp = re.sub(r'(. ?)\1 ', r'\1', temp)
print(temp)
deduped = temp[:-1]
output = deduped.split(',')
The function works well as expected result: [a, b, c]
However, there is one issue. If the input list is:
['hello', 'sell', 'hello', 'sell', 'hello', 'sell']
The result will be: ['helo', 'sel']
You see, the regular expression also replaced the 'll' to 'l', which is not desired.
How can I fix this issue with my function, or is there any better way? Thanks
CodePudding user response:
I dont get why you would use regex in this case. Why don't you use a "set" instead :
my_set=set(['hello', 'sell', 'hello', 'sell', 'hello', 'sell'])
print(my_set)
my_set=set(['a', 'a', 'b', 'c', 'a', 'b', 'c'])
print(my_set)
Gives :
{'hello', 'sell'}
{'b', 'a', 'c'}
CodePudding user response:
sell will be substituted by sel because re.sub substitutes the repeating character l.
You can tweak your regular expression to avoid matching those cases.
For example matching repeating patterns starting from the beginning of the string:
temp = re.sub(r'^(. ?)\1 ', r'\1', temp)
Or ensuring the patterns ends with a comma :
temp = re.sub(r'(. ?,)\1 ', r'\1', temp)
Edit: given your last example, it's probably best to check patterns between commas:
import re
list_in = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c']
temp = "," ",".join(list_in) ","
last_temp = ""
while temp != last_temp:
last_temp = temp
temp = re.sub(r'(?<=,)(. ?,)\1 ', r'\1', temp)
print(temp)
deduped = temp[1:-1]
output = deduped.split(',')
A look-behind makes sure your pattern is preceded by a comma as well.
