I have a string containing words in the form word1_word2, word3_word4, word5_word1 (so a word can appear at the left or at the right). I want a regex that looks for all the occurrences of a specific word, and returns the "super word" containing it. So if I'm looking for word1, I expect my regex to return word1_word2, word5_word1. Since the word can appear on the left or on the right, I wrote this:
re.findall("( {}_)?[\u0061-\u007a\u00e0-\u00e1\u00e8-\u00e9\u00ec\u00ed\u00f2-\u00f3\u00f9\u00fa]*(_{} )?".format("w1", "w1"), string)
With the optional blocks at the beginning or at the end of the pattern. However, it takes forever to execute and I think something is not correct because I tried removing the optional blocks and writing two separate regex for looking at the beginning and at the end and they are much faster (but I don't want to use two regex). Am I missing something or is it normal?
CodePudding user response:
This would be the regex solution to your problem:
re.findall(rf'\b({yourWord}_\w ?|\w ?_{yourWord})\b', yourString)
CodePudding user response:
Python provides some methods to do this
a=['word1_word2', 'word3_word4', 'word5_word1']
b = [x for x in a if x.startswith("word1") or x.endswith('word1')]
print(b) # ['word1_word2', 'word5_word1']
Referenece link
CodePudding user response:
s = 'word1_word2, word3_word4, word5_word1'
matches = re.finditer(r'(\w _word1)|(word1_\w )', s)
result = list(map(lambda x: x.group(), matches))
['word1_word2', 'word5_word1']
This is one method, but seeing @Carl his answer I voted for his. That is a faster and cleaner method. I will just leave it here as one of many regex options.
CodePudding user response:
this regex will do the job for word1:
regex = (word\d_)*word1(_word\d)*
re.findall(regex, string)
you can also use this:
re.findall(rf'\b(word{number}_\w ?|\w ?_word{number})\b', string)
CodePudding user response:
Try the following regex.
In the following, replace word1 with the word you're looking for. This is assuming that the word you are looking for consists of only alphanumeric characters.
([a-zA-Z0-9]*_word1)|(word1_.[a-zA-Z0-9]*)
