I am trying to create a regex that would remove any word that either starts or ends with a hyphen (not both).
word1- -> remove
-word2 -> remove
sub-word ->keep
My attempt is the following:
def begin_end_hyphen_removal(line):
return re.sub(r"((\s |^)(-[A-Za-z] )(\s |$))|((\s |^)([A-Za-z] -)(\s |$))","",line)
However, when I try to apply it on the following lines:
here are some word sub-words -word1 word2- sub-word2 word3- -word4
-word5 example
word6-
word7-
another one -word8
-word9
I get the same input as output again.
CodePudding user response:
You can use
r'\b(?<!-)[A-Za-z0-9] -\B|\B-[A-Za-z0-9] \b(?!-)'
r'\b(?<!-)\w -\B|\B-\w \b(?!-)'
See the regex demo. Details:
\b(?<!-)\w -\B- one or more word chars that are not preceded with-and then a-char that is either at the end of string or before a non-word char|- or\B-\w \b(?!-)- a-that is either at the start of string or after a non-word char and then one or more word chars that are not followed with-.
See the Python demo:
import re
rx = re.compile( r' *(?:\b(?<!-)\w -\B|\B-\w \b(?!-))' )
text = 'here are -some- word sub-words -word1 word2- sub-word2 word3- -word4\n-word5 example\nword6-\nword7-\nanother one -word8\n-word9'
print( rx.sub('', text) )
Output:
here are -some- word sub-words sub-word2
example
another one
CodePudding user response:
import re
pattern = r"(?=\S*['-])([a-zA-Z'-] )"
test_string = '''here are some word sub-words -word1 word2- sub-word2 word3- -word4
-word5 example
word6-
word7-
another one -word8
-word9'''
result = re.findall(pattern, test_string)
print(result)
CodePudding user response:
You could repeat matching word characters preceded or followed by a -
If you have words that are separated by a hyphen, and that end on a hyphen that you also want to remove like for example sugar-free-:
(?<!\S)(?:-\w (?:-\w )*|\w (?:-\w )*-)(?!\S)
In parts, the pattern matches:
(?<!\S)Whitespace boundary to the left(?:Non capture group-\w (?:-\w )*Match-and word chars, optionally repeated by-and word chars|Or\w (?:-\w )*-Match word chars optionally repeated by-and word chars
)Close non capture group(?!\S)Whitespace boundary to the right
See a regex demo and a Python demo.
Note that in the pattern that you tried, you use \s, but note that it could also match a newline.
If you don't want to remove the newlines, you can use [^\S\n]* instead of \s*.
Example
import re
def begin_end_hyphen_removal(line):
return re.sub(r"\s*(?<!\S)(?:-\w (?:-\w )*|\w (?:-\w )*-)(?!\S)", "", line)
s = ("here are some word sub-words -word1 word2- sub-word2 word3- -word4\n"
"-word5 example\n"
"word6-\n"
"word7-\n"
"another one -word8\n"
"-word9")
print(begin_end_hyphen_removal(s))
Output
here are some word sub-words sub-word2 example
another one
