I am trying to remove any word that might contain non-Arabic characters. So, words like ذهb or word should be removed.
I have managed to remove the non-Arabic characters using the below regex:
re.sub(r'([^،-٩] )',' ', 'ذهb')
But how would I remove the whole word? Preceding the regex with \b doesn't seem to work.
CodePudding user response:
You might want to try ascii_letters. This should work.
import string
text = "".join([char for char in text if char not in string.ascii_letters]).strip()
return text
CodePudding user response:
You can use
re.sub(r'\s*\b[\u0621-\u064A]*[^\W\d_\u0621-\u064A][^\W\d_]*\b', '', text)
The \s*\b[\u0621-\u064A]*[^\W\d_\u0621-\u064A][^\W\d_]*\b matches
\s*- zero or more whitespaces\b- a word boundary[\u0621-\u064A]*- zero or more Arabic letters[^\W\d_\u0621-\u064A]- any Unicode letter but Arabic letter[^\W\d_]*- any zero or more Unicode letters\b- a word boundary
