I have a data frame that looks like the following:
I want to filter out all words within a list. eg. ['King', 'sEAttle', 'California']. Here is my code
import pandas as pd
import re
remove_words = ['King', 'sEAttle', 'California']
remove_words_lower = (map(lambda x: x.lower(), remove_words))
pattern = '|'.join(remove_words_lower)
t1 = 'Hello! @kingcounty Seattle, #California'
t2 = 'hello! seattlecity #king'
df = pd.DataFrame({'Id': ['user1', 'user2'], 'tweets': [t1, t2]})
clean_tweets = []
for i, tweet in enumerate(df.tweets):
tweet = tweet.lower()
clean_tweet = re.sub(pattern, "", tweet)
clean_tweets.append(clean_tweet)
df['clean_tweets'] = clean_tweets
df
Here is the result:
Is there a way I can modify the RE to remove @county city, and #? In other words, remove the whole word if the word contains a word from a given list. The RE pattern has to be as generic as possible. (ie. can't hard code @county to have it removed)
Expected output:
CodePudding user response:
I am not a regex expert, but I can imagine that you could match your remove words till the next space (and previous space, in case a remove word appears at the end of a word and not the beginning) and also match # and @ if they are present:
import pandas as pd
import re
remove_words = ['King', 'sEAttle', 'California']
remove_words_lower = (map(lambda x: '((#|@)?[^\s]*' x.lower() '[^\s]*)?', remove_words))
pattern = ''.join(remove_words_lower)
t1 = 'Hello! @kingcounty Seattle, #California'
t2 = 'hello! seattlecity #king'
df = pd.DataFrame({'Id': ['user1', 'user2'], 'tweets': [t1, t2]})
df['clean_tweets'] = df.tweets.map(lambda x : re.sub(pattern, "", x.lower()).strip())
Id tweets clean_tweets
0 user1 Hello! @kingcounty Seattle, #California hello!
1 user2 hello! seattlecity #king hello!
Or:
Id tweets clean_tweets
0 user1 Hello! @countyking Seattle, #California hello!
1 user2 hello! cityseattle #king hello!



