Home > Software design >  Using regular expression to filter out pandas data frames
Using regular expression to filter out pandas data frames

Time:02-04

I have a data frame that looks like the following:

enter image description here

I want to filter out all words within a list. eg. ['King', 'sEAttle', 'California']. Here is my code

import pandas as pd
import re 

remove_words = ['King', 'sEAttle', 'California']

remove_words_lower = (map(lambda x: x.lower(), remove_words))
pattern = '|'.join(remove_words_lower)

t1 = 'Hello! @kingcounty Seattle, #California'
t2 = 'hello! seattlecity #king'
df = pd.DataFrame({'Id': ['user1', 'user2'], 'tweets': [t1, t2]})


clean_tweets = []
for i, tweet in enumerate(df.tweets):
    tweet = tweet.lower()
    clean_tweet = re.sub(pattern, "", tweet)
    clean_tweets.append(clean_tweet)

df['clean_tweets'] = clean_tweets
df

Here is the result:

enter image description here

Is there a way I can modify the RE to remove @county city, and #? In other words, remove the whole word if the word contains a word from a given list. The RE pattern has to be as generic as possible. (ie. can't hard code @county to have it removed)

Expected output:

enter image description here

CodePudding user response:

I am not a regex expert, but I can imagine that you could match your remove words till the next space (and previous space, in case a remove word appears at the end of a word and not the beginning) and also match # and @ if they are present:

import pandas as pd
import re 

remove_words = ['King', 'sEAttle', 'California']

remove_words_lower = (map(lambda x: '((#|@)?[^\s]*'  x.lower()  '[^\s]*)?', remove_words))
pattern = ''.join(remove_words_lower)
t1 = 'Hello! @kingcounty Seattle, #California'
t2 = 'hello! seattlecity #king'
df = pd.DataFrame({'Id': ['user1', 'user2'], 'tweets': [t1, t2]})

df['clean_tweets'] = df.tweets.map(lambda x : re.sub(pattern, "", x.lower()).strip())
      Id                                   tweets clean_tweets
0  user1  Hello! @kingcounty Seattle, #California       hello!
1  user2                 hello! seattlecity #king       hello!

Or:

     Id                                   tweets clean_tweets
0  user1  Hello! @countyking Seattle, #California       hello!
1  user2                 hello! cityseattle #king       hello!
  •  Tags:  
  • Related