NLTK TweetTokenizer incorrectly separates contractions-CodePudding

I have a personal Python project where I am trying to tokenize tweets. I am using NLTK's TweetTokenizer to break up these tweets. I am running into an issue where contractions incorrectly get broken up

EX "can't" -> ["can", "'", "t"]

I am struggling to find any documentation on this error. I have pasted relevant code below.

An important note is that TweetTokenizer works with strings that I hardcode into my program, however, does not work with strings that originate from Twitter

from nltk.tokenize import TweetTokenizer
def tweetsTagger(tweets): #Tokenizes and tags the tweets
    tweetsTagged = []
    for tweet in tweets: #tweet is a status object from Twitter's Tweepy API
        text = ""
        if hasattr(tweet, 'full_text'):
            text = str(tweet.full_text)
        else:
            text = str(tweet.text)
        tt = TweetTokenizer()
        tweetTokenized = tt.tokenize(text)
        tweetTagged = pos_tag(tweetTokenized)
        tweetsTagged.append(tweetTagged)
    return tweetsTagged

I think the error may have to do with TweetTokenizer not recognizing certain Unicode apostrophes but I may be wrong about that.

CodePudding user response：

The NLTK TweetTokenizer does not work properly with irregular quotes. I would advise pre-processing your data to normalize these forms of quotes to regular ones.

For reference:

>>> from nltk.tokenize import TweetTokenizer
>>> TweetTokenizer().tokenize("can't") 
["can't"]
>>> TweetTokenizer().tokenize("can’t") 
['can', '’', 't']

Perhaps Python: Replace typographical quotes, dashes, etc. with their ascii counterparts would help for this.

CodePudding user response：

Replace curly quotes with straight quotes:

from nltk.tokenize import TweetTokenizer
def tweetsTagger(tweets): #Tokenizes and tags the tweets
    tweetsTagged = []
    for tweet in tweets: #tweet is a status object from Twitter's Tweepy API
        text = ""
        if hasattr(tweet, 'full_text'):
            text = str(tweet.full_text)
        else:
            text = str(tweet.text)
        tt = TweetTokenizer()
        tweetTokenized = tt.tokenize(text.replace("’","'")) # << HERE
        tweetTagged = pos_tag(tweetTokenized)
        tweetsTagged.append(tweetTagged)
    return tweetsTagged