Finding the number of personal pronouns used in text using regrex with mixed case sensitive and inse-CodePudding

I want to count the number of personal pronouns such as I, we, my, ours and us in a sentence using regex. I want it to ignore US as it could be the name of a country.

My code is as follows

import re

pronounRegex = re.compile(r'I|we|my|ours|us',re.I)
pronouns = pronounRegex.findall(' I me  you We and all of us make this team tweek, he is from US')
print(pronouns)

which prints

['I', 'We', 'us', 'i', 'we', 'i', 'US']

It is reading the "i" from "this" and "we" from "tweek". I'm not sure how to ignore those cases.

CodePudding user response：

In order to prevent re from matching words this, tweek, you can use word boundaries. Add \b to the front and back of each regex items in between the operator, |

Like this,

r'\bI\b|\bwe\b|\bmy\b|\bours\b|\bus\b'

Now, to prevent the matching of US you need to explicitly specify the possible forms of the verbs. Such as, the pronoun we can be written as We or we itself but not wE.

So, remake your regex like this,

pronounRegex = re.compile(r'\bI\b|\bwe\b|\bWe\b|\bmy\b|\bMy\b|\bours\b|\bus\b')
pronounRegex.findall(' I me  you We and all of us make this team tweek, he is from US')

Notice that the pronouns ours and us are not provided in the capitalized form. As those pronouns can never be present at the starting of a sentence (that is, possessive form).

CodePudding user response：

You matched US because your regex has a us alternative and the re.I flag enables case insensitive search.

You get partial matches inside words because your regex is context-unaware, not "anchored" in any way. If you need to match words, use word boundaries. You do not need to put them with every alternative though, you can use a grouping construct and put \b only on both ends of the group.

You can use

pronounRegex = re.compile(r'\b(I|we|my|ours|(?-i:us))\b',re.I)

Details:

\b - a word boundary (immediately on the left, there can be start of string position, or a non-word char)
( - start of a capturing group with ID 1:
- I|we|my|ours - one of the I, we, my, ours words
| - or
- (?-i:us) - inline modifier group where matching is CASE SENSITIVE, and this only matches us (not US)
) - end of the group
\b - as the previous char was a word char, the next position is either end of string, or there is a non-word char following.

See the Python demo:

import re
pronounRegex = re.compile(r'\b(I|we|my|ours|(?-i:us))\b',re.I)
pronouns = pronounRegex.findall(' I me  you We and all of us make this team tweek, he is from US')
print(pronouns)
# => ['I', 'We', 'us']

See this regex demo (note PCRE option is selected, as there is a bug with the Python option at regex101).

CodePudding user response：

Add word boundaries, \b, to the front and back of the regex items in between the operator, |

r'\bI\b|\bwe\b|\bmy\b|\bours\b|\bus\b'