Search for a word in a DataFrame column and ignore regex and substring-CodePudding

I have a list of programming languages, and I want to check if a dataframe column contains said language (and create a new empty column with the language as the column name if it does). However, I'm running into 2 issues:

Some programming languages (like "A ") register as a regex, causing an error
I'm only trying to find matching words, but not substrings.

I've managed to work around my first problem (see minimal example) by parametrizing str.contains to ignore regex, but not the second.

How do I match specific words only (so for example, the programming language "Tea" shouldn't return True if the word is "Teapot"), without causing regex issues?

See the following for a minimal example:

df = pd.DataFrame({'Text': ['A   python', 'Teapot warmeR'],})
languages = ["Python", "R", "A  ", "TEA"]

for lang in languages:
    if lang not in df.columns:
        if df['Text'].str.lower().str.contains(lang.lower(), regex=False).any():
            df[lang] = pd.Series(dtype='int')

This will however create a column for "R" and "Tea" too, which I do not want:

Text            Python R   A   TEA
A   python      NaN    NaN NaN NaN
Teapot warmeR   NaN    NaN NaN NaN

What I'd like to see is an exact word match:

Text           A   Python 
A   python     NaN NaN
Teapot warmeR  NaN NaN

CodePudding user response：

Here is a solution and some thoughts about this:

A working solution if, as in your example, the strings are always separated by whitespaces in Text:

import pandas as pd


df = pd.DataFrame({'Text': ['A   python', 'Teapot warmeR'],})
languages = ["Python", "R", "A  ", "TEA"]

# Extracting column as list and convert to lower case
text_col = df['Text'].tolist()
text_col = [x.lower() for x in text_col]

# To lower case too
languages = [x.lower() for x in languages]

# Finding "whole words"
to_add = [lang for lang in languages for langs_list in text_col if lang in langs_list.split(" ")]

# Adding columns
for lang in to_add:
    df[lang] = pd.Series(dtype='int')

print(df)

Output:

            Text  python  a  
0     A   python     NaN  NaN
1  Teapot warmeR     NaN  NaN

Thoughts:

In fact this is an interesting multi-causal problem.

1st cause: "A " ends with 2 plus signs which are regex special characters that need to be escaped
2nd: You need to find whole words, so we should use regex boudaries \b "as usual" but:
3rd: \b will match "Python", but \b won't work after the plus sign (a non-word character) of "A " and the whitespace after it because \b is a zero width match that will match between a word character (\w) and a non-word character (\W) or between a word character and the start of end of string.
4th: We could replace the ending \b with \B, and the the regex will match "A " because \B is \b negated. But this time, it will not match "Python" anymore and it will match "TEA"...

We could analyse this like that :

Here is the "final" (non-working) code and after that an explanation of the steps followed to write it:

for lang in languages:
    if lang not in df.columns:
        needle = re.escape(lang)
        needle = r'\b{}\B'.format(needle)
        if df['Text'].str.lower().str.contains(needle, case=False, regex=True).any():
            df[lang] = pd.Series(dtype='int')

For clarity, we use case=False and remove .str.lower() and lang.lower()
We set regex=True in order to use regex to match whole words. But as is, the regex will fail becasue "A " needs to be escaped.
We escape the strings with needle = re.escape(lang). But now we get substrings: Pyton R, A and TEA.
So we use word boundary \b: needle = r'\b{}\b'.format(needle). But now we only get Python...
So we use word boundary \B at the end: needle = r'\b{}\B'.format(needle). Now, we get A , but this does not match Python anymore and we also get TEA...

To conclude we can't use a simple regex that will work with all cases.

But, if, as in your example, the strings are always separated by whitespaces in Text, we could split on whitespaces and check if the whole words are in the resulting lists using in operator.

CodePudding user response：

You can use adaptive word boundaries here:

import re
for lang in languages:
    if lang not in df.columns:
        if df['Text'].str.lower().str.contains(fr'(?:(?!\w)|\b(?=\w)){re.escape(lang)}(?:\b(?<=\w)|(?<!\w))', regex=True, case=False).any():
            df[lang] = pd.Series(dtype='int')

Ouput:

>>> df
            Text  Python  A  
0     A   python     NaN  NaN
1  Teapot warmeR     NaN  NaN

You can read more about these word boundary patterns in Word boundary with words starting or ending with special characters gives unexpected results and watch this YT video of mine with more explanations and a Python demo.