How create specific dummy variable using regular expression?-CodePudding

I have a pandas dataframe:

col1
johns id is 81245678316
eric bought 82241624316 yesterday
mine is87721624316
frank is a genius
i accepted new 82891224316again

I want to create new column with dummy variables (0,1) depending on col1. If there is 11 numbers starting with 8 and going in a row, than it must be 1, otherwise 0.

So I wrote this code:

df["is_number"] = df.col1.str.contains(r"\b8\d{10}").map({True: 1, False: 0})

However output is:

col1                                         is_number
johns id is 81245678316                        1
eric bought 82241624316 yesterday              1
mine is87721624316                             0
frank is a genius                              0
i accepted new 82891224316again                0

as you see third and fifth rows have 0 in "is_number", but I want them to have 1, even though space is missing there between words and numbers in some places. How to do that? I want:

col1                                         is_number
johns id is 81245678316                        1
eric bought 82241624316 yesterday              1
mine is87721624316                             1
frank is a genius                              0
i accepted new 82891224316again                1

CodePudding user response：

You can use numeric boundaries as the numbers in your input can be "glued" to letters (that are word boundaries and thus there is no word boundary between the letters and 8):

df["is_number"] = df['col1'].str.contains(r"(?<!\d)8\d{10}(?!\d)").map({True: 1, False: 0})

Output:

>>> df
                                col1  is_number
0            johns id is 81245678316          1
1  eric bought 82241624316 yesterday          1
2                 mine is87721624316          1
3                  frank is a genius          0
4    i accepted new 82891224316again          1

See the regex demo.

CodePudding user response：

The solution can be as simple as yours, except that '\b' must be removed because it must match a word boundary:

df.col1.str.contains("8\d{10}").astype(int)

If you want exactly 11 digits, not more, then demand that the symbols before and after the eleven digits either do not exist or are not digits:

df.col1.str.contains("(^|\D)8\d{10}($|\D)").astype(int)

CodePudding user response：

You just need to remove the \b which stands for word boundary since you do not care if there is a boundary or not.

df["is_number"] = df.col1.str.contains(r"8\d{10}").map({True: 1, False: 0})