Home > Software engineering >  Python Pandas column regex extract substring to end of line (\n or \r) in multil-line string
Python Pandas column regex extract substring to end of line (\n or \r) in multil-line string

Time:01-21

I have a column that's multiline text. I want to match a specific substring and extract it along with everything up until new line. The challenge I'm facing is that there are both "\n" and "\r" and I haven't discovered how to remove the "\r" from my result. Here's the test:

    \r
empty
\n
\r\n
basic text
\r
WHATIWANT: - this is cool ! = . a1^@#%\r\n
more lines\r\r\r
\n
\r
\n

And the result I want is:

WHATIWANT: - this is cool ! = . a1^@#%

I tried using:

(WHATIWANT:\s. (.*?)\s )

But get this (can't get rid of the \n and \r):

WHATIWANT: - this is cool ! = . a1^@#%\n\r

CodePudding user response:

Using str.extract in multiline mode should work here:

df["value"] = df["col"].str.extract(r'^(WHATIWANT:.*?)\s*$', flags=re.M)

CodePudding user response:

In your pattern, in the . (.*?) part, the (.*?) captures an empty string as . is greedy and matches the whole line and will only backtrack if you put some obligatory patterns after it. Since there is \s at the end, backtracking only yields the \n, but \r still remains captured with the . part (remember it is greedy and one \n is enough for the \s to match).

Since you multiline string contains a mixture of line ending sequences (CR, LF, CRLF) you might need to account for the \r that does not mark a line end and (?m)^ does not match a location after this CR char, (?m)^ will only match an empty position after \n, LF char.

So, I suggest

import pandas as pd
df = pd.DataFrame({'col':["basic text\rWHATIWANT: - this is cool ! = . a1^@#%\r\nmore lines"]})
df['col'].str.extract(r'(?m)(?:\r\n?|^)WHATIWANT:\s*(.*\S)\s*$')
# >                             0
# > 0  - this is cool ! = . a1^@#%

df['col'].str.extract(r'(?m)(?:\r\n?|^)(WHATIWANT:.*\S)\s*$')
# >                                         0
# > 0  WHATIWANT: - this is cool ! = . a1^@#%

Here,

  • (?m) - a re.M flag inline modifier
  • (?:\r\n?|^) - CR and an optional LF char, or a position after an LF char or start of string
  • WHATIWANT: - a string
  • \s* - zero or more whitespaces
  • (.*\S) - Group 1: any zero or more chars other than a line feed char and then any one non-whitespace char
  • \s*$ - zero or more whitespaces and end of line.

When using inline flag modifiers like (?m), you do not have to import re explicitly in your script.

  •  Tags:  
  • Related