Home > Blockchain >  Get everything between delimiter on multiple lines
Get everything between delimiter on multiple lines

Time:01-30

I have a dataset that is a series of sentences on multiple lines. There are three labels that are consistent in the dataset that indicate the beginning and end of a sentence. It looks like the following:

text = """
DOC
TITLE
start
This is a test
to see if this work
DOC
TITLE
start
Testing to see if this works
Testing this out
Another test case
DOC"""

Note: the \n is a newline not a value within the text.

Edit: the \n have been removed as the triple quoted string already included newlines.

I need everything between start and DOC. I am trying to split the results so that I have a list like so:

[(DOC\nTITLE\nstart\n, This is a test\n, to see if this work\n), (DOC\nTITLE\nstart\n, Testing to see if this works\nTesting this out\nAnother test case\n)]

I have done this thus far: re.split(r'(?<=\start)(.*\n)(?=\DOC)', text) however, it is not splitting. When I checked on Regexr, it gives me an unmatched result. I am not sure what I am doing wrong.

CodePudding user response:

Assuming there is only 1 newline between lines of text, you could use:

import re

re.findall('(DOC\n.*?\nstart)\n(. ?(?=\nDOC))', text, flags=re.DOTALL)

Output:

[('DOC\nTITLE\nstart', 'This is a test\nto see if this work'),
 ('DOC\nTITLE\nstart', 'Testing to see if this works\nTesting this out\nAnother test case')]

How the regex works

  •  Tags:  
  • Related