I have a dataset that is a series of sentences on multiple lines. There are three labels that are consistent in the dataset that indicate the beginning and end of a sentence. It looks like the following:
text = """
DOC
TITLE
start
This is a test
to see if this work
DOC
TITLE
start
Testing to see if this works
Testing this out
Another test case
DOC"""
Note: the \n is a newline not a value within the text.
Edit: the \n have been removed as the triple quoted string already included newlines.
I need everything between start and DOC. I am trying to split the results so that I have a list like so:
[(DOC\nTITLE\nstart\n, This is a test\n, to see if this work\n), (DOC\nTITLE\nstart\n, Testing to see if this works\nTesting this out\nAnother test case\n)]
I have done this thus far: re.split(r'(?<=\start)(.*\n)(?=\DOC)', text) however, it is not splitting. When I checked on Regexr, it gives me an unmatched result. I am not sure what I am doing wrong.
CodePudding user response:
Assuming there is only 1 newline between lines of text, you could use:
import re
re.findall('(DOC\n.*?\nstart)\n(. ?(?=\nDOC))', text, flags=re.DOTALL)
Output:
[('DOC\nTITLE\nstart', 'This is a test\nto see if this work'),
('DOC\nTITLE\nstart', 'Testing to see if this works\nTesting this out\nAnother test case')]
