I want to split '10.1 This is a sentence. Another sentence.'
as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence']
I have tried
s.split(r'\D.\D')
It doesn't work, how can this be solved?
CodePudding user response:
If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:
re.split(r'(?<!\d)\.(?!\d|$)', text)
See the regex demo.
If your strings can contain more special cases, you could use a more customizable extracting approach:
re.findall(r'(?:\d (?:\.\d )*\.?|[^.]) ', text)
See this regex demo. Details:
(?:\d (?:\.\d )*\.?|[^.])- a non-capturing group that matches one or more occurrences of\d (?:\.\d )*\.?- one or more digits (\d), then zero or more sequences of.and one or more digits ((?:\.\d )*) and then an optional.char (\.?)|- or[^.]- any char other than a.char.
CodePudding user response:
You have multiple issues:
- You're not using
re.split(), you're usingstr.split(). - You haven't escaped the
., use\.instead. - You're not using lookahead and lookbehinds so your 3 characters are gone.
Fixed code:
>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']
Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.
CodePudding user response:
All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is pointless. The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.
data = '10.1 This is a sentence. Another sentence.'
lines = re.split(r'\.\s ', data))
print(lines)
