Split by '.' when not preceded by digit-CodePudding

I want to split '10.1 This is a sentence. Another sentence.' as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence']

I have tried

s.split(r'\D.\D')

It doesn't work, how can this be solved?

CodePudding user response：

If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:

re.split(r'(?<!\d)\.(?!\d|$)', text)

See the regex demo.

If your strings can contain more special cases, you could use a more customizable extracting approach:

re.findall(r'(?:\d (?:\.\d )*\.?|[^.]) ', text)

See this regex demo. Details:

(?:\d (?:\.\d )*\.?|[^.]) - a non-capturing group that matches one or more occurrences of
- \d (?:\.\d )*\.? - one or more digits (\d ), then zero or more sequences of . and one or more digits ((?:\.\d )*) and then an optional . char (\.?)
- | - or
- [^.] - any char other than a . char.

CodePudding user response：

You have multiple issues:

You're not using re.split(), you're using str.split().
You haven't escaped the ., use \. instead.
You're not using lookahead and lookbehinds so your 3 characters are gone.

Fixed code:

>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']

Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.

CodePudding user response：

All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is pointless. The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.

data  = '10.1 This is a sentence. Another sentence.'
lines = re.split(r'\.\s ', data))
print(lines)