I'm trying to build a regex that stop when a line is equal to "--- admonition".
For example, I have :
??? ad-question Quels sont les deux types de bornages ?
Il y en a deux :
- Le bornage amiable.
- Le bornage judiciaire.
test
--- admonition
I can have the same capture format multiple time on a page.
I want to retrieve (in every match) in a first group :
Quels sont les deux types de bornages ?
and in a second :
Il y en a deux :
Le bornage amiable.
Le bornage judiciaire.
test
I tried :
^\?{3} ad-question {1}(. )\n*((?:\n(?:^[^#].{0,2}$|^[^#].{3}(?<!---).*)) )
or
^\?{3} ad-question {1}(. )\n*((?:\n(?:^[^\n#].{0,2}$|^[^\n#](?<!----).*)) )
but it didn't stop at "\n--- admonition" and it took the new line between the two group.
Is someone can help me build this regex ?
ps : I must have a new line between the two group and between group 2 and "---- admonition". So these lines must be avoid in the groups.
Thanks for your help.
CodePudding user response:
If you want 2 capture groups without matching the newlines in between the groups, but there must be at least a whole empty line in between the groups:
^\?{3} ad-question (. )\n{2,}((?:(?!---).*\n)*?)\n ---
The pattern matches:
^Start of string\?{3} ad-questionMatch??? ad-question(. )Capture group 1, match the whole line\n{2,}Match 2 or more newlines, so that there is at least an empty line in between(Capture group 2(?:(?!---).*\n)*?Repeat as least as possible matching all lines and the newline, that do not start with ---
)Close group 2\n ---Match 1 or more newlines and---
If there should be at least a single newline present:
^\?{3} ad-question (. )\n ((?:(?!---).*\n)*?)\n*---
CodePudding user response:
Try this regex:
\?{3}\s*(. )\s*((?:(?!-{3} admonition)[\s\S])*?)\s*-{3} admonition
Explanation:
\?{3}- matches 3 occurrences of?\s*- matches 0 or more white-spaces(. )- matches 1 or more occurrences of any character except a new line and captures it in group 1\s*- matches 0 or more white-spaces((?:(?!-{3} admonition)[\s\S])*?)\s*-{3} admonition- matches 0 or more occurrences of any character that does not start with--- admonition. After matching all such characters, it matches 0 or more whote-spaces followed by the word--- admonition
CodePudding user response:
So many ways I guess in doing this; my two cents:
^\?{3}\h ad-question\h (. )\n ((?:.*\n?) ?)\n ^---\h admonition$
See an online demo
^\?{3}\h ad-question\h- Start-line anchor followed by three literal question marks, 1 (Greedy) horizontal whitespace characters and literally 'ad-question' and another 1 whitespace chars;(. )- Your 1st capture group with 1 (Greedy) characters other than newline;\n- 1 (Greedy) newline-chars.((?:.*\n?) ?)- A 2nd capture group with a nested non-capture group matched 1 (Lazy) times, capturing 0 characters upto an optional newline char;\n- 1 (Greedy) newline-chars.^---\h admonition$- From start-line anchor to end-line anchor, match: '---', multiple whitespace chars and 'admonition'.
CodePudding user response:
You most probably need re.DOTALL and re.MULTILINE flags. You can also use it as inline flag within the pattern: '(?s)' and '(?m)'.
DOTALL lets '.' also capture '\n' which it normally does NOT match (re.DOTALL is python - other dialects have similar flags, f.e.: JS, Java ).
You can capture yours with r'\?\?\?(.*?)\?(.*?)--- admonition' and those 2 flags.
Python example (JS has DOTALL
import re
text = """??? ad-question Quels sont les deux types de bornages ?
Il y en a deux :
- Le bornage amiable.
- Le bornage judiciaire.
test
--- admonition
??? ad-question 2 types de bornages ?
Il y en a deux :
- Le bornage judiciaire.
test 2
--- admonition"""
pattern = r'\?\?\?(.*?)\?(.*?)--- admonition'
for f in re.finditer(pattern, text, re.MULTILINE | re.DOTALL):
print(f)
print(f.groups()) # tuple of groups (A, B, ..) of grouped matches
Output:
<re.Match object; span=(0, 144), match='??? ad-question Quels sont les deux types de born>
(' ad-question Quels sont les deux types de bornages ',
'\n\nIl y en a deux :\n\n- Le bornage amiable.\n\n- Le bornage judiciaire.\n\ntest\n\n')
<re.Match object; span=(145, 251), match='??? ad-question 2 types de bornages ?\n\nIl y en>
(' ad-question 2 types de bornages ',
'\n\nIl y en a deux :\n\n- Le bornage judiciaire.\n\ntest 2\n\n')
Pattern '\?\?\?(.*?)\?(.*?)--- admonition' explained:
\?\?\? - 3 literal question marks (QM)
(.*?)\? - non greedy capture (including \n) up to 1st QM
(.*?)--- admonition - non greedy capture up to ---admonition
