Can any of you please help me to write a regex pattern for the below requirement?
- Section tags that don't have numbers
- All section tag numbers that don't have a dot character followed by.
- Numbers that are closer to the section tag only that to be considered.
Test String:
<sectionb>2.3. Optimized test sentence<op>(</op>1,1<cp>)</cp></sectionb>
*<sectiona>2 Surface Model: ONGV<op>(</op>1,1<cp>)</cp></sectiona>*
<sectiona>3. Verification of MKJU<op>(</op>1,1<cp>)</cp> Entity</sectiona>
*<sectionc>3. 2. 1 <txt>Case 1</txt> Annual charges to SGX</sectionc>*
*<sectiona>Compound Interest<role>back</role></sectiona>*
Pattern:
<section[a-z]>[\d]*[^\.]*<\/section[a-z]
Regex Pattern Should Match the below string:
<sectiona>2 Surface Model: ONGV<op>(</op>1,1<cp>)</cp></sectiona>
<sectionc>3. 2 1 <txt>Case 1</txt> Annual charges to SGX</sectionc>
<sectiona>Compound Interest<role>back</role></sectiona>
CodePudding user response:
You can use this regex:
/<section[a-z]>([^\d]|\d (?![.])).*?<\/section[a-z]/g
Explanation:
<section[a-z]> - match literal <section and a letter and >
( - begin a group
[^\d]|\d - match either a non-digit OR a one or more digits
(?![.]) - NOT followed by a a dot .
) - end group
.*?<\/section[a-z]> - match any character zero or more times followed by the literal string </section followed by a letter and >
This will not match if one or more numbers are followed by a dot.
CodePudding user response:
This matches the updated requirements:
<section\w >(((\d \.\s*)*(\d [^\.]))|[^\d]).*?<\/section\w>
<section\w > \w is mostly the same as [a-z] with to allow for 0 or more (<section> <sectionabc>), remove for exactly one letter
(\d \.\s*)* 0 or more digit/dot/any number of spaces - match updated row 3 where it's now 3. 2. 1 with spaces after dots
(\d [^\.]) must match digit without a dot, one or more digits
((...)|[^\d]) or section does not start with a digit (match row 5)
.*? followed by any character, as few as times as possible upto the following </section - could likely do this with a look ahead to simplify the regex, but, for me, this keeps the separate "no digits" clause separate.
