Home > Software design >  Regular Expression - Ignore multiple spaces and Consider only one space in the match
Regular Expression - Ignore multiple spaces and Consider only one space in the match

Time:02-04

I am stumbled on a regular expression and unable to fix it after trying several different ways.

Here is enter image description here

CodePudding user response:

You can match single spaces by editing your CircuitID part to either match a space character that isn't followed by another space character (?! ) (negative lookahead), or one of the non-space characters [a-zA-Z0-9\-\/].

so the CircuitID part becomes (?<CircuitID>(?:[a-zA-Z0-9\-\/]| (?! )){6,26})

regex101

CodePudding user response:

The main task is to establish the rules for identifying the Circuit ID. Once that is done it can be determined if those strings can be extracted with the use of a regular expression.

After having posited rules for identifying the Circuit ID my objective was to construct a regular expression that avoided the need to enumerate company names in the regex, considering that attempts to match those names could be problematic due to misspellings or other variations.

Let's first look at the beginnings of the example strings that contain what I understand to be the circuit ids, which I've indicated by ^^^^^.

XYZ INTERNATIONAL      A1B101012        AB ...<tab>...
                       ^^^^^^^^^
XYZ INTERNATIONAL<tab>AB/PQRS/012345/ /ABC /<tab>PQR...
                      ^^^^^^^^^^^^^^^^^^^^^^
XYZ INTERNATIONAL<tab>AB/ABYX/271703/ /ABC /<tab>ABC...
                      ^^^^^^^^^^^^^^^^^^^^^^
XYZ WORLDWIDE PTE. LTD.<tab>A1234597<tab>N/A...
                            ^^^^^^^^
Convent Inc<tab>A123 4599<tab>N/A...
                ^^^^^^^^
XYZ INTERNATIONAL<tab>B124565<tab>N/A...
                      ^^^^^^^ 
XYZ INC<tab>WA/OGGS/186815/ /ABC /<tab>ABC...
            ^^^^^^^^^^^^^^^^^^^^^^

The triple dots are placeholders for additional text text.

On the basis of this limited sample I will posit a rule for identifying the circuit id string. If my assumptions are incorrect the OP needs to clarify the rule.

I assume that a circuit id string is:

  • comprised of 2-4 uppercase letters and digits, of which there is at least one letter, followed by 4-8 digits, the string being preceded and followed by a space; or
  • the text between the first two tabs in the string.

A string may have multiple substrings that satisfy one of these two requirements, but it appears from the data that, if there is at least one match, the first match will return the circuit id string. Therefore, all matches after the first are to be ignored.

The rule I have set out may of course have to modified as necessary by the OP, necessitating a change to the regular expression I will suggest using, which is as follows.

(?<= )(?=.{0,3}[A-Z])[A-Z\d]{2,4}\d{4,8}(?= )|(?<=\t).*?(?=\t)

Demo

The regular expression can be broken down as follows.

(?<= )        # positive lookbehind asserts previous character is a space
(?=           # begin positive lookahead
  .{0,3}[A-Z] # match 0-3 characters followed by a capital letter
)             # end positive lookahead
[A-Z\d]{2,4}  # match 2-4 characters from the character class
\d{4,8}       # match 4-8 digits
(?= )         # positive lookahead asserts next character is a space
|             # or
(?<=\t)       # positive lookbehind asserts previous character is a tab
.*?
(?=\t)        # positive lookahead asserts next character is  a tab
  •  Tags:  
  • Related