How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by-CodePudding

This question relates to PCRE regular expressions.

Part of my big dataset are address data like this:

12H MARKET ST. Canada
123 SW 4TH Street     USA
ONE HOUSE                       USA
1234 Quantity Dr          USA  
123 Quality Court          Canada 
1234 W HWY 56A                 USA   
12345 BERNARDO CNTR DRIVE      Canada  
12 VILLAGE PLAZA              USA  
1234 WEST SAND LAKE RD ?567    USA
1234 TELEGRAM BLVD SUITE D      USA  
1234-A SOUTHWEST FRWY          USA
123 CHURCH STREET              USA
123 S WASHINGTON               USA   
123 NW-SE BLVD                USA 
#                              USA
1234 E MAIN STREET USA

I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).

For the purpose of this question I just manually clipped the interesting part from the data for this example.

The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.

This regular expression

([a-zA-Z0-9?\-#.] \s)

already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.

Since I want to capture the matching parts glued together, I place a sign behind my regular expression:

([a-zA-Z0-9?\-#.] \s)

but then in the two nasty cases with only one separating space before the country, the country is also caught!

Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:

([a-zA-Z0-9?\-#.] \s)(?!USA|Canada)

which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this. So I would add a plus sign to the group on the left:

([a-zA-Z0-9?\-#.] \s) (?!USA|Canada)

But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!

How would you proceed in such a case?

If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.

I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.

CodePudding user response：

If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.

So you should be able to set the 'global' and 'multiline' options and then use the following regex:

^(.*)(?=\s (USA|Canada)\s*$)

Explanation:

^(.*) match all characters from start of line

(?=\s (USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.

That should give you a list with all addresses.