I'm trying to match some sort of amount, here are all possibilities:
$5.6 million
$4,1 million
$8,1M
$6.3M
$333,333
$2 million
$5 million
I have already this regex:
\$\d{1,3}(?:,\d{3})*(?:\s (?:thousand|[mb]illion|[MB]illion)|[M])?
See online demo.
But I'm not able to match those ones:
$5.6 million
$4,1 million
$8,1M
$6.3M
Any help would be appreciated.
CodePudding user response:
You can use
(?i)\$\d (?:[.,]\d )*(?:\s (?:thousand|[mb]illion)|m)?
If you need to make sure you do not match m that is part of another word:
(?i)\$\d (?:[.,]\d )*(?:\s (?:thousand|[mb]illion)|m)?\b
See the regex demo. Details:
(?i)- case insensitive option\$- a$char\d- one or more digits(?:[.,]\d )*- zero or more repetitions of.or,and then one or more digits(?:\s (?:thousand|[mb]illion)|m)?- an optional occurrence of\s (?:thousand|[mb]illion)- one or more whitespaces and thenthousand,millionorbillion|- orm- anmchar
\b- a word boundary.
CodePudding user response:
Let's look at your regular expression:
\$\d{1,3}(?:,\d{3})*(?:\s (?:thousand|[mb]illion|[MB]illion)|[M])?
\$\d{1,3} is fine. What follows? One way to answer that is to consider the following three possibilities.
The string to be matched ends ' million'
This string (which begins with a space, in case you missed that) is preceded by an empty string or a single digit preceded by a comma or period:
(?:[,.]\d)? million
Evidently, "million" can be "thousand" or "billion", and the first in last might be capitalized, so we change the expression to
(?:[,.]\d)? (?:[MmBb]illion|thousand)
One potential problem is that this matches '$5.6 millionaire'. We can avoid that problem by tacking on a word boundary preventing the match to be followed by a word character:
(?:[,.]\d)? (?:[MmBb]illion|thousand)\b
The string ends 'M'
In this case the 'M' must be preceded by a single digit preceded by a comma or period:
[,.]\dM\b
You could accept 'B' as well by changing M to [MB].
The string ends with three digits preceded by a comma
Here we need
,\d{3}\b
Here the word boundary avoids matching, for example, $333,3333'. It will not match, however, '$333,333,333' or '$333,333,333,333'. If we want to match those we could change the expression to
(?:,\d{3}) \b
or to match '$333' as well, change it to
(?:,\d{3})*\b
Construct the alternation
We therefore can use the following regular expression.
\$\d{1,3}(?:(?:[,.]\d)? (?:[MmBb]illion|thousand)\b|[,.]\dMb|,\d{3}b)
Factoring out the end-of-string anchor we obtain
\$\d{1,3}(?:(?:[,.]\d)? (?:[MmBb]illion|thousand)|[,.]\dM|,\d{3})b
