Home > Blockchain >  plus-minus (±) sign in regex
plus-minus (±) sign in regex

Time:10-17

tl;dr:

The ± character is two bytes long, so preg_match is interpreting it as two characters. In order to match in the way you expect, you have to use the /u modifier.

I need to produce a regex pattern that verifies UTC offsets. These are typically formatted as UTC 05:30 or UTC-01:00. It seemed simple enough to match as follows (being permissive for spaces):

^UTC[ ]?[ \-±][ ]?[01][0-9]:[034][05]$

[Note: I updated this pattern based on feedback from @barman]

There is a pocket case in which the code is written UTC±00:00. However, the plus-minus sign is throwing things off. Using PHP for example:

echo preg_match("/^±$/","±");
echo preg_match("/^[±]$/","±");
echo preg_match("/^[\±]$/","±");

Will return true for the first match but false on the other two.

So my question is, does the ± require special handling in Regex? I can't find any reference to this symbol in the docs. Thx.

CodePudding user response:

You mustn't put the - between two characters inside [], that makes it create a range (like when you write [0-9]) rather than matching the - character literally.

You should put the - at the beginning or end, or escape it.

^UTC[ ]?[ \-±][ ]?[01][0-9]:[034][05]$

Also, you don't put | inside [] character sets. That's used inside () to create alternative patterns.

CodePudding user response:

It looks like @Barmar probably solved the first issue you were having (matching the UTC string). However, to explain what you were seeing with:

preg_match("/^±$/","±"); // true
preg_match("/^[±]$/","±"); // false
preg_match("/^[\±]$/","±"); // false

The ± character is two bytes long, so preg_match is interpretting it as two characters. In order to match in the way you expect, you have to use the /u modifier. This tells preg_match to treat your pattern as utf-8, which will interpret ± as a single character instead of two characters.

preg_match("/^[±]$/u","±"); // true

And to include an example that matches your UTC sample:

// with the /u modifier (works as expected)
preg_match("/^UTC[ ]?[ \-±][ ]?[01][0-9]:[034][05]$/u", "UTC±05:30"); // true

// without the /u modifier (does not match)
preg_match("/^UTC[ ]?[ \-±][ ]?[01][0-9]:[034][05]$/", "UTC±05:30"); // false
  • Related