tl;dr:
The ± character is two bytes long, so preg_match is interpreting it as two characters. In order to match in the way you expect, you have to use the /u modifier.
I need to produce a regex pattern that verifies UTC offsets. These are typically formatted as UTC 05:30 or UTC-01:00. It seemed simple enough to match as follows (being permissive for spaces):
^UTC[ ]?[ \-±][ ]?[01][0-9]:[034][05]$
[Note: I updated this pattern based on feedback from @barman]
There is a pocket case in which the code is written UTC±00:00. However, the plus-minus sign is throwing things off. Using PHP for example:
echo preg_match("/^±$/","±");
echo preg_match("/^[±]$/","±");
echo preg_match("/^[\±]$/","±");
Will return true for the first match but false on the other two.
So my question is, does the ± require special handling in Regex? I can't find any reference to this symbol in the docs. Thx.
CodePudding user response:
You mustn't put the - between two characters inside [], that makes it create a range (like when you write [0-9]) rather than matching the - character literally.
You should put the - at the beginning or end, or escape it.
^UTC[ ]?[ \-±][ ]?[01][0-9]:[034][05]$
Also, you don't put | inside [] character sets. That's used inside () to create alternative patterns.
CodePudding user response:
It looks like @Barmar probably solved the first issue you were having (matching the UTC string). However, to explain what you were seeing with:
preg_match("/^±$/","±"); // true
preg_match("/^[±]$/","±"); // false
preg_match("/^[\±]$/","±"); // false
The ± character is two bytes long, so preg_match is interpretting it as two characters. In order to match in the way you expect, you have to use the /u modifier. This tells preg_match to treat your pattern as utf-8, which will interpret ± as a single character instead of two characters.
preg_match("/^[±]$/u","±"); // true
And to include an example that matches your UTC sample:
// with the /u modifier (works as expected)
preg_match("/^UTC[ ]?[ \-±][ ]?[01][0-9]:[034][05]$/u", "UTC±05:30"); // true
// without the /u modifier (does not match)
preg_match("/^UTC[ ]?[ \-±][ ]?[01][0-9]:[034][05]$/", "UTC±05:30"); // false
