I am trying to select only capital letters in polytonic Greek text using regex. The specific application is PHP, but I had trouble with it so I started playing around with it in RegExr:
([Α-ΩΗΙΟΥΩᾼῌῼΡΆΈΉΊΌΎΏᾺῈῊῚῸῪῺἈἘἨἸὈὨᾈᾘᾨἌἜἬἼὌὬᾌᾜᾬἊἚἪἺὊὪᾊᾚᾪἎἮἾὮᾎᾞᾮἉἙἩἹὉὙὩᾉᾙᾩῬἍἝἭἽὍὝὭᾍᾝᾭἋἛἫἻὋὛὫᾋᾛᾫἏἯἿὟὯᾏᾟᾯΪΫᾹῙῩᾸῘῨ])
When the JavaScript engine is selected, the behaviour is as expected. However, if I select PCRE not only are capital letters selected, but also a bunch of seemingly random lowercase letters.
Can anyone shed some light on what is going on here? Is this a bug? Is there a way to get the desired result using the PCRE engine?
CodePudding user response:
You need to tell the PCRE regex engine the input is to be parsed as a Unicode string.
In a PCRE regex, you can prepend the pattern with a (*UTF) verb. The (*UTF)[Α-ΩΗΙΟΥΩᾼῌῼΡΆΈΉΊΌΎΏᾺῈῊῚῸῪῺἈἘἨἸὈὨᾈᾘᾨἌἜἬἼὌὬᾌᾜᾬἊἚἪἺὊὪᾊᾚᾪἎἮἾὮᾎᾞᾮἉἙἩἹὉὙὩᾉᾙᾩῬἍἝἭἽὍὝὭᾍᾝᾭἋἛἫἻὋὛὫᾋᾛᾫἏἯἿὟὯᾏᾟᾯΪΫᾹῙῩᾸῘῨ] highights the correct matches.
However, you can also make it a bit shorter with
(*UTF)(?=\p{Lu})\p{Greek}
Here,
(*UTF)- a PCRE verb telling the PCRE engine the input is a Unicode string(?=\p{Lu})- a positive lookahead requiring the next char to be an uppercase char\p{Greek}- a Greek char.
Note in case there is a u flag support in your PCRE implementation, it is most probably the way to go (as in PHP, /(?=\p{Lu})\p{Greek}/u).
