You can check the regex101 page from here.
I have a list of adresses in different formats and non-english. Assume my list is like below.
KENNEDY CAD. SİRKECİ ARABALI VAPUR İSKELESİ FATİH/ İSTANBUL
YAVUZTÜRK MAH. KARADENİZ CAD. NO:2 ÜSKÜDAR/ İSTANBUL
HAMİDİYE MAH. ALPEREN SOK. NO:15/2 ÇEKMEKÖY/ İSTANBUL
UĞUR MUMCU MAH. YUNUS EMRE CAD. NO:25 KARTAL/ İSTANBUL
The regex I've written is as following:
(?:(?:\p{L}* M[Aa]?[Hh][. ])? *|(?:\p{L}* C[Aa]?[Dd][. ])? *)
My regex return each character as match, but i need to get 4 matches which are:
KENNEDY CAD.
YAVUZTÜRK MAH. KARADENİZ CAD.
HAMİDİYE MAH.
UĞUR MUMCU MAH. YUNUS EMRE CAD.
How can I solve that problem?
CodePudding user response:
You can use
^\p{L} (?:\s \p{L} )*\s (?:M[Aa]?[Hh]|C[Aa]?[Dd])\.?(?:\s \p{L} (?:\s \p{L} )*\s (?:M[Aa]?[Hh]|C[Aa]?[Dd]))*\.?
Details:
^- start of string\p{L} (?:\s \p{L} )*- a word and then zero or more whitespace separated words\s- one or more whitespaces(?:M[Aa]?[Hh]|C[Aa]?[Dd])-M, an optionalAoraand thenhorH, orC, an optionalAoraand thenDord\.?- an optional dot(?:\s \p{L} (?:\s \p{L} )*\s (?:M[Aa]?[Hh]|C[Aa]?[Dd]))*- zero or more sequences of one or more whitespaces and the pattern described above\.?- an optional dot
See the regex demo. Or, a bit less precise and efficient, but shorter:
^(?:\s*[\p{L}\s] (?:M[Aa]?[Hh]|C[Aa]?[Dd])\.?)
See this regex demo. Details:
^- start of string(?:\s*[\p{L}\s] (?:M[Aa]?[Hh]|C[Aa]?[Dd])\.?)- one or more sequences of\s*- zero or more whitespaces[\p{L}\s]- one or more letters or whitespaces(?:M[Aa]?[Hh]|C[Aa]?[Dd])-M, an optionalAoraand thenhorH, orC, an optionalAoraand thenDord\.?- an optional dot
CodePudding user response:
Try (regex101):
^(?=.*C[Aa][Dd]\s*\.).*?C[Aa][Dd]\.|^.*?M[Aa][Hh]\s*\.
This will match all string until CAD. or if not found until MAH.
