I need to convert the regex below, that works fine in Oracle, to HIVE. The regex Pattern Removes all punctuation except ('-' and '&' apostrophe, and spaces).
select
REGEXP_REPLACE (
TRANSLATE(
REGEXP_REPLACE( 'UNI''C...ER - BÉBI,,, DAS DE PORTUGAL, SGPS, S.A.', '[^[:alnum:] &-]', '' ),
'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü', 'ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu'
),
' {2,}', ' ') SSS
from dual;
The problem is with the outputs:
- Oracle Output : "UNICER - BEBIDAS DE PORTUGAL SGPS SA"
- HIVE Output : "UNIC...ER - BEBI,,, DAS DE PORTUGAL, SGPS, S.A." (keeps '.' and ',')
I think that the problem resides on [[:alnum:]] POSIX Character Class interpretation.
Many thank's in advance.
CodePudding user response:
You can use
'[^\\p{L}\\p{N} &-]'
This regex matches any single character that is not ([^...] is a negated character class)
\p{L}- any Unicode letter\p{N}- any Unicode digit- a space&- a&char-- a-char.
