I am looking for a regular expressions pattern which will remove articles(a, an, the), special chars(;,:,% etc) and expand abbreviation(inc.-> 'incorporation', & -> 'and' etc) in snowflake. I am able to do this in snowflake but it not completely correct. Below is my code. The issue is that i want to give pattern (for example output of 'a good book' should be 'good book' but string 'give a book' should remain as
'''
select REGEXP_REPLACE((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE (
(select REGEXP_REPLACE(concat (' ', lower('a book of the great man'), ' '), '(^an )|(^the )|
(^a )'))
, '\\.|\\,|\\(|\\)|\\!|\\\\|/|£|\\$|%|\\^|\\*|-|\\ |=|_|{|}|\\[|\\]|#|~|;|:|''|`|@|<|>|\\?|
¬|\\|')
), ' & ', ' and ')
), ' ltd ', ' limited ')
), '', '')
'''
CodePudding user response:
Instead of using REGEXP_REPLACE, I suggest you write a UDF (JavaScript or Java), and use regexp of JavaScript (or java). It will be much cleaner and maintainable.
CREATE OR REPLACE FUNCTION transform_text_2 (STR VARCHAR)
RETURNS VARCHAR
LANGUAGE JAVASCRIPT
AS $$
var abbreviations = { "inc.": "incorporation", "&": "and" };
// remove articles from the beginning
var Result = STR.replace( /\b(an?|the)\b /i, "" );
// convert abbreviations
for (var abv in abbreviations) Result = Result.replace( abv, abbreviations[abv] );
// remove the special characters
Result = Result.replace( /[^A-Za-z0-9 ]/g, "" );
return (Result);
$$
;

