Home > Enterprise >  Regex_replace in snowflake using pattern
Regex_replace in snowflake using pattern

Time:01-08

I am looking for a regular expressions pattern which will remove articles(a, an, the), special chars(;,:,% etc) and expand abbreviation(inc.-> 'incorporation', & -> 'and' etc) in snowflake. I am able to do this in snowflake but it not completely correct. Below is my code. The issue is that i want to give pattern (for example output of 'a good book' should be 'good book' but string 'give a book' should remain as

'''
select REGEXP_REPLACE((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((


select REGEXP_REPLACE (

  
  (select REGEXP_REPLACE(concat (' ', lower('a book of the great man'), ' '), '(^an )|(^the )| 
  (^a )'))
  , '\\.|\\,|\\(|\\)|\\!|\\\\|/|£|\\$|%|\\^|\\*|-|\\ |=|_|{|}|\\[|\\]|#|~|;|:|''|`|@|<|>|\\?| 
 ¬|\\|')

  ), ' & ', ' and ')
  ), ' ltd ', ' limited ')

  ), '', '')
  '''

CodePudding user response:

Instead of using REGEXP_REPLACE, I suggest you write a UDF (JavaScript or Java), and use regexp of JavaScript (or java). It will be much cleaner and maintainable.

enter image description here

CREATE OR REPLACE FUNCTION transform_text_2 (STR VARCHAR)
RETURNS VARCHAR
LANGUAGE JAVASCRIPT
AS $$
  var abbreviations = { "inc.": "incorporation", "&": "and" };

  // remove articles from the beginning
  var Result = STR.replace( /\b(an?|the)\b /i, "" );

  // convert abbreviations
  for (var abv in abbreviations) Result = Result.replace( abv, abbreviations[abv] );

  // remove the special characters
  Result = Result.replace( /[^A-Za-z0-9 ]/g, "" );


  return (Result);
$$
;
  •  Tags:  
  • Related