Curious thing. Using Databricks, but getting two different behaviors when using the replace_regex functionality.
%sql
select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d [)]', '')) as result
returns my expectation: TEST (PA)
Yet:
%python
display(spark.sql("""select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d [)]', '')) as result"""))
returns: TEST (PA) (1234)
Should these not be equivalent in the two different contexts (spark sql, and spark / python / pyspark)? Why doesn't the second context work?
CodePudding user response:
You don't take into account character escaping rules. In Python your \\ is converted into a single \, so your regex changes to [(]\d [)]. You have choice:
use raw string
r"""select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d [)]', '')) as result"""double backslashes:
[(]\\\\d [)]
with any of these changes, it produces correct result
