PL SQL Remove HTML Encoding-CodePudding

I know you can remove HTML tags with a command such as this:

REGEXP_REPLACE(overview, '<. ?>')

But, some of the text has actual HTML encoding, where the application actually encoded things, like single quotes to be: &#39 or ’

I'm assuming these are pretty standard. Is there a way to remove them and replace them with the actual character, or am I stuck with REPLACE and listing them?

Many thanks!

CodePudding user response：

You can use utl_i18n.unescape_references():

utl_i18n.unescape_reference(regexp_replace(overview, '<. ?>'))

As a demo:

-- sample data
with t (overview) as (
  select '<div><p>Some entities: &amp; &#39; &lt; &gt; to be handled </p></div>'
  from dual
)
select REGEXP_REPLACE(overview, '<. ?>') as result1,
  utl_i18n.unescape_reference(regexp_replace(overview, '<. ?>')) as result2
from t

gets

RESULT1	RESULT2
Some entities: & ' < > to be handled	Some entities: & ' < > to be handled

db<>fiddle

_{I'm not endorsing (or attacking) the notion of using regular expressions; that's handled and refuted and discussed elsewhere. I'm just addressing the part about encoded entities.}

CodePudding user response：

Use a proper XML parser:

with t (overview) as (
  SELECT '<div><p>Some entities: &amp; &#39; &lt; &gt; to be handled </p></div>' from dual UNION ALL
  SELECT '<html><head><title>Test</title></head><body><p>&lt;test&gt;</p></body></html>' from dual
)
SELECT x.*
FROM   t
       CROSS JOIN LATERAL (
         SELECT LISTAGG(value) WITHIN GROUP (ORDER BY ROWNUM) AS text
         FROM   XMLTABLE(
                  '//*'
                  PASSING XMLTYPE(t.overview)
                  COLUMNS
                    value CLOB PATH './text()'
                )
      ) x

Which outputs:

TEXT

Some entities: & ' < > to be handled

Test<test>

TEXT
Some entities: & ' < > to be handled
Test<test>

db<>fiddle here