I know you can remove HTML tags with a command such as this:
REGEXP_REPLACE(overview, '<. ?>')
But, some of the text has actual HTML encoding, where the application actually encoded things, like single quotes to be: ' or ’
I'm assuming these are pretty standard. Is there a way to remove them and replace them with the actual character, or am I stuck with REPLACE and listing them?
Many thanks!
CodePudding user response:
You can use utl_i18n.unescape_references():
utl_i18n.unescape_reference(regexp_replace(overview, '<. ?>'))
As a demo:
-- sample data
with t (overview) as (
select '<div><p>Some entities: & ' < > to be handled </p></div>'
from dual
)
select REGEXP_REPLACE(overview, '<. ?>') as result1,
utl_i18n.unescape_reference(regexp_replace(overview, '<. ?>')) as result2
from t
gets
| RESULT1 | RESULT2 |
|---|---|
| Some entities: & ' < > to be handled | Some entities: & ' < > to be handled |
I'm not endorsing (or attacking) the notion of using regular expressions; that's handled and refuted and discussed elsewhere. I'm just addressing the part about encoded entities.
CodePudding user response:
Use a proper XML parser:
with t (overview) as (
SELECT '<div><p>Some entities: & ' < > to be handled </p></div>' from dual UNION ALL
SELECT '<html><head><title>Test</title></head><body><p><test></p></body></html>' from dual
)
SELECT x.*
FROM t
CROSS JOIN LATERAL (
SELECT LISTAGG(value) WITHIN GROUP (ORDER BY ROWNUM) AS text
FROM XMLTABLE(
'//*'
PASSING XMLTYPE(t.overview)
COLUMNS
value CLOB PATH './text()'
)
) x
Which outputs:
TEXT Some entities: & ' < > to be handled Test<test>
db<>fiddle here
