Improve performance of NOT EXISTS in case of large tables-CodePudding

What I am trying to accomplish is getting rows from one table that do not match another table based on specific filters. The two tables are relatively huge so I am trying to filter them based on a certain time range.

The steps I went through so far.

Get the IDs from "T1" for the last 3 days

SELECT
id 
FROM T1
WHERE STARTTIME BETWEEN '3 days ago' AND 'now';

Execution time is 4.5s.

Get the IDs from "T2" for the last 3 days

SELECT
id 
FROM T2
WHERE STARTTIME BETWEEN '3 days ago' AND 'now';

Execution time is 2.5s.

Now I try to use NOT EXISTS to merge the results from both statements into one

SELECT
CID
FROM T1
WHERE STARTTIME BETWEEN '3 days ago' AND 'now'
AND NOT EXISTS (
  SELECT NULL FROM T2
  WHERE T1.ID = T2.ID 
  AND STARTTIME BETWEEN '3 days ago' AND 'now'
);

Execution time is 23s.

I also tried the INNER JOIN logic from this answer thinking it makes sense, but I get no results so I cannot properly evaluate.

Is there a better way to construct this statement that could possibly lead to a faster execution time?

CodePudding user response：

Is there a better way to construct this statement that could possibility lead to a faster execution time?

Your basic responsibility is to write the SQL staement, the basic responsibility of Oracle is to come with an execution plan

If you are not satified (but you should know that a combination of two sources using NOT EXISTS will take longer that the sum of the time to extract the data from the sources) your fist step should be to verify the execution plan (and not try to rewrite the statement).

See some more details how to proceede here

EXPLAIN PLAN  SET STATEMENT_ID = 'stmt1' into   plan_table  FOR
SELECT
PAD
FROM T1
WHERE STARTTIME BETWEEN date'2021-01-11' AND date'2021-01-13'
AND NOT EXISTS (
  SELECT NULL FROM T2
  WHERE T1.ID = T2.ID 
  AND STARTTIME BETWEEN date'2021-01-11' AND date'2021-01-13'
);

SELECT * FROM table(DBMS_XPLAN.DISPLAY('plan_table', 'stmt1','ALL'));

This is what you should see

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |  1999 |   150K| 10175   (1)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT ANTI|      |  1999 |   150K| 10175   (1)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | T2   |  2002 | 26026 |  4586   (1)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL  | T1   |  4002 |   250K|  5589   (1)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("T1"."ID"="T2"."ID")
   2 - filter("STARTTIME"<=TO_DATE(' 2021-01-13 00:00:00', 'syyyy-mm-dd 
              hh24:mi:ss') AND "STARTTIME">=TO_DATE(' 2021-01-11 00:00:00', 
              'syyyy-mm-dd hh24:mi:ss'))
   3 - filter("STARTTIME"<=TO_DATE(' 2021-01-13 00:00:00', 'syyyy-mm-dd 
              hh24:mi:ss') AND "STARTTIME">=TO_DATE(' 2021-01-11 00:00:00', 
              'syyyy-mm-dd hh24:mi:ss'))

Note that the hash join (here anti due to the not exists) is the best way to join two large row sources. Note also that the plan does not use indexes. The reason is the same - to access large data you do not want to go over index.

Contrary to the case of low cardinality row sources (OTPL) where you expects to see index access and NESTED LOOPS ANTI.

Some times is Oracle confused (e.g. while seeing stale statistics) and decide to go the NESTED LOOPway even for large data - which leads to long elapsed time.

This should help you at least to decide if you have a problem or not.

CodePudding user response：

Perhaps a simple MINUS operation will accomplish what you are looking for:

    select id 
      from ( select id 
               from t1
              where starttime between '3 days ago' and 'now' 
             MINUS
             select id 
               from t2
              where starttime between '3 days ago' and 'now' 
           );

for however you actually define starttime between '3 days ago' and 'now'. This literally uses your current queries as is the MINUS operation removes from the first those values which do exist in the second and returns the result. See MINUS demo here.