|
|
Archives of the TeradataForum
Message Posted: Wed, 02 Jun 2004 @ 07:04:24 GMT
Subj: | | Re: Exclusion Joins |
|
From: | | Dieter Noeth |
Christie, Jon wrote:
| NOT EXISTS is much better than NOT IN when there are multiple terms. With NOT IN, multiple terms, and nullable columns, you get very
strange explains with verbage about skipping steps. Basically, the amps decide at execution time which of too similar plans to use. There's a
reason for this that actually does make sense, but it's a rather long story. | |
The reason is simple: those damn NULLs ;-)
explain
select * from x
where i not in (select j from y);
Explanation -------------------------------------------------- | |
| 1) | First, we lock a distinct TERADATA_EDUCATION."pseudo table" for read on a RowHash to prevent global deadlock for TERADATA_EDUCATION.y.
| |
| 2) | Next, we lock a distinct TERADATA_EDUCATION."pseudo table" for read on a RowHash to prevent global deadlock for TERADATA_EDUCATION.x.
| |
| 3) | We lock TERADATA_EDUCATION.y for read, and we lock TERADATA_EDUCATION.x for read.
| |
| 4) | We do an all-AMPs SUM step to aggregate from TERADATA_EDUCATION.x by way of an all-rows scan with no residual conditions. Aggregate
Intermediate Results are computed globally, then placed in Spool 4.
| |
| 5) | We do an all-AMPs SUM step to aggregate from TERADATA_EDUCATION.y by way of an all-rows scan with no residual conditions. Aggregate
Intermediate Results are computed globally, then placed in Spool 6.
| |
| 6) | We execute the following steps in parallel.
| |
| |
| 1) | We do an all-AMPs RETRIEVE step from Spool 6 (Last Use) by way of an all-rows scan into Spool 2 (all_amps), which is duplicated on all
AMPs.
| | |
| |
| 2) | We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of an all-rows scan into Spool 3 (all_amps), which is duplicated on all
AMPs.
| | |
| 7) | We do an all-AMPs JOIN step from TERADATA_EDUCATION.x by way of an all-rows scan with no residual conditions, which is joined to
TERADATA_EDUCATION.y by way of an all-rows scan with no residual conditions. TERADATA_EDUCATION.x and TERADATA_EDUCATION.y are joined using an
exclusion merge join, with a join condition of ("TERADATA_EDUCATION.x.i = TERADATA_EDUCATION.y.j"), and null value information in Spool 3 and
Spool 2. Skip this join step if null exists. The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1
is estimated with index join confidence to be 2 rows. The estimated time for this step is 0.92 seconds.
| |
| 8) | We do an all-AMPs JOIN step from TERADATA_EDUCATION.x by way of an all-rows scan with no residual conditions, which is joined to
TERADATA_EDUCATION.y by way of an all-rows scan with no residual conditions. TERADATA_EDUCATION.x and TERADATA_EDUCATION.y are joined using an
exclusion merge join, with a join condition of ("TERADATA_EDUCATION.x.i = TERADATA_EDUCATION.y.j"), and null value information in Spool 3 (Last
Use) and Spool 2 (Last Use). Skip this join step if there is no null. The result goes into Spool 1 (group_amps), which is built locally on the
AMPs. The size of Spool 1 is estimated with index join confidence to be 2 rows. The estimated time for this step is 0.92 seconds.
| |
| 9) | Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
| |
| -> | The contents of Spool 1 are sent back to the user as the result of statement 1.
| |
But if all columns are NOT NULL:
Explanation -------------------------------------------------- | |
| 1) | First, we lock a distinct TERADATA_EDUCATION."pseudo table" for read on a RowHash to prevent global deadlock for TERADATA_EDUCATION.y.
| |
| 2) | Next, we lock a distinct TERADATA_EDUCATION."pseudo table" for read on a RowHash to prevent global deadlock for TERADATA_EDUCATION.x.
| |
| 3) | We lock TERADATA_EDUCATION.y for read, and we lock TERADATA_EDUCATION.x for read.
| |
| 4) | We do an all-AMPs JOIN step from TERADATA_EDUCATION.x by way of an all-rows scan with no residual conditions, which is joined to
TERADATA_EDUCATION.y by way of an all-rows scan with no residual conditions. TERADATA_EDUCATION.x and TERADATA_EDUCATION.y are joined using an
exclusion merge join, with a join condition of ("TERADATA_EDUCATION.x.i = TERADATA_EDUCATION.y.j"). The result goes into Spool 1 (group_amps),
which is built locally on the AMPs. The size of Spool 1 is estimated with index join confidence to be 2 rows. The estimated time for this step
is 0.58 seconds.
| |
| 5) | Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
| |
| -> | The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.58 seconds.
| |
And this is almost the same as NOT EXISTS, the only difference is "where unknown comparison will be ignored"
| | | |
| 4) | We do an all-AMPs JOIN step from TERADATA_EDUCATION.x by way of an all-rows scan with no residual conditions, which is joined to
TERADATA_EDUCATION.y by way of an all-rows scan with no residual conditions. TERADATA_EDUCATION.x and TERADATA_EDUCATION.y are joined using an
exclusion merge join, with a join condition of ("TERADATA_EDUCATION.x.i = TERADATA_EDUCATION.y.j") where unknown comparison will be ignored. The
result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with index join confidence to be 2
rows. The estimated time for this step is 0.58 seconds.
| |
In my trainings i always warn about using NOT IN on NULLable columns: Imagine selecting all customers with unpayed bills to write
reminders. Now a NULL is inserted and returned by the subquery -> the answer set will be empty forever.
First you're glad about your customers paying timely, but some months later you'll go bancrupt ;-)
| The equivalent NOT EXISTS, though less intuitive to read, suffers from none of these problems. | |
Dieter
| |