Archives of the TeradataForum

Message Posted: Wed, 02 Jun 2004 @ 07:04:24 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Exclusion Joins

From:		Dieter Noeth

Christie, Jon wrote:

NOT EXISTS is much better than NOT IN when there are multiple terms. With NOT IN, multiple terms, and nullable columns, you get very strange explains with verbage about skipping steps. Basically, the amps decide at execution time which of too similar plans to use. There's a reason for this that actually does make sense, but it's a rather long story.

The reason is simple: those damn NULLs ;-)

     explain
     select * from x
     where i not in (select j from y);

Explanation --------------------------------------------------

1) First, we lock a distinct TERADATA_EDUCATION."pseudo table" for read on a RowHash to prevent global deadlock for TERADATA_EDUCATION.y.

2) Next, we lock a distinct TERADATA_EDUCATION."pseudo table" for read on a RowHash to prevent global deadlock for TERADATA_EDUCATION.x.

3) We lock TERADATA_EDUCATION.y for read, and we lock TERADATA_EDUCATION.x for read.

4) We do an all-AMPs SUM step to aggregate from TERADATA_EDUCATION.x by way of an all-rows scan with no residual conditions. Aggregate Intermediate Results are computed globally, then placed in Spool 4.

5) We do an all-AMPs SUM step to aggregate from TERADATA_EDUCATION.y by way of an all-rows scan with no residual conditions. Aggregate Intermediate Results are computed globally, then placed in Spool 6.

6) We execute the following steps in parallel.

1) We do an all-AMPs RETRIEVE step from Spool 6 (Last Use) by way of an all-rows scan into Spool 2 (all_amps), which is duplicated on all AMPs.

2) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of an all-rows scan into Spool 3 (all_amps), which is duplicated on all AMPs.

7) We do an all-AMPs JOIN step from TERADATA_EDUCATION.x by way of an all-rows scan with no residual conditions, which is joined to TERADATA_EDUCATION.y by way of an all-rows scan with no residual conditions. TERADATA_EDUCATION.x and TERADATA_EDUCATION.y are joined using an exclusion merge join, with a join condition of ("TERADATA_EDUCATION.x.i = TERADATA_EDUCATION.y.j"), and null value information in Spool 3 and Spool 2. Skip this join step if null exists. The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with index join confidence to be 2 rows. The estimated time for this step is 0.92 seconds.

8) We do an all-AMPs JOIN step from TERADATA_EDUCATION.x by way of an all-rows scan with no residual conditions, which is joined to TERADATA_EDUCATION.y by way of an all-rows scan with no residual conditions. TERADATA_EDUCATION.x and TERADATA_EDUCATION.y are joined using an exclusion merge join, with a join condition of ("TERADATA_EDUCATION.x.i = TERADATA_EDUCATION.y.j"), and null value information in Spool 3 (Last Use) and Spool 2 (Last Use). Skip this join step if there is no null. The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with index join confidence to be 2 rows. The estimated time for this step is 0.92 seconds.

9) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.

-> The contents of Spool 1 are sent back to the user as the result of statement 1.

But if all columns are NOT NULL:

`Explanation --------------------------------------------------`
	`1)`	`First, we lock a distinct TERADATA_EDUCATION."pseudo table" for read on a RowHash to prevent global deadlock for TERADATA_EDUCATION.y.`
	`2)`	`Next, we lock a distinct TERADATA_EDUCATION."pseudo table" for read on a RowHash to prevent global deadlock for TERADATA_EDUCATION.x.`
	`3)`	`We lock TERADATA_EDUCATION.y for read, and we lock TERADATA_EDUCATION.x for read.`
	`4)`	We do an all-AMPs JOIN step from TERADATA_EDUCATION.x by way of an all-rows scan with no residual conditions, which is joined to TERADATA_EDUCATION.y by way of an all-rows scan with no residual conditions. TERADATA_EDUCATION.x and TERADATA_EDUCATION.y are joined using an exclusion merge join, with a join condition of ("TERADATA_EDUCATION.x.i = TERADATA_EDUCATION.y.j"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with index join confidence to be 2 rows. The estimated time for this step is 0.58 seconds.
	`5)`	`Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.`
	`->`	`The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.58 seconds.`

And this is almost the same as NOT EXISTS, the only difference is "where unknown comparison will be ignored"


	`4)`	We do an all-AMPs JOIN step from TERADATA_EDUCATION.x by way of an all-rows scan with no residual conditions, which is joined to TERADATA_EDUCATION.y by way of an all-rows scan with no residual conditions. TERADATA_EDUCATION.x and TERADATA_EDUCATION.y are joined using an exclusion merge join, with a join condition of ("TERADATA_EDUCATION.x.i = TERADATA_EDUCATION.y.j") where unknown comparison will be ignored. The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with index join confidence to be 2 rows. The estimated time for this step is 0.58 seconds.

In my trainings i always warn about using NOT IN on NULLable columns: Imagine selecting all customers with unpayed bills to write reminders. Now a NULL is inserted and returned by the subquery -> the answer set will be empty forever.

First you're glad about your customers paying timely, but some months later you'll go bancrupt ;-)

The equivalent NOT EXISTS, though less intuitive to read, suffers from none of these problems.

Dieter


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference