|
|
Archives of the TeradataForum
Message Posted: Tue, 21 Jun 2011 @ 13:54:41 GMT
Subj: | | Re: Distinct vs. group by in TD 13 onwards |
|
From: | | Dieter Noeth |
Todd Walter wrote:
| 100% is the intent. By the time it gets to plan generation there is only one representation internally so the decisions are all made from
that one form. Then Opt decides which algorithm to use based on what it knows about demographics, especially information about how unique the
source relation already is. | |
Hi Todd,
i'm curious why the optimizer seems to do it different for subqueries.
A DISTINCT was automatically added in older releases and this was a good opportunity for GROUP BY. Now i often see it changed back to DISTINCT
optimization.
Dieter
TD13.00.00.22, 2 AMP demo
trans: 3.000.000 rows, 31 distinct trans_dates, stats collected
GROUP BY/DISTINCT: same plan, as expected
SELECT DISTINCT trans_date FROM au.trans
Explanation -------------------------------------------------- | |
| 3) | We do an all-AMPs SUM step to aggregate from au.trans by way of an all-rows scan with no residual conditions , grouping by field1
(au.trans.TRANS_DATE). Aggregate Intermediate Results are computed globally, then placed in Spool 1. The input table will not be cached in
memory, but it is eligible for synchronized scanning. The size of Spool 1 is estimated with low confidence to be 31 rows (651 bytes). The
estimated time for this step is 8.51 seconds.
| |
GROUP BY/DISTINCT: same plan, but a bad one.
Seems to be due to the cost estimation for the trans_date: wrong "high confidence to be 3,000,000 rows" in step 4, but correct "low confidence
to be 31 rows" after the join.
SELECT * FROM sys_calendar.CALENDAR AS c WHERE calendar_date IN
(
SELECT trans_date FROM au.trans GROUP BY 1
)
Explanation -------------------------------------------------- | |
| 4) | We do an all-AMPs RETRIEVE step from au.trans by way of an all-rows scan with no residual conditions into Spool 5 (all_amps), which is
redistributed by the hash code of (au.trans.TRANS_DATE) to all AMPs. Then we do a SORT to order Spool 5 by row hash and the sort key in spool
field1 eliminating duplicate rows. The input table will not be cached in memory, but it is eligible for synchronized scanning. The result spool
file will not be cached in memory. The size of Spool 5 is estimated with high confidence to be 3,000,000 rows (74,793,250 bytes). The estimated
time for this step is 27.43 seconds.
| |
| 5) | We do an all-AMPs JOIN step from SYS_CALENDAR.CALDATES in view CALENDAR by way of an all-rows scan with no residual conditions, which is
joined to Spool 5 (Last Use) by way of an all-rows scan. SYS_CALENDAR.CALDATES and Spool 5 are joined using an inclusion merge join, with a join
condition of ("SYS_CALENDAR.CALDATES.cdate = TRANS_DATE"). The result goes into Spool 4 (group_amps), which is built locally on the AMPs. The
size of Spool 4 is estimated with low confidence to be 31 rows (2,697 bytes). The estimated time for this step is 0.11 seconds.
| |
| |