Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Tue, 21 Jun 2011 @ 13:54:41 GMT


     
  <Prev Next>   <<First <Prev Next> Last>>  


Subj:   Re: Distinct vs. group by in TD 13 onwards
 
From:   Dieter Noeth

Todd Walter wrote:

  100% is the intent. By the time it gets to plan generation there is only one representation internally so the decisions are all made from that one form. Then Opt decides which algorithm to use based on what it knows about demographics, especially information about how unique the source relation already is.  


Hi Todd,

i'm curious why the optimizer seems to do it different for subqueries.

A DISTINCT was automatically added in older releases and this was a good opportunity for GROUP BY. Now i often see it changed back to DISTINCT optimization.


Dieter


TD13.00.00.22, 2 AMP demo
trans: 3.000.000 rows, 31 distinct trans_dates, stats collected

GROUP BY/DISTINCT: same plan, as expected

     SELECT DISTINCT trans_date FROM au.trans
Explanation
--------------------------------------------------
 
  3)We do an all-AMPs SUM step to aggregate from au.trans by way of an all-rows scan with no residual conditions , grouping by field1 (au.trans.TRANS_DATE). Aggregate Intermediate Results are computed globally, then placed in Spool 1. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 is estimated with low confidence to be 31 rows (651 bytes). The estimated time for this step is 8.51 seconds.  

GROUP BY/DISTINCT: same plan, but a bad one.

Seems to be due to the cost estimation for the trans_date: wrong "high confidence to be 3,000,000 rows" in step 4, but correct "low confidence to be 31 rows" after the join.

     SELECT * FROM sys_calendar.CALENDAR AS c WHERE calendar_date IN
       (
        SELECT trans_date FROM au.trans GROUP BY 1
       )
Explanation
--------------------------------------------------
 
  4)We do an all-AMPs RETRIEVE step from au.trans by way of an all-rows scan with no residual conditions into Spool 5 (all_amps), which is redistributed by the hash code of (au.trans.TRANS_DATE) to all AMPs. Then we do a SORT to order Spool 5 by row hash and the sort key in spool field1 eliminating duplicate rows. The input table will not be cached in memory, but it is eligible for synchronized scanning. The result spool file will not be cached in memory. The size of Spool 5 is estimated with high confidence to be 3,000,000 rows (74,793,250 bytes). The estimated time for this step is 27.43 seconds.  
  5)We do an all-AMPs JOIN step from SYS_CALENDAR.CALDATES in view CALENDAR by way of an all-rows scan with no residual conditions, which is joined to Spool 5 (Last Use) by way of an all-rows scan. SYS_CALENDAR.CALDATES and Spool 5 are joined using an inclusion merge join, with a join condition of ("SYS_CALENDAR.CALDATES.cdate = TRANS_DATE"). The result goes into Spool 4 (group_amps), which is built locally on the AMPs. The size of Spool 4 is estimated with low confidence to be 31 rows (2,697 bytes). The estimated time for this step is 0.11 seconds.  



     
  <Prev Next>   <<First <Prev Next> Last>>  
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 15 Jun 2023