Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Sun, 02 Mar 2003 @ 09:13:22 GMT


     
  <Prev Next>   <<First <Prev Next> Last>>  


Subj:   Re: Performance for Distinct values
 
From:   Dieter Nöth

Anomy Anom wrote:

  This question is related to the performance between two different methods. I need to create a Dimension table from a Fact table. I will be loading all distinct values of one column into a table. I expect to get maybe 10 rows into the target table from a source table of 100,000 rows.  


Don't think too much about it, if it's such a small table ;-)


  I have different methods and am considering:  


  1. Do a GROUP BY or Distinct on the source of a INSERT/SELECT. This of course will require the sorting and redistribution and aggregation (ARSA).  


As Teradata optimizes GROUP BY and DISTINCT totaly different, there's a rule of thumb: If the number of rows/value is large, GROUP BY is faster, because duplicates are eliminated locally *before* redistribution. If the number of rows/value is small, DISTINCT is faster.

e.g. 10,000,000 rows, 100 AMPs

50 values, 200,000 rows/value
DISTINCT: 10,000,000 rows redistributed to 50 AMPs (large spool), then
sorted eliminating duplicate rows.
GROUP BY: 100,000 rows/AMP locally aggregated and reduced to 50
rows/AMP, 50*100 = 5000 rows redistributed to 50 AMPs (small spool) and
aggregated again.

500,000 values, 20 rows/value
DISTINCT: 10,000,000 rows redistributed to 100 AMPs (large spool), then
sorted eliminating duplicate rows.
GROUP BY: 100,000 rows/AMP locally aggregated, but there's probably only
1 row/value on each AMP, so it's almost useless. 100,000*100 =
10,000,000 rows redistributed to all AMPs (large spool) and aggregated
again.

10,000 values, 1,000 rows/value
DISTINCT: 10,000,000 rows redistributed to 100 AMPs (large spool), then
sorted eliminating duplicate rows.
GROUP BY: 100,000 rows/AMP locally aggregated and reduced to 10,000
rows/AMP, but aggregate cache (~100KB) will overflow and there will be a
second local aggregation, 10,000*100 = 1,000,000 rows redistributed to
all AMPs (medium spool) and aggregated again.

This is for even distribution, if it's skewed, this may be different...

Btw. if you use a subquery there's always a distinct by default, but you can specify GROUP BY to override it.


  2. Make the target table a SET table, do not use any GROUP BY or DISTINCT on the INSERT/SELECT, and let the Duplicate Row check for the SET table weed out the duplicates.  


This is similar to a distinct, but probably more overhead.


Dieter



     
  <Prev Next>   <<First <Prev Next> Last>>  
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 27 Dec 2016