

Archives of the TeradataForum
Message Posted: Sun, 02 Mar 2003 @ 09:13:22 GMT
Subj:   Re: Performance for Distinct values 

From:   Dieter Nöth 
Anomy Anom wrote:
 This question is related to the performance between two different methods. I need to create a Dimension table from a Fact table. I
will be loading all distinct values of one column into a table. I expect to get maybe 10 rows into the target table from a source table of
100,000 rows.  
Don't think too much about it, if it's such a small table ;)
 I have different methods and am considering:  
 1. Do a GROUP BY or Distinct on the source of a INSERT/SELECT. This of course will require the sorting and
redistribution and aggregation (ARSA).  
As Teradata optimizes GROUP BY and DISTINCT totaly different, there's a rule of thumb: If the number of rows/value is large, GROUP BY is
faster, because duplicates are eliminated locally *before* redistribution. If the number of rows/value is small, DISTINCT is faster.
e.g. 10,000,000 rows, 100 AMPs
50 values, 200,000 rows/value
DISTINCT: 10,000,000 rows redistributed to 50 AMPs (large spool), then
sorted eliminating duplicate rows.
GROUP BY: 100,000 rows/AMP locally aggregated and reduced to 50
rows/AMP, 50*100 = 5000 rows redistributed to 50 AMPs (small spool) and
aggregated again.
500,000 values, 20 rows/value
DISTINCT: 10,000,000 rows redistributed to 100 AMPs (large spool), then
sorted eliminating duplicate rows.
GROUP BY: 100,000 rows/AMP locally aggregated, but there's probably only
1 row/value on each AMP, so it's almost useless. 100,000*100 =
10,000,000 rows redistributed to all AMPs (large spool) and aggregated
again.
10,000 values, 1,000 rows/value
DISTINCT: 10,000,000 rows redistributed to 100 AMPs (large spool), then
sorted eliminating duplicate rows.
GROUP BY: 100,000 rows/AMP locally aggregated and reduced to 10,000
rows/AMP, but aggregate cache (~100KB) will overflow and there will be a
second local aggregation, 10,000*100 = 1,000,000 rows redistributed to
all AMPs (medium spool) and aggregated again.
This is for even distribution, if it's skewed, this may be different...
Btw. if you use a subquery there's always a distinct by default, but you can specify GROUP BY to override it.
 2. Make the target table a SET table, do not use any GROUP BY or DISTINCT on the INSERT/SELECT, and let the Duplicate Row check
for the SET table weed out the duplicates.  
This is similar to a distinct, but probably more overhead.
Dieter
 