Archives of the TeradataForum

Message Posted: Sun, 02 Mar 2003 @ 09:13:22 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Performance for Distinct values

From:		Dieter N�th

Anomy Anom wrote:

This question is related to the performance between two different methods. I need to create a Dimension table from a Fact table. I will be loading all distinct values of one column into a table. I expect to get maybe 10 rows into the target table from a source table of 100,000 rows.

Don't think too much about it, if it's such a small table ;-)

I have different methods and am considering:

1. Do a GROUP BY or Distinct on the source of a INSERT/SELECT. This of course will require the sorting and redistribution and aggregation (ARSA).

As Teradata optimizes GROUP BY and DISTINCT totaly different, there's a rule of thumb: If the number of rows/value is large, GROUP BY is faster, because duplicates are eliminated locally *before* redistribution. If the number of rows/value is small, DISTINCT is faster.

e.g. 10,000,000 rows, 100 AMPs

50 values, 200,000 rows/value
DISTINCT: 10,000,000 rows redistributed to 50 AMPs (large spool), then
sorted eliminating duplicate rows.
GROUP BY: 100,000 rows/AMP locally aggregated and reduced to 50
rows/AMP, 50*100 = 5000 rows redistributed to 50 AMPs (small spool) and
aggregated again.

500,000 values, 20 rows/value
DISTINCT: 10,000,000 rows redistributed to 100 AMPs (large spool), then
sorted eliminating duplicate rows.
GROUP BY: 100,000 rows/AMP locally aggregated, but there's probably only
1 row/value on each AMP, so it's almost useless. 100,000*100 =
10,000,000 rows redistributed to all AMPs (large spool) and aggregated
again.

10,000 values, 1,000 rows/value
DISTINCT: 10,000,000 rows redistributed to 100 AMPs (large spool), then
sorted eliminating duplicate rows.
GROUP BY: 100,000 rows/AMP locally aggregated and reduced to 10,000
rows/AMP, but aggregate cache (~100KB) will overflow and there will be a
second local aggregation, 10,000*100 = 1,000,000 rows redistributed to
all AMPs (medium spool) and aggregated again.

This is for even distribution, if it's skewed, this may be different...

Btw. if you use a subquery there's always a distinct by default, but you can specify GROUP BY to override it.

2. Make the target table a SET table, do not use any GROUP BY or DISTINCT on the INSERT/SELECT, and let the Duplicate Row check for the SET table weed out the duplicates.

This is similar to a distinct, but probably more overhead.

Dieter


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference