Archives of the TeradataForum

Message Posted: Wed, 01 Feb 2006 @ 21:18:44 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Use of Multiset for Large Table

From:		Anomy Anom

<-- Anonymously Posted: Wednesday, February 01, 2006 14:01 -->

MULTISET addresses the symptom but does not fix the problem.

Symptom(s): Slow to INSERT/UPDATE into a table because of DUP row checks. Mass Maintenance takes too long and exceeds batch windows.

The "cure" for the symptom is to make it MULTISET.

The symptom disappears but the problem of having a PI that exceeds the recommended number of repeated rows per PI value continues to exit. Moreover, the "cure" introduced a new problem: The risk of duplicate rows and data corruption.

Depending on the level of tolerance some will accept this risk and some will not. Some will pass the buck and some won't. It becomes an exercise in Risk Management and CYA if the DW suddenly is not the "single version of the truth" anymore.

The only "cure" is to address the problem rather than the symptom.

The real cure is to design a PI that remains within the threshold of dup rows per PI value and still deliver good access to the data. This approach reduces the amount of dup checks to a level that is acceptable and completes within the allotted windows.

It has been my experience that very large tables (1B rows, plus) are better off having a SET definition, along with a UPI preferably defined on the natural identifier (PK) and a PPI defined on one of the UPI columns for these reasons:

1. Fast maintenance

2. No Dup Rows risk

3. Easy to perform RI where this table is the source of a FK

4. Easy to invoke Dynamic Partition Elimination

5. Users (developers and business) learn to write better SQL if they know the key to fast access to very large tables is to use the PPI column or to join with the DIMENSION that implements the PPI column.

The mantra of data distribution (for join or aggregate operations) tends to get in the way of doing good DW designs for very large tables. Data distribution becomes secondary on very large tables for these reasons:

1. Resource consumption from a 300M row aggregate, regardless of the levels of granularity, will be unacceptable whether data is in the same AMP or in different AMPs (CPU bound query)

2. Many joins of that table will be done using columns (FKs) that do not cover the entire NUPI definition

3. Many joins of that table will perform ALL AMP retrieves and REDISTRIBUTIONS (based on partial keys).

Anonym


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference