Re: Use of Multiset for Large Table [From Anomy Anom: Thu, 02 Feb 2006 @ 22:41 GMT]

Message Posted: Thu, 02 Feb 2006 @ 22:41:53 GMT

Subj:

Re: Use of Multiset for Large Table

From:

Anomy Anom

<-- Anonymously Posted: Thursday, February 02, 2006 17:05 -->

The performance penalty for enforcing Referential Integrity and Referential Constraints (RI and RC) in the RDBMS, especially with Teradata, is a price few, if any, would be willing to pay. When I discussed RI with Teradata PS people they looked at me like I was from another planet. The Reason being the need to define UPI or USI on every table would make the system perform incredibly bad due to poor data distribution and the USI maintenance overhead incurred on mass operations.

In my DW we do not do it, but I follow Reagan's advice of "trust but verify". I have a set of scripts that roam the DW looking for RI problems and report back on them in order to keep ETL developers honest. Developers are becoming much better at creating "quarantine" tables in their processes as they do not like to receive my emails showing RI problems.

That said, I personally advocate writing the RI and RC where it belongs: at the business process. The problem with a DW is that such animal (a business process) does not exist since everything (well almost everything) is handled via mass inserts and updates through ETL or ELT. It is up to those ETL/ELT processes to implement RI and RC policies. That, in my view, is a high risk proposition, but there's no way around it short of creating business processes. Judging by the lack of tolerance to pay the price for dup checks, I doubt anyone would tolerate writing code to handle the inserts, update, and delete operation one row at a time.

MULTISETS do not create RI problem. They are a data corruption problem. Some RI/RC problems are easier to live with, such as a bad ZIP CODE or a bad GENDER CODE. On the other hand, two repeated rows could have perfect RI and RC but their existence is a data corruption problem. Reporting double the sales or double the quantity on hand is less appealing and finding DUPs, especially on very large tables, is a costly affair requiring aggregates (as opposed to exclusion merge operations). Removing the DUPS is another issue too that must be coded for and conducted.

The only denormalization I know about that enhances performance and simplifies processing is 1NF (repeating groups). 2NF (Part Key Dependency) does not do much in Teradata due to its ability to perform amazing joins on small tables sourcing the part key dependencies. 2NF is a waste of processing time in Teradata (less rows per block) and disk space and I have noticed very little gain in my Dimensional models from denormalizing 2NF rules. No bang for the buck and can create maintenance problems if the denormalized values change (a priced we pay in the form of slow changing dimensions)

Storing a flat file (from a source or a summary report) as a Teradata table is not "denormalization". That's simply storing a file (with records) as a table (with rows). Creating a Dimension table on a star schema, whether it is a single table, a collection of tables, or a hierarchy collapsed into a single table, is not "denormalization" either. That's dimensional modeling.

Anonym

Attachments

Library

Quick Reference

Archives of the TeradataForum

Message Posted: Thu, 02 Feb 2006 @ 22:41:53 GMT