Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Mon, 30 Jan 2006 @ 18:15:29 GMT


     
  <Prev Next>   <<First <Prev Next> Last>>  


Subj:   Re: Use of Multiset for Large Table
 
From:   Anomy Anom

<-- Anonymously Posted: Monday, January 30, 2006 13:12 -->

  Hmm, do you really compromise the most important property on a table (the PI), for the sake of uniqueness?  


Not for the sake of uniqueness. That is not the point. If the table has 20B rows (1Tb) I have to do something to speed up inserts while preserving the integrity of the data. A PPI and a USI make it easy to maintain without sacrificing access and without the use of a multiset definition.


  How many business questions can you answer (using the PI), if you have put say 10 columns into it, to make it unique?  


In 17 years of designing and working with Relational databases I never encountered a "natural identifier" that was 10 columns long. 5, maybe 6 at most and in some those instances a surrogate key was used for inheritance purposes (FK). The particular table in question with 20B rows has 3 columns as the UPI, all 3 columns are FKs to dimensions. In a properly design Dimensional Model, it is very easy to get the dimensions that make up the PI of the large table into the SQL picture and help the OPTIMIZER put 2+2 together. The key is to ensure people know the data model, know how to properly use the tables, know the nuances that Teradata's PIs play in access to the information, and know the nuances that adding context in the form of tables supplying FKs to the PI definition can help the optimizer make the connection to use the PI (whether UPI or NUPI, makes no difference).


  Can you "educate" folks (the business people I guess you mean), on which business questions they are allowed to ask?  


Yes, that's part of being a good Data Administrator. But I would not say "ALLOWED TO ASK". I would rewrite that to read "ASK ANY QUESTION THEY LIKE AS LONG AS THEY DO IT IN AN EDUCATED WAY", that means ensuring report writers know the Data Model very well. Only a few selected Business Users have access to the tables directly and their queries are monitored to ensure they are using the model correctly. If they are not I meet with them to review the data model and ensure they PLAN the queries on the data model FIRST and then WRITE the SQL last. Access to the 20B row table is not restricted at all because everyone knows how to use it properly.


  Yes, the table scans can be costly, but star-joins are great. Sadly you wont get them, when you put a lot of columns into the PI.  


I would say STAR JOINS are SOMETIMES great. I have seen the OPTIMIZER tank and skew something fierce on BIG TABLE/SMALL TABLE star joins. Also, surrogate keys can be used as UPIs and Join Indexes could be used for data access too. Even a 2 column PI with zero skew won't be used if only 1 column is specified. Uniqueness has nothing to do with efficient access. Many Teradata DBAs get lost on the quest for data distribution (i.e., Best PI, dup checks, etc) and spend little time educating folks to ensure they know the data models that support the Data Warehouse and know how to use them to access that data. The best distributed PIs and multiset tables that don't check for duplicates won't make up for a poorly designed data warehouse and for poorly written SQL.


  Do you really consider the Data warehouse to be yours? In my humble opinion, it belongs to the business.  


I must explain that I am not a DBA (although I play one here). While the hardware and Software belong to the business, I am charged by the business to be the STEWARD of the DATA WAREHOUSE. As the Steward, I am responsible for its well being and the "owners" expect that. MULTISETS are not in the best interest of the data warehouse or the business for two reasons: They violate RELATIONAL theory and they have the potential to introduce serious data integrity problems. The first one is reason enough to keep them out of the data model, the second one violates the premise of the DW being a single version of the truth (if duplicate rows yield improper results). Teradata already makes it difficult to implement reasonable Referential Integrity because it requires USI's and USI's are the Kiss of Death for performance. No need to aggravate the situation by adding Multisets to the mix. Let's remember that a PI is not a PK and DUPLICATE Primary Identifiers are indeed possible under Teradata's NUPIs.

Anyone with some relational database background that takes the 2 week class in Atlanta can be a Teradata DBA, but not everyone is able to bring more value to the organization by doing the role of a Data Architect and Data Administrator. In that role I make sure the data warehouse has a well designed, full blown data model that faithfully represents the business and is implemented in the Teradata box in a consistent state (RI maintained). I spend a good deal of my time evangelizing the proper use of its contents by promoting the concepts stored in the data warehouse and ensuring they are used properly. So yes, I sort of "own" the data warehouse because I act as the gatekeeper charged to ensure that whatever needs to get into the data warehouse gets properly modeled and defined. Nothing goes in or out unless it is modeled first (I do the modeling, so you get the idea).

I am not asking anyone not use Multisets. I am simply sharing the reasons why I don't use them or why I have no need for them (A UPI takes care of that pronto without sacrificing access or data distribution). That's what this forum is all about. Now folks may know a little more about different points of view on this subject and whether they wish to use multisets at all.


Anonym



     
  <Prev Next>   <<First <Prev Next> Last>>  
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 15 Jun 2023