Re: Surrogate keys [From Eric J. Kohut: Fri, 26 May 2000 @ 20:27 GMT]

Message Posted: Fri, 26 May 2000 @ 20:27:43 GMT

Subj:

Re: Surrogate keys

From:

Eric J. Kohut

Alan,

Barry did an excellent job explaining the I/O part of the reduction due to dropping a column from and Primary Index.

I'll try to explain the CPU reduction a little more, but first, to some degree, Barry's and Bethany's approach assumes that you are mainly Accessing the tables with the small table / large table join approach. Is this your case? If you are using mostly full table scans Then you may or may not see the performance gains they talk about and they difference will be dependent on your queries.

First I'll explain the large table / small table join approach. In this case the initial PI of the Large table is Store, Product, Date / Period. In the past we've added all of the pieces of the candidate key to the Primary index to help manage the amount of Uniqueness. Recently, due to technical advances, you may choose to leave out a particular Column. The column to leave out of the PI depends on the customer requirements but I tend to look at Date / Period. Others may leave out Store or Item. Larger Data Blocks (64 K), Multi-set tables, and faster hardware, and improved optimization in Teradata all enable this newer approach.

With this type of Physical schema, the Teradata optimizer usually executes what is call the small table / large join in this situation as long as You include the join to each of the smaller dimension tables involved (Store table , Product table, Date / Period table) This can be done In the SQL or in the View to simplify for the users. This is a case where less joins are actually much better than a single table. If this approach is being used and you are going after a Relatively small enough amount of the table (less than 8 -10 % of the table per query in the resulting spool) then the optimizer will product join the restricted rows from the where clause of the Store, Product, and Date / Period tables. It does this by selecting the restricted Rows (For Example Stores) from the smallest estimated result and redistributing these rows to the amps with the next smallest resulting Dimension (Items) and doing a product join with these restricted rows and then finally doing the same to the next Dimension (Date Period) With a second redistribution and product join. The resulting spool of this query now has the entire set of primary index values of all possible Combinations to execute a merge join against the Sales or Inventory or related tables. I call this approach, an on the fly index, and it is pretty efficient even at high numbers of rows (< 8 %) and for high numbers of Queries.

(Ideally most or 2 of 3 will have a restriction) Some times It can work with one restriction on a very narrow segment of data, single item all stores all dates. By removing the 3rd column from the PI, the query can skip the 2nd redistribution and 2nd product join to identify the candidate combinations.

The 3rd restriction, if supplied in the where clause, is still used as an extra restriction during the merge join, so it is not wasted. This is what reduced the CPU, BY-NET traffic, and some of the I/O. This is all in addition to the reduction in I/O that Barry explained from all the rows Being in one data block on the same amp. As a result, the approach using fewer columns is much more efficient and benefits these and other queries on the system by make more efficient use of the system's resources for all queries.

I hope this helps,

EJK Eric J. Kohut
Senior Solutions Consultant - Teradata Solutions Group - Retail
NCR Corp.

Attachments

Library

Quick Reference

Archives of the TeradataForum

Message Posted: Fri, 26 May 2000 @ 20:27:43 GMT