Re: Surrogate keys [From Alan Friel: Fri, 26 May 2000 @ 16:40 GMT]

Message Posted: Fri, 26 May 2000 @ 16:40:39 GMT

Subj:

Re: Surrogate keys

From:

Alan Friel

Just to review some points:

- I assume that Bethany is talking about a sale history table (all stores, all products for some number of weeks). I also assume that the natural key would be (store, product, date). I understand the difference between index and key, and expect that the primary index, in this case, will match the natural key. Other than the one-off query, most of the queries against a history table would include date as a dimension. Since date would generally be given in a query, why would removing it from the primary index help performance?

- I don't think that it should be necessary to drop date from the primary index to change the table from set to multiset. Even though the rows in a table might actually be unique, the table can still be defined as being multiset.

- While being multiset should help the load process, I don't understand why it would help the performance of queries against the history table. I understood Bethany's note to say that dropping date from the index (although she said key) resulted in queries running faster and with less I/O. I trust that this was actually measured, but I don't understand why there would be a performance improvement - hence my query.

- Interestingly enough, she noted that their batch load times had increased slightly. Shouldn't they have gone down because the table was multiset and duplicate checking was skipped? Maybe the longer batch times includes duplicate checking being performed before the data is loaded to the Teradata.

- Having thought about it more after I sent my note, I'm not sure that removing date from the index of a sale history table would have a significant impact on the data distribution. As Sam pointed out, some lumpiness in data distribution is probably not much of a problem and could actually be a byproduct of a good design.

- Although I've heard that the change to the hashing algorithm has been a big improvement, I don't think it would do much for you if your data is prone to hash synonyms. By this I mean that if I have rows with a primary indexs of 1, 2, 3, 1, 2, 3, 1, 2, 4 on a 4 VAMP system, then I have a synonym problem. If I have a large number of rows with an even distribution of primary index values, then a good hashing algorithm should largely prevent synonyms problems.

- Even given that, wouldn't adding date to a primary index of (store, product) help increase the distribution of hash values while removing date from the primary index of (store, product, date) decrease the distribution? Actually, if removing date causes a decrease in the hash distribution (therefore increasing synonyms), wouldn't that explain the longer batch times?

So I'm back to my original question: Why would removing date from the primary index result in improved performance?

Attachments

Library

Quick Reference

Archives of the TeradataForum

Message Posted: Fri, 26 May 2000 @ 16:40:39 GMT