Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Fri, 26 May 2000 @ 17:11:24 GMT


     
  <Prev Next>   <<First <Prev Next> Last>>  


Subj:   Re: Surrogate keys
 
From:   Barry Hull

I work with Beth and I'm familiar with the change... I put in explanations after your questions below.

-----Original Message-----
From: Alan Friel
Sent: Friday, May 26, 2000 12:41 PM
Subject: Re: surrogate keys

Just to review some points:

- I assume that Bethany is talking about a sale history table (all stores, all products for some number of weeks). I also assume that the natural key would be (store, product, date). I understand the difference between index and key, and expect that the primary index, in this case, will match the natural key. Other than the one-off query, most of the queries against a history table would include date as a dimension. Since date would generally be given in a query, why would removing it from the primary index help performance?

[Barry Hull] A couple of reasons. First, and probably most importantly, the change to the index dramatically reduces IO. The degree to which IO is reduced is dependent on the date range specified in the query. The larger the date range, the greater the benefit. The reason is that if you have all of the rows for a given store/item stored in one data block (which is the likely consequence of having the NUPI on store/item), you only have to do one IO to retrieve the data for each store/item that qualifies. With a NUPI of store/item/date, you would be doing one IO for each store/item/date combination, since the sales rows for a given store/item would be distributed across the AMPs. This helps performance when accessing the table via the NUPI or through a NUSI. Second reason: In attempting to perform a star- join, the optimizer now only has to do a product join between the qualifying stores and the qualifying items, so this results in a smaller spool file that will then be joined to the NUPI. When we had the NUPI on the store/item/date, then the optimizer had to do a product join of all qualifying stores, items, and dates, which resulted in a larger spool file.


- I don't think that it should be necessary to drop date from the primary index to change the table from set to multiset. Even though the rows in a table might actually be unique, the table can still be defined as being multiset.

[Barry Hull] The reasoning was just the opposite. We made the table multiset table because the Multiload performance on the SET table with a primary index of store/item was much worse. So, the intent was to change the primary index to store/item to improve the query performance. The multiset change was something we had to do to get acceptable Multiload performance after we mad this change. By the way, when the primary index of the table was on store/item/date, this was still a NUPI, not a UPI. There are other columns in the table that are required to make the row unique. So, we had a NUPI before the change and a NUPI after the change - the only difference being the NUPI after the change was much more non-unique, which is why we went with Multiset.


- While being multiset should help the load process, I don't understand why it would help the performance of queries against the history table. I understood Bethany's note to say that dropping date from the index (although she said key) resulted in queries running faster and with less I/O. I trust that this was actually measured, but I don't understand why there would be a performance improvement - hence my query.

[Barry Hull] Multiset did nothing to improve query performance. It improved the Multiload performance. The change in the NUPI improved the query performance.


- Interestingly enough, she noted that their batch load times had increased slightly. Shouldn't they have gone down because the table was multiset and duplicate checking was skipped? Maybe the longer batch times includes duplicate checking being performed before the data is loaded to the Teradata.

[Barry Hull] Although dup checking no longer had to be done, the Multiload still had to work it's way down the NUPI chain prior to updating or inserting a row to see which row within the NUPI qualified for the update or to insert the row at the end of the NUPI chain.


- Having thought about it more after I sent my note, I'm not sure that removing date from the index of a sale history table would have a significant impact on the data distribution. As Sam pointed out, some lumpiness in data distribution is probably not much of a problem and could actually be a byproduct of a good design.

[Barry Hull] There was no significant impact on data distribution. Because there are still lots store/item combinations in the table (approx. 500 million), there is almost perfect distribution across the AMPs.


- Although I've heard that the change to the hashing algorithm has been a big improvement, I don't think it would do much for you if your data is prone to hash synonyms. By this I mean that if I have rows with a primary indexs of 1, 2, 3, 1, 2, 3, 1, 2, 4 on a 4 VAMP system, then I have a synonym problem. If I have a large number of rows with an even distribution of primary index values, then a good hashing algorithm should largely prevent synonyms problems.

[Barry Hull] I agree.


- Even given that, wouldn't adding date to a primary index of (store, product) help increase the distribution of hash values while removing date from the primary index of (store, product, date) decrease the distribution? Actually, if removing date causes a decrease in the hash distribution (therefore increasing synonyms), wouldn't that explain the longer batch times?

[Barry Hull] I think that this was explained above.


So I'm back to my original question: Why would removing date from the primary index result in improved performance?

[Barry Hull] Explained above.


[Barry Hull] There was another issue raised in this thread that there was some risk to making the table multiset. In our case, the risk is very limited. We had a NUPI before, and all of the data maintenance did upserts against the table based on the entire primary key (not primary index). Therefore, if someone would have sent in a duplicate row before, we would have updated the row twice. This is still the case. We still have the same Multiload upserting to the table with the same primary key before. In our case, we have to make sure through other methods that re-processing of a file is not possible. So, it would be impossible to insert duplicate rows unless the Multiload malfunctioned, which would present us with bigger issues anyway.



     
  <Prev Next>   <<First <Prev Next> Last>>  
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 15 Jun 2023