Archives of the TeradataForum
Message Posted: Tue, 29 Apr 2003 @ 15:59:51 GMT
This discussion really points to operational design issues, basically that the efficient collection of statistics is a data dependent process, not a schedule or load dependent process. Most shops I have worked with end up treating stats collection as a separate batch process in the weekend batch window, with it's own dependent metadata. The demographics information needed to calculate the interval in which to collect the statistics should be collected at the time the modeling is done - and periodically validated.
I have typically used a 3% to 5% data change per column rule, with one exception - boundary conditions. This happens most often in the case of dates, but can also happens often with system assigned values such as account numbers and their domain ilk. I have found that the optimizer can be quite sensitive to the statistics for dates being more than two distinct values outside the boundary in the statistics.
There is also a need to be very careful with the statistics on reference tables that have counts just under the number of VAMPS. This date statistics issue can be especially problematic when the result range of dates drops under the number of VAMPS boundary as the result of a predicate clause.
My general rules of thumb:
1) statistics must be collected on any column that is used in a join, predicate, or aggregation.
2) statistics should be updated any time the data in a column falls outside the high or low boundary for that column
3) statistics should be updated when the data ranges across the stats quartiles change more than 3-5%
4) statistics for reference tables should be collected periodically (monthly or quarterly).
In rare cases I have found it can better to remove the statistics on low cardinality columns joined to a compound primary index. A phone number is a good example. Call detail records with a PI of SID-NPA-NXX (ex: 800-322-1012), would result in many records, but a low cardinality for the SID. When joining to a lookup table with a PI of SID, the optimizer will choose to redistribute the large table onto the small table. Removing the stats on the *reference* table PI of SID fixed this problem.
I suspect block optimization will expose some of these problems under V2R5 when the optimizer is no longer blind across the derived table boundary. It's good news and bad news...
|Copyright 2016 - All Rights Reserved|
|Last Modified: 27 Dec 2016|