Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Thu, 31 May 2012 @ 16:01:01 GMT


     
  <Prev Next>   <<First <Prev Next> Last>>  


Subj:   Re: Why incorrect stats are worse than not collecting stats?
 
From:   Dieter Noeth

Dave Wellman wrote:

  Before TD12, if there were statistics on a column then they were accepted as correct. However, this could lead to a situation such as:  


  - the current date is during September  


  - a query includes selection which looks for rows between 1st August and 31st August (i.e. the previous month)  


  - the statistics on the relevant column include a maximum value of 31st July (because statistics have not been refreshed)  


  - the optimizer assumes that there are very few rows (I think one per AMP but not sure about that) that meet the selection criteria although inreality there may be many thousands or even hundreds of thousands.  


It was a "high confidence 1 row".


  In this scenario, it is very possible that the plan generated by the optimizer would be 'less than optimal'. Common situations that I've seen is that the spool file built in such a situation is then product joined with something else, the actual processing then taking hours to complete.  


I've seen a system with such a bad stats (because of SAMPLE on the partitoning column in an old release, V2R6? ) where the optimizer created plans like "duplicated" or "redistributed" based on that assumption. Actually it might have been a few hundred million rows (CDRs) and the first step was that small high conf spool and all queries were running out of spool (duplicated) or running for multile hours (redistributed). Dropping the stats was the emergency fix :-)


  With TD12 and upwards, the optimizer now compares any statistics with random amp samples and decides (again, don't know how) if the statistics are out of date. If they are out of date then the statistics are ignored.  


The optimizer is not ignoring those stats, it's trying to adjust them: If the number of rows increased and it's a DATE then it's assuming those new rows were added *after* the max date from stats and adjusts that max date based on the average number of rows per day.

This works based on the count returmned by a Random AMP Sample which is compared to the One-/All-AMPSampleEst from stats.

But this is only working when the number of rows increased, when it decreased it's still using it without any modification.


  Having said that, accurate, up to date stats is one of my markers for a well administered Teradata system.  


Full Ack :-)


Dieter



     
  <Prev Next>   <<First <Prev Next> Last>>  
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 15 Jun 2023