|
|
Archives of the TeradataForum
Message Posted: Thu, 31 May 2012 @ 16:01:01 GMT
Subj: | | Re: Why incorrect stats are worse than not collecting stats? |
|
From: | | Dieter Noeth |
Dave Wellman wrote:
| Before TD12, if there were statistics on a column then they were accepted as correct. However, this could lead to a situation such
as: | |
| - the current date is during September | |
| - a query includes selection which looks for rows between 1st August and 31st August (i.e. the previous month) | |
| - the statistics on the relevant column include a maximum value of 31st July (because statistics have not been refreshed) | |
| - the optimizer assumes that there are very few rows (I think one per AMP but not sure about that) that meet the selection criteria
although inreality there may be many thousands or even hundreds of thousands. | |
It was a "high confidence 1 row".
| In this scenario, it is very possible that the plan generated by the optimizer would be 'less than optimal'. Common situations that
I've seen is that the spool file built in such a situation is then product joined with something else, the actual processing then taking hours to
complete. | |
I've seen a system with such a bad stats (because of SAMPLE on the partitoning column in an old release, V2R6? ) where the optimizer created
plans like "duplicated" or "redistributed" based on that assumption. Actually it might have been a few hundred million rows (CDRs) and the first
step was that small high conf spool and all queries were running out of spool (duplicated) or running for multile hours (redistributed). Dropping
the stats was the emergency fix :-)
| With TD12 and upwards, the optimizer now compares any statistics with random amp samples and decides (again, don't know how) if the
statistics are out of date. If they are out of date then the statistics are ignored. | |
The optimizer is not ignoring those stats, it's trying to adjust them: If the number of rows increased and it's a DATE then it's assuming those
new rows were added *after* the max date from stats and adjusts that max date based on the average number of rows per day.
This works based on the count returmned by a Random AMP Sample which is compared to the One-/All-AMPSampleEst from stats.
But this is only working when the number of rows increased, when it decreased it's still using it without any modification.
| Having said that, accurate, up to date stats is one of my markers for a well administered Teradata system. | |
Full Ack :-)
Dieter
| |