Archives of the TeradataForum
Message Posted: Wed, 18 May 2005 @ 19:02:42 GMT
Hola Teradudes y dudettes!
I would appreciate your wise counsel and comments (uh, even wise cracks would be appreciated) on the following issue. I've submitted a "question" incident to NCR, but I would like some out-in-the-field (My dad made me play right field when I was in Little League), real-world input.
We have a query involving two very large tables. When we specify a date range criteria of a month on the significantly smaller (400 million row) table, the optimizer extracts records from both tables, puts them into spool and then joins those two spools, and returns results in several minutes. When we specify a date range criteria of a week, the optimizer thinks that the spool from the the smaller table is small enough (it thinks approximately 20 or so rows), and uses this to do a product merge join to the bigger table, which never returns.
We've determined that it thinks thinks because, while most intervals for in the complete statistics cover a total of 4 million rows and range over a month to a month and a half's worth of data, the last interval covers the current month and future dates, some of which are bad. So, for example, when you look at the 99th interval in the stats, it has a max value of 2/25/2005 (and a min determined from the 98th interval of 1/1/2005), the 100th interval has a max of 6/26/6700 (yes, the year 6,700). Most of the data ACTUALLY exists within the normal 2/25/2005 to current data range, but there are a small proportion of records that are real outliers. So when you do a simple COUNT(*) of the number of records from 1/1/05 to 1/7/05 from that smaller table, you get about 600 thousand, which is close to the optimizer's guess of 400 thousand. But when you do a COUNT(*) from 3/1/05 to 3/7/05, which you get close to the same number of actual records (600 thousand), the optimizer guesses that you should be getting 88 or so!
The garbage dates in the future ARE errors, but are also historically accurate for the source data. So what we tried, which worked, was to use the Statistics Wizard to go into that last interval and change the max value to something like 6/26/2005, instead of 6/26/6700. The optimizer now estimates the number of records more accurately for the 3/1 - 3/7 date range (about 200 thousand records). And when we EXPLAIN the more complex query, it now generates a working plan just like the larger date range does (3/1 - 3/31).
So what do you all think about normal people doing these kind of shenanigans to the statistics? Getting rid of the outlier dates is not possible, but dates for this in the future are not really important for the business, except for historical reasons. So question #1 is,
"Is this a good way of doing things?"
Question #2 is,
"What are the caveats and dangers we should be aware of?"
Question #3 is,
"Is there a way to 'manually' update the interval statistics using some sort of BATCH method?"
After all, we can easily get the current date, and if we could run some sort of SQL or command that would update the max value for that 100th interval, that would help us immensely.
Question #4 involves some tables that we update daily that are just way too big to collect complete statistics on the dates in the nightly batch window. Currently, since sampled statistics on dates creates all kinds of bad plans, we'd wondering,
"Would it be possible to calculate our own fake 'complete' stats for these dates and put them in?"
We know how many records have come in, and we know that generally, these date columns are pretty evenly distributed. Anyway, amigos mios, I was wondering if any of you have ever considered engaging in such craziness.
pquedo a sus ordenes,
|Copyright 2016 - All Rights Reserved|
|Last Modified: 28 Jun 2020|