|
|
Archives of the TeradataForum
Message Posted: Thu, 15 Dec 2011 @ 17:07:34 GMT
Subj: | | Re: PI on multicolumns? |
|
From: | | frank.c.martinez |
It's not a matter of "perfect distribution". There is a process, a criteria, for deciding on the PI. I have encapsulated my own opinions (but
as Issac Newton said, "on the shoulders of giants") in the following SQL:
DATABASE DTNG_XXYZZY;
/*****
* PURPOSE: The number of records is needed in order to compare the
* to the number of distinct values. Also, the number of
* NULL's in the candidate index can be determined at the
* same time.
*****/
SELECT CAST(COUNT(*) AS DECIMAL(18)),
SUM(CASE
WHEN XXYZZY_OUTBND_ID IS NULL
THEN CAST(1 AS DECIMAL(18))
ELSE 0
END) AS NullCnt
FROM T_XXYZZY_OUTBND_DTL;
/*****
* NAME: ATM Statistics
* PURPOSE: This calculates the maximum and average number of records per value
* and the number of distinct values, for use in index selection analysis.
* The average and max should be low (< 200 if possible) and the number of
* distinct values should be high (some large percentage of the number of rows).
*****/
SELECT AVG(TheCount) AS AvgRecs,
MAX(TheCount) AS MaxRecs,
CAST(COUNT(*) AS DECIMAL(18)) AS DistinctVals
FROM (SELECT CAST(COUNT(*) AS DECIMAL(18)) AS TheCount
FROM T_XXYZZY_OUTBND_DTL
GROUP BY XXYZZY_OUTBND_ID) AS Temp;
/*****
* NAME: Distribution - Mode
* PURPOSE: This determines the "mode" of the distribution of a specific index.
* NOTE: The mode is the score in the population that occurs most frequently.
* (Internet Glossary of Statistical Terms:
http://www.animatedsoftware.com/statglos/sgmode.htm)
* It should be low (as the average and max).
*****/
SELECT NumberOfRecords, Occurs
FROM (SELECT NumberOfRecords,
CAST(COUNT(*) AS DECIMAL(18)) AS Occurs
FROM (SELECT CAST(COUNT(*) AS DECIMAL(18)) AS NumberOfRecords
FROM T_XXYZZY_OUTBND_DTL
GROUP BY XXYZZY_OUTBND_ID) AS Temp
GROUP BY 1) AS Temp2
QUALIFY RANK(Occurs) = 1;
/*****
* NAME: Distribution - Record Count by AMP
* PURPOSE: This calculates where records would be placed on the AMPs, based on a specific index.
* It can show uneven distribution caused by a specific value or by no value (AMP 0).
*****/
SELECT HASHAMP(HASHBUCKET(HASHROW(XXYZZY_OUTBND_ID))) AS AmpNo,
CAST(COUNT(*) AS DECIMAL(18))
FROM T_XXYZZY_OUTBND_DTL
GROUP BY 1
ORDER BY 1;
/*****
* NAME: Distribution - Skew Factor
* PURPOSE: This calculates percentage of record skew between the AMP's, using both
* the WinDDI definition of skew, or Mike Buckland's definition. Both of these
* were originally defined using CurrentPerm usage.
* NOTE: For the fact table of a Star Schema, the second skew should be less than 25%.
*****/
SELECT 100 - (100 * AVG(NumberOfRecords)/MAX(NumberOfRecords)
(DECIMAL(18,2))) AS WinDDI,
(100 * MAX(NumberOfRecords) (DECIMAL(18,2)))/MIN(NumberOfRecords) - 100 AS Buckland
FROM (SELECT HASHAMP(HASHBUCKET(HASHROW(XXYZZY_OUTBND_ID))) AS AmpNo,
CAST(COUNT(*) AS DECIMAL(18)) AS NumberOfRecords
FROM T_XXYZZY_OUTBND_DTL
GROUP BY 1) AS T1;
/*****
* NAME: Distribution - Exceeds 200 Summary
* PURPOSE: This shows the number of records with 200 or more records per specific index.
* It can be used to compare to the total number of records as a percentage.
*****/
SELECT SUM(NumberOfRecords)
FROM (SELECT CAST(COUNT(*) AS DECIMAL(18)) AS NumberOfRecords
FROM T_XXYZZY_OUTBND_DTL
GROUP BY XXYZZY_OUTBND_ID
HAVING NumberOfRecords > 199) AS T1;
I come from a time when the suggested number of duplicates had gone up from 100 per AMP to 200. The "rule of thumb" may have gone back
down to 100.
You can alter this script to check multi-column PI candidates. It's not sophisticated, just documented. And like I said, it exists because of
the collective wisdom of others.
Speaking of which, Mike Buckland of Teradata (who had some good ideas for how to implement star schemas in Teradata) suggested that with a
normal Teradata table, the criteria for selecting a PI was distribution and then joining, but with a star schema table it was joining and then
distribution, but distribution wasn't off the list. Ok, just my two cents, one penny of which came from better minds than mine. Of course, I
would love to see a discussion of these principles on the lista.
iv
| |