Re: PI on multicolumns? [From frank.c.martinez: Thu, 15 Dec 2011 @ 17:07 GMT]

Message Posted: Thu, 15 Dec 2011 @ 17:07:34 GMT

Subj:

Re: PI on multicolumns?

From:

frank.c.martinez

It's not a matter of "perfect distribution". There is a process, a criteria, for deciding on the PI. I have encapsulated my own opinions (but as Issac Newton said, "on the shoulders of giants") in the following SQL:

DATABASE DTNG_XXYZZY; /***** * PURPOSE: The number of records is needed in order to compare the * to the number of distinct values. Also, the number of * NULL's in the candidate index can be determined at the * same time. *****/ SELECT CAST(COUNT(*) AS DECIMAL(18)), SUM(CASE WHEN XXYZZY_OUTBND_ID IS NULL THEN CAST(1 AS DECIMAL(18)) ELSE 0 END) AS NullCnt FROM T_XXYZZY_OUTBND_DTL; /***** * NAME: ATM Statistics * PURPOSE: This calculates the maximum and average number of records per value * and the number of distinct values, for use in index selection analysis. * The average and max should be low (< 200 if possible) and the number of * distinct values should be high (some large percentage of the number of rows). *****/ SELECT AVG(TheCount) AS AvgRecs, MAX(TheCount) AS MaxRecs, CAST(COUNT(*) AS DECIMAL(18)) AS DistinctVals FROM (SELECT CAST(COUNT(*) AS DECIMAL(18)) AS TheCount FROM T_XXYZZY_OUTBND_DTL GROUP BY XXYZZY_OUTBND_ID) AS Temp; /***** * NAME: Distribution - Mode * PURPOSE: This determines the "mode" of the distribution of a specific index. * NOTE: The mode is the score in the population that occurs most frequently. * (Internet Glossary of Statistical Terms: http://www.animatedsoftware.com/statglos/sgmode.htm) * It should be low (as the average and max). *****/ SELECT NumberOfRecords, Occurs FROM (SELECT NumberOfRecords, CAST(COUNT(*) AS DECIMAL(18)) AS Occurs FROM (SELECT CAST(COUNT(*) AS DECIMAL(18)) AS NumberOfRecords FROM T_XXYZZY_OUTBND_DTL GROUP BY XXYZZY_OUTBND_ID) AS Temp GROUP BY 1) AS Temp2 QUALIFY RANK(Occurs) = 1; /***** * NAME: Distribution - Record Count by AMP * PURPOSE: This calculates where records would be placed on the AMPs, based on a specific index. * It can show uneven distribution caused by a specific value or by no value (AMP 0). *****/ SELECT HASHAMP(HASHBUCKET(HASHROW(XXYZZY_OUTBND_ID))) AS AmpNo, CAST(COUNT(*) AS DECIMAL(18)) FROM T_XXYZZY_OUTBND_DTL GROUP BY 1 ORDER BY 1; /***** * NAME: Distribution - Skew Factor * PURPOSE: This calculates percentage of record skew between the AMP's, using both * the WinDDI definition of skew, or Mike Buckland's definition. Both of these * were originally defined using CurrentPerm usage. * NOTE: For the fact table of a Star Schema, the second skew should be less than 25%. *****/ SELECT 100 - (100 * AVG(NumberOfRecords)/MAX(NumberOfRecords) (DECIMAL(18,2))) AS WinDDI, (100 * MAX(NumberOfRecords) (DECIMAL(18,2)))/MIN(NumberOfRecords) - 100 AS Buckland FROM (SELECT HASHAMP(HASHBUCKET(HASHROW(XXYZZY_OUTBND_ID))) AS AmpNo, CAST(COUNT(*) AS DECIMAL(18)) AS NumberOfRecords FROM T_XXYZZY_OUTBND_DTL GROUP BY 1) AS T1; /***** * NAME: Distribution - Exceeds 200 Summary * PURPOSE: This shows the number of records with 200 or more records per specific index. * It can be used to compare to the total number of records as a percentage. *****/ SELECT SUM(NumberOfRecords) FROM (SELECT CAST(COUNT(*) AS DECIMAL(18)) AS NumberOfRecords FROM T_XXYZZY_OUTBND_DTL GROUP BY XXYZZY_OUTBND_ID HAVING NumberOfRecords > 199) AS T1;

I come from a time when the suggested number of duplicates had gone up from 100 per AMP to 200. The "rule of thumb" may have gone back down to 100.

You can alter this script to check multi-column PI candidates. It's not sophisticated, just documented. And like I said, it exists because of the collective wisdom of others.

Speaking of which, Mike Buckland of Teradata (who had some good ideas for how to implement star schemas in Teradata) suggested that with a normal Teradata table, the criteria for selecting a PI was distribution and then joining, but with a star schema table it was joining and then distribution, but distribution wasn't off the list. Ok, just my two cents, one penny of which came from better minds than mine. Of course, I would love to see a discussion of these principles on the lista.

Attachments

Library

Quick Reference

Archives of the TeradataForum

Message Posted: Thu, 15 Dec 2011 @ 17:07:34 GMT