Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Thu, 15 Dec 2011 @ 17:07:34 GMT


     
  <Prev Next>   <<First <Prev
Next>
Last>>
 


Subj:   Re: PI on multicolumns?
 
From:   frank.c.martinez

It's not a matter of "perfect distribution". There is a process, a criteria, for deciding on the PI. I have encapsulated my own opinions (but as Issac Newton said, "on the shoulders of giants") in the following SQL:

     DATABASE DTNG_XXYZZY;
     /*****
      * PURPOSE: The number of records is needed in order to compare the
      *          to the number of distinct values.  Also, the number of
      *          NULL's in the candidate index can be determined at the
      *          same time.
      *****/
     SELECT CAST(COUNT(*) AS DECIMAL(18)),
            SUM(CASE
                   WHEN XXYZZY_OUTBND_ID IS NULL
                      THEN CAST(1 AS DECIMAL(18))
                      ELSE 0
                   END) AS NullCnt
       FROM T_XXYZZY_OUTBND_DTL;

     /*****
      * NAME:    ATM Statistics
      * PURPOSE: This calculates the maximum and average number of records per value
      *          and the number of distinct values, for use in index selection analysis.
      *          The average and max should be low (< 200 if possible) and the number of
      *          distinct values should be high (some large percentage of the number of rows).
      *****/
     SELECT AVG(TheCount)   AS AvgRecs,
            MAX(TheCount)   AS MaxRecs,
            CAST(COUNT(*) AS DECIMAL(18))       AS DistinctVals
       FROM (SELECT CAST(COUNT(*) AS DECIMAL(18)) AS TheCount
               FROM T_XXYZZY_OUTBND_DTL
              GROUP BY XXYZZY_OUTBND_ID) AS Temp;

     /*****
      * NAME:    Distribution  - Mode
      * PURPOSE: This determines the "mode" of the distribution of a specific index.
      * NOTE:    The mode is the score in the population that occurs most frequently.
      *          (Internet Glossary of Statistical Terms:
     http://www.animatedsoftware.com/statglos/sgmode.htm)
      *          It should be low (as the average and max).
      *****/
     SELECT NumberOfRecords, Occurs
       FROM (SELECT NumberOfRecords,
                    CAST(COUNT(*) AS DECIMAL(18))         AS Occurs
               FROM (SELECT CAST(COUNT(*) AS DECIMAL(18)) AS NumberOfRecords
                       FROM T_XXYZZY_OUTBND_DTL
                      GROUP BY XXYZZY_OUTBND_ID) AS Temp
              GROUP BY 1) AS Temp2
     QUALIFY RANK(Occurs) = 1;

     /*****
      * NAME:    Distribution - Record Count by AMP
      * PURPOSE: This calculates where records would be placed on the AMPs, based on a specific index.
      *          It can show uneven distribution caused by a specific value or by no value (AMP 0).
      *****/
     SELECT HASHAMP(HASHBUCKET(HASHROW(XXYZZY_OUTBND_ID))) AS AmpNo,
            CAST(COUNT(*) AS DECIMAL(18))
       FROM T_XXYZZY_OUTBND_DTL
      GROUP BY 1
      ORDER BY 1;

     /*****
      * NAME:    Distribution - Skew Factor
      * PURPOSE: This calculates percentage of record skew between the AMP's, using both
      *          the WinDDI definition of skew, or Mike Buckland's definition. Both of these
      *          were originally defined using CurrentPerm usage.
      * NOTE:    For the fact table of a Star Schema, the second skew should be less than 25%.
      *****/
     SELECT 100 - (100 * AVG(NumberOfRecords)/MAX(NumberOfRecords)
     (DECIMAL(18,2))) AS WinDDI,
            (100 * MAX(NumberOfRecords) (DECIMAL(18,2)))/MIN(NumberOfRecords) - 100 AS Buckland
       FROM (SELECT HASHAMP(HASHBUCKET(HASHROW(XXYZZY_OUTBND_ID))) AS AmpNo,
                    CAST(COUNT(*) AS DECIMAL(18)) AS NumberOfRecords
               FROM T_XXYZZY_OUTBND_DTL
              GROUP BY 1) AS T1;

     /*****
      * NAME:    Distribution - Exceeds 200 Summary
      * PURPOSE: This shows the number of records with 200 or more records per specific index.
      *          It can be used to compare to the total number of records as a percentage.
      *****/
     SELECT SUM(NumberOfRecords)
       FROM (SELECT CAST(COUNT(*) AS DECIMAL(18)) AS NumberOfRecords
               FROM T_XXYZZY_OUTBND_DTL
              GROUP BY XXYZZY_OUTBND_ID
             HAVING NumberOfRecords > 199) AS T1;

I come from a time when the suggested number of duplicates had gone up from 100 per AMP to 200. The "rule of thumb" may have gone back down to 100.

You can alter this script to check multi-column PI candidates. It's not sophisticated, just documented. And like I said, it exists because of the collective wisdom of others.

Speaking of which, Mike Buckland of Teradata (who had some good ideas for how to implement star schemas in Teradata) suggested that with a normal Teradata table, the criteria for selecting a PI was distribution and then joining, but with a star schema table it was joining and then distribution, but distribution wasn't off the list. Ok, just my two cents, one penny of which came from better minds than mine. Of course, I would love to see a discussion of these principles on the lista.


iv



     
  <Prev Next>   <<First <Prev
Next>
Last>>
 
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 15 Jun 2023