Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Fri, 18 Dec 2004 @ 07:19:51 GMT


     
  <Prev Next>  
<<First
<Prev
Next>
Last>>
 


Subj:   PI Assessment Stored Procedure - Calc Skew & Dupe PI Count Summary Statistics for Candidate PI's
 
From:   Terry Stover

This extremely useful stored procedure calculates table skew and PI duplicate summary statistics to support primary index selection. Inputs are the database.tablename and the comma separated candidate PI column list. Skew and variance are calculated using the hash functions, so the PI being evaluated does not have to be the actual table PI. The skew & variance calculations are lifted from the DB Design manual, the PI duplicate summary stats include min, max, average and std deviation of the number of rows per PI value.

When evaluating a new table I typically run the procedure for all the reasonable candidate PI's in one queryman session. The databasename, tablename and candidate PI list is included in the output to simplify tracking. DDL for the referenced tables is included at the bottom of the message. Don't forget to change the stored procedure databasename. The stored procedure runs dynamic SQL using DBC.SysExecSql, so the creator has to be the immediate owner of the stored procedure or you'll just generate errors.

     Input Parameters    Comment
     DBName.Tablename    Repeated in output for reference
     PICols              Comma separated list of candidate PI columns.
     Repeated in output.

     Output Parameters   Description
     row_skew            (Max Rows -  Avg Rows) / Avg Rows
     row_variance        (Max Rows -  Min Rows) / Min Rows
     amp_min_rows        Minimum rows on amp
     amp_max_rows        Maximum rows per amp
     amp_avg_rows        Average rows per amp
     PI_avg_rows         Average number of rows per PI value
     PI_min_rows         Minimum number of rows per PI value
     PI_max_rows         Maximum number of rows per PI value
     PI_row_stddev       Standard Deviation of number of rows per PI value

     REPLACE PROCEDURE userdb.Check_PI(INOUT dbtabname varchar(61), INOUT PICols varchar(200),
     OUT row_skew decimal(9,3),  OUT row_variance decimal(9,3),  OUT amp_min_rows
     integer,  OUT amp_max_rows integer, OUT amp_avg_rows decimal(18,3),
     OUT PI_avg_rows decimal(9,3),  OUT PI_min_rows integer,  OUT PI_max_rows
     integer,  OUT PI_row_stddev decimal(9,3)
     )
     BEGIN
     /*
     Author:  Terry Stover, Levi Strauss & Co.  2004-11-14.
              Comments & feedback are welcome: tstover@levi.com or terry.stover@comcast.net.

     This procedure runs the PI duplicate and table skew analysis to support
     primary index selection.

     Note that the candidate PI referenced when calling the procedure does not
     have to be the actual PI of the table being analyzed.

     The design guidelines are to keep number of rows per PI value < 100
     (reference V2R5.1 DB Design Manual, p 10-62, "Duplicate Row Checks for
     NUPI"),        and to keep row variance < 5% (reference V2R5.1 DB Design
     Manual, p 10-32, "Analyzing Column Distribution Demographics to Optimize
     PI Selection")

     Always remember that the primary criteria for PI are access paths and usage,
     and secondarily, the columns should not be volatile.


     Inputs are the table name (databasename.tablename) and candidate PI column
     list (comma separated).  The columns do not have to be the actual PI.
     References tables userdb.PI_dupe_stats and userdb.PI_skew to hold output of
     dynamic sql for loading into output parameters.

     Procedure is called with syntax:
        call userdb.Check_PI(., ,
                             row_skew , row_variance, amp_min_rows, amp_max_rows, amp_avg_rows,
                        --output parameters
                             PI_avg_rows , PI_min_rows, PI_max_rows, PI_row_stddev);
                        --output parameters

     Example

        call userdb.Check_PI('dw_data.monthly_sales', 'cust_id, prod_id, fiscal_month',
                             row_skew , row_variance, amp_min_rows, amp_max_rows, amp_avg_rows,
                             PI_avg_rows , PI_min_rows, PI_max_rows, PI_row_stddev);

     Multiple procedure calls can be run in a single queryman window to speed up analysis.

     Input Parameters    Comment
     DBName.Tablename    Repeated in output for reference
     PICols              Comma separated list of candidate PI columns.
     Repeated in output.

     Output Parameters   Description
     row_skew            (Max Rows -  Avg Rows) / Avg Rows
     row_variance        (Max Rows -  Min Rows) / Min Rows
     amp_min_rows        Minimum rows on amp
     amp_max_rows        Maximum rows per amp
     amp_avg_rows        Average rows per amp.
     PI_avg_rows         Average number of rows per PI value
     PI_min_rows         Minimum number of rows per PI value
     PI_max_rows         Maximum number of rows per PI value
     PI_row_stddev       Standard Deviation of number of rows per PI value

     */


     declare sqlstr varchar(3000);
     declare dbtabstr varchar(63);
     declare picolstr varchar(202);
     declare colgrp varchar(20);
     declare nbrcols byteint;
     declare commapos byteint;
     declare tmppi varchar(202);
     declare quote char(1);
     set quote = '''';
     set picolstr = quote || PICols || quote;
     set dbtabstr = quote || dbtabname ||  quote;
     set nbrcols = 1;
     set colgrp = '1';

     -- ############  This section calculate PI skew values  ############
     delete from userdb.PI_skew;
     select 'locking '|| :dbtabname || ' for access  '
      || 'insert into userdb.PI_skew(dbtabname, PI_columns, row_skew,
     row_variance, amp_min_rows, amp_max_rows, amp_avg_rows)'
      || 'select ' ||  :dbtabstr || ',' ||  :picolstr || ', cast((cast((maxrows -
     avgrows) as decimal(18,5))/(nullif(maxrows,0)))*100 as decimal(9,3)) AS Rowskew, '
      || ' cast( (cast((maxrows - minrows) as
     decimal(18,5))/(nullif(minrows,0)))*100 as decimal(9,3)) AS RowVariance,'
      || 'cast(MIN(sub.nbr_rows) as decimal(12,0)) AS minrows, '
      || 'cast(MAX(sub.nbr_rows) as decimal(12,0))  AS maxrows, '
      || 'cast( AVE(sub.nbr_rows)as decimal(18,5))  AS avgrows'
      || ' FROM '
      || '(select hashamp(hashbucket(hashrow( '
      || :PICols
      || '  ))) as primaryamp, count(*) nbr_rows '
      || 'from ' || :dbtabname
      || ' group by 1 ) sub ;' into :sqlstr;
     CALL dbc.sysexecsql(:sqlstr);


     -- ############  This section calculate PI dupe statistics  ############
     --count number of columns in PI based on commas and create group by clause
     set tmppi = PIcols;
     set commapos = INDEX( tmppi, ',');
     WHILE (characters(tmppi) > 1)  and (commapos > 1)  and (nbrcols < 10)  DO
            set nbrcols = nbrcols +1;
            set commapos = commapos + 1;
            set colgrp = colgrp || ', '|| cast(nbrcols as varchar(5));
            set tmppi = substring  (tmppi from commapos for  9999);
            set commapos = INDEX( tmppi, ',');
     END WHILE;

     delete from userdb.PI_dupe_stats;

     SELECT 'locking '|| :dbtabname || ' for access  '
     || 'INSERT INTO userdb.PI_Dupe_Stats(Dbtabname, Pi_Columns, Pi_Avg_Rows,
     Pi_Min_Rows, Pi_Max_Rows, Pi_Row_Stddev) '
     || ' SELECT ' ||  :dbtabstr || ' as Tabname,' ||  :picolstr || ' as
     PI_Columns, average(pi_rows) piavg,  min(pi_rows) minpi, max(pi_rows) maxpi,
     stddev_pop(pi_rows) stddevPI '
     || ' FROM '
     || ' ( '
     || '     SELECT  pi_rows,  count(*) nbr_PI  '
     || '     FROM (   '
     || '              SELECT  '
     ||                 :PICols
     || '               , count(*) pi_rows'
     || '              FROM   '
     ||                 :dbtabname
     || '              GROUP BY ' || :colgrp
     || '        ) pisub '
     || ' GROUP BY 1 '
     || ' ) statsub ; ' into :sqlstr;

     CALL dbc.sysexecsql(:sqlstr);

     -- ############  Load Values into ouput parameters  ############
     select PI_avg_rows, Pi_Row_Stddev, Pi_Min_Rows, Pi_Max_Rows  into
     :Pi_Avg_Rows,  :Pi_Row_Stddev, :Pi_Min_Rows, :Pi_Max_Rows  from
     userdb.PI_dupe_stats;
     select row_skew, row_variance, amp_min_rows, amp_max_rows,amp_avg_rows into
     :row_skew, :row_variance, :amp_min_rows, :amp_max_rows, :amp_avg_rows from
     userdb.PI_skew;

     END;



     /*
     -- Referenced Tables
     Create Table userdb.Pi_Dupe_Stats
     (Dbtabname Varchar(60),
      Pi_Columns Varchar(150),
      Pi_Avg_Rows Decimal(18,3),
      Pi_Row_Stddev Decimal(12,1),
      Pi_Min_Rows Decimal(12,0),
      Pi_Max_Rows Decimal(12,0)
      )
     Primary Index (Dbtabname, Pi_Columns);


     create table userdb.PI_skew
     (dbtabname varchar(60),
      PI_columns varchar(150),
      row_skew decimal(9,3),
      row_variance decimal(9,3),
      amp_min_rows decimal(12,0),
      amp_max_rows decimal(12,0),
      amp_avg_rows decimal(18,3)
      )
     primary index (Dbtabname, PI_columns);
     */


     
  <Prev Next>  
<<First
<Prev
Next>
Last>>
 
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 27 Dec 2016