PI Assessment Stored Procedure - Calc Skew & Dupe PI Count Summary Statistics for Candidate PI's [From Terry Stover: Fri, 18 Dec 2004 @ 07:19 GMT]

Message Posted: Fri, 18 Dec 2004 @ 07:19:51 GMT

<Prev

Next>

<<First

<Prev

Next>

Last>>

Subj:

PI Assessment Stored Procedure - Calc Skew & Dupe PI Count Summary Statistics for Candidate PI's

From:

Terry Stover

This extremely useful stored procedure calculates table skew and PI duplicate summary statistics to support primary index selection. Inputs are the database.tablename and the comma separated candidate PI column list. Skew and variance are calculated using the hash functions, so the PI being evaluated does not have to be the actual table PI. The skew & variance calculations are lifted from the DB Design manual, the PI duplicate summary stats include min, max, average and std deviation of the number of rows per PI value.

When evaluating a new table I typically run the procedure for all the reasonable candidate PI's in one queryman session. The databasename, tablename and candidate PI list is included in the output to simplify tracking. DDL for the referenced tables is included at the bottom of the message. Don't forget to change the stored procedure databasename. The stored procedure runs dynamic SQL using DBC.SysExecSql, so the creator has to be the immediate owner of the stored procedure or you'll just generate errors.

Input Parameters Comment DBName.Tablename Repeated in output for reference PICols Comma separated list of candidate PI columns. Repeated in output. Output Parameters Description row_skew (Max Rows - Avg Rows) / Avg Rows row_variance (Max Rows - Min Rows) / Min Rows amp_min_rows Minimum rows on amp amp_max_rows Maximum rows per amp amp_avg_rows Average rows per amp PI_avg_rows Average number of rows per PI value PI_min_rows Minimum number of rows per PI value PI_max_rows Maximum number of rows per PI value PI_row_stddev Standard Deviation of number of rows per PI value

REPLACE PROCEDURE userdb.Check_PI(INOUT dbtabname varchar(61), INOUT PICols varchar(200), OUT row_skew decimal(9,3), OUT row_variance decimal(9,3), OUT amp_min_rows integer, OUT amp_max_rows integer, OUT amp_avg_rows decimal(18,3), OUT PI_avg_rows decimal(9,3), OUT PI_min_rows integer, OUT PI_max_rows integer, OUT PI_row_stddev decimal(9,3) ) BEGIN /* Author: Terry Stover, Levi Strauss & Co. 2004-11-14. Comments & feedback are welcome: [email protected] or [email protected]. This procedure runs the PI duplicate and table skew analysis to support primary index selection. Note that the candidate PI referenced when calling the procedure does not have to be the actual PI of the table being analyzed. The design guidelines are to keep number of rows per PI value < 100 (reference V2R5.1 DB Design Manual, p 10-62, "Duplicate Row Checks for NUPI"), and to keep row variance < 5% (reference V2R5.1 DB Design Manual, p 10-32, "Analyzing Column Distribution Demographics to Optimize PI Selection") Always remember that the primary criteria for PI are access paths and usage, and secondarily, the columns should not be volatile. Inputs are the table name (databasename.tablename) and candidate PI column list (comma separated). The columns do not have to be the actual PI. References tables userdb.PI_dupe_stats and userdb.PI_skew to hold output of dynamic sql for loading into output parameters. Procedure is called with syntax: call userdb.Check_PI(., , row_skew , row_variance, amp_min_rows, amp_max_rows, amp_avg_rows, --output parameters PI_avg_rows , PI_min_rows, PI_max_rows, PI_row_stddev); --output parameters Example call userdb.Check_PI('dw_data.monthly_sales', 'cust_id, prod_id, fiscal_month', row_skew , row_variance, amp_min_rows, amp_max_rows, amp_avg_rows, PI_avg_rows , PI_min_rows, PI_max_rows, PI_row_stddev); Multiple procedure calls can be run in a single queryman window to speed up analysis. Input Parameters Comment DBName.Tablename Repeated in output for reference PICols Comma separated list of candidate PI columns. Repeated in output. Output Parameters Description row_skew (Max Rows - Avg Rows) / Avg Rows row_variance (Max Rows - Min Rows) / Min Rows amp_min_rows Minimum rows on amp amp_max_rows Maximum rows per amp amp_avg_rows Average rows per amp. PI_avg_rows Average number of rows per PI value PI_min_rows Minimum number of rows per PI value PI_max_rows Maximum number of rows per PI value PI_row_stddev Standard Deviation of number of rows per PI value */ declare sqlstr varchar(3000); declare dbtabstr varchar(63); declare picolstr varchar(202); declare colgrp varchar(20); declare nbrcols byteint; declare commapos byteint; declare tmppi varchar(202); declare quote char(1); set quote = ''''; set picolstr = quote || PICols || quote; set dbtabstr = quote || dbtabname || quote; set nbrcols = 1; set colgrp = '1'; -- ############ This section calculate PI skew values ############ delete from userdb.PI_skew; select 'locking '|| :dbtabname || ' for access ' || 'insert into userdb.PI_skew(dbtabname, PI_columns, row_skew, row_variance, amp_min_rows, amp_max_rows, amp_avg_rows)' || 'select ' || :dbtabstr || ',' || :picolstr || ', cast((cast((maxrows - avgrows) as decimal(18,5))/(nullif(maxrows,0)))*100 as decimal(9,3)) AS Rowskew, ' || ' cast( (cast((maxrows - minrows) as decimal(18,5))/(nullif(minrows,0)))*100 as decimal(9,3)) AS RowVariance,' || 'cast(MIN(sub.nbr_rows) as decimal(12,0)) AS minrows, ' || 'cast(MAX(sub.nbr_rows) as decimal(12,0)) AS maxrows, ' || 'cast( AVE(sub.nbr_rows)as decimal(18,5)) AS avgrows' || ' FROM ' || '(select hashamp(hashbucket(hashrow( ' || :PICols || ' ))) as primaryamp, count(*) nbr_rows ' || 'from ' || :dbtabname || ' group by 1 ) sub ;' into :sqlstr; CALL dbc.sysexecsql(:sqlstr); -- ############ This section calculate PI dupe statistics ############ --count number of columns in PI based on commas and create group by clause set tmppi = PIcols; set commapos = INDEX( tmppi, ','); WHILE (characters(tmppi) > 1) and (commapos > 1) and (nbrcols < 10) DO set nbrcols = nbrcols +1; set commapos = commapos + 1; set colgrp = colgrp || ', '|| cast(nbrcols as varchar(5)); set tmppi = substring (tmppi from commapos for 9999); set commapos = INDEX( tmppi, ','); END WHILE; delete from userdb.PI_dupe_stats; SELECT 'locking '|| :dbtabname || ' for access ' || 'INSERT INTO userdb.PI_Dupe_Stats(Dbtabname, Pi_Columns, Pi_Avg_Rows, Pi_Min_Rows, Pi_Max_Rows, Pi_Row_Stddev) ' || ' SELECT ' || :dbtabstr || ' as Tabname,' || :picolstr || ' as PI_Columns, average(pi_rows) piavg, min(pi_rows) minpi, max(pi_rows) maxpi, stddev_pop(pi_rows) stddevPI ' || ' FROM ' || ' ( ' || ' SELECT pi_rows, count(*) nbr_PI ' || ' FROM ( ' || ' SELECT ' || :PICols || ' , count(*) pi_rows' || ' FROM ' || :dbtabname || ' GROUP BY ' || :colgrp || ' ) pisub ' || ' GROUP BY 1 ' || ' ) statsub ; ' into :sqlstr; CALL dbc.sysexecsql(:sqlstr); -- ############ Load Values into ouput parameters ############ select PI_avg_rows, Pi_Row_Stddev, Pi_Min_Rows, Pi_Max_Rows into :Pi_Avg_Rows, :Pi_Row_Stddev, :Pi_Min_Rows, :Pi_Max_Rows from userdb.PI_dupe_stats; select row_skew, row_variance, amp_min_rows, amp_max_rows,amp_avg_rows into :row_skew, :row_variance, :amp_min_rows, :amp_max_rows, :amp_avg_rows from userdb.PI_skew; END; /* -- Referenced Tables Create Table userdb.Pi_Dupe_Stats (Dbtabname Varchar(60), Pi_Columns Varchar(150), Pi_Avg_Rows Decimal(18,3), Pi_Row_Stddev Decimal(12,1), Pi_Min_Rows Decimal(12,0), Pi_Max_Rows Decimal(12,0) ) Primary Index (Dbtabname, Pi_Columns); create table userdb.PI_skew (dbtabname varchar(60), PI_columns varchar(150), row_skew decimal(9,3), row_variance decimal(9,3), amp_min_rows decimal(12,0), amp_max_rows decimal(12,0), amp_avg_rows decimal(18,3) ) primary index (Dbtabname, PI_columns); */

<Prev

Next>

<<First

<Prev

Next>

Last>>

Attachments

Library

Quick Reference

Archives of the TeradataForum

Message Posted: Fri, 18 Dec 2004 @ 07:19:51 GMT