|
|
Archives of the TeradataForum
Message Posted: Fri, 18 Dec 2004 @ 07:19:51 GMT
Subj: | | PI Assessment Stored Procedure - Calc Skew & Dupe PI Count Summary Statistics for Candidate PI's |
|
From: | | Terry Stover |
This extremely useful stored procedure calculates table skew and PI duplicate summary statistics to support primary index selection. Inputs
are the database.tablename and the comma separated candidate PI column list. Skew and variance are calculated using the hash functions, so the PI
being evaluated does not have to be the actual table PI. The skew & variance calculations are lifted from the DB Design manual, the PI duplicate
summary stats include min, max, average and std deviation of the number of rows per PI value.
When evaluating a new table I typically run the procedure for all the reasonable candidate PI's in one queryman session. The databasename,
tablename and candidate PI list is included in the output to simplify tracking. DDL for the referenced tables is included at the bottom of the
message. Don't forget to change the stored procedure databasename. The stored procedure runs dynamic SQL using DBC.SysExecSql, so the creator
has to be the immediate owner of the stored procedure or you'll just generate errors.
Input Parameters Comment
DBName.Tablename Repeated in output for reference
PICols Comma separated list of candidate PI columns.
Repeated in output.
Output Parameters Description
row_skew (Max Rows - Avg Rows) / Avg Rows
row_variance (Max Rows - Min Rows) / Min Rows
amp_min_rows Minimum rows on amp
amp_max_rows Maximum rows per amp
amp_avg_rows Average rows per amp
PI_avg_rows Average number of rows per PI value
PI_min_rows Minimum number of rows per PI value
PI_max_rows Maximum number of rows per PI value
PI_row_stddev Standard Deviation of number of rows per PI value
REPLACE PROCEDURE userdb.Check_PI(INOUT dbtabname varchar(61), INOUT PICols varchar(200),
OUT row_skew decimal(9,3), OUT row_variance decimal(9,3), OUT amp_min_rows
integer, OUT amp_max_rows integer, OUT amp_avg_rows decimal(18,3),
OUT PI_avg_rows decimal(9,3), OUT PI_min_rows integer, OUT PI_max_rows
integer, OUT PI_row_stddev decimal(9,3)
)
BEGIN
/*
Author: Terry Stover, Levi Strauss & Co. 2004-11-14.
Comments & feedback are welcome: [email protected] or [email protected].
This procedure runs the PI duplicate and table skew analysis to support
primary index selection.
Note that the candidate PI referenced when calling the procedure does not
have to be the actual PI of the table being analyzed.
The design guidelines are to keep number of rows per PI value < 100
(reference V2R5.1 DB Design Manual, p 10-62, "Duplicate Row Checks for
NUPI"), and to keep row variance < 5% (reference V2R5.1 DB Design
Manual, p 10-32, "Analyzing Column Distribution Demographics to Optimize
PI Selection")
Always remember that the primary criteria for PI are access paths and usage,
and secondarily, the columns should not be volatile.
Inputs are the table name (databasename.tablename) and candidate PI column
list (comma separated). The columns do not have to be the actual PI.
References tables userdb.PI_dupe_stats and userdb.PI_skew to hold output of
dynamic sql for loading into output parameters.
Procedure is called with syntax:
call userdb.Check_PI(., ,
row_skew , row_variance, amp_min_rows, amp_max_rows, amp_avg_rows,
--output parameters
PI_avg_rows , PI_min_rows, PI_max_rows, PI_row_stddev);
--output parameters
Example
call userdb.Check_PI('dw_data.monthly_sales', 'cust_id, prod_id, fiscal_month',
row_skew , row_variance, amp_min_rows, amp_max_rows, amp_avg_rows,
PI_avg_rows , PI_min_rows, PI_max_rows, PI_row_stddev);
Multiple procedure calls can be run in a single queryman window to speed up analysis.
Input Parameters Comment
DBName.Tablename Repeated in output for reference
PICols Comma separated list of candidate PI columns.
Repeated in output.
Output Parameters Description
row_skew (Max Rows - Avg Rows) / Avg Rows
row_variance (Max Rows - Min Rows) / Min Rows
amp_min_rows Minimum rows on amp
amp_max_rows Maximum rows per amp
amp_avg_rows Average rows per amp.
PI_avg_rows Average number of rows per PI value
PI_min_rows Minimum number of rows per PI value
PI_max_rows Maximum number of rows per PI value
PI_row_stddev Standard Deviation of number of rows per PI value
*/
declare sqlstr varchar(3000);
declare dbtabstr varchar(63);
declare picolstr varchar(202);
declare colgrp varchar(20);
declare nbrcols byteint;
declare commapos byteint;
declare tmppi varchar(202);
declare quote char(1);
set quote = '''';
set picolstr = quote || PICols || quote;
set dbtabstr = quote || dbtabname || quote;
set nbrcols = 1;
set colgrp = '1';
-- ############ This section calculate PI skew values ############
delete from userdb.PI_skew;
select 'locking '|| :dbtabname || ' for access '
|| 'insert into userdb.PI_skew(dbtabname, PI_columns, row_skew,
row_variance, amp_min_rows, amp_max_rows, amp_avg_rows)'
|| 'select ' || :dbtabstr || ',' || :picolstr || ', cast((cast((maxrows -
avgrows) as decimal(18,5))/(nullif(maxrows,0)))*100 as decimal(9,3)) AS Rowskew, '
|| ' cast( (cast((maxrows - minrows) as
decimal(18,5))/(nullif(minrows,0)))*100 as decimal(9,3)) AS RowVariance,'
|| 'cast(MIN(sub.nbr_rows) as decimal(12,0)) AS minrows, '
|| 'cast(MAX(sub.nbr_rows) as decimal(12,0)) AS maxrows, '
|| 'cast( AVE(sub.nbr_rows)as decimal(18,5)) AS avgrows'
|| ' FROM '
|| '(select hashamp(hashbucket(hashrow( '
|| :PICols
|| ' ))) as primaryamp, count(*) nbr_rows '
|| 'from ' || :dbtabname
|| ' group by 1 ) sub ;' into :sqlstr;
CALL dbc.sysexecsql(:sqlstr);
-- ############ This section calculate PI dupe statistics ############
--count number of columns in PI based on commas and create group by clause
set tmppi = PIcols;
set commapos = INDEX( tmppi, ',');
WHILE (characters(tmppi) > 1) and (commapos > 1) and (nbrcols < 10) DO
set nbrcols = nbrcols +1;
set commapos = commapos + 1;
set colgrp = colgrp || ', '|| cast(nbrcols as varchar(5));
set tmppi = substring (tmppi from commapos for 9999);
set commapos = INDEX( tmppi, ',');
END WHILE;
delete from userdb.PI_dupe_stats;
SELECT 'locking '|| :dbtabname || ' for access '
|| 'INSERT INTO userdb.PI_Dupe_Stats(Dbtabname, Pi_Columns, Pi_Avg_Rows,
Pi_Min_Rows, Pi_Max_Rows, Pi_Row_Stddev) '
|| ' SELECT ' || :dbtabstr || ' as Tabname,' || :picolstr || ' as
PI_Columns, average(pi_rows) piavg, min(pi_rows) minpi, max(pi_rows) maxpi,
stddev_pop(pi_rows) stddevPI '
|| ' FROM '
|| ' ( '
|| ' SELECT pi_rows, count(*) nbr_PI '
|| ' FROM ( '
|| ' SELECT '
|| :PICols
|| ' , count(*) pi_rows'
|| ' FROM '
|| :dbtabname
|| ' GROUP BY ' || :colgrp
|| ' ) pisub '
|| ' GROUP BY 1 '
|| ' ) statsub ; ' into :sqlstr;
CALL dbc.sysexecsql(:sqlstr);
-- ############ Load Values into ouput parameters ############
select PI_avg_rows, Pi_Row_Stddev, Pi_Min_Rows, Pi_Max_Rows into
:Pi_Avg_Rows, :Pi_Row_Stddev, :Pi_Min_Rows, :Pi_Max_Rows from
userdb.PI_dupe_stats;
select row_skew, row_variance, amp_min_rows, amp_max_rows,amp_avg_rows into
:row_skew, :row_variance, :amp_min_rows, :amp_max_rows, :amp_avg_rows from
userdb.PI_skew;
END;
/*
-- Referenced Tables
Create Table userdb.Pi_Dupe_Stats
(Dbtabname Varchar(60),
Pi_Columns Varchar(150),
Pi_Avg_Rows Decimal(18,3),
Pi_Row_Stddev Decimal(12,1),
Pi_Min_Rows Decimal(12,0),
Pi_Max_Rows Decimal(12,0)
)
Primary Index (Dbtabname, Pi_Columns);
create table userdb.PI_skew
(dbtabname varchar(60),
PI_columns varchar(150),
row_skew decimal(9,3),
row_variance decimal(9,3),
amp_min_rows decimal(12,0),
amp_max_rows decimal(12,0),
amp_avg_rows decimal(18,3)
)
primary index (Dbtabname, PI_columns);
*/
| |