The procedures in this document were based on prior work by Colby Guilbeau and were shamelessly stolen from his work.
Teradata Multi-value Compression is a new feature of V2R5.0 which is an expansion of the older single value compression. This
procedure is to enable you to perform an Analysis to determine what compression gains you can get and select values to compress.
Note: Primary Key Columns, and VARCHARs cannot be compressed.
Note: This compression requires knowledge of specific column values. Therefore, you must have data in a table to compress it.
Future Enhancements:
1. Building a method of comparing previous compression candidate values to the current table demographics to flag
tables that should be recompressed.
2. Enhancing this procedure to make the analysis automatic. Kim Lee will be heading up this effort.
3. Add logic to calculate whether a VARCHAR column should be converted to a CHAR and compressed.
Tools for the Analysis:
The code for the following Table and Macros can be found in Appendix A
Tables
DBA.T_CMPR_CAND (Compression Candidates)
This holds the counts for each value in the columns compared. It is a historical record to enable enhancement 1 to
occur. The use of this table eliminates a second pass on the source tables to build the compression value list.
Macros
DBA.M_GET_COL_COUT_CMD (Get Column Count Commands)
Builds the calls to the stored procedure SYSSTP.COL_COUT. Given a database, it will generate these commands for every
table and column combination that can be compressed. Primary Key columns and VARCHARs are not included.
DBA.M_GET_CMPR_RSLT (Get Compression Results)
Gets the cost savings for each compression number of values plateau (1,3,7,15,31,63, 127, and 255). These numbers are
recommended by NCR since the compression implementation adds an addition byte once you exceed that level. You cannot compress more
than 255 values. This builds a spreadsheet loadable list (one row per database, table, and column) that lists number of rows
compressed (RANKxxx), amount of space saved (SAVINGSxxx), and the DELTA from the next lower value (DELTAxxx). This data is used to
make compression decisions.
DBA.M_BLD_CMPR_LIST (Build Compression List)
Gives you the actual values to compress for a given database, table, column, and rank (or number of values to
compress). You use this output to modify the original Table DDL to build the compressed version of the table.
Stored Procedures
SYSSTP.COL_COUT (Column Counter)
Build the table DBA.T_CMPR_CAND. It selects the frequency distribution of each value for a given, database, table, and
column.
The code for the Stored Procedure can be found in Appendix B
Procedure:
Step 1.
Set up the frequency counters.
Input: Which database do you want to compress.
Output: A list of the calls to SYSTP.COL_COUT
Execute DBA.M_GET_COL_COUT_CMD for your database to build the calls to collect column demographics for every table in
the database.
Example:
Execute DBA.M_GET_COL_COUT_CMD('DB1') gives you (list abbreviated for clarity):
CALL SYSSTP.COL_COUT ('DB1','TABLE1','COLUMN8');
CALL SYSSTP.COL_COUT ('DB1','TABLE1','COLUMN10');
CALL SYSSTP.COL_COUT ('DB1','TABLE1','COLUMN9');
CALL SYSSTP.COL_COUT ('DB1','TABLE2','COLUMN8');
CALL SYSSTP.COL_COUT ('DB1','TABLE2','COLUMN9');
CALL SYSSTP.COL_COUT ('DB1','TABLE3','COLUMN8');
CALL SYSSTP.COL_COUT ('DB1','TABLE3','COLUMN1');
CALL SYSSTP.COL_COUT ('DB1','TABLE3','COLUMN6');
CALL SYSSTP.COL_COUT ('DB1','TABLE3','COLUMN5');
CALL SYSSTP.COL_COUT ('DB1','TABLE3','COLUMN7');
CALL SYSSTP.COL_COUT ('DB1','TABLE3','COLUMN2');
CALL SYSSTP.COL_COUT ('DB1','TABLE3','COLUMN9');
You can either proceed to Step 2 or cull this list for tables/clumns that you want to analyze. This step takes
seconds (16 seconds for our test run on TD12).
Step 2.
Perform the demographics step (be prepared to wait a while)
Input: List of commands from step 1.
Output: Entries in table DBA.T_CMPR_CAND for the databases, tables, and columns requested.
Run the list of commands. This takes a while. For the entire XEWI410T database (on TD12) it took 27 minutes and 31
seconds.
Note: To analyze VARCHAR or Primary Key values, there is nothing stopping you from manually build these commands and
running the analysis on the columns.
Step 3.
Get the analysis results for the database.
Input: DBA.T_CMPR_CAND.
Output: A list of table columns with the space saving possibilities.
Execute M_GET_CMPR_RSLT for your database and collection date to get your input for analysis.
Example: Execute M_GET_CMPR_RSLT('DB1',date). Note: If step 3 crosses over a date boundary, then you may have to run
this command once for each possible date. Note I found it easier to run this step from SQL Assistant and move this data into a
spreadsheet
Output (modified for clarity):
COMPILE_ DB TABLE COLUMN Col NUM_ MAX RANK SAVINGS RANK SAVINGS DELTA
DATE NAME NAME NAME Len ROWS SIZE 001 001 003 003 003
-------- ---- ------ ------- --- ---- ---- ---- ------- ---- ------- -----
1/8/2004 DB1 TABLE1 COLUMN1 4 39 156 20 78 39 146 69
1/8/2004 DB1 TABLE1 COLUMN2 4 39 156 11 43 32 120 77
1/8/2004 DB1 TABLE1 COLUMN3 4 39 156 0 0 3 11 11
1/8/2004 DB1 TABLE2 COLUMN1 4 6 24 6 23 6 23 -1
1/8/2004 DB1 TABLE2 COLUMN2 4 6 24 0 0 0 0 0
1/8/2004 DB1 TABLE3 COLUMN1 4 6774 27096 1,423 5,514 2,077 7,789 2,275
1/8/2004 DB1 TABLE3 COLUMN2 3 6774 20322 263 756 711 1,955 1,199
1/8/2004 DB1 TABLE3 COLUMN3 1 6774 6774 3,532 3,091 6,774 5,081 1,990
1/8/2004 DB1 TABLE3 COLUMN4 4 6774 27096 5,630 21,816 6,774 25,403 3,586
1/8/2004 DB1 TABLE3 COLUMN5 1 6774 6774 3,782 3,309 6,774 5,081 1,771
1/8/2004 DB1 TABLE3 COLUMN6 3 6774 20322 659 1,895 1,625 4,469 2,574
1/8/2004 DB1 TABLE3 COLUMN7 4 6774 27096 0 0 846 3,173 3,173
This step took 3 seconds.
Step 4.
Analysis using your MACH 1 eyeballs.
Input: Spreadsheet with values from Step 3.
Output: Tables, Columns, and the number of values to compress.
Note: There are two concerns that cannot be addressed in this analysis data.
1. The table header has a limit of 64K. If you exceed that limit then in Step xx. The table build will fail. If this
happens, reevaluate your analysis, accepting less compression to fit within this limit.
2. If the data demographics are very volatile, then the compression may quickly become invalid and could hurt you. If
you know that the column demographics change drastically over time, either skip this column or plan to reevaluate the compression on
a regular basis.
For each compression level, the spreadsheet gives you how many rows are compressed (RANKxxx), the space savings for
that level (compared to no compression) (SAVINGSxxx), and the difference in savings between this level and the previous level
(DELTAxxx). This is the gain that you get by moving to this level of compression. Note; There is no DELTA001 since SAVINGS001 is
the same value.
This obvious point to compress is the last positive DELTAxxx. If the DELTA < 0, then you are actually using more
space at that level of compression than at the lower level. However, you might decide to stop at a low positive value to simplify
the DDL and keep the header from exceeding its 64K limit.
Example:
Looking at a couple of columns in DB1.TABLE1 (COLUMN1 and COLUMN2)
COLUMN1 is 3 bytes long. There are 6774 rows in the table. Therefore an uncompressed column would take 20,322 bytes
(MAXSIZE). Thus this column contributes 20,322 bytes to the total size of the table.
In order to exhibit the logic, I reformatted the spreadsheet data.
Rank RANKxxx SAVINGSxxx DELTAxxx
---- ------- ---------- --------
001 263 756
003 711 1,955 1,199
007 1,271 3,336 1,381
015 1,822 4,555 1,219
031 2,426 5,762 1,207
063 3,136 7,056 1,294
127 5,188 8,604 1,548
255 5,977 9,906 1,302
Now since DELTAxxx is always positive and the values are very similar, I would probably choose to compress 255
values. I would keep this one in mind in case you run out of header space.
COLUMN2 is 4 bytes long. There are 6774 rows in the table. Therefore an uncompressed column would take 27,096 bytes
(MAXSIZE). Thus this column contributes 27,096 bytes to the total size of the table.
In order to exhibit the logic, I reformatted the spreadsheet data.
Rank RANKxxx SAVINGSxxx DELTAxxx
---- ------- ---------- --------
001 5,630 21,816
003 6,774 25,043 3,586
007 6,774 24,556 -847
015 6,774 23,709 -847
031 6,774 22,862 -847
063 6,774 22,016 -847
127 6,774 21,169 -847
255 6,774 20,322 -847
Obviously, you would never choose to compress more than three values, since by compressing 3 values, you get 100%
compression. The bulk of your compression is gained by compressing the first value.
Less obvious, unless you compare all of the different columns.this is the best column to compress.
This step will probably take a few minutes per table.
Step 5.
Getting the column values to compress.
Input: DBA.T_CMPR_CAND and the decisions made (database, table, column, and Compression Rank) from Step 4.
Output: Compression Values
Execute DBA.M_BLD_CMPR_LIST to get the list of columns and the values to compress in a format that will allow you to
quickly modify the table DDL to build a compressed table. Vairables are (in order): Databasename, Tablename, Columnname, number of
values to compress, and columnlength. Note: I had some problems with strings that ended in spaces. The columnlength was an attempt
to solve this problem. It didn't seem to work, so I have a little bit of cleanup to do with character fields.
Example
EXECUTE DBA.M_BLD_CMPR_LIST('DB1','TABLE3','COLUMN5',3,date, 9)
COLUMNNAME CASE expression
---------- ---------------
COLUMN5 ,1.
COLUMN5 ,3.
This step took 1 second.
Character columns will have single quotes automatically added to the value, e.g., 'ALPHA1'.
Step 6.
Modifying the DDL
INPUT: The results of Step 5 (electronically) and the table schema. I used get_table_changes to pull out the DDL.
OUTPUT: Compressed DDL.
a. After the column definition add the text "COMPRESS("
b. Copy the values from the second column in Step 5 after the open parenthesis. This gives you one value per line
which is very easy to read and will not cause problems with bteq when performing a show table. Delete the first column.
c. Add a closed parenthesis ")" after the last value.
d. Build the new table and copy over your data and clean up.
Note: Obviously, if you are recompressing an compressed column, you can ignore a and c.
Note: If you use get_table_changes (see below) the rename of the uncompressed tabe and the data copy are taken care
of. Then all you need to do is drop the renamed table when finished. Otherwise, you have to take care of everything.
Example:
Using get_table_changes.
Before Editing
RENAME TABLE DB1.TABLE1 AS DB1.TABLE1_GKB;
CREATE SET TABLE DB1.TABLE1 ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL
(
COLUMN0 CHAR(4) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
COLUMN1 CHAR(3) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
COLUMN2 CHAR(3) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
COLUMN3 VARCHAR(60) CHARACTER SET LATIN NOT CASESPECIFIC,
COLUMN4 VARCHAR(25) CHARACTER SET LATIN NOT CASESPECIFIC,
COLUMN5 DECIMAL(5,0) NOT NULL,
COLUMN6 CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
COLUMN7 CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
COLUMN8 DATE FORMAT 'DD-MMM-YYYY' NOT NULL,
COLUMN9 DATE FORMAT 'DD-MMM-YYYY')
UNIQUE PRIMARY INDEX XPKTABLE3 ( COLUMN0 )
INDEX XIF336TABLE3 ( COLUMN1 )
INDEX XIF337TABLE3 ( COLUMN2 )
INDEX XIF338TABLE3 ( COLUMN5 );
INSERT INTO DB1.TABLE3 SELECT * FROM DB1.TABLE3_GKB;
REVOKE ALL ON DB1.TABLE3 FROM O10782;
After Editing
RENAME TABLE DB1.TABLE3 AS DB1.TABLE3_GKB;
CREATE SET TABLE DB1.TABLE3 ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL
(
COLUMN0 CHAR(4) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
COLUMN1 CHAR(3) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
COLUMN2 CHAR(3) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
COLUMN3 VARCHAR(60) CHARACTER SET LATIN NOT CASESPECIFIC,
COLUMN4 VARCHAR(25) CHARACTER SET LATIN NOT CASESPECIFIC,
COLUMN5 DECIMAL(5,0) NOT NULL COMPRESS (1.,3.),
COLUMN6 CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
COLUMN7 CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
COLUMN8 DATE FORMAT 'DD-MMM-YYYY' NOT NULL,
COLUMN9 DATE FORMAT 'DD-MMM-YYYY')
UNIQUE PRIMARY INDEX XPKTABLE3 ( COLUMN0 )
INDEX XIF336TABLE3 ( COLUMN1 )
INDEX XIF337TABLE3 ( COLUMN2 )
INDEX XIF338TABLE3 ( COLUMN5 );
INSERT INTO DB1.TABLE1 SELECT * FROM DB1.TABLE1_GKB;
REVOKE ALL ON DB1.TABLE1 FROM O10782;
Now, run the above script and you end up with two tables DB1.TABLE1 is compressed and DB1.TABLE1_GKB is the
original table.
If you are satisfied that the data in your compressed table is valid, drop DB1.TABLE1_GKB.
Appendix A: Code for Tables and Macros
CREATE SET TABLE DBA.T_CMPR_CAND,
NO FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL
( DATABASENAME VARCHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
TABLENAME VARCHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
COLUMNNAME VARCHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
VAL_TXT VARCHAR(3000) CHARACTER SET LATIN NOT CASESPECIFIC,
COUT_NBR DECIMAL(18,0),
COMPILE_DT DATE FORMAT 'YY/MM/DD' DEFAULT DATE ,
RANK_NBR INTEGER)
UNIQUE PRIMARY INDEX UPI ( DATABASENAME, TABLENAME,
COLUMNNAME, VAL_TXT, COMPILE_DT )
INDEX NUSI1 ( DATABASENAME, TABLENAME, COLUMNNAME );
REPLACE MACRO DBA.M_GET_COL_COUT_CMD
( DATABASENAME VARCHAR(30)
)
AS ( select 'CALL SYSSTP.COL_COUT
(''' || TRIM(DATABASENAME) || ''','''
|| TRIM(TABLENAME) || ''','''
|| TRIM(COLUMNNAME) || ''');'
from dbc.columns
where databasename = :DATABASENAME
AND columntype in ('I', 'DA', 'D', 'CF', 'F')
AND (databasename, tablename, columnname)
NOT IN (SELECT databasename, tablename, columnname
FROM DBC.indices
where indextype in ('K', 'P') )
AND (DATABASENAME, TABLENAME)
IN (SELECT DATABASENAME, TABLENAME
FROM DBC.TABLES
WHERE TABLEKIND = 'T')
ORDER BY 1;
);
REPLACE MACRO DBA.M_GET_CMPR_RSLT
( DATABASENAME VARCHAR(30)
, COMPILE_DT DATE
)
AS ( LOCKING DBA.T_CMPR_CAND FOR ACCESS
SELECT COMPILE_DT
, DBA.T_CMPR_CAND.DATABASENAME
, DBA.T_CMPR_CAND.TABLENAME
, DBA.T_CMPR_CAND.COLUMNNAME
, COLUMNLENGTH
, SUM(COUT_NBR) AS NUM_ROWS
, CAST(NUM_ROWS * COLUMNLENGTH AS DECIMAL(18,0)) AS MAXSIZE
, SUM(CASE WHEN RANK_NBR < 2 THEN COUT_NBR ELSE 0 END) AS RANK001
, RANK001 * COLUMNLENGTH - RANK001 *1/8 AS SAVINGS001
, SUM(CASE WHEN RANK_NBR < 4 THEN COUT_NBR ELSE 0 END) AS RANK003
, RANK003 * COLUMNLENGTH - RANK003 *2/8 AS SAVINGS003
, SAVINGS003 - SAVINGS001 AS DELTA003
, SUM(CASE WHEN RANK_NBR < 8 THEN COUT_NBR ELSE 0 END) AS RANK007
, RANK007 * COLUMNLENGTH - RANK007 *3/8 AS SAVINGS007
, SAVINGS007 - SAVINGS003 AS DELTA007
, SUM(CASE WHEN RANK_NBR < 16 THEN COUT_NBR ELSE 0 END) AS RANK015
, RANK015 * COLUMNLENGTH - RANK015 *4/8 AS SAVINGS015
, SAVINGS015 - SAVINGS007 AS DELTA015
, SUM(CASE WHEN RANK_NBR < 32 THEN COUT_NBR ELSE 0 END) AS RANK031
, RANK031 * COLUMNLENGTH - RANK031 *5/8 AS SAVINGS031
, SAVINGS031 - SAVINGS015 AS DELTA031
, SUM(CASE WHEN RANK_NBR < 64 THEN COUT_NBR ELSE 0 END) AS RANK063
, RANK063 * COLUMNLENGTH - RANK063 *6/8 AS SAVINGS063
, SAVINGS063 - SAVINGS031 AS DELTA063
, SUM(CASE WHEN RANK_NBR < 128 THEN COUT_NBR ELSE 0 END) AS RANK127
, RANK127 * COLUMNLENGTH - RANK127 *7/8 AS SAVINGS127
, SAVINGS127 - SAVINGS063 AS DELTA127
, SUM(CASE WHEN RANK_NBR < 256 THEN COUT_NBR ELSE 0 END) AS RANK255
, RANK255 * COLUMNLENGTH - RANK255 *8/8 AS SAVINGS255
, SAVINGS255 - SAVINGS127 AS DELTA255
FROM DBA.T_CMPR_CAND
WHERE DBA.T_CMPR_CAND.DATABASENAME = DBC.COLUMNS.DATABASENAME
AND DBA.T_CMPR_CAND.TABLENAME = DBC.COLUMNS.TABLENAME
AND DBA.T_CMPR_CAND.COLUMNNAME = DBC.COLUMNS.COLUMNNAME
AND DBA.T_CMPR_CAND.DATABASENAME = :DATABASENAME
AND DBA.T_CMPR_CAND.COMPILE_DT = :COMPILE_DT
GROUP BY 1,2,3,4,5
ORDER BY 1,2,3,4,5;
);
REPLACE MACRO DBA.M_BLD_CMPR_LIST
( DATABASENAME VARCHAR(30)
, TABLENAME VARCHAR(30)
, COLUMNNAME VARCHAR(30)
, RANK_NBR INTEGER
, COMPILE_DT DATE
, COLUMNLENGTH INTEGER
)
AS ( LOCKING DBA.T_CMPR_CAND FOR ACCESS
SELECT DBA.T_CMPR_CAND.COLUMNNAME,
CASE
WHEN DBC.COLUMNS.COLUMNTYPE = 'CF'
THEN ',''' || SUBSTR(VAL_TXT, 1, :COLUMNLENGTH) || ''''
WHEN DBC.COLUMNS.COLUMNTYPE = 'DA'
THEN ',' || TRIM(CAST(CAST(VAL_TXT as DATE) as INTEGER))
ELSE ',' || TRIM(VAL_TXT) END
FROM DBA.T_CMPR_CAND,
DBC.COLUMNS
WHERE DBA.T_CMPR_CAND.DATABASENAME = :DATABASENAME
AND DBA.T_CMPR_CAND.TABLENAME = :TABLENAME
AND ((DBA.T_CMPR_CAND.COLUMNNAME = :COLUMNNAME
AND RANK_NBR < :RANK_NBR + 1))
AND COMPILE_DT = :COMPILE_DT
AND DBA.T_CMPR_CAND.DATABASENAME = DBC.COLUMNS.DATABASENAME
AND DBA.T_CMPR_CAND.TABLENAME = DBC.COLUMNS.TABLENAME
AND DBA.T_CMPR_CAND.COLUMNNAME = DBC.COLUMNS.COLUMNNAME
ORDER BY 1,2;
);
Appendix B: Code for the Stored Procedure
SPLText
REPLACE PROCEDURE SYSSTP.COL_COUT
( IN DATABASENAME VARCHAR(30)
, IN TABLENAME VARCHAR(30)
, IN COLUMNNAME VARCHAR(30)
)
/*******************************************************************/
/* NAME: COL_COUT - COLUMN_COUNTER */
/* DESC - INSERTS COLUMN COUNTS INTO TABLE FOR COMPRESSION STUDY */
/* INPUTS: */
/* DATABASENAME */
/* HISTORY: */
/* 12/22/2003 CREATED. */
/* 1/7/2004 Cleaned up to add lessons learned */
/*******************************************************************/
COLUMN_LABEL: BEGIN
DELETE FROM DBA.T_GLBL_CMPR_CAND;
CALL DBC.SYSEXECSQL (
'INSERT INTO DBA.T_GLBL_CMPR_CAND (DATABASENAME, TABLENAME, COLUMNNAME
, VAL_TXT, COUT_NBR) SELECT '''
|| TRIM(:DATABASENAME) || ''','''
|| TRIM(:TABLENAME) || ''','''
|| TRIM(:COLUMNNAME) || ''','
|| ' CAST( ' || :COLUMNNAME || ' AS VARCHAR(3000))'
|| ', COUNT(*) FROM '
|| TRIM(:DATABASENAME)
|| '.'
|| TRIM(:TABLENAME)
|| ' GROUP BY 1,2,3,4;' ) ;
UPDATE DBA.T_GLBL_CMPR_CAND
FROM
(SELECT DATABASENAME, TABLENAME, COLUMNNAME, VAL_TXTD, RANK() OVER
(PARTITION BY DATABASENAME, TABLENAME, COLUMNNAME ORDER BY COUT_NBR DESC)
FROM DBA.T_GLBL_CMPR_CAND
WHERE DBA.T_GLBL_CMPR_CAND.DATABASENAME = :DATABASENAME
AND DBA.T_GLBL_CMPR_CAND.TABLENAME = :TABLENAME
AND DBA.T_GLBL_CMPR_CAND.COLUMNNAME = :COLUMNNAME
AND DBA.T_GLBL_CMPR_CAND.COMPILE_DT = DATE
) A(DATABASENAME, TABLENAME, COLUMNNAME, VAL_TXT, RANK_NBR)
SET RANK_NBR = A.RANK_NBR
WHERE DBA.T_GLBL_CMPR_CAND.DATABASENAME = A.DATABASENAME
AND DBA.T_GLBL_CMPR_CAND.TABLENAME = A.TABLENAME
AND DBA.T_GLBL_CMPR_CAND.COLUMNNAME = A.COLUMNNAME
AND DBA.T_GLBL_CMPR_CAND.VAL_TXT = A.VAL_TXT
AND DBA.T_GLBL_CMPR_CAND.DATABASENAME = :DATABASENAME
AND DBA.T_GLBL_CMPR_CAND.TABLENAME = :TABLENAME
AND DBA.T_GLBL_CMPR_CAND.COLUMNNAME = :COLUMNNAME
AND DBA.T_GLBL_CMPR_CAND.COMPILE_DT = DATE;
INSERT INTO DBA.T_CMPR_CAND SELECT * FROM DBA.T_GLBL_CMPR_CAND;
END COLUMN_LABEL;
This page was developed from attachments provided by Glen Blood. The attachments can be found in the
Attachment area .
|