Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Mon, 22 Oct 2007 @ 14:02:39 GMT


     
  <Prev Next>   <<First <Prev Next> Last>>  


Subj:   Re: Compression Algorithm
 
From:   Dieter Noeth

David Clough wrote:

  I'm writing a Stored Procedure (well, actually, I've written it) which works out the break point for the number of columns to compress on for a table, column by column.  


You probably already do it, but don't forget to add CASESPECIFIC to any [Var]Char during calculation.


  So, as you can see, I'm calculating overhead as a fraction of one byte, multiplied by the total number of rows in the table.  


I'd do all the calculations for all columns based on bits and finally multiply with row count.


  The final calculation is a simple comparison between the space saving derived from a _particular cumulative set_ against the overhead taken up _across all rows_.  


  Actually, I've got all sorts of bits of flexibility written into the process, for example being able to specify a minimum column percentage compression saving or a maximum number of compression values to look at but, fundamentally, I wonder whether it can be looked at in this simplistic way.  


I wrote the basic part some years ago (extracting the 255 most common values + NULL), but stopped after some easy calculations (how many bytes saved by x compressed values compared to uncompressed/varchar). As this was just a kind of exercise using all those nice SP features, i stopped when it was about exact calculation :-)

- Table header size will increase at least by (number of compressed values * number of AMPs). But if the table is large this might be negligible.

- The number of additional Presence Bytes per row should be: (NULLable columns + Compress Bits) / 8.

- In worst case adding a single presence *bit* might cause an added presence *byte*. But rows are word-aligned, so maybe this doesn't matter.

- If it's a Varchar to be compressed then substract 2 bytes oberhead, but if it's the last remaining Varchar then 4 bytes.


And there are probably other considerations, this i why i never did that exact calculation...

Btw, now there are 64-bit systems, so this is even worse, due to pad byte fields and alignment issues :-(


Dieter



     
  <Prev Next>   <<First <Prev Next> Last>>  
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 15 Jun 2023