Archives of the TeradataForum

Message Posted: Wed, 29 Jun 2005 @ 17:35:22 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Teradata row distribution and hashing

From:		Diehl, Robert

Swapnil,

If you have the data already loaded you can run a query like the following to see how evenly it will distribute.

     /* just replace a.session_id with your column and databasename.tablename
     with the database and table that the data resides in */
     SELECT HASHAMP (HASHBUCKET (HASHROW
           (A.session_id)))
           ,COUNT(*)
     FROM databasename.tablename A
     GROUP BY 1
     ORDER BY 1
     ;
      /*     (A.FLT_DT || A.FLT_NR || A.DPT_ARP_CD || A.ARR_ARP_CD )))
          a concatenated key like this can be used for compound index checks
     */

Also find the highest occurrence of the column

     Select account_id, count(*) from tablename
     Group by 1 order by 2 desc
     Having count(*) > 500

Basically, you will start to see performance degradation if the rows for a single Primary index value will no fit into 1 block. I have found the threshold to be at 2-5 times this rule before major performance problems occur.

To figure out how many rows will fit in a block divide the blocksize by the average row length.

To find the block size of the table look at the table definition for the Datablocksize.

     DATABLOCKSIZE = 130560 BYTES

If the table does not have an explicit datablocksize look at the file system DBS CONTROL 3. PermDBSize = 127(sectors) 127 sectors = 130560 bytes.

This is most common setting for Teradata V2R4.1 and above.

Look in the books for other combinations if you have something else defined.

Determining the average record length is much more complicated due to row overhead, compression, varchar and PPI factors. So don't have time to detail it. I believe one of the Teradata manuals details how to calculate average row length.

So an example, if databablocksize = 130560 and average row length of 100 130560/100 = 1305 rows per block

This means you could have 1,305 rows with the same primary index before you would start to experience some performance degradation. In my experience, you would have severe performance degradations on some activity with 5,000 or more rows with the same primary index in this situation.

The degradation occurs on inserting data into the table, because of duplicate row checks. Often referred to a NUPI DUP problem. Two ways to alleviate duplicate row checks is to use Unique secondary index or MULTISET tables. These do not stop or change the data skew. This NUPI duplicate problem also often occurs when data has to be redistributed by not nearly unique column/columns. For this try influencing plan with stats on the column so the plan will not redistribute data.

Also queries that use the 'offending' primary index rows can be impacted depending on the request.

I hope this clears up this issue as not many understood it. Just correcting this type of issued alone will really improve performance for the whole system.

Thanks,

Bob Diehl
Travelcoity.com


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference