Archives of the TeradataForum

Message Posted: Fri, 28 Oct 2005 @ 22:11:12 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Skewed tables and VARCHAR column

From:		Fred W Pluebell

Skew is a measure of deviation from average. In the case of a parallel system like Teradata where you're only as good (or as fast) as the weakest link, it's the maximum deviation that tends to dominate. For example, Teradata tools define "Skew Factor" as

     (100 - (AVG(...)/MAX(...)*100))

so 0 would be perfectly balanced and larger numbers are (rapidly) worse.

Suppose we have 1,000,000 rows, and 900,000 of them each have unique RowHash values but the remaining 100,000 all have the same RowHash. Note that the average rows per hash (1.1) seems reasonable. But if we have 10 AMPs and each one gets roughly 90,000 of the "unique" rows, one unlucky AMP will get the remaining 100,000 (so Skew Factor = 47). This would almost certainly cause performance issues since one AMP has to do more than twice the work of the rest.

At the other extreme, suppose there were only 10 unique RowHash values but the 1,000,000 rows were evenly distributed (and further, that we were very lucky and each hash was assigned to a different AMP). Then the skew factor would look fine but you could have performance problems due to the 100,000 rows per hash.

Skew and uniqueness are not the only factors you need to consider, though. Suppose you include many columns in the PI so your RowHash is unique. Rows per hash and skew factor would both seem "perfect" - but your performance may still be poor if you can't use the PI for retrieval or JOIN processing.


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference