Archives of the TeradataForum
Message Posted: Fri, 28 Oct 2005 @ 22:11:12 GMT
Subj: | | Re: Skewed tables and VARCHAR column |
|
From: | | Fred W Pluebell |
Skew is a measure of deviation from average. In the case of a parallel system like Teradata where you're only as good (or as fast) as the
weakest link, it's the maximum deviation that tends to dominate. For example, Teradata tools define "Skew Factor" as
(100 - (AVG(...)/MAX(...)*100))
so 0 would be perfectly balanced and larger numbers are (rapidly) worse.
Suppose we have 1,000,000 rows, and 900,000 of them each have unique RowHash values but the remaining 100,000 all have the same RowHash. Note
that the average rows per hash (1.1) seems reasonable. But if we have 10 AMPs and each one gets roughly 90,000 of the "unique" rows, one unlucky
AMP will get the remaining 100,000 (so Skew Factor = 47). This would almost certainly cause performance issues since one AMP has to do more than
twice the work of the rest.
At the other extreme, suppose there were only 10 unique RowHash values but the 1,000,000 rows were evenly distributed (and further, that we
were very lucky and each hash was assigned to a different AMP). Then the skew factor would look fine but you could have performance problems due
to the 100,000 rows per hash.
Skew and uniqueness are not the only factors you need to consider, though. Suppose you include many columns in the PI so your RowHash is
unique. Rows per hash and skew factor would both seem "perfect" - but your performance may still be poor if you can't use the PI for retrieval or
JOIN processing.
|