Archives of the TeradataForum
Message Posted: Wed, 06 Jun 2007 @ 16:44:20 GMT
Probably indeed because of that fear. The method as such is not the most sophisticated but it seems adequate for the job. The randomness of the sample, however, raises very serious doubts.
Let's take another look at the original example:
"In one case, a table with 12M, evenly distributed (UPI) rows had a column with 70K distinct values. A standard collect stats shows this number.
Sampling with a default CollectStatsSample parm value (2%) showed about 1.2M distinct values."
The ratio of distinct values in the population was 70,000 / 12,000,000 = 0.58%.
In the sample, Teradata seems to have found 1,200,000 / 50 = 24,000 distinct values (I reverse the extrapolation rule communicated by Carrie here) in the sample of 12,000,000 * 2% = 240,000. The sample ratio of distinct values is then 24,000 / 240,000 = 10%.
This is very bad for a 2% sample.
|Copyright 2016 - All Rights Reserved|
|Last Modified: 28 Jun 2020|