Archives of the TeradataForum

Message Posted: Wed, 06 Jun 2007 @ 16:44:20 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Minimal Stats Sampling and a well-informed

From:		Victor Sokovin

Carrie, this seems odd to me. Two percent is a large sample -- far more than is needed for a 99% confidence level. Why doesn't the sampler simply accept what it finds? Is it because of a fear that the sample isn't "truly random"?

Probably indeed because of that fear. The method as such is not the most sophisticated but it seems adequate for the job. The randomness of the sample, however, raises very serious doubts.

Let's take another look at the original example:

"In one case, a table with 12M, evenly distributed (UPI) rows had a column with 70K distinct values. A standard collect stats shows this number.

Sampling with a default CollectStatsSample parm value (2%) showed about 1.2M distinct values."

The ratio of distinct values in the population was 70,000 / 12,000,000 = 0.58%.

In the sample, Teradata seems to have found 1,200,000 / 50 = 24,000 distinct values (I reverse the extrapolation rule communicated by Carrie here) in the sample of 12,000,000 * 2% = 240,000. The sample ratio of distinct values is then 24,000 / 240,000 = 10%.

This is very bad for a 2% sample.

Regards,

Victor


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference