Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Wed, 06 Jun 2007 @ 16:44:20 GMT


     
  <Prev Next>   <<First <Prev Next> Last>>  


Subj:   Re: Minimal Stats Sampling and a well-informed
 
From:   Victor Sokovin

  Carrie, this seems odd to me. Two percent is a large sample -- far more than is needed for a 99% confidence level. Why doesn't the sampler simply accept what it finds? Is it because of a fear that the sample isn't "truly random"?  


Probably indeed because of that fear. The method as such is not the most sophisticated but it seems adequate for the job. The randomness of the sample, however, raises very serious doubts.

Let's take another look at the original example:

"In one case, a table with 12M, evenly distributed (UPI) rows had a column with 70K distinct values. A standard collect stats shows this number.

Sampling with a default CollectStatsSample parm value (2%) showed about 1.2M distinct values."

The ratio of distinct values in the population was 70,000 / 12,000,000 = 0.58%.

In the sample, Teradata seems to have found 1,200,000 / 50 = 24,000 distinct values (I reverse the extrapolation rule communicated by Carrie here) in the sample of 12,000,000 * 2% = 240,000. The sample ratio of distinct values is then 24,000 / 240,000 = 10%.

This is very bad for a 2% sample.


Regards,

Victor



     
  <Prev Next>   <<First <Prev Next> Last>>  
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 15 Jun 2023