Archives of the TeradataForum

Message Posted: Mon, 08 Jul 2002 @ 18:50:15 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Uniform Distributions from RANDOM number function

From:		Geoffrey Rommel

Doug has raised some interesting points here.

First, if you would like an overview of how "random number" algorithms work, along with plenty of warnings about the possible pitfalls, I recommend Chapter 3 of Donald Knuth's Art of Computer Programming, Vol. 2 (3rd edition, Addison-Wesley, 1998). There's a lot of higher math that I don't understand, but the main concepts are clear enough.

The documentation seems to imply that a different seed to the algorithm is used for each call which should produce another totally random number sequence independent of any prior sequences. However, this does not seem to be the case.

Yes, a new seed is used each time, but you have no control over it. The algorithm (lrand48, I believe) is described in the SQL Reference, Vol. 5 (V2R4.1, B035-1101-061A, June 2001), p. 7-43. Now, is the sequence "independent" of the prior sequences? Strictly speaking, no: the algorithm is deterministic; it's just designed to produce results that look random. Generally speaking, though, each sequence should look different enough from previous ones that a casual observer would never notice the pattern.

Since you are trying to produce unique values, I guess you are no longer a casual observer!

The trouble arises when I attempt to split this process into multiple passes inserting a batch of rows (100K) at a time. On the second or subsequent pass of the same SQL I begin to see some number ranges that have a high rate of duplication in each pass.... I am looking for ideas on how to work around this issue so that I can generate a set of random numbers without duplicates with some level of confidence.

In the words of the Wizard, you are a victim of disorganized thinking! If you pick two separate random sequences, uh, at random, they may very well overlap. For instance, let's say you generate two randomly selected groups of 10 integers from 0 to 99. There is a chance that they will contain one or more identical integers. (I'm not savvy enough to compute what that chance is, but it's greater than 0.) The same applies when you generate two groups of 100K random numbers from a universe of 2**31, although obviously the chances are smaller. And, of course, the algorithm isn't really random, so perhaps the chances are greater than they would be if the numbers were truly random. Demoralizing, isn't it?

I believe the only way to accomplish what you wish is to generate a single set of random numbers and then somehow insert them 100,000 at a time into your final table.

Hope this helps,

wgr


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference