Archives of the TeradataForum

Message Posted: Thu, 03 Mar 2011 @ 14:00:18 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Collecting Statistics for NULLS

From:		Maccha Narayana Rao Naidu

Hope this Helps

Teradata 12 includes statistics collection enhancements to better iden-tify rows whose multicolumns contain partial NULLs, as opposed to all NULLs. In previous releases there was no way to make this determina-tion, which resulted in all of the all NULL and partial NULL values counting as one distinct value. That could lead to skewed redistribu-tions when some rows had all NULLs and others had partial NULLs.

Because the optimizer no longer considers partial and all NULLs as the same, not only are redistributions more even, but better join costing and more accurate aggregations on those columns can also be performed.

This distinction between full and partial NULLs is a result of the changes made to Interval 0 of a Teradata 12 statistics histogram. An exist- ing field and a new field help the optimizer make this differentiation:

* NumNulls. A count of all rows that have NULL on any column in the statistic

* AllNulls. A count of rows that have NULL on all the columns of the statistic

From the values in those two fields, the optimizer can derive how many rows carry partial NULL values (NumNulls - AllNulls = Partial NULLs).

One aspect of this change is that the database assumes each row reflects a distinct value when its multicolumn values are partially NULL, even when the non-NULL portions are the same. For that reason, if a multicolumn statistic includes a column with a high number of NULLs, you might want to consider removing that column from the statistic to make the statistic more usable.

Below is the actual formula used by the optimizer for determining the number of distinct values in the multicolumn statistic:

# Distinct values = # of distinct values from rows with no NULLs + (1 if AllNulls > 0 else 0) + # of rows with partial NULL.

To help clarify this, consider a table that includes two columns (Region and State), as illustrated in table 2. Assume that multicol-umn statistics have been collected on these two columns. In the rightmost column is text indicating how the number of distinct values for the statistics was changed by the contents of those two columns in that particular row.

Thanks

Machha


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference