Archives of the TeradataForum

Message Posted: Fri, 28 Sep 2007 @ 21:22:27 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Number of pi duplicates

From:		Anomy.Anom

<-- Anonymously Posted: Fri, 28 Sep 2007 12:59:43 -->

Hi,

Yes, I'd please like to make a comment on NUPI duplicates.

"Hysterically, the rule of thumb was 100. It went up to 200 for a while, but then went back down. The actual "number" is the number of rows that fit in a block! Which, of course, depends on your row size, compression, etc., etc. So go figure. Anybody else want to comment.".

I think the answer to the original question in this thread is quite simple and relatively easy to understand, however it looks to me that a lot of confusion may frequently arise about this subject.

If you consider the number of PI or NUPI duplicates from an end user perspective, if you need to recover or join a relatively small percentage of rows in a very large table, you can work with a very high number of PI duplicates very effectively, for instance I've recently seen an extreme case with the implementation of a large table where a few NUPI values have more than 200K rows, which of course span many, many blocks, but queries against the table are very efficient, because the NUPI covers all access and the Optimizer never needs to duplicate or redistribute it.

On the other hand, from the data load/refresh perspective, if you have many NUPI duplicates and you keep the table as SET and there are no unique indexes defined, like for instance a USI on the PK, any bulk SQL INSERT/UPDATE or MultiLoad process against the table may never end, or run for many hours/days, because you are asking Teradata to check for row duplicates in every INSERT/UPDATE operation, and the system has no other choice but examine and compare every row having the same row hash as the NUPI value on a column per column basis, demanding an enormous amount of resources, both CPU time and I/O operations, to be able to reject duplicate rows.

So the above rule of thumb ONLY applies to SET tables with a NUPI and without USI and when considered from a data load/refresh perspective, and in fact it may be a right approach to avoid wasting valuable resources in checking duplicate rows when updating the table, i.e. limiting the number of NUPI duplicates to a single block.

But while you get efficient update processing for the table, it may happen that when you are forced to choose a NUPI with a relatively small number of duplicate values you may be spoiling the performance of joins and SELECTs, if this NUPI is not useful enough for instance for your BI queries, and the Optimizer then needs to redistribute the table to change its PI, or duplicate its rows in every AMP.

What many physical designers do in order to choose a useful NUPI in spite of the fact it may have many duplicate values is for instance define a USI on the PK of the table, so Teradata can reduce the row duplicates check operation just to that index uniqueness. But there's a tradeoff, you need perm space for the unique index structure, and of course additional processing to keep the USI in sync, and you can't use MultiLoad to update the table. Furthermore you still need to verify if the resources Teradata needs in order to check the USI for every NUPI duplicate are reasonable.

Or you could ask Teradata to not perform the duplicate row checking, in order to accomplish this you must define the table as MULTISET, but in this case you become responsible for avoiding undesired row duplicates in the table, for instance implementing a robust ETL or ELT strategy, as many Teradata users do.

Or you could review the physical definition of your table, perhaps it could be possible to find a good PI candidate as UPI instead of NUPI. I see many cases where the physical designer tries to put together both current and previous instances of information in the same table, so what would have been a natural choice of UPI becomes a NUPI. In these cases perhaps it may be better to keep history and current data in two separate tables, the one which holds current data would be UPI and the other one would be NUPI, updates to history may not be so frequent.

Or you may consider if the solution could be to define a Partitioned PI (PPI) for the table, but in this case you need to analyze if it must perform joins using the same PI against non PPI tables.

I think the main thing is to remember Teradata due to its massive parallel architecture can only perform direct joins on Primary Indexes, because each AMP or parallel unit must have all rows from both tables which belong to same domain values of the joined columns, and you also need to remember that the PI is always the best possible access method for rows in the table.

So the natural choice of the PI is to pick up the PK of the main entity as UPI, and define the FK of the most frequently joined related table as NUPI, if you are working in a normalized physical implementation. After your choice you need to make sure the NUPI table may have a reasonable update performance, applying any of the approaches described above.

Or if your data model is dimensional, you would normally choose for the fact table the concatenation of the columns belonging to the dimensions used to access its rows, which normally can be a UPI, then each table representing each dimension should have UPI as well.

Just my 2 cents.

Regards.


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference