Archives of the TeradataForum

Message Posted: Wed, 30 Nov 2005 @ 11:27:34 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: MULTISET Tables

From:		Dunweber, Ole

Hello,

We use MULTISET-tables as a defined standard on our big fact tables. (of course we use SET-tables with UPI's for the lookup-tables). It works very well.

Coming from OLTP-systems some 5 years ago I had to get used to this, but now I couldn't imagine working with out them.

Yes, you somehow have to make sure you don't load the same data twice.

But if you have "decent" source system, and a structured ETL-process, it should very rarely happen, that data gets inserted twice.

And if it did happen, we use the following (very fast) technique to get rid of the dupes:

Just load (one instance) of the dupes with negated values in all the fact columns. The application layer then of course always has to sum rows in the answers sets. As we use Microstrategy, this is an automatic option.

Regarding the structured ETL-process:

We've built our own methodology.

We use a staging approach with staging tables all the way, and a final commit-unit (macro) at the end, updating all permanent tables in one commit unit.

This way any (BTEQ-) script is restart able, and we know exactly whether input-data has been loaded or not.

Another huge advantage of using MULTISET-tables:

If we get dirty data loaded, we just do a negation of this dirty data, and then load the corrected data, when it's available.

This way we can correct dirty data very quickly.

Also make sure to use UNION ALL (and EXCEPT ALL) instead of UNION/EXCEPT otherwise duplicates will be silently removed. (for some sources we do have real duplicates in the input).

A little disadvantage is, that I would never dare use an update on this kind of table.

So if we have do update something (in fact we have daily processes doing this - for example for redoing currency exchanges when correct exchange rates are available), then we use this method:

- Extract the rows to be updated to a process-table, and do the changes in this extraction process

- And then in one commit-unit (macro):

Delete the extracted rows from the source-table

Insert the changed rows from the process-table into the source-table.

Another (bigger) disadvantage is, that it gets harder to implement a normalized model (say basket_haeder and basket_detail).

Because we obviously don't want duplicates in the basket_haeder table.

Well in the case of basket-data, we've decided to go for the denormalized model, ensuring we can use our "negation-logic" and have optimal insert-performance.

And also ensuring we can use PPI the way we want. Because, regarding PPI:

That's also a defined standard on our big table.

We make sure the partitioning column (usually/always a date) is included in the PI.

Therefore the load pattern matches our partitioning, and loads are very fast, compared to the same table without PPI.

Note: this advantage gets smaller as the number of NUSI's increase on the table.

Finally a note on using USI's on big fact tables:

In our business and DW (retail and a dimension model) they don't make any sense, because we would never use them for access or joins, and we handle uniqueness our self.

Regards

Ole Dunweber
Coop Nordic


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference