Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Wed, 30 Nov 2005 @ 11:27:34 GMT


     
  <Prev Next>   <<First <Prev
Next>
Last>>
 


Subj:   Re: MULTISET Tables
 
From:   Dunweber, Ole

Hello,

We use MULTISET-tables as a defined standard on our big fact tables. (of course we use SET-tables with UPI's for the lookup-tables). It works very well.

Coming from OLTP-systems some 5 years ago I had to get used to this, but now I couldn't imagine working with out them.

Yes, you somehow have to make sure you don't load the same data twice.

But if you have "decent" source system, and a structured ETL-process, it should very rarely happen, that data gets inserted twice.

And if it did happen, we use the following (very fast) technique to get rid of the dupes:

Just load (one instance) of the dupes with negated values in all the fact columns. The application layer then of course always has to sum rows in the answers sets. As we use Microstrategy, this is an automatic option.

Regarding the structured ETL-process:

We've built our own methodology.

We use a staging approach with staging tables all the way, and a final commit-unit (macro) at the end, updating all permanent tables in one commit unit.

This way any (BTEQ-) script is restart able, and we know exactly whether input-data has been loaded or not.

Another huge advantage of using MULTISET-tables:

If we get dirty data loaded, we just do a negation of this dirty data, and then load the corrected data, when it's available.

This way we can correct dirty data very quickly.

Also make sure to use UNION ALL (and EXCEPT ALL) instead of UNION/EXCEPT otherwise duplicates will be silently removed. (for some sources we do have real duplicates in the input).

A little disadvantage is, that I would never dare use an update on this kind of table.

So if we have do update something (in fact we have daily processes doing this - for example for redoing currency exchanges when correct exchange rates are available), then we use this method:

- Extract the rows to be updated to a process-table, and do the changes in this extraction process

- And then in one commit-unit (macro):

Delete the extracted rows from the source-table

Insert the changed rows from the process-table into the source-table.


Another (bigger) disadvantage is, that it gets harder to implement a normalized model (say basket_haeder and basket_detail).

Because we obviously don't want duplicates in the basket_haeder table.

Well in the case of basket-data, we've decided to go for the denormalized model, ensuring we can use our "negation-logic" and have optimal insert-performance.

And also ensuring we can use PPI the way we want. Because, regarding PPI:

That's also a defined standard on our big table.

We make sure the partitioning column (usually/always a date) is included in the PI.

Therefore the load pattern matches our partitioning, and loads are very fast, compared to the same table without PPI.

Note: this advantage gets smaller as the number of NUSI's increase on the table.

Finally a note on using USI's on big fact tables:

In our business and DW (retail and a dimension model) they don't make any sense, because we would never use them for access or joins, and we handle uniqueness our self.


Regards

Ole Dunweber
Coop Nordic



     
  <Prev Next>   <<First <Prev
Next>
Last>>
 
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 15 Jun 2023