Archives of the TeradataForum

Message Posted: Mon, 06 Feb 2006 @ 13:45:53 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Use of Multiset for Large Table

From:		Dunweber, Ole

Hello Michelle

We are hand coding the SQL to verify - because we don't use an ELT-tool.

The reason for this is, that there are not that many ELT-tools for Teradata, and because our organization unfortunately is reluctant to spend the money.

So we've built our own methodology, which has been (and still is) great fun.

In my opinion, the must important things are:

1) Every process in the ELT must be restart able.

2) Each input-batch should be assigned a tracking number - we call this a batch number. Even if you are "active" this probably is sufficient.

3) There should be a formal validation process in the ELT

4) That you can account for all transactions in a given input (#Input = #Loaded + #Error-transactions)

5) Keep things simple - only do one process in each stage. If more processing is required introduce an extra stage.

We use staging tables all the way from loading until our one and only one final commit to the permanent tables (base-table and summary tables).

This commit is 1 macro, and is 1-1 INSERT/SELECTs from staging tables.

We use BTEQ and have 1script for each stage in the ELT.

The commit-macro must either be the only macro/process in the commit-script, or the last statement in that script.

If you should want to check for dupes in an input, I would use a simple GROUP BY on the unique columns.

Summing all transactions can easily check integrity between base- and summary-tables.

If you use PPI, you can probably efficiently check the most recent data, if you decide this is sufficient on a daily basis, and only do the full check from time to time.

Ensuring integrity between base- and summary-tables is a fun challenge - especially when the dimensions change.

My point of view is, that the best overall solution is to rebuild the summary-tables, with "some" frequency.

This probably can't be done each day, so we've decided to "freeze" some dimensions during the week, and only do the rebuild once a week for those. Others are rebuild every day.

Others are partly rebuild each day.

To this matter, it becomes extremely important, that the PI of the base-table aggregates well, that is local.

Of course it can't do that in every dimension, but it should be possible in one of them.

If you apply these principles I think it's very unlikely that the same data gets loaded twice.

In our experience it's more likely, that one/some of the summary tables along some dimension gets inconsistent with the base-table!

Should any of these happen, it's not the end of the world.

Of course you have to analyze why it happened, but then just get one with the job of correcting it.

Double data is easily removed using the negate-logic described in an earlier post. (This also quickly removes bad data loaded).

An inconsistent summary-table can be rebuild by our rebuild-logic at hand.

Probably to a new table, which then is renamed to the production version.

If you deem the risk of inconsistencies between base- and summary-tables unacceptable, I suggest using AJI.

Best Regards,

Ole Dunweber
Coop Nordic


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference