Re: Info on DataStage EE For Teradata [From Brian Christjohn: Tue, 20 Mar 2007 @ 10:41 GMT]

Message Posted: Tue, 20 Mar 2007 @ 10:41:45 GMT

Subj:

Re: Info on DataStage EE For Teradata

From:

Brian Christjohn

First rule of thumb, the Teradata Enterprise Stage in version 7.5.1A is very buggy and poorly concieved. We decided not to use it at all. May be different in new 8.x version.

1. You are correct. All loads are based on Fastload of the inbound link to a temp table created by the TES (Teradata Enterprise Stage). Then it creates CLI sessions to insert/update target from the temp. It never cleans up all the TEMP tables. In addition, if you have 8 degrees of parallelism in DataStage parallel stages, for example, it will create 8 SQL sessions for the post-load. SQL only goes through 1 of those sessions. If you get a couple of these going at once then you have many unused sessions. Also if the load aborts, and you hit RESET in datastage, it never cleans up the fastload table - even though it has a separate control table that keeps track of all the jobs and the temp tables it creates. Also note that the TES can also be used as a source. This works well, but invokes a FastExport. If you use this 2-3 times for source tables/lookups in a job , and you run a couple of jobs at once then you have invoked too many fastexports. For explicit exports or big data pulls the TES can work fine, or you can use the MLOAD stage (which does Fastexport if it is a source stage)

2. Yes. CLI on postload.

3.) If you are considering using fastload to the target table, then you must have an empty target. The additional time to do insert/select to empty table + fastload is not that different than just a fastload, as long as you have an appropriate PI. The size of your box vs. the size of the data is the key. But if you are loading to non-empty table, then you should just use the MLOAD stage (Which allows you to use MLOAD or TPUMP). Your descision between MLOAD or TPUMP would be based on whether you have any special index considerations on the target table, and also how big the DataStage box is. In most cases, your datastage box cannot feed a MLOAD fast enough to make MLOAD faster than TPUMP. As point of reference, 8 simultanesou input sessions to a MLOAD spread across 2-Nodes (4-CPU) each on the Datastage server cannot feed the MLOAD anywhere close to a straight disc file. So even with a box that size, and running all by yourself, the MLOAD won't run faster than than the TPUMP (unless you have a bunch of SI on target table). The DataStage inbound processes is a great architecture but has a huge amount of CPU overhead.

4.) These are two separate things. If you are using the TES(uggg) or MLOAD/TPUMP stage, the partitioned Datastage inputs (transformer stagees etc.) will converge to a single MLOAD/TPUMP or TES stage. The single utility instance is invoked which then logs on 'n' number of Teradata sessions based on the normal Teradata utilitiy rules (sessions are allocated across Teradata Nodes through the PE control so their is balanced data aquisition processes).

5.) There is is also an API stage which I highly recommend. This stage can be run with sequential property or parallel property - so it will follow the partitioning level of the rest of the job. The blocking on Teradata is and issue. The way you resolve it depends on the way you implement the load. For TES, it is non-issue due to the fact that you run fast load, then a single SQL stamet through a single CLI session. For MLOAD, it is also non-issue. If you use TPUMP, you need to make sure that use follow normal TPUMP standards (Set the KEY statements equal to the PI and turn SERIALIZE on). IF you use the API stage and use it in parallel, then you must set the artitioning property to HASH and then include the columns = to all columns in the PI of the target tables. Only under the API will get occassional blocks due to Teradata hash synonyms only, but this should be pretty minimal on most low volume loads .

6.) Restart on TES works basically by starting over from the begining. It does not cleanup the old Fastload tables (lose space, screw up backups etc). The restart on MLOAD is a problem if you've used a straight feed (named pipe). The entire set of options is long, but I'd recommend not doing the DROPs in datastage, do them manually, and then if there is a failure in aquisition just reset the job. IF the failure is in the application phase, then you need to manually restart the MLOAD (from UNIX) You can write your own control wrapper around this to do all this if you want much as you would if you were writing a custom MLOAD. IF you are using TPUMP, just write the cleanup of log and macro in the MLOAD/TPUMP stage, Restart when fail, but make sure you put UPSERT logic in the stage (easily auto genereated) For the API, there is no cleanup, but again just write as UPSERT. The API will genereate upsert code automatically, but it will be 2 SQL statements, not atomic upsert. If you want ATOMIC upsert you will have to take the auto generated code and then modify slightly to the ATOMIC upsert syntax.

7.) Connections are created at the begging of the job (well the TES SQL post load session aren't created until the Fastload is done). Datastage streams data from begiinng to end as soon as the first row is available in most cases, so it has to fire up all connections. This is another reason to be careful with the TES or any other bulk loader utilitiy - they will stay active for as long as the job is active, even if you are waiting for a single slow connection from a source system (DB2, SQL server etc).

Attachments

Library

Quick Reference

Archives of the TeradataForum

Message Posted: Tue, 20 Mar 2007 @ 10:41:45 GMT