Archives of the TeradataForum

Message Posted: Fri, 23 May 2003 @ 22:12:54 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Capturing Data Leaving Teradata Machine

From:		Anomy Anom

<-- Anonymously Posted: Friday, May 23, 2003 17:21 -->

Responses are below. Everyone's input on this very frustrating situation is greatly appreciated.

Just to state your situation again: You have a specific application that is LAN connected to the Teradata. Occasionally when this application executes a macro, the answer set that is returned has been corrupted (the nature of the corruption has not been specified). If the application re-executes that same macro, then the answer set is returned correctly.

Correct.

I don't think that you have said whether this is a new problem or an application that has been working and just recently started having problems.

The application has been in place for several years. The problem 1st occurred in January and then it went away. It resurfaced in Mid- April and was occurring more frequently. This week it was been light. Only a couple have been reported. We have not made any changes to the Teradata machine. The application claims that they have not changed their code.

A sniffer has been used at the point where the application's server makes its connection to the LAN. The sniffer indicates that the data coming into the server from the Teradata has been corrupted. Since the corruption can be detected at the LAN, it's unlikely to be a problem with the application or its hardware.

Keep in mind that for practical purposes, a LAN is just a chunk of wire and that a wire has two ends. Also recognize that your piece of wire probably has routers/bridges between those two ends. So the next logical step might be putting the sniffer on the LAN at the point where the connection with the Teradata is made (ie- the other end of the wire). If you can catch the corrupted data there, then it's definitely a Teradata problem.

If it's a Teradata problem, then the question of whether it's hardware or software comes next.

I wouldn't be too quick to dismiss the idea that it's a software/hardware problem. In the past, there was a case where an extract job that ran daily (and had been stable for years) would occasionally return a few rows out of order (the query did have an order-by clause that should have guaranteed sort order). It would only happen every couple of months. It was finally proven to be a hardware problem.

There was another case where SQL which had been in use for years would return the wrong answer as a one-off. This would only occur once every couple of weeks and never hit the same query twice. It was finally tracked down to a bug in the handling of single-bit memory errors. There are lots of intermittent-type bugs that I could reminiscence over, but the bottom-line is that they weren't self-evident and wouldn't have been isolated and fixed if there hadn't been a deliberate and persistent effort to get to the real problem.

I assume that the application always communicates with the Teradata thru the same Host group (as defined by Config and the LAN connection made to the Teradata). The Host group is probably made-up of two or more PE's (aka COP/IFP) with the work being round-robined amongst those PE's.

Correct.

One question is whether the application-in-question shares the Host group with other applications? If so (and excluding the macro being at fault), it seems odd that only the one application is having trouble with corrupted data and that the problem isn't pandemic.

It does not share the Host group. It is the only application that accesses Teradata from its server.

When corrupted data is returned to the application, can you identify the Teradata session that was involved? If so, that should allow you to identify which PE was handling the query. Was it always the same PE?

At this time we can't identify the session involved. We may enable access logging. We may limit them to logging onto 1 node for several days and then moving onto another node. We only have a 3 node machine.

If you can't track the problem back to a PE, then you have to go back and look at the macro (from your description so far, it's the only other thing you have to work with).

It sounds like there is a single macro involved. What's the failure rate? Are there 100's of successful executions for every failure? 1000's? Can you put the execute macro into a loop to see how repeatable the problem is? You may not be able to get the macro to fail without submitting it from the application server that is having the problem.

It turns out that it is more than 1 macro. There can be 100,000's of successful executions before we see a failure. But sometimes we have seen fail within minutes of each other. On the application server I created a script to loop through and execute the macros through bteq. It executed 151,200 macros in the script serially. It was running when a failure was reported. There was not any corrupted data in the bteq output. We had hoped to capture corrupt data by running the script, which would point more towards it being a Teradata or network error.

For what it's worth: I have seen a case where a query would intermittently fail. The situation in that case was that statistics were not collected on the tables involved and the explain would, from time-to-time, change into something nasty. The majority of the time, the plan would be fine. Other times the categories that were being selected would be significantly different and we would get incorrect results.

There's a lot of things that can be wrong before I would blame the macro, but if it's the same macro that is always associated with the corruption, then I would give it more attention while waiting to see the results of placing the sniffer on the Teradata side.

So that it isn't forgotten, in which way is the data corrupted? Are numbers changed to alpha? Are you finding unprintable (or binary) characters? Is there byte flipping? Are you losing sign bits? Are fields shifted? The nature of the corruption can give a pretty good picture of the cause.

See previous post which has the corrupted data in.


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference