Archives of the TeradataForum

Message Posted: Fri, 18 Apr 2003 @ 19:40:17 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Capturing Data Leaving Teradata Machine

From:		John Hall

Just to state your situation again: You have a specific application that is LAN connected to the Teradata. Occasionally when this application executes a macro, the answer set that is returned has been corrupted (the nature of the corruption has not been specified). If the application re-executes that same macro, then the answer set is returned correctly.

I don't think that you have said whether this is a new problem or an application that has been working and just recently started having problems.

A sniffer has been used at the point where the application's server makes its connection to the LAN. The sniffer indicates that the data coming into the server from the Teradata has been corrupted. Since the corruption can be detected at the LAN, it's unlikely to be a problem with the application or its hardware.

Keep in mind that for practical purposes, a LAN is just a chunk of wire and that a wire has two ends. Also recognize that your piece of wire probably has routers/bridges between those two ends. So the next logical step might be putting the sniffer on the LAN at the point where the connection with the Teradata is made (ie- the other end of the wire). If you can catch the corrupted data there, then it's definitely a Teradata problem.

If it's a Teradata problem, then the question of whether it's hardware or software comes next.

I wouldn't be too quick to dismiss the idea that it's a software/hardware problem. In the past, there was a case where an extract job that ran daily (and had been stable for years) would occasionally return a few rows out of order (the query did have an order-by clause that should have guaranteed sort order). It would only happen every couple of months. It was finally proven to be a hardware problem.

There was another case where SQL which had been in use for years would return the wrong answer as a one-off. This would only occur once every couple of weeks and never hit the same query twice. It was finally tracked down to a bug in the handling of single-bit memory errors. There are lots of intermittent-type bugs that I could reminiscence over, but the bottom-line is that they weren't self-evident and wouldn't have been isolated and fixed if there hadn't been a deliberate and persistent effort to get to the real problem.

I assume that the application always communicates with the Teradata thru the same Host group (as defined by Config and the LAN connection made to the Teradata). The Host group is probably made-up of two or more PE's (aka COP/IFP) with the work being round-robined amongst those PE's.

One question is whether the application-in-question shares the Host group with other applications? If so (and excluding the macro being at fault), it seems odd that only the one application is having trouble with corrupted data and that the problem isn't pandemic.

When corrupted data is returned to the application, can you identify the Teradata session that was involved? If so, that should allow you to identify which PE was handling the query. Was it always the same PE?

If you can't track the problem back to a PE, then you have to go back and look at the macro (from your description so far, it's the only other thing you have to work with).

It sounds like there is a single macro involved. What's the failure rate? Are there 100's of successful executions for every failure? 1000's? Can you put the execute macro into a loop to see how repeatable the problem is? You may not be able to get the macro to fail without submitting it from the application server that is having the problem.

For what it's worth: I have seen a case where a query would intermittently fail. The situation in that case was that statistics were not collected on the tables involved and the explain would, from time-to-time, change into something nasty. The majority of the time, the plan would be fine. Other times the categories that were being selected would be significantly different and we would get incorrect results.

There's a lot of things that can be wrong before I would blame the macro, but if it's the same macro that is always associated with the corruption, then I would give it more attention while waiting to see the results of placing the sniffer on the Teradata side.

So that it isn't forgotten, in which way is the data corrupted? Are numbers changed to alpha? Are you finding unprintable (or binary) characters? Is there byte flipping? Are you losing sign bits? Are fields shifted? The nature of the corruption can give a pretty good picture of the cause.


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference