Archives of the TeradataForum

Message Posted: Fri, 27 Sep 2002 @ 20:59:35 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Query Performance Problem On V2R4.1.3

From:		Hough, David A

Gary,

We are investigating something that may be related to your problem. We are running 4.1.2.23, headed for 4.1.3.15 shortly. We have a large insert select that occasionally paralyzes a single node on the system. Because the query is run ad hoc, we can't be sure what percentage of runs are actually failing. The job also runs 2-3 hours, so testing is inconvenient.

The ratio between user and system CPU usage on the failing node is the indicator we use for the problem. The information shows on the Teradata Manager DUC display or in the sar data (shown below). Here's the timeline for that failure and the sar data from the affected node:

12:15 Job starts, system is heavily loaded but responsive, ~90% user/10% system.
14:00 System makes sudden transition to slowdown state, 39% user/61% system.
14:30 System has degraded to 27% user/73% system, system response very poor.
14:35 Job aborted, busy node begins completing backlog of tasks.
14:50 System returns to normal load profile.

12:00:00    %usr    %sys    %sys    %wio   %idle
                   local  remote
12:00:00      65      11       0       2      22
12:05:00      22      11       0      13      54
12:10:00      16       5       0       3      76
12:15:00      89      10       0       0       0
12:20:00      88      12       0       0       0
12:25:00      87      12       0       0       0
12:30:00      88      12       0       0       0
12:35:00      90      10       0       0       0
12:40:00      85      15       0       0       0
12:45:00      83      17       0       0       0
12:50:00      87      13       0       0       0
12:55:00      92       8       0       0       0
13:00:00      92       8       0       0       0
13:05:00      91       9       0       0       0
13:10:00      89      11       0       0       0
13:15:00      90      10       0       0       0
13:20:00      93       7       0       0       0
13:25:00      91       9       0       0       0
13:30:00      90      10       0       0       0
13:35:00      86      14       0       0       0
13:40:00      89      11       0       0       0
13:45:00      87      13       0       0       0
13:50:00      88      12       0       0       0
13:55:00      62      38       0       0       0
14:00:00      39      61       0       0       0
14:05:00      35      65       0       0       0
14:10:00      33      67       0       0       0
14:15:00      33      67       0       0       0
14:20:00      28      72       0       0       0
14:25:00      28      72       0       0       0
14:30:00      27      73       0       0       0
14:35:00      29      71       0       0       0
14:40:00      77      23       0       0       0
14:45:00      78      22       0       0       0
14:50:00      22      21       0      36      22
14:55:00      14      14       0      17      55
15:00:00      10      23       0      38      29

The other nodes on the system show a similar sar report until slowdown state is entered on the paralyzed node. After that, they show increasing amounts of idle time as tasks begin to queue on the problem node.

We have an incident open, but we're stalled until we can reproduce the problem reliably.

Any insight would be welcome.

/dave hough


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference