|
Archives of the TeradataForumMessage Posted: Fri, 27 Sep 2002 @ 20:59:35 GMT
Gary, We are investigating something that may be related to your problem. We are running 4.1.2.23, headed for 4.1.3.15 shortly. We have a large insert select that occasionally paralyzes a single node on the system. Because the query is run ad hoc, we can't be sure what percentage of runs are actually failing. The job also runs 2-3 hours, so testing is inconvenient. The ratio between user and system CPU usage on the failing node is the indicator we use for the problem. The information shows on the Teradata Manager DUC display or in the sar data (shown below). Here's the timeline for that failure and the sar data from the affected node: 12:15 Job starts, system is heavily loaded but responsive, ~90% user/10% system. 14:00 System makes sudden transition to slowdown state, 39% user/61% system. 14:30 System has degraded to 27% user/73% system, system response very poor. 14:35 Job aborted, busy node begins completing backlog of tasks. 14:50 System returns to normal load profile. 12:00:00 %usr %sys %sys %wio %idle local remote 12:00:00 65 11 0 2 22 12:05:00 22 11 0 13 54 12:10:00 16 5 0 3 76 12:15:00 89 10 0 0 0 12:20:00 88 12 0 0 0 12:25:00 87 12 0 0 0 12:30:00 88 12 0 0 0 12:35:00 90 10 0 0 0 12:40:00 85 15 0 0 0 12:45:00 83 17 0 0 0 12:50:00 87 13 0 0 0 12:55:00 92 8 0 0 0 13:00:00 92 8 0 0 0 13:05:00 91 9 0 0 0 13:10:00 89 11 0 0 0 13:15:00 90 10 0 0 0 13:20:00 93 7 0 0 0 13:25:00 91 9 0 0 0 13:30:00 90 10 0 0 0 13:35:00 86 14 0 0 0 13:40:00 89 11 0 0 0 13:45:00 87 13 0 0 0 13:50:00 88 12 0 0 0 13:55:00 62 38 0 0 0 14:00:00 39 61 0 0 0 14:05:00 35 65 0 0 0 14:10:00 33 67 0 0 0 14:15:00 33 67 0 0 0 14:20:00 28 72 0 0 0 14:25:00 28 72 0 0 0 14:30:00 27 73 0 0 0 14:35:00 29 71 0 0 0 14:40:00 77 23 0 0 0 14:45:00 78 22 0 0 0 14:50:00 22 21 0 36 22 14:55:00 14 14 0 17 55 15:00:00 10 23 0 38 29 The other nodes on the system show a similar sar report until slowdown state is entered on the paralyzed node. After that, they show increasing amounts of idle time as tasks begin to queue on the problem node. We have an incident open, but we're stalled until we can reproduce the problem reliably. Any insight would be welcome. /dave hough
| ||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||
Copyright 2016 - All Rights Reserved | ||||||||||||||||||||||||||||||||||||||||||||||||
Last Modified: 15 Jun 2023 | ||||||||||||||||||||||||||||||||||||||||||||||||