Re: Myth or Reality - Database Corruption [From Anomy Anom: Sun, 02 Sep 2001 @ 19:26 GMT]

Message Posted: Sun, 02 Sep 2001 @ 19:26:55 GMT

Subj:

Re: Myth or Reality - Database Corruption

From:

Anomy Anom

<-- Anonymously Posted: Sunday, September 02, 2001 12:37 ->

I've worked with the Teradata database for about 15 years. Early on, it got very ugly at times - I could tell some scary stories - but I've seen a vast improvement over the years. I love the advances that have been made by the use of RAID technology and also having virtual amps that migrate to another node when a node fails - what a wonderful thing! The database software seems to be pretty good at detecting abnormalities (I hope) and protecting itself by making any amps in question go fatal. The fsgwizard utility that support center runs (we wouldn't dare run this ourselves) has saved us many times by flushing memory from the buddy node in order to bring amps back online without any further grief (unlike the old days of NVRAM failures).

Data corruption can mean a lot of different things. It can be triggered by a hardware failure or software logic. The machine might puke on it (usually in the middle of the night), or it might possibly go unnoticed, which I refer to as "incorrect results" (usually a software gotcha'). NCR has come a long way in sharing information on these types of software problems via tech alerts (of course, seeing them gives us much more to fret about but at least we can try to determine our exposure). But when the data breaks so bad that the machine restarts and a table appears to be corrupted, the recovery will depend on what it is that broke.

RAID saves you from the lost of a drive, which is goodness. Most of the time, drives fail without anyone noticing. But I've seen a few hardware problems with a disk or a controller that look just like serious data corruption - the machine restarts with ugly error messages and amps go fatal - and in the end, the problem was fixed by replacing a drive and reconstructing it or by replacing a controller. No data restores or table rebuilds needed. What interests me however, are those times when RAID doesn't save you. If you're really concerned about data integrity and getting the database back online quickly, then spend the money on fallback. Don't believe the hype that says you don't need fallback if you have RAID. Once you've been burned, suddenly everyone in the shop wants to run fallback.

Admittedly, the types of problems where fallback saves you are rare events. They seem much more likely to occur on a larger system (especially as it grows to roughly 50 nodes then to roughly 100 nodes, etc). At this size, I expect one of these problems to occur about once or maybe twice a year. It might be just one table that's broken, but 80% of the time it's your largest table. A table rebuild run from fallback data will run much, much quicker & easier & be less error prone than a data restore. When it's the transient journal that breaks, then you lose all non-fallback data on the machine (it happened to us once). A table rebuild run on one amp saves you lots and lots of work finding dump data sets and running restore jobs for every nonfallback table on the machine.

The benefits of running with fallback data:

- When an amp goes fatal (this happens to us several times a year), then jobs & users can continue to run after the database restart completes (assuming mloads run with ampcheck none, etc). You might not need the fallback data for the recovery, but you will need it in order to stay up while NCR figures out what's wrong with the amp. BTW - Don't make the mistake of creating work tables as non-fallback if you'll need them to be accessible to run critical jobs or onlines.

- With fallback, a checktable ("level two") will have something to compare: primary data vs fallback data. Otherwise it doesn't do much besides checking the indexes. Checktables should be scheduled to run about once a month, during a quiet time (to avoid dictionary locks). BTW, the scandisk utility should be run about once a month as well, also during a quiet time (to avoid superfluous errors). It does a lower level check of the file system. Any errors should be reported to NCR support center.

- When data corruption is discovered, it's usually isolated to a particular table or amp. With fallback, the table rebuild utility gives you the option of rebuilding a broken table or rebuilding all the tables on that amp - the primary data on that amp is built from the fallback data and the fallback data on the amp is built from the primary data on the other amps in the cluster. BTW - make sure that amps in the same cluster are in different cabinets.

Of course, the drawbacks of running fallback are the doubling of the space required ($) and the performance impact of doing extra i/o's ($). That has to be weighed against the heat that you'll get when you tell management that the data will be unavailable for a day while the data is reloaded (note: journaling won't save you from restoring the data if you have non fallback tables). It's all a matter of what risk you can accept vs. how much money you have to spend.

Attachments

Library

Quick Reference

Archives of the TeradataForum

Message Posted: Sun, 02 Sep 2001 @ 19:26:55 GMT