Archives of the TeradataForum

Message Posted: Thu, 16 Mar 2005 @ 23:51:00 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Key Tables and Primary Indexes

From:		McCall, Glenn David

Hi there, hopefully mailing lists such as this one can help clear up any questions and doubts that you have.

First off, I'm not sure what you mean by a key table? Is that some sort of table that translates a code value into a descriptive text? Such as:

     Code    meaning
     0       normal
     1       under review
     2       suspended
     Etc?

Selection of a primary index is a fairly important part of the definition of a table.

As you probably know Teradata is a Massively parallel database engine. This means that it uses multiple processes, multiple CPU's and potentially multiple connected computer systems (nodes) to service a query. To get the best performance it is desirable to distribute the workload evenly across all of these processors. This is typically achieved by evenly distributing the data across all of these processors.

The way Teradata distributes data in a table is determined by the primary index you select for a table. Teradata (as do most RDMB's) will perform a calculation (hashing) on the values of the columns you specify in the primary index to determine where to place a row of data. Ideally you want to pick primary indexes that have a wide range of unique values (high cardinality).

The best primary indexes are primary keys - where the values are unique for every single row (eg. customer number in a customer table).

If you pick a set of columns that have many rows with the same values, you will find that Teradata (and other RDMBS's) will have no choice place all of these rows in the same place. For other RDBM's this might not be a problem because they are single node systems (i.e. all the data is in the same place anyway), or you are doing single row retrievals (eg. an OLTP application), but for Teradata where you are doing massively parallel DSS style queries (i.e. analysing all data from all processors), you can get a situation where one processor is doing more than its fair share because it has more than its fair share of data. This condition is often referred to as a "hot AMP" or skewing.

The result of skewing is - as you have found - that one AMP will run slower than the others. More precisely, it will take longer to complete its portion of the work than the others did.

The solution is to evenly distribute the data by picking a primary index with high cardinality.

You ask can you do without the null values - well it depends, is it OK to delete the records with the null PI values? If so, then delete them.

However, if you need to keep these rows and you are asking can you somehow get Teradata to apply different rules when hashing the null values, then the answer is only by changing the Primary Index definition. Try to think of it like this - imagine you are filing name cards and you have 27 drawers one for each letter of the alphabet and one for blank. If you have 100 name cards 50 of which are blank and 50 of which have names starting with different letters and you have to file them into the drawers. You will end up with 26 drawers having a small number of cards in them and one drawer having 50 cards in it (this is your null). You can't change the distribution unless you change the rules (i.e. the index columns) - eg file the name cards by phone number or some other criteria.

I hope this helps.

Glenn McCall

P.S. Another tip is if you are joining tables make sure that the data types are the same for join columns for any given join expression (eg. int = int and char = char). Different expressions in a join can use different data types. For example, I had one site that was doing this

     Select ...
     From Account ACT Join Transaction TRN on
         ACT.Cust_Num = TRN.Cust_Num and
         ACT.Acct_Num = TRN.Acct_Num

Which is fine - the only problem is that the ACT.Cust_Num was numeric but the TRN.Cust_Num was varchar (12) - but still contained numbers. This was a problem because the hashing takes into account the data type i.e. 1234 is hashed differently to '1234'. Making them the same data type (didn't matter which way we went both integer or both character - just as long as they were the same) made the query run a heck of a lot faster.

P.P.S. Most of the stuff I talk about here is not specific to Teradata with the exception of the massively parallel stuff. Good index selection, matching data types and so on applies to most if not all RDBMS's.


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Attachments

Library

Quick Reference