|
Archives of the TeradataForumMessage Posted: Fri, 13 Aug 2010 @ 16:25:57 GMT
Hi Naveen, It sounds like you've got some very strange data ! Using NOPI, if you load the data from an external file then the data rows will be "round robin'd" across the AMPs in the config. With 300k rows then that should not give you a distribution problem. As someone else has said, if you're doing an insert/select from an existing table then different rules come into play. However, if you've got 20-30 columns in the table def'n (like the one that you showed before) then I doubt that you need ot go that far. Unless (as I said before) you've got some very strange data (all/mainly nulls or zero's?) I would expect to find a number of combinations of column(s) which provide good distribution. Obviously, the final choice of which of those to then use will be dependent on normal PI choice (joining, access, dup count etc). If you've got a table that is loaded on your Teradata system, then you can model the distribution of any chosen PI without having to actually build the table. Do this using the hash functions as shown below: Example 1 Choose two columns from a table. SELECT (100.00 * AVG(rowcount) / MAX(rowcount)) AS ParEff ,AVG(RowCount) ,MAX(RowCount) FROM (SEL HASHAMP(HASHBUCKET(HASHROW(thedate,thetime))) AS AmpNo ,CAST(COUNT(*) AS FLOAT) AS RowCount FROM dbc.resusagespma GROUP BY 1 ) AS dt1; ParEff Average(RowCount) Maximum(RowCount) 82.62 135.50 164.00 Example 2 Choose a single column from a table. SELECT (100.00 * AVG(rowcount) / MAX(rowcount)) AS ParEff ,AVG(RowCount) ,MAX(RowCount) FROM (SEL HASHAMP(HASHBUCKET(HASHROW(thedate))) AS AmpNo ,CAST(COUNT(*) AS FLOAT) AS RowCount FROM dbc.resusagespma GROUP BY 1 ) AS dt1 ParEff Average(RowCount) Maximum(RowCount) 51.36 295.82 576.00 Note that the only things that you have to change from my examples above are the name of your table (dbc.resusagespma in the examples) and the column name(s) - either "thedate" or "thedate,thetime" in my examples. Note also that this query reports "ParEff" (Parallel Efficiency) rather than skew. For Parallel Efficiency you want higher values (ideally 100%, but certainly 95%+). Run those queries a few times, chosing different combinations of columns. See what that tells you. Cheers, Dave Ward Analytics Ltd - Information in motion ( www.ward-analytics.com )
| ||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||
Copyright 2016 - All Rights Reserved | ||||||||||||||||||||||||||||||||||||||||||||||||
Last Modified: 15 Jun 2023 | ||||||||||||||||||||||||||||||||||||||||||||||||