Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Thu, 09 Oct 2003 @ 19:30:46 GMT


     
  <Prev Next>  
<<First
<Prev
Next> Last>>  


Subj:   Guru needed: OLAP function quandary
 
From:   Jeremy Christiansen

I'm having problems with OLAP functions again. I've written to the list before about this issue but never got anywhere. It seems that no one really knows what goes on inside Teradata when it comes to exactly how OLAP functions accomplish their tasks. Maybe a Teradata guru out there could provide some insight. (Hint, Hint.) Here is the problem:

OLAP FUNCTIONS USE WAY TOO MUCH SPOOL!

Case in point: I have 1 volatile table loaded with data. Let's call it V1. My current spool is 15 gig. I have nothing else running. Now I run a query on V1 that simply selects all data in it with the addition of 1 OLAP calculation, and then inserts the results into another table, call it V2.

Insert into V2
Select a.*, csum(1,primary_index,more_fields)
from V1 a
group by primary_index
;


Here is the explain:

Explanation
--------------------------------------------------
 
  1)First, we do an all-AMPs STAT FUNCTION step from CHRISTIANSENJDCHBULL.a by way of an all-rows scan with no residual conditions into Spool 5 (Last Use), which is assumed to be redistributed by value to all AMPs. The result rows are put into Spool 3, which is built locally on the AMPs.  
  2)Next, we do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of an all-rows scan into Spool 1, which is redistributed by hash code to all AMPs. Then we do a SORT to order Spool 1 by row hash. The size of Spool 1 is estimated with no confidence to be 48 rows. The estimated time for this step is 0.04 seconds.  
  3)We do a MERGE into CHRISTIANSENJDCHBULL.vt_eligiblemonth2 from Spool 1 (Last Use).  
  4)Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.  
  ->No rows are returned to the user as the result of statement 1.  


Guess what happens? Before the job blows on spool, I end up with about 90 gig of spool being used. 90! What the heck? There is no hot-amping going on here. All amps have roughly equal amounts of data. Why on earth does one OLAP function cause my spool to balloon to over 6 times the size of the original table? It makes no sense. It's like Teradata is internally making 4 or 5 copies of every row. This sounds a bit outrageous, but it may not be far from the truth. I did a further test. I rewrote the query like this:

Insert into V2
Select a.*,b.OLAP_Field
from v1 a
inner join
(
Select primary_index,csum(1,primary_index,more_fields) as OLAP_Field
from V1
group by primary_index
) b
on a.primary_index = b.primary_index
;

This method deals only with a small portion of each row when performing the OLAP function--it uses the primary_index and a couple of other fields. After the OLAP function completes, the result is put into a derived table. This is then joined back to the rest of the data in V1, and then inserted into V2. So it does the same thing as the first version of the query, except in a more round about way.

What I found is that the second version not only uses far less spool, but also runs much faster. The only real change to the query is that the OLAP function no longer operates on the entire row from V1, but rather just a subset (the keys and whatnot). This supports my hypothesis that Teradata is creating a lot of extra "work data".

It would be nice to know why OLAP functions behave as they do (with regard to spool usage) and whether or not any changes are coming down the pike that will reduce their space requirements.



     
  <Prev Next>  
<<First
<Prev
Next> Last>>  
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 15 Jun 2023