Home Page for the TeradataForum
 

Archives of the TeradataForum

Message Posted: Thu, 18 Jun 1999 @ 00:47:19 GMT


     
  <Prev Next>  
<<First
<Prev
Next>
Last>>
 


Subj:   Some changes to EXPLAINs in V2R3
 
From:   John Hall

I've just gotten access to a machine with V2R3 and look at what they've done to the EXPLAIN (there's more after the EXPLAIN):

Just a few details focusing on Step #2: The LC_CONT table has a total of 2,285,906 rows. The help statistics (and they are current) for this table:

     Column      Unique Values
     ------      -------------
     LOC                 5,168
     CLU_NO             29,836


     EXPLAIN
     LOCKING PDB.CLS_LOC_HST FOR ACCESS
     LOCKING PDB.SUBLN FOR ACCESS
     LOCKING PDB.CSTORES FOR ACCESS
     LOCKING PDB.LC_CONT FOR ACCESS
     LOCKING PDB.CDIV FOR ACCESS

     SEL D.DIV_SEQ
       , D.GRO_NO
       , D.CAT_NO
       , D.VBS_NO
       , S.DIV_NO
       , S.LN_NO
       , C.MTH_NO
       , COUNT(*)
       , SUM(H.REG_GRS_SLL_DLR)

       FROM PDB.CLS_LOC_HST   H
          , PDB.SUBLN         S
          , PDB.CSTORES       C
          , PDB.LC_CONT       L
          , PDB.CDIV          D

       WHERE D.CAT_NO <> 'Y'
         AND L.CLU_NO = 57735
         AND C.CYR_NO = 1999
         AND C.YR_NO = 1999
         AND H.SLS_CD in ('F','L','R','S','T')
         AND S.DIV_NO = D.DIV_NO
         AND C.LOC = L.LOC
         AND H.DIV_NO = S.DIV_NO
         AND H.LN_NO = S.LN_NO
         AND H.SBL_NO = S.SBL_NO
         AND H.MTH_NO = C.DB_MTH_NO
         AND H.LOC = C.LOC

       GROUP BY D.DIV_SEQ
           , D.GRO_NO
           , D.CAT_NO
           , D.VBS_NO
           , S.DIV_NO
           , S.LN_NO
           , C.MTH_NO

       ORDER BY D.DIV_SEQ
           , D.GRO_NO
           , D.CAT_NO
           , D.VBS_NO
           , S.DIV_NO
           , S.LN_NO
           , C.MTH_NO;

     *** Help information returned. 78 rows.
     *** Total elapsed time was 1 second.

Explanation
--------------------------------------------------
 
  1)First, we lock PDB.SUBLN for access, we lock PDB.CDIV for access, we lock PDB.CLS_LOC_HST for access, we lock PDB.LC_CONT for access, and we lock PDB.CSTORES for access.  
  2)Next, we do a single-AMP RETRIEVE step from PDB.L by way of the primary index with no residual conditions into Spool 3, which is redistributed by hash code to all AMPs. Then we do a SORT to order Spool 3 by row hash. The size of Spool 3 is estimated with high confidence to be 67 to 2,281,593 rows. The estimated time for this step is 0.05 seconds to 17 minutes and 10 seconds.  
  3)We execute the following steps in parallel.  
   
  1) We do an all-AMPs JOIN step from Spool 3 (Last Use) by way of a RowHash match scan, which is joined to PDB.C with a condition of ("(PDB.C.YR_NO = 1999) AND (PDB.C.CYR_NO = 1999)"). Spool 3 and PDB.C are joined using a merge join, with a join condition of ("PDB.C.LOC = LOC"). The result goes into Spool 4, which is duplicated on all AMPs. The size of Spool 4 is estimated with low confidence to be 107,136 to 40,393,577,472 rows. The estimated time for this step is 0.76 seconds to 5 hours and 36 minutes.
 
   
  2) We do an all-AMPs RETRIEVE step from PDB.D by way of an all-rows scan with a condition of ("PDB.D.CAT_NO <> 'Y'") into Spool 5, which is duplicated on all AMPs. The size of Spool 5 is estimated with no confidence to be 25,728 to 28,416 rows. The estimated time for this step is 0.39 seconds.
 
  4)We do an all-AMPs JOIN step from PDB.S by way of an all-rows scan with no residual conditions, which is joined to Spool 4 (Last Use). PDB.S and Spool 4 are joined using a product join, with a join condition of ("(1=1)"). The result goes into Spool 6, which is built locally on the AMPs. The size of Spool 6 is estimated with low confidence to be 3,406,032 to 1,284,179,150,464 rows. The estimated time for this step is 3.87 seconds to 391 hours and 41 minutes.  
  5)We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of an all-rows scan, which is joined to Spool 6 (Last Use). Spool 5 and Spool 6 are joined using a product join, with a join condition of ("DIV_NO = DIV_NO"). The result goes into Spool 7, which is redistributed by hash code to all AMPs. Then we do a SORT to order Spool 7 by row hash. The size of Spool 7 is estimated with no confidence to be 3,169,502 to 1,319,850,793,533 rows. The estimated time for this step is 29.08 seconds to 5,456 hours and 49 minutes.  
  6)We do an all-AMPs JOIN step from Spool 7 (Last Use) by way of a RowHash match scan, which is joined to PDB.H with a condition of ("(PDB.H.SLS_CD = 'F') OR ((PDB.H.SLS_CD = 'L') OR ((PDB.H.SLS_CD = 'R') OR ((PDB.H.SLS_CD = 'S') OR (PDB.H.SLS_CD = 'T'))))"). Spool 7 and PDB.H are joined using a merge join, with a join condition of ("(PDB.H.LOC = LOC) AND ((PDB.H.MTH_NO = DB_MTH_NO) AND ((LOC = PDB.H.LOC) AND ((PDB.H.SBL_NO = SBL_NO) AND ((PDB.H.LN_NO = LN_NO) AND ((PDB.H.DIV_NO = DIV_NO) AND (DIV_NO=PDB.H.DIV_NO ))))))"). The input table PDB.H will not be cached in memory. The result goes into Spool 2, which is built locally on the AMPs. The result spool file will not be cached in memory. The size of Spool 2 is estimated with no confidence to be 8,570,233 to 172,589,078 rows. The estimated time for this step is 2 minutes and 34 seconds to 3 hours and 28 minutes.  
  7)We do a SUM step to aggregate from Spool 2 (Last Use) by way of an all-rows scan, and the grouping identifier in field 1. Aggregate Intermediate Results are computed globally, then placed in Spool 8. The aggregate spool file will not be cached in memory. The size of Spool 8 is estimated to be 8,570,233 rows.  
  8)We do an all-AMPs RETRIEVE step from Spool 8 (Last Use) by way of an all-rows scan into Spool 1, which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1. The result spool file will not be cached in memory. The size of Spool 1 is estimated with low confidence to be 8,570,233 to 172,589,078 rows. The estimated time for this step is 37.57 seconds to 10 minutes and 28 seconds.  
  9)Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.  
  ->The contents of Spool 1 are sent back to the user as the result of statement 1.  


Seeing the differences was pretty confusing. With a lot of help from Anita Richards, here's some useful info:


--------------------------------------------------

Some text from DR41584:

For join and retrieve steps, the confidence in the estimate and an estimate range (instead of only a single estimate) may be included in the explain. Note that this additional information increases the size of explains somewhat.

Confidence Phrases

with high confidence
with low confidence
with no confidence
with index join confidence


For base relations, a level of confidence is assigned as follows:

* If there exist applicable statistics on indexes/columns with restricting conditions, the base relation will have an estimate with high confidence.

* If there are no statistics but the optimizer can make an estimate based on random sampling on an index or statistics can be used but the restricting condition is an ANDed condition on non-indexed columns or an ORed condition, the base relation will have an estimate with low confidence.

* Otherwise, the base relation will have an estimate with no confidence.


A join will have a confidence level that is the lowest level confidence from its input relations, or no confidence if statistics do not exist in either join field. Alternatively, a join may be assigned index join confidence if there is join connection via a unique index.

For a retrieve from a spool, the confidence is the same as the step generating the spool (in cases where the cost is not reported, the confidence will also not be reported but can easily be derived from step that generated the spool).

Low or no confidence may indicate a higher degree of variance in the actual execution time and actual rows compared to the estimates. Collecting statistics on appropriate indexes and columns may improve the confidence, improve the estimates, reduce variances in the estimate from actuals, or generate an improved plan. See also the information for estimate phrases.

Estimate Phrases

l row[s]
l to h row[s]
ltime
ltime to htime


The s after row is omitted if the preceding count is 1.

Time is expressed as "x hours", "x hours and y minutes", "y minutes", "y minutes and z seconds", or "z.zz seconds" as appropriate. If only seconds are reported for both ltime and htime, for ltime, "seconds" is omitted. If the hours, minutes, or seconds are 1, the singular is used for hour, minute, or second, respectively).

If a row or time estimate is too large to display, three asterisks (***) are displayed instead.

The high end of a range (and preceding " to ") is not included in the explain if it is the same as the low end of the range, if it is not available, or if low end overflows (that is, if asterisks are printed for the low end).

The ltime (or low end) cost estimate is used in choosing the lowest cost plan and is calculated based on the l (or low end) row estimate. The low end row estimate is calculated as in previous releases. The low end estimates (as compared to midpoint or high end estimates) are used in choosing the plan in order to produce a reasonable plan that is most aggressive.

The htime (or high end) cost estimate has no affect on planning decisions and is calculated based on the h (or high end) row estimate. The high end row estimate gives an indication of a reasonable upper bound on the number of rows (cardinality) when more conservative assumptions are used compared to the low end estimate.

A low end row estimate, based on any statistics collected for a table, is used to estimate the size of the spool file needed to accommodate spooled data. If statistics have not been collected for a table, this estimate may be grossly incorrect (see "COLLECT STATISTICS" statement).

An estimated time is an approximation based on average times for the suboperations that comprise the overall operation, and the estimated number of rows involved in the operation. The accuracy of a time estimate is affected by the accuracy of the corresponding row estimate. The time shown is not clock time; you should consider it to be an arbitrary unit of measure that compares different operations.

The high end row estimate is propagated from one step to the next.

For a base relations, the high end row estimate is equal to the low end row estimate.

For a join, the high end row estimate is calculated as follows:

* If statistics are collected on the join fields or random sampling on indexes can be used, the high end row estimate will be calculated using the statistics or random sampling information and the high end row estimates of the input relations.

* Otherwise, the high end row estimate of a join result will be the product of the high end row estimates of the input relations.


The high end row estimate of a join relation may be the same as the low end row estimate when the join result total rows is estimated with low or high confidence.

A large variance between the low and high end estimates may indicate that the query and plan should be more closely examined, statistics need to be collected, the runtime may be significantly different than the low end estimate, or the execution of the query should be closely monitored. Usually, a large difference between the low and high end estimates means that statistics are missing; the user should try to correct it before running the query.

When the low or high end cost appears to be excessive, this may indicate the possibility of a run away plan. The user may then examine the query to see if any thing can be improved (check for an incorrectly, poorly expressed query, or missing statistics) or choose to run it only when there is enough resources. The user may want to closely monitor the execution of the query to see if it is executing closer to the low end cost estimate or actually runs much longer.

For a query, a significant difference in the high end estimates of one plan compared to a previous plan may indicate closer comparison of the two plans is needed (did a merge join become a product join?). Even though the low end estimates may be similar (leading the optimizer to choose a different plan), a difference in the high end estimate may indicate the potential for a longer execution time. Again, this may indicate statistics are missing or out of date.

As is true for the low end estimate, the high end estimate does not include the affects of the actual concurrent workload. They both can be affected by statistics that are out of date and no longer represent the demographics of the data. That is, out of date statistics may lead to overly pessimistic or optimistic estimates (estimates are based on the assumption that the statistics are representative of the data demographics).

The estimates may not reflect the actual number of rows returned or the actual execution time. The actuals may be outside the range. These estimates should be used only as indications and in comparing execution plans in a relative sense.

Note a plan is chosen based on the low end cost estimate. If the actual number of rows is close to the low end row estimate, this should be a good plan. However, if the actual is closer to the high end row estimate, then it is possible another plan (with a higher low end cost estimate but a lower high end cost estimate) may be better. Statistics will help the optimizer make better estimates and choose better plans.

Example:

       
  8)We do an all-AMPs JOIN step from Spool 3 (Last Use) by way of an all-rows scan, which is joined to TPCDU.N2 with a condition of ("(TPCDU.N2.N_NAME = 'FRANCE') OR (TPCDU.N2.N_NAME = 'GERMANY')"). Spool 3 and TPCDU.N2 are joined using a product join, with a join condition of ("((Spool_3.N_NAME = 'FRANCE') OR (TPCDU.N2.N_NAME = 'FRANCE')) AND ((TPCDU.N2.N_NAME = 'GERMANY') OR (Spool_3.N_NAME = 'GERMANY'))"). The input table TPCDU.N2 will not be cached in memory, but it is eligible for synchronized scanning. The result goes into Spool 4, which is duplicated on all AMPs. The result spool file will not be cached in memory. The size of Spool 4 is estimated with low confidence to be 4 to 8 rows. The estimated time for this step is 0.49 to 0.62 seconds.  



     
  <Prev Next>  
<<First
<Prev
Next>
Last>>
 
 
 
 
 
 
 
 
 
  
  Top Home Privacy Feedback  
 
 
Copyright for the TeradataForum (TDATA-L), Manta BlueSky    
Copyright 2016 - All Rights Reserved    
Last Modified: 15 Jun 2023