Teradata Compression (V2R4)

Library: White Papers

Teradata Compression (V2R4)

Written by Mark Morris
(Teradata-NCR)

Introduction

We are entering a unique period in the history of business where technologies are rapidly converging to enable the next leap forward in data warehousing - the ability to affordably keep all business data accessible to the warehouse - forever.

The amount of business data is exploding at an unprecedented rate. Businesses are retaining enormous amounts of detailed data such as call detail records, transaction history, and web click-stream and mining it to identify business value. Regulatory and legal retention requirements are leading businesses to keep years and years of historical data accessible to the warehouse.

Since its inception, Teradata has been the number one choice for the largest data warehouses in existence. As we enter the era of petabyte scale data warehouses, advanced technologies such as data compression and high-density bulk disk drives are enabling Teradata to augment its lead in very large data warehouses and to make it cost effective to keep enormous data volumes in the warehouse.

The trend among customers of Teradata is to introduce more and more applications via Teradata and to retain history data for longer periods of time. This trend is paced by the affordability of massive warehouses. By recognizing that the performance requirements for data managed by the warehouse is varied, we can bring technologies to bear that will reduce the effective price of massive data warehouses.

Temperature Pyramid Many customers have varied performance requirements for time-based or history data. Recent data may be accessed frequently, while older data is accessed less often. Let's use hot, warm, cool, and dormant to describe the average access rate of different classes of data. In a warehouse with massive history, the volume of data will typically be inversely proportional to the data usage. In other words, there will be an enormous amount of cool history data that is accessed lightly, and relatively less volume of hot and warm data that is accessed frequently. The temperature of the data reflects the access rate to that data. The best storage for hot, warm, and cool data is online disk that is directly accessible to the data warehouse. Truly dormant data is a candidate for alternative storage - such as tape, optical, or perhaps holographic storage in the future.

We distinguish between data that is classified as "cool" vs. data that is truly dormant. While the cool data may be accessed lightly on average, it still has temporal hot spots, for example, when doing comparative analysis between time periods. Similarly if the data model is modified via addition of new columns or if data types are changed, then the data is not dormant. Furthermore, if the data is periodically accessed to recast the data to make historical data relevant in the current business context, then the data is not dormant. Dormant data is rarely, if ever, accessed.

Customers want seamless management of data as the query and update frequency, or data temperature, of that data changes over time. They also want the price of storage to be representative of the changing temperature of the data.

The technologies we can use to reduce the effective price of data warehouse storage when the warehouse contains hot, warm, cool, and dormant data are:

- High capacity disk drives

- High capacity configurations

- Teradata compression

- Alternative storage for truly dormant data

This white paper focuses on Teradata compression, a technology that both reduces the effective price of logical data storage capacity, and improves query performance. Customers using Teradata compression in V2R4 have experienced up to 50% capacity savings and scan performance increases up to 35%.

The Benefits of Compression

Teradata compression reduces storage cost by storing more logical data per unit of physical capacity. Performance is improved because there is less physical data to retrieve during scan-oriented queries. Performance is further enhanced since data remains compressed in memory and the Teradata cache can hold more logical rows. The compression algorithm used by Teradata is extremely efficient and the savings in compute resources from reducing disk accesses more than compensates for the CPU used for compression. Another way to improve performance is to use the space saved by compression for advanced indexes. All of these benefits will be explored in the remaining sections of this white paper.

Teradata Compression in V2R4

Teradata uses a lossless compression method. This means that although the data is compacted, there is no loss of information.

The granularity of Teradata compression is the individual field of a row. This is the finest level of granularity possible and is superior for query processing, updates, and concurrency. Field compression offers superior performance when compared to row level or block level compression schemes. Row and block level compression schemes require extraneous work to uncompress the row or block whenever they might potentially contribute to a result set. Furthermore, field compression allows compression to be independently optimized for the data domain of each column.

Teradata performs database operations directly on the compressed rows - there is no need to reconstruct a decompressed row for query processing. Of course, external representations and answer sets include the fully uncompressed results.

In Teradata V2R4, one value in each column can be compressed out of the row body. If the column is nullable, then NULLs are also compressed. The best candidate for compression is the most frequently occurring value in each column. The compressed value is stored in the table header. A bit field in each row header indicates whether the field is compressed and whether or not the value is NULL.

Fixed width fields that are not part of the primary index are candidates for Teradata compression. The following data types are compressible. The native number of bytes used for each data type is indicated in parenthesis.

- Integer Date (4)

- CHAR (N, where N < 256)

- BYTEINT (1)

- SMALLINT (2)

- INTEGER (4)

- FLOAT/REAL (8)

- DOUBLE (8)

- DECIMAL (1, 2, 4 or 8)

- BYTE (N, where N < 256)

When a column has a frequently occurring value it can be highly compressed. Some examples include the following:

- NULLs

- Zeros

- Default values

- Flags

- Spaces

- Binary indicators (e.g., T/F)

Teradata compression is completely transparent to applications, ETL, queries and views. Compression is easily specified when tables are created or columns are added to an existing table. For example, here is the syntax for compressing a populous city:

  CREATE TABLE Properties (
    Address VARCHAR(40),
    City CHAR(20) COMPRESS ('Chicago'),
    StateCode CHAR(2)
  );

Compression and the Optimizer

Teradata utilizes a leading edge cost based optimizer. The optimizer evaluates the relative cost of many potential plans and picks a low cost plan. One of the costs considered is the number of estimated I/O operations needed to execute a plan. Tables using Teradata compression have fewer physical data blocks and will therefore need less physical I/O operations to accomplish certain tasks. Hence, the plan chosen by the optimizer for a query operating on compressed tables will be naturally optimized to take advantage of compression.

How Much Compression?

The amount of space savings for a particular column is determined by the percentage of values that can be compressed and the column data type.

As discussed earlier, compression is implemented by including a bit field in every row header. There is a tradeoff between the capacity for adding the bit field and the savings from compression. The following table gives the break-even point for different field widths. For example, if a column has an integer data type with a field width of 4 bytes, then if the most common value (and NULLs) occurs more than 3.13% of the time, then compression will reduce the storage needed for that column.

Field Width (Bytes)	Break Even Frequency of Occurrence
1	12.50%
2	6.25%
3	4.17%
4	3.13%
5	2.50%
6	2.08%
7	1.79%
8	1.56%
> 12	< 1%

Single Value Compression Break Even (1 Billion Row Table)

The graph to the above right shows the break-even trendline. As you can see, for any field larger than two bytes, it takes less than a 5% frequency of occurrence to make compression a winner.

The next question is how much compression will be achieved? The following table shows example compressibility for fields of 1 and 4 bytes. "Percent compression" is the savings when considering the compressed size in comparison to the uncompressed size.

Field Width (Bytes)	Frequency of Single Value (and NULL)	Percent Compression
1	20%	7%
1	50%	37%
1	80%	67%
4	20%	17%
4	50%	47%
4	80%	77%

Note that the compression savings are for a particular column. Overall system savings is dependent on the number of columns that are compressed and the individual savings from each column.

Compression Success Stories

A number of Teradata customers have had huge wins with compression as described in the following customer experiences.

Retail Industry Successes

- 42% capacity savings on four tables, each with over 1 billion rows. The average rows per data block increased by 72%. Biggest capacity savings was compression of decimal zero.

- 58% capacity savings. Table scan and archive times were reduced proportionately. Large capacity savings on single byte flag fields.

Financial Industry Successes

- 50% capacity savings. Response time improved by 30%.

- 40% capacity savings. Multiload elapsed time decreased by 30%.

Communications Industry Successes

- 50% capacity savings compressing spaces and zeros on a financial application

- 43% capacity savings (over a terabyte) on a table that was initially 2.8 terabytes

- 33% compression on a 50 million row table with over 100 columns compressed. I/O was reduced by 31% on a query mix and a 7% reduction in CPU was achieved.

- 10% average capacity savings - range of 1% to 37% capacity savings on different tables.

Capacity savings and application performance are not the only winners with Teradata compression. Data maintenance activities, such as loads, integrity checks, statistics collection, and backups take less time when the data is compressed. Reduced time and resources for data maintenance yields additional benefit to the end-users in terms of system and application availability and performance.

Storing Data Efficiently

Teradata compression is one of several techniques available for storing data efficiently. Storage for decimal and integer data types is a function of the range of values to be accommodated in the field. You can achieve a double win in space savings by using Teradata compression in conjunction with an efficient decimal or integer size. An example would be to define the decimal range such that a 4 byte representation was used instead of an 8 byte representation. Then, compression could be used on the most frequently occurring value.

For character based data, an alternative to Teradata compression is the VARCHAR(N) data type. The number of bytes used to store each field is the length of the data item plus 2 bytes. Contrast this to a fixed width CHAR(N) data type which takes N bytes per row, regardless of the actual number of characters in each individual field. Combining Teradata compression with fixed width character data type can be a very effective space saver.

Length vs Compressibility The data demographics will determine whether variable length character data type or fixed width plus compression is more efficient. The most important factors are:

- the maximum field length

- the average field length

- the compressibility of the data

Which is best - compression or VARCHAR?

- VARCHAR will be more efficient when the difference of maximum and average field length is high and compressibility is low.

- Compression and fixed CHAR will be more efficient when the difference of maximum and average field length is low and compressibility is high.

- When neither is a clear winner, then use VARCHAR since it will use slightly less CPU resource.

Capacity Planning for a New System with Compression

How does the use of Teradata compression affect the capacity planning for a new Teradata system? We must consider both compute and storage resources to answer this question.

The compute savings from compression are modest, since the database must still execute the application analytics on the logical data. Therefore, when considering the use of Teradata compression for a new system, the number of compute nodes required for the workload will be equal to a system without compression.

Compression will reduce the amount of physical storage capacity required for a given amount of logical data. The savings can be estimated by examining sample data. Teradata compression will result in more cost-effective storage for the warehouse.

Using Reclaimed Capacity on an Existing System

On an existing system, the introduction of compression is unlikely to free up a significant amount of compute resource. On systems where CPU resource is available, the space savings from compression may allow more applications to be introduced prior to the next system upgrade. If CPU is fully utilized then we must look for alternative ways to use reclaimed capacity.

One quick and easy way to use this capacity is to keep more infrequently accessed history data online.

Another way to use the capacity is to improve system availability with the use of fallback for all or some tables.

Alternately, the excess capacity can be leveraged to improve performance through the use of advanced index techniques such as:

- Join index

- Aggregate join index

- Covered index

- Value ordered index

Teradata is a strong advocate of a normalized database design. However, in some situations, augmentation of a normalized base table with raw data extensions may be an effective way to improve performance.

- Pre-joined summary tables

- Pre-aggregated summary tables

- Duplicated vertical or horizontal partitions of the base table

Keep in mind that these de-normalizations will potentially carry significant data maintenance costs. These de- normalizations will only be effective when the savings from a high query volume more than compensates for the additional data maintenance.

Put Everything in the Data Warehouse and Keep it Forever

Teradata compression delivers a significant storage savings, while simultaneously improving data warehouse performance, especially for scan-oriented queries. Compression is fully supported by Teradata's cost based optimizer. The space saved by compression could be used to reduce storage cost, or to keep more online history, or to further enhance performance with advanced indexing, or to improve availability with increased use of fallback. Teradata compression is one of many technologies that will enable customers to put everything in the data warehouse and keep it forever.

Last Modified: 28 Jun 2020