A Low Power 2-D DCT Chip Design Using Direct 2-D Algorithm

Liang-Gee Chen, Juing-Ying Jiu, Hao-Chieh Chang, Yung-Pin Lee, and Chung-Wei Ku

DSP/IC Design Lab, Department of Electrical Engineering
National Taiwan University
Taipei, Taiwan, R.O.C.
Tel: 886-2-363-5251 ext 332
Fax: 886-2-363-8247
e-mail: {lgchen, howard}@video.ee.ntu.edu.tw

Abstract—In this paper, a low power 8×8 2-D DCT architecture based on direct 2-D approach is proposed. The direct 2-D consideration reduces computational complexity. According to this algorithm, a parallel distributed arithmetic (DA) architecture at reduced supply voltage is derived. In the real circuit implementation of the chip, a hybrid-architecture adder of low power consumption is designed, as well as a powersaving ROM and a low voltage two-port SRAM with sequential access. The resultant 2-D DCT chip is realized by 0.6 μm single-poly double-metal technology. Critical path simulation indicates a maximum input rate of 133MHz, and it consumes 138mW at 100MHz.

I. INTRODUCTION

The Discrete Cosine Transform (DCT), among various transforms, is the most popular and effective one in image and video compression, such as JPEG, MPEG, H.261, and H.263. Since these standards recently apply to battery operated systems like portable computers (Notebook), personal digital assistants (PDA) and wireless communication equipments, it becomes imperative to develop low power DCT chip as one component of these energy-crucial desktops.

Since DCT has been standardized in recent years, many researchers and companies have took lots of resources to implement it. The conventional row-column approach has the advantage of regularity for VLSI implementation, which causes most 2-D DCT chips to be designed in this way. However, the computational complexity of the row-column approach is more than that of the direct method. And low computational amount is considered mainly in low power algorithm level. Although the direct method incurs the irregularity in realizing 2-D DCT chips, the feature of low computational complexity is still attractive for low power DCT chip design. This fact motivates our research for fewer computations and regular 2-D DCT architecture for real chip implementation with the direct method.

As to low power DCT design, T. Kuroda et al.[3] proposed a 0.9V, 150MHz, 10mW 2-D DCT with variable threshold-voltage scheme implemented by 0.3μm CMOS triple-well technology. However, the chip achieved low power by only taking the circuit and device level into account, not including algorithm level consideration. Therefore, we propose a 2-D DCT chip incorporating low power considerations in algorithm, architecture, and circuit design levels.

The paper is organized as follows. In Section II, the direct 2-D DCT algorithm is briefly discussed. The architecture exploiting this algorithm is described in Section III. In Section IV, the main circuit module designs, including adders and memories, are presented. The core characteristics are shown in Section V. Finally, a conclusion is given in Section VI.

II. THE DIRECT 2-D DCT ALGORITHM

The 2-D DCT of an $N \times N$ real signal $x_{n_1,n_2}$, with kernel factor $2c(n_1)c(n_2)/N$ neglected, is defined as

\[
Y_{k_1,k_2} = \sum_{n_1=0}^{N-1} \sum_{n_2=0}^{N-1} x_{n_1,n_2} \cos \left( \frac{2\pi(2n_1+1)k_1}{4N} \right) \cos \left( \frac{2\pi(2n_2+1)k_2}{4N} \right)
\]

(1)

\[n_1, n_2, k_1, k_2 = 0, 1, \ldots, N - 1,\]

In the following, assume that $N$ is to be a power of 2. Using the permutation, signal $x_{n_1,n_2}$ can be permuted as:

\[
y_{n_1,n_2} = \begin{cases} x_{2n_1,2n_2} & n_1 = 0, \ldots, N/2 - 1, n_2 = 0, \ldots, N/2 - 1 \\ x_{2N-2n_1-1,2n_2} & n_1 = N/2, \ldots, N - 1, n_2 = 0, \ldots, N/2 - 1 \\ x_{2n_1,2N-2n_2-1} & n_1 = 0, \ldots, N/2 - 1, n_2 = N/2, \ldots, N - 1 \\ x_{2N-2n_1-1,2N-2n_2-1} & n_1 = N/2, \ldots, N - 1, n_2 = N/2, \ldots, N - 1 \end{cases}
\]
Thus, $Y_{k_1,k_2}$ can be rewritten as:

$$Y_{k_1,k_2} = \sum_{n_1=0}^{N-1} \sum_{n_2=0}^{N-1} y_{n_1,n_2} \cos \left[ \frac{2\pi (4n_1 + 1)k_1}{4N} \right] \cos \left[ \frac{2\pi (4n_2 + 1)k_2}{4N} \right]$$

(2)

Now consider the following expression:

$$U_{k_1,k_2} = \sum_{n_1=0}^{N-1} \sum_{n_2=0}^{N-1} y_{n_1,n_2} W_{4N}^{(4n_1+1)k_1+(4n_2+1)k_2},$$

where $W_{4N} = \exp(-j2\pi/N)$.  

(3)

It is not difficult to find that $Y_{k_1,k_2}$ can be computed from $U_{k_1,k_2}$ by the following set of expressions:

$$Y_{k_1,k_2} = \frac{1}{2} \left[ \text{Re}(U_{k_1,k_2}) - \text{Im}(U_{N-k_1,k_2}) \right],$$

$$Y_{k_1,k_2} = \frac{1}{2} \left[ -\text{Im}(U_{k_1,k_2}) - \text{Re}(U_{N-k_1,k_2}) \right].$$

(4)

Note that (4) requires $U_{k_1,k_2}$ in (3) to be computed for all $k_1$ and only a sufficient subset of $k_2$ such that $\{k_2, N-k_2\}$ covers all possible values of $k_2$.

By the following relation [4]

$$4n_2 + 1 = (4t+1)(4n_1+1) \mod 4N,$$

(5)

where $0 \leq t, n_1, n_2 \leq N - 1$, the signal $y_{n_1,n_2}$ is mapped as $y_{nt}$. If $n_1$ is fixed, the mapping from $n_2$ to $t$ is one-to-one. However, with different $n_1$, the mapping order is not the same.

By substituting (5) into (3), (3) can be rewritten as:

$$U_{k_1,k_2} = \sum_{n_1=0}^{N-1} \sum_{t=0}^{N-1} y_{nt} W_{4N}^{(4n_1+1)k_1+(4t+1)k_2}$$

(6a)

$$= \sum_{t=0}^{N-1} \left[ \sum_{n_1=0}^{N-1} y_{nt} W_{4N}^{(4n_1+1)k_1+(4t+1)k_2} \right]$$

(6b)

$$= \sum_{t=0}^{N-1} (-j)^t \sum_{n_1=0}^{N-1} y_{nt} W_{4N}^{(4n_1+1)k_1}.$$

(6c)

In the above deduction, we let $k_1 + (4t+1)k_2 = aN + b$, where $a \in \text{integer}$ and $0 \leq b \leq N - 1$. Let the computation of $n_1$'s summation be represented by $U_{t,b}$. Then we can find:

$$U_{t,b} = \sum_{n_1=0}^{N-1} y_{nt} W_{4N}^{(4n_1+1)b}, \quad 0 \leq b \leq N - 1$$

$$= \sum_{n_1=0}^{N-1} y_{nt} \left[ \cos \frac{2\pi (4n_1 + 1)b}{4N} - j \sin \frac{2\pi (4n_1 + 1)(N-b)}{4N} \right].$$

(6d)

Although $U_{t,b}$ is a complex number, its real part is indeed an $N$-point 1-D DCT, and its imaginary part can be obtained by $\text{Im}(U_{t,0}) = 0$, and

$$\text{Im}(U_{t,b}) = -\text{Re}(U_{t,N-b}), \quad 1 \leq b \leq N - 1.$$

This reveals that $U_{t,b}$ can be achieved by calculating $N$-point 1-D DCT. Since multiplying $(-j)^b$ does not need any multiplication, but only affects the addition, the $N \times N$ 2-D DCT can therefore be realized by $N \times N$-point 1-D DCT's with some additions. Besides, comparing with the row-column method which needs $2N \times N$-point 1-D DCT's to perform an $N \times N$ 2-D DCT, this approach with less operation complexity is more suitable for low power considerations in the algorithm level.

III. LOW POWER 2-D DCT ARCHITECTURE

Since the direct 2-D DCT algorithm discussed above reduces the computation complexity, it is obvious that an architecture based on it shall lead to the goal of low power. By reviewing (4) and (6c), the computation from the input $x_{n_1,n_2}$ to the output $Y_{n_1,n_2}$ is shown in the following.

In our computation, first comes the data mapping from $x_{n_1,n_2}$ to $y_{nt}$. Then, $U_{t,b}$ is obtained by calculating the 1-D complex DCT of $y_{nt}$. Before the final output $Y_{k_1,k_2}$ is finished by (4), $U_{k_1,k_2}$ is computed by the summation with respect to $t$ depicted in (6c).

After the direct 2-D DCT algorithm is discussed, it is time to depict the whole low power 2-D DCT chip architecture shown in Figure 1. Since the DCT input and output is ranging from -255 to 255 and -2040 to 2040, respectively, the wordlength of the input data is 9-bit and that of the output data is 12-bit. However, for convenience, the kernel factor $2c(n_1)c(n_2)/N$ is neglected in deducing the direct 2-D DCT method. Therefore, the wordlength of the output data turns out to be 16-bit for covering all the output range. Besides, since the 1-D DCT computation is implemented with DA method, two-port SRAMs operating in ping-pong mode are employed for re-ordering the input and output data. Hence, 9-bit input data are fed word-serially and through the input SRAM, the data are converted into 64 bit-serial data for 2-D DCT. After these data are processed, the output SRAM changes.
the 64 word-parallel data to 16 bit-parallel data for next stage, usually zig-zag scan.

In order to deduce the architecture which processes the data from $U_{4, 16}$ to $U_{4, 16, 2}$, it is found after a deep investigation that each $U_{4, 16, 2}$ can be gained through a butterfly-like adder/subtractor network, which is constructed by re-arranging the value $a$ in summation of $(-j)^a$ shown in (6c).

Take the network deduction of $U_{4, 16, 2}$ where $k_1 = 3, 4, 5, 6; k_2 = 1, 5$ for example. Figure 2 shows the deduction process. In computing $U_{4, 16, 1}$ shown in (6c), as the summation index $t$ goes from 0 to 7, the exponential value $a$ of $(-j)^a$ changes to 0, 1, 2, 3, 4, 5, 0 in sequence. Since adding 1 to $a$ means a rotation by 90 degree, the original $a$ values can be divided into several segments for seeking some general rules. Therefore, as shown in the left side of Figure 2, the $a$ values for computing $U_{4, 16, 1}$ are divided into 3 segments. Note that the sum of each row in the dashed box is either equal to the original $a$ value, or led/trailed by 360 degree. Besides, the three segments of $a$ value mean that there will be an adder tree of three-stage to complete the simulation with index $t$ from 0 to 7. After analyzing the three segments of $a$ values, it is not difficult to find that the summation with respect to $t$ can be also separated into two groups: one is the even values of $t$, and the other is the odd values of $t$.

According to the segmentation of $a$ value and the grouping of $t$, the final network configuration for calculating $U_{4, 16, 1}$ for $k_1=3$ to 6 is shown in Figure 3. The dash-line represents the subtrahend in the subtraction. Moreover, the arrangement of the $a$ value for $U_{4, 4, 5, 6, 5}$ network deduction is also shown in Figure 2 as another example. Furthermore, the final architecture of $U_{4, 4, 5, 6, 1}$ combined with that of $U_{3, 4, 5, 6, 5}$ is shown in Figure 3, too.

Just as the depiction of (4), there is no necessity for processing the whole set $k_1, k_2$. In order to complete all $Y_{k_1, k_2}$, it takes efforts to compute $U_{k_1, k_2}$ for all $k_1$ and only a subset of $k_2$, where $k_2$ is 0, 1, 2, 3, 5 in our design.

So far, the butterfly-like adder/subtractor tree for computing $U_{k_1, k_2}$ are presented. Since our goal is to design a low power 2-D DCT chip at reduced supply voltage, a parallel architecture is needed to compensate for the speed loss due to lowering operating voltage. Hence, combining the butterfly-like adder/subtractor tree and the 1-D DCT computation with DA method in parallel form realizes the core of the 2-D DCT chip shown in Figure 4. After that, $U_{4, 16, 2}$ with $k_1 = 0, 1, 2, 4, 5$ are obtained and all $Y_{k_1, k_2}$ are solved with these values by another adder/subtractor stage according to (4).

IV. Chip Implementation

The proposed low power 2-D DCT chip consists of mainly adders, memories and registers. Thus, reducing the power consumption in these components will make more contribution to achieve low power.

A. Adder Design

The adder is used as the accumulator in calculating the 1-D DCT result. Since the adder is also operating at low voltage, the parallelism is employed in order to compensate for the speed loss. First, the adder adopts the square-root carry-select structure shown in Figure 5 for its propagation delay is proportional to $\sqrt{N}$. After dividing the larger adder into several stages, the stages are implemented with Manchester adder for its improvement on the carry-lookahead by using a single gate for generating carry $C_i$. Therefore, a large-bit adder is formed by combining the square-root carry-select adder in architecture and Manchester adder in stage circuits. This adder has two characteristics inherited from the two adders mentioned above: carry-select for high speed and Manchester for low power.

B. Power-Saving ROM

Since the 1-D DCT in our chip is implemented by DA method, the ROM is needed to hold the content of the look-up table which is pre-computed. In order to eliminate the static power consumption due to the DC path existing in static pseudo-nMOS ROM, a better approach is to use precharged logic. The ROM decoder and data circuits are shown in Figure 6. An address transition detection (ATD) circuit is employed to generate the precharge.
signal \( \text{pre} \), which is activated only when the input addresses change. The ROM decoder and data circuits are shown in Figure 6. During the precharge phase, \( \text{pre} = 0 \) and the bit-lines are precharged to \( V_{DD} \). Meanwhile, the AND gates in decoder ensure that all pull-down paths through the NMOS are off during precharging. In the evaluation phase, \( \text{pre} = 1 \) and if the word-line is activated high, the bit-line is discharged. For the PMOS and NMOS are not turned on simultaneously during precharging or evaluation phase, there is no DC path from \( V_{DD} \) to \( \text{GND} \), and thus, no static DC power dissipation.

C. Low-Voltage Two-Port SRAM

Since the proposed 2-D DCT is implemented with DA parallel architecture, the data reordering is needed for bit-serial word-parallel data operation. Thus, the two-port SRAM shown in Figure 7 is used for data mapping and data reordering. Note that the input port size \( n \) is different from the output port size \( m \). While the two-port SRAM \( (n = 9, m = 64) \) is for the input ping-pong mode, the \( (n = 64, m = 16) \) two-port SRAM is for the output ping-pong mode. The sense amplifier consists of a cross-coupled pair of PMOS transistors and NMOS input devices. This differential pair applies the positive feedback to accelerate the sense speed.

D. The Register

Since the single clocking strategy is adopted in our design, the TSPC DFF is used for simplicity and its low transistor-count. After the original TSPC DFF[5] structure is re-examined, the pull-up network (PUN) and the pull-down network (PDN) are recognized from the first stage of this structure. They are shown in Figure 8(a). Hence the logic function can be put into the PUN- and PDN-network complementarily. For example, assume a NAND gate is to be combined with the TSPC DFF. The PUN consists of two parallel PMOS transistors and the PDN is composed of two series NMOS transistors. Thus, the TSPC DFF including NAND logic is shown as Figure 8(b) and the additional input is used for reset signal \( \text{RST} \). These modified registers can be used to design the
TABLE I
Chip Characteristics

<table>
<thead>
<tr>
<th>Characteristic</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Internal Wordlength</td>
<td>16 bits</td>
</tr>
<tr>
<td>Technology</td>
<td>0.6µm CMOS SPDM</td>
</tr>
<tr>
<td>No. of Transistors</td>
<td>153017</td>
</tr>
<tr>
<td>Core Size</td>
<td>7.85mm x 8.45mm</td>
</tr>
<tr>
<td>Die Size</td>
<td>8.98mm x 7.79mm</td>
</tr>
<tr>
<td>Clock Rate</td>
<td>100 MHz</td>
</tr>
<tr>
<td>Latency</td>
<td>198 cycles</td>
</tr>
<tr>
<td>Block size</td>
<td>8 x 8</td>
</tr>
<tr>
<td>Supply Voltage</td>
<td>2.0 V</td>
</tr>
<tr>
<td>Power</td>
<td>138 mW</td>
</tr>
</tbody>
</table>

sequential addresser in SRAM and counters for their logic function. And they have better performance in power and speed than that of static CMOS DFFs for less transistor-count.

V. Chip Performance and Specifications

By incorporating the module circuits discussed above, the proposed low power 2-D DCT chip with direct method is implemented. Figure 9 shows the photomicrograph of the whole system chip. The core characteristics are summarized in Table I.

Besides, in order to understand more details about the power distribution in the designed chip, a power simulation at 100 MHz by components is shown in Table II. From this table, it is obvious that registers consume most power than others do. Then, excluding the clock buffers, the first runner up is memory modules. Hence, reducing the power consumption of registers and memories will contribute more to achieve the proposed chip. That is the reason why we design low power components such as registers, memories and adders.

Since the DCT is applied to portable applications recently, the power consumption becomes a critical point in designing a 2-D DCT chip. The implementation in [1] and the product presented in [2] are not dedicated to low power design. Thus, they consumes larger power. The chip reported by [3] which utilized variable threshold-voltage scheme by controlling back-bias voltage and better technology achieved a 10mW 2-D DCT core processor. The main features of these chip implementation are summarized in Table III.

Although the chip presented by [3] consumes low power, its implementation lacks the low power consideration in algorithm level. Our chip is design by taking the low power algorithm, architecture, and circuits into consideration. The ideas in both chips do not conflict. Hence, combining the low power algorithm and architecture in our chip and the variable threshold-voltage scheme in [3] will lead to a 2-D DCT chip with lower power dissipation than both two chips.

VI. Conclusion

A low-power high-performance 2-D DCT chip is implemented. The design features that contributes most to
TABLE III
Processor Comparison

<table>
<thead>
<tr>
<th>Authors</th>
<th>Tech.</th>
<th>Core area</th>
<th>Trans.</th>
<th>Voltage</th>
<th>Clock rate</th>
<th>Power</th>
</tr>
</thead>
<tbody>
<tr>
<td>D. Slawecki et al. [1]</td>
<td>2μm</td>
<td>72.68 mm²</td>
<td>67929</td>
<td>5 V</td>
<td>60 MHz</td>
<td>1 W</td>
</tr>
<tr>
<td>SGS-THOMSON [2]</td>
<td>—</td>
<td>—</td>
<td>—</td>
<td>5 V</td>
<td>20 MHz</td>
<td>1.5 W</td>
</tr>
<tr>
<td>T. Kuroda et al. [3]</td>
<td>0.3μm</td>
<td>4 mm²</td>
<td>120000</td>
<td>0.9 V</td>
<td>150 MHz</td>
<td>10 mW</td>
</tr>
<tr>
<td>Our Chip</td>
<td>0.6μm</td>
<td>50.6 mm²</td>
<td>152017</td>
<td>2 V</td>
<td>100 MHz</td>
<td>138mW</td>
</tr>
</tbody>
</table>

TABLE II
Simulated Power Dissipation by Components

<table>
<thead>
<tr>
<th>Module</th>
<th>Counts</th>
<th>Power (mW)</th>
<th>Percentage (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Registers</td>
<td>2923</td>
<td>35.38</td>
<td>25.64%</td>
</tr>
<tr>
<td>Clock buffers</td>
<td>1</td>
<td>29.35</td>
<td>21.27%</td>
</tr>
<tr>
<td>SRAM32 × 16</td>
<td>4</td>
<td>21.76</td>
<td>15.77%</td>
</tr>
<tr>
<td>ROM</td>
<td>64</td>
<td>17.51</td>
<td>12.69%</td>
</tr>
<tr>
<td>13-bit adder</td>
<td>64</td>
<td>15.48</td>
<td>11.22%</td>
</tr>
<tr>
<td>SRAM64 × 9</td>
<td>2</td>
<td>11.08</td>
<td>8.03%</td>
</tr>
<tr>
<td>One-bit ALU</td>
<td>320</td>
<td>5.81</td>
<td>4.21%</td>
</tr>
<tr>
<td>Controller</td>
<td>1</td>
<td>1.27</td>
<td>&lt;0.02%</td>
</tr>
</tbody>
</table>

Fig. 9. The photomicrograph of the whole chip including I/O pads

this result are as follows. First, the usage of the direct 2-D DCT algorithm reduces the 2-D DCT into 1-D DCT and some additions. Also, a fast algorithm of 1-D DCT is employed. Both of these decrease the computational complexity which means low power consumption per block operation. Besides, a parallel distributed arithmetic (DA) architecture with the direct 2-D DCT approach is proposed in order to compensate the speed loss due to the reduced internal supply voltage.

In addition to the considerations in algorithm and architecture level, low power design methodologies in logic-style and circuit level are applied to the real circuit implementation of the proposed 2-D DCT. Since adders, memories and registers are the main modules of the proposed DCT design, a power-saving in these circuits contribute to the goal significantly.

Finally, the proposed low power 2-D DCT chip with direct method is implemented. The maximum frequency simulated of the chip is 133MHz at last. It meets the requirement of the real-time HDTV signal processing for the chrominance format 4:2:0 and 4:2:2. The power simulated is 138mW at 100MHz by 0.6 μm single-poly double-metal technology.

REFERENCES


