Cost-effective Low-Power Architectures of Video Coding Systems

Jie Chen* and K. J. Ray Liu

*Bell Laboratories, Lucent Technologies
Murray Hill, NJ 07974, USA

Department of Electrical Engineering and Institute for Systems Research
University of Maryland, College Park, MD 20742, USA

Abstract

A new low-power design technique, multirate, has been used along with other methods such as look-ahead, pipelining in designing the cost-effective low-power architectures of video coding system. We demonstrate both low-power and high-speed can be accomplished at algorithm/architecture level. Based on the calculation and simulation results, the design can achieve significant power saving in the range of 60% – 80% or speedup factor of two at the needs of users.

1. Introduction

In recent years, the need for personal mobile communications – “anytime, anywhere” access to multimedia and communication services – has become increasingly clear. Digital cellular telephony, such as the U.S. third generation code-division multiple access PCS and the European GSM systems, has seen rapid acceptance and growth in the marketplace. Due to the limited power-supply capability of current battery technology, low-power design to prolong the operating time of those mobile handsets becomes vital to success. However, the development of low-power video coding systems is still in its infancy. In this paper, we focus on the combined low-power design of DCT and motion estimation units, which serve as the computing engine in video coding system. The current low-power video coding systems are achieved at device/process level such as low-power video coder design in [1], which use 0.5 μm VLSI fabrication technology. Nevertheless, the cost of those approaches is the most expensive among all low-power techniques, namely, from system, algorithm/architecture down to circuit/logic, device/process level design [2].

In this paper, we extend our video coding architectures in [3], [4] for low-power applications. Our low-power design is achieved at the algorithm/architecture level, which provides the most leveraged way to achieve low-power consumption when both effectiveness and cost are taken into consideration. In principle, the algorithm/architecture low-power design is achieved by reformulating the algorithms and mapping them to efficient low-power VLSI architectures to compensate for the speed loss caused by lowered supply voltage. As a result, we trade silicon area for power consumption under the current technology, without invoking new expensive devices and advanced VLSI fabrication technology. Compared with other low-power techniques, our algorithmic/architectural approach is one of the most cost-effective ways to save power.

Unlike the conventional video coder design in MPEG standards, the motion estimation in our low-power video coder is achieved in DCT instead of spatial domain. As a result, we can naturally accommodate both DCT and motion estimation processors into one processing unit, which saves silicon area drastically and also enables the combined low-power design. In addition, all advantages mentioned in [3], [4] i.e. high throughput, numerical stability, multiplier-free, modular and solely local connected properties are also inherited in our low-power design. Furthermore, it is important to recognize that our low-power design can smartly conquer both low-power and high-speed requirements, which are often considered to be the problems of opposite natures, at the needs of users. Based on the calculation and simulation results, the proposed design can be readily applied to high-speed video communication with the speedup factor of two under normal supply voltage (5V). Or, the same design can be operated at two-time slower operating frequency under lowered supply voltage (3.08V) while retaining

![Diagram](image)

Figure 1: Low-power architecture for video coding.

*This work is supported in part by the NSF NYI award MIP9457397 and the ONR grant N00014-93-15566.
0-7803-5471-0/99/$10.00 ©1999 IEEE

I-153
the original data throughput rate. This feature enables us to achieve significant power saving in the range of 60% – 80% without sacrificing system performance (refer to the detail later).

The proposed low-power design has fully pipelined parallel architecture as shown in Fig. 1. A new low-power design technique, multirate [5], [6], has been used along with other methods such as look-ahead, pipelining in our design to achieve low-power/high-speed performance. In what follows, we explain the architectures of each building block in Fig. 1 in detail. Then we present the simulation results in Section 3 to demonstrate the performance of our design. Finally the paper is concluded in Section 4.

II. Low-power/High-speed Architectures

A. Two-stage Look-ahead Type-II DCT/IDCT Coder:

Unlike the conventional DCT encoder design using matrix factorization, we adopt the time-recursive DCT [7], [8] which is able to simultaneously generate type-II DCT and DST coefficients $X^r_1$ and $X^r_2$ needed by pseudo-phase computation module which we will discuss later. Due to the inherent time-recursive characteristic, we use look-ahead method to reduce the power consumption. In principle, the speed-up provided by look-ahead compensates the speed loss caused by reduced supply voltage at the cost of increasing hardware complexity.

The two-stage look-ahead time-recursive updating of DCT and DST coefficients is given by:

$$
\begin{align*}
\begin{bmatrix}
X^r_{1+1}(k) \\
X^r_{2+1}(k)
\end{bmatrix} &= \begin{bmatrix}
\cos\frac{2\pi k}{N} & \sin\frac{2\pi k}{N} \\
-\sin\frac{2\pi k}{N} & \cos\frac{2\pi k}{N}
\end{bmatrix}
\begin{bmatrix}
X^r_1(k) \\
X^r_2(k)
\end{bmatrix} \\
\begin{bmatrix}
X^r_1(k) \\
X^r_2(k)
\end{bmatrix} &= D(k)
\begin{bmatrix}
\frac{\cos\frac{2\pi k}{N}}{N} & \frac{\sin\frac{2\pi k}{N}}{N} \\
\frac{-\sin\frac{2\pi k}{N}}{N} & \frac{\cos\frac{2\pi k}{N}}{N}
\end{bmatrix}
\begin{bmatrix}
X^r_1(k) \\
X^r_2(k)
\end{bmatrix}
\end{align*}
$$

where $t$ is time index, $X^r_1(k)$ and $X^r_2(k)$ are defined as:

$$
\begin{align*}
X^r_1(k) &= D(k) \sum_{n=0}^{N-1} a(n) \cos \frac{2\pi k}{N} \left[\left(n - \frac{1}{2}\right) \right] k \in \{0, 1, \ldots, N-1\}, \\
X^r_2(k) &= D(k) \sum_{n=0}^{N-1} a(n) \sin \frac{2\pi k}{N} \left[\left(n - \frac{1}{2}\right) \right] k \in \{0, 1, \ldots, N\},
\end{align*}
$$

where $D(k) = \begin{cases} \frac{\sqrt{N}}{2}, & \text{for } k = 0 \text{ or } N. \end{cases}$

Both two-stage look-ahead DCT and its inverse counterpart, inverse IDCT, undergo the similar computing procedure in (1) except for minor differences in the data inputs and rotation angles. In order to save chip area, we interleave them into a unified structure which contains 3 CORDIC (Coordinate Rotation Digital Computer [9]) processors as shown in Fig. 2.

Clearly, the look-ahead system can be clocked at two-time faster rate than the original system for high-speed application. Or, by reducing the supply voltage from 5V to 3.08V, we increase the propagation delay of look-ahead system until it equals to that of the original system. In other words, we achieve low-power design while still keep the same system throughput. The ratio of the power consumption of 2-stage look-ahead design, $P_{\text{2-stage}}$, to the power of original design, $P_{\text{orig}}$, can be written as:

$$
P_{\text{2-stage}} = \begin{cases} \frac{3.08V^2}{5V^2} f + \frac{3.08V^2}{5V^2} \frac{1}{f}, & \text{for } f = 0.38, \end{cases}
$$

where $f$ is the original operating frequency, $C_{\text{2-stage}}$ and $C_{\text{orig}}$ represent the total switching capacitances of look-ahead and original implementation. Provided that the capacitances due to CORDICs are dominant in the circuit and are roughly proportional to the number of CORDICs, a 2-stage design results in 72% power saving without sacrificing the system throughput at the expense of an increasing output latency and 50% hardware overhead. In essence, we trade silicon area for low-power consumption.

Because two-dimensional DCT can be decomposed into 2-stage pipelined one-dimensional computation, we therefore adopt the same approach as in [3] to extend our low-power DCT design to two-dimensional design.

B. Pipelining Design for DCT Coefficients Conversion:

Note that the type-I DCT coefficients, $X_{-1}$, required by the pseudo-phase computation in Fig. 1 can actually be obtained by the plane rotation of its counterpart type-II DCT coefficients, $X_{-1}$ as in [3]. To achieve high-speed design, we can insert flip-flop (D) across the feed-forward cut-set as shown in Fig. 3. Now the pipelining design can run two-time faster than the original design because the critical path has been halved. Or, we can reduce the power supply voltage from 5V to 3.08V while still maintaining the same system throughput. The ratio of the power consumption of pipelining design, $P_{\text{pipe}}$, to the power of original design, $P_{\text{rot}}$, is given by:

$$
P_{\text{pipe}} = \begin{cases} \frac{3.08V^2}{5V^2} f + \frac{3.08V^2}{5V^2} \frac{1}{f}, & \text{for } f = 0.38, \end{cases}
$$

which leads to 81% power saving at the cost of increased system latency.

C. Multirate Design for Pseudo-phase Computation:

Traditionally, multirate technique is widely used in sub-band coding [5]. Our interest, on the other hand, is to
apply this technique to compensate the speed loss due to lowered supply voltage or to simply speed-up the design under normal condition. For the pseudo-phase computation module in the original design as shown in Fig. 4 (a), the processing rate of the operator has to be as fast as the input data rate. By employing multirate low-power design, the pseudo-phases are computed from the reformulated circuit using the decimated sequences (M = 2) as shown in Fig. 4 (b). Now the multirate design operates at two different rates. Because the operating frequency of pseudo-phase computation is reduced to half of the input data rate while the overall throughput rate is still remained the same, the speed penalty therefore is compensated at the architectural level. As stated previously, we can keep the overall throughput rate while reduce the power supply voltage from 5V to 3.08V. The multirate design needs 20 CORDICs, which is twice the number of CORDICs in original design. The ratio of the power consumption of multirate design, $P_{\text{multi}}$, to the power of original design, $P_{\text{phase}}$, can be obtained as:

$$\frac{P_{\text{multi}}}{P_{\text{phase}}} = \frac{20 \times 3.08V}{10 \times 5V} = \frac{3}{10} = 0.38.$$ 

Overall we can achieve the power saving of 62% or the speed-up factor of two at the cost of doubled hardware complexity.

**D. Two-stage Look-ahead Half-pel Motion Estimator:**

To obtain motion at half-pel accuracy, we first compute the integer-pel motion vectors (m,n) then use “half-pel motion estimator” in Fig. 1 to compute the half-pel motion vectors. With such an approach, we can avoid conventional interpolation procedure [10] because we can determine the half-pel motion vectors by only considering the nine positions $u \in \{m - 0.5, m, m + 0.5\}$ and $v \in \{n - 0.5, n, n + 0.5\}$ surrounding integer-pel motion vectors (m,n) as illustrated at the upper right corner of Fig. 5. In other words, the peak position among nine $\hat{DSC}(u,v)$ and $\hat{DSC}(u,v)$,

$$\hat{DSC}(u,v) = \sum_{k=0}^{N-1} \sum_{l=0}^{N-1} C(k)C(l)f(k,l) \cos \frac{\pi}{N} k u + \frac{\pi}{2} \sin \frac{\pi}{N} l v + \frac{\pi}{2},$$

$$\hat{DSC}(u,v) = \sum_{k=0}^{N-1} \sum_{l=0}^{N-1} C(k)C(l)f(k,l) \sin \frac{\pi}{N} k u + \frac{\pi}{2} \cos \frac{\pi}{N} l v + \frac{\pi}{2},$$

indicates the half-pel motion. In order to figure out both $DSC(u,v)$ and $\hat{DSC}(u,v)$, we can decompose the computations into hierarchical one-dimensional calculations of type-II inverse IDCT as illustrated in Fig. 5 (Here we use $\hat{DSC}(u,v)$ as an example). By taking a close look at

**III. Simulation Results and Hardware Cost**

Cadence circuit design tool, VerilogTM, has been used in simulating the performance of our design. We use
"Miss America" in QCIF format as the test sequence. The original frame 91 and reconstructed frame using our proposed low-power design are shown in Fig. 6 (a) and (b), respectively. The simulation results demonstrate that our low-power design can achieve comparable video quality as the original one.

To compare the speed of original [4] and our low-power/high-speed design for each module in Fig. 1, we use the synthesis tool to check the static timing of each block. The resulted speed-up factors are listed in Table 1. Based on the simulation results, we observe that our low-power/high-speed design can operate at about two-time faster clock rate than the original design, which is corresponding to our previous derivations.

Table 1: Simulation result of speed-up

<table>
<thead>
<tr>
<th>Unit</th>
<th>Type-II DCT/IDCT</th>
<th>DCT Coeff. Conversion</th>
<th>Sine-phase Computation</th>
<th>Half-pel Estimator</th>
</tr>
</thead>
<tbody>
<tr>
<td>Speedup Factor</td>
<td>1.87</td>
<td>1.95</td>
<td>1.81</td>
<td>1.85</td>
</tr>
</tbody>
</table>

Table 2: Hardware cost and data throughput rate

<table>
<thead>
<tr>
<th>Component</th>
<th>CORDICs</th>
<th>Adders</th>
<th>Registers</th>
<th>Throughput</th>
</tr>
</thead>
<tbody>
<tr>
<td>Type-II DCT/IDCT</td>
<td>9N</td>
<td>27N+12</td>
<td>N + 6N²</td>
<td>O(N)</td>
</tr>
<tr>
<td>Type Conversion</td>
<td>4N</td>
<td>0</td>
<td>0</td>
<td>O(N)</td>
</tr>
<tr>
<td>Pseudo Phase</td>
<td>20N</td>
<td>3N</td>
<td>0</td>
<td>O(N)</td>
</tr>
<tr>
<td>Peak Searching</td>
<td>0</td>
<td>0</td>
<td>2N²</td>
<td>O(N)</td>
</tr>
<tr>
<td>Half-pel Motion Estimation</td>
<td>3N+9</td>
<td>7N+33</td>
<td>3N + N²</td>
<td>O(N)</td>
</tr>
<tr>
<td>Total</td>
<td>26N+9</td>
<td>36N+45</td>
<td>4N + 8N²</td>
<td>O(N)</td>
</tr>
</tbody>
</table>

29N + 12 adders to achieve integer-pel accuracy and requires additional 3N + 9 CORDICs, 7N + 33 adders to achieve half-pel accuracy.

IV. Conclusion

Anticipating the future trend of running video applications on the portable personal devices, we propose cost-effective low-power/high-speed architectures for video coding system. Unlike the existing low-power video codec design using the costly 0.3μm fabrication technology, our low-power/high-speed design is achieved at the algorithmic/architectural levels. Basically, we only trade more silicon area or system latency for low-power consumption or high-speed performance under current technology, without invoking dedicated circuit design, new expensive devices and advanced VLSI fabrication technology. Compared with other approaches, our algorithmic/architectural low-power approach is one of the most economic ways to save power. Techniques such as look-ahead, multirate, pipelining have been used in our design. Based on the simulation results, our low-power/high-speed design can achieve comparable performance as the original system at the speed-up factor of two (equivalent to the power saving in the range of 60% – 80%).

REFERENCES


I-156