Accuracy-Configurable Adder for Approximate Arithmetic Designs

35.1
Accuracy-Configurable Adder for Approximate Arithmetic Designs
†
Andrew B. Kahng†‡ and Seokhyeong Kang†
ECE and CSE Departments, University of California at San Diego
abk@cs.ucsd.edu, shkang@vlsicad.ucsd.edu
‡
ABSTRACT
Various approximate arithmetic designs have been previously
proposed. Lu [7] introduces a faster adder which has shorter carry
chains and considers only the previous k bits of input in computing
a carry bit. Verma et al. [12] provide a variable latency speculative adder (V LSA), which is a reliable version of the Lu adder [7]
with error detection and correction. Shin et al. [10] also propose
a data path redesign technique for various adders which cuts the
critical path in the carry chain. Zhu et al. [14] [13] propose three
approximate adders – ET AI, ET AII and ET AIIM . ETAI is
divided into an accurate part and an inaccurate part to achieve approximate results. ETAII cuts carry propagation to speed up the
adder, and ETAIIM modifies ETAII by connecting carry chains
in accurate MSB parts. Kulkarni et al. [5] present a 2x2 underdesigned multiplier, and use it to build large power-efficient approximate multipliers. George et al. [3] define the concept of probabilistic CMOS (PCMOS), and implement efficient arithmetic using P CM OS. Shin et al. [11] propose a logic synthesis approach
to design an approximate circuit.
The approximate designs produce almost-correct results at the
given required accuracy, and obtain power reductions or performance improvements in return. In some applications, however,
more accurate or totally accurate results are required under certain conditions – e.g., image processing in security cameras would
require cleaner images after detecting a motion. In contexts where
the required accuracy changes during runtime, the accuracy of results should be configurable to maximize the benefit of approximate
operations. Figure 1 illustrates how power benefits can be achieved
with an accuracy-configurable design. The accuracy-configurable
design can adapt to changing accuracy constraints by using different modes in each situation. To our knowledge, no previous work
can configure the output accuracy during runtime, and each is thus
restricted (or, best-suited) to particular application contexts. In contexts where the accuracy requirement can change dynamically, the
previous methods’ benefits from the accuracy tradeoff are reduced
since the implementation must be targeted to the maximum accuracy requirement.
Approximation can increase performance or reduce power consumption with a simplified or inaccurate circuit in application contexts
where strict requirements are relaxed. For applications related to
human senses, approximate arithmetic can be used to generate sufficient results rather than absolutely accurate results. Approximate
design exploits a tradeoff of accuracy in computation versus performance and power. However, required accuracy varies according
to applications, and 100% accurate results are still required in some
situations. In this paper, we propose an accuracy-configurable approximate (ACA) adder for which the accuracy of results is configurable during runtime. Because of its configurability, the ACA
adder can adaptively operate in both approximate (inaccurate) mode
and accurate mode. The proposed adder can achieve significant
throughput improvement and total power reduction over conventional adder designs. It can be used in accuracy-configurable applications, and improves the achievable tradeoff between performance/power and quality. The ACA adder achieves approximately
30% power reduction versus the conventional pipelined adder at the
relaxed accuracy requirement.
Categories and Subject Descriptors
B.7.2 [Hardware]: INTEGRATED CIRCUITS—Design Aids; J.6
[Computer Applications]: COMPUTER-AIDED ENGINEERING
General Terms
Algorithms, Design, Performance
Keywords
Approximate Arithmetic, Error-Tolerance, Power Minimization,
Accuracy-Configurable Adder
1.
INTRODUCTION
Guardbands for dynamic variations severely limit performance
and energy efficiency of conventional IC designs. To overcome
consequences of overdesign, several recent mechanisms for variation-resilient design [4] allow timing errors and manage design
reliability dynamically. Relaxing the requirement of correctness for
designs may dramatically reduce costs of manufacturing, verification and test [16]. In resilient designs, errors can be corrected with
redundancy techniques (error-tolerance), or accepted in some applications relating to human senses such as hearing and sight (erroracceptance). In the error-acceptance regime, approximation via a
simplified or inaccurate circuit can increase performance and/or reduce power consumption.
normalized power
accurate mode
accurate
design
1.0
approximate mode
accuracy
configurable
design
required accuracy
80%
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DAC 2012, June 3-7, 2012, San Francisco, California, USA.
Copyright 2012 ACM ACM 978-1-4503-1199-1/12/06 ...$10.00.
100%
event occurred
90%
80%
time
Figure 1: Power benefits from accuracy-configurable design.
In this paper, we propose an accuracy-configurable approximate
(ACA) adder, which can configure the accuracy of results during
runtime. The main contributions of our work are the following.
820
35.1
adder with a parameter k, which is the bit-width of the sub-adder
result. In the adder, each divided sub-module produces a k-bit result except for the last sub-module, which produces a 2k-bit result.
The approximate adder thus consists of the (N/k − 1) sub-modules
as described in Equation (1).
• The proposed ACA adder has runtime-configurable accuracy
to better enable tradeoff of accuracy in computation versus
performance and power.
• We provide quantitative metrics for an approximate arithmetic design. We compare the ACA adder to previous approximate adders based on these metrics.
• We demonstrate the power benefits of the ACA adder over
previous approximate and conventional adder designs for accuracy-configurable applications.
SU M [N − ik − 1 : N − (i + 1)k] =
A[N − ik − 1 : N − (i + 2)k] +
B[N − ik − 1 : N − (i + 2)k],
where i = 0, ..., N/k − 2
The rest of the paper is organized as follows. Section 2 presents
the proposed ACA adder design. Section 3 provides experimental results and analysis. Section 4 summarizes and concludes the
paper.
2.
In modern adder designs, such as carry-lookahead (CLA), carryselect and Kogge-Stone adders, the path depth and area are asymptotically proportional to log2 N and N log2 N respectively, where
N is the bit-width of the adder [15]. Based on this, we can express delay, area and power consumption of the proposed adder in
terms of the parameters N and k. The proposed ACA adder has
(N/k − 1) sub-adders, each of which is a 2k-bit adder. Therefore,
delay of the critical path can be expressed with Equation (2) and
area can be estimated with Equation (3), where Cdelay and Carea
are constants for delay and area, respectively.
ACCURACY-CONFIGURABLE ADDER
2.1
Approximate Adder Implementation
AH=A[15:8],
AM=A[11:4],
AL=A[7:0]
carry
8-bit
adder SUMH
delay = Cdelay (log2 k + 1)
SUM[16]
AH+BH
A[15]
8-bit
adder SUMM
A[15:0] B[15:0]
SUM[15:12]
area = Carea (N − 2k)(log2 k + 1)
(3)
P owerdyn = Cpower (N − 2k)(log2 k + 1)2
(4)
Power consumption of the ACA adder can be roughly estimated
as follows. Dynamic power consumption with voltage scaling at
a fixed frequency is proportional to capacitance · Vdd 2 , where
the capacitance is proportional to the area. Cell delay is pro2
portional to 1/(Vdd − Vt )β , and Vdd
is roughly proportional to
1/(cell delay) if we assume that β is 2. Since (cell delay) ×
2
(path depth) is constant at a fixed frequency, Vdd
is proportional to
the path depth, which is log2 k + 1. Consequently, dynamic power
with voltage scaling can be expressed using Equation (4), where
Cpower is a constant fixed for given Vdd for dynamic power consumption. Static power consumption of the adder can be roughly
estimated as proportional to the area in Equation (3).
In our proposed adder design, the output of each sub-adder (except the last sub-adder) is incorrect when a carry input should be
propagated to the results. In Figure 2, when the carry[4] (carry
bit from AL + BL ) is ‘1’ and SU MM [3 : 0] is 1111(2) , the
output result has an error in SU M [11 : 8]. In the general implementation, the output result will be correct when there are no
errors in all (N/k − 1) sub-adders. In the ith sub-adder, errors
occur when (1) the LSB part of the result (SU Mi [k − 1 : 0])
has all ‘1’ values (probability P = 21k ) and (2) the LSB part
([k − 1 : 0]) of the (i + 1)th sub-adder produces a carry bit (probability P = 14 + 12 · 41 + 12 · 12 · 41 + ...). Therefore, with a random
input vector, the probability of having a correct result in the proposed adder is
SUM[3:0]
A[0]
8-bit
adder SUML
SUM
AL+BL
Figure 2: Proposed approximate adder – 16-bit adder case.
Previous approximate adders [7] [10] [14] have difficulty detecting and correcting errors since they are designed for error-acceptable applications with a target accuracy. However, accurate computations are still required at certain times, according to the application. VLSA [12] can provide accurate results, but has large delay
and area overhead for the error detection and correction. The central contribution of our present work is to propose an approximate
adder which supports both accurate and inaccurate computation
with error-correction and accuracy-configuration capability. Figure
2 shows our proposed approximate circuit for the case of a 16-bit
adder. In the adder, the carry chain is cut to reduce critical-path
delay, and three sub-adders generate results of partial summations.
With the reduced critical-path delay, high performance (by increasing the clock frequency) or low power consumption (by decreasing
the operating voltage) is obtained. A middle sub-adder (AM +BM )
is introduced to increase accuracy. Without the middle sub-adder
(as in ETAII [13]), error occurs when the eighth carry bit is high,
and for random input patterns the error rate is 50.1%. On the
other hand, with the introduction of the middle sub-adder, error rate
for random input patterns is reduced to 5.5%. (In the real implementation, all redundant parts (four-LSB output of AH + BH and
AM + BM sub-adders) are optimized only for carry-generation.)
k
(2)
SUM[11:8]
SUM[7:4]
AM+BM
(1)
1 2k − 1 Nk −2
(5)
·
)
2k 2k+1
Table 1 shows the estimated results of 16-bit ACA adders with
different parameter values k. With smaller k value, the minimum
clock period and dynamic power can be reduced, but the pass rate
(probability of having a correct result) will be decreased. The estimations come from Equations (2), (3), (4) and (5). In Section 3.3
below, we validate the above estimation with real implementations.
P (N, k) = (1 −
N: bit width, k: ½ carry-chain depth
A [N-1:N-k]
A [N-k-1:N-2k]
A [N-2k-1:N-3k]
A [N-2k-1:N-3k]
B [N-1:N-k]
B [N-k-1:N-2k]
B [N-2k-1:N-3k]
B [N-2k-1:N-3k]
Table 1: Estimated minimum clock cycle, area, dynamic power and pass rate for
each k value when N = 16 (normalized to the conventional CLA 16-bit adder).
carry
SUM [N-1:N-k]
SUM [N-k-1:N-2k]
SUM [N-2k-1:N-3k]
min. clock period
area
dynamic power
pass rate
Figure 3: General implementation for the proposed adder.
We can generalize the implementation of the proposed approximate adder. Figure 3 shows the general implementation of an N -bit
821
k=2
0.5
0.87
0.44
0.554
k=3
0.65
1.05
0.68
0.829
k=4
0.75
1.12
0.84
0.942
k=5
0.83
1.15
0.95
0.982
k=6
0.89
1.12
1.00
0.995
35.1
2.2
Error Detection and Correction for Accurate
Computation
tiple stages. Figure 6 shows the pipelined adder implementation
(k = N/8 case), in which four pipeline stages are required to
achieve a 100% accurate result. In the pipelined adder, each stage
generates a result with different accuracy; the output accuracy increases as the number of pipeline stages increases. According to
the accuracy requirement, we can turn off the later stages with a
power gating technique, and we can reduce the power consumption
further with the accuracy tradeoff.
Since the proposed adder supports both approximate and accurate results, it can be used in applications that require accurate results only under certain conditions. Conventional accurate designs
are energy-inefficient in the error-acceptable application context,
because they always compute the exact function. Previous approximate designs cannot handle a varying accuracy requirement, and
this limits the benefit of the accuracy tradeoff: as noted above, the
approximate function must meet the maximum accuracy threshold
across all applications. Moreover, if the application requests an exact computation, additional accurate circuits must be added to the
previous approximate designs. By contrast, the ACA design efficiently exploits a tradeoff between accuracy and power/performance
with its runtime accuracy configurability.
As described in Section 2.1, our proposed adder is incorrect when
a carry bit is propagated between sub-adders. However, the error
can be detected and corrected with a small overhead. We detect an
error for each sub-adder by checking the output of the sub-adder
and the carry-in signal that comes from the previous sub-adder. Error detection can be implemented with several ‘and’ gates. To correct the error, ‘1’ should be added to the approximate (inaccurate)
output, and the error correction can be implemented with an incrementor circuit.
approximate adder
EDC circuit
SUMapprox
IN
sub-adderi
OUT
sumi
incrementor
sub-adderi+1
SUMcorrect
errori
error
data stall
carryi+1
Stage 1
Figure 4: Error detection and correction with the approximate adder.
AL
BL
With these simple error detection and correction circuits, our
proposed adder can be implemented to have variable latency like
the previous VLSA adder [12], with a small overhead for an error detection and correction (EDC) system. Figure 4 shows an
EDC system with our proposed adder. The error detection circuit (‘and’ gates) checks the carry propagation and generates an
error signal. The error correction (incrementor) circuit produces
an error-free output by adding compensation data, and requires an
additional clock cycle. When errors are detected from input patterns, the error signal is activated. The error signal holds the input
pattern during the error correction and chooses the error-corrected
value (SU Mcorrect ) as an output. With this approach, our approximate adder can provide accurate results at a higher clock frequency
than that of conventional adders (e.g., CLA). According to the estimated results in Table 1, clock period can be reduced by 25% with
6% (= error rate) recovery-cycle overhead (16-bit ACA, k = 4).
2.3
Stage 2
N/2-bit adder
SUML
carry
AH
BH
N/2-bit adder
SUMH
error
A
B
approximate adder
SUMcorrect
error correction
SUMapprox
accurate
mode
power gating
switches
Figure 5: Pipelined adder implementation – conventional adder (above) and approximate adder (below). In approximate operation, the error correction stage is
power-gated.
3. EXPERIMENTAL SETUP AND RESULTS
3.1
Experimental Setup
To test approximate designs, we have written each design in Verilog and synthesized it to a TSMC 65GP cell library with Synopsys
DesignCompiler [17]. We then perform gate-level simulations using Cadence NC-Sim [18]. In the simulation, gate delay is taken
from an SDF (standard delay format) file. For voltage scaling experiments, we prepare Synopsys Liberty (.lib) files for each voltage
from 1.00V to 0.60V in 0.01V increments, using Cadence Library
Characterizer v9.1 [19]. The prepared libraries are used for SDF
file generation and power estimation at each voltage. Each simulation is performed with input patterns for one million cycles. During
the simulation, each output value is compared with a reference (correct) value to produce the accuracy metrics. For the input patterns,
we use random data, as well as actual data from SPEC 2006 [20]
benchmarks. We extract operand data from ADD instructions in
the SPEC benchmarks.
Accuracy Configuration with Pipelined Architecture
When our proposed adder is combined with a pipelined architecture, we can obtain accurate results with the same throughput as a
conventional adder. In the pipelined architecture, approximate additions are computed at the first pipeline stage, and error correction
can be completed at the second stage. Figure 5 shows the conventional pipelined adder (above) and the approximate adder (below).
The pipelined implementation of approximate adder has a structural analogy with the pipelined adder of the 2006 U.S. patent [8] in
which partial summations are performed at the first stage and carry
bits are added at the later stages. However, the patent is clearly
directed to accurate operations, not approximate computations. In
addition, we use our approximate adder (Figure 3) in the first stage.
In the pipelined approach, there is no improvement of the clock frequency since the achievable clock period is the same as that of the
conventional adder. However, power benefits are obtained through
configuration of accuracy: in the approximate mode, the error correction stage is power-gated with foot (or, head) switches in Figure
5, and power reduction over the conventional adder design can be
achieved. We compare the conventional and approximate pipelined
adders in Section 3.
In the proposed adder implementation, to achieve higher performance or lower power consumption, we can reduce the carry chain
depth (k) of sub-adders (see Table 1). However, when k is less than
N/4, it is impossible to correct all errors and achieve 100% correct results within one clock cycle since the error-correction paths
become critical. To achieve correct results in the pipelined implementation, the error-correction stage should be extended to mul-
3.2
Metric for Approximate Design
To quantify errors in approximate designs, two metrics have been
previously proposed [1]. Error rate (ER) is the percentage of cycles in which output value is different from the correct value. Error
significance (ES) is the numerical difference between correct and
output results; this quantifies the amount of error. In image/video
applications, [2] uses the product of ES and ER as a metric of
error tolerance. [10] introduces a criterion for acceptability: ES
× ER ≤ acceptance threshold, where the acceptance threshold is
specified according to the application. For the error significance
(ES) metric, [14] considers only amplitude of error. This is useful for many digital signal processing (DSP) systems that process,
e.g., sound and image data. However, in communication systems
that mainly handle information data, the number of incorrect bits
822
Stage 1
A
B
Stage 2
Stage 3
correction on S1
35.1
Stage 4
correction on S2
correction on S3
approximate
adder
SUMcorrect
errors
on S1
SUM
S3
S2
S1
approximate
errors
on S2
S0
correct
S3
S2
approximate
S1
S0
correct
errors
on S3
S3
S2
approx.
S1
S0
S3
S2
S1
S0
correct
correct
Figure 6: Accuracy-configurable implementation for pipelined adder.
Table 2: Accuracy metrics for error significance (ES).
metric
definition
data type
ACCamp
1 − |Rc − Re |/Rc
amplitude data
ACCinf
1 − Be /Bw
information data
1.000
Table 3: ACA adder results with different k values.
k
2
3
4
5
min. clock period (ps)
180
190
220
230
area (um2 )
550
990
920
840
pass rate (%)
55.3
82.8
94.0
98.1
throughput improvement (%)
11.3
24.6
22.3
21.4
0.900
Table 4: Design comparison for each adder design.
CLA
LU
ACA
ETAI
ETAIIM
area (um2 )
910
1356
923
576
678
min. clock period (ps)
280
210
200
200
260
pass rate (%)
100
99.2
94.1
10.0
97.0
ACCamp (maximum)
1.000
0.998
0.997
0.999
0.999
ACCinf (maximum)
1.000
0.999
0.993
0.694
0.996
area overhead for EDC
N/A
75%
28%
N/A
15%
0.500
ACCamp
adders: CLA, Lu’s adder [7], ETAI, ETAIIM [14] and the proposed ACA adder (without error correction). In the experiment,
the same carry-chain width (8-bit) is selected for the four approximate adders. In the implementation, a register (flip-flop) is inserted
in each output port to detect timing errors.
Table 4 shows area, pass rate, accuracy, minimum clock period
and EDC overhead for each adder design. According to the results, the ETAI adder has the smallest design area, but has a low
pass rate and limited accuracy with respect to the ACCinf metric.
Therefore, the ETAI adder is preferred for applications which allow
low accuracy in results. The ETAIIM adder shows fairly high accuracy, but does not have speed (clock period) benefit. Lu’s adder
shows a smaller error rate and high accuracy with respect to both
ACCamp and ACCinf metrics. However, it requires larger area
than the other designs. The proposed adder shows similar results
for both metrics as Lu’s adder. However, the area of the ACA adder
is smaller than that of Lu’s adder, and EDC is possible with small
area overhead (28%). With the ACA adder, the minimum clock
period can be reduced by 26% compared to the accurate CLA.
(Hamming distance) is a more meaningful metric for accuracy –
e.g. a (32,28) Reed-Solomon code can correct up to 2-byte errors.
This consideration for the ES metric is required when approximate
arithmetic is applied to error-tolerant systems with a redundancy
technique.
Table 2 shows two accuracy metrics for amplitude data and information data. ACCamp used in [14] quantifies the amplitude of
errors, where Rc and Re are the correct and obtained results, respectively. We propose another accuracy metric, ACCinf , which
measures error significance as Hamming distance, where Be is the
number of error bits and Bw is the bit-width of the data. For example, when the correct (reference) data is 1000_0000(2) and the
result data is 1100_0000(2) , accuracy with ACCamp and ACCinf
will be 12 and 87 , respectively. To evaluate the approximate circuits, we obtain average values of accuracy metrics ACCamp and
ACCinf over the entire simulation to consider both ER and ES.
Voltage scaling
(1.0V~0.6V)
1.000
0.800
0.995
0.700
0.990
3.00E-04
8.00E-04
0.600
ACA adder
Lu's adder
ETAIIM
0.400
2.00E-04
4.00E-04
6.00E-04
8.00E-04
CLA
ETAI
total power (W)
1.00E-03
1.20E-03
0.900
ACCinf
1.000
Voltage scaling
(1.0V~0.6V)
1.000
0.800
0.990
0.700
3.3
Approximate Adder with Different Parameters
We explore the proposed adder with different parameters (k: half
of carry-chain depth). Table 3 summarizes results – minimum clock
period, area, error rate and throughput improvements – for each implementation of the 16-bit adder with different k values. According
to the results, with smaller k, the maximum operating frequency increases, but the error rate increases as well. With higher k, the error rate is reduced significantly, but the benefit of the approximate
circuit, i.e., clock period reduction, is small. In the table, throughput improvement over conventional design is calculated including
error recovery overhead. From the implementations, a maximum
throughput improvement is achieved when k = 3. If we correct
erroneous results with EDC as in Figure 4, then 17.2% additional
clock cycles are required for error correction. With this overhead,
ACA adder can improve data throughput by 24.6% over the conventional CLA adder.
3.4
0.980
4.00E-04
0.600
ACA adder
Lu's adder
ETAIIM
0.500
0.400
2.00E-04
4.00E-04
6.00E-04
8.00E-04
8.00E-04
CLA
ETAI
total power (W)
1.00E-03
1.20E-03
Figure 7: Accuracy (y-axis) vs. power consumption (x-axis) under fixed clock
period (0.25ns) and scaled voltage (from 1.0V to 0.6V ).
Figure 7 shows a power vs. accuracy tradeoff in a voltage scaling
scenario: the x-axis shows total power consumption, and the y-axis
shows the accuracy (ACCamp , ACCinf ). The power consumption
and the accuracy are measured with different voltage libraries characterized using Cadence Library Characterizer [19]. The clock
period is fixed at 0.30ns during the simulations. In the results,
Lu’s adder does not show power benefits due to its design size.
ETAI shows low power consumption and high ACCamp accuracy,
but has low ACCinf accuracy, and cannot detect and correct errors. ETAIIM shows similar characteristics to ACA in the voltage
scaling case, but the adder cannot be used for a high-performance
(high-frequency) design, as shown in Table 4. The results in Figure
7 imply that our proposed adder can provide a significant power
Approximate Adder Comparison
We evaluate each approximate adder with respect to the pass
rate and the accuracy metrics which we have proposed. We use
gate-level simulation at each possible clock period to compare five
823
35.1
lected as N/4 for a two-stage pipelined implementation. In the
table, minimum clock period is measured at a fixed voltage (1.0V ),
and total power is measured at a fixed frequency (2.5GHz) with
voltage scaling. In the ACA adder case, timing and power overheads from power gating cells, output MUXes, and IR drop are
included. We can see that area, timing and power of both designs
are similar when the ACA adder operates in the accurate mode.
Total power of the approximate adder is comparable to that of the
conventional adder, even though ACA has additional EDC circuits.
This is because ACA has fewer registers between stage-1 and stage2 than the conventional pipelined adder. (In Figure 5, the conventional adder requires registers for AH , BH , SU ML and carry at
the first stage. For a 16-bit adder, 25 registers (8 + 8 + 8 + 1) are
required. On the other hand, ACA requires 18 registers (16 for
SU Mapprox and 2 for error indication).)
reduction with small accuracy penalty. When the required accuracy is 0.970 (ACCamp ), the ACA adder shows 37.0%, 36.4% and
15.9% total power reduction over CLA, Lu’s adder and ETAIIM,
respectively.
We have tested our approximate adder on a real application – a
Gaussian smoothing filter used in [6]. Gaussian smoothing is performed on the input image by convolving with a matrix in the spatial domain. In the convolution, the addition operation is done with
approximate 16-bit adders. Other operations, such as multiplication
and division, are accurate computations. Figure 8 shows results for
various approximate adders when they consume 50% of the power
of accurate CLA. From the results, the ACA adder has PSNR of
24.5dB, and this suggests that image processing/filtering applications could employ our proposed adder with significant power savings and only small loss in image quality.
(a)
(b)
total power consumption (W)
7.00E-03
(c)
5.00E-03
4.00E-03
3.00E-03
2.00E-03
1.00E-03
0.00E+00
0.95
mode change
Conventional pipelined adder
ACA adder (mode 2)
ACA adder (mode 4)
0.96
0.97
(e)
total power consumption (W)
6.00E-03
(f)
Figure 8: Image smoothing: (a) original image with noise; (b) accurate adder; (c)
ACA, PSNR: 24.5 dB; (d) ETAI, PSNR: 25.3 dB; (e) ETAIIM, PSNR: 16.2 dB; (f)
Lu’s adder, PSNR: 11.1dB.
conventional pipelined
area
clock
total
(um2 )
period
power
(ns)
(mW )
459
0.313
0.557
1082
0.357
1.558
2252
0.404
2.860
k
2
4
8
mode-1
mode-2
mode-3
mode-4
3.5
powergating
none
stage-4
stage-3, 4
stage-2, 3, 4
ACCamp
(max.)
1.000
0.998
0.991
0.983
total power
(mW)
5.962
4.683
3.691
2.588
1.00
accurate result
voltage scaling
4.00E-03
mode change
3.00E-03
2.00E-03
1.00E-03
Conventional pipelined adder
ACA adder (mode 1)
ACA adder (mode 2)
ACA adder (mode 3)
ACA adder (mode 4)
0.85
0.90
0.95
1.00
Figure 9: Accuracy metric ACCamp (above) and ACCinf (below) vs. power
consumption for conventional pipelined adder, ACA adder in accurate mode, and
ACA adder in approximate mode (4-stage, 32-bit adder).
approximate pipelined
area
clock
total
(um2 )
period
power
(ns)
(mW )
576
0.312
0.564
1171
0.358
1.669
2420
0.414
2.914
ACCinf
(max.)
1.000
0.960
0.925
0.900
0.99
ACCinf
In the pipelined architecture, the ACA adder can provide various
configurable modes according to the pipeline depth. To improve
the design performance, we increase the pipeline depth; the deeper
pipeline reduces the path depth of the design. In the conventional
pipelined adder, bit-width of the adder in each stage can be reduced
to N/#stage, where N is the entire bit-width and #stage is the
depth (number) of the pipeline stages. In the ACA adder, we can reduce the value of parameter k with deeper pipeline depth as shown
in Figure 6. To show the benefit of accuracy configuration, we have
implemented a 32-bit ACA adder (N = 32, k = 4) with 4-stage
pipeline, and compared it with a conventional pipelined adder with
an 8-bit CLA in each stage. Table 6 shows the implemented results
for the 32-bit ACA adder. For the accuracy estimation, one million
cycles of random patterns are used. The ACA adder can operate
in four different modes, based on the power gating of each stage.
We can see that the modes show different power consumptions and
different achievable accuracies. The ACA adder consumes 11.5%
more power than the conventional adder in accurate mode (mode-1)
due to the presence of recovery circuits. At the same time, it shows
a significant power reduction in the approximate modes: 12.4%,
31.0% and 51.6% in mode-2, mode-3 and mode-4, respectively.
Figure 9 shows detailed results for power consumption versus accuracy metrics in each configuration. From the results, we can see
that accuracy configuration with the mode change is much more effective than with voltage scaling, in terms of the tradeoff between
accuracy and power.
Table 6: Implementation results of 32-bit ACA adder with 4-stage pipeline (power
consumption of each mode and power reduction over conventional pipelined
adder).
config.
0.98
5.00E-03
0.00E+00
0.80
Table 5: Comparison between conventional and approximate (2-stage) pipelined
adders at the accurate mode.
adder
width
(N )
8
16
32
ACA adder (mode 1)
ACA adder (mode 3)
ACCamp
7.00E-03
(d)
accurate result
voltage scaling
6.00E-03
reduction
(%)
-11.5%
12.4%
31.0%
51.6%
Accuracy Configuration and Power Savings
When the architecture allows pipelining for addition, our proposed adder can be implemented as shown in Figure 5. We implement both the conventional pipelined adder and the approximate
pipelined adder to compare the designs in terms of area, timing and
power. In the implementation, registers (flip-flops) are included at
each pipeline stage (before stage-1, between stage-1 and stage-2,
and after stage-2).
Table 5 shows the implementation results for the conventional
and approximate pipelined adders. The parameter k has been se-
824
35.1
Table 7: Accuracy (ACCamp , ACCinf ) results of 32-bit ACA adder for real benchmarks (SPEC 2006).
accuracy metric
benchmark
astar
bzip2
calculix
gcc
h264ref
mcf
sjeng
soplex
mode-1
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
mode-2
0.9999
1.0000
0.9999
0.9992
0.9999
0.9997
0.9998
0.9999
ACCamp
mode-3
0.9993
0.9998
0.9972
0.9990
0.9990
0.9997
0.9995
0.9998
mode-4
0.9979
0.9970
0.9958
0.9951
0.9978
0.9991
0.9981
0.9953
mode-1
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
mode-2
0.9979
1.0000
0.9978
0.9881
0.9953
0.9819
0.9897
0.9985
ACCinf
mode-3
0.9949
0.9984
0.9967
0.9849
0.9897
0.9809
0.9876
0.9965
mode-4
0.9940
0.9931
0.9910
0.9617
0.9851
0.9596
0.9787
0.9925
Normalized power consumption
1
0.99
ACCamp
adder. The ACA adder can also be used in accuracy-configurable
applications with pipelining. We demonstrate that the ACA adder
can provide approximately 30% power reduction under a relaxed
accuracy requirement versus the conventional pipelined adder. Finally, we show that our ACA adder can improve the achievable
tradeoff between performance, power and quality for given accuracy requirements.
Our ongoing work seeks to implement accuracy-configurable designs for other arithmetic components such as multipliers, multiinput adders, etc. More broadly, our research addresses additional
aspects of (runtime) accuracy-configurable systems and applications.
1.00
0.8
0.6
mode-4
mode-3
0.4
mode-2
mode-1
0.2
0
astar
bzip2 calculix
gcc
h264ref
mcf
sjeng soplex
Normalized power consumption
1
0.95
0.8
ACCinf
0.6
1.00
5. REFERENCES
[1] M. A. Breuer, “Intelligible Test Techniques to Support Error-Tolerance”, Proc.
Asian Test Symp., 2004, pp. 386–393.
[2] I. Chong, H. Y. Cheong and A. Ortega, “New Quality Metric for Multimedia
Compression Using Faulty Hardware”, Proc. International Workshop on Video
Processing and Quality Metrics for Consumer Electronics, 2006, pp. 267–272.
[3] J. George, B. Marr, B. E. S. Akgul and K. V. Palem, “Probabilistic Arithmetic
and Energy Efficient Embedded Signal Processing”, Proc. CASES, 2006, pp.
158–168.
[4] S. Ghosh and K. Roy, “Parameter Variation Tolerance and Error Resiliency:
New Design Paradigm for the Nanoscale Era”, Proceedings of the IEEE
98(10) (2010), pp. 1718–1751.
[5] P. Kulkarni, P. Gupta and M. Ercegovac, “Trading Accuracy for Power with an
Underdesigned Multiplier Architecture”, Proc. IEEE/ACM International
Conference on VLSI Design, 2011, pp. 346–351.
[6] M. S. Lau, K.-V. Ling and Y.-C. Chu, “Energy-Aware Probabilistic Multiplier:
Design and Analysis”, Proc. CASES, 2009, pp. 281–290.
[7] S.-L. Lu, “Speeding Up Processing with Approximation Circuits”, IEEE
Computer 37(3) (2004) pp. 67-73.
[8] H. D. Mohammed and L. Hemmert, “Fast Pipelined Adder/Subtractor using
Increment/Decrement Function with Reduced Register Utilization”, U.S.
Patent No. 7,007,059, 2006.
[9] B. J. Phillips, D. R. Kelly and B. W. Ng, “Estimating Adders for a Low
Density Parity Check Decoder”, Proc. SPIE, vol. 6313, 2006, pp. 1–9.
[10] D. Shin and S. K. Gupta, “A Re-Design Technique for Datapath Modules in
Error Tolerant Applications”, Proc. Asian Test Symp., 2008, pp. 431–437.
[11] D. Shin and S. K. Gupta, “Approximate Logic Synthesis for Error Tolerant
Applications”, Proc. DATE, 2010, pp. 957–960.
[12] A. K. Verma, P. Brisk and P. Ienne, “Variable Latency Speculative Addition: A
New Paradigm for Arithmetic Circuit Design”, Proc. DATE, 2008, pp.
1250–1255.
[13] N. Zhu, W. Goh and K. Yeo, “An Enhanced Low-Power High-Speed Adder
For Error-Tolerant Application” Proc. Intl. Symp. on Integrated Circuits, 2009,
pp. 69–72.
[14] N. Zhu, W. Goh, W. Zhang, K. Yeo and Z. Kong, “Design of Low-Power
High-Speed Truncation-Error-Tolerant Adder and Its Application in Digital
Signal Processing”, IEEE Trans. on VLSI Systems 18(8) (2010), pp.
1225–1229.
[15] M. Ziegler and M. Stan, “Optimal Logarithmic Adder Structures with a Fanout
of Two for Minimizing the Area-Delay Product”, Proc. ISCAS, 2001, pp.
657–660.
[16] International Technology Roadmap for Semiconductors, 2009,
http://www.itrs.net .
[17] Synopsys Design Compiler User’s Manual.
http://www.synopsys.com .
[18] NC-Sim User’s Manual. http://www.cadence.com .
[19] Cadence LC User’s Manual. http://www.cadence.com .
[20] Standard Performance Evaluation Corporation (SPEC) CPU2006.
http://www.spec.org/cpu2006 .
mode-4
mode-3
0.4
mode-2
mode-1
0.2
0
astar
bzip2 calculix
gcc
h264ref
mcf
sjeng soplex
Figure 10: Normalized power consumption versus conventional pipelined design when the accuracy requirement is varied uniformly over the interval 0.99
≤ ACCamp ≤ 1.00 and 0.95 ≤ ACCinf ≤ 1.00.
We also obtain the accuracy results in each accuracy mode with
real input patterns extracted from SPEC 2006 benchmarks. Table 7
shows accuracy results of a 32-bit ACA adder with such real input
patterns. The accuracy results are different for each benchmark,
e.g, the measured accuracy for bzip2 is higher than for gcc. Furthermore, the accuracy with real patterns is greater than with random input patterns (Table 6), most likely because addition inputs
for MPU have infrequently and/or systematically changing patterns
in the applications. We evaluate power reductions across accuracy
requirements with the patterns from SPEC 2006 benchmarks. Figure 10 shows power reduction achieved by the ACA adder versus the conventional pipelined adder under the accuracy requirements. We assume that required accuracy is from 0.99 (0.95) to
1.0 for ACCamp (ACCinf ), and that it varies uniformly over this
range during the entire runtime. From the results, dynamic accuracy configuration achieves up to 44.5% (30.0% on average) and
47.1% (35.8% on average) power reduction over the conventional
pipelined design for ACCamp and ACCinf metrics, respectively.
4.
CONCLUSIONS
In this paper, we propose an accuracy-configurable approximate
(ACA) adder for which the accuracy of results is configurable during runtime. Due to its configurability, the ACA adder can operate adaptively in both approximate (inaccurate) mode and accurate
mode. To quantify the accuracy in approximate computation, we
provide two metrics for amplitude data and information data. We
compare the ACA adder against previous approximate adders based
on the proposed metrics. The ACA adder shows high accuracy with
respect to the metrics, and can provide up to 24.6% throughput improvement and 37.0% power reduction over the conventional CLA
825