Area Efficient Floating-Point Adder and Multiplier with IEEE-754 Compatible Semantics Andreas Ehliar

Area Efficient Floating-Point Adder and Multiplier with IEEE-754 Compatible Semantics Andreas Ehliar
Area Efficient Floating-Point Adder and Multiplier
with IEEE-754 Compatible Semantics
Andreas Ehliar
Dept. of Electrical Engineering
Linköping University
Linköping, Sweden
Email: [email protected]
Abstract—In this paper we describe an open source floatingpoint adder and multiplier implemented using a 36-bit custom
number format based on radix-16 and optimized for the 7series FPGAs from Xilinx. Although this number format is not
identical to the single-precision IEEE-754 format, the floatingpoint operators are designed in such a way that the numerical
results for a given operation will be identical to the result from an
IEEE-754 compliant operator with support for round-to-nearest
even, NaNs and Inf s, and subnormal numbers. The drawback of
this number format is that the rounding step is more involved
than in a regular, radix-2 based operator. On the other hand,
the use of a high radix means that the area cost associated with
normalization and denormalization can be reduced, leading to
a net area advantage for the custom number format, under the
assumption that support for subnormal numbers is required.
The area of the floating-point adder in a Kintex-7 FPGA is
261 slice LUTs and the area of the floating-point multiplier is
235 slice LUTs and 2 DSP48E blocks. The adder can operate at
319 MHz and the multiplier can operate at a frequency of 305
MHz.
I.
I NTRODUCTION
When porting floating-point heavy reference code to an
FPGA there are several different implementation strategies.
The first is to convert all (or most) of the reference code
into fixed-point. This option will most likely lead to the
most efficient implementation, as most FPGAs are optimized
for fixed-point arithmetic. However, the development cost for
this implementation strategy can be prohibitive, depending on
whether for example techniques such as block floating-point
can be used. In many cases it is also not possible to get the
exact same results as in the reference code.
Another strategy is to simply implement the floating-point
based algorithms directly using floating-point operators in
the FPGA. This probably means that IEEE-754 compatible
floating-point operators will be required, as the reference code
is probably using either single or double-precision IEEE-754
compatible floating-point arithmetic. The main advantage of
this strategy is that it is possible to create an application
which behaves in exactly the same way as the reference code.
The drawback is that the area overhead associated with fully
compliant IEEE-754 operators can be rather large. (Although
it should also be noted that porting reference code from a
language such as C and C++, where the compiler has a
relatively free reign in choosing whether single, double, or
extended precision is used, is not trivial if bit-exact results are
required, see [1] for more information.)
Floating Point Adder
Floating Point Multiplier
Compare
Multiply
Swap
Normalize
Align
Round/Postprocess
Add
Normalize
Round/Postprocess
Fig. 1.
Grey boxes are
substantially different
from a traditional
operator with
IEEE-754 support
Overview of the high radix floating point adder/multiplier
Yet another strategy that can be used is to analyze the reference code to determine if a custom floating-point format can
be used in order to reduce the area cost required. This strategy
exploits the fine-grained reconfigurability of FPGAs by not
using more hardware than strictly required. For example, if the
application doesn’t require the precision offered by the singleprecision IEEE-754 format, the size of the mantissa can be
reduced in order to reduce the area cost, which is especially
useful if the reduced mantissa size corresponds to a reduce in
the number of required DSP blocks. The main drawback of
this approach is that it is more difficult to compare the results
from the FPGA implementation with the reference code, as
it is no longer possible to get bit-exact compliance with the
reference code.
A. Custom Floating-Point Format With IEEE-754 Semantics
The main subject presented in this paper is a new floatingpoint format which is designed to cater to the strength of
FPGAs while retaining IEEE-754 semantics. This format is
based on three main ideas:
•
The area cost associated with normalization and denormalization can be reduced if a radix higher than
two is used for the floating point number.
•
Subnormal numbers are costly to implement. However, the extra dynamic range offered by subnormal
numbers can also be had by extending the exponent
by one bit.
•
It is possible to create a floating-point format based on
a high radix and an extended exponent range which
can represent all possible values in a single precision
IEEE-754 number. A special post-processing stage can
be used to ensure that the result is identical to the
result from an IEEE-754 compatible single precision
operator.
Under the assumption that the cost of converting between
IEEE-754 and the custom format can be amortized over a few
operators, this will lead to a net reduction in area while retaining the ability to give the exact same values as an IEEE-754
compatible implementation would give. The implementation
details of such an operator will be similar to a traditional IEEE754 compatible operator, although some of the pipeline stages
will be substantially different as shown in Fig. 1.
II.
H IGH R ADIX F LOATING -P OINT F ORMATS
Normally, a floating-point number is assumed to have radix
2. That is, the value of a floating-point number can be written
as follows:
(−1)sign · mantissa · 2exponent
(1)
TABLE I.
T HE F LOATING -P OINT F ORMAT U SED IN THIS PAPER
C OMPARED TO S INGLE P RECISION IEEE-754
Format type
IEEE-754
(single
precision)
Bit number
31
30-23
22-0
35
34-33
Field name
Sign bit
Exponent
Mantissa (with implicit one)
Sign bit
If 10: Inf.
If 11: NaN.
If 00 or 01: Normal value
33-27
Exponent
26-0
Mantissa (with explicit one)
Restriction: Only values that can be
represented in IEEE-754 are allowed.
HRFP16
(this paper)
TABLE II.
Decimal
0.0
1.0
2.0
4.0
8.0
16.0
1.401 · 10−45
3.403 · 1038
Sign
0
0
0
0
0
0
0
0
HRFP16 EXAMPLE VALUES
HRFP16 format
Exponent
Mantissa
0x00
0x0000000
0x60
0x0800000
0x60
0x1000000
0x60
0x2000000
0x60
0x4000000
0x61
0x0800000
0x3a
0x4000000
0x7f
0x7fffff8
Equivalent
IEEE-754 encoding
0x00000000
0x3f800000
0x40000000
0x40800000
0x41000000
0x41800000
0x00000001
0x7f7fffff
The most significant bit of the mantissa is often assumed
to have the value 1, although this is not a strict requirement.
In contrast, if a high radix is used, the following definition
is used instead, where base > 2:
(−1)sign · mantissa · baseexponent
(2)
point of view, it is sometimes convenient if bit 34 is considered
a part of the exponent, as this will automatically cause most
overflows to be flushed to Inf.) Another special case is the
value zero. This is stored as all zeroes in the exponent and
mantissa field (and an appropriate sign in the sign field).
In order to make it easy to convert the exponent in a
floating-point number with radix-2 to a high-radix representation it is also an advantage if the following holds as well:
The following equation is used to determine the value of
a given number in the HRFP16 format
k
base = 22
(3)
(−1)sign · mantissa · 24·exponent−407
(4)
If the expression in Equation 3 holds, the exponent can be
converted from radix-2 to the high-radix merely by removing
the k LSB-bits from the radix under the assumption that the
bias for the exponent is chosen appropriately.
where sign, mantissa, and exponent should be interpreted as
integers. The encoding of a few different values in the HRFP16
format can be seen in Table II.
The advantage of using a higher radix is that the floatingpoint adder can be simplified. In radix-2, shifters in the
floating-point adder needs to shift the mantissa in steps of 1. If
for example radix 16 is used these shifters only needs to shift
in steps of 4. Such number formats have been shown to have
a clear advantage over radix-2 based floating point numbers in
FPGAs [2].
B. On IEEE-754 Compatibility
A. The HRFP16 Format
The high radix floating-point format used in this paper
(hereafter referred to as the HRFP16 format to signify that
the radix 16 is used) is based on radix-16 and has 7 exponent
bits and 27 mantissa bits as shown in table I, for a total of 36
bits. This means that the mantissa is shifted in steps of four
bits during normalization and denormalization rather than one
bit as is the case in IEEE-754 (and most other floating-point
formats). It should also be noted that since the format is 36
bits wide, this means that a value will can be conveniently
stored in a 36-bit wide BlockRAM as well.
If bit 34 is set it signifies that the value is either a NaN
or Inf, depending on the value in bit 33. (From a developers
The IEEE-754 standard only considers floating-point formats with a radix of 2 or 10 (although the decimal case will not
be further discussed in this paper). This means that a straightforward implementation of a floating-point number format with
another radix is not compatible with IEEE-754. However,
the format described in Table I is capable of accurately
representing all values possible in a single precision IEEE754 number. This means that it is possible to create floating
point operators based on the HRFP16 format that will produce
the exact same value as specified by IEEE-754, as long as a
post-processing step is employed that guarantees that the result
is a value that can be represented in single precision IEEE-754.
This was previously explored for floating-point adders
in [3] where it was shown how rounding could be implemented
in a high radix floating-point adder in such a way that the
numerical result is the same as that produced by a IEEE754 compatible adder. In the HRFP16 format, this means that
the guard, round and sticky bits are located at different bit
positions, depending on where the explicit one is located as
shown in Table III.
G UARD , ROUND , AND S TICKY B IT H ANDLING IN THE
A DDER
While this is a good solution for most situations, it is not
sufficient in those situation where strict adherence to IEEE-754
is required, for example when an accelerator has to return a bitexact answer as compared to a CPU based reference implementation. Rather than implement support for subnormal numbers
in the HRFP 16 adders, IEEE-754 style subnormal values are
instead handled by using an extra bit in the exponent, while
at the same time implementing the rounding/post-processing
step in such a way that the appropriate LSB bits are forced to
zero while taking care to round at the correct bit position.
III.
R ADIX -16 F LOATING -P OINT A DDER
This section contains the most interesting implementation
details from the adder, readers who want even more implementation details are encouraged to download the source code.
The adder is based on a typical pipeline of a floating-point
adder with five pipeline registers as shown in Fig. III. The first
stage compares the operands and the second operand swaps
them if necessary, so that the smallest operand is right-shifted
before the mantissas are added using an integer adder. After the
addition the result is normalized and the exponent is adjusted.
Finally, the result is post-processed and rounded to ensure that
the output is a value which can be represented using a single
precision IEEE-754 floating-point number.
A. Comparison Stage
The comparison stage is fairly straight-forward. It is possible to determine the operand with the largest magnitude
merely by doing an integer comparison of the exponent and
mantissa part of the number. By using the LUT6_2 and
CARRY4 primitives, two bits can be compared at the same
time. Normally XST will be able to infer such a comparator
automatically. However, in order to handle a corner case
involving +0 and -0 correctly it is convenient if this comparison
could be configured as either “less than” or “less than or
equal”, depending on whether the sign bit of one of the
operands is set or not. The author did not manage to write HDL
in such a way that the synthesis tool would do this, hence this
comparison contains hand instantiated LUT6_2 and CARRY4
components.
Comparison
1 bit wide signal
Module which is
mostly manually
instantiated LUTs
Magnitude
comparator
op1_a
op1_b
*
Characteristic
difference compared
to IEEE-754
op1_b.exp
op1_a.exp
0
1
1
0 1
0
10
1
SATURATE
op2_b
op2_a
op2_a.sign
op2_b.sign
Subnormal numbers are typically not considered in FPGA
implementations of floating-point numbers due to the additional cost of such support. Instead, subnormal numbers are
treated as zeroes in for example the floating-point operators
supplied by Xilinx [4]. The recommendation is instead to take
advantage of the malleability of FPGAs and use a custom
floating-point format with one additional exponent bit in such
cases, as this will cover a much larger dynamic range than the
small extra range afforded by subnormal numbers.
Signal with more
than one bit
op_b
op_a.sign
Swap
C. Subnormal Handling
op_a
Align
Mantissa before post-processing and rounding
1xxx xxxx xxxx xxxx xxxx xxxx GRS. ..
01xx xxxx xxxx xxxx xxxx xxxx xGRS ..
001x xxxx xxxx xxxx xxxx xxxx xxGR S.
0001 xxxx xxxx xxxx xxxx xxxx xxxG RS
Key: x: These bits are allowed to be either zero or one in the final rounded mantissa
G: Guard bit, R: Round bit, S: Sticky bit, .: Bits that contribute to the sticky bit.
G, R, S, and . are set to 0 in the final mantissa.
Legend:
Round/postprocess Normalization Mantissa add
TABLE III.
Fig. 2.
expdiff2
op2_b.mant
Input
*
op3_a
op3_a.mant
Bits to shift
Right shifter
(4 bit steps)
Logic or of
shifted out bits
Output
0
01
op3_a.exp
op3_a.sign
sign4
mant4
>>4
*
*
Input Bits to shift
LZD
(4 bit steps)
Left shifter
(4 bit steps)
Count
Output
sign5
ovf_mant5
1
exp4
mant5
Overflow
detection
expdiff5
ovf5
exp5
0
4 MSB bits
Rounding Rounding
vectors
mask
*
Adjust
exponent
Final mantissa
Final exponent
*
>>3
Final sign
High Level Schematic of the Adder
B. Swap Stage
The swap stage is responsible for swapping the operands to
make sure that the smallest operand is sent to the shifter in the
next stage. The multiplexers that swap the operands basically
consists of inferred LUT6_2 primitives where the O5 and O6
are used to either output the floating-point values on the input
unchanged or swapped.
Besides swapping operands if necessary, this stage also
calculates the difference between the exponents, so that the
alignment stage can know how far to shift the right operand:
expdiff = MIN(7, |exponentA − exponentB |)
TABLE IV.
Compare
Swap
Align
Mantissa adder
Normalization
Round
Total
(5)
This calculation requires only one adder, as the previous stage has already performed a comparison between the
operands. (However, some extra logic is still needed to determine whether to saturate the result to 7 or not.) Finally, this
stage also sets the NaN or Inf flag (not shown in Fig. III).
C. Alignment Stages
To align the mantissa of the smaller operand a right shifter
is required. Unlike a traditional IEEE-754 based adder, the
shift distance is always a multiple four bits which will reduce
the area and latency of this stage.
In the RTL code, the shifter is implemented as an eight
to one multiplexer which has been divided into two four to
one multiplexers. This division makes it possible to merge the
final two to one multiplexer into the integer adder in the next
stage. Besides the multiplexers, this stage also contains the orgates required to calculate the sticky bit, which is necessary
for correct implementation of round-to-nearest-even.
D. Mantissa Addition Stage
This stage contains the final step of the alignment operation
as described above, as well as the integer adder/subtracter. The
smaller operand is either added or subtracted from the larger
operand depending on whether the sign of the operands are
equal or not.
E. Normalization
It is well known that the normalization is one of the hardest,
if not the hardest part of floating-point addition and this is
still the case for the HRFP16 format, even if the complexity
is reduced due to the fact that only shifts in multiples of four
bits are required. Note that unlike floating-point formats based
on radix 2 it is not feasible to use an implicit one in radix 16.
Instead, a normalized number in radix 16 has at least one of
the four most significant bits set to one.
The normalization stage consists of two parts. The first part
calculates how many steps to shift by using a leading zero
detector (LZD). Unlike a traditional IEEE-754 operator which
operators on one bit at a time, the LZD in the HRFP adder
works on four bits at the same time. This is implemented by
manually instantiating LUT6_2 instances configured as four
input AND gates. As an additional micro-optimization, some
of these LUTs are reused to detect groups of ones in the
mantissa that would lead to an overflow when rounding is
performed in the next stage. This is done by configuring the
other output of the LUT6_2 instances as four input NOR gates.
Finally, as these modules are very latency critical they are
floorplanned using RLOC primitives. The manual instantiation
of these primitives was mostly done in order to floorplan these
very latency critical part using RLOC attributes.
The second part consists of a left shifter which is implemented in the RTL code as an eight to one multiplexer
A REAS FOR D IFFERENT PARTS OF THE A DDER IN A
K INTEX -7 ( XC 7 K 70 T-1- FBG 484)
Slice LUTs (LUT6)
18
48
35
32
78
50
261
Slice Registers
73
68
70
40
73
0
324
configured to shift the mantissa in steps of 4 bits. Finally,
this stage also contains some logic to set the result to Inf if
necessary.
F. Rounding and Post Processing
This part is responsible for post-processing the result in
such a way that it is numerically identical to what a IEEE-754
compliant unit would produce in round to nearest-even mode.
This is also the part that differs the most when compared to
the high radix floating-point adder proposed in [2].
Unlike the rounding module of a normal IEEE-754 based
adder, rounding can only be done correctly if the position of
the guard, round, and sticky bits are dynamic, depends on the
content of the four most significant bits as shown in Table III.
In addition, a post-processing stage is also required to ensure
that an appropriate number of of the least significant bits are
set to zero depending on where the most significant bit is set,
according to table III. There are thus 24 consecutive bits that
can be either one or zero for a given configuration of the MSB
bits, regardless of where the most significant one is located,
which corresponds to the number of mantissa bits present in
a single-precision IEEE-754 formatted value. The remaining
bits are set to zero.
Since the floating point adder has an extra exponent bit,
all subnormal numbers that are possible in IEEE-754 can be
represented as normalized numbers in the HRFP format. Note
that it is not possible to add or subtract two IEEE-754 style
values and get a result with a non-zero magnitude which is
smaller than the smallest subnormal number. Thus, as long
as only values that would be legal in IEEE-754 are sent into
the adder, no special hardware is needed to handle the values
that in IEEE-754 would be represented by using subnormal
numbers. (This also extends to rounding/post-processing, since
the guard, round, and sticky bit are guaranteed to be zero if a
result corresponds to an IEEE-754 subnormal number.)
G. Area and Frequency
The total area of the adder when synthesized to a Kintex-7
FPGA is 261 slice LUTs, of which 109 are using both the
O5 and O6 outputs. The areas for the different parts of the
adder as reported in the map report file are shown in table IV.
No route-thrus were reported for this design. The adder can
operate at a frequency of 319 MHz in the slowest Kintex-7.
IV.
R ADIX -16 F LOATING -P OINT M ULTIPLIER
The multiplier contains six pipeline registers and is divided
into three different parts. The first part uses an integer multiplier to multiply the two mantissas. The second performs
Legend:
*
Signal with more
than one bit
Mantissa multiplication
1 bit wide signal
TABLE V.
Characteristic
difference compared
to traditional IEEE-754
compatible operator
op_a.mant op_b.mant
op_b.exp
op_a.exp
27x27 bit
pipelined multiplier
Contains two
cascaded DSP48E1
blocks and one LUT
based adder.
Bias-1
Bias
1
Round/postprocess Normalization
Check 4
MSB bits
1
Sticky bitgeneration
0
Precalculate
rounding
vectors/mask
*
<<4
*
Adjust
rounding
vectors/mask
0
1
Precalculate
Rounding
Overflow
4 MSB
bits
Handle Round
Overflow
Not shown: Sign and
NaN/Inf generation
Fig. 3.
T HE I MPLEMENTATION OF THE 27 × 27 B IT M ULTIPLIER
Module which contains
manually instantiated
LUTs or Flip-flops
Final result
High Level Schematic of the Multiplier
normalization and the third part post-processes the number
and rounds it to ensure that the result is a value which can
be represented using the single-precision IEEE-754 format.
A. Mantissa Multiplier Stage
The HRFP16 format has a 27 bit mantissa. This means that
a 27 × 27 bit multiplier is required for the multiplier stage. In
older Xilinx FPGAs where the DSP blocks were based on
18 × 18 bit multipliers this was not a big issue as four DSP
blocks would be required for a 24 × 24 bit multiplier as a
27 × 27 bit multiplier. Unfortunately, the DSP blocks in the
7-series FPGAs are optimized for single-precision IEEE-754
A[2 : 0]
000
001
010
011
100
101
110
111
A×B
A×B
A×B
A×B
A×B
A×B
A×B
A×B
=
=
=
=
=
=
=
=
Operation in cascaded
DSP48E1 blocks
8 × A[26 : 3] × B
8 × A[26 : 3] × B
8 × A[26 : 3] × B
8 × A[26 : 3] × B
8 × (A[26 : 3] + 1) × B
8 × (A[26 : 3] + 1) × B
8 × (A[26 : 3] + 1) × B
8 × (A[26 : 3] + 1) × B
+
+
+
+
−
−
−
−
Operation in
(0
+
(B
+
(0
+
(B
+
(2B
+
(B
+
(0
+
(B
+
slices
0)
0)
2B)
2B)
2B)
2B)
2B)
0)
multiplication, as two DSP48E1 blocks can be combined to
form a 24 × 24 bit unsigned multiplier but not a 27 × 27 bit
multiplier.
While further DSP blocks could be added to increase the
width of the multiplication to 27, this seems unnecessary for
a bit width increase of only 3 bits. Fortunately it is possible to
create a slightly restricted 27 × 27 bit multiplier suitable for
use in HRFP16 by using two DSP48E1 blocks and only 31
LUT6 components.
This can be implemented by first cascading two DSP48E1
components in order to provide a 27 × 24 bit unsigned
multiplier. The remaining 27 × 3 bit multiplier is implemented
using a combination of slice based logic and the pre-adder in
the DSP48E1 blocks as shown in the pseudo code in table
V, where the key insight is that slice area can be saved by
handling only five of the eight possible values of the smaller
factor in the slice based adder. The remaining three values are
handled by using the pre-adder to increment the A operand by
one and signalling the DSP48E1 blocks to subtract the value
from the slice based adder. In this way the 27×3 bit multiplier
will consume a very small number of LUTs, making this an
attractive alternative to instantiating twice as many DSP48E1
blocks.
It is important to note that the resulting multiplier is not a
general purpose 27 × 27 bit multiplier, as the pre-adder will
cause an overflow to occur if all bits are one in A[26 : 3].
Fortunately this event cannot occur when the HRFP16 format
is used, as A[26] can never be one simultaneously as A[2] is
one, thus avoiding this problem (see table III).
B. Normalization
As long as subnormal number support is not required, the
normalization stage in a radix-2 based floating-point multiplier
is trivial if implemented correctly. A two to one multiplexer
shifts the mantissa at most one step depending on whether
the MSB of the result is one or zero. However, if subnormal
numbers are supported at the inputs of the multiplier, the
normalizer must be able to shift the entire width of the result.
In fact, such a normalization unit is therefore even more costly
than the normalization unit found in a floating-point adder. A
better alternative is to add support for normalization of one
of the inputs to the multiplier before any multiplication is
performed. (There is no need to normalize both numbers before
multiplication as the product of two subnormal numbers will
be zero.)
However, in the HRFP16 format, IEEE-754 style subnormal
numbers are instead emulated by using an extended exponent
TABLE VI.
G UARD , ROUND , AND S TICKY B IT H ANDLING IN THE
M ULTIPLIER
Exponent
Mantissa before post-processing and rounding
≥ 64
1xxx xxxx xxxx xxxx xxxx xxxx GRS. ..
Normal
≥ 64
01xx xxxx xxxx xxxx xxxx xxxx xGRS ..
numbers
> 64
001x xxxx xxxx xxxx xxxx xxxx xxGR S.
> 64
0001 xxxx xxxx xxxx xxxx xxxx xxxG RS
64
00xx xxxx xxxx xxxx xxxx xxxx UGRS ..
Emulated
63
xxxx xxxx xxxx xxxx xxxx UGRS .... ..
IEEE-754
62
xxxx xxxx xxxx xxxx UGRS .... .... ..
style
61
xxxx xxxx xxxx UGRS .... .... .... ..
subnormal
60
xxxx xxxx UGRS .... .... .... .... ..
numbers
59
xxxx UGRS .... .... .... .... .... ..
58
UGRS .... .... .... .... .... .... ..
≤ 57
Flushed to zero
Key: x: These bits are allowed to be either zero or one in the final rounded mantissa,
U: this bit corresponds to the ULP of an IEEE-754 subnormal number, G: Guard bit,
R: Round bit, S: Sticky bit, .: Bits that contribute to the sticky bit. G, R, S, and . are
set to 0 in the final mantissa.
range. Therefore the normalization stage can thus be reduced
down to a single two to one multiplexer. The only difference
compared to a normal floating point multiplier is that the select
signal of the multiplexer is controlled by ORing together the
four MSBs of the result.
C. Rounding and Post-Processing
When no subnormal numbers are encountered, this stage
essentially works in the same way as the rounding and postprocessing stage of the floating-point adder described in section III-F. Or in other words, the guard, round, and sticky bit
can be located in one of four different places as seen in the
first part of table III.
However, unlike the adder, an IEEE-754 style subnormal
result has to be handled as a special case here, as there is no
guarantee that the guard, round, and sticky bit are zero when
a subnormal result is encountered. This has to be taken into
account by adjusting the rounding vector and rounding mask
in such a way that the guard bit is located just to the right of
the ULP bit of such a subnormal number. This is illustrated in
table VI for the case where the exponent is between 63 and
58. There is also a special case when the exponent is 64, as
subnormal emulation is only required if the two MSB bits are
0 in that case.
TABLE VII.
A REAS FOR D IFFERENT PARTS OF THE M ULTIPLIER IN A
K INTEX -7 ( XC 7 K 70 T-1- FBG 484)
Mantissa multiplier
Normalization, rounding,
and post processing
Total
Slice
LUTs
31
204
Slice
Registers
57
220
DSP48E1
blocks
2
0
235
277
2
format. After the addition or multiplication is finished, sanity
checks are performed to ensure that the least significant bits
of the mantissa are masked out correctly (see table VI) and
then converted back to IEEE-754 format.
The first testbench uses test vectors in the IEEE test-suite
generated by FPGen [5]1 . These test vectors have proved very
valuable as a quick way of verifying the correctness of small
changes to the floating-point module as the coverage is nearly
perfect.
The other testbench has been developed from scratch using
more of a brute-force approach. This test-bench dumps all
possible combinations of the sign and exponent for both
operands while testing a number of different values of the
mantissa. However, it is not feasible to loop through all possible mantissa combinations however. Instead, the mantissae of
the two operands are set to all possible combinations in the
following list:
•
All but one bit set to zero
•
All but one bit set to one
•
All values from 0x7fff00 to 0x7fffff
•
All values from 0x000000 to 0x000100
•
One hundred random values
To verify the correct behavior of this testbench, the SoftFloat package [6] is used as a golden model.
In addition, a few other tests are run in this testbench,
including testing for overflow during rounding, random small
numbers (including subnormals), and random large numbers.
However, as the run time of this test is measured in hours
rather than seconds, this is only run after major changes had
been undertaken.
D. Area and frequency
The area of the different parts of the multiplier can be seen
in table VII. The total area of the multiplier is 235 slice LUTs,
of which 57 use both the O5 and O6 output. 6 LUTs are used
as route-thrus. The total number of slice registers is 277. The
maximum operating frequency of the multiplier is 305 MHz.
By removing support for subnormal emulation (i.e., the
special rounding vectors/masks when the exponent is less than
64), the area can be reduced to 161 slice LUTs and 241 slice
registers.
V.
V ERIFICATION
The correctness of the adder and multiplier has been tested
by using two different test-benches. Both testbenches read
IEEE-754 formatted single-precision values from an external
file (actually a named pipe) and converts them into the HRFP16
VI.
R ESULTS
A comparison with a floating-point adder and multiplier
generated by Coregen with the same number of pipeline stages
as the HRFP16 adder and multiplier is shown in table VIII.
To create a fair value for the fmax column, a wrapper was
created with pipeline registers before and after the device under
test to ensure that I/O related delays did not have any impact
on timing closure. Furthermore, a number of different timing
constraints were tried in order to arrive at a near-optimal fmax
value. Note that the slice count is not included in the table since
this value is largely dependent on whether the design is area
constrained or not. (The author has seen this value differ by up
to 50% for the same design when implemented with different
area constraints.)
1 Available for download at https://www.research.ibm.com/haifa/projects/
verification/fpgen/ieeets.html
TABLE VIII.
Unit
HRFP16 Adder
HRFP16 Multiplier
Adder [4]
Multiplier [4]
A REA /F REQUENCY C OMPARISON IN A K INTEX -7
( XC 7 K 70 T-1- FBG 484)
Subnormal
support
Emulated
Emulated
Flush to zero
Flush to zero
Slice
LUTs
261
235
363
119
Slice
Registers
324
277
209
76
DSP
blocks
0
2
0
2
fmax
319 MHz
305 MHz
291 MHz
464 MHz
module normalizer(input wire [23:0] x,
output wire [23:0] result,
output reg [4:0]
s);
always @* begin
casez(x)
24’b1???????????????????????:
24’b01??????????????????????:
24’b001?????????????????????:
24’b0001????????????????????:
// ....
24’b000000000000000000000010:
24’b000000000000000000000001:
24’b000000000000000000000000:
endcase
end
s=0;
s=1;
s=2;
s=3;
s=22;
s=23;
s=24;
assign result = x << s;
endmodule
Fig. 4. Verilog Source Code for a Priority Decoder and a Shifter suitable
for normalizing a 24 bit mantissa
add full, or partial, support for the HRFP16 (or a similar
format) into the existing DSP blocks. For example, if one (or
possibly two cascaded) DSP blocks contained support for a
high-radix floating-point multiplier supporting only the roundto-zero rounding mode and no subnormal numbers, this would
probably be enough for most FPGA developers. However, if a
developer actually requires full support for IEEE-754, a round
and post-process stage similar to the ones described in this
paper could be implemented using regular slice based logic
adjacent to the DSP block. This would allow such users to
gain at least a partial benefit from the hardwired floating-point
units while at the same time using a floating-point format
more convenient for FPGA developers. This has been partially
investigated in [7], but at the time that publication was written,
the idea outlined in [3] had not occurred to us.
VIII.
S OURCE CODE AVAILABILITY
Those who are interested in using, extending, or making
comparisons with the work presented in this paper are welcome
to download the source code of the floating point adder, multiplier and testbenches at http://users.isy.liu.se/da/ehliar/hrfp/
which is published under released under the terms of the X11
license. This license should enable convenient use in both
commercial and non-commercial situations.
IX.
It is important to note that the adder and multiplier provided
by Coregen does not include any support for subnormal
numbers, opting instead to flush those numbers to zero. (This
seems to be the standard modus operandi for all recent floatingpoint modules optimized for FPGAs.) Therefore a straight
comparison of the area and frequency numbers are misleading,
especially for the multiplier. However, it is easy to arrive
at a rough estimate of the area cost of a normalization
module with support for subnormal numbers suitable for use
in a floating-point multiplier. What is needed is essentially a
priority decoder and a shifter as shown in the source code in
Fig. 4. The area for this module is 80 slice LUTs, suggesting
that at least 200 LUTs would be used by the Coregen multiplier
if subnormal numbers were supported.
VII.
F UTURE WORK
There are three obvious directions for future research
in this area. First of all, it would be very interesting to
implement a double precision version of the HRFP16 format.
This would be of particularly high interest for people interested
in accelerating HPC algorithms with FPGAs. Similarly, a half
precision format could be of interest to people who are interested in computer graphics algorithms on FPGAs. Secondly,
support for other operators would be interesting to look into as
well. For signal processing purposes, the most important such
module is either a floating-point accumulator, or a floatingpoint multiply and accumulate unit. Thirdly, optimizing the
code in more ways could be done, especially in terms of adding
support for a configurable number of pipeline registers. The
code could also be optimized for FPGAs besides the Kintex7.
However, another research direction which would be interesting to look into is whether it would make sense to
R ELATED WORK
Floating-point operators have been of interest to the FPGA
community for a long time. However, only a few publications
discuss the use of a high radix. One of the first such publications is [2], which shows the area-advantage of a floating
point adders and multipliers which use a high-radix floatingpoint format with no compatibility with IEEE-754. Another
publication is [8], which shows that it is feasible to build a
floating-point multiply and accumulate-unit with single cycle
accumulation if a very high radix (232 ) is used for the floating
point format used in the accumulator.
In regards to the traditional floating-point operators such as
addition, subtraction, and multiplication, it would be remiss not
to mention the work done by the FPGA vendors themselves,
where [4] and [9] are two typical examples. The focus here
seems to be to supply robust floating-point operators with
near full IEEE-754 compliance. The main exception is that
subnormal numbers are not implemented. Unfortunately the
internal implementation details are not discussed in these
publications, but the tables with area and frequency numbers
are very useful in order to establish a baseline which other
reported results should improve upon.
Another well known resource for floating-point arithmetic
in FPGAs is FloPoCo [10]. This project concentrates on
floating point operators that are typically not implemented
in mainstream processors such as transcendental functions
and non-standard number formats. The basic philosophy of
this project is that an FPGA based application should not
limit itself to the floating-point operators that are common in
microprocessors (e.g., adders/subtracters, multipliers, and to a
lesser extent, division and square root).
One of the most interesting approaches to floating-point
arithmetic in FPGAs is [11] and [12], where a tool has
been developed that takes a datagraph of floating-point operations and generates optimized hardware for that particular
datapath. In essence, a fused datapath is created where the
complete knowledge of the input and output dependencies
of each floating-point operator has been taken into account
to reduce the hardware cost. For example, if an addition
is followed by another addition, some parts of the normalization/denormalization steps may be omitted. In the case
of a 48 element dot product, the datapath compiler showed
an impressive reduction of area from 30144 ALUTs and
36000 registers to 12849 ALUTs and 20489 registers [11].
In another case a Cholesky decomposition was reduced from
9898 ALUTs and 9100 registers down to 5299 ALUTs and
6573 registers [12]. However, a drawback of this work is that
the arithmetic operations are not fully compliant with IEEE754 (although in practice, a slight increase in the accuracy of
the result can be expected [13]).
A key advantage to floating-point units in FPGA is that
the precision of the unit can be adjusted to match the precision
requirements of the application. In for example [14] it is shown
that a double precision multiplier can be reduced from nine
DSP48 blocks down to six DSP48 block if a worst case error
of 1 ulp is acceptable instead of 0.5 ulp as required by IEEE754.
Finally, it should also be noted that Altera recently announced that the Stratix-10 and Arria-10 FPGAs will have DSP
blocks with built-in support for single-precision IEEE-754
floating-point addition and multiplication. These hard blocks
are a welcome addition in any FPGA system with high floating
point computation requirements. To reduce the cost of these
floating point operators slightly, no support for subnormal
numbers are present.
X.
C ONCLUSIONS
In this paper we have demonstrated that using a custom
floating-point format based on radix-16 can be beneficial in an
FPGA when support for IEEE-754 style subnormal numbers
are required. The combinational area of a high-radix floatingpoint adder is shown to be around 30% smaller than the
combinational area of a floating-point adder generated by
Coregen, even though subnormal numbers are handled by
flushing to zero in the later case.
The multiplier on the other hand has a combinational area
which is almost twice as large as the multiplier generated
by Coregen, although it is likely that the Coregen provided
multiplier would be of comparable size if support for subnormals was added to it. If subnormal numbers are not required
however, the HRFP16 operators presented in this paper can
still be useful in situations where relatively few floating point
multipliers are required compared to the number of floating
point adders.
ACKNOWLEDGMENTS
Thanks to the FPGen team at IBM for clearing up a misunderstanding on my side regarding exception handling in IEEE754. Thanks also to Madeleine Englund for many interesting
discussions regarding high-radix floating-point arithmetic in
FPGAs.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
D. Monniaux, “The pitfalls of verifying floating-point computations,”
ACM Trans. Program. Lang. Syst., vol. 30, no. 3, pp. 12:1–12:41, May
2008. [Online]. Available: http://doi.acm.org/10.1145/1353445.1353446
B. Catanzaro and B. Nelson, “Higher radix floating-point
representations for fpga-based arithmetic,” in Field-Programmable
Custom Computing Machines, 2005. FCCM 2005. 13th Annual IEEE
Symposium on. IEEE, 2005, pp. 161–170. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=1508536
P.-M. Seidel, “High-radix implementation of ieee floating-point
addition,” in Computer Arithmetic, 2005. ARITH-17 2005. 17th
IEEE Symposium on. IEEE, 2005, pp. 99–106. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=1467628
Xilinx, “Pg060: Logicore ip floating-point operator v6.2,” 2012.
[Online]. Available: http://www.xilinx.com/support/documentation/ip
documentation/floating point/v6 2/pg060-floating-point.pdf
M. Aharoni, S. Asaf, L. Fournier, A. Koifman, and R. Nagel, “Fpgen
- a test generation framework for datapath floating-point verification,”
in In Proc. IEEE International High Level Design Validation and Test
Workshop 2003 (HLDVT03, 2003, pp. 17–22.
J. R. Hauser, “Softfloat release 2b.” [Online]. Available: http:
//www.jhauser.us/arithmetic/SoftFloat.html
M. Englund, “Hybrid floating-point units in fpgas,” Master’s thesis,
Linköping University, 2012.
A. Paidimarri, A. Cevrero, P. Brisk, and P. Ienne, “Fpga implementation
of a single-precision floating-point multiply-accumulator with singlecycle accumulation,” in Field Programmable Custom Computing
Machines, 2009. FCCM’09. 17th IEEE Symposium on. IEEE, 2009,
pp. 267–270. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs
all.jsp?arnumber=5290905
Altera, “Ug01058-6.0: Floating-point megafunctions,” 2011. [Online].
Available: http://www.altera.com/literature/ug/ug altfp mfug.pdf
F. de Dinechin and B. Pasca, “Custom arithmetic datapath design for
fpgas using the flopoco core generator,” Design & Test of Computers,
IEEE, vol. 28, no. 4, pp. 18–27, 2011.
M. Langhammer, “Floating point datapath synthesis for fpgas,” in Field
Programmable Logic and Applications, 2008. FPL 2008. International
Conference on. IEEE, 2008, pp. 355–360. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=4629963
M. Langhammer and T. VanCourt, “Fpga floating point datapath
compiler,” in Field Programmable Custom Computing Machines,
2009. FCCM’09. 17th IEEE Symposium on. IEEE, 2009, pp.
259–262. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs all.
jsp?arnumber=5290908
I. Berkeley Design Technology, “An independent analysis of
floating-point dsp design flow and performance on altera 28-nm
fpgas,” 2012. [Online]. Available: http://www.altera.com/literature/wp/
wp-01187-bdti-altera-fp-dsp-design-flow.pdf
M. K. Jaiswal and N. Chandrachoodan, “Efficient implementation
of ieee double precision floating-point multiplier on fpga,” in
Industrial and Information Systems, 2008. ICIIS 2008. IEEE Region
10 and the Third international Conference on. IEEE, 2008,
pp. 1–4. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs all.jsp?
arnumber=4798393
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement