Signal Processing on Intel® Architecture: Performance Analysis

Signal Processing on Intel® Architecture:
Performance Analysis using Intel®
Performance Primitives
White Paper
Intel® Advanced Vector
Extensions (Intel® AVX)
Signal Processing
Engineers can quickly determine whether Intel® processorbased platforms with Intel® Advanced Vector Extensions
(Intel® AVX) satisfy signal processing requirements
Embedded Computing
Signal processing functions have often required special-purpose hardware such as
DSPs and FPGAs. However, recent enhancements to Intel® architecture processors are
providing developers an alternative: execute signal processing workloads on an Intel®
processor.
Signal processing on the latest Intel processors is now a viable option due to continued
improvements in multi-core architectures. The increased parallelism from vector
instructions, along with other continuing performance improvements, enables the
efficient execution of data parallel workloads such as digital transforms and filters.
Additionally, by consolidating signal processing functions with other workloads on a
multi-core Intel processor, it is possible to save hardware cost, simplify the application
development environment and reduce time to market. This approach can be applied to
many applications in aerospace (radar, sonar), communications infrastructure (baseband
processing, transcoding) and healthcare (medical imaging).
Umberto Santoni
Platform Architect,
Embedded Communications
Group
Thomas Long
Software Engineer,
Embedded Communications
Group
This paper describes an easy process that allows developers to quickly determine
how fast 2nd generation Intel® Core™ i7-2710QE processor will execute their signal
processing algorithms, based on performance data1 that is relatively easy to obtain.
Developers can complete the process in a straightforward manner, as demonstrated
with two simple examples in this paper: fast convolution and amplitude demodulation.
The paper concludes by reviewing some of the development tools available to
developers to conduct their own evaluations.
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
Table of Contents
Why Intel® Architecture for Signal Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
SIMD Instructions Enhanced By Intel® Advanced Vector Extensions (Intel® AVX). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
The Process for Evaluating Signal Processing Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Signal Processing Performance Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Overview of benchmark data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
A) Forward and inverse Fast Fourier Transform (FFT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
B) 2D Complex to Complex FFT Throughput (GFLOPS/s and absolute time). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
C) Filter Execution Times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
D) Discrete Hilbert Transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
E) Discrete Cosine Transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Speedup with Intel® Advanced Vector Extensions (Intel® AVX). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Two Signal Processing Workload Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Example 1: Fast Convolution using FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Example 2: Discrete Envelope Detection /
Amplitude Demodulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Floating Point Speeds Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Development Tools Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Intel® C++ Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Intel® Math Kernel Library (Intel® MKL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Intel® Integrated Performance Primitives (Intel® IPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Intel® VTune™ Performance Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Intel® Application Debugger. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Eclipse*-based Integrated Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Consider Intel® Architecture Processors for Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Appendix A: Test Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
Why Intel® Architecture for Signal Processing
There is a natural tendency to assume that just about any signal
processing application requires a DSP, FPGA or ASP. That’s because traditionally, it was necessary to use specialized hardware in order to satisfy
performance objectives. However, for hybrid designs utilizing a mix of
specialized signal processing algorithms and a broader set of applications,
implementing two separate computing architectures may pose some
significant disadvantages, such as:
•H
ardware and board space for two computing systems: higher
product cost
•M
ultiple tool chains: additional technical training and project
management complexity
More specifically, the Intel® Advanced Vector Extensions (Intel® AVX)
– available for the first time with 2nd generation Intel Core i7 processors–
provide significantly improved floating point performance (see sidebar).
Engineers who code floating point algorithms for Intel architecture
processors can leverage a mature software ecosystem that offers a
very wide breadth and depth of development tools. Also available are
Intel development tools and libraries that employ Intel AVX and Intel®
Streaming SIMD Extensions 4 (Intel® SSE4) instructions. Equipment
manufacturers can choose from many hardware vendors supplying
commercial off-the-shelf (COTS) embedded boards and systems that
support embedded lifecycles and benefit from the economics of the
PC/server supply chain.
•M
ultiple code bases: larger software management effort
•P
ower consumption for two computing systems: more expensive
thermal design
• Intersystem communication: greater design complexity or possibility
for bottlenecks
•T
ime to market: extra time needed to design, validate and integrate
subsystems
•T
wo development teams: unique communication challenges (e.g., silos)
One alternative is to perform signal processing workloads on an existing
Intel architecture processor in the system. Workload consolidation is a
powerful concept that has been delivering significant payoffs in datacenters with respect to reduced server cost, power consumption and
footprint. This is made possible by multi-core processors with scalable,
efficient performance, coupled with significant memory and I/O bandwidth. This consolidation approach can be equally powerful in embedded
systems, addressing issues around cost, software complexity, power, time
and communication.
Often, performance efficiency is foremost on the minds of embedded
system developers when running signal processing workloads. This is
discussed in the next section, which presents performance data for key
signal processing kernels running on 2nd generation Intel® Core™ i7 processors. Yet, for most embedded applications, raw performance isn’t the only
factor; it is also necessary to meet overall system cost goals, and the
highly scalable family of embedded Intel architecture processors helps to
do just that.
The embedded Intel processor roadmap gives developers a wide choice
with respect to the number of cores, cache and system memory size,
I/O and footprint. In addition, there are many other technologies available for enhancing system capabilities, like virtualization technology,
remote management and various security features. Nevertheless, it is the
enhanced vector single-instruction, multiple-data (SIMD) instructions that
open the door to using Intel architecture processors for signal processing.
SIMD Instructions Enhanced By Intel®
Advanced Vector Extensions (Intel® AVX)
Many signal processing applications are highly parallel, performing
the same arithmetic operation on large number sets. Speeding
up these workloads, single-instruction, multiple-data (SIMD)
instructions were introduced in the mid 1990’s, and they
perform the same operation on multiple data elements simultaneously, as illustrated below.
SMID
2
3
5
11
20
+
9
11
2
1
5
=
11
14
7
12
25
The throughput of a SIMD instruction is a function of register
size because larger registers translate into greater throughput.
With the introduction of 2nd generation Intel® Core™ i7 processors,
the size of the 16 registers available for floating point operations doubles, increasing from 128 bits to 256 bits. Additionally,
new three and four operand instructions establish a destination
argument that results in fewer register copies, better register
usage, faster execution and smaller code size. These are just
some of the recent architectural enhancements, called Intel®
Advanced Vector Extensions (Intel® AVX).
128 bits
(Intel® SSE4)
256 bits
(Intel® AVX)
XMM0
XMM1
XMM2
XMM15
3
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
The Process for Evaluating Signal Processing
Performance
Requiring different levels of effort, there are a number of ways to evaluate
the performance of Intel processors, such as using Intel® C++ and Intel®
Fortran Compilers, calling optimized performance libraries, or coding
optimizations in assembly language and employing compiler intrinsics.
The approach presented in this paper strikes a balance between effort
and optimization to quickly achieve good estimates for the performance
of signal processing algorithms. This is done in two ways. The first is to
utilize the Intel® Integrated Performance Primitives (Intel® IPP) library. It
provides a quick way to assess the performance of hundreds of algorithms and math functions optimized for Intel architecture. Furthermore,
the signal processing portion of the Intel IPP library includes over 250
functions, each often supporting multiple data types and in-place versus
not in-place variants. There are also many image processing calls that are
useful for DSP applications.
The Intel IPP library distribution includes a performance tool (with documentation) that allows developers to obtain performance metrics for all
the functions in the library. For example, Figure 1 shows the results
generated by a shell script written to automatically collect performance
data on the number of clocks, execution time, and in some cases MFLOPs,
for a function. For Figure 2, an .ini file was used in conjunction with the
shell script, to report 2D FFT performance for specific input sizes.
#!/bin/bash
source /opt/intel/composerxe/bin/compilervars.sh intel64
DATE=`date +%Y-%m-%d`
OUTPUT_DIR=results_avx_${DATE}
OUTPUT_PATH=${PWD}/${OUTPUT_DIR}
PERF_TOOL_DIR=${IPPROOT}/tools/intel64/perfsys
FILE_EXT_1=_lin_avx_1
FILE_EXT_2=_lin_avx_2
mkdir ${OUTPUT_DIR}
#IPPS
${PERF_TOOL_DIR}/ps_ipps -r${OUTPUT_PATH}/ipps${FILE_
EXT_1}.csv -o${OUTPUT_PATH}/ipps${FILE_EXT_1}.txt -N1
-YHIGH -TAVX -B
${PERF_TOOL_DIR}/ps_ipps -r${OUTPUT_PATH}/ipps${FILE_
EXT_2}.csv -o${OUTPUT_PATH}/ipps${FILE_EXT_2}.txt -N1
-YHIGH -TAVX -B
#2DFFT
${PERF_TOOL_DIR}/ps_ippi -r${OUTPUT_PATH}/2dfft${FILE_
EXT_1}.csv -o${OUTPUT_PATH}/2dfft${FILE_EXT_1}.txt -N1
-YHIGH -TAVX -B -i${PWD}/2dfft.ini -fippiFFTFwd_CToC_32fc_C
${PERF_TOOL_DIR}/ps_ippi -r${OUTPUT_PATH}/2dfft${FILE_
EXT_2}.csv -o${OUTPUT_PATH}/2dfft${FILE_EXT_2}.txt -N1
-YHIGH -TAVX -B -i${PWD}/2dfft.ini -fippiFFTFwd_CToC_32fc_C
Figure 1. Sample shell script running IPP performance tool
4
[Perf System]
FFT_OrderXY=4x4; 5x5; 6x5; 6x6; 7x4; 7x5; 7x7; 8x3; 8x4; 8x6;
8x7; 8x8; 9x3; 9x5; 9x6; 9x8; 9x9;10x4; 10x5; 10x7; 10x8;
10x10; 11x3; 11x4; 11x6; 11x7; 11x11; 11x12; 11x13; 11x14;
11x15;12x3; 12x5; 12x6; 12x12; 13x4; 13x5; 13x13; 14x3; 14x4;
15x3; 15x4; 16x4; 17x3; 17x4;
Figure 2. Sample .ini file generating 2D FFT performance data
Another way to estimate signal processing performance is to focus the
performance assessment on key kernels that are often used in signal
processing workloads. By selecting a subset of the signal processing
functions, developers can produce a manageable set of data that
contains the most relevant functions and provides a reference with
which to estimate the performance of other functions. For instance,
this can be done by choosing forward and inverse FFTs of various sizes
– both complex and real – along with FIR and IIR filters of varying
complexities, and other useful functions such as discrete cosine and
Hilbert transforms. Developers may certainly need data on functions
other than the ones covered in this paper; in those cases, the Intel IPP
performance tool can be used to gather the necessary data.
In summary, this process for evaluating the signal processing performance
of Intel architecture utilizes Intel-collected performance data and gives
developers a straightforward method to quickly estimate performance
for their own workloads. Although the data provided is on 2nd generation
Intel Core i7-2710QE processor, the methods described here are extensible
to the full range of Intel processors. With a manageable effort, this
process gives developers a quick readout of the signal processing performance of next generation Intel processors and provides an estimate of
how much general-purpose computing headroom is available for other
applications. The next section reviews the performance data collected
and demonstrates the process using two examples.
It is important to note that although using Intel IPP to assess the signal
processing performance of Intel processors provides a good starting point
that balances effort and optimization, it need not be the endpoint. Going
beyond Intel IPP, it may be possible to capture significant performance
improvements for specific algorithms through the use of compiler optimizations, primitives and assembly language programming. The flexibility of
Intel architecture and its supporting software infrastructure provides
developers with all of these degrees of freedom that can ultimately identify the most appropriate tradeoff between performance and effort.
Signal Processing Performance Data
The following lists a sample of the signal processing performance data1,2
collected by Intel on 2nd Intel Core i7-2710QE processors. The algorithms
were run on a single execution thread, on Linux* (Fedora* 13 distribution),
and repeated until the results of iterations were within 5 percent
accuracy. Developers can create results for their own algorithms and
functions of interest using an Intel® compiler and the Intel IPP package,
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
which includes sample code and the Intel IPP performance test tool. More
details about the test configuration are provided in Appendix A.
Input range: 64B to 1024KB
Data type: Single and double precision floating point
Purpose: Decompose a discrete sequence into a set of frequencies.
This single threaded execution environment provides an approximation of
the performance of a single core and an indication of available computing
headroom. Processor cores not used for signal processing can be targeted
at other algorithms or applications running on the system. Intel® development tools and libraries also provide extensive support for development,
validation and performance tuning of multi-threaded applications.
Complex-to-complex FFT performance of the IPP library is in the
range of 16.1 – 23.7 single precision GFLOPs / sec for sizes between
64B and 4KB. Similarly, the performance of real-to-complex conjugate FFT is in the range of 15 – 17.3 single precision GFLOPs/sec for
sizes between 64B and 4KB.
For larger sizes, FFT performance declines as the execution of the
algorithm becomes less compute-bound and more memory-bound.
Additionally, the FFT throughput of double precision floating point
is approximately half that of single precision, and it scales with input
size similarly to single precision.
Overview of benchmark data
A. Forward and inverse Fast Fourier Transform (FFT).
Format: Complex-to-Complex, Real-to-Complex Conjugate Symmetric
Complex-To-Complex FFT
25
100,000.00
10,000.00
20
15
100.00
10.00
10
GFLOP / sec
Time (µSec)
1,000.00
1.00
5
0.10
0.01
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K 256K 512K 1024K
SP Float (µSec)
0.08
0.25
0.53
1.14
2.44
6.04
15.3
38.0
83.8
200
464
981
DP Float (µSec)
0.15
0.42
0.94
2.09
5.16
13.6
31.7
71.6
171
389
842
1970 4966 11450 25250
SP Float (GFLOP/s)
23.7
18.3
19.3
20.4
21.0
18.7
16.1
14.1
13.7
12.3
11.3
11.4
10.4
9.2
8.2
DP Float (GFLOP/s) 12.6
10.6
10.9
11.1
10.0
8.3
7.8
7.5
6.7
6.3
6.2
5.7
4.8
4.4
4.2
0
2274 5425 12800
Input Size
Figure 3. Complex-to-Complex FFT and Inverse FFT
Real-To-CCS FFT
20
100,000.00
18
10,000.00
16
Time (µSec)
12
10
100.00
8
10.00
GFLOP / sec
14
1,000.00
6
4
1.0
2
0.1
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K 1024K
SP Float (µSec)
0.14
0.34
0.73
1.51
3.26
7.56
18.4
44.7
98
229
528
1120
2625
DP Float (µSec)
0.23
0.59
1.24
2.72
6.39
16.3
37.5
85.1
197
447
965
2295
5525 12663
SP Float (GFLOP/s)
16.4
15.0
15.7
17.0
17.3
16.3
14.5
12.9
12.6
11.5
10.6
10.6
9.5
8.8
DP Float (GFLOP/s)
9.6
8.7
9.3
9.4
8.8
7.6
7.1
6.8
6.2
5.9
5.8
5.2
4.5
4.1
0
5989
Input Size
Figure 4. Real-to-CCS (Complex Conjugate Symmetric) FFT
5
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
B.2D Complex to Complex FFT Throughput (GFLOPS/s and
absolute time)
Format: Complex-to-Complex, 2 dimensional array data
Input range: Various sizes ranging from 64B x 64B to 128KB x 16B
Data type: Single precision floating point
Purpose: Decompose a two dimensional array of discrete
sequences into a set of frequencies.
Input Size
64 X 64
128 x 128
256 x 64 256 x 256 512 x 64
GFLOP/S
17.2
13.6
14.9
12.9
Time (µSec)
14.3
84.5
76.9
407
Two dimensional complex-to-complex FFT performance of the Intel
IPP library is in the range of 12.9 – 17.2 for sizes ranging from 64B x
64B to 4KB x 64B.
As with the one dimensional FFT case, 2D FFT performance declines
for larger sizes as the FFT becomes more memory-bound. Table 1
contains sample data points of interest, and it is possible to generate
2D FFT throughput data for other sizes using the Intel IPP library.
1K x 256
2K x 128
4K x 64
8K x 32
16K x 16
32K x 16 128K x 16
14.2
13.6
13.0
12.9
11.3
10.9
8.5
7.4
173
1744
1823
1834
2088
2163
5893
29838
Table 1. 2D Complex to Complex FFT Throughput (GFLOPS/s)
Speedup with Intel® Advanced Vector Extensions (Intel® AVX)
The floating point performance improvements for 2nd generation
Intel® Core™ i7 processors are the result of architectural enhancements, which enable the processor to:
The improved performance from Intel® Advanced Vector Extensions
(Intel® AVX) is illustrated in Figure 5, which shows the speed up
compared to the prior generation Intel® Streaming SIMD Extensions
3 (Intel® SSE3) instructions. The comparison is for Complex-toComplex FFT and Inverse FFT functions, which are averaged
together and charted for both single and double precision floating
point routines. For smaller input sizes, the speedup is over two
times, and for very large inputs, the speedup is around 20 percent.
- Retire one floating point instruction per CPU clock cycle
- Dispatch up to 4 floating point instructions per CPU clock cycle
Speedup of Intel® AVX Speedup vs. Intel® SSE4 for Complex-to Complex FFT and Inverse FFT (averaged)
2.8
Higher is better
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1024K
Input Size
Single precision (SP) floating point
Double precision (DP) floating point
Figure 5. Intel® Advanced Vector Extensions (Intel® AVX) Speedup over Intel® Streaming SIMD Extensions 4 (Intel® SSE4)
6
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
C. Filter Execution Times
Format: Single precision floating point complex data, complex
coefficients.
Finite Impulse Response Filter: 8, 32, and 128 taps
Infinite Impulse Response Filter: Orders ranging from 2 – 12 taps.
Inputs: Complex data ranging from 32B – 32KB in size.
Purpose: Suppress unwanted components from a discrete-time series.
Input Size / Execution Time (Microseconds)
32 input
128 input
512 input
2K
8K input
32K input
8 Tap Fir
0.2
0.5
1.8
7.0
28.2
113.0
32 Tap FIR
0.6
2.0
4.2
15.3
59.4
244.0
128 Tap FIR
2.1
7.7
6.2
18.6
67.6
276.0
Order 2 IIR
0.2
0.7
2.4
9.7
38.5
156.0
Order 3 IIR
0.3
0.9
3.5
13.7
54.4
221.5
Order 4 IIR
0.4
1.1
3.8
14.9
59.4
239.0
Order 6 IIR
0.6
1.4
4.5
17.4
69.5
277.5
Order 7 IIR
0.7
1.6
5.2
20.3
81.3
330.0
Order 8 IIR
1.1
1.7
5.3
20.4
80.8
324.5
Order 10 IIR
1.1
2.1
6.3
24.2
97.0
387.0
Order 12 IIR
1.2
2.5
7.3
27.5
112.0
445.0
Order 11 IIR
1.2
2.3
7.0
26.6
107.0
427.0
Table 2. Filter Execution Time
D. Discrete Hilbert Transform
Format: Single precision floating point complex data.
Inputs: Complex data ranging from 128B – 32KB in size.
Purpose: Create analytic representation of a real-valued
discrete signal.
Input Size / Execution Time (Microseconds)
INT16 to Complex Short FP
128
512
2K
8K
32K
0.7
2.7
13.0
74.2
385.3
IN16 to Complex SP FP
0.6
2.4
11.5
64.1
353.6
SP Float to Complex SP Float
0.6
2.3
11.0
62.8
341.0
Table 3. Hilbert Transform Execution Times
E. Discrete Cosine Transform
Format:
Inputs: 128B – 32KB
Purpose: Express a discrete signal as a series of cosine frequencies
that can be used for lossy signal compression.
Input Size / Execution Time (Microseconds)
128
512
2K
8K
32K
0.8
3.7
20.6
111.8
563.8
3.7
20.6
111.0
567.3
2.3
11.1
60.0
300.5
11.0
58.9
301.8
SP Float Forward
SP Float Inverse
0.8
DP Float Forward
0.6
DP Float Inverse
0.6
2.3
Table 4. Discrete Cosine Transform Execution Times
7
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
Two Signal Processing Workload Examples
Sample C-code snipet for Example 1 using Intel® Integrated
In this section, two generic signal processing workload examples are
presented, and the performance of the 2nd Intel Core i7-2710QE processor is estimated in two ways, according to the methods described
earlier. The first method is a simple manual approximation that adds
the performance data of underlying functions obtained from the Intel
IPP performance tool. Interpolation is used where measured data is
not available. Though this is a rough approximation, it produces a quick
performance estimate and a directional check of whether the system
performance and headroom is adequate. The second method estimates
the performance of the algorithms by coding them in C++ using Intel IPP
and measuring their performance directly with hardware counters. In the
following examples, the results from both methods are presented, which
provides an indication of how well the manual method approximates
actual measured results. The objective of this exercise is to produce a
reasonable estimate of performance in order to determine where effort
is best applied during optimization. Ultimately, detailed design work may
be needed to complete product development.
Performance Primitives (Intel® IPP):
/* allocate and initialize specification structures */
ippsFFTInitAlloc_C_32fc(&FFTspec1_p, order, IPP_FFT_DIV_
FWD_BY_N, ippAlgHintFast);
ippsFFTGetBufSize_C_32fc(FFTspec1, &BufSize);
Buf1_p = (Ipp8u *) ippsMalloc_32sc(BufSize*sizeof(Ipp8u));
…
/* compute in-place FFTs of input sequences*/
ippsFFTFwd_CToC_32fc_I(x_p, FFTspec1_p, Buf1_p);
ippsFFTFwd_CToC_32fc_I(y_p, FFTspec1_p, Buf1_p);
/* perform complex multiplication and inverse FFT*/
ippsMul_32fc( x_p, y_p, o_p, veclength);
ippsFFTInv_CToC_32fc_I(o_p,FFTspec1_p, Buf1_p);
…
/* free specification structures */
ippsFFTFree_C_32fc( FFTSpec1_p);
Example 1: Fast Convolution using FFT
ippsFree(Buf1_p);
The following example performs a fast convolution of two discrete
signals, x(n) and y(n) shown in Figure 6. The example is also a frequency
domain FIR filter when one of the input sequences represents the
transfer function of an FIR filter. A sample C-code snipet using Intel
IPPs is provided in Figure 7.
x(n)
FFT
Inverse
FFT
X
y(n)
o(n)
FFT
Inputs: x(n), y(n): 16KB input size
Output: o(n): 16KB output size
Operations: Single precision floating point in-place FFT,
Complex Multiply, Inverse FFT
FFT
µsec
iFFT
clocks
µsec
Complex Mul
clocks
µsec
clocks
Fast Convolution Calculated
µsec
clocks
Fast Convolution Measured
µsec
Delta
clocks
64
7.96E-02
1.67E+02
8.02E-02
1.68E+02
5.25E-02
1.10E+02
2.92E-01
6.13E+02
2.99E-01
6.28E+02
-2%
128
2.30E-01
4.83E+02
2.30E-01
4.83E+02
1.05E-01
2.21E+02
7.95E-01
1.67E+03
8.06E-01
1.69E+03
-1%
256
5.04E-01
1.06E+03
4.92E-01
1.03E+03
1.84E-01
3.87E+02
1.68E+00
3.54E+03
1.69E+00
3.55E+03
0%
512
1.07E+00
2.25E+03
1.04E+00
2.18E+03
3.17E-01
6.66E+02
3.50E+00
7.34E+03
3.80E+00
7.99E+03
-8%
1024
2.32E+00
4.87E+03
2.27E+00
4.77E+03
6.12E-01
1.29E+03
7.52E+00
1.58E+04
8.33E+00
1.75E+04
-10%
2048
5.91E+00
1.24E+04
5.93E+00
1.25E+04
1.18E+00
2.48E+03
1.89E+01
3.98E+04
2.11E+01
4.42E+04
-10%
4096
1.49E+01
3.13E+04
1.49E+01
3.13E+04
2.55E+00
5.34E+03
4.72E+01
9.92E+04
5.22E+01
1.10E+05
-10%
8192
3.65E+01
7.67E+04
3.60E+01
7.56E+04
5.46E+00
1.15E+04
1.14E+02
2.40E+05
1.28E+02
2.70E+05
-11%
16384
8.13E+01
1.71E+05
8.19E+01
1.72E+05
1.19E+01
2.50E+04
2.56E+02
5.38E+05
2.96E+02
6.21E+05
-13%
32768
2.00E+02
4.20E+05
2.02E+02
4.24E+05
2.57E+01
5.40E+04
6.28E+02
1.32E+06
7.47E+02
1.57E+06
-16%
65536
4.54E+02
9.53E+05
4.53E+02
9.51E+05
5.14E+01
1.08E+05
1.41E+03
2.97E+06
1.63E+03
3.43E+06
-14%
Table 5. Fast Convolution Execution Times
8
Table 5 summarizes Fast Convolution execution times, calculated
and measured, for various data sizes. The calculated times sum the
execution times of individual functions, using times for Intel IPP signal
processing functions obtained from the Intel IPP performance test tool
running in batch mode. The measured times were generated by running
the entire algorithm (Figure 6), which was coded in C++ and used Intel
IPP, and by calculating the elapsed time based on the hardware clock
count. The runtimes were averaged across 10,000 runs.
For comparison, the calculated times were within 16 percent of the
measured results. However, the calculated times took a few hours of
unattended run time (no human effort aside from installing the Intel
IPP and running the aforementioned shell script) and less than an hour
of calculating results in a spreadsheet. The measured results took
an engineer familiar with the Intel IPP and C++ programming a couple
Figure 6. Fast Convolution using FFT Example
Size
Figure 7. Sample C-code Snipet for Example 1
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
of days for coding, debugging and executing the runtime. From this
example, it is clear the simple manual approximation method (i.e., calculated times) delivers a performance estimate at a fraction of the effort
required for the coding approach, and it provides an early read on where
to invest optimization effort. Still, coding the algorithm using Intel IPP
took significantly less effort than the alternative, which is to manually
optimize custom libraries for each generation of Intel processor.
One possible optimization is to parallelize the algorithm by going
beyond vector instructions and executing the two FFTs on separate threads. Further parallelizing the execution, the threads can be
dispatched to two processor cores and the results combined back to a
single thread for the complex multiply and inverse FFT.
Floating Point Speeds Development
Voice quality is still a major concern for countless wireless, cable
and internet service providers relying on Voice over IP (VoIP) to
deliver telephony services. To ensure quality is on par with the
Public Switched Telephone Network (PSTN), service providers require an economical and automatic means to continuously
test calls in real time. A commonly used family of standards is
PESQ3 (Perceptual Evaluation of Speech Quality), which defines
a MOS voice quality score that closely correlates to human listening experience, as shown in Figure 8. The algorithms perform
voice encoding and measurements related to jitter, packet loss,
time-clipping and channel errors. Many of the algorithms are
computationally intensive and use floating point fast Fourier
transforms.
For manufacturers of voice quality test equipment, passing
industry conformance tests demands 32-bit float-like behavior
throughout the application. This is exceptionally difficult to achieve
with integer CPUs or FPGAs without suffering dire performance
consequences. Likewise, fixed point math – add, multiply or divide
– is not an option because of the loss of accuracy every time an
instruction throws out a remainder. With a fixed point integer type,
intermediate results of a multiply or add can grow beyond the fixed
point type, causing overflow and truncation errors. This can make
the result shrink below its fractional component and lead to
incorrect results, like a PESQ score of 3.5 instead of 4.2.
Example 2: Discrete Envelope Detection / Amplitude
Demodulation
The following example performs a fast convolution of two discrete
signals, x(n) and y(n) shown in Figure 6. The example is also a frequency
domain FIR filter when one of the input sequences represents the
transfer function of an FIR filter. A sample C-code snipet using Intel
IPPs is provided in Figure 7.
Ixia*, a leading supplier of test and measurement equipment,
decided to use multi-core Intel® processors with high performance
floating point units because they could run the PESQ code as-is,
hence minimal migration effort. The processors delivered accurate
PESQ results and proved to have floating pointing performance on
par with, and even superior to, many floating point DSPs. “Using
Intel, we were able to get near-final performance numbers in just
a few days, significantly lowering our project risk,” says Bryan
Rittmeyer, System Architect at Ixia.
R
User Satisfaction
100
90
5.0
Very Satisfied
Satisfied
80
70
60
50
0
MOS
Some Users Dissatisfied
Many Users Dissatisfied
Nearly All Users Dissatisfied
Not Recommended
4.3
4.0
3.6
3.1
2.6
1.0
Figure 8. MOS Diagram
The second example is an envelope detector for a discrete time
sequence, shown in Figure 9. The Hilbert transform produces the
analytic representation of the signal, whose magnitude is obtained
in order to generate the envelope of the signal, which is then downsampled. The downsampling is done in two stages since the carrier
is operating at 200x the frequency of the message bandwidth. This
keeps the FIRs to reasonable sizes. Finally, the DC component is
removed from the discrete output sequence. Figure 10 contains a code
snipet of the MATLAB* model for the envelope detector and Figure
shows the results.
9
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
x(n) = (A + M . f(ωmnts))*
sin(ωcnts) + noise
Z
-N/2
+
|xc |
LPF1
D1
LPF2
D2
+
Discrete
Hilbert
Transform
j
Envelope
Formation
Figure 9. Discrete Envelope Detection / Amplitude Demodulation Example
Input: Amplitude modulated message. Message bandwidth = 5KHz.
Carrier frequency: 1000KHz. Input sampling frequency: 2200KHz.
Output sampling frequency: 11KHz
LPF1: 128 tap FIR, 44KHz cutoff frequency. D1: Downsampling by 25.
LPF2: 128 tap FIR, 5.5KHz cutoff frequency. D2: Downsampling by 8.
%Form analytic signal for envelope
inenv = abs(hilbert(in));
%Downsample, take out DC and LPF envelope
lpf1 = fir1(lpf1tap,cutoff1/(Fsi/2),’low’,chebwin(lpf1tap+1));
out1 = fftfilt(lpf1,inenv);
out1 = downsample(out1,D1);
tout1 = downsample(tin,D1);
%Stage 2 Downsample & LPF envelope
lpf2 = fir1(lpf2tap,cutoff2/(Fs1/2),’low’,chebwin(lpf2tap+1));
out = fftfilt(lpf2,out1);
out = downsample(out,D2);
out = out - mean(out);
out = out(17:length(out));
tout = downsample(tout1,D2);
tout = tout(17:length(tout));
Figure 10. Snipet of MATLAB* model
10
Low Pass Filtering, Downsampling,
& DC removal
DC
removal
o(m)
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
Figure 11 shows the MATLAB results for a simulated noisy AM signal
containing messages centered at 1KHz, 3KHz, and 4.75KHz. The
figure also shows the single sided spectrum magnitude of the input,
after LPF1, after LPF2 and downsampling D2, and at the final output.
Superimposed to the frequency spectrum is the frequency response of
the LPFs (thin blue line).
Figure 11. MATLAB* Results
Table 6 summarizes the execution times for the Amplitude Demodulation
for various data sizes. Similar to Example 1, the calculated times summed
the execution times of the individual functions, and the measured times
were generated from hardware clock count measurements collected
while the algorithm executed (see Figure 10 for the code snipet). The
runtimes were averaged across 10,000 runs.
The calculated times are within 11 percent of the measured results. Here
again, the calculated times took a few hours of unattended run time and
less than an hour of spreadsheet calculation time. An engineer familiar
with Intel IPP, C++ and the algorithm generated the measured results in
three days, which included coding using Intel IPP functions, debugging
and executing the runtime. This example is similar to Example 1, in that
the manual approximation method offers a good compromise between
accuracy and effort, and it provides a relatively quick indication of performance and focus areas for further optimization. As in the prior example,
going to the next step of coding the algorithm using Intel IPP took an
acceptable amount of effort, given the degree of optimization and
compared to manually optimizing libraries to Intel processors.
11
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
Input Size
32
Hilbert Tran sf (32f - 32fc)
Magnitude
128 tap FIR #1
DownSampling by 8
clocks
µsec
clocks
µsec
clocks
µsec
clocks
µsec
clocks
1.41E-01
2.96E+02
4.05E-02
8.51E+01
4.49E-01
9.43E+02
2.46E-02
5.17E+01
7.69E-02
1.61E+02
128
5.82E-01
1.22E+03
1.40E-01
2.94E+02
1.63E+00
3.42E+03
2.25E-02
4.73E+01
7.03E-02
1.48E+02
512
2.33E+00
4.89E+03
5.52E-01
1.16E+03
6.39E+00
1.34E+04
3.78E-02
7.94E+01
1.18E-01
2.48E+02
2048
1.09E+01
2.29E+04
2.20E+00
4.62E+03
2.63E+01
5.52E+04
9.67E-02
2.03E+02
3.02E-01
6.35E+02
8192
6.17E+01
1.30E+05
8.90E+00
1.87E+04
7.61E+01
1.60E+05
6.34E-01
1.33E+03
1.98E+00
4.16E+03
9000
7.09E+01
1.49E+05
9.91E+00
2.08E+04
8.10E+01
1.70E+05
7.41E-01
1.56E+03
2.31E+00
4.86E+03
18000
1.74E+02
3.64E+05
2.12E+01
4.45E+04
1.36E+02
2.85E+05
1.93E+00
4.05E+03
6.03E+00
1.27E+04
27000
2.76E+02
5.80E+05
3.25E+01
6.82E+04
1.90E+02
3.99E+05
3.12E+00
6.55E+03
9.74E+00
2.05E+04
32768
3.42E+02
7.18E+05
3.97E+01
8.34E+04
2.25E+02
4.73E+05
3.88E+00
8.15E+03
1.21E+01
2.55E+04
36000
3.76E+02
7.89E+05
4.36E+01
9.16E+04
2.47E+02
5.19E+05
4.26E+00
8.95E+03
1.33E+01
2.80E+04
Input Size
128 tap FIR #2
DownSampling by 8
µsec
clocks
µsec
clocks
327.68
5.69E+00
1.20E+04
3.56E-02
7.47E+01
360
6.09E+00
1.28E+04
3.68E-02
7.74E+01
720
1.57E+01
3.30E+04
6.54E-02
1.37E+02
1080
2.04E+01
4.28E+04
7.92E-02
1.66E+02
1310.72
2.34E+01
4.91E+04
8.81E-02
1.85E+02
1440
2.51E+01
5.26E+04
9.30E-02
1.95E+02
Input Size
Demod Calculated
Demod Measured
Delta
µsec
clocks
µsec
clocks
8192
1.54E+02
3.24E+05
1.59E+02
3.34E+05
-3%
9000
1.70E+02
3.58E+05
1.75E+02
3.67E+05
-3%
18000
3.52E+02
7.39E+05
3.68E+02
7.72E+05
-4%
27000
5.29E+02
1.11E+06
5.97E+02
1.25E+06
-11%
32768
6.42E+02
1.35E+06
7.18E+02
1.51E+06
-11%
36000
7.05E+02
1.48E+06
7.86E+02
1.65E+06
-10%
Measured Data
Interpolation From Measured Data
Calculated Data
Table 6. Amplitude Demodulation Execution Times
/* allocate and initialize specification structures */
ippsHilbertInitAlloc_32f32fc(&HilbertSpec_p, insmpl, ippAlgHintNone);
ippsFIRInitAlloc_32f(&LPF1FIRState_p, (Ipp32f *)lpf1taps, LPF1TAPCNT, NULL);
ippsFIRInitAlloc_32f(&LPF2FIRState_p, (Ipp32f *)lpf2taps, LPF2TAPCNT, NULL);
…
/* Form the analytic signal and its envelope */
ippsHilbert_32f32fc((Ipp32f *)in_p, inenv_p, HilbertSpec_p);
ippsMagnitude_32fc(inenv_p, inenvabs_p, insmpl);
/* First stage of LPF and downsampling */
ippsFIR_32f_I(inenvabs_p, insmpl, LPF1FIRState_p);
ippsSampleDown_32f((Ipp32f *)inenvabs_p, insmpl, outD1_p, &D1smpl, D1, &phm);
/* Second stage of LPF and downsampling */
ippsFIR_32f_I(outD1_p, D1smpl, LPF2FIRState_p);
ippsSampleDown_32f(outD1_p, D1smpl, outD2_p, &D2smpl, D2, &phm);
/* Remove DC */
ippsMean_32f(outD2_p, D2smpl, &dcval, ippAlgHintNone);
ippsSubC_32f_I((Ipp32f)dcval, outD2_p, D2smpl);
…
/* free states */
ippsHilbertFree_32f32fc(HilbertSpec_p);
ippsFIRFree_32f(LPF1FIRState_p);
ippsFIRFree_32f(LPF2FIRState_p);
Figure 12. Sample C-code Snipet for Example 2 using Intel® Integrated Performance Primitives (Intel® IPP)
12
DownSampling by 25
µsec
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
The single-sided spectrum magnitude output of the Intel IPP implementation is shown in Figure 13 and indicates the resulting envelope
and message recovery at the 11KHz sampling frequency.
Figure 13. Output Spectrum Magnitude for Intel® Integrated Performance Primitives (Intel® IPP) implementation.
Table 7 summarizes the measured execution time for various lengths of
input sample sequences. Of note is the column labeled Utilization Rate.
This is the algorithm’s execution time divided by the duration of the
input sample, which provides a measure of core utilization over the time
interval of the algorithm (i.e., before the next set of input samples need
to be processed). It is an indication of the amount of headroom the core
has available for additional signal processing functions, or perhaps, for
other applications.
Input Size in Samples
AVX Speedup Over SSE
AVX Time (µsec)
Utilization Rate (%)
9000
1.32x
174.70
4.30%
18000
1.30x
367.64
4.50%
27000
1.27x
597.04
4.90%
36000
1.29x
785.95
4.80%
45000
1.26x
1,048.31
5.10%
54000
1.27x
1,223.71
5.00%
63000
1.25x
1,485.68
5.20%
72000
1.26x
1,659.40
5.10%
81000
1.25x
1,986.38
5.40%
90000
1.26x
2,155.75
5.30%
Table 7. Measured Execution Time for Envelope Detector
13
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
Development Tools Overview
Intel® VTune™ Performance Analyzer
Developers of signal processing applications have a wide choice of development tools from Intel and the broad Intel ecosystem. The benefits of
using these comprehensive tool suites are many and impact every phase
of the software development process.
Designed to help developers find bottlenecks in their applications, the
tool profiles how the application is using CPU time and computing platform resources throughout the code.
Intel® C++ Compiler
A rich and user friendly Eclipse* RCP-based graphical user interface,
combined with OS signal and thread awareness, enable developers to
cross-debug more easily by finding coding issues that affect application runtime behavior.
The Intel C++ Compilers for Linux and Microsoft* Windows* operating
systems are optimized to harness key properties of Intel architecture
processors and deliver optimal performance. They take advantage
of a complex set of heuristics to decide which assembly instructions
can best optimize the performance in various area, including memory
access, branch prediction, vectorization and floating point operations.
Intel® Math Kernel Library (Intel® MKL)
Intel® Math Kernel Library (Intel® MKL) is a library of highly optimized,
extensively threaded math routines that rely heavily on floating point
computations for maximum performance. Core math functions include
BLAS, LAPACK, ScaLAPACK, Sparse Solvers, Fast Fourier Transforms,
Vector Math and more.
Intel® Integrated Performance Primitives (Intel® IPP)
Intel IPP offers a rich set of library functions and codecs capable of
speeding up the development of highly optimized routines for the
handling of multimedia formats and data of any kind. They have been
hand optimized at a low level to provide maximum performance and
ease of use with Intel architecture processor-based platforms.
14
Intel® Application Debugger
Eclipse*-based Integrated Development Environment
Intel® software development products can be used with the Eclipse
Integrated Development Environment (IDE).
Consider Intel® Architecture Processors for Signal
Processing
Although today’s Intel architecture processors are already being used for
signal processing workloads, the release of 2nd generation Intel Core i7
processors with Intel AVX makes this approach much more compelling.
Intel AVX delivers over twice the performance1 for some floating pointbased workloads compared to prior generation Intel SSE instructions. It is
relatively straightforward for developers to evaluate the signal processing performance of next generation Intel architecture processors using
the data available collected with Intel® tools and libraries.
White Paper: Signal Processing on Intel® Architecture: Performance Analysis using Intel® Performance Primitives
Appendix A: Test Configuration
• Single thread execution
• Emerald Lake Platform (Fab A)
– BIOS – American Megatrends 4.6.3.2 (Project Version – ASNBCPT1.86C. 0054.P00)
–C
PU: 2nd generation Intel® Core™ i7-2710QE processor (4 core, 2.1GHz, 6MB LLC, Intel® Hyper-Threading
Technology off)
– PCH: Mobile Intel® QM67 Chipset, B0 stepping.
– 2 GB RAM (2x1GB Samsung DIMM DDR3 1333, dual rank, PN: M471B2874EH1-CH9)
– Western Digital 160GB HDD (WD1600AAJS)
• Fedora* 13 Linux* 2.6.33.3-85.fc13.x86_64 operating system
• Intel® Composer XE 2011
– Intel® C++ Compiler Pro, version 12.0.1, build 107.
– Intel® Integrated Performance Primitives (Intel® IPP) version 7.0, build 205.23, September 2, 2010 (libippse9.so.7.0)
– Intel IPP performance tool version 7.0 (part of the Intel IPP package)
• A
ll individual Intel IPP measurements were taken using the Intel IPP performance test tool. Standard batch
mode (-B) input was used. The automatic timing mode with default accuracy was used. The tests were run with
high priority (Y=HIGH) and on one thread only (N=1). More information on the command line parameters can be
obtained by running the performance applications with the –hh switch
• F requency domain FIR was compiled in release mode (Release x64) with the Intel C++ Compiler. The cache is
warmed before the test. Optimizations are enabled using the /O3 , -xHost, and –std=c99 compiler flags. FDFIR
data averaged among in place, fast, and no divide by N options
• O
ther data averaged among in place and not in place, fast & accurate switches, divide by N, divide by sqrt(n),
and no divide by N, as applicable to each algorithm
• Data is at fixed CPU clock frequency and may change with Intel® Turbo Boost Technology enabled. • S
oftware libraries, drivers, operating systems, and compilers used are not fully tuned for performance and
additional performance gains may be possible.
Acronyms
ASIC
Application-specific integrated circuit
FFT
Fast Fourier transform
ASP
Application-specific processor
FPGA Field-programmable gate array
DSP
Digital signal processor
IIR
Infinite impulse response
FIR Finite impulse response
SIMD
Single-instruction, multiple data
15
1Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult
other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
2For more information go to http://www.intel.com/performance
3Source: PESQ website at http://www.pesq.org/
Copyright © 2011 Intel Corporation. All rights reserved. Intel, the Intel logo and Intel Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States
or other countries.
*Other names and brands may be claimed as the property of others.
Printed in USA
0111/S2D/BM/XX/PDF
Please Recycle
324910-001US
Download PDF