Texas Instruments | Introduction to the DSP Subsystem in the IWR16xx | Application notes | Texas Instruments Introduction to the DSP Subsystem in the IWR16xx Application notes

Texas Instruments Introduction to the DSP Subsystem in the IWR16xx Application notes
Application Report
SWRA564 – May 2017
Introduction to the DSP Subsystem in the IWR16xx
Sandeep Rao, Jasbir Nayyar, Mingjian Yan, Brian Johnson
This application report introduces the DSS, discusses the key blocks of the DSS, provides benchmarks for
the digital signal processor (DSP) and enhanced direct memory access (EDMA), and presents typical
radar processing chains that are intended for engineers who wish to implement signal processing
algorithms on the IWR16xx.
Introduction ................................................................................................................... 2
Overview of the DSP Subsystem (DSS) .................................................................................. 2
Algorithm Chain and Benchmarks ......................................................................................... 6
Data Flow.................................................................................................................... 10
Discussion on Cache Strategy ........................................................................................... 17
References .................................................................................................................. 18
List of Figures
Programmers View of the DSP Subsystem .............................................................................. 2
Memory Hierarchy in the DSS ............................................................................................. 3
Multidimensional Transfer Capabilities of the EDMA ................................................................... 4
ADC Buffers .................................................................................................................. 5
FMCW Frame Structure
.................................................................................................... 7
Typical Radar Signal Processing Chain .................................................................................. 7
Variant of a Radar Signal Processing Chain ............................................................................. 8
Contiguous-Read Transpose-Write EDMA Access .................................................................... 11
Transpose-Read Contiguous-Write EDMA Access .................................................................... 12
Single-Chirp Use Case .................................................................................................... 12
Buffer Management for Data Flow ....................................................................................... 13
Single-Chirp Use Case Alternate Flow .................................................................................. 14
Improving Efficiency of the Transpose Transfer ....................................................................... 15
Timing for Multichirp Use Case........................................................................................... 15
Multichirp Use Case ....................................................................................................... 16
Interframe Processing Case 1 ............................................................................................ 16
Interframe Processing Case 2 ............................................................................................ 17
C674x Benchmarks.......................................................................................................... 6
List of FFT Routines in DSPLIB
MIPS (Cycles) Performance of FFTs .................................................................................... 10
SQNR (dB) Performance of FFTs ........................................................................................ 10
List of Tables
SWRA564 – May 2017
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
C6000 is a trademark of Texas Instruments.
ARM is a registered trademark of ARM Limited.
Cortex is a registered trademark of ARM.
All other trademarks are the property of their respective owners.
The IWR16xx device is the industry’s first fully integrated, 77-GHz, radar-on-a-chip solution for industrial
radar applications [1]. The device comprises the entire mmWave RF and analog baseband signal chain for
two transmitters (TX) and four receivers (RX), as well as two customer-programmable processor cores in
the form of a C674x DSP and an ARM® Cortex®-R4F MCU. These two customer-programmable
processors are part of the DSP subsystem (DSS) and master subsystem (MSS), respectively.
Overview of the DSP Subsystem (DSS)
Figure 1 shows a programmer's view of the DSS. The center of the DSS is the C674x DSP core, which
customers can program to run radar signal-processing algorithms. Two levels of memory (L1 and L2) are
local to the DSP core. Additionally, the DSP has access to external memories. These external memories
include the ADC buffer (which presents the digitized radar IF signal to the DSP), the L3 memory (primarily
used to store radar data), and a handshake RAM (which can be used to share data between the MSS and
the DSS). The DSP has access to enhanced-DMA (EDMA) [5] engines, which can be used to efficiently
transport data between these external memories and the DSP core. Each of these components is
described in more detail in the following subsections.
Figure 1. Programmers View of the DSP Subsystem
C674x DSP
The C674x DSP [2 to 7] belongs to the C6000™ family and combines the instruction sets of two prior
DSPs: C64x+ (a fixed-point DSP) and C67x (a floating-point DSP). The DSP in the IWR16xx runs at 600
MHz. The DSP has the following key features:
• Instruction parallelism: The CPU has eight functional units that can operate in parallel, which allows
for a maximum of eight instructions to be executed in a single cycle [2].
• Rich instruction set: In addition to instruction parallelism, the DSP supports a rich instruction set [2],
which further speeds up signal processing computations. These instructions include single instruction
multiple data (SIMD) instructions, which can operate on multiple sets of input data (for example, ADD2
can perform two 16-bit additions in one cycle and MAX2 can perform two 16-bit comparisons in one
cycle). There are also other specialized instructions for commonly used signal-processing math
operations (such as CMPRY, an instruction which is a single-cycle complex multiplication, specialized
instructions for dot products, and so on).
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
SWRA564 – May 2017
Submit Documentation Feedback
Overview of the DSP Subsystem (DSS)
Optimizing C-compiler: The C6000 family of DSPs comes with a very good optimizing C compiler [4],
[6], which lets engineers develop highly efficient signal-processing routines using linear C code. The
compiler ensures optimal scheduling of instructions, to ensure the best possible use of the DSP
parallelism. The compiler also supports the use of intrinsics [4], which lets the programmer directly
access specialized CPU instructions from within the C code.
Figure 2. Memory Hierarchy in the DSS
Memory architecture: The DSP uses a 3-level memory architecture (L1, L2, and L3) (see Figure 2)
– L1 memory is the closest to the CPU and runs at the CPU speed (600 MHz). L1 memory consists
of 32KB each of program and data memory commonly referred to as L1P and L1D, respectively.
– L2 memory consists of two 128KB-banks, which can store both program and data. Each bank is
accessible through its own memory port, which improves efficiency by allowing access to both L2
banks simultaneously (for example, data and program simultaneously, if kept on separate banks).
L2 runs at a speed of CPU/2 (300 MHz). L1 and L2 memories reside within the DSP core.
– In addition, the DSP can also access an external memory (L3). In the IWR16xx device, L3 memory
runs at 200 MHz and is accessible through a 128-bit interconnect. Both L1 and L2 memories can
be configured (partly or wholly) as cache for the next higher level memory [3]. A typical radar
application might configure L1P and L1D as cache, place the code and data in L2, and use L3
primarily for storing the radar-data (though as noted in Section 5 other variations are possible as
Collateral: Besides the optimizing C compiler, other collateral is available in the form of optimized
libraries. The TI C6000 DSPLIB [7] is an optimized DSP function library for C programmers. The library
includes many C-callable, optimized, general-purpose, signal-processing routines and contains
routines for fast Fourier transforms (FFTs), matrix and vector manipulation, filtering, and more. TI also
provides an optimized library called the mmwaveLib [8], that complements the DSPLIB and includes
signal processing routines that are commonly used in radar signal processing.
The EDMA engine handles direct memory transfers between various memories. In the context of radar
signal processing, the amount of data to be stored per frame is typically much larger than what can be
accommodated in the higher-level memories (such as L1 and L2). Consequently, the bulk of the data is
usually stored in a lower-level memory (L3). The EDMA enables data to be brought in from the lower-level
memory to the higher-level memories for processing and subsequently shipped back to the lower-level
memory. Proper use of the EDMA features (such as multidimensional transfers, chaining, and linking,
which are explained next) can ensure that data transfers across memories occur with minimal overhead.
This subsection provides a concise overview of the EDMA and its capabilities. For a more detailed
description of the EDMA, [1], [5].
An EDMA engine consists of a channel controller (CC) and one or more transfer controllers (TCs). The CC
schedules various data transfers, while the TC performs transfers dictated by requests submitted by the
CC. The IWR16xx device has two CCs, each with two TCs. Because multiple TCs can operate in parallel,
up to four transfers can be performed in parallel.
SWRA564 – May 2017
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
Overview of the DSP Subsystem (DSS)
Parameters corresponding to memory transfers must be programmed into a parameter RAM (PaRAM)
table. The PaRAM table is segmented into PaRAM sets. Each PaRAM set includes fields for parameters
that define a DMA transfer, such as source and destination address, transfer counts, and configurations
for triggering, linking, chaining, and indexing. Each EDMA engine (each CC) supports 64 logical channels,
and the first 64 PaRAM sets are directly mapped to the 64 logical channels.
EDMA Features
The EDMA engine has a rich feature set, all of which can be programmed using the PaRAM sets. These
features include the following:
• Multidimensional transfers: An EDMA data transfer is defined in terms of three dimensions.
– The first dimension (A transfer) transfers a contiguous array of ACNT bytes from source to
– The second dimension (B transfer) transfers BCNT arrays (each of size ACNT bytes). Subsequent
arrays in the B transfer are separated by SRCBIDX (DSTBIDX) bytes at the source (destination).
– The third dimension transfers CCNT frames, with each frame consisting of BCNT arrays.
Subsequent arrays are separated by SRCCIDX (DSTCIDX) bytes at the source (destination). The
transfer counts (ACNT, BCNT, and CCNT) and indices (SRCBIDX and so on) are programmed in
the PaRAM set.
These multidimensional transfers allow considerable flexibility in defining the data streaming from the input
to the output. For example, a matrix can be picked up from a source location and stored in a transpose
fashion at the destination. This is shown in Figure 3, where a 4 × 5 matrix (each element 4 bytes in size)
is transposed at the destination.
Figure 3. Multidimensional Transfer Capabilities of the EDMA
Chaining: Chaining is a mechanism by which the completion of one EDMA transfer (corresponding to
one EDMA channel) automatically triggers another EDMA channel. The chaining mechanism helps to
orchestrate a sequence of DMA transfers.
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
SWRA564 – May 2017
Submit Documentation Feedback
Overview of the DSP Subsystem (DSS)
Linking: Once the transfers corresponding to a ParRAM set have been completed (and its counter
fields such as ACNT, BCNT, and CCNT have decremented to 0), the EDMA provides a mechanism to
load the next PaRAM set. The 16-bit parameter LINK (in each PaRAM set [5] specifies the byte
address offset from which the next PaRAM set should be reloaded. Once reloaded, the PaRAM set is
ready to be triggered again. The linking feature eliminates the need to reprogram the PaRAM set for
subsequent transfers. It is also useful in programming the EDMA so that subsequent transfers
alternate between ping-pong buffers. The first 64 PaRAM sets are directly mapped to the
corresponding EDMA channels. The byte address offset programmed in LINK, is however, free to point
to a PaRAM set beyond these first 64 (the two CCs in the IWR16xx support 128 and 256 entries,
respectively, in their PaRAM tables).
Triggering: Once an EDMA channel is programmed (using the PaRAM set), it can be triggered in
multiple ways. These ways include:
– Event-based triggering (for example, a chirp available interrupt from the ADC buffer, see
Section 2.3.1)
– Software triggering by the DSP CPU
– Chain triggering by a prior completed EDMA
The previously listed features of the EDMA are ideally suited for performing data transfers in the context of
radar signal processing. For example, radar processing involves multidimensional FFT processing, which
requires data to be rearranged along various dimensions (range bins, Doppler bins, and antennas). The
multidimensional transfers supported by the EDMA are very useful in this context. The PaRAM set-based
programming model allows multiple transfers to be programmed up front. For example, EDMA transfers
from the ADC buffer to the DSP core, and between the DSP core and the L3 memory can all be
programmed up front in various logical channels. Subsequently, these channels are triggered during the
data flow. Further, the linking feature allows automatic reloading of PaRAM sets (for example, for each
successive frame), thus eliminating the need for the CPU to periodically reprogram the EDMA channels.
External Interfaces
ADC Buffer
Digitized samples from the digital front end (DFE) are stored in the ADC buffer. The ADC buffer exists as
a ping-pong pair, each 32KB in size. So when the DFE is streaming data to the ping (respectively, pong)
buffer, the DSP has access to the pong (respectively, ping) buffer. The number of chirps that can be
stored in a buffer is programmable. When the DSP has streamed the programmed number of chirps into a
buffer, a chirp available interrupt is raised and the DFE switches buffers. This interrupt can serve as a cue
to the DSP to begin processing the stored data. Alternatively, the interrupt can also be used to trigger an
EDMA transfer of the ADC samples to the local memory of the DSP (such as L1 or L2), with the EDMA
subsequently interrupting the DSP on completion of the transfer. Figure 4 (left side) shows an ADC buffer
that has been programmed to store one chirp per ping-pong. The four colored rows refer to the ADC
samples corresponding to the four antennas (the four RX channels). In Figure 4 (right side), the ADC
buffer has been programmed to store four chirps per ping-pong. The data for each antenna (across the
four chirps) is stored contiguously.
Figure 4. ADC Buffers
SWRA564 – May 2017
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
Algorithm Chain and Benchmarks
L3 Memory
In the IWR16xx device, up to 768KB of the L3 memory is available for use by the DSP. The primary use of
the L3 memory is to store the radar-data arising from the ADC samples corresponding to a frame (radarcube). L3 memory can be accessed using a 128-bit interconnect.
Handshake Memory
An additional 32KB of memory is available at the same hierarchy level as that of L3 memory. The primary
intent a separate memory is to have another memory for sharing the data with other masters, such as
master R4F or DMA in the MSS, without interfering with the L3 memory access efficiency. One prime use
can be to share the final object list with the master R4F through this RAM.
Algorithm Chain and Benchmarks
This section provides benchmarks for some common radar signal-processing routines. This section also
examines the DSP loading in the context of some typical processing chains.
These benchmarks assume the data and program to be in L1.
Table 1. C674x Benchmarks
Timing (µs) at 600 MHz
Source and Function Name
128-point FFT (16 bit)
0.86 µs
DSPLIB (DSP_fft16x16)
256-point FFT (16 bit)
1.55 µs
DSPLIB (DSP_fft16x16)
128-point FFT (32 bit)
1.59 µs
DSPLIB (DSP_fft32x32)
Windowing (16 bit)
0.595 N + 70
0.37 µs (for N = 256)
mmwavelib (mmwavelib_windowing16x16)
Windowing (32 bit)
N + 67
0.32 µs (for N = 128)
mmwavelib (mmwavelib_windowing16x32)
Log2abs (16 bit)
1.8 N + 75
0.89 µs (for N = 256)
mmwavelib (mmwavelib_log2Abs16)
Log2abs (32 bit)
3.5 N + 68
0.86 µs (for N = 128)
mmwavelib (mmwavelib_log2Abs32)
CFAR-CA detection
3 N + 161
0.91 µs
mmwavelib (mmwavelib_cfarCadB)
Maximum of a vector of
length 256
0.12 µs
DSPLIB (DSP_maxval)
Sum of complex vector of
length 256 (16-bit I, Q)
0.28 µs
Multiply two complex vectors 265
of length 256 (16-bit)
0.44 µs
Radar Signal Processing Chain
The fundamental transmission unit of an FMCW radar is a frame, which consists of a number (for
example, Nchirp) of equispaced chirps (see Figure 5). Central to FMCW radar signal processing is a series
of three FFTs commonly called the range-FFT, Doppler-FFT, and angle-FFT, which respectively resolve
objects in range, Doppler (relative velocity with regard to the radar), and angle of arrival. The range-FFT is
performed across ADC samples for each chirp (one range-FFT for each RX antenna). The range-FFT is
usually performed inline (during the intraframe period as the samples corresponding to each chirp become
available). The Doppler-FFT operates across chirps and can only be performed when all the range-FFTs
corresponding to all the chirps in a frame have become available. Lastly, the angle-FFTs are performed on
the range-Doppler processed data across RX antennas. For more in-depth information on FMCW signal
processing [9].
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
SWRA564 – May 2017
Submit Documentation Feedback
Algorithm Chain and Benchmarks
Figure 5. FMCW Frame Structure
Figure 6 shows a representative signal-processing chain for the FMCW radar. The processing is
performed on each frame of received data. The ADC samples corresponding to each chirp (across the
four RX antennas) are received in the ADC buffer. The DSP performs an FFT (one for each RX antenna),
which is then placed in L3 memory. The data in L3 memory is best visualized as a 3-dimensional matrix
indexed along range, chirps, and antennas (often called a radar-cube).
When all the range-FFTs have been computed for all the chirps in a frame, Doppler-FFTs are performed.
For each range-bin (and for each antenna) a Doppler-FFT is performed across the Nchirp chirps. In the
example of Figure 6, both the range-FFT and Doppler-FFT are 16-bit FFTs. Therefore, the results of the
Doppler-FFT can be stored in-place (overwriting the corresponding range-FFT samples) in the radar-cube
in L3 memory. At the end of the Doppler-FFT processing, the radar-cube is now indexed along range,
Doppler, and antennas.
Next the radar-cube is noncoherently summed along the antenna dimension to create a predetection
matrix that is indexed along range and Doppler. This predetection matrix is much smaller than the radarcube. In the previous example, the predetection matrix would be 1/8 the size of the radar-cube. The
reduction is due to collapsing along the antenna dimension contributing a factor of ¼, and the fact that the
predetection matrix is real, while the radar-cube is complex contributing a factor of 1/2. Consequently, the
predetection matrix can be stored in either L3 or L2. A detection algorithm then identifies peaks in this
predetection matrix that correspond to valid objects. For each valid object, an angle-FFT is then performed
on the corresponding range-Doppler bins in the radar-cube to identify the angle of arrival of that object.
Lastly, the estimated parameters for the detected objects such as range, Doppler, and angle of arrival can
be shared with the MSS using the handshake memory. There is also an option to perform further
processing such as clustering and tracking in the DSP. Alternatively, these higher-layer algorithms can be
performed in the MSS (R4F MCU).
Figure 6. Typical Radar Signal Processing Chain
SWRA564 – May 2017
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
Algorithm Chain and Benchmarks
In the chain described in Figure 6, the Doppler-FFT processing did not increase the size of the radar-cube
(because the Doppler-FFT output was written over the range-FFT). There can be cases where
performance requirements might dictate a higher dynamic range at the output of the Doppler-FFT (for
example, see the discussion in Section 3.3). Figure 7 depicts a possible processing chain in such a
scenario. While the range-FFT is still 16-bit, the Doppler FFT has an increased dynamic range and
outputs 32-bit complex samples. While these samples can be stored back in the radar-cube, that would
mean doubling the radar-cube size in L3. Therefore, the chain in Figure 7 follows a different approach.
When the Doppler-FFT is computed, it is noncoherently summed across antennas to create a predetection
matrix. This result is not stored back in L3. When the detection algorithm is run and objects are identified,
the Doppler-FFT (followed by the angle-FFT) is selectively recomputed only for the detected objects.
Figure 7. Variant of a Radar Signal Processing Chain
Compute the bulk of the memory and computational requirements for a representative FMCW frame
structure using the processing flow indicated in Figure 7. A frame with 128 chirps (Nchirp = 128), with each
chirp having 256 samples (Nadc = 256) is considered. With four RX antennas, the radar-cube memory
requirement is 256 samples × 128 chirps × 4 antennas × 4 bytes / sample = 512KB. The processing
requirements are looked at during two distinct phases: intraframe processing and interframe processing.
• Intraframe processing: From Table 1, it can be seen that a 256-pt range-FFT with windowing takes
(1.55 + 0.37) = 1.92 µs. Therefore the computational requirements for processing one chirp (for four
RX antennas) are 1.92 × 4 = 7.7 µs. Typical chirp periodicities (TC in Figure 5) in are in the tens of
microseconds. For example, with a sampling rate of 5 MHz, and Nadc = 256, TC would be at least 256 /
5 = 51 µs (excluding the interchirp idle time of a few µs). Thus range-FFT processing can be
comfortably performed in the time (TC) between the arrival of ADC samples from consecutive chirps.
• Interframe processing: The first step in interframe processing is the Doppler-FFT. A 128-pt DopplerFFT must be performed for each of the 256-range bins and across the four RX antennas. Again from
Table 1, a single Doppler-FFT (with windowing) takes 1.59 µs + 0.32 µs = 1.91 µs, and the entire
computation would take 1.91 µs × 256 range bins × 4 RX antennas = 1.95 ms. The biggest
computation related to noncoherent accumulation is computation of the log magnitude for all the 128 ×
256 × 4 samples in the radar-cube, which would take about 0.86 µs × 256 × 4 = 0.9 ms. This operation
would result in a predetection matrix of 256 × 128 samples, each sample being a 16-bit real (positive)
number. Subsequently, a detection algorithm is run. While the choice of detection algorithm is very
application specific, here we assume that a CFAR-CA detector is used. Running this detector along
each of the 256 × 4 Doppler lines would take 0.9 µs × 256 × 4 ≈ 1 ms. Therefore the core
computational blocks in interframe processing collectively take about 1.95 + 0.9 + 1 ≈ 4 ms. This
computation ignores further downstream processing that is performed on the detected objects such as
additional detection steps to reconfirm initial detections and angle-estimation performed only on the
detected objects.
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
SWRA564 – May 2017
Submit Documentation Feedback
Algorithm Chain and Benchmarks
In the processing chains discussed so far, the low-level processing chain (up to the angle-FFT) used
fixed-point arithmetic. (Subsequent higher-layer processing (such as clustering and tracking) is typically
done in floating point.) However, it is also possible to construct a MIPS-efficient chain where all the
processing starting from the Doppler-FFT is in floating point as explained below.
C674x DSP provides a set of floating-point instructions that can accomplish addition, subtraction,
multiplication and conversion between 32-bit fixed point and floating point, in single cycle for singleprecision floating point, and in one to two cycles for double precision floating-point. There are also fast
instructions to calculate reciprocal and reciprocal square root in single cycle with 8-bit precision. With one
a more iterations of Newton-Raphson interpolation, one can achieve higher precision in 10 to couple of
10s of cycles. Table 3 indicates that a floating-point FFT is almost as efficient as a 32-bit fixed-point FFT.
Another advantage of using floating-point arithmetic is that users can maintain both precision and dynamic
range of the input and output signal without spending cycles rescaling intermediate computation results,
and enable them to skip or do less requalification of the fixed-point implementation of an algorithm, which
makes algorithm porting easier and faster.
FFT Performance
FFT processing forms a significant part of the lower-level radar signal processing in the FMCW radar. The
optimized DSP library (DSPLIB) has efficient FFT routines (both fixed and floating point variants) that have
been specifically optimized for the C674x DSP. The fastest routine is the DSP_fft16x16() which takes 0.86
µs (128-point FFT) and 1.55 µs (256-point FFT) on a C674x running at 600 MHz. It operates on a 16-bit
complex input to produce a 16-bit complex output. It accesses a precomputed twiddle factor table, which
is also stored as 16-bit complex numbers. This FFT uses multiple radix-4 butterfly stages, with the last
stage being either a radix-2 or radix-4 depending on whether the FFT length is an odd or even power of 2.
Every radix-4 stage (except the last stage) has a scaling by 2. There is no scaling in the last stage. So
there is a bit growth (worst case) of 1 in every (radix-4) stage preceding the last stage. The last stage has
a bit growth of 1 or 2 depending on whether it is a radix-2 or radix-4. Thus, for a 128-point FFT, this
routine has a bit growth of 4 (1 + 1 + 1 + 1) and can handle a 12-bit signed input with no clipping, resulting
(for tone at the input) in a full-scale peak at the FFT output.
In most radar applications, the 16 × 16 FFT routines previously described suffices for the first dimension
(range) processing. However, subsequent FFT processing stages could involve bit growth that extends
beyond 16 bits. The use of 32-bit FFT routines available in DSPLIB allows these subsequent stages to
realize their full processing gain without being limited by either bit overflow or the quantization noise of the
FFT routine. As an example, DSP_fft32x32() works on 32-bit complex inputs producing 32-bit complex
outputs, and uses a precomputed twiddle factor table also stored as 32-bit complex numbers.
A list of some key FFT routines available with DSPLIB is provided in Table 2. Additionally, the mmwave
library [8] also has some useful associated routines such as windowing.
Table 2. List of FFT Routines in DSPLIB
Function Name
Fixed-point FFT using 16-bit complex numbers for input and output (16-bit I and 16-bit Q). The
twiddle factor table is also stored as 16-bit complex. There is a scaling by 2 after every radix-4 stage
(except the last stage which can be either a radix 4 or a radix 2). The minimum length of the FFT is
16. All buffers are assumed to be 8-byte aligned. The FFT works for input lengths which are powers
of 2 or 4.
Fixed-point FFT using 16-bit complex numbers for input and output (16-bit I and 16-bit Q). The
twiddle factor table is stored as 32-bit complex. There is a scaling by 2 after every radix-4 stage
(except the last stage which can be either a radix 4 or a radix 2). The minimum length of the FFT is
16. All buffers are assumed to be 8-byte aligned. The FFT works for input lengths which are powers
of 2 or 4.
Fixed-point FFT using 32-bit complex numbers for input and output (32-bit I and 32-bit Q). The
twiddle factor table is also stored as 32-bit complex. There is no scaling in this FFT. The minimum
length of the FFT is 16. All buffers are assumed to be 8-byte aligned. The FFT works for input
lengths which are powers of 2 or 4.
This FFT uses complex floating-point input and output. The twiddle factors are also in floating point.
There is no scaling in this FFT. The minimum length of the FFT is 16. All buffers are assumed to be
8-byte aligned. The FFT works for input lengths which are powers of 2 or 4.
SWRA564 – May 2017
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
Data Flow
Table 3 lists the cycle count for these FFT routines.
Table 3. MIPS (Cycles) Performance of FFTs
N = 32
N = 64
N = 128
N = 256
N = 512
N = 1024
N = 2048
Ideally, users would like to ensure that the quantization noise coming from the FFT processing does not
limit the output SNR. Table 4 lists the SQNR performance metrics (in dBFS) for various FFT routines. For
input data with effective bit widths of B bits, the SNR at the input is Bx6 + 1.76 dB. The ideal SNR after an
N point first dimension FFT is Bx6 + 1.76 + 10log10(N). If B = 10 bits and N = 128, the ideal output SNR is
less than 83 dB. Consequently, the use of DSP_fft16x16 (which, from Table 4, has an SQNR of 85 dB)
will not limit the output SNR. Similarly, depending on the input SNR and the FFT length, other options
such as DSP_fft32x32() can be considered. For optimal performance the input back off should be
appropriately chosen to prevent clipping at the output.
Table 4. SQNR (dB) Performance of FFTs
N = 32
N = 64
N = 128
N = 256
N = 512
N = 1024
N = 2048
The 32-bit FFT routine provides additional space for bit-width growth. A useful technique when using this
routine is to left shift/scale the input by a few bits. The free lower bits thus obtained ensure that the
quantization noise of subsequent FFT stages does not limit the processing gain.
Data Flow
While Section 3 focused on algorithmic benchmarks, this section focuses on data flows. A typical radar
signal-processing chain requires data transfer from the ADC buffer to the DSP local memory (L1 or L2),
and to and fro between the DSP local memory and L3. These data transfers are primarily accomplished
using the EDMA. It is important to understand the various types of EDMA transfers and their associated
latencies. This will enable the stitching together of a data flow that has minimal overhead and is as
nonintrusive as possible.
Types of EDMA Transfers
Three kinds of EDMA transfers are relevant in the context of FMCW radar signal processing.
• Contiguous-Read Contiguous-Write (or contiguous): This is the simplest and fastest kind of data
transfer, which involves moving a portion of memory from a source to a destination with no data
rearrangement involved during the transfer. This transfer is accomplished as a single-dimensional
transfer (with ACNT specifying the number of bytes to be transferred). The speed of this transfer is 128
bits per cycle (at 200 MHz). In a typical radar signal-processing chain where each data element is a
32-bit complex number (16-bit I and 16-bit Q), this amounts to four samples being transferred in every
cycle. Thus a transfer of 256 complex samples would take about 0.32 µs. The transfer of data from the
ADC buffer to the L2 memory of the DSPs before range-FFT processing would be an example of a
contiguous transfer.
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
SWRA564 – May 2017
Submit Documentation Feedback
Data Flow
Contiguous-Read Transpose-Write (or transpose-write): A vector that is contiguously spaced at the
source must be placed in a transpose fashion at the destination. This is accomplished using a 2dimensional transfer with ACNT representing the size of each sample and BCNT representing the
number of samples. Assuming that the sample size ACNT <128 bits, the transfer takes four cycles (at
200 MHz) per sample. As a result, it would take about 5.12 µs to transfer 256-contiguous complex
samples using a transpose-write. Figure 8 is an example of such a transfer where the range-FFT
output (corresponding to the four RX channels of a chirp) is placed in a transpose fashion in L3
memory. Such a transpose-write ensures that the data on which the Doppler-FFT operates is available
in a contiguous fashion at the end of a frame (as shown in the blue inset).
NOTE: For all depictions of memory in this document, the memory address is assumed to be
contiguous along the rows of the array that represents the memory.
Figure 8. Contiguous-Read Transpose-Write EDMA Access
Techniques exist which can increase the effective speed of the transpose-write transfer. One such
technique is to parallelize the transfer by employing multiple TCs. In the example of Figure 8, rangeFFTs for channels 1 and 2 could be transferred using one TC, while range-FFTs for channels 3 and 4
could be transferred using another TC. This would speed up the transfer by a factor 2. This technique
works because the latencies for such transfers arise primarily within the TC and not at the memory
interfaces of the source or destination (also see Section 4.2.1 and Section
Transpose-Read Contiguous-Write (or transpose-read): Here data that is noncontiguously placed at
the source must be contiguously placed at the destination. As in the earlier case, this is accomplished
using a 2-dimensional transfer with ACNT representing the size of each sample and BCNT
representing the number of samples. Assuming that the sample size ACNT <128 bits, the transfer
takes 1 cycle (at 200 MHz) per sample. As a result, it would take about 1.28 µs to transfer 256
complex samples. In the context of radar signal processing, such a transfer would be needed if rangeFFT data across chirps in a frame had been contiguously stored in L3, as shown in Figure 9. A
transpose-read access would be needed to transfer the data corresponding to a specific range bin
(and a specific antenna) to the local memory of the DSP to perform the Doppler-FFT.
SWRA564 – May 2017
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
Data Flow
Figure 9. Transpose-Read Contiguous-Write EDMA Access
Example Use Cases: Intraframe Processing (Range-FFT Processing)
We use a few illustrative examples to demonstrate the use of the EDMA during range-FFT processing (by
no means is this an exhaustive set of examples).
Single-Chirp Use Case
Single chirp refers to the use case where the ADC buffer has been programmed to issue an interrupt after
data for each chirp (across four RX channels) becomes available. The EDMA channel is programmed to
interrupt the DSP when the transfer is complete. This is a contiguous transfer and hence is fast. For
example, transferring 256 ADC samples across four RX channels would take 256 × 4 / 4 / 200 ≈ 1.28 µs.
The DSP performs range-FFT processing on the data received for the four RX antennas. The processed
data is then transferred to the radar-cube memory in L3 through an EDMA. In this example transfer from
L2 to L3 is a transpose-write and therefore is slower (by a factor of 16 compared to a contiguous transfer).
The transfer of the 256 × 4 range-FFT samples would thus take about 21 µs . However, the total latency
of both these DMA transfers is still likely to be less than the chirp period TC and would not be a bottleneck
(for a 5-MHz sampling rate with Nadc = 256, TC will be 51 µs). The advantage of this approach is that it
eases the data transfer latencies during the Doppler-processing. Since the range-FFT output is written out
in a transpose manner, the data required for a Doppler-FFT would be contiguously placed.
Figure 10. Single-Chirp Use Case
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
SWRA564 – May 2017
Submit Documentation Feedback
Data Flow
In Figure 10, the buffering scheme in the DSP L2 consists of a triad of buffers. At any time, two of these
buffers serve as input and output buffers for range-FFT processing. So the DSP picks up the samples for
the current chirp from the designated input buffer and places the processed range-FFT data in the
designated output buffer. The third buffer contains the range-FFT output from the previously processed
chirp. This data is transferred out (using EDMA transfer C) and samples for the next chirp are got in (using
EDMA transfer A) to replace it. In this scheme, the DSP processing will not be a bottle-neck as long as the
total latencies associated with A and C remain less than a chirp period (also see Section 4.2.1 and
Figure 11 shows the buffer management using this triad of buffers.
Figure 11. Buffer Management for Data Flow
SWRA564 – May 2017
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
Data Flow
Figure 12 shows another variant of the single-chirp use case. There is a ping-pong buffer at the input to
receive data from the ADC buffer to L2. Each of its constituent buffers can only store ADC samples
corresponding to 1 antenna. This suffices because the contiguous transfer from the ADC buffer takes only
0.32 µs, while the corresponding time for the DSP to perform a range-FFT on this data is 1.92 µs. There is
also a ping-pong buffer at the output to transfer the processed range-FFT data to L3. The data from this
output buffer is transferred to L3 in a transpose fashion. Due to the slower speed of the transpose-write
transfer, this buffer cannot operate on a per-RX antenna granularity, and each of its constituent buffers
must be provisioned to store data corresponding to all four RX antennas of a chirp. The DSP kicks of this
EDMA when the range-FFTs for all four RX antennas are placed in the output buffer. This transfer has a
full chirp period (TC) to complete (though it takes only 20.48 µs).
Figure 12. Single-Chirp Use Case Alternate Flow
Speeding Up Transpose Transfer.
A transpose-write-based EDMA transfer is 16 times slower than a contiguous EDMA transfer.
Nevertheless, as was demonstrated in the previous use cases, it is possible to come up with data flows
which ensure that the transpose-transfer does not become a bottleneck. However, should there be a
need; there are options available to speed up this transfer. One option, discussed in Section 4.1, involves
the use of two EDMA transfers operating in parallel using different TCs. Another option is described in the
following discussion.
A transpose-write EDMA transfer takes four cycles to move one sample. This latency remains the same
for any sample up to 128 bits in size (because the width of the bus that accesses L3 memory is 128 bits
long). Therefore any sample size that is less than 128 bits (16 bytes) is making inefficient use of available
bandwidth. Thus a complex sample of size 32 bits (16-bit I and 16-bit Q) is using only ¼ the bandwidth.
This can be remedied by performing a transpose on the range-FFT results prior to initiating a transpose
transfer to L3. As shown in Figure 13 (matrix A), the range-FFT output for each antenna is contiguously
placed. The DSP then performs a transpose operation on this matrix such that data for a specific rangebin is interleaved across antennas (matrix B). A transpose-write EDMA is now initiated with an ACNT of
128 bits. With this enhancement, the EDMA transfer that originally took 20.48 μs (in Figure 10), now takes
¼ the time (5.12 μs). At the end of the intrachirp period, each row in L3 memory consists of data for a
specific range-bin, interleaved across antennas. Subsequently, a contiguous transfer of each row can be
used to transfer the data to DSP memory for Doppler-FFT processing. There will be some overhead
during the Doppler processing to deinterleave this data. This method results in a 4× improvement in the
latency of the transpose-write EDMA transfer. The additional overhead on the DSP (for interleaving the
data after each range-FFT and deinterleaving prior to Doppler-FFT) is small and usually acceptable.
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
SWRA564 – May 2017
Submit Documentation Feedback
Data Flow
Figure 13. Improving Efficiency of the Transpose Transfer
NOTE: The DSP can transpose a 4 × 256 matrix in 576 cycles (approximately 0.96 µs).
Multichirp Use Case
Multichirp refers to the use case where the ADC buffer has been configured to interrupt the DSP after a
fixed number of chirps have been collected. Figure 14 shows the timing diagram of a multichirp use case
where the ADC buffer interrupts the DSP after every four chirps have been collected. In contrast to
Figure 5 (which depicts a single-chirp use case), note the long gaps between ADC interrupts where the
DSP is not performing range-FFTs. The DSP could enter a sleep mode during these gaps in processing,
making it a useful power-saving feature. Alternatively the DSP could perform higher-level processing (such
as tracking) as a background process in these gaps.
Figure 14. Timing for Multichirp Use Case
While it would be easy to adapt the techniques discussed in Section 4.2.1 to the multichirp use case, a
slightly different approach is highlighted here. In the implementation shown in Figure 15, each row in the
ADC buffer (a single chirp for a single antenna) is transferred to L2 at a time which keeps the L2 buffering
requirements to a minimum. The processed data from L2 is transferred to L3 as a contiguous transfer.
Two ping-pong buffer pairs respectively manage the data that is input to the DSP (from the ADC buffer)
and output to L3. Because both the EDMA transfers are contiguous, it ensures that the transfer latencies
do not become a bottleneck. The latencies for both of these EDMA transfers (represented by A and C in
Figure 15) is 0.32 µs, while the corresponding DSP processing latency (B) is approximately 2 µs. Because
B > A + C, it ensures that the DSP never has to wait for an EDMA completion. In this example, the rangeFFT data has been transferred to L3 without a transpose. Consequently the access for Doppler-FFT will
now involve a transpose-read EDMA access. This is an overhead that must be factored in during the
Doppler-FFT processing.
SWRA564 – May 2017
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
Data Flow
Figure 15. Multichirp Use Case
Example Use Cases: Interframe Processing
Interframe processing primarily involves Doppler-FFT computation, detection, and angle estimation. The
data flow is largely determined by the way the range-FFT output has been stored in L3 memory during
intraframe processing, and three such cases are discussed.
• Case 1
If the range-FFT output has been placed in L3 using a transpose-write (as in for example, Figure 10)
then the data required for performing a Doppler-FFT is contiguously available. This represents the
simplest scenario for interframe processing, because all L3 accesses are now through contiguous
EDMA transfers. As shown in Figure 16, each line from L3 memory is transferred to the DSP local
memory (for example, L2) using a ping-pong scheme. After the Doppler-FFT has been computed, the
result can be optionally transferred back to L3 (not shown in Figure 16). During interframe processing
all the required input data is available in L3. Therefore, to prevent any stalling in the processing, it is
important that the EDMA transfer latencies keep up with the DSP-processing latencies. In the case
shown in Figure 16 (with 128 chirps), an EDMA transfer takes 0.16 µs, while a 128-pt FFT takes 0.86
µs. Consequently, the EDMA transfer should not be a bottleneck even in the case where the DopplerFFT output is written back to the radar-cube.
Figure 16. Interframe Processing Case 1
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
SWRA564 – May 2017
Submit Documentation Feedback
Discussion on Cache Strategy
Case 2
Figure 17 shows the interframe processing for the case where the range-FFT is placed in L3 using the
scheme suggested in Figure 13. Each line in L3 consists of data for a specific range-bin interleaved
across antennas. The data can be transferred from L3 by using a pair of ping-pong buffers. Because
the FFT routines expect input data to be contiguously placed, the DSP must first perform a
deinterleaving operation before Doppler-FFT computation.
Figure 17. Interframe Processing Case 2
Case 3
If the range-FFT output is stored contiguously in L3 (for example, as in Figure 15), then Doppler-FFT
processing will entail an EDMA transfer with transpose-read access, as shown in Figure 9. Taking the
example of a 128-pt FFT (16-bit), the fastest FFT would be 0.86 μs, while the corresponding EDMA
transfer (transpose-read access for 128 samples) would take only 0.64 µs. In addition to the DopplerFFT computation, the intraframe processing will include other computations (such as computing the
magnitude of logmag of the FFT result). Therefore, it is likely that the EDMA transfer will not be a
bottleneck and will be able to keep up with DSP processing. However, performing a write-back of the
Doppler-FFT results in the radar-cube would involve a significant penalty (because it would require a
EDMA transpose-write, which in this example takes 2.56 µs). Therefore, if such a write-back is
required, this data flow architecture might not be suitable.
Discussion on Cache Strategy
Recall from Section 2.1 that L1 and L2 can partly or wholly be configured as cache (for both program and
data). L1P is a direct-mapped cache, L1D is 2-way set associative, and L2 is a 4-way set associative
cache [3]. Choice of an appropriate cache strategy can help achieve the desired trade-off between
performance, code size, and development time.
The simplest strategy follows:
1. Configure both the L1P and the L1D as cache.
2. Place scratch pad data and program in L2, and use L3 only to store the radar-cube.
Most users should be able to achieve the desired performance using this approach (with the overhead due
to caching being of the order of 15%). In our data flow discussions in Section 4, we had implicitly assumed
that various ping-pong buffers and other temporary buffers resided in L2 with L1D cache enabled. The
C674x DSP supports a snoop-read and snoop-write [3] feature, which ensures cache coherency between
L1D and L2 in the presence of EDMA reads and writes to L2.
SWRA564 – May 2017
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
However, given the highly deterministic nature of the radar signal-processing chain, additional
improvements can be achievable by configuring a portion of L1P and L1D as RAM. Key routines which
constitute the bulk of the radar signal processing, such as FFT and windowing routines, can be placed in
L1P, while the associated scratch pad buffers (such as some of the ping-pong buffers depicted in the
earlier data-flows) can be placed in L1D. In one experiment, we put all real-time frame-work and algorithm
code (up to clustering) into 24KB L1P SRAM (leaving only 8KB for L1P cache), and saw modest
improvement of cycle performance of the chain. We also configured 16KB of L1D as data SRAM (leaving
16KB for L1D cache) to hold temporary processing buffers (reused between range processing and
Doppler processing). By doing so we improved the cycle performance of range processing by 5~10%. We
also saw much less cycle fluctuation in range processing due to a decrease in L1 cache activity. The
reduced size of L1 cache did not seem to degrade the cycle performance of other algorithms which
worked out of buffers in L2. This placement of code and data in L1P and L1D also saved 40KB of L2
memory, without negative impact of cycle performance of the whole chain
Alternatives to EDMA-based L3 access: The data flows discussed in Section 4 relied on the EDMA engine
to access data in L3. However, in some scenarios direct access of L3 memory by the CPU might also be
considered. Each EDMA transfer has an overhead in terms of triggering the EDMA and subsequently
verifying its completion (either through an interrupt or polling). These overheads could swamp out the
benefits of using an EDMA for small sized transfers. Such a scenario could occur, for example, in the
context of an FMCW frame with a small number of chirps (16 or 32) resulting in small sized Doppler-FFTs.
One option is to fetch data for multiple Doppler-FFTs in every EDMA transfer. This reduces the EDMA
overhead, though at the cost of proportionally larger buffer sizes. In such cases direct access to L3 (with
caching in L1D) could be a viable alternative. In one of the experiment with size 32 Doppler-FFT's, 4 Rx
antennas, and 1024 range bins, we saw direct L3 access (with L1 cache on) performed slightly faster
compared to an EDMA based approach. This also reduced the code complexity (with the elimination of
EDMA triggering/polling). We also experimented on having the pre-detection matrix inside L3 memory with
direct CPU accesses (via cache). While we observed that the detection algorithm (CFAR) had cycle
degradation of about 2x (vs. the pre-detection matrix residing in L2), the freeing up of L2 memory might
make it a worthwhile trade-off in certain cases.
Code in L3: Customers also have the option of storing program code in L3. This option is typically suited
for placing code which would be used once per frame (in one go) rather than being executed multiple
times with the other code in an interleaved manner (the intent is to reduce the loss due to repeated code
eviction). One example is Kalman filtering or detection. While actual results will vary depending on the
nature of the code, our experience running a Kalman filter from L3 shows a hit of about 2× for the first
invocation and almost no overhead for subsequent invocations (due to the program being cached in L1P).
Whenever L3 is being directly accessed by the CPU (either for code or data), multiple caching options are
possible. One option is to cache L3 directly to L1 (L1P, L1D). Another option is to create a hierarchical
cache structure, with a small part of L2 (about 32KB) also configured as cache.
IWR14xx/16xx Technical Reference Manual
TMS320C674x DSP CPU and Instruction Set
TMS320C674x DSP Megamodule Reference Guide
TMS320C600 Optimizing Compiler User's Guide
TMS320C6748 Technical Reference Manual
TMS320C6000 Programmer’s Guide
DSPLIB: Required DSPLIB libraries that must be installed for the C674x:
• The fixed-point library for the C64x+
• The floating-point library for C674x (documentation included with the installation provides APIs for
all the routines and a cycle benchmarking report)
8. mmwave lib: Install the mmwave-SDK and then navigate to \packages\ti\alg\mmwavelib to view both
the documentation and code for mmwavelib.
9. Introduction to mm-wave sensing: FMCW Radars
Introduction to the DSP Subsystem in the IWR16xx
Copyright © 2017, Texas Instruments Incorporated
SWRA564 – May 2017
Submit Documentation Feedback
Texas Instruments Incorporated (‘TI”) technical, application or other design advice, services or information, including, but not limited to,
reference designs and materials relating to evaluation modules, (collectively, “TI Resources”) are intended to assist designers who are
developing applications that incorporate TI products; by downloading, accessing or using any particular TI Resource in any way, you
(individually or, if you are acting on behalf of a company, your company) agree to use it solely for this purpose and subject to the terms of
this Notice.
TI’s provision of TI Resources does not expand or otherwise alter TI’s applicable published warranties or warranty disclaimers for TI
products, and no additional obligations or liabilities arise from TI providing such TI Resources. TI reserves the right to make corrections,
enhancements, improvements and other changes to its TI Resources.
You understand and agree that you remain responsible for using your independent analysis, evaluation and judgment in designing your
applications and that you have full and exclusive responsibility to assure the safety of your applications and compliance of your applications
(and of all TI products used in or for your applications) with all applicable regulations, laws and other applicable requirements. You
represent that, with respect to your applications, you have all the necessary expertise to create and implement safeguards that (1)
anticipate dangerous consequences of failures, (2) monitor failures and their consequences, and (3) lessen the likelihood of failures that
might cause harm and take appropriate actions. You agree that prior to using or distributing any applications that include TI products, you
will thoroughly test such applications and the functionality of such TI products as used in such applications. TI has not conducted any
testing other than that specifically described in the published documentation for a particular TI Resource.
You are authorized to use, copy and modify any individual TI Resource only in connection with the development of applications that include
RIGHT OF TI OR ANY THIRD PARTY IS GRANTED HEREIN, including but not limited to any patent right, copyright, mask work right, or
other intellectual property right relating to any combination, machine, or process in which TI products or services are used. Information
regarding or referencing third-party products or services does not constitute a license to use such products or services, or a warranty or
endorsement thereof. Use of TI Resources may require a license from a third party under the patents or other intellectual property of the
third party, or a license from TI under the patents or other intellectual property of TI.
You agree to fully indemnify TI and its representatives against any damages, costs, losses, and/or liabilities arising out of your noncompliance with the terms and provisions of this Notice.
This Notice applies to TI Resources. Additional terms apply to the use and purchase of certain types of materials, TI products and services.
These include; without limitation, TI’s standard terms for semiconductor products http://www.ti.com/sc/docs/stdterms.htm), evaluation
modules, and samples (http://www.ti.com/sc/docs/sampterms.htm).
Mailing Address: Texas Instruments, Post Office Box 655303, Dallas, Texas 75265
Copyright © 2017, Texas Instruments Incorporated
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF