Texas Instruments | Introduction to the DSP Subsystem in the IWR6843 | Application notes | Texas Instruments Introduction to the DSP Subsystem in the IWR6843 Application notes

Texas Instruments Introduction to the DSP Subsystem in the IWR6843 Application notes
Application Report
SWRA621 – June 2018
Introduction to the DSP Subsystem in the IWR6843
Anil Mani, Sandeep Rao,Jasbir Nayyar, Mingjian Yan, and Brian Johnson
ABSTRACT
This application report introduces the digital signal processing subsystem (called the DSS) of the
IWR6843, discusses the key blocks of the DSS, provides benchmarks for the digital signal processor
(DSP), the Radar Hardware Accelerator (HWA) and enhanced direct memory access (EDMA), and
presents typical radar processing chains. It is intended for engineers who wish to implement signal
processing algorithms on the IWR6843.
1
2
3
4
5
6
Contents
Introduction ................................................................................................................... 2
Overview of the DSP Subsystem (DSS) .................................................................................. 3
Algorithm Chain and Benchmarks ....................................................................................... 11
Data Flow.................................................................................................................... 19
Discussion on Cache Strategy ........................................................................................... 26
References .................................................................................................................. 28
List of Figures
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
................................................................................................. 3
Memory Hierarchy in the DSS ............................................................................................ 4
Hardware Accelerator Block Diagram ..................................................................................... 5
Multidimensional Transfer Capabilities of the EDMA ................................................................... 9
ADC Buffers ................................................................................................................. 10
FMCW Frame Structure ................................................................................................... 12
16-Bit Processing Chain Data Flow ..................................................................................... 13
32-Bit Processing Chain Data Flow ..................................................................................... 15
FFT SFDR Performance With and Without Dithering ................................................................. 17
Contiguous-Read Transpose-Write EDMA Access .................................................................... 20
Transpose-Read Contiguous-Write EDMA Access .................................................................... 20
Single-Chirp Use Case .................................................................................................... 22
Buffer Management for Data Flow ....................................................................................... 22
Improving Efficiency of the Transpose Transfer ....................................................................... 23
Timing for Multichirp Use Case........................................................................................... 23
Multichirp Use Case (DSP to HWA) ..................................................................................... 24
Multichirp Use Case (HWA to L3) ........................................................................................ 25
Interframe Processing Case 1 ............................................................................................ 25
Interframe Processing Case 2 ............................................................................................ 26
DSP Subsystem Overview
List of Tables
1
Abbreviations ................................................................................................................. 2
2
C674x Benchmarks ........................................................................................................ 11
3
HWA Benchmarks.......................................................................................................... 11
4
List of FFT Routines in DSPLIB .......................................................................................... 18
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
1
Introduction
www.ti.com
5
MIPS (Cycles) Performance of FFTs .................................................................................... 18
6
SQNR (dB) Performance of FFTs........................................................................................ 18
7
‘steady state’ EDMA Transfer Costs (in cycles) ........................................................................ 21
Trademarks
C6000 is a trademark of Texas Instruments.
ARM is a registered trademark of ARM Limited.
Cortex is a registered trademark of ARM.
All other trademarks are the property of their respective owners.
1
Introduction
The IWR6843 device is the industry's first fully-integrated 57 GHz to 64 GHz, radar-on-a-chip solution for
industrial radar applications. [1] The device comprises of the entire mmWave RF and analog baseband
signal chain for three transmitters (TX) and four receivers (RX), two customer-programmable processor
cores in the form of a C674x DSP and an ARM® Cortex®-R4F MCU, as well as a hardware accelerator
(HWA) meant for offloading common radar algorithms.
This application report provides an introduction to the computational capabilities of the device with a focus
on the DSP and the HWA. It is targeted towards engineers looking to implement radar signal processing
algorithms on IWR6843, or port existing implementations from IWR16xx [2] to IWR6843.
The IWR6843 device is organized into different subsystems, each with a specific role. The mmWave RF
and analog baseband signal chain are part of the ‘Radar Subsystem’. The ‘Radar Sub-system’ is not
customer programmable, and can be controlled only through pre-defined APIs (defined in [3]). The
remaining two subsystems are customer programmable: ‘Master Sub-System’ (or MSS) and ‘DSP SubSystem’ (or DSS).
The MSS contains an R4F processor (clocked @ 200 Mhz) and is expected to be used to control
communication between the device and external equipments. Spare MIPs on the MSS can be used to run
late-stage algorithms like trackers (Kalman filters, Fusion (from different sub-frames), fencing, and so
forth.). However, since these algorithms are not ‘core signal processing’ (MIPS intensive) algorithms, the
MSS is not considered in this document.
The DSS contains the C674x DSP, the HWA and some peripherals and is expected to do the heavy lifting
as far as FMCW processing is concerned, and it alone is the focus of this user's guide.
The guide is organized as follows - Section 2 provides an overview of the DSS including the capabilities of
the DSP, the HWA and the EDMA. Section 3 describes a few typical processing chains that can be
implemented using the DSS, as well as benchmarks for common algorithms. Section 4 discusses the data
transfer associated with radar processing. Section 5 provides some hints on organizing the cache on the
DSP. Finally Section 6 provides links and references to further information.
1.1
Abbreviations
Table 1 shows the abbreviations used this document:
Table 1. Abbreviations
No.
2
Abbreviation
Expansion
1
DSP
Digital signal processor
2
HWA
Hardware accelerator
3
Fencing
Counting the number of detected objects in a given volume
4
EDMA
Enhanced – DMA.
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Overview of the DSP Subsystem (DSS)
www.ti.com
2
Overview of the DSP Subsystem (DSS)
Figure 1 shows a programmer's view of the DSS with callouts showing the suggested division of labor
between the different modules of the DSS. The heart of the DSS is the 600 MHz C674x DSP core, which
customers can program to run radar signal-processing algorithms. Two levels of memory (L1 and L2) are
local to the DSP core. The L1 (32KB L1D, 32KB L1P) runs at the DSP clock rate of 600 Mhz. The L2
(256KB) runs at 300 Mhz (or half the DSP clock rate).
Assisting the DSP is a hardware accelerator (or HWA) that can take on common FMCW algorithms (FFT,
log computation, CFAR-CA, and so forth), thereby, lightening the load on the DSP. The HWA has four
local buffers (ACCEL_MEM0 … MEM3) each of 16KB each and is clocked at 200 Mhz.
Both the DSP and the HWA have access to external memories that include the ADC buffer, which
presents the digitized radar IF signal to the DSS). The L3 memory (clocked at 200 Mhz and primarily used
to store radar data) and a handshake RAM (that can be used to share data between the MSS and the
DSS).
Finally, there are Enhanced-DMA (or EDMA [4]) engines that can be programmed by the DSP (or R4F) to
efficiently transport data between these external memories, the DSP’s L1 or L2, and the HWA’s local
memories.
Each of these components is described in more detail in the sub-sections below.
For sharing data (e.g.
detected objects) between
DSP and R4F (MSS).
Signal processing on the ADC
data (interference mitigation),
advanced detection algorithms,
highter layer algorithms
Handshake RAM
(32 KB)
Primarily for storing
the radar-cube
L1P
(32 KB)
ADC data from the
digital front-end
L2
(256KB)
C674x
CPU
L1D
(32 KB)
ADC Buffer
ADC Buffer
(32 KB)
(32 KB)
DSP
L3
(1MB)
Range, Doppler, Azimuth
FFTs, Detection (CFAR-CA)
FFT
MEM0
(16 KB)
MEM1
(16 KB)
CFAR
MEM2
(16 KB)
MEM3
(16 KB)
HWA
Efficient transfer of data between
memories (ADC Buffer, L2, L3, HWA
MEM, Handshake RAM)
EDMA
engine
Figure 1. DSP Subsystem Overview
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
3
Overview of the DSP Subsystem (DSS)
2.1
www.ti.com
C674x DSP
The C674x DSP ([4] through [9]) belongs to the C6000™ family and combines the instruction sets of two
prior DSPs: C64x+ (a fixed-point DSP) and C67x (a floating-point DSP). The DSP in the IWR6843 runs at
600 MHz. The DSP has the following key features:
• Instruction parallelism: The CPU has eight functional units that can operate in parallel, which allows
for a maximum of eight instructions to be executed in a single cycle [4].
• Rich instruction set: In addition to instruction parallelism, the DSP supports a rich instruction set [4],
which further speeds up signal processing computations. These instructions include single instruction
multiple data (SIMD) instructions, which can operate on multiple sets of input data (for example, ADD2
can perform two 16-bit additions in one cycle and MAX2 can perform two 16-bit comparisons in one
cycle). There are also other specialized instructions for commonly used signal-processing math
operations (such as CMPRY, an instruction which is a single-cycle complex multiplication, specialized
instructions for dot products, and so on).
• Optimizing C-compiler: The C6000 family of DSPs comes with a very good optimizing C compiler [7]
[8], which lets engineers develop highly efficient signal-processing routines using linear C code. The
compiler ensures optimal scheduling of instructions, to ensure the best possible use of the DSP
parallelism. The compiler also supports the use of intrinsics [7], which lets the programmer directly
access specialized CPU instructions from within the C code.
256 bits
L1P
32 KB
256 bits
L2
program/data
128 KB
L3
1024 KB
64 bits
64 bits
L1D
32 KB
256 bits
L2
program/data
128 KB
L1 runs at CPUT clock (600 MHz)
L2 runs at CPU/2 (300 MHz)
L3 runs at CPU/3 (200 MHz)
Figure 2. Memory Hierarchy in the DSS
•
•
4
Memory architecture: The DSP uses a 3-level memory architecture (L1, L2, and L3) (see Figure 2)
[6].
– L1 memory is the closest to the CPU and runs at the CPU speed (600 MHz). L1 memory consists
of 32KB each of program and data memory commonly referred to as L1P and L1D, respectively.
– L2 memory consists of two 128KB-banks, which can store both program and data. Each bank is
accessible through its own memory port, which improves efficiency by allowing access to both L2
banks simultaneously (for example, data and program simultaneously, if kept on separate banks).
L2 runs at a speed of CPU/2 (300 MHz). L1 and L2 memories reside within the DSP core.
– In addition, the DSP can also access an external memory (L3). In the IWR6843device, L3 memory
runs at 200 MHz and is accessible through a 128-bit interconnect. Both L1 and L2 memories can
be configured (partly or wholly) as cache for the next higher level memory [6]. A typical radar
application might configure L1P and L1D as cache, place the code and data in L2, and use L3
primarily for storing the radar-data (as noted in Section 5 other variations are possible as well).
Collateral: Besides the documents referenced above ([4] through [8]) , there is other collateral
available in the form of optimized libraries. The TI C6000 DSPLIB [9] is an optimized DSP function
library for C programmers. The library includes many C-callable, optimized, general-purpose, signalprocessing routines and contains routines for fast Fourier transforms (FFTs), matrix and vector
manipulation, filtering, and more. TI also provides an optimized library called the mmwaveLib (as part
of the mmWaveSDK [10]), that complements the DSPLIB and includes additional signal processing
routines that are commonly used in radar signal processing. Finally, signal processing chains
demonstrating functionality are available as part of the mmWaveSDK.
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Overview of the DSP Subsystem (DSS)
www.ti.com
2.2
Hardware Accelerator (HWA)
Radar processing flows mainly consists of repeated FFTs performed in order to resolve ‘ADC data’ in
range (the range-FFT or 1D-FFT), velocity (velocity-FFT, or 2D-FFT) and angle (azimuth FFT or 3D-FFT).
As such, if there was custom hardware to perform the FFT repeatedly, a significant proportion of the
compute-load on the DSP could be relieved. Standardized algorithms like detection algorithms, logcomputation, and so forth, also constitute a significant proportion of the DSP’s load.
The HWA, after being programmed, can independently complete all of these procedures without needing
processor supervision. It is clocked at 200 Mhz (or 1/3rd the speed of the DSP) but, for the algorithms it
implements, its speed is comparable (or better) than the DSP. The following sub-sections provide a brief
overview of the HWA. Additional information can be found in the HWA user's guide [11]. Relevant
subsections of the user's guide are referred to in the next few sub-sections.
As shown in Figure 3, the HWA consists of four local memories (called Accelerator Local Memories), and
an ‘Accelerator Engine’. These sub-modules are discussed in detail below.
128-bit wide bus interconnect
From/To
DMA/Processor
Accelerator Local Memories
ACCEL_MEM0
(16 KB)
ACCEL_MEM1
(16 KB)
ACCEL_MEM2
(16 KB)
ACCEL_MEM3
(16 KB)
ACCELERATOR ENGINE
Input Formatter
Input samples
(24-bit I, 24-bit Q)
Core
Computational
Unit
Output samples
(24-bit I, 24-bit Q)
Output Formatter
To Accelerator
local memory
From Accelerator
local memory
Parameter-Set
Config Memory
512-byte RAM
Trigger to DMA/Processor
State Machine
Trigger from DMA/
Processor/Ping-Pong buffer
Static (common)
registers
Figure 3. Hardware Accelerator Block Diagram
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
5
Overview of the DSP Subsystem (DSS)
2.2.1
www.ti.com
Accelerator Local Memories
These local memories (ACCEL_MEM0…3) are essentially ‘scratch-pads’ for the HWA. Data can be
bought in from external memories (like the L3, or the DSP L2/L1, Handshake RAM, and so forth) into any
one of these local memories, and taken out after processing is complete. Each ‘local memory’ is of size 16
KB, and can therefore store at most 4096 complex 16 bit-samples (or 2048 complex 32 bit samples).
Having four such local memories allows for the Ping-Pong operation of the HWA. That is, one pair of
memories can serve as the input and output buffers for the HWA while the other two can be used to bring
in new data and take out processed data. For example, while data is being filled into ACCEL_MEM1 (from
L3 using EDMA), the HWA can read data from ACCEL_MEM0 and place the computed output in
ACCEL_MEM2. Processed data on ACCEL_MEM3 can (in the same period) be taken out to L3 (again
using an EDMA). Note that the same local memory cannot be used for bringing in data and for processing
(or for taking the data out) at the same time.
Another point to note is that the HWA can (optionally) access the ‘ADC Buffer’ directly. In such a case, two
of the local memories are disabled and in their place, the first half (16KB) of the ADC Buffer (both Ping
and Pong) is directly readable. In other words, using this mode, removes the need to bring the data from
the ADC Buffer to the Accelerator memories for performing the FFT. This comes in handy in range-FFT
processing (see Section 4.2.1 for an example).
2.2.2
Accelerator Engine
The Accelerator Engine consists of a ‘Core Computational Unit’, an ‘Input Formatter’, an ‘Output
Formatter’, and a State Machine. The State machine controls the other modules based on userprogrammed registers called as ‘Parameter-Sets’ and ‘static registers’.
The ‘Core Computational Unit’ is the computing heart of the HWA and can perform the following
operations:
• FFT Computation with programmable FFT size (powers of 2, radix-2 stages) up to 1024-pt complex
FFT. The internal bitwidth of each butterfly stage is 24bits (24-bit I and 24-it Q) guaranteeing good
SQNR performance. Each stage also has programmable scaling by 2 after the butterfly computation.
(‘Core Computational Unit - FFT’ Section in [11])
– It also has built-in capabilities for pre-FFT processing – specifically, programmable windowing,
basic interference zeroing-out based on a pre-programmed threshold and BPM removal.
– Constant false alarm rate - cell averaging (CFAR-CA) including variants like CFAR-CASO (SO –
smaller of) and CFAR-CAGO (GO – greater of). The detection inputs can be linear or logarithmic.
• Complex vector multiplication and Dot product capability for vectors up to 512 in size. (Core
Computational Unit - CFAR Engine section in [11]).
• Magnitude (absolute value) and log-magnitude computation capability (Core Computational Unit –
Magnitude and Log-Magnitude Post-Processing section in [11]).
The Core Computational Unit is designed as a streaming accelerator meaning that in steady state it can
spit out one output sample per cycle [@200 MHz].
The Input Formatter and the output Formatter have two different functions:
• They exist to interface between the input (or output) data format and the internal HWA data format.
For example, the input can be real (not complex), and either 16 or 32 bit, whereas, the internal HWA
format is fixed to be 24-bit complex. In such cases, the ‘Input Formatter’ can be programmed to scale
the real 16 bit (or 32 bit) input to the 24-bit complex data format. Similarly, the output can be either
scaled up to 32 bit or down to 16 bit in the ‘Output Formatter’.
• They can read data from (and write data to) the ‘local memories’ in either linear or transpose fashion.
Unlike the DSP, or the EDMA, the transpose operation is essentially free (in the sense of cycles). This
feature is discussed in Section 4, and is important for efficiently using the HWA.
6
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Overview of the DSP Subsystem (DSS)
www.ti.com
The State Machine allows a programmer to chain and loop a sequence of accelerator operations one after
another with no intervention from the main processor (the main processor being the R4F or the DSP). It is
programmed by writing to two sets of registers: the parameter-set memories and static (common)
registers.
• Parameters corresponding to a sequence of computations (say Doppler-processing, or RangeProcessing) need to be programmed into a ‘parameter-set memory’ which is itself segmented into 16
'parameter-sets’. Each parameter-set is simply a collection of registers. The state machine reads a
parameter-set and (from its contents) controls ‘the accelerator engine’. A single parameter set can be
configured to do the following operations:
– The selection of a core operation (say, a windowed-FFT with interference mitigation, or CFARCASO, or complex vector multiplication, and so forth).
– The control of of the input and output formatters, including the scaling and access patterns. For
example, Input/Output formatters can be used to extract (or place) constituent 1D-vectors (one after
another) of an arbitrary 2D matrix existing in (or copied to) the local memories. Each subsequent
call to a parameter set, can automatically update the indices of the vector that is extracted from (or
emplaced to) the local memory.
– The number of times a parameter set is to be used before moving to the next parameter set (also
called the ‘BCNT’ parameter). For example, if a 256-pt FFT has to be performed 512 times, then
the BCNT can be set to 511, and the same parameter-set will run 512 times.
– Each parameter-set can also be stalled until or triggered by a certain condition (say a trigger from
the DSP) is met.
• Static registers (or common registers) instruct the State machine on which is the ‘starting’ parameter
set (out of the 16 available), the number of parameter sets to run and how many times (the number of
loops) that subset of the parameter-sets is to be called.
The state machine then, essentially reads each Parameter-set, programs the accelerator, triggers the
computation, waits until the computation specified by the parameter-set is complete (including the
BCNTs), and then moves on to the next parameter-set as per the static registers’ value, until it finishes the
final parameter-set.
NOTE: Both the EDMA and the HWA are controlled by what is called parameter-set registers (in the
case of EDMA, they are also called PaRAMSet). However, the EDMA and the HWA are
separate IPs, and there is no relation between the PaRAMSet for EDMA and the parameterset for HWA. The registers names for certain parameters are similar between the HWA and
EDMA, however, their use internally in their respective modules will be subtly different. For
instance certain registers (SRCBCNT) in the EDMA are zero based and the HWA are 1based.
2.2.3
Collatorals
The Hardware Accelerator User guide [11] covers the high-level architecture, key features such as
windowing, FFT, and log-magnitude and also the CFAR-CA detector, complex multiplication. Additional
collateral exists as part of the mmwaveSDK, which abstracts out the programming of the HWA with APIs.
The use of these APIs are demonstrated in the out-of-box demo as part of the mmWaveSDK ([10]).
2.3
Enhanced-DMA (EDMA)
The EDMA engine handles direct memory transfers between various memories in IWR6843. In the context
of radar signal processing, the amount of data to be stored per frame is typically much larger than what
can be accommodated in the higher-level memories (such as L1 , L2 or the HWA local memories).
Consequently, the bulk of the data is usually stored in a lower-level memory (L3). The EDMA enables data
to be brought in from the lower-level memory to the higher-level memories for processing and
subsequently shipped back to the lower-level memory. Proper use of the EDMA features (such as
multidimensional transfers, chaining, and linking, which are explained next) can ensure that data transfers
across memories occur with minimal overhead. This subsection provides a concise overview of the EDMA
and its capabilities. For a more detailed description of the EDMA, see [4].
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
7
Overview of the DSP Subsystem (DSS)
www.ti.com
An EDMA engine consists of a channel controller (CC) and one or more transfer controllers (TCs). The CC
schedules various data transfers, while the TC performs transfers dictated by requests submitted by the
CC. The IWR6843 device has two CCs, each with two TCs. Because multiple TCs can operate in parallel,
up to four transfers can be performed in parallel.
Parameters corresponding to memory transfers must be programmed into a parameter RAM (PaRAM)
table. The PaRAM table is segmented into PaRAM sets. Each PaRAM set includes fields for parameters
that define a DMA transfer, such as source and destination address, transfer counts, and configurations
for triggering, linking, chaining, and indexing. Each EDMA engine (each CC) supports 64 logical channels,
and the first 64 PaRAM sets are directly mapped to the 64 logical channels.
In the context of radar, the EDMA is used to extract (or emplace) matrices from/or to memories on the
device. For example:
• From the ADC Buffer into L3
• From (or to) the L1/L2 DSP to (or from) the L3 or the HWA local memories.
• From (or to) the HWA local memories to (or from) the L3 or the DSP L1/L2 memories.
In most cases, these matrices correspond to ADC input or FFT outputs. They are typically from either one
chirp or one range-gate and their dimensions are typically ‘ADC data’ x ‘number of receivers’, or ‘rangegate’ x ‘number of receivers’.
NOTE: Many matrices need to be read (or written) in ‘transpose’. In such cases, the samples of the
matrices are not contiguous in memory. While ‘transpose read’ using the EDMA is 4x as slow
as a ‘linear read/write’, ‘transpose write’ can be substantially slower (by approximately 16x).
For the maximum throughput and efficient use of the device, care should be taken to
minimize the overhead of EDMA reads and writes. For more details, see Section 4.
2.3.1
EDMA Features
The EDMA engine has a rich feature set, all of which can be programmed using the PaRAM sets. These
features include the following:
• Multidimensional transfers: An EDMA data transfer is defined in terms of three dimensions.
– The first dimension (A transfer) transfers a contiguous array of ACNT bytes from source to
destination.
– The second dimension (B transfer) transfers BCNT arrays (each of size ACNT bytes). Subsequent
arrays in the B transfer are separated by SRCBIDX (DSTBIDX) bytes at the source (destination).
– The third dimension transfers CCNT frames, with each frame consisting of BCNT arrays.
Subsequent arrays are separated by SRCCIDX (DSTCIDX) bytes at the source (destination). The
transfer counts (ACNT, BCNT, and CCNT) and indices (SRCBIDX and so on) are programmed in
the PaRAM set.
NOTE: As stated in Section 2.2.2 the input/output formatter of the HWA are also capable of 2dimensional transfers from and to the local memories of the HWA. The registers in the
HWA’s parameter-sets reuse some of the register names of the EDMA, including,
SRCACNT, BCNT, SRCBIDX, DSTBIDX, and so forth. They should not be mistaken for the
registers (from the EDMA).
8
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Overview of the DSP Subsystem (DSS)
www.ti.com
These multidimensional transfers allow considerable flexibility in defining the data streaming from the input
to the output. For example, a matrix can be picked up from a source location and stored in a transpose
fashion at the destination. This is shown in Figure 4, where a 4 × 5 matrix (each element 4 bytes in size)
is transposed at the destination.
Figure 4. Multidimensional Transfer Capabilities of the EDMA
•
•
•
Chaining: Chaining is a mechanism by which the completion of one EDMA transfer (corresponding to
one EDMA channel) automatically triggers another EDMA channel. The chaining mechanism helps to
orchestrate a sequence of DMA transfers.
Linking: Once the transfers corresponding to a ParRAM set have been completed (and its counter
fields such as ACNT, BCNT, and CCNT have decremented to 0), the EDMA provides a mechanism to
load the next PaRAM set. The 16-bit parameter LINK (in each PaRAM set ) specifies the byte address
offset from which the next PaRAM set should be reloaded. Once reloaded, the PaRAM set is ready to
be triggered again. The linking feature eliminates the need to reprogram the PaRAM set for
subsequent transfers. It is also useful in programming the EDMA so that subsequent transfers
alternate between ping-pong buffers. The first 64 PaRAM sets are directly mapped to the
corresponding EDMA channels. The byte address offset programmed in LINK, is however, free to point
to a PaRAM set beyond these first 64 (the two CCs in the support 128 and 256 entries, respectively, in
their PaRAM tables).
Triggering: Once an EDMA channel is programmed (using the PaRAM set), it can be triggered in
multiple ways. These ways include:
– Event-based triggering (for example, a chirp available interrupt from the ADC buffer, see
Section 2.4.1)
– Software triggering by the DSP CPU
– Chain triggering by a prior completed EDMA
The previously listed features of the EDMA are ideally suited for performing data transfers in the context of
radar signal processing. For example, radar processing involves multidimensional FFT processing, which
requires data to be rearranged along various dimensions (range bins, Doppler bins, and antennas). The
multidimensional transfers supported by the EDMA are very useful in this context. The PaRAM set-based
programming model allows multiple transfers to be programmed up front. For example, EDMA transfers
from the ADC buffer to the DSP core/HWA local memories, and between the DSP core/HWA local
memories and the L3 memory can all be programmed up front in various logical channels. Subsequently,
these channels are triggered during the data flow. Further, the linking feature allows automatic reloading
of PaRAM sets (for example, for each successive frame), thus eliminating the need for the CPU to
periodically reprogram the EDMA channels.
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
9
Overview of the DSP Subsystem (DSS)
2.4
2.4.1
www.ti.com
External Interfaces
ADC Buffer
Digitized samples from the digital front end (DFE) are stored in the ADC buffer. The ADC buffer exists as
a ping-pong pair, each 32KB in size. So when the DFE is streaming data to the ping (respectively, pong)
buffer, the DSP/HWA has access to the pong (respectively, ping) buffer. The number of chirps that can be
stored in a buffer is programmable. When the DFE has streamed the programmed number of chirps into a
buffer, a chirp available interrupt is raised and the DFE switches buffers. This interrupt can serve as a cue
to the DSP/HWA to begin processing the stored data. Alternatively, the interrupt can also be used to
trigger an EDMA transfer of the ADC samples to the local memory of the DSP/HWA (such as L1, L2, HWA
local memories), with the EDMA subsequently interrupting the DSP/HWA on completion of the transfer.
In most cases the first ‘signal processing’ algorithm that is run on the ADC data is an FFT. The HWA can
be configured to read the ADC buffer directly, and perform FFTs on its contents. In such a mode, two of
the HWA’s local memories are replaced by the ADC Buffer (ping and pong). The advantage of this
configuration is that the latency associated with moving memory from the ADC buffer to the local memory
is removed. There is an associated disadvantage though, the size of the ADC Buffer is halved from 32 kB
to 16 kB.
The HWA can perform a host of pre-processing algorithms prior to the FFT, including single threshold
‘interference mitigation’ algorithm. This algorithm will zero-out any sample in the ADC data, that exceeds
pre-programmed threshold value. However, this simple algorithm may not be sufficient.
Figure 5 (left side) shows an ADC buffer that has been programmed to store one chirp per ping-pong. The
four colored rows refer to the ADC samples corresponding to the four antennas (the four RX channels). In
Figure 5 (right side), the ADC buffer has been programmed to store four chirps per ping-pong. The data
for each antenna (across the four chirps) is stored contiguously.
Figure 5. ADC Buffers
2.4.2
L3 Memory
In the IWR6843 device, up to 1MB of the L3 memory is available for use by the DSP or HWA. The primary
use of the L3 memory is to store the radar-data arising from the ADC samples corresponding to a frame
(radar-cube). The L3 Memory can be accessed via a 128-bit interconnect at speeds of up to 200 Mhz (or
16 contiguous bytes @ 200 Mhz in ‘steady state’).
2.4.3
Handshake Memory
An additional 32KB of memory is available at the same hierarchy level as that of L3 memory. The primary
intent of a separate memory is to have another memory for sharing the data with other masters, such as
master R4F or DMA in the MSS, without interfering with the L3 memory access efficiency. One prime use
can be to share the final object list with the master R4F through this RAM.
10
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Algorithm Chain and Benchmarks
www.ti.com
3
Algorithm Chain and Benchmarks
This section provides benchmarks for some common radar signal-processing routines for both DSP and
HWA. This section also examines the DSP loading in the context of some typical processing chains.
3.1
DSP Benchmarks
These benchmarks assume the data and program to be in L1.
Table 2. C674x Benchmarks
3.2
Cycles
Timing (µs) at 600 MHz
Source and Function Name
128-point FFT (16 bit)
516
0.86 µs
DSPLIB (DSP_fft16x16)
256-point FFT (16 bit)
932
1.55 µs
DSPLIB (DSP_fft16x16)
128-point FFT (32 bit)
956
1.59 µs
DSPLIB (DSP_fft32x32)
Windowing (16 bit)
0.595 N + 70
0.37 µs (for N = 256)
mmwavelib (mmwavelib_windowing16x16)
Windowing (32 bit)
N + 67
0.32 µs (for N = 128)
mmwavelib (mmwavelib_windowing16x32)
Log2abs (16 bit)
1.8 N + 75
0.89 µs (for N = 256)
mmwavelib (mmwavelib_log2Abs16)
Log2abs (32 bit)
3.5 N + 68
0.86 µs (for N = 128)
mmwavelib (mmwavelib_log2Abs32)
CFAR-CA detection
3 N + 161
0.91 µs
mmwavelib (mmwavelib_cfarCadB)
Maximum of a vector of
length 256
70
0.12 µs
DSPLIB (DSP_maxval)
Sum of complex vector of
length 256 (16-bit I, Q)
169
0.28 µs
–
Multiply two complex vectors 265
of length 256 (16-bit)
0.44 µs
–
HWA Benchmarks
The HWA is a streaming accelerator clocked at 200 Mhz. So, in steady state (once the initialization delays
are accounted for), it can compute one output sample per cycle (per 5 ns) for any algorithm that it runs.
Since the algorithm initialization delays are algorithm specific, a selection of benchmarks for common
algorithms is provided.
Table 3. HWA Benchmarks
Approximate Cycles
Timing (µs) (@200 MHz)
Notes
N point windowed FFT
repeated k times
N*(k+1) + 1
3.2 µs (for N = 128, k = 4)
6.4 µs (for N = 256, k = 4)
An N-pt FFT in steady state,
takes N cycles. The addition of
windowing takes 1 extra cycle.
N sample Log-Magnitude
3+N
1.25 µs (N = 256)
N sample CFAR-CA (window
size – w, guard length - g)
(w + g)*2 + 1 + N
1.4 µs (N = 256, w = 8, g = 4)
.8 µs (N = 128, w = 8, g = 4)
In the above benchmarks, it is important to note the time it takes for the data to be bought into the local
memories of the HWA has not been taken into account. However, since the HWA has four local
memories, it is possible to configure two such memories as ping memories (as an input buffer and an
output buffer) and the remaining two as pong memories (similarly organized as input and output buffer).
So, while the HWA processes data in the ping input buffer, and stores the result into the ping output
buffer, data can be bought in via EDMA to the pong input buffer and processed data can be taken out
through the pong output buffer. In such a case, the EDMA delays would only occur on the first run of each
algorithm.
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
11
Algorithm Chain and Benchmarks
3.3
www.ti.com
Radar Signal Processing Chain
The fundamental transmission unit of an FMCW radar is a frame, which consists of a number (for
example, Nchirp) of equispaced chirps (see Figure 6). Central to FMCW radar signal processing is a series
of three FFTs commonly called the range-FFT, Doppler-FFT, and angle-FFT, which respectively resolve
objects in range, Doppler (relative velocity with regard to the radar), and angle of arrival. The range-FFT is
performed across ADC samples for each chirp (one range-FFT for each RX antenna). The range-FFT is
usually performed inline (during the intraframe period as the samples corresponding to each chirp become
available). The Doppler-FFT operates across chirps and can only be performed when all the range-FFTs
corresponding to all the chirps in a frame have become available. Lastly, the angle-FFTs are performed on
the range-Doppler processed data across RX antennas. For more in-depth information on FMCW signal
processing [12].
Figure 6. FMCW Frame Structure
Three similar signal processing chains are described that can be implemented on the IWR6843. The first
maintains (by appropriately scaling the FFT) the bit width of the output at 16-bits even after the DopplerFFT, the second allows the Doppler-FFT to grow to more than 16-bits (32-bit output) and one that uses a
32-bit floating point chain from the Doppler-FFT onwards to allow for increased dynamic range.
12
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Algorithm Chain and Benchmarks
www.ti.com
3.3.1
16-Bit Processing Chain
te
n
an
range
range
an
te
n
na
na
s
s
Figure 7 is a representative signal processing chain for FMCW radar. The processing is performed on
each frame of received data. The ADC samples corresponding to each chirp (across the 4 RX antennas)
is received in the ADC buffer. Colors indicate whether the processing should be performed on the DSP or
the HWA (or either).
radar-cube
chirps
chirps
ADC
Buffer
Interference
Mitigation
(DSP L1)
Range
FFT
(16 bit)
Non
Coherent
Summation
L3/L2
L3
radar-cube
Doppler
FFT
(16 bit)
L3
LEGEND
HWA
DSP/HWA
Detection
AngleFFT
Tracking/
Clustering
Handshake
Memory
range
DSP
Pre-Detection
Matrix
doppler
Figure 7. 16-Bit Processing Chain Data Flow
The first step on the ADC data of a chirp is typically an ‘interference mitigation’ algorithm. A simple version
(zeroing out based on a fixed threshold) can be performed on the HWA. However, more complicated
algorithms may be necessary and these need to be done on the DSP. Data is piped in (via an EDMA)
from the ADC Buffer into the DSP’s L1/L2, where this algorithm is performed. The output of the algorithm
is then sent (via another EDMA) to the HWA. The HWA performs a sequence of FFTs (one for each RX
antenna) and the range-FFT output is then placed in L3 memory. The data in L3 memory is best
visualized as a 3 dimensional matrix indexed along range, chirps and antennas (the ‘radar-cube’).
Once all the range-FFTs have been computed for all the chirps in a frame, Doppler-FFT's are performed.
For each range-bin (and for each antenna) a Doppler-FFT is performed across the Nchirp chirps. In the
example of Figure 7 both the range and Doppler-FFT's are 16-bit FFTs. Hence the results of the DopplerFFT can be stored in-place (overwriting the corresponding range-FFT samples) in the radar-cube in L3
memory. At the end of the Doppler-FFT processing, the radar-cube is now indexed along range, Doppler,
and antennas. Note that FFT’s processing gain can result in the output of the Doppler-FFT being more
than 16 bit (leading to saturation).
Care should be taken either by configuring the HWA FFT hardware’s internal scaling, or by scaling after
the FFT (in the output formatter), to keep the output within 16 bit. There is no automatic scaling in the
HWA FFT.
Immediately after the doppler-FFT for a particular range-gate (while the output is still in the local memory),
the HWA does additional processing. It creates a ‘pre-detection array’ by non-coherently summing the
doppler-FFT output along the antenna dimension. This is then stored back in L3 in an array that is indexed
along range and doppler. The pre-detection matrix is much smaller than the radar-cube. In the example
above, the pre-detection matrix would be 1/8th the size of the radar-cube: the reduction due to the
collapsing along the antenna dimension contributing a factor of 1/4th and the fact that the pre-detection
matrix is real, while the radar-cube is complex contributing a factor of 1/2. Consequently, pre-detection
matrix can be stored either in L3 or L2.
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
13
Algorithm Chain and Benchmarks
www.ti.com
One more algorithm can be run on the ‘pre-detection array’ (again, while it is still in the HWA local
memory) – a detection algorithm along the doppler direction. This algorithm identifies bins that correspond
to the valid objects. If the detection algorithm is one of the supported variants of CFAR-CA then it can be
performed by the HWA. The output of the algorithm is a list of detected objects per range-gate.
If the DSP has to perform the detection algorithm, it is better to let the HWA complete the dopplerprocessing of all the range-gates and then, interrupt the DSP on completion. The DSP can then pipe in
the pre-detection matrix and perform detection on it.
Finally, once the doppler-processing is complete, a number of other algorithms need to be run. A final
detection algorithm (in the range-dimension) performed on the ‘pre-detection matrix’ at certain indices to
create a final list of detected objects. These indices correspond to the detection output in the dopplerprocessing. Such an algorithm is better performed on the DSP, since it only has to be done on a few
indices, and not on the entire pre-detection matrix.
For each such object, an angle-FFT (on the HWA) is then performed on the corresponding range-Doppler
bins in the radar-cube to identify the angle of arrival of that object.
Finally the estimated parameters for the detected objects such as range, doppler and angle of arrival can
be shared with the MSS via the hand-shake memory. There is also an option to perform further processing
such as clustering,tracking and classification in the DSP itself.
Since the DSP is not likely to be used during the intraframe period and during much of the interframe
processing period, it should have sufficient MIPs to perform these algorithms. In fact, it can continue to do
tracking/classification in a lower priority thread, while a higher priority thread is used to control and finish
the intraframe processing.
Alternatively these higher layer algorithms can be performed in the MSS (the R4F MCU).
3.3.1.1
Memory and Computational Requirements
The bulk of the memory and computational requirements need to be computer for a representative FMCW
frame structure using the processing flow indicated in Figure 7. A frame with 128 chirps (Nchirp=128) was
considered, with each chirp having 256 samples (Nadc = 256). With 4 RX antennas, the radar-cube
memory requirements is 256 samples x 128 chirps x 4 antennas x 4 bytes/sample= 512KB. There are
processing requirements during two distinct phases: intra-frame processing and inter-frame-processing.
The intraframe period (or the active-period) is the period during which chirps of one frame are being
generated, transmitted and received. Once all chirps of a frame have been transmitted, the inter-frame
period is entered during which the RF is idling. The recommended ratio of the active period to the interframe period is 50%.
• Intraframe processing: Each chirp, once completed, will consist of up to 4 Rx’s worth of ADC data.
From Table 3, a sequence of 4 256-pt range-FFTs with windowing takes 6.4 µs. Note that if the FFT
were performed on the DSP, the same computation would take ~ 7.7 µs (Because each windowing
and FFT operation takes (1.55+0.37) = 1.92 µs (from Table 2). Thus for 4 RX antennas it would take
1.92 x 4 = 7.7 µs). Since, typical chirp periodicities (Tc in Figure 6) in are in the order of 10 µs, rangeFFT processing can be comfortably performed in the time (Tc ) between the arrival of ADC samples
from consecutive chirps. For example, with a sampling rate of 10 MHz, and Nadc = 256, Tc would be at
least 256/10 = 25.6 µs (excluding the interchirp idle time of a few us).
The range-FFT output is then stored in L3 to be fetched back for performing the Doppler-FFT. Timing
considerations for this processing is covered in Section 4.
• Interframe processing: The first step in interframe processing is the Doppler-FFT. A 128-pt DopplerFFT needs to be performed for each of the 256 range-gates and across the 4 RX antennas. Again
from Table 3, 4 128-pt Doppler-FFTs takes 3.2 µs. Each doppler-FFT can be immediately followed by
the log-magnitude computation (of the 4 FFT outputs) which would take approx. 2.6 µs , this can be
followed by a computation of the log-magnitude sum (again 2.6 µs). Since the sum would be present in
the local memory, CFAR-CA in the doppler dimension can also be done. This would take ~ 0.8 us. The
cumulative time taken for these operations is 9.2 µs.These four operations have to be repeated 256
times, taking approximately 2.4 ms (256 x 9.2 ms) for the full inter-frame processing.
The operation would result in a pre-detection matrix of 256 x 128 samples, each sample being a 16-bit
real (positive) number. The first dimension corresponds to range, and the 2nd dimension to Doppler.
14
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Algorithm Chain and Benchmarks
www.ti.com
NOTE: Note that this computation ignores further downstream processing that is performed on the
detected objects such as additional detection steps to reconfirm initial detections, angleestimation, SNR computation, clustering, tracking, and so forth
Also the timing computation assumes that there is no delay in the memory access. Concerns relating to
memory access and data flow are discussed in Section 4.
3.3.2
32-Bit Processing Chain
range
an
te
n
na
s
In the chain described in Figure 7, the Doppler-FFT processing did not increase the size of the radar-cube
(allowing the Doppler-FFT output to be written over the range-FFT). There could be cases where
performance requirements might dictate a higher dynamic range at the output of the Doppler-FFT. For
more information, see the discussion in Section 3.4. Figure 8 depicts a possible processing chain in such
a scenario. While the range-FFT is still 16-bit, the Doppler-FFT has an increased dynamic range and
outputs 32-bit complex samples.
radar-cube
chirps
Interference
Mitigation
(DSP L1)
ADC
Buffer
Range
FFT
(16 bit)
L3
Doppler
FFT
(24 bit)
Non
Coherent
Summation
LEGEND
HWA
L3/L2
Detection
AngleFFT
Tracking/
Clustering
Handshake
Memory
DSP/HWA
range
DSP
Pre-Detection
Matrix
doppler
Figure 8. 32-Bit Processing Chain Data Flow
While these samples could be stored back in the radar-cube it would mean a doubling of the radar-cube
size in L3. Hence, the chain in Figure 8 follows a different approach. Once the Doppler-FFT is computed,
it is non-coherently summed up across antennas to create a pre-detection matrix. This result is not stored
back in L3. Once the detection algorithm is run, and objects have been identified, a Doppler-DFT (followed
by the angle-FFT) is selectively recomputed only for the detected objects.
In most cases, using 32-bit FFT is overkill, and the 24-bit HWA FFT (whose performance is shown in
Section 3.4.1) would suffice. Especially since the HWA FFT allows programmable scaling at the output of
each butterfly, preventing overflow. However, if a 32-bit FFT is necessary then it has to be performed on
the DSP. 32-bit windowed FFT on the DSP is more than 2x slower than the 24-pt FFT on the HWA. More
specifically, for four 128-pt FFTs it would take (1.59 + 0.32)*4 or 7.6 µs, whereas it would take only 3.2 µs
on the HWA.
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
15
Algorithm Chain and Benchmarks
3.3.2.1
www.ti.com
Memory and Computational Requirements
The requirements for memory are exactly the same as the 16-bit processing chain, since only the 16-bit
range-FFT output is stored. Hence for a frame with 128 chirps (Nchirp = 128), with each chirp having 256
samples (Nadc = 256) and with 4 RX antennas, the radar-cube memory requirements is 256 samples x 128
chirps x 4 antennas x 4 bytes/sample= 512KB.
• Intraframe processing:Intra-chirp processing would also take the same number of cycles as 16-bit
processing chain (6.4us), since the HWA would be used for the 1D-FFT.
• Interframe processing:If the HWA is used, the cycle counts would remain unchanged from the 16-bit
processing. If the DSP is solely used for the inter-frame processing, then you need to use the DSP
benchmarks to compute the time taken. For a 128-pt 32-bit Doppler-FFT (performed for each of the
256 range-bins and across the 4 RX antennas), the time taken would be (1.59 µs + 0.32 µs + 0.86
+0.28) x 256 x 4 or 3.2 ms. (The time taken for a 32-bit FFT and 32-bit windowing and 32-bit log
magnitude are taken from Table 1). The output of this collective set of functions would be the predetection matrix of 256 x 128 samples, each sample being a 16-bit real (positive) number. The first
dimension corresponds to range, and the 2nd dimension to Doppler. Subsequently, a detection
algorithm (say CFAR-CA) is run on the matrix. Running this detector along each of the 256 x 4 Doppler
lines would take 0.9 µs x 256 x 4 ≈ 1 ms
Hence the core computation blocks in 32-bit inter-frame processing would collectively take about 4.2 ms.
3.3.3
Floating-Point Processing Chain
In the processing chains, the low-level processing chain (up to the angle-FFT) uses fixed-point arithmetic.
(Subsequent higher layer processing, such as clustering and tracking, is typically done in floating point.)
However, it is also possible to construct a MIPS-efficient chain where all the processing starting from the
Doppler-FFT is in floating point as explained below. In such a case, the HWA cannot be used as it has no
floating point capability.
C674x DSP provides a set of floating-point instructions that can accomplish addition, subtraction,
multiplication and conversion between 32-bit fixed point and floating point, in single cycle for singleprecision floating point, and in one to two cycles for double precision floating-point. There are also fast
instructions to calculate reciprocal and reciprocal square root in single cycle with 8-bit precision. With one
a more iterations of Newton-Raphson interpolation, one can achieve higher precision in 10 to couple of
10s of cycles.
Another advantage of using floating-point arithmetic is that users can maintain both precision and dynamic
range of the input and output signal without spending cycles rescaling intermediate computation results,
and enable them to skip or do less re-qualification of the fixed-point implementation of an algorithm, which
makes algorithm porting easier and faster.
3.3.3.1
Memory and Computational Requirements
Floating point FFTs in the DSP are almost as efficient as a 32-bit fixed point FFT (see Table 2, again
more than twice as slow as the HWA FFT). Hence, the times computed in the 32-bit processing chain
(Inter-frame processing taking approximately 4.3 ms) remain roughly valid.
3.4
FFT Performance
FFT processing forms a significant part of the lower level radar signal processing in FMCW radar.
3.4.1
HWA FFT Performance
Details of the HWA’s FFT can be found in [11]. Summarized, the HWA’s FFT takes a 24-bit complex input
(24-bit real, 24-bit imag) and produces a 24-bit complex output. It is a pipelined FFT consisting of up to 10
radix 2 stages and is capable of a sustained throughput of 1 complex sample per cycle (5 ns).
Since the FFT is fixed point, and the pipeline bitwidth is a fixed 24-bits, the processing gain of each
butterfly stage should be taken into account. At the output of each radix-2 stage of the FFT, bit-growth by
one bit can occur. To prevent overflow, the output of each butterfly can be divided by 2 (round off one
LSB) or saturated to 24 bits. There is no auto-scaling, and the stages at which division/saturation is to
occur have to be decided up front.
16
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Algorithm Chain and Benchmarks
www.ti.com
Applying a window to the input prior to the FFT happens in line, meaning that the cycle cost of windowing
is only 5 ns (no matter the number of FFTs being performed per parameter set).
Figure 9. FFT SFDR Performance With and Without Dithering
The SFDR performance of the HWA’s FFT can be better than approximately 150 dBc.
3.4.2
DSP FFT performance
It is recommended that all FFTs be performed using the HWA. However, in certain cases, it may be
necessary for the DSP to perform FFTs. For example, it can happen that the HWA is fully loaded
(performing the basic processing for the current frame), and the DSP is idle and can perform angle-FFTs
for the previous frame. Efficient DSP-FFT routines for such cases are provided in the DSPLIB . The
following section describes the different DSP-FFT implementations (both fixed and floating point variants)
that have been specifically optimized for the C674x DSP.
The fastest routine is the DSP_fft16 x 16(), which takes 0.86 µs (128 point FFT) and 1.55 µs (256 point
FFT) on a C674x running at 600 MHz. It operates on a 16 bit complex input to produce a 16 bit complex
output. It accesses a precomputed twiddle factor table which is also stored as 16 bit complex numbers.
This FFT uses multiple radix-4 butterfly stages, with the last stage being either a radix-2 or radix-4
depending on whether the FFT length is an odd or even power of 2. Note that every radix-4 stage (except
the last stage) has a scaling by 2. There is no scaling in the last stage. So there is a bit-growth (worst
case) of 1 in every (radix-4) stage preceding the last stage. The last stage has a bit-growth of 1 or 2
depending on whether it is a radix-2 or radix-4. Thus for a 128 point FFT this routine has a bit growth of 4
(=1+1+1+1) and can handle a 12 bit signed input with no clipping; resulting (for tone at the input) in a full
scale peak at the FFT output.
In most radar applications, the 16 × 16 FFT routines previously described suffices for the first dimension
(range) processing. However, subsequent FFT processing stages could involve bit growth that extends
beyond 16 bits. The use of 32-bit FFT routines available in DSPLIB allows these subsequent stages to
realize their full processing gain without being limited by either bit overflow or the quantization noise of the
FFT routine. As an example, DSP_fft32x32() works on 32-bit complex inputs producing 32-bit complex
outputs, and uses a precomputed twiddle factor table also stored as 32-bit complex numbers.
A list of some key FFT routines available with DSPLIB are provided in Table 4. Additionally, the mmwave
library [10] also has some useful associated routines such as windowing.
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
17
Algorithm Chain and Benchmarks
www.ti.com
Table 4. List of FFT Routines in DSPLIB
Function Name
Description
DSP_fft16x16
Fixed-point FFT using 16-bit complex numbers for input and output (16-bit I and 16-bit Q). The
twiddle factor table is also stored as 16-bit complex. There is a scaling by 2 after every radix-4 stage
(except the last stage which can be either a radix 4 or a radix 2). The minimum length of the FFT is
16. All buffers are assumed to be 8-byte aligned. The FFT works for input lengths which are powers
of 2 or 4.
DSP_fft16x32
Fixed-point FFT using 16-bit complex numbers for input and output (16-bit I and 16-bit Q). The
twiddle factor table is stored as 32-bit complex. There is a scaling by 2 after every radix-4 stage
(except the last stage which can be either a radix 4 or a radix 2). The minimum length of the FFT is
16. All buffers are assumed to be 8-byte aligned. The FFT works for input lengths which are powers
of 2 or 4.
DSP_fft32x32
Fixed-point FFT using 32-bit complex numbers for input and output (32-bit I and 32-bit Q). The
twiddle factor table is also stored as 32-bit complex. There is no scaling in this FFT. The minimum
length of the FFT is 16. All buffers are assumed to be 8-byte aligned. The FFT works for input
lengths which are powers of 2 or 4.
DSPF_sp_fftSPxSP
This FFT uses complex floating-point input and output. The twiddle factors are also in floating point.
There is no scaling in this FFT. The minimum length of the FFT is 16. All buffers are assumed to be
8-byte aligned. The FFT works for input lengths which are powers of 2 or 4.
Table 5. MIPS (Cycles) Performance of FFTs
Function
Name
N = 32
N = 64
N = 128
N = 256
N = 512
N = 1024
N = 2048
DSP_fft16x16
160
240
516
932
2168
4216
9868
DSP_fft16x32
241
369
816
1488
3503
6831
16078
DSP_fft32x32
261
413
956
1772
4267
8363
19914
DSPF_sp_fftS
PxSP
305
473
1066
1962
4683
9163
21740
Ensure that the quantization noise coming from the FFT processing does not limit the output SNR. The
SQNR performance metrics (in dBFS) for various FFT routines has been listed in Table 5. For input data
with effective bit width of B bits, the SNR at the input is Bx6+1.76dB. The ideal SNR after an N point 1st
dimension FFT will be Bx6+1.76+10log10(N). If B=10 bits and N=128 the ideal output SNR is less than
83dB. Consequently the use of DSP_fft16x16 (which, from Table 6 has an SQNR of 85 dB) will not limit
the output SNR. Similarly, depending on the input SNR and the FFT length, other options such as
DSP_fft32x32() can be considered. For optimal performance the input back off should be appropriately
chosen to prevent clipping at the output. For example when using DSP_fft16x16Scaled an input back-off
of -12dBFS is ideal.
Table 6. SQNR (dB) Performance of FFTs
Function
Name
N = 32
N = 64
N = 128
N = 256
N = 512
N = 1024
N = 2048
DSP_fft16x16
89
87.5
88.3
85.3
87
84.2
86.3
DSP_fft16x32
101.3
105.4
104.4
108.7
108.7
112.7
113.4
DSP_fft32x32
175.7
173
169.9
167
164.1
161.2
158.2
Note that the 32-bit FFT routine provide plenty of room for bit-width growth. A useful technique, when
using this routine is to left shift/scale the input by a few bits. The “free” lower bits thus obtained ensure
that the quantization noise of subsequent FFT stages does not limit the processing gain.
18
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Algorithm Chain and Benchmarks
www.ti.com
3.5
Recommended Division of Labor Between the DSP and HWA.
The DSP, being a general purpose processor, can implement the algorithms of the HWA. It can do
efficient FFTs, CFAR-CA detectors, complex vector multiplications, and so forth. This brings us to a
common question, what should be the division of labor between the DSP and the HWA?
The answer would depend on factors like the MIPS necessary for the application (Is the DSP sufficient? If
so, use it) and the specifications of the algorithm (Does it need floating point operations? Can it be
implemented on the HWA?) and so forth.
Some other factors to keep in mind when dividing algorithms between the HWA and DSP are:
• Repetitive standardized algorithms like FFTs are best performed on the HWA, since the entire
processing can be completed without any interruption from the processor. The DSP is then better used
for other higher level algorithms during this period – like tracking, classification, and so forth.
• Specialized (and proprietary) algorithms used for detection (other than CFAR-CA), histogram
computation, MUSIC, interference mitigation using multiple thresholds, and so forth can only be
performed on the DSP.
• The HWA is a fixed-point accelerator with a fixed 24-bit internal bitwidth. Computations that need 32-bit
precision need to use the DSP. Likewise all floating point operations need the DSP.
• Certain arithmetic operations (like complex vector multiplication) may actually be faster on the DSP,
since it is running at 3x the clock of the HWA, and can perform one complex multiplication per cyclen.
4
Data Flow
While Section 3 focused on algorithmic benchmarks, this section focuses on data flows. A typical radar
signal-processing chain requires data transfer from the ADC buffer to the DSP local memory (L1 or L2),
and to and fro between the DSP local memory and L3. These data transfers are primarily accomplished
using the EDMA. It is important to understand the various types of EDMA transfers and their associated
latencies. This will enable the stitching together of a data flow that has minimal overhead and is as
nonintrusive as possible.
4.1
Types of EDMA Transfers
Three kinds of EDMA transfers are relevant in the context of FMCW radar signal processing.
• Contiguous-Read Contiguous-Write (or contiguous): This is the simplest and fastest kind of data
transfer, which involves moving a portion of memory from a source to a destination with no data
rearrangement involved during the transfer. This transfer is accomplished as a single-dimensional
transfer (with ACNT specifying the number of bytes to be transferred). The speed of this transfer is 128
bits per cycle (at 200 MHz). In a typical radar signal-processing chain where each data element is a
32-bit complex number (16-bit I and 16-bit Q), this amounts to four samples being transferred in every
cycle. Thus a transfer of 256 complex samples would take about 0.32 µs. The transfer of data from the
ADC buffer to the L2 memory of the DSPs before range-FFT processing would be an example of a
contiguous transfer. Another example would be the transfer of the output of interference-mitigation to
one of the HWA local memories.
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
19
Data Flow
•
www.ti.com
Contiguous-Read Transpose-Write (or transpose-write): A vector that is contiguously spaced at the
source must be placed in a transpose fashion at the destination. This is accomplished using a 2dimensional transfer with ACNT representing the size of each sample and BCNT representing the
number of samples. Assuming that the sample size ACNT <128 bits, the transfer takes four cycles (at
200 MHz) per sample. As a result, it would take about 5.12 µs to transfer 256-contiguous complex
samples using a transpose-write. Figure 10 is an example of such a transfer where the range-FFT
output (corresponding to the four RX channels of a chirp) is placed in a transpose fashion in L3
memory. Such a transpose-write ensures that the data on which the Doppler-FFT operates is available
in a contiguous fashion at the end of a frame (as shown in the blue inset).
Techniques exist which can increase the effective speed of the transpose-write transfer. One such
technique is to parallelize the transfer by employing multiple TCs. In the example of Figure 10, rangeFFTs for channels 1 and 2 could be transferred using one TC, while range-FFTs for channels 3 and 4
could be transferred using another TC. This would speed up the transfer by a factor 2. This technique
works because the latencies for such transfers arise primarily within the TC and not at the memory
interfaces of the source or destination.
NOTE: For all depictions of memory in this document, the memory address is assumed to be
contiguous along the rows of the array that represents the memory.
256 samples
128 chirps
Transpose write speeds : 4
Cycles (@200MHz) per sample
-4 channel x 256
Samples/channels takes 21 us
To be transferred to L3
256 x 4 rows
Range FFT channel #1
Range FFT channel #2
Range FFT channel #3
Range FF4T channel #2
After
processing
128 chirps
L3 Memory
L3 Memory
Data is contiguously placed
to Make subsequent
transfer to DSP/HWA (for
doppler processsing) fast.
Figure 10. Contiguous-Read Transpose-Write EDMA Access
•
Transpose-Read Contiguous-Write (or transpose-read): Data that is noncontiguously placed at the
source must be contiguously placed at the destination. As in the earlier case, this is accomplished
using a 2-dimensional transfer with ACNT representing the size of each sample and BCNT
representing the number of samples. Assuming that the sample size ACNT <128 bits, the transfer
takes 1 cycle (at 200 MHz) per sample. As a result, it would take about 1.28 µs to transfer 256
complex samples. In the context of radar signal processing, such a transfer would be needed if rangeFFT data across chirps in a frame had been contiguously stored in L3, as shown in Figure 11. A
transpose-read access would be needed to transfer the data corresponding to a specific range bin
(and a specific antenna) to the local memory of the DSP to perform the Doppler-FFT.
Range bins
Chirp 1
Chirp 2
DSP/HWA Memory
Chirp N
L3 Memory
” Transpoose read speeds : 1 cycle
(@200Mhz) per sample
” One line of 256 samples takes 1.28 µs to
be transferred from DSP memory to L3
Figure 11. Transpose-Read Contiguous-Write EDMA Access
20
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Data Flow
www.ti.com
Table 7. ‘steady state’ EDMA Transfer Costs (in cycles)
Read Mode
Write Mode
ACNT Bytes
Bits/Cycles
Time for 1024 complex
samples (us)
Contiguous
Contiguous
x
128 bits/cycle
1.28 µs
Contiguous
Transpose
Transpose
Contiguous
4
32 bits/4 cycles
20.48 µs
16
128 bits/4 cycle
5.12 µs
4
32 bits/ cycle
5.12 µs
16
128 bits/cycle
1.28 µs
Finally, Table 7 summarizes the transfer costs when using EDMA. Note that these are best case
numbers, and in general, the. When designing the EDMA transfer sequences be aware that arbitration
(when multiple TPTC vie for the same bus) can introduces a significant increase in the transfer costs.
This application report will be updated with more accurate numbers at a later date.
4.2
Example Use Cases: Intraframe Processing (Range-FFT Processing)
The use of the EDMA during range-FFT processing is demonstrated using a few illustrative examples (by
no means is this an exhaustive set of examples).
4.2.1
Single-Chirp Use Case
Single chirp refers to the use case where the ADC buffer has been programmed to issue an interrupt after
data for each chirp (across four RX channels) becomes available. In this diagram no pre-processing is
performed in the DSP prior to the range-FFT, and so the HWA can be configured to allow direct access to
the ADC Buffer. There is no need to pipe-in data to the accelerator via an EDMA.
NOTE: If some pre-processing needs to be performed on the DSP, then an EDMA would have to be
configured to transfer the ADC buffer contents to the DSP. The transfer to the DSP’s L2
would be contiguous and hence fast. For example, transferring 256 ADC samples across 4
RX channels would take 256 x 4/4/200 ≈ 1.28 µs. Once the pre-processing is complete,
another EDMA can be triggered to transfer the ADC data to the local memories of the HWA
for FFT, which would take another 1.28 µs
The FFT output is then transferred to the radar-cube memory in L3 via an EDMA. In this example transfer
from HWA’s ‘local memory’ to L3 is a transpose-write and hence is slower (by a factor of 16 compared to
a contiguous transfer). The transfer of the 256 x 4 range-FFT samples would thus take about 21 µs.
NOTE: The total latency of this DMA transfer is in many cases less than the chirp period Tc and
hence shouldn’t be a bottle-neck (for a typical 5 MHz sampling rate with Nadc = 256, Tc will be
51.2 µs). However, since the AWR1843 allows sampling rates as high as 12.5 Mhz (which
implies that for Nadc = 256, Tc = 20 µs), there will not be enough time to do a transpose write
(which would take 21 µs). Care should be taken so that the bottle-neck is accounted for.
The advantage of this configuration of the EDMA is that it eases the data transfer latencies during the
Doppler-processing. Since the range-FFT output is written out in a transpose manner, the data required
for a Doppler-FFT would be contiguously placed. Therefore, when the Doppler-FFT outputs need to be
written back (see Section 3.3.1), the write is contiguous and fast.
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
21
Data Flow
www.ti.com
L3 Memory
Ant 1
4
TextAnt
Here
Ant 3
Ant 5
Ant 1
4
TextAnt
Here
Ant 3
Ant 5
256 samples
A
B
HWA Processing:
FFT + windowing
Processing time is 6.5 µs for 4 Rx channels
Transpose transfer to L3
Will take 21 µs using 1 DMA
Figure 12. Single-Chirp Use Case
In Figure 13, the buffer management scheme is shown for the typical use-case. While the ADC Buffer’s
pong buffer is being filled with ADC data, the ping buffers contents are being processed by the HWA and
stored in one of the local memories (MEM2). During the same period previous chirp’s output (stored in
MEM3) is being transferred out to L3 via an EDMA. Since the HWA was configured such that the ADC
buffers (ping and pong) can be directly accessed, there is no need to move ADC data to HWA’s internal
buffers.
Figure 13. Buffer Management for Data Flow
4.2.1.1
Speeding Up Transpose Transfer.
A transpose-write-based EDMA transfer is 16 times slower than a contiguous EDMA transfer.
Nevertheless, as was demonstrated in the previous use cases, it is possible to come up with data flows
that ensure that the transpose-transfer does not become a bottleneck. However, should there be a need;
there are options available to speed up this transfer. One option, discussed in Section 4.1, involves the
use of two EDMA transfers operating in parallel using different TCs. Another option is described in the
following discussion.
22
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Data Flow
www.ti.com
A transpose-write EDMA transfer takes four cycles to move one sample. This latency remains the same
for any sample up to 128 bits in size (because the width of the bus that accesses L3 memory is 128 bits
long). Therefore any sample size that is less than 128 bits (16 bytes) is making inefficient use of available
bandwidth. Thus a complex sample of size 32 bits (16-bit I and 16-bit Q) is using only 1/4 the bandwidth.
This can be remedied by performing a transpose on the range-FFT results prior to initiating a transpose
transfer to L3. As shown in Figure 14 (matrix A), the range-FFT output for each antenna is contiguously
placed. The DSP then performs a transpose operation on this matrix such that data for a specific rangebin is interleaved across antennas (matrix B). A transpose-write EDMA is now initiated with an ACNT of
128 bits. With this enhancement, the EDMA transfer that originally took 20.48 μs (in Figure 12), now takes
1/4 the time (5.12 μs). At the end of the intrachirp period, each row in L3 memory consists of data for a
specific range-bin, interleaved across antennas. Subsequently, a contiguous transfer of each row can be
used to transfer the data to DSP memory for Doppler-FFT processing. There will be some overhead
during the Doppler processing to deinterleave this data. This method results in a 4× improvement in the
latency of the transpose-write EDMA transfer. The additional overhead on the DSP (for interleaving the
data after each range-FFT and deinterleaving prior to Doppler-FFT) is small and usually acceptable.
HWA Buffer Memory
(MEM2, MEM3)
chirp1
chirp2
chirp Nchirp
L3 Memory
Ant 1
Text
Ant 2Here
Ant 3
Ant 4
range
ADC buffer
(ping pong)
Range FFT channel #1
Range
FFT
channel #2
Text
Here
Range FFT channel #3
Range FF4T channel #2
L3 Memory
A
B
HWA Processing:
FFT + windowing
Processing time is 6.5 µs for 4 Rx channels
Transpose transfer to L3
Will take 5.12 using 1 DMA + HWA transpose
Figure 14. Improving Efficiency of the Transpose Transfer
NOTE: The DSP can transpose a 4 × 256 matrix in 576 cycles (approximately 0.96 µs).
4.2.2
Multichirp Use Case
Multichirp refers to the use case where the ADC buffer has been configured to interrupt the DSP after a
fixed number of chirps have been collected. Figure 15 shows the timing diagram of a multichirp use case
where the ADC buffer interrupts the HWA/DSP after every four chirps have been collected. Such a
configuration results in fewer interrupts to the HWA/DSP and also longer gaps between processing
associated with successive interrupts (contrast this with Figure 6, which depicts a single-chirp use-case).
The DSP/HWA could enter a sleep mode during these gaps in processing, making it a useful power
saving feature [debatable]. Alternatively, the DSP could perform higher level processing (such as
clustering and tracking for prior frames) as a background process in these gaps.
Figure 15. Timing for Multichirp Use Case
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
23
Data Flow
www.ti.com
While it would be easy to adapt the techniques discussed in Section 4.2.1 to the multichirp use case, a
slightly different approach is highlighted here. It is assumed that pre-processing is necessary, and that
output of the HWA’s FFT is going to be written to L3 contiguously, meaning that, the HWA must perform
the transpose (prior to transfer to L3). Since the HWA’s local memories are half the size of the ADC
Buffers, the entire content of the ADC buffer cannot be bought into one of the HWA’s ‘local memory’. So,
half of the processed chirps are brought into MEM0 and the other half into MEM1.
In the implementation, shown in Figure 16 and Figure 17, each row in the ADC buffer (a single chirp for a
single antenna) is transferred to the DSP’s L2 at a time that keeps the L2 buffering requirements to a
minimum. The processed data from L2 is then transferred to the HWA’s MEM0/MEM1 (ping or pong) in a
contiguous transfer. There are two ping-pong buffer pairs that respectively manage the data that is input to
the DSP (from the ADC buffer) and output to the HWA. Since both the EDMA transfers are contiguous, it
ensures that the transfer latencies do not become a bottle-neck. The latencies for both these EDMA
transfers (represented by A and C in Figure 16 ) is 0.32 µs, while the corresponding DSP pre-processing
latency should be higher (say 2 µs). Since B > A+C, it ensures that the DSP never has to wait for an
EDMA completion.
Once one HWA memory is filled, the HWA is triggered to perform the range-FFT processing with an
additional transpose step. Following which, the range-FFT data can be transferred to L3 so that the
number of contiguous bytes is 128 bits ( Figure 17). Consequently the access for Doppler-FFT will now
involve a linear-read EDMA access.
Once this process is complete, the DSP will trigger the HWA a second time to process the contents of
HWA MEM1, which will now hold the processed ADC samples of the remaining Rx Antennas.
HWA MEM0
Ant 1
Ant 1
Ant 1
Ant 1
Ant 2
Ant 2
Ant 2
Ant 2
Ant 3
Ant 3
Ant 3
Ant 3
Ant 4
Ant 4
Ant 4
Ant 4
L2 Memory
(ping/ pong buffers for input)
Ant 1
Ant 1
Ant 1
Ant 1
Ant 2
Ant 2
Ant 2
Ant 2
L2 Memory
(ping/ pong buffers for output)
pre-processing
HWA MEM1
Ant 3
Ant 3
Ant 3
Ant 3
Ant 4
Ant 4
Ant 4
Ant 4
A
Contiguous transfer to L2
Will take 0.32 µs
B
256 samples
DSP processing:
Interference Mitigation
C
Contiguous transfer to HWA buffers
Will take 0.32 µs
Figure 16. Multichirp Use Case (DSP to HWA)
24
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Data Flow
www.ti.com
L3 Memory
HWA MEM2
Ant 1
Ant 1
Ant 1
Ant 1
Ant 2
Ant 2
Ant 2
Ant 2
HWA MEM0
Ant 1
Ant 1
Ant 1
Ant 1
Ant 2
Ant 2
Ant 2
Ant 2
D
E
Transpose transfer to L3
Will take 11µs using 1 DMA
HWA Processing:
FFT + windowing + transpose
Processing time is 13 µs for 2 Rx channels 4 chirps
Figure 17. Multichirp Use Case (HWA to L3)
4.3
Example Use Cases: Interframe Processing
Interframe processing primarily involves Doppler-FFT computation, detection, and angle estimation. The
data flow is largely determined by the way the range-FFT output has been stored in L3 memory during
intraframe processing, and three such cases are discussed.
• Case 1
If the range-FFT output has been placed in L3 using a transpose-write (as in for example, Figure 12)
then the data required for performing a Doppler-FFT is contiguously available. This represents the
simplest scenario for interframe processing, because all L3 accesses are now through contiguous
EDMA transfers. As shown in Figure 18, 4 lines from L3 memory is transferred to the HWA local
memory (say MEM0/MEM1) using a ping-pong scheme. After the Doppler-FFT has been computed,
the result can be optionally transferred back to L3 (not shown in the figure).
During interframe processing all the required input data is available in L3. Hence, to prevent any
stalling in the processing, it is important that the EDMA transfer latencies keep up with the DSP/HWA
processing latencies. In the case depicted in Figure 16 (with 128 chirps, 4 Rx channels), an EDMA
transfer would take 0.64 µs, while a 128-pt FFT (for 4 Rx channels) takes 3.2 µs. Consequently the
EDMA transfer should not be a bottle-neck even in the case where the Doppler-FFT output is written
back to the radar-cube
L3 Memory
HWA MEM0/MEM1
HWA MEM2/MEM3
B
A
HWA processing:
FFT + windowing + log-magnitude sum +
CFAR-CA is 12 µs for 4 RX channels
Contiguous transfer:
FFT + windowing
1.28 µs for 4 Rx channels
Figure 18. Interframe Processing Case 1
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
25
Discussion on Cache Strategy
•
www.ti.com
Case 2
Figure 19 depicts the interframe processing for the case where the range-FFT has been placed in L3
using the scheme suggested in Figure 13. Each line in L3 consists of data for a specific range-bin
interleaved across antennas. The data can be transferred from L3 to the HWA. Since the FFT routines
expect input data to be contiguously placed, the HWA is programmed to access data in transpose (deinterleave). There is no penalty in this case.
chirp 1 chirp 2
chirp 128
Range bins
HWA MEM0/MEM1
A
L3 Memory
Contiguous transfer:
1.28 µs for 4 Rx channels
HWA MEM2/MEM3
B
HWA processing:
FFT + windowing + log-magnitude sum +
CFAR-CA is ~12 µs for 4 RX channels
Figure 19. Interframe Processing Case 2
•
5
Case 3
If the range-FFT output is stored contiguously in L3, then Doppler-FFT processing will entail an EDMA
transfer with transpose-read access as illustrated in Figure 11. Taking the example of 4 consecutive
128-pt FFTs (performed on the HWA), the FFTs would take 3.2 μs, while the corresponding EDMA
transfer (transpose-read access for 128 x 4 samples) would take only 2.56 µs.
In addition to the Doppler-FFT computation, the intra-frame processing will include other computations
(such as computing the magnitude or 'log-magnitude' of the FFT result). Hence it is likely that the
EDMA transfer will not be a bottle-neck and will be able to keep up with DSP processing. However,
performing a write back of the Doppler-FFT results in the radar-cube would involve a larger penalty
(since it would require a EDMA transpose-write, which in this example (4x128 samples) takes 11 µs).
Hence this data flow architecture might not be optimal if such a write back is required
Discussion on Cache Strategy
Recall from Section 2.1 that L1 and L2 can partly or wholly be configured as cache (for both program and
data). L1P is a direct-mapped cache, L1D is 2-way set associative, and L2 is a 4-way set associative
cache [10]. Choice of an appropriate cache strategy can help achieve the desired trade-off between
performance, code size, and development time.
The simplest strategy follows:
• Configure both the L1P and the L1D as cache.
• Place scratch pad data and program in L2, and use L3 only to store the radar-cube.
You should be able to achieve the desired performance using this approach (with the overhead due to
caching being of the order of 15%). In data flow discussions in Section 4, it was implicitly assumed that
various ping-pong buffers and other temporary buffers resided in L2 with L1D cache enabled. The C674x
DSP supports a snoop-read and snoop-write [10] feature, which ensures cache coherency between L1D
and L2 in the presence of EDMA reads and writes to L2.
26
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
Discussion on Cache Strategy
www.ti.com
However, given the highly deterministic nature of the radar signal-processing chain, additional
improvements can be achievable by configuring a portion of L1P and L1D as RAM. Key routines which
constitute the bulk of the radar signal processing, such as FFT and windowing routines, can be placed in
L1P, while the associated scratch pad buffers (such as some of the ping-pong buffers depicted in the
earlier data-flows) can be placed in L1D. In one experiment, all real-time frame-work and algorithm code
(up to clustering) is put into 24KB L1P SRAM (leaving only 8KB for L1P cache), and saw modest
improvement of cycle performance of the chain. 16KB of L1D is also configured as data SRAM (leaving
16KB for L1D cache) to hold temporary processing buffers (reused between range processing and
Doppler processing). By doing so the cycle performance of range processing was improved by 5~10%.
Also, there is much less cycle fluctuation in range processing due to a decrease in L1 cache activity. The
reduced size of L1 cache did not seem to degrade the cycle performance of other algorithms that worked
out of buffers in L2. This placement of code and data in L1P and L1D also saved 40KB of L2 memory,
without negative impact of cycle performance of the whole chain
Alternatives to EDMA-based L3 access: The data flows discussed in Section 4 relied on the EDMA engine
to access data in L3. However, in some scenarios direct access of L3 memory by the CPU might also be
considered. Each EDMA transfer has an overhead in terms of triggering the EDMA and subsequently
verifying its completion (either through an interrupt or polling). These overheads could swamp out the
benefits of using an EDMA for small sized transfers. Such a scenario could occur, for example, in the
context of an FMCW frame with a small number of chirps (16 or 32) resulting in small sized Doppler-FFTs.
One option is to fetch data for multiple Doppler-FFTs in every EDMA transfer. This reduces the EDMA
overhead, though at the cost of proportionally larger buffer sizes. In such cases direct access to L3 (with
caching in L1D) could be a viable alternative. In one of the experiments with size 32 Doppler-FFT's, 4 Rx
antennas, and 1024 range bins, direct L3 access (with L1 cache on) performed slightly faster compared to
an EDMA based approach. This also reduced the code complexity (with the elimination of EDMA
triggering/polling). We also experimented on having the pre-detection matrix inside L3 memory with direct
CPU accesses (via cache). While it was observed that the detection algorithm (CFAR) had cycle
degradation of about 2x (vs. the pre-detection matrix residing in L2), the freeing up of L2 memory might
make it a worthwhile trade-off in certain cases.
Code in L3: Customers also have the option of storing program code in L3. This option is typically suited
for placing code which would be used once per frame (in one go) rather than being executed multiple
times with the other code in an interleaved manner (the intent is to reduce the loss due to repeated code
eviction). One example is Kalman filtering or detection. While actual results will vary depending on the
nature of the code, our experience running a Kalman filter from L3 shows a hit of about 2× for the first
invocation and almost no overhead for subsequent invocations (due to the program being cached in L1P).
Whenever L3 is being directly accessed by the CPU (either for code or data), multiple caching options are
possible. One option is to cache L3 directly to L1 (L1P, L1D). Another option is to create a hierarchical
cache structure, with a small part of L2 (about 32KB) also configured as cache.
SWRA621 – June 2018
Submit Documentation Feedback
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
27
References
6
www.ti.com
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
Industrial Radar Device Family Technical Reference Manual
IWR1642 Single-Chip 76- to 81-GHz mmWave Sensor
mmWave Device Firmware Package (DFP)
TMS320C6748 DSP Technical Reference Manual
TMS320C674x DSP CPU and Instruction Set Reference Guide
TMS320C674x DSP Megamodule Reference Guide
TMS320C600 Optimizing Compiler User's Guide
TMS320C6000 Programmer’s Guide
TMS320C6000 DSP Library (DSPLIB): Required DSPLIB libraries that must be installed for the C674x:
a. The fixed-point library for the C64x+
b. The floating-point library for C674x (documentation included with the installation provides APIs for
all the routines and a cycle benchmarking report)
10. mmWave Software Development Kit (SDK): Install the mmwave-SDK and then navigate to
\packages\ti\alg\mmwavelib to view both the documentation and code for mmwavelib.
11. Radar Hardware Accelerator User's Guide
12. Introduction to mm-wave sensing: FMCW Radars
13. TMS320C6000 DSP Library (DSPLIB)
28
Introduction to the DSP Subsystem in the IWR6843
Copyright © 2018, Texas Instruments Incorporated
SWRA621 – June 2018
Submit Documentation Feedback
IMPORTANT NOTICE FOR TI DESIGN INFORMATION AND RESOURCES
Texas Instruments Incorporated (‘TI”) technical, application or other design advice, services or information, including, but not limited to,
reference designs and materials relating to evaluation modules, (collectively, “TI Resources”) are intended to assist designers who are
developing applications that incorporate TI products; by downloading, accessing or using any particular TI Resource in any way, you
(individually or, if you are acting on behalf of a company, your company) agree to use it solely for this purpose and subject to the terms of
this Notice.
TI’s provision of TI Resources does not expand or otherwise alter TI’s applicable published warranties or warranty disclaimers for TI
products, and no additional obligations or liabilities arise from TI providing such TI Resources. TI reserves the right to make corrections,
enhancements, improvements and other changes to its TI Resources.
You understand and agree that you remain responsible for using your independent analysis, evaluation and judgment in designing your
applications and that you have full and exclusive responsibility to assure the safety of your applications and compliance of your applications
(and of all TI products used in or for your applications) with all applicable regulations, laws and other applicable requirements. You
represent that, with respect to your applications, you have all the necessary expertise to create and implement safeguards that (1)
anticipate dangerous consequences of failures, (2) monitor failures and their consequences, and (3) lessen the likelihood of failures that
might cause harm and take appropriate actions. You agree that prior to using or distributing any applications that include TI products, you
will thoroughly test such applications and the functionality of such TI products as used in such applications. TI has not conducted any
testing other than that specifically described in the published documentation for a particular TI Resource.
You are authorized to use, copy and modify any individual TI Resource only in connection with the development of applications that include
the TI product(s) identified in such TI Resource. NO OTHER LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE TO
ANY OTHER TI INTELLECTUAL PROPERTY RIGHT, AND NO LICENSE TO ANY TECHNOLOGY OR INTELLECTUAL PROPERTY
RIGHT OF TI OR ANY THIRD PARTY IS GRANTED HEREIN, including but not limited to any patent right, copyright, mask work right, or
other intellectual property right relating to any combination, machine, or process in which TI products or services are used. Information
regarding or referencing third-party products or services does not constitute a license to use such products or services, or a warranty or
endorsement thereof. Use of TI Resources may require a license from a third party under the patents or other intellectual property of the
third party, or a license from TI under the patents or other intellectual property of TI.
TI RESOURCES ARE PROVIDED “AS IS” AND WITH ALL FAULTS. TI DISCLAIMS ALL OTHER WARRANTIES OR
REPRESENTATIONS, EXPRESS OR IMPLIED, REGARDING TI RESOURCES OR USE THEREOF, INCLUDING BUT NOT LIMITED TO
ACCURACY OR COMPLETENESS, TITLE, ANY EPIDEMIC FAILURE WARRANTY AND ANY IMPLIED WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT OF ANY THIRD PARTY INTELLECTUAL
PROPERTY RIGHTS.
TI SHALL NOT BE LIABLE FOR AND SHALL NOT DEFEND OR INDEMNIFY YOU AGAINST ANY CLAIM, INCLUDING BUT NOT
LIMITED TO ANY INFRINGEMENT CLAIM THAT RELATES TO OR IS BASED ON ANY COMBINATION OF PRODUCTS EVEN IF
DESCRIBED IN TI RESOURCES OR OTHERWISE. IN NO EVENT SHALL TI BE LIABLE FOR ANY ACTUAL, DIRECT, SPECIAL,
COLLATERAL, INDIRECT, PUNITIVE, INCIDENTAL, CONSEQUENTIAL OR EXEMPLARY DAMAGES IN CONNECTION WITH OR
ARISING OUT OF TI RESOURCES OR USE THEREOF, AND REGARDLESS OF WHETHER TI HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
You agree to fully indemnify TI and its representatives against any damages, costs, losses, and/or liabilities arising out of your noncompliance with the terms and provisions of this Notice.
This Notice applies to TI Resources. Additional terms apply to the use and purchase of certain types of materials, TI products and services.
These include; without limitation, TI’s standard terms for semiconductor products http://www.ti.com/sc/docs/stdterms.htm), evaluation
modules, and samples (http://www.ti.com/sc/docs/sampterms.htm).
Mailing Address: Texas Instruments, Post Office Box 655303, Dallas, Texas 75265
Copyright © 2018, Texas Instruments Incorporated
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertising