Scorpion MPF presentation

Qualcomm High Performance Processor Core and
Platform for Mobile Applications
Lou Mallia, Senior Staff Engineer, Qualcomm Inc.
ARM Developers Conference 2007
The objective
To here?
How to get from here
Features
Features
•
QVGA screens
•
WVGA video
•
Fancy ring tones
•
Surround-sound audio
•
Limited graphics
•
MP3 player
•
Snapshot-quality camera
•
PC gaming-quality graphics
•
Single-mode modem
•
Professional 12MP photos
•
Multi-mode modem
•
Bluetooth
•
GPS w/ navigation
•
Web browsing with DRM
•
Secure financial transactions
1000
•
Live TV
500
•
WiFi
•
...all running at the same time!!
2500
QUALCOMM Application MIPS
2000
1500
0
1998
2000
2003
2004
2006
2008
QUALCOMM
Platform
MSM
2300
MSM
3000
MSM
6500
MSM
6550
MSM
7200
Snapdragon
Processor*
MCU
ARM7
ARM9
ARM9
ARM11
Scorpion
MIPS
<20
23
160
250
480
2000+
*QUALCOMM implementations
PAGE 2
In short:
How do we get 2000+ DMIPS and still
keep the power below 500mW ???
The challenges
• Need more MIPS, MFLOPS, and data bandwidth...
ƒ Performance approaching the level of PC’s
• ...but also want lower power and smaller form factor
ƒ Smaller batteries
ƒ Always on
• Other design team challenges
ƒ ARMv7 architecture new for design team
ƒ Architectural complexity
ƒ Energy-efficient data movement
ƒ New OS standards
PAGE 3
The plan of attack
• Experienced design team
• Target cycle time of 20-25 FO4
ƒ 5 generations of low-power RISC
processors
ƒ Using low-power technology
• Aggressive energy management
• ARMv7 architecture license
ƒ New design, not a standard product
• Design, model, and iterate
ƒ Partner with ARM on evolution
Experienced design team
ARMv7 architecture
Scorpion
Mobile requirements
Low-power expertise
PAGE 4
Scorpion overview
• Highly integrated design
Scorpion Applications Processor
ƒ Superscalar CPU
ƒ Tightly-coupled multimedia engine
ƒ L1 and L2 caches
Bus Interface Unit
L2 Cache/TCM
DMA
ƒ Built-in DMA channels
ƒ Debug, trace, and performance monitors
L1
I-Cache
L1
D-Cache
Superscalar
CPU
VeNum
Multimedia
Engine
Debug &
Performance
Monitors
Trace
• All the latest ARMv7 architectural features,
including:
ƒ Multimedia enhancements
–
–
–
Neon 128-bit Advanced SIMD extensions (ASE)
VFPv3 floating-point (32 double-precision registers)
FP-16 half-precision floating-point format
ƒ TrustZone™ security extensions
PAGE 5
Scorpion details
BTAC
BHT
• 1.0 GHz (65nm LP technology)
IC2
ITLB
8-entry
I-cache
32KB
IDA
DCD
• 2,100 DMIPS and 8,000 MFLOPS
IQ
• Power as low as 0.14 mW/DMIP
SRACC1
• Dual-Issue
SRACC2
• Speculative out-of-order issue
SRSV
SRSV
SRSV
SRSV
• Deeply pipelined (24 FO4)
ƒ 13-stage load/store pipe
IC1
UTLB
64-entry
XRACC1
Comp
Buffer
VeNum
XRACC2
VBUF
SRSV
SRSV
SRSV
XRSV
VREC
DTLB
S1
Y1
X1
D-cache
32KB
S2
Y2
X2
VEXP
VXDCD
S3
X3
ƒ 10-/12-stage integer pipes
S4
X4
ƒ 23-stage floating-point pipe
S5
VXQ
VSQ
VXRACC
VSRACC
VRF
• Dynamic branch prediction
L2
Subsystem
VX1
VPERM
VX2
VSWB
VX3
ƒ Branch history table (BHT)
VX4
ƒ Branch target address cache (BTAC)
VX6
–
VSDCD
VX5
VX7
1-cycle penalty on taken branches
VX8
VX9
ƒ Global history register (GHR)
VXBYP
ƒ Subroutine return acceleration (link stack)
PAGE 6
VXWB
Scorpion VeNum multimedia engine (1)
VeNum
• VeNum = “Vector Numerics”
VBUF
28 Entries
ƒ VFPv3 floating-point
ƒ Neon advanced SIMD extensions (ASE)
VREC
ƒ 11-stage, 128-bit arithmetic and load/store pipelines
VEXP
–
–
–
VFP operations merged with low-order 64-bits of SIMD
VXDCD
VSDCD
VXQ
VSQ
VXRACC
VSRACC
Unified multiplier (integer and floating-point)
No “trundling” of data (early-out bypass and writeback)
ƒ Dual-issue, out-of-order execution
–
In
order
Out
of
order
VRF
128-bit load/store plus arithmetic operation
ƒ No speculative execution (reduces wasted energy)
ƒ Separate clock domain from CPU (synchronous)
Arithmetic Pipe
128-bit SIMD
VXWB Stage
PAGE 7
64-bit VFP
Permute
VX1 Stage
128-bit SIMD
Load/Store
Pipe
64-bit VFP
Scorpion VeNum multimedia engine (2)
VBUF
• VFPv3 floating-point
VREC
ƒ Register file expanded to 32x64-bit
VEXP
ƒ Pipelined for both double- and single-precision
–
Including subnormals, NaNs, and multiply-add
ƒ Divide and square root
VXDCD
VSDCD
VXQ
VSQ
VXRACC
VSRACC
ƒ Precise trapping w/ syndromes on IEEE exceptions
VRF
• Neon advanced SIMD extension (ASE)
ƒ Fully-pipelined 128-bit datapath
VX1
VPERM
VX2
VSWB
ƒ Shares VFP register file (accessed as 16x128-bit)
VX3
ƒ Integer SIMD (16x8-bit, 8x16-bit, 4x32-bit)
VX4
ƒ Floating-point SIMD (4x32-bit single-precision)
–
8000 MFLOPS
VX5
VX6
VX7
VX8
• FP-16 half-precision support
VX9
ƒ Doubles load/store bandwidth and saves energy
VXBYP
ƒ Conversion operations between half- and single-precision
VXWB
–
Supports both OpenGL ES 2.0 formats
PAGE 8
Scorpion memory subsystem
• 32KB/32KB L1 instruction/data caches
• 256KB unified L2 array
ƒ Cache or TCM
–
AXI Slave
Port
Configurable by 64KB bank
ƒ Multi-port TCM access
–
–
–
Scorpion
CPU
CPU
DMA
AXI slave port
32KB
I-Cache
ƒ TCM coherent with L1 data cache
32KB
D-Cache
4-Channel
DMA
• Internal four-channel DMA controller
• Enhanced AXI-based bus architecture
ƒ Out-of-order transactions
ƒ Barrier operations
4x4 128-bit Cross-bar Interconnect
Bank 4
64KB
ƒ Semaphore operation protocol
Bank 3
64KB
Bank 2
64KB
256KB L2 Cache/TCM
ƒ Three ports
–
–
–
CPU
AXI Master
Port
64-bit CPU master port
64-bit DMA master port
64-bit slave port (TCM access)
ƒ Up to 4.8 GB/s throughput
PAGE 9
Bank 1
64KB
DMA
AXI Master
Port
QUALCOMM energy management (1)
• Technology
ƒ Low-leakage, multi-Vt CMOS process (65nm, 45nm)
ƒ Customized by QUALCOMM with multiple fab partners
• Logic design process
ƒ Low-power front-end design
– Optimal per-cycle local clock gate for every register in design
– Unused dataflow stabilized on per-cycle basis
ƒ VeNum 64-bit SIMD multiply mode
– Limits peak power of 128-bit operations
• Multiple clock domains (Clock-do-Mania™)
ƒ Dynamic regional clock gating
– CPU, SIMD/FPU, L2 cache, trace logic
ƒ Dynamic domain clock gating
– Processor core, system interfaces, debug
PAGE 10
QUALCOMM energy management (2)
• Dynamic power reduction
ƒ Selective use of HVT, NVT, LVT devices
ƒ Low operational voltages
• Leakage power reduction
ƒ Head/foot switches
ƒ Ultra-low retention voltage
• Multiple voltage and frequency realms
ƒ Supported by level shifters and clamps
ƒ Configured by software
ƒ Adjusted by hardware
• System-in-package (SIP) stacked memory approach
ƒ Reduces SDRAM access power
PAGE 11
Completion buffer overview
• ARM architecture defines 32 logical general-purpose registers (LGPRs)
ƒ 15 user-mode GPRs
ƒ 17 privileged GPRs
• Completion buffer (CB) supports 64 physical registers (PGPRs)
ƒ LGPRs “renamed” to PGPRs
• Higher performance
ƒ Allows pipelining of register hazards
ƒ Allows out-of-order writeback of results
• Lower power
ƒ Completion buffer IS the register file
– No need to move from completion buffer to register file as with reorder buffer
– Early pipeline results written directly to CB
» No trundling through later pipeline stages
PAGE 12
Completion buffer power-saving features
• CAM search
• RAM read
0 R1 0
1 R0 1
2 R31 1
3 R1 1
RDY
PRIV
MR
LGPR
PGPR
ƒ Only search PRIV=0 entries in user mode
ƒ Only search Most recent (MR=1) entries
ƒ Gate off comparator for PRIV=1 and/or MR=0
Completion Buffer
1
0
0
1
VALUE 0
VALUE 2
VALUE 4
VALUE 3
61 R12 1 0 COMPARE 1
62 R0 0 0 COMPARE 1
63 R4 1 0 COMPARE 0
VALUE 5
VALUE 1
VALUE 6
0
0
1
0
COMPARE
COMPARE
COMPARE
COMPARE
ƒ Only read matched entry if RDY=1
ƒ RDY=0 entry won’t fire RAM read wordlines
– Bitlines and outputs unswitched
• Common case for deep pipelines
ƒ Most source values forwarded from pipeline
–
Not read from completion buffer
PAGE 13
6
1
32
HIT
PGPR
RDY
OUT
DATA
OUT
Multimedia data processing (structure load/store)
• Multimedia data is a series of structures
VLD3. 16{Dn, Dn+1, Dn+2}, [RA]
• Each structure has 1-4 elements
ƒ 1-element structure (e.g., sampled values)
+4 +2
[RA]
ƒ 2-element structure (e.g., coordinates)
ƒ 3-element structure (e.g., color space)
Memory
B3 G 3
R 3 B2 G 2 R 2 B 1
G1 R1
B0 G 0 R 0
ƒ 4-element structure (e.g., 3D graphics)
• Dilemma
ƒ Registers normally filled in-order with sequential bytes
–
First register gets filled first, then next register
ƒ But, processing algorithms require different order
–
Put all “first elements” in register 1, all “second elements” in
register 2, etc.
• Solution – Auto-permuting load/store operations
ƒ Elements auto-permuted into registers “on-the-fly”
Register Dn + 2
B3
Register Dn + 1
B2
B1 B0
G3
G2
G1 G0
ƒ Saves energy by avoiding read-permute-write operations
Register Dn
• Example
ƒ Four, 3-element structures (16-bits per element)
ƒ Loaded into three doubleword registers (Dn, Dn+1, Dn+2)
PAGE 14
R3
R2
R1 R0
Multimedia data processing (pipelined DMA + TCM)
• Leverage Scorpion DMA and TCM
ƒ Block n written from TCM back to memory
by DMA channel “A”
Tightly
Coupled
Memory
ƒ Block n+1 stored from VeNum register file
back to TCM
VeNum Register
File
ƒ Block n+2 processed in VeNum pipeline
ƒ Block n+3 loaded from TCM into VeNum
register file
Block n
Block n+1
Block n+2
Block n+3
Block n+4
ƒ Block n+4 read from memory into TCM by
DMA channel “B”
Block n+1
Block n+2
Block n+3
• TCM and L1 D-cache kept coherent
DMA ‘A’
DMA ‘B’
n
n+1 n+2 n+3 n+4
Data blocks in memory
PAGE 15
VeNum
Execution
Unit
Block n+2
MPEG-4 encode/decode performance
MPEG4 on CPU
CIF (352x288, 30fps), 133MHz memory
• Leverages VeNum SIMD engine
1000
ƒ Cosine transforms
800
ƒ Deblocking filters
600
ƒ Motion estimation
Available
MHz
400
Available
ƒ Color space conversion
200
• Full-duplex video encode/decode
ƒ Almost 800MHz of headroom
– Available for OS and other tasks
PAGE 16
Encode
Decode
Decode
0
Scorpion
QUALCOMM ARM11
Graphics vertex processing performance
Graphics Vertex Processing
(vertices/sec)
Normalized to QUALCOMM 400MHz ARM1136
• Vertex processing gaming application
•
Improvement from QUALCOMM ARM1136
ƒ 3.6x due to clock speed & micro-architecture
ƒ 6.2x more from VeNum SIMD engine
35.1x
40.0
30.0
22.3x
ƒ 1.6x more from DMA plus TCM
• Total Scorpion improvement: 35x !!
• Complements graphics processing unit (GPU)
20.0
10.0
ƒ Software chooses where to split the graphics
processing chain
1x
3.6x
0.0
QUALCOMM
ARM1136
PAGE 17
Scorpion w/o
VeNum
Scorpion w/
VeNum
Scorpion w/
VeNum, DMA,
TCM
Scorpion Summary
• Scorpion processor core for mobile applications
ƒ A unique micro-architectural realization of the ARMv7 architecture
– Provides maximum energy efficiency at high performance levels
– Performance up to 2100 DMIPS and 8000 MFLOPS at 1GHz
» Using 65nm LP technology
– Power as low as 0.14mW/DMIP
ƒ The cornerstone of QUALCOMM’s Snapdragon technology platform
– QUALCOMM creating a variety of Scorpion-based Snapdragon
products for different applications
PAGE 18
High Performance Mobile Platform Challenges
• Mobile platforms
ƒ Complex systems with a wide range of applications
ƒ Trade-offs between generic and customized processing solutions
ƒ Complex system modeling and architecture trade-offs
• Key challenges
ƒ Optimizing system bandwidth for multiple concurrent applications
ƒ Extremely power efficient designs for maximum battery life
ƒ Minimum footprint to enable small, lightweight mobile form factors
PAGE 19
Key Mobile Platform Trends
• Multimedia and connected applications driving need for more MIPS
ƒ Performance approaching the level of desktop PC’s
• More features in smaller form factors drives need for lower power and higher integration
ƒ Smaller batteries
ƒ Always on
Computing Performance Trend
MHz
1000
500
~
500 MHz
400
Desktop
PCs
300
525 MHz
High-end
Smart Phone
200
100
0
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
Source: Gartner Dataquest and 3G Today
(www.3gtoday.com), 2006
PAGE 20
Snapdragon Highly Integrated Platform
• Always On
ƒ
Low power consumption through custom CPU
and DSP cores
ƒ
All the performance of a laptop in your pocket
and much more battery life
• Industry leading Performance
Snapdragon
ƒ
Superscalar CPU: Scorpion surpasses 2100
DMIPS at 1 GHz speeds
ƒ
Next Generation DSP running at 600MHz
ƒ
High resolution up to XGA support for
uncompromised Video and Computing
• Ubiquitous Connectivity
ƒ
PAGE 21
CDMA, WCDMA, HSPA, GPS, Bluetooth, WiFi,
Broadcast (MediaFLO, DVB-H, etc.)
Snapdragon Platform Connectivity
• Cross-bar interconnect to enable simultaneous traffic
• Balanced interconnect enabling any master to access any slave
• Tiered bus structure to off-load low bandwidth/latency tolerant traffic
Peripherals Subsystem
SSBI Ctlr
GPIO Ctlr
Touch
Ctlr
UART
TSIF Ctlr
SD Ctlr
Flash/NOR
UBM Aux
Display
Interface
I2C Ctlr
Aux Out
LCD I/F
I2S
Aux In
EBI2
Display/Camera /USB I/F
Audio Interface
TV DAC
MDDI
Client
Aux PCM
USB
MDDI
Host
AHB Peripheral Interconnect
MODEM
GPS
ARM9
MDSP
CDMA, WCDMA,
HSDPA & EGPRS
Application
Subsystem
Application
Co-processors
Scorpion CPU
VIC
Crypto
Graphics
Memory
Subsystem
Display
Image
On-chip
LPDDR
Timers
ADSP
IMEM
Video
L2 Cache/TCM
High Bandwidth AXI Cross-bar Interconnect
PAGE 22
Audio
EBI1
SMI1
Snapdragon High Bandwidth AXI Interconnect
• Enhanced AXI architecture
ƒ
Memory barriers, memory type and attributes, ordering
• Configurable full cross-bar implementation
ƒ
ƒ
ƒ
Arbiter for each slave interface, parallel R/W data paths
Configurable number of masters & slaves, queue depths, pipeline depths
Simultaneous access to all slaves
• Integrated performance monitor and bus trace
ARM9
MDSP
Application
Subsystem
CPU
DMA
Application
Co-processors
ADSP
Graphics
Video
Masters
MODEM
Display / USB / Audio
Display
Image
MDDI
Host
USB
USB
USB
Audio
SMI1
EBI1
CPU
Slave
Port
DSP
Slave
Port
PAGE 23
Slaves
IMEM
Snapdragon Bandwidth Modeling Methodology
• Goal
1. System C Model
ƒ
Optimal interconnect for mobile applications
• Methodology
2. Application Traffic BW
ƒ
Model bus & memory application traffic
ƒ
Characterize various bus and memory alternatives
ƒ
Key criteria includes bandwidth, latency, and utilization
• Conclusions
ƒ
System bandwidth was limited by memory bandwidth
ƒ
AXI Bus utilization was less than 50% of theoretical maximum
ƒ
Memory controller enhancements delivered up to 40% system bandwidth
improvement without increased AXI bus frequency
4. Latency/BW Analysis
3. Memory / Bus Freq. Analysis
5. Concurrent Application BW Analysis
Mem ory Bandw idth Analysis
Multi-player gam ing + 2nd application
2000
14 0 0
12 0 0
BW
Mem 1
10 0 0
800
Mem 2
600
400
Mem 3
200
Memory BW
16 0 0
Read Latency
18 0 0
other app
game
EBI1 B/W
SMI B/W
0
Bus 1 Bus 2 Bus 3
Offered BW
PAGE 24
SMI+EBI B/W
Snapdragon Memory Overview
• Multiple memory interfaces
ƒ
ƒ
Stacked LP DDR SDRAM reduces power dissipation and board area
External LP DDR SDRAM provides additional bandwidth
• Highly power optimized external memory controller
ƒ
ƒ
ƒ
Power-down, deep power down, clock stop,
Self refresh, auto refresh, directed auto-refresh, temperature adjusted refresh rates
IO calibration to adjust IO impedance
• External bus interface controller (EBI2) supports multiple memory options
ƒ
NAND, OneNAND/M-systems, burst NOR support
• Integrated on-chip memory
ƒ
ƒ
IMEM reduces off-chip memory accesses
CPU and DSP L2 caches can be configured as tightly coupled on-chip memories
Flash/NOR
UBM Aux
Display
Interface
Application
Subsystem
Scorpion CPU
VIC
EBI2
Timers
L2 Cache/TCM
Application
Co-processors
Data
Mover
Crypto
ADSP
TCM
PAGE 25
Graphics
IMEM
Video
Memory
Subsystem
Display
Image
Audio
On-chip
LPDDR
EBI1
SMI1
Snapdragon Application CPU Subsystem
• Secure vector interrupt controller
ƒ Configurable up to 64 primary interrupts
Application Subsystem
Scorpion
Timers
ƒ 8-level Prioritized Interrupts to FIQ/IRQ
ƒ 32-bit IRQ/FIQ vector address
CPU
SIMD
32KB
Instr
Cache
32KB
Data
Cache
ƒ TrustzoneTM compliant security mechanism
• Clock and power manager
ƒ High frequency, low jitter PLL
Secure
Vector Interrupt
Controller
ƒ Clock source selection, gating and routing
ƒ Power collapse and voltage scaling
• RTOS, general purpose & secure timers
PAGE 26
L2 Cache / TCM / DMA
DMA
Clock & Power
Manager
Snapdragon Peripheral Subsystem
• Provides connectivity to peripheral devices
ƒ Direct access from processors
ƒ Off-loads peripheral traffic from memory bus
ƒ Round robin arbitration with 5 levels of priority
ƒ Memory protection for secure peripherals
Peripherals Subsystem
SSBI
Ctlr
Touch
Ctlr
GPIO
Ctlr
TSIF
Ctlr
I2C
Ctlr
UART
AHB Peripheral Interconnect
Modem
Application
Subsystem
ADSP
Display
Proc
Application Co-processors
PAGE 27
SD
Ctlr
Snapdragon Display Support
• RGB LCD Controller
ƒ Supports direct-attach LCD panels up to XGA
at 60Hz
ƒ 24 bit RGB outputs, programmable refresh
rates and display sizes
• Mobile Display Digital Interface (MDDI)
ƒ High-speed serial communication for displays
and sensors
ƒ Type II (1 Gbps) MDDI interface
ƒ Displays up to XGA
Display
Interfaces
EBI2
• TV Out
TV
DAC
LCD
I/F
ƒ Composite and S-Video output supported
ƒ Integrated 10-bit DAC, NTSC or PAL
• Auxiliary LCD interface for sub displays
Application
Co-processors
Display
Processor
PAGE 28
MDDI
Host
Snapdragon Multimedia Co-Processors
• Integrated high-performance, power-optimized multimedia processing engines
ƒ
Programmable to enable adaptation of emerging standards
• Multimedia DSP
ƒ
• Mobile Audio Processor
Custom QUALCOMM designed 600MHz DSP
• Mobile Graphics Processor
ƒ
ƒ
Support for OpenGL ES 2.0
133M pixel/sec or 21M triangles/sec
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Video encoding and decoding
Supporting H.263 H.264, and MPEG-4
• Mobile Display Processor
ƒ
ƒ
wideband stereo CODEC,
I2S(Inter-IC Sound),
PCM, and Dual Microphone Support
• Mobile Image Processor
• Mobile Video Processor
ƒ
ƒ
ƒ
ƒ
ƒ
Integrated LCD controller
Image Processing (e.g. rotate, scale)
Camera sensor image processor
Viewfinder
Video and image capture
Snapshot processing
Encoding
Image display, and image processing
Application Co-processors
ADSP
Graphics
Video
Display
Audio
Image
600MHz
Multimedia
DSP
Mobile
Graphics
Processor
Mobile
Video
Processor
Mobile
Display
Processor
Mobile
Audio
Processor
Mobile
Image
Processor
PAGE 29
Network Gaming Example
• Application requires power efficient data movement
• Crossbar interconnect provides parallel data paths
• Tightly coupled memory is used to store intermediate results
• Separate read and write data paths allow concurrent data transfers
Memory
Subsystem
EBI1
SMI
High Bandwidth AXI Cross-bar Interconnect
Multi-player
Data
MODEM
ARM9
MDSP
From
RF
CDMA, WCDMA,
HSDPA & EGPRS
Software &
Game Data
Textures &
Geometries
Application
Subsystem
SIMD
Completed
Frames
Application Co-processors
TCM
Graphics
Scorpion CPU
PAGE 30
Display
Proc
Snapdragon Summary
• Snapdragon platform is a highly integrated, high performance, power
optimized mobile solution
ƒ High performance, power efficient applications processor, multimedia DSP,
multimedia applications co-processors
ƒ Configurable, power efficient interconnect optimized to enable advanced, concurrent
mobile applications
• Snapdragon platform enables more advanced applications in smaller,
longer lasting, always connected mobile devices
PAGE 31
Thank you!