STM32 F4 series

STM32 F4 series
High-performance Cortex™-M4 MCU
Presentation highlights
The STM32 F4 series brings to the market the world’s highest
performance Cortex
Cortex™-M
-M microcontrollers
168 MHz FCPU/210 DMIPS
363 Coremark score
The STM32 F4 series extends the STM32 portfolio
250+ compatible
p
devices already
y in p
production,, including
g the
F1 series, F2 series and ultra-low-power L1 series
The STM32 F4 series reinforces ST’s current leadership in Cortex-M
microcontrollers, with 45% world market share by units in (2010 or
cumulated 2007 to Q1/11) according to ARM reporting
STM32 F4 series
High-performance digital signal controller
FPU
Single precision
Ease of use
Better code efficiency
Faster time to market
Eliminate scaling and saturation
Easier support for meta-language tools
(M
tl b )
(Matlab…)
What is Cortex-M4?
MCU
Ease of use of C
programming
Interrupt handling
Ultra-low
Ultra
low power
DSP
Cortex-M4
Harvard architecture
Single-cycle MAC
Barrel shifter
STM32 F4 Series highlights 1/4
ƒ ST is introducing STM32 products based on Cortex M4 core.
Over 30 new part numbers pin
pin-to-pin
to pin and software compatible
with existing STM32 F2 Series.
ƒ Th
The new DSP and
d FPU iinstructions
t ti
combined
bi d tto 168Mhz
168Mh
performance open the door to a new level of Digital Signal
Controller applications and faster development time.
ƒ STM32 Releasing your creativity
4
STM32 F4 Series highlights 2/4
Advanced technology and process from ST:
ƒ Memory
M
accelerator:
l t ART A
Accelerator™
l t ™
ƒ Multi AHB Bus Matrix
ƒ 90nm process
Outstanding results:
ƒ 210DMIPS at 168Mhz.
ƒ Execution from Flash equivalent to 0-wait state performance
up to 168Mhz thanks to ST ART Accelerator
5
STM32 F4 Series highlights 3/4
More Memory
ƒ U
Up to
t 1MB Fl
Flash,
h
ƒ 192kB SRAM: 128kB on bus matrix + 64kB on data bus dedicated
g
to the CPU usage
Advanced peripherals shared with STM32 F2 Series
ƒ
ƒ
ƒ
ƒ
ƒ
USB OTG High speed 480Mbit/s
Ethernet MAC 10/100 with IEEE1588
PWM High speed timers: Now 168Mhz max frequency!
Crypo/hash processor, 32-bit random number generator (RNG)
y and
32-bit RTC with calendar: Now with sub 1 second accuracy,
<1uA typ!
6
STM32 F4 Series highlights 4/4
Further improvements
ƒ Low voltage: 1
1.8V
8V to 3
3.6V
6V VDD , down to 1.7
1 7*V
V on most
packages
ƒ Full duplex I2S peripherals
ƒ 12-bit ADC: 0.41µs conversion/2.4Msps (7.2Msps in
interleaved mode)
g speed
p
USART up
p to 10.5Mbits/s
ƒ High
ƒ High speed SPI up to 37.5Mbits/s
ƒ Camera interface up to 54MBytes/s
*external reset circuitry required to support 1.7V
7
STM32 F4 series – applications served
ƒ Points of sale/inventory ƒ
management
ƒ Industrial automation
and solar panels
Building
ƒ Secu
Security/fire/HVAC
ty/ e/
C
ƒ Test and measurement
ƒ Transportation
ƒ Consumer
ƒ
Medical
ƒ
Communication
2
STM32 F4 block diagram
Feature highlight
ƒ
168 MHz Cortex-M4 CPU
ƒ
Floating point unit (FPU)
ƒ
ART Accelerator TM
ƒ
Multi-level AHB bus matrix
ƒ
1-Mbyte Flash,
192-Kbyte SRAM
ƒ
1.7 to 3.6 V supply
ƒ
RTC: <1 µA typ, sub second
accuracy
ƒ
2x full duplex I²S
IS
ƒ
3x 12-bit ADC
0.41 µs/2.4 MSPS
ƒ
168 MHz timers
2
STM32 F4 portfolio
2
STM32 product series
4 product series
2
STM32 – leading Cortex-M portfolio
2
The cheapest and quickest way to
discover the STM32F4
ƒ
Everything included for a quick start with the
STM32F4 serie
ƒ Order code: STM32F4DISCOVERY
ƒ Available in ST stock from October 2011
ƒ
In circuit ST-LINK/V2 debugger / programmer
included to debug Discovery kit applications or other
target board applications.
ƒ
Dedicated web site www.st.com/stm32F4discovery
ƒ Large number of examples ready to run
ƒ Schematics
S h
ti
ƒ Forums and more
13
STM32F4 Discovery Board
ƒ
ƒ
On-board ST-LINK/V2 with selection mode
switch to use the kit as stand-alone ST-LINK
with SWD connector
Designed to be powered by USB or by
external power 5V or 3.3V supply
ƒ
Ca supply
Can
supp y target
a ge app
application
ca o with 5 Volts
o so
or
3 Volts
ƒ
ƒ
ƒ
Two User LEDs (Green and Blue)
Audio codec
Mems Micro (MP45DT02)
ƒ
One user Push Button
ƒ
Extension header for all QFP64 I/Os for
quick connection to prototyping board or
easy probing
ST-LINK/V2
SWD connector
STM32F407VGT6
User button
Audio Jack
14
September : STM32F4 eval board
ƒ Eval board : STM3240G-EVAL : 21st of September
ƒ For any needs before contact your local ST support
Sample :
21st of September
LQFP100
LQFP144
LQFP176
BGA176
STM32F407VGT6
STM32F457ZGT6
STM32F457IGT6
STM32F457IGH6
LQFP64
STM32F455RGT6
Full pproduction November
2011
Advanced Information
15
STM32 F4 key features
STM32 F4 Key features
1
Real time performance
STM32 F4 series: Cortex M4-based
FPU
Single precision
Ease of use
Better code efficiency
Faster time to market
Eliminate scaling and saturation
Easier support for meta-language tools
What is Cortex-M4?
MCU
Ease of use of C
programming
Interrupt handling
Ultra-low
Ultra
low power
DSP
Cortex-M4
Harvard architecture
Single-cycle MAC
Barrel shifter
1
STM32F4 versus competitors
1
STM32 F4: World’s #1 in performance
Dhrystone
It takes ART to be #1 in performance: It is a combination of core
core, embedded
Flash design, process, acceleration techniques.
1
ST’s ART Accelerator™
The adaptive real-time memory accelerator unleashes the Cortex-M4 core’s
maximum processing performance equivalent to 0-wait state execution
Fl h up to
Flash
t 168 MHz
MH
1
rb4
Real-time performance
32-bit multi-AHB bus matrix
Decompressed
MP3
decoder
code
Access
to theaudio
MP3
DMA
transfer
to
User
interface:
Compressed
audio
stream
to
execution
bystage
core
data
for
audio
output
DMA
transfers
of
stream
(MP3)
112kByte
SRAM
decompression
(I2S)
the
graphical
icons
16kByte
SRAM
fromblock
Flash to
display
Slide 23
rb4
Use the updated chart provided by Olivier Ferrand. Also, we will use the example of datatransfers provided by Olivier Ferrand.
Annimations as in the F4 video would be great.
renaud bouzereau, 8/30/2011
Outstanding power efficiency
Outstanding power efficiency
Typical
yp
values in VBAT mode
ƒ 230 μA/MHz, 38.6 mA at
168 MHz executing
g
Coremark benchmark from
Flash memory (with
peripherals off), made
possible with:
ƒ ST’s 90 nm process
allowing the CPU core
to run at only 1.2 V
ƒ ART Accelerator™ reducing the number of accesses to Flash
ƒ Voltage
V lt
scaling
li tto optimize
ti i performance/power
f
/
consumption
ti
ƒ VDD min down to 1.7 V
p
modes with backup
p SRAM and RTC support
pp
ƒ Low-power
1
Low power and real life applications
ƒ Low power in real life applications is not just Low-power mode
ƒ Need to consider the % of time spend in LP mode and in Run mode
μA/MHz
%
Run
Mode
% Low power
mode
Run
Run
Low power
Low power
time
μA/MHz
Average consumption
time
1
Superior and innovative peripherals
Superior and innovative peripherals
Audio
architecture
HW crypto/hash
PWMs
@
168
Ethernet
withMHz
2 USB
coprocessor
andOTG
IEEE
1588v2
2 full2
duplex
II²S
S
and
ADC
2.4
4 MSPS
<1 µA RTC
1
Digital Camera Interface
ƒ Digital Camera interface, up to 54 Mbyte/s
ƒ The Camera interface is a universal 8 to 14
14-bit
bit parallel
interface (no industry standing name). It supports the following
data formats :
- 8-bit p
progressive
g
video monochrome or raw bayer
y
-YCbCr 4:2:2 progressive video
-RGB 565 progressive video compressed data (like JPEG)
It also supports the following features:
-continuous
continuous mode or snapshot (a single frame) mode
-automatically image cropping
-8-word FIFO.
-AHB slave interface with capability to control the GP-DMA (request/acknowledge)
using1
i 1 channel.
h
l
ƒ -Various Interrupts Flags such as End Of Line, End of Frame, Vertical
Synchronization, Overun or Errors Flags
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
29
Crypto/Hash Processor and RNG
ƒ Encryption/Decryption
ƒ DES/TDES (data encryption standard/triple data encryption
standard): ECB (electronic codebook) and CBC (cipher block
chaining) chaining algorithms, 64-,128- or 192-bit key
ƒ AES (advanced encryption standard): ECB, CBC and CTR
(counter mode) chaining algorithms, 128, 192 or 256-bit key
ƒ Universal hash
ƒ SHA
SHA-1
1 (secure hash algorithm)
ƒ MD5
ƒ True random number generator (RNG) that delivers 32bit random
d
numbers
b
produced
d
db
by an iintegrated
t
t d analog
l
circuit.
30
Crypto/Hash Processor performance
AES
Key sizes
Block sizes
128, 192 or 256 bits
128 bits
DES
64* bits
TDES
192***, 128** or 64*
bits
* 8 parity bits
* 8 parity bits : Keying option 1
** 16 parity bits: Keying option 2
***24 parity bits: Keying option 3
64 bits
64 bits
16 HCLK cycles
48 HCLK cycles
14 HCLK cycle for key =
128bits
Time to process
one block
16 HCLK cycle for key =
192bits
18 HCLK cycle for key =
256bits
Type
yp
block cipher
block cipher
block cipher
Structure
Substitutionpermutation network
Feistel network
Feistel network
1998
1977
(standardized on
January 1979)
1998 (ANS X9.52)
First published
31
Crypto/Hash Processor
DMA
request for
incoming
data transfer
AES
ECB
CBC
DMA
request for
outgoing
data transfer
CTR
Key: 128-,
128 192192 and 256
256-bit
bit
INRIS
INIM
CBC
Key:
y 64-,, 128- and 192-bit
DES
ECB
CBC
Key: 64
64-bit
bit
CRYPTO Processor
IFEM
INMIS
IFNF
BUSY
OFFU
OUTIM
OFNE
OUTRIS
O
Output FIFO
ECB
Data swapping
Input FIFO
Data swapping
TDES
Flags
OUTMIS
CRYPTO Global
Gl b l interrupt
i t
t
(NVIC)
32
32
USB OTG
ƒ Up to 2 USB OTG peripherals (on STM32F4x7
devices) compliant with the USB 2
devices),
2.0
0 specification
and with the OTG 1.0 specification.
ƒ One is Full Speed (12 Mb/s) only
only,
ƒ One is Full Speed or High Speed (480 Mb/s), Both
embeds a FS PHY
33
USB OTG HS
ƒ The USB FS/HS peripheral supports both fullspeed and high-speed
high speed operations. It features a
UTMI low-pin interface (ULPI) to connect an
external HS PHY device. The OTG PHY is
connected to the microcontroller ULPI port
through 12 signals.
ƒ Features a dedicated RAM of 4 Kbytes with
advanced FIFO control
ƒ Dedicated
D di t d DMA controller
t ll
34
Audio architecture
ƒ Two PLLs are available for more flexibilty of the
system:
ƒ The main PLL (PLL) clocked by HSI or HSE used to
generate the System clock (up to 168MHz), and 48 MHz
clock for USB OTG FS, SDIO and RNG.
ƒ A dedicated PLL (PLLI2S) used to generate an accurate
clock to achieve high-quality audio performance on the I2S
interface.
ƒ USB OTG peripherals facilitate audio synchronization:
ƒ each time a SOF event occurs a pulse can be output on a
pin
ƒ dynamic trimming capability of SOF framing period in host
mode
d
ƒ 2xI2S Full duplex peripherals with:
ƒ Less than 0.5% error on sampling frequency
ƒ Clock input in case an external high quality audio PLL is
needed
35
More peripherals improvements
ƒ Flexible Static Memory Interface for external LCD,
SRAM, PSRAM, NOR and NAND Flash, CompactFlash
running at up to 60MHz to expand memory space or
support an external display
ƒ 3 SPIs running at up to 37
37.5
5 Mbit/s,
Mbit/s
ƒ 6 USARTs running at up to 10.5Mbit/s
ƒ Analog:
g
ƒ ADCs and DACs work down to VDD min
ƒ 3x 12-bit ADC, 2.4 MSPS, up to 7.2MSPS in interleaved mode
ƒ F
Fastt GPIO (84 MHz
MH toggling
t
li speed)
d)
ƒ RTC: sub second accuracy, <1uA typ
36
Maximum integration
ƒ The 1-Mbyte Flash and 192-Kbyte SRAM
memories available in the product accommodate
advanced software stacks and user data, with no
need for external memories
ƒ 4-Kbyte
4 Kbyte SRAM battery back
back-up:
up: EEPROM used
to save application state, calibration data
ƒ In addition, 528 bytes of OTP memory make it
possible to store critical user data such as
p
Ethernet MAC addresses or cryptographic keys
1
Extensive tools and SW
Extensive tools and SW
Evaluation board for full product feature evaluation
ƒ Hardware evaluation platform for all interfaces
ƒ Possible connection to all I/Os and all peripherals
ƒ Discovery kit for cost-effective evaluation and
prototyping
p
yp g
ƒ
STM3240G-EVAL
$349
ƒ Starter kits from 3rd parties available soon
ƒ
Large choice of development IDE solutions from the
STM32 and ARM ecosystem
STM32F4DISCOVERY
$14.90
Software Libraries
ƒ ST software libraries free at
www.st.com/mcu
C source code for easy implementation of all
STM32 peripherals in any application
ƒ Standard library – source code for implementation of all standard
peripherals. Code implemented in demos for STM32 evaluation board
ƒ Motor Control library – Sensorless Vector Control for 3-phase
brushless motors
ƒ DSP library – PID,
PID IIR,
IIR FFT,
FFT FIR (free with license agreement)
ST engineered,
g
, tested,,
documented and free
ƒ Audio library – MP3/WMA decoder, volume control, equalizer
(free with license agreement)
agreement).
40
Key messages to remember
ƒ STM32 F4 series
ƒ World’s
World s highest performance
ƒ Extends the STM32 portfolio to over 250+ compatible
devices
ƒ One-in-two Cortex-M MCUs shipped worldwide is
an STM32
Discovery kits available now
STM32F4DISCOVERY
Thank you
www.st.com/stm32f4
STM32F roadmap
STM32F series short term roadmap
STM32F4
series
Cortex-M4
@ 168 MHz
STM32F2
series
i
STM32F1
series
STM32F0
series
Cortex-M3
@ 120 MHz
Cortex-M3
Cortex
M3
@ 72 MHz
Cortex-M0
44
STM32 Next 2 Major Launch
STM32F4
series
Cortex-M4
@ 168 MHz
STM32F4 Æ Cortex M4
Increasing ST leadership
in the performance race
PR September 2011
STM32F0 Æ Cortex M0
Expanding Market Reach
towards 8-16 bit
Early 2012
STM32F0
series
Cortex-M0
45
STM32 F4 Roadmap
Flash Size
(bytes)
2 MB
STM32 F4 2MB Flash Die
1 MB
STM32 F4 1MB Flash Die
512 K
256 K
64 pins
100 pins
144 pins
176 pins
208 pins
LFQFP/WLCSP
LQFP
LQFP
LQFP/UFBGA
UFBGA
Pin count
46
STM32 F4 Roadmap
Flash Size
(bytes)
2 MB
STM32 F4 2MB Flash Die
Samples Q3 2012
Production end of 2012
1 MB
STM32 F4 1MB Flash Die
Production now
512 K
256 K
64 pins
100 pins
144 pins
176 pins
208 pins
LFQFP/WLCSP
LQFP
LQFP
LQFP/UFBGA
UFBGA
Pin count
47
Backup Slides
STM32 F4 Block diagram
•
Cortex M4 w/FPU 168 MHz
AHB2
pin-to-pin compatible with
Mangusta
•
More SRAM (192KB)
•
Same IPS as Mangusta
•
•
•
I2S: now full duplex
New RTC sub second
precision
Faster serial I/F
CORTEX M4
CPU+ MPU
+ FPU
168 MHz
JTAG/SW Debug
ETM
Nested vect IT Ctrl
1 x Systic Timer
DMA
•
64 pins to 176 pins
1.7V-3.6V Supply
1MB Flash
Memory
192KB SRAM
Encryption
Camera Interface
USB 2
2.0
0 OTG FS
External Memory
Interface
Power Supply
Reg 1.2V
POR/PDR/PVD
USB 2.0 OTG HS
XTAL oscillators
32KHz + 8~25MHz
Ethernet MAC
10/100, IEEE1588
Int. RC oscillators
32KHz + 16MHz
16 Channels
Clock Control
80/112/140 I/Os
Bridge
AHB1
(max 168Mhz)
Bridge
2x6x 16-bit PWM
1 x SPI
2 x USART/LIN
PLL
RTC / AWU
5x 16-bit Timer
4KB backup RAM
2x 32-bit Timer
2x Watchdog
3 x 16bit Timer
Up to 16 Ext. ITs
APB1
(max 42MHz)
2x DAC + 2 Timers
Synchronized AC Timer
APB2
•
Faster ADC
(max 84MHz)
•
Flash I/F
•
ARM ® 32-b
bit multi-AHB bus matrix
Arrbiter (max 120MHz))
(max 168Mhz)
(independent & window)
1x SDIO
2x CAN 2.0B
3x 12-bit ADC
2 x SPI/I2S
24 channels / 2Msps
Temp Sensor
4x USART/LIN
3x I2C
49
STM32 F2 portfolio
STM32 F-2 Series portfolio
e
Flash Si
Size
(bytes)
STM32F207ZG
128 KB RAM
STM32F205VG
128 KB RAM
STM32F205ZG
128 KB RAM
STM32F207VF
128 KB RAM
STM32F207ZF
128 KB RAM
STM32F205VF
128 KB RAM
STM32F205ZF
128 KB RAM
E*
1MB
STM32F205RG
128 KB RAM
E*
768 K
STM32F205RF
128 KB RAM
E*
E*
STM32F205VE
128 KB RAM
STM32F205ZE
128 KB RAM E*
STM32F207VC
128 KB RAM
STM32F207ZC
128 KB RAM
STM32F205RC
96 KB RAM
STM32F205VC
96 KB RAM
STM32F205ZC
96 KB RAM
STM32F205RB
64 KB RAM
STM32F205VB
64 KB RAM
E*
E*
256 K
E*
Ethernet, 2xUSB
OTG camera IF
OTG,
STM32F207IF
128 KB RAM
1xUSB OTG FS/HS
OTG, camera IF
Encryption peripheral on
STM32F207ZE
128 KB RAM
STM32F205RE
128 KB RAM
STM32F207IG
128 KB RAM
E*
STM32F207VE
128 KB RAM
512 K
128 K
STM32F207VG
128 KB RAM
E*
E*
STM32F207IE
128 KB RAM
E*
STM32F217 and STM32F115
E*
STM32F207IC
128 KB RAM
64 pins
100 pins
144 pins
176 pins
LFQFP/WLCSP
LQFP
LQFP
LQFP/UFBGA
Pin count
51
STM32 F2 and F4 Series coverage
e
Flash Si
Size
(bytes)
STM32F207VG
128 KB RAM
STM32F207ZG
128 KB RAM
STM32F205VG
128 KB RAM
STM32F205ZG
128 KB RAM
E*
1MB
STM32F205RG
128 KB RAM
E*
E*
E*
STM32F207VF
128 KB RAM
STM32F207ZF
128 KB RAM
STM32F205VF
128 KB RAM
STM32F205ZF
128 KB RAM
STM32F207VE
128 KB RAM
STM32F207ZE
128 KB RAM
STM32F205VE
128 KB RAM
STM32F205ZE
128 KB RAM E*
STM32F207VC
128 KB RAM
STM32F207ZC
128 KB RAM
STM32F205RC
96 KB RAM
STM32F205VC
96 KB RAM
STM32F205ZC
96 KB RAM
STM32F205RB
64 KB RAM
STM32F205VB
64 KB RAM
STM32F205RF
128 KB RAM
STM32F205RE
128 KB RAM
E*
256 K
128 K
STM32F207IF
128 KB RAM
U
Upgrade
d Z
Zone
E*
512 K
E*
E*
STM32 F2 to F4
768 K
STM32F207IG
128 KB RAM
E*
E*
STM32F207IE
128 KB RAM
E*
STM32F207IC
128 KB RAM
64 pins
100 pins
144 pins
176 pins
LFQFP/WLCSP
LQFP
LQFP
LQFP/UFBGA
Pin count
52
STM32 F4 Hardware tools
STM32 F4 Discovery kit
•Develop your applications easily with
everything required for beginners and
experienced users to get started quickly.
•Based on STM32F407 in LQFP100
Q
package
p
g
•Includes on-board ST-LINK/V2,
Only $14.90*
*RRP
54
STM32 F4 Discovery kit
•STM32F407VGT6 MCU in LQFP100
package,
•on-board ST-LINK/V2,
•2x ST MEMS motion sensor and microphone,
•Audio DAC,
•USB OTG with micro-AB connector
•Extension header for all LQFP100 I/Os
•Eight
Ei ht LED
LEDs:
55
STM32 F4 Eval Board from ST
ƒ Evaluation board for full product feature
evaluation
ƒ Hardware evaluation platform for all interfaces
ƒ Possible connection to all I/Os and all peripherals
ƒ Based on STM32F407 in UFBGA176 package
STM3240G-EVAL
$349*
*RRP
56
Starter kits from 3rd parties
ƒ STM32F4 starter kits from IAR and Keil available
in Q4 2011
ƒ Order codes:
ƒ IAR: STM3240G-SK/IAR
ƒ KEIL: STM3240G-SK/KEI
57
High Performance
How to benchmark micros
The core
ƒ The first think to consider is the core : no wait states shall be
introduced to decrease the result
ƒ As a consequence,
q
, the maximum frequency
q
y achievable is the
maximum frequency of the Flash
ƒ The compiler used for the code generation of the benchmark have a
significant influence on the result : for a same core, you can have
two different benchmark result with two different compilers
59
0-ws performance chart
DMIPS
Good CPU, fast flash
150
125
100
75
50
Equivalent CPU, better flash for STM32F4
25
0
40
20
Competitor A
M fl
Max
flash
h
frequency
STM32 F4
M fl
Max
flash
h
frequency
60
80
100
Fcpu
(MHz)
Competitor B
M fl
Max
flash
h
frequency
All the results are with the best compiler for each MCU
60
How to benchmark micros
The flash acceleration
ƒ As the Flash access time is limiting the micro speed, wait state have
to be introduce the reach higher frequency
ƒ The influence of wait state is reduced using
g a flash accelerator
which combined a buffer and/or a cache system taking benefit of a
wide access bus to the flash (ex.128-bit wide)
ƒ The quality and the efficiency of the flash accelerator can be
evaluated looking at the loss of performance on a given benchmark
each time a wait state is added
ƒ
ƒ
An excellent flash acceleration will result in no penalty each time a wait
state is added
A poor flash acceleration will result in a big penalty each time a wait
state is added
ƒ As a consequence,
q
a fast flash or a p
powerful CPU may
y not
necessary means best MCU performance
61
FPU benefits and performance
FPU benefits in real life applications
High level approach
Matrix, mathematical equations
Meta language tools
Matlab ,Scilab…etc…
C code generation
Floating point numbers (float)
FPU
No FPU
No FPU
Direct mapping
No code modification
High performance
Optimal code efficiency
Usage of SW lib
No code modification
Low performance
Medium code efficiency
Usage of integer based format
Code modification
Corner case behavior to be checked
(saturation, scaling)
Medium/high performance
Medium code efficiency
63
FPU assembly code generation
float function1(float number1, float number2)
{
float temp1, temp2;
temp1 = number1 + number2;
temp2 = number1/temp1;
return temp2;
}
# float function1(float number1, float number2)
# {
#
float temp1, temp2;
#
#
temp1
1 = number1
b 1 + number2;
b 2
VADD.F32 S1,S0,S1
#
temp2 = number1/temp1;
VDIV.F32 S0,S0,S1
#
#
return temp2;
BX
LR
# }
1 assembly instruction
Call Soft-FPU
Soft FPU
# float function1(float number1, float number2)
# {
PUSH
{R4,LR}
MOVS
R4,R0
MOVS
O S
R0,R1
0 1
#
float temp1, temp2;
#
#
temp1 = number1 + number2;
MOVS
R1,R4
BL
__aeabi_fadd
MOVS
R1,R0
#
t
temp2
2 = number1/temp1;
b 1/t
1
MOVS
R0,R4
BL
__aeabi_fdiv
#
#
return temp2;
POP
{R4,PC}
# }
64
Floating point benchmark
ƒ Time execution comparison for a 29 coefficient FIR on float 32 with
and without FPU (CMSIS library)
Execution
Time
10x improvement
Best compromise
Development time vs.
performance
No FPU
FPU
65
DSP benefits and performance
Single-cycle multiply-accumulate (MAC)
ƒ The multiplier unit allows any MUL or MAC instructions to be
executed in a single cycle
ƒ Signed/Unsigned Multiply
ƒ Signed/Unsigned Multiply-Accumulate
ƒ Signed/Unsigned Multiply-Accumulate Long (64-bit)
ƒ Benefits : Speed improvement vs. Cortex-M3
ƒ 4x for 16-bit MAC (dual 16-bit MAC)
ƒ 2x for 32-bit MAC
ƒ up to 7x for 64-bit MAC
67
Saturated arithmetic
ƒ Intrinsically prevents overflow of variable by clipping to min/max
boundaries and remove CPU burden due to software range checks
ƒ Benefits
1.5
ƒ Audio applications
Without
saturation
1.5
1
1
0.5
0
-0.5
-1
0.5
-1.5
15
1.5
0
1
-0.5
With
saturation
-1
-1.5
0.5
0
-0.5
-1
ƒ Control applications
-1.5
ƒ The PID controllers’ integral term is continuously accumulated over time. The
saturation automatically limits its value and saves several CPU cycles per
regulators
Single-cycle SIMD instructions
ƒ
Stands for Single Instruction Multiple Data
ƒ
Allows to do simultaneously several operations with 8-bit or 16-bit
data format
ƒ
ƒ
ƒ
Ex: dual 16-bit MAC (Result = 16x16 + 16x16 + 32)
Ex: Quad 8-bit SUB / ADD
Benefits
ƒ
ƒ
ƒ
Parallelizes operations (2x to 4x speed gain)
Minimizes the number of Load/Store instruction for exchanges between
memory and register file (2 or 4 data transferred at once), if 32-bit is not
necessary
Maximizes register file use (1 register holds 2 or 4 values)
69
DSP performances for filtering applications
ƒ FIR filter execution time (CMSIS library)
100
80
10x improvement
Best compromise
Development time vs.
performance
60
17.9x improvement
Best performance
Requires effort for proper
data management
40
20
0
32-bit float
no FPU
32-bit float
FPU
16-bit fixed-point
SIMD optimized
ti i d
70
DSP performances for control application
ƒ
ƒ
ƒ
Example based on a complex
formula used for sensorless
motor drive
Gain comes for load
operations and SIMD
instructions
Total gain on this part is 25 to
35%
Cortex M3 (28-38 c.)
Cortex M4 (18-28 c.)
LDRSH R12,[R4, #+12]
LDR
LDRSH
(1 single 32-bit load replacing two 16-bit load
with sign extension. Gain: 2 cycles
R0,[SP, #+20]
SXTH
LR,R8
MUL
R8,LR,R0
LDR
R1,[R4, #+44]
SDIV
R0,R1,R7
LDRSH
R2,[R4, #+24]
LDRSH
R3,[R4, #+26]
LDRSH
R10,[R4, #+22]
SXTH
LDR
R2,[R4, #+22]
(1 single 32-bit load replacing to 16-bit with sign
extension. Gain: 2 cycles)
R6,R6
MLS
R5,R6,R10,R5
MLA
R5,R9,R12,R5
ASR
R6,R8,#+15
R6 R8 #+15
MLA
R5,R6,R3,R5
SXTH
R10,[R4, #+12]
SMLSD R5, R10, R6, R5
(1 SIMD instruction replacing two multiplyaccumulate. Gain: 3 cycles)
R0,R0
MLS
R5,R0,R2,R5
STR
R5,[SP, #+12]
SMLSD R5, R0, R2
( SIMD instruction replacing
(1
g two multiplyy
accumulate. Gain: 3 cycles)
71
ARM Cortex M4 in few words
Cortex-M processors
ƒ Forget traditional 8/16/32-bit classifications
ƒ Seamless architecture across all applications
ƒ Every
E
product
d t optimised
ti i d ffor ultra
lt llow power and
d ease off use
Cortex-M0 Cortex-M3 Cortex-M4
“8/16-bit” applications
“16/32-bit” applications
“32-bit/DSC” applications
Binary and tool compatible
73
Cortex-M processors binary compatible
ARM Cortex M4 Core
FPU
Single precision
Ease of use
Better code efficiency
Faster time to market
Eliminate scaling and saturation
Easier support for meta-language tools
What is Cortex-M4?
MCU
Ease of use of C
programming
Interrupt handling
Ultra-low
Ultra
low power
DSP
Cortex-M4
Harvard architecture
Single-cycle MAC
Barrel shifter
1
Cortex-M4 processor microarchitecure
ƒ ARMv7ME Architecture
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Thumb-2 Technology
DSP and SIMD extensions
Single cycle MAC (Up to 32 x 32 + 64 ->
> 64)
Optional single precision FPU
Integrated configurable NVIC
Compatible with Cortex-M3
ƒ Microarchitecture
ƒ 3-stage pipeline with branch speculation
ƒ 3x AHB-Lite Bus Interfaces
ƒ Configurable for ultra low power
ƒ Deep Sleep Mode, Wakeup Interrupt Controller
ƒ Power down features for Floating Point Unit
ƒ Flexible configurations for wider applicability
ƒ Configurable Interrupt Controller (1-240 Interrupts and Priorities)
ƒ Optional Memory Protection Unit
ƒ Optional Debug & Trace
76
Cortex-M feature set comparison
Cortex-M0
Architecture Version
Cortex-M3
Cortex-M4
V6M
v7M
v7ME
Thumb, Thumb-2
System Instructions
Thumb + Thumb-2
Thumb + Thumb-2,
DSP, SIMD, FP
0.9
1.25
1.25
1
3
3
Yes
Yes
Yes
Number interrupts
1-32
1
32 + NMI
1-240
1
240 + NMI
1-240
1
240 + NMI
Interrupt priorities
4
8-256
8-256
4/2/0, 2/1/0
8/4/0, 2/1/0
8/4/0, 2/1/0
Memory Protection Unit (MPU)
No
Yes (Option)
Yes (Option)
I t
Integrated
t d trace
t
option
ti (ETM)
N
No
Y (Option)
Yes
(O ti )
Y (Option)
Yes
(O ti )
Fault Robust Interface
No
Yes (Option)
No
Yes (Option)
Yes
Yes
Hardware Divide
No
Yes
Yes
WIC Support
Yes
Yes
Yes
Bit banding support
No
Yes
Yes
Single cycle DSP/SIMD
No
No
Yes
Floating
gp
point hardware
No
No
Yes
AHB Lite
AHB Lite, APB
AHB Lite, APB
Yes
Yes
Yes
Instruction set architecture
DMIPS/MHz
Bus interfaces
Integrated NVIC
Breakpoints, Watchpoints
Single Cycle Multiply
Bus protocol
CMSIS Support
77
Cortex-M4 extended single cycle MAC
OPERATION
INSTRUCTIONS
CM3
CM4
16 x 16 = 32
16 x 16 + 32 = 32
16 x 16 + 32 32
16 x 16 + 64 = 64
16 x 32 = 32
(16 x 32) + 32 = 32
(16 x 16) ± (16 x 16) = 32
SMULBB, SMULBT, SMULTB, SMULTT
SMLABB, SMLABT, SMLATB, SMLATT
SMLALBB, SMLALBT, SMLALTB, SMLALTT
SMULWB, SMULWT
SMLAWB, SMLAWT
SMUAD, SMUADX, SMUSD, SMUSDX
n/a
n/a
n/a
n/a
n/a
n/a
1
1
1
1
1
1
(16 x 16) ± (16 x 16) + 32 = 32
(16 x 16) ± (16 x 16) + 64 = 64
SMLAD, SMLADX, SMLSD, SMLSDX
SMLALD, SMLALDX, SMLSLD, SMLSLDX
n/a
n/a
1
1
32 x 32 = 32 32 ± (32 x 32) = 32 32 x 32 = 64 (32 x 32) + 64 = 64 (32 x 32) + 32 + 32 = 64
MUL
MLA MLS MLA, MLS SMULL, UMULL
SMLAL, UMLAL
UMAAL
1
2
5‐7
5‐7
n/a
1
1
1
1
1
32 ± (32 x 32) = 32 (upper)
(32 x 32) = 32 (upper)
SMMLA, SMMLAR, SMMLS, SMMLSR
SMMUL, SMMULR
n/a
n/a
1
1
All the above operations are single cycle on the Cortex-M4 processor
78
Cortex-M4 DSP instructions compared
Cycle counts
CLASS
Arithmetic
p
Multiplication
Division
INSTRUCTION
ALU operation (not PC)
ALU operation to PC
CLZ
QADD, QDADD, QSUB, QDSUB
QADD8 QADD16 QSUB8 QSUB16
QADD8, QADD16, QSUB8, QSUB16
QDADD, QDSUB
QASX, QSAX, SASX, SSAX
SHASX, SHSAX, UHASX, UHSAX
SADD8, SADD16, SSUB8, SSUB16
SHADD8, SHADD16, SHSUB8, SHSUB16
UQADD8, UQADD16, UQSUB8, UQSUB16
UHADD8, UHADD16, UHSUB8, UHSUB16
UADD8, UADD16, USUB8, USUB16
UQASX, UQSAX, USAX, UASX
UXTAB, UXTAB16, UXTAH
USAD8, USADA8
MUL, MLA
,
MULS, MLAS
SMULL, UMULL, SMLAL, UMLAL
SMULBB, SMULBT, SMULTB, SMULTT
SMLABB, SMLBT, SMLATB, SMLATT
SMULWB, SMULWT, SMLAWB, SMLAWT
SMLALBB, SMLALBT, SMLALTB, SMLALTT
SMLAD, SMLADX, SMLALD, SMLALDX
SMLSD, SMLSDX
SMLSLD, SMLSLD
SMMLA, SMMLAR, SMMLS, SMMLSR
SMMUL, SMMULR
SMUAD, SMUADX, SMUSD, SMUSDX
UMAAL
SDIV, UDIV
CORTEX‐M3 Cortex‐M4
1
1
3
3
1
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
1 ‐ 2
1
1 ‐ 2
1
5 ‐ 7
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
2 ‐ 12
2 – 12
Single
y
cycle
MAC
79
Cortex-M4 non–DSP instructions
C l Cycle counts
t
CLASS
Load/Store
Branch
Special
Manipulation
INSTRUCTION
Load single byte to R0‐R14
Load single halfword to R0‐R14
Load single word to R0‐R14
Load to PC
L d d bl
Load double‐word
d
Store single word
Store double word
Load‐multiple registers (not PC)
Load‐multiple registers plus PC
Store‐multiple registers
Load/store exclusive
SWP
B, BL, BX, BLX
CBZ, CBNZ
TBB, TBH
IT
MRS
MSR
CPS
BFI, BFC
RBIT, REV, REV16, REVSH
SBFX, UBFX
UXTH UXTB SXTH SXTB
UXTH, UXTB, SXTH, SXTB
SSAT, USAT
SEL
SXTAB, SXTAB16, SXTAH
UXTB16, SXTB16
SSAT16, USAT16
PKHTB, PKHBT
CORTEX‐M3 Cortex‐M4
1 ‐ 3
1 ‐ 3
1 ‐ 3
1 ‐ 3
1 ‐ 3
1 ‐ 3
5
5
3
3
1 ‐ 2
1 ‐ 2
3
3
N+1
N+1
N+5
N+5
N+1
N+1
2
2
n/a
n/a
2 ‐ 3
2 ‐ 3
3
3
5
5
0 ‐ 1
0 ‐ 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
n/a
1
n/a
1
n/a
1
n/a
1
n/a
1
80
16-bit DSP functions compared
Relative cycle counts for DSP tasks running on 16-bit data shown below
Smaller is better on the chart – Cortex-M4 is 30% to 70% better
81
32-bit DSP functions compared
Relative cycle counts for DSP tasks running on 32-bit data shown below
Smaller is better on the chart – Cortex-M4 is 25% to 60% better
82
DSP application example:
MP3 audio playback
MHz required for MP3 decode (smaller is better !)
DSP Concept
83
M4 Benefits
ƒ 50% more performance than M3 for signal processing calculations
ƒ 25% better than ARM9E at equivalent frequency
ƒ 50% better than M3 for Audio (MP3 codec)
ƒ 5% better than ARM9E at equivalent frequency
ƒ MMACS
ƒ 72 MHz Æ 72 MMACS (32bits) or 144 MMACS (16bits)
ƒ 150 MHz Æ 150 MMACS (32bits) or 300 MMACS (16bits)
ƒ Floating point Unit
ƒ Graphic acceleration: moves like rotations and so on...
ƒ Advanced algorithms: audio (voice recognition
recognition, pitch detection) or image
processing
ƒ Direct Matlab interface: PC tools generate floating point code, directly
portable on FPU. A fixed point device will require more care and
adaptation.
d t ti
84
DSP lib provided for free by ARM
ƒ The benefits of software libraries for Cortex-M4
ƒ Enables end user to develop applications faster
ƒ Keeps
K
end
d user abstracted
b t t d ffrom llow llevell programming
i
ƒ Benchmarking vehicle during system development
ƒ Clear competitive positioning against incumbent DSP/DSC offerings
ƒ Accelerate third party software development
ƒ Keeping it easy to access for end user
ƒ Minimal entry barrier - very easy to access and use
ƒ One standard library – no duplicated efforts
ƒ ARM channels effort/resources with software partner
ƒ Value add through another level of software – eg: filter config tools
85
DSP lib function list snapshot
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Basic math – vector mathematics
Fast math – sin, cos, sqrt etc
Interpolation – linear, bilinear
Complex math
Statistics – max, min,RMS etc
Filtering – IIR, FIR, LMS etc
Transforms – FFT(real and complex) , Cosine transform etc
Matrix functions
PID Controller
Support functions – copy/fill arrays, data type conversions etc
86
STM32 F4
vs.
STM32 F2
Differences in Core and System Architecture
STM32 F2
STM32 F4
C
Core
ARM C
Cortex
t M3 (r2p0)
( 2 0) ARM C
Cortex
t M4F * (r0p1)
( 0 1)
Floating
gp
point calculation
s/w
Single
g p
precision h/w
Performance / with ART ON
“0ws like” performance
thanks to ART
Accelerator:
120MHz:1.65V-3.6V
“0ws like” performance thanks to
ART Accelerator:
168Mhz: 2.1V–3.6V
144MHz:1.8V–2.1V
128MHz:1.7V–1.8V
SRAM internal capacity
128KB of system
memory
192KB (128KB system
memory + 64KB dedicated to
CPU data))
88
Differences in Core and System Architecture
STM32 F2
STM32 F4
Available only on WLCSP64
(IRR_OFF pin) and BGA176
(BYPASS_REG pin)
packages
Available only on WLCSP64 and
BGA176 (BYPASS_REG pin)
packages
On WLCSP64 this
O
thi
functionality can not be
dissociated from BOR OFF
BOR OFF and Internal regulator
bypass are non exclusive on the
above packages
VDD min
i extension
t
i from
f
1.8V
1 8V down
d
to
1.65V (requires BOR OFF) on F2
1.7V (requires BOR OFF) on F4
Available
A
il bl only
l on WLCSP64
package (IRR_OFF pin)
Available
A
il bl on allll packages
k
(PDR
(PDR_ON
ON
pin) except on LQFP64 pin package
This functionality can not be
dissociated from Regulator
bypass
This functionality can be dissociated
from Regulator bypass
Voltage Scaling (Internal regulator
output))
None
Performance Optimization (150 MHz
max))
Power Optimization (120MHz max)
Internal Regulator Bypass
89
Differences in Peripheral System Architecture
STM32 F2
STM32 F4
FSMC (improvements)
Remap capability on
bank1-NE1/NE2, but
no capability to
access other banks
while remapped
Remap capability on
bank1-NE1/NE2, with
access to other FSMC
banks while remapped.
pp
I2S
2x I2S Half duplex
2x I2S Full duplex.
90
New RTC implementation
STM32 F2
STM32 F4
Calendar Sub seconds
access
NO
YES (resolution down
to RTC clock)
Calendar resolution
From RTCCLK/2 to
RTCCLK/2^20
From RTCCLK/1 to
RTCCLK/2^22
Calendar read and
NO
synchronization on the fly
YES
Alarm on calendar
2 alarms
Sec, Min, Hour,
Date/day, Sub
seconds
2 alarms
Sec, Min, Hour,
Date/day
91
New RTC implementation
Calendar
Calibration
STM32 F2
STM32 F4
Calib window : 64min
C lib ti step:
Calibration
t
Negative:-2ppm
Positive: +4ppm
Range [-63ppm+126ppm]
Calib window : 8s/16s/32s
Calibration step:
Negative or Positive:
3.81ppm/1.91ppm/0.95
pp
pp
ppm
pp
Range [-480ppm +480ppm]
Timestamp
YES
Sec, Min, Hour, Date
YES
Sec, Min, Hour, Date, Sub
seconds
Tamper
YES (2 pins /1 event)
Edge Detection only
YES (2 pins/ 2 events)
Level Detection with
Configurable filtering
92
Compatible board design for
LQFP100-144-176 and BGA 176 packages
F2xx – RFU (reserved for future
use) can be connected to
VDD/VSS/NC
F4xx – PDR_ON can be connected
to VDD or VSS (should be
connected to VDD to maintain
compatibility with the STM32
family
RFU / PDR_ON
VDD VSS
93
Compatible board design for
WLCSP64+2 package
F2xx – IRR_OFF(Internal Reset
and Regulator OFF pin) can be
connected to VDD/VSS. The BOR
and the Internal Regulator is
switched OFF when IRR_OFF is
set to VDD.
F4xx – PDR_ON (BOR OFF pin).
The BOR is switched OFF when
PDR_ON pin is set to VSS.
(Internal regulator is controlled
independently using the
BYPASS_REG pin)
IRR_OFF/
PDR_ON
VDD VSS
94
Thank you
www.st.com/stm32f4
Glossary
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ART Accelerator ™ : ST’s adaptive real-time accelerator
CMSIS:
C
S S Co
Cortex™
e
microcontroller
c oco o e so
software
a e interface
e ace sstandard
a da d
MCU: microcontroller unit
DSC: digital signal controller
DSP: digital signal processor
FPU: floating point unit
RTC: real-time clock
MPU: memory protection unit
FSMC: flexible static memory controller
Download PDF