5 FPGAworld FPGAworld CONFERENCE CONFERENCE

5 FPGAworld FPGAworld CONFERENCE CONFERENCE
5th FPGAworld CONFERENCE
Book
2008 SEPTEMBER
EDITORS
Lennart Lindh, David Källberg, DTE - Santiago de Pablo and Vincent J. Mooney III
The
e FPGAworld Conference addresses aspects of digital and hardware/software system engineering
on FPGA technology. It is a discussion and network forum for students, researchers and engineers
working on industrial and research projects, state-of-the-art
state
art investigations, development and
applications.
The
book
contains
all
presentations;
for
more
information
see
(www.fpgaworld.com/conference
www.fpgaworld.com/conference).
ISBN 978-91-976844-1-5
SPONSORS
Copyright and Reprint Permission for personal or classroom use are allowed with credit to
FPGAworld.com. For commercial or other for-profit/for-commercial-advantage uses, prior
permission must be obtained from FPGAworld.com.
Additional copies of 2008 or prior Proceedings may be found at www.FPGAworld.com or at
Jönköpings University library (www.jth.hj.se), ISBN 978-91-976844-1-5
2008 PROGRAM COMMITTEE
General Chair
Lennart Lindh, FPGAworld and Jönköpings University, Sweden
Publicity Chair
David Kallberg, FPGAworld, Sweden
Academic Programme Chair
Vincent J. Mooney III, Georgia Institute of Technology, USA
Academic Publicity Chair
Santiago de Pablo, University of Valladolid, Spain
Academic Programme Committee Members
Ketil Roed, Bergen University College, Norway
Lennart Lindh, Jönköping University, Sweden
Pramote Kuacharoen, National Institute of Development Administration, Thailand
Mohammed Yakoob Siyal, Nanyang Technological University, USA
Fumin Zhang, Georgia Institute of Technology, USA
Santiago de Pablo, University of Valladolid, Spain
Industrial Programme Committee Members
Sweden
Solfrid Hasund, Bergen University
College
Daniel Stackenäs, Altera, Sweden
Martin Olsson, Synective Labs,
Kim Petersén, HDC, Sweden
Mickael Unnebäck, ORSoC, Sweden
Sweden
Stefan Sjöholm, Prevas, Sweden
Fredrik Lång, EBV, Sweden
Niclas Jansson, BitSim, Sweden
Ola Wall, Synplicity, Sweden
Torbjorn Soderlund, Xilinx, Sweden
Göran Bilski, Xilinx, Sweden
Anders Enggaard, Axcon, Denmark
Adam Edström, Elektroniktidningen,
Sweden
Doug Amos, Synplicity, UK
Guido Schreiner, The Mathworks,
Espen Tallaksen, Digitas, Norway
Germany
Göran Rosén, Actel, Sweden
Tommy Klevin, ÅF, Sweden
Stig Kalmo, Engineering College of
Tryggve Mathiesen, BitSim, Sweden
Aarhus, Denmark
Hichem Belhadj, Actel, USA
Fredrik Kjellberg, Net Insight,
General Chairman’s
Chairman’s Message
The FPGAworld program committee welcomes you to the 5th FPGAworld
conference.. This year’s conference is held in Stockholm and Lund,
Lund Sweden. We
hope that the conferences provide you with much more then you expected.
We will try to balance academic and industrial presentations, exhibits
exhibit and
tutorials to provide a unique chance for our attendants to obtain knowledge from
different views. This year we have the strongest program in FPGAworld´s history.
Track A - Industrial
Track A features presentations with focus on industrial applications. The
presenters were selected by the Industrial Programme Committee. 8 papers was
presented.
Track B - Academic
Track B features presentations with focus on academic papers and industrial
industria
applications. The
he presenters were selected by the Academic Programme
Committee. Due to the high quality, 5 out of the 17 papers submitted this year
were presented.
Track C - Product presentations
Track C features product presentations from our exhibitors
exhibitors and sponsors.
Track D - Altera Innovate Nordic
Track D is reserved for the Altera Innovate Nordic contest.
Three in the final
Exhibitors FPGAworld'2008 Stockholm & Lund;
Lund; 15 unique exhibitors.
The FPGAworld 2008 conference
onference is bigger than the FPGAworld 2007 conference.
Totally we are close to 300 participants (Stockholm and Lund).
All are welcome
elcome to submit industrial/academic papers,, exhibits and tutorials to
the conference, both from student, academic and industrial backgrounds.
Together we can make the FPGAworld conference exceed even
ven above our best
expectations!
Please check out the website (http://fpgaworld.com/conference/) for more
information about FPGAworld.
FPGAworld. In addition, you may contact David Källberg
([email protected]) for more information.
information
We would like to thank all of the authors for submitting their papers and
hope that the attendees enjoyed the FPGAworld conference 2008 and you
welcome
lcome to next year’s conference.
conference
Lennart Lindh
Programme FPGAworld'2008 Lund (Ideon Beta in Lund, Sweden)
08:30 09:00
Registration
09:00 09:15
Conference opening
Lennart Lindh, FPGAworld
09:15 10:00
Challenges and Opportunities in the FPGA industry?
Gary Meyers, Synopsys
Key Note Session
10:00 10:30
Coffee Break
10:30 11:30
Exhibitors Presentation
11:30 12:30
Lunch Break
Sponsored by Actel
Session Chair
TBD
Session Chair
TBD
Session A1
Open Source within Hardware
Session A2
World's first mixed-signal
FPGA
12:30 14:30
Synplicity Business Group of Synopsys
Session C2
OVM introduction
Session A3
Verification - reducing costs
and increasing quality
Mentor Graphics
Session C3
Verification Management
Session A4
Mentor Graphics
Analog Netlist partitioning and
automatic generation of
schematic
Session C4
MAGIC - Next generation platform for Telecom
and Signal Processing
BitSim
14:30 15:00
Coffee Break
Session Chair
TBD
Session A5
Product Presentation
ORSoC
Session A6
15:00 –
16:30
Session C1
Prototyping Drives FPGA Tool
Flows
Drive on one chip
Session Chair
TBD
Session C5
Product Presentation
The Dini Group
Session A7
Session C6
Product Presentation
Actel
Standard architecture for
typical remote sensing micro
satellite payload
Session C7
Product Presentation
Nextreme: The industries only Zero Maskcharge NEW ASIC
Programme FPGAworld'2008 Stockholm (Electrum in Kista, Sweden)
08:30 09:00
Registration
09:00 09:15
Conference opening
Lennart Lindh, FPGAworld
09:15 10:00
Key Note Session
Making the FPGA Reconfigurable Platform Accessible to Domain Experts
Dr. Ivo Bolsons, CTO, Xilinx
10:00 10:30
Coffee Break
Sponsored by Synplicity
10:30 11:30
Exhibitors Presentations
11:30 12:30
Lunch Break
Sponsored by Mentor Graphics
Session Chair
Kim Petersén
HDC AB
Session A1
Open Source within Hardware
12:30 14:30
Session A2
Open and Flexible Network Hardware
Session A3
World's first mixed-signal FPGA
Session A4
Verification - reducing costs and
increasing quality
14:30 15:00
Session Chair
Tommy Klevin
ÅF
Session Chair
Johnny Öberg
Session C1
Product Presentation
Actel
Session B1
A Java-Based System for FPGA
Programming
Session C2
Product Presentation
The Dini Group
Session B2
Automated Design Approach for On-Chip
Multiprocessor Systems
Session B3
ASM++ Charts: an Intuitive Circuit
Representation Ranging from Low Level
RTL to SoC Design
Session C3
Product Presentation
ORSoC
Session D1
Altera
Innovate Nordic
Contest
Session C4
7Circuits - I/O Synthesis for FPGA Board
Design
Gateline
Coffee Break
Session Chair
TBD
Session Chair
Santiago de Pablo
Session Chair
TBD
Session C5
Prototyping Drives FPGA Tool Flows
Synplicity Business Group of
Synopsys
Session A5
Large scale real-time data acquisition and
signal processing in SARUS
15:00 16:30
Session A6
Drive on one chip
Session A7
Standard architecture for typical remote
sensing micro satellite payload
Session C6
OVM introduction
Mentor Graphics
Session B4
Space-Efficient FPGA-Implementations
of FFTs in High-Speed Applications
Session B5
The ABB NoC – a Deflective Routing
2x2 Mesh NoC targeted for Xilinx FPGAs
Session C7
Verification Management
Mentor Graphics
Session C8
MAGIC - Next generation
platform for Telecom and Signal
Processing
BitSim
16:30 17:00
Altera Innovate Nordic Prize draw
17:00 19:00
Exhibition & Snacks
Sponsored by Altera
Session D2
Altera
Innovate Nordic
Contest
FPGA World 2008
Exhibitors FPGAworld'2008 Stockholm & Lund
5 Minutes presentation with PowerPoint’s
Stockholm
Altera
BitSim
Arrow
Silica
Actel
Synplicity
The Mathworks
EBV Elektronik
ACAL Technology
The Dini Group
VSYSTEMS
Gateline
ORSoC
National Instruments
Lund
BitSim
Arrow
Silica
Actel
Synplicity
The Mathworks
EBV Elektronik
ACAL Technology
The Dini Group
National Instruments
NOTE Lund
Innovate Nordic 2008
Design Contest
Lena Engdahl
© 2008 Altera Corporation—Public
© 2008 Altera Corporation—Public
Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
2
1
The finalists are:
© 2008 Altera Corporation—Public
Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
3
2
GATEline Presentation
GATEline
Melek Mentes
GATEline Overview
Value added reseller of eCAD and ePLM
products on Nordic and Baltic market
Established 1984
7 employees, 6 in Sweden and 1 in
Norway
Office in Stockholm and Oslo
EDAnova – Representative in Finland
1
GATEline Represents
The Ultimate PCB Design Environment
FPGA I/O Synthesis
Schematic Design
Functional Simulation
PCB Design
Signal Integrity
Simulation
Component
Database
Design for Manufacturing
GATEline offers probably the best integrated PCB design flow on the market!
2
The DINI Group
Products and Roadmap
Mike Dini
[email protected]
September ‘08
Philosophy … Why are we here?
• We make big FPGA boards
• Fastest, biggest for the lowest cost
– Easy to use where important
– Less polish where not
• What you get:
– Working, easy to use, cutting edge, cost effective,
reference designs
– High performance in both speed and gate density
• What you don’t:
– Pretty GUI’s and other SW that drives up the cost
– The ‘soft-shoe’ on partitioning …
1
What We Provide vs. What You Need
• HW only
– Lots of reference stuff included
• Customer needs
– Simulation (verilog/VHDL)
– Partitioning (optional)
• Manual or third party solutions such as Auspy
– Synthesis
• Xilinx/Altera tools work fine
– Place/Route
• Comes from FPGA vendor: Xilinx/Altera
– Debug
• Chipscope, SignalTap, and other third party solutions
Overview of Product Line
• Goal: Provide customers a cost-effective
vehicle to use the biggest and fastest FPGA’s
– Xilinx
• Virtex-5
– Altera
• Stratix III
– Stratix IV when available
– We try to keep lead-times under 2 weeks.
• If not 2 weeks, issue is usually availability of FPGAs
2
Xilinx Virtex5
• Shipping with V5 since Jan ‘07:
– Standalone
• DN9000k10 – Bride of Monster 16 FPGA’s (LX330’s)
• DN7020k10 – Uncle of Monster 20 FPGA’s (S3/S4)
– PCI
• DN9000k10PCI – 6 FPGA’s (LX330’s)
• DN9200k10PCI – 2 FPGA’s (LX330’s)
• DN9002k10PCI – 2 FPGA’s (LX330/LX220/LX155/LX110)
– PCIe
•
•
•
•
•
8-lanes GEN1 with LX50T-2
4-lanes GEN1 with FX70T-2
DN9000k10PCIe-8T – 6 FPGA’s (LX330’s)
DN9200k10PCIe-8T – 2 FPGA’s (LX330’s)
DN9002k10PCIe-8T – 2 FPGA’s (LX330/LX220/LX155/LX110)
Altera Stratix III (and IV) – 50M/100M gates and stackable
3
Daughter of Monster
• DN9000k10
–
–
–
–
–
–
Bride of MonsterTM
16 Virtex-5 LX330’s
Expected to start shipping in Dec ’07
32M ASIC gates (measured the real way …)
6 DDR2 SODIMM sockets (250MHz)
450MHz LVDS chip to chip interconnect
4
DN9000k10PCIe-8t
DN9000k10PCI
• 6 Virtex5 LX330
– Oversize PCI circuit board
• 66MHz/64-bit
• Stand-alone operation with ATX power supply
– ~12 million USABLE ASIC gates
• REAL ASIC gates! No exaggeration!
– Any subset of FPGA’s can be stuffed to reduce cost.
• 6, DDR2 SODIMM SDRAM sockets
• Custom DDR2-compatible cards for FLASH, SSRAM,
RLDRAM, mictors, and others
• FPGA to FPGA interconnect LVDS (or singleended)
• LVDS: 450MHz
• 10x ISERDES/OSERDES tested and verified
• 160-pin Main bus connects all FPGA’s
5
Block Diagram – DN9000k10
DINI_selection_guide_v700.xls
FX
LX
SX
207,360
138,240
97,280
69,120
97,280
69,120
51,840
28,800
19,200
58,880
32,640
21,760
64,000
44,800
20,480
178,176
135,168
98,304
84,352
50,560
135,168
98,304
71,680
53,248
36,864
49,152
88,192
66,176
47,232
FPGA
FF's
3,320
2,210
1,556
1,110
1,556
1,110
830
460
307
940
522
392
1,024
717
328
2,490
1,890
1,380
1,180
710
1,890
1,380
1,000
750
520
690
1,230
930
660
1,990
1,330
934
670
934
666
498
276
184
564
313
235
614
430
197
1,490
1,130
830
710
430
1,130
830
600
450
310
410
740
560
400
Total
(kbytes)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
96
96
96
160
128
96
96
80
64
64
512
444
328
232
192
128
128
64
128
64
48
48
32
640
288
192
256
128
64
0
0
0
0
0
0
0
0
0
0
0
0
0
0
576
384
384
256
424
296
216
120
72
488
264
168
456
296
136
336
288
240
376
232
288
240
200
160
96
320
444
328
232
10,368
6,912
6,912
4,608
7,632
5,328
3,888
2,160
1,296
8,784
4,752
3,024
8,208
5,328
2,448
6,048
5,184
4,320
6,768
4,176
5,184
4,320
3,600
2,880
1,728
5,760
7,992
5,904
4,176
1,296
864
864
576
954
666
486
270
162
1,098
594
378
1,026
666
306
756
648
540
846
522
648
540
450
360
216
720
999
738
522
MLAB
(640)
M9K
(9 kbit)
M144K
(144 kbit)
Total
(kbits)
Total
(kbytes)
13622
10624
6750
1529
1280
1040
64
64
48
22,977
20,736
16,272
2,872
2,592
2,034
M-RAM
(4kx144)
Total
(kbits)
Total
(kbytes)
4
9
4,415
9,163
552
1,145
LUT
Size
Multipliers
(18x18)
1,200
800
800
800
640
640
480
480
360
640
480
360
640
640
360
960
960
960
768
576
768
768
768
640
640
640
1040
996
692
Total
(kbits)
Max I/O's
Gate Estimate
Max I/O's
6-input
6-input
6-input
6-input
6-input
6-input
6-input
6-input
6-input
6-input
6-input
6-input
6-input
6-input
6-input
4-input
4-input
4-input
4-input
4-input
4-input
4-input
4-input
4-input
4-input
4-input
4-input
4-input
4-input
Stratix III
4SE680
4SE530
3SL340
-4,-3,-2
-4,-3,-2
-4,-3,-2
6-input
6-input
6-input
544,880
424,960
270,000
8,718
6,799
4,320
5,231
4,080
2,592
1104
960
1120
1360
1024
576
StratixII GX
StratixII
2SGX90E
2S180
-5,-4,-3
-5,-4,-3
6-input
6-input
72,768
143,520
1,020
2,010
610
1,210
558
1,170
192
384
Stratix IV
Max
Practical
(100% util)
(60% util)
(1000's)
(1000's)
Memory
Blocks
(18kbits)
Speed
Grades
(slowest to
fastest)
Altera
VirtexII
Pro
-1,-2
-1,-2
-1,-2,-3
-1,-2,-3
-1,-2,-3
-1,-2,-3
-1,-2,-3
-1,-2,-3
-1,-2,-3
-1,-2,-3
-1,-2,-3
-1,-2,-3
-1,-2,-3
-1,-2,-3
-1,-2,-3
-10,-11
-10,-11,-12
-10,-11,-12
-10,-11,-12
-10,-11,-12
-10,-11,-12
-10,-11,-12
-10,-11,-12
-10,-11,-12
-10,-11,-12
-10,-11,-12
-5,-6
-5,-6,-7
-5,-6,-7
Practical
(60% util)
(1000's)
Multipliers
(25x18)
LX
LX330
LX220
LX155
LX110
LX155T
LX110T
LX85T
LX50T
LX30T
SX95T
SX50T
SX35T
FX100T
FX70T
FX30T
LX200
LX160
LX100
FX100
FX60
LX160
LX100
LX80
LX60
LX40
SX55
2vp100
2vp70
2vp50
FF's
Max
(100% util)
(1000's)
Multipliers
(18x18)
SXT
FPGA
Speed Grades
(slowest to
fastest)
Virtex-5
LXT
FXT
Virtex-4
Xilinx
LX
Gate Estimate
Size
(6-input
or
4-input)
LUT
Memory
M512
M4K
(32x18) (128x36)
488
930
408
768
6
®
TM
®
© 2008 The MathWorks, Inc.
Overview of The MathWorks
Jan Hedman
Sr. Application Engineer
Signal Processing and Communications
®
TM
®
The MathWorks at a Glance
Headquarters:
Natick, Massachusetts US
US:
California, Michigan,
Washington DC, Texas
Europe:
UK, France, Germany,
Switzerland, Italy, Spain,
the Netherlands, Sweden
Asia-Pacific:
China, Korea, Australia
Worldwide training
and consulting
Distributors in 25 countries
Earth’s topography on an equidistant cylindrical projection,
created with MATLAB® and Mapping Toolbox™.
2
1
TM
®
®
Core MathWorks Products
The leading environment for technical
computing
Numeric computation
Data analysis and visualization
The de facto industry-standard,
high-level programming language
for algorithm development
Toolboxes for signal and image
processing, statistics, optimization,
symbolic math, and other areas
Foundation of MathWorks products
3
TM
®
®
Core MathWorks Products
The leading environment for modeling,
simulating, and implementing
communications systems and
semiconductors
Foundation for Model-Based Design
Digital, analog, and mixed-signal systems, with
floating- and fixed-point support
Algorithm development, system-level design,
implementation, and test and verification
Optimized code generation for FPGAs and
DSPs
Blocksets for signal processing,
communications, video and image processing,
and RF
Open architecture with links to third-party
modeling tools, IDEs, and test systems
4
2
®
TM
®
Model-Based Design Workflow
Requirements
Research
Design
Data Analysis
Environment
Algorithm
Development
Physical Components
Data Modeling
Algorithms
Embedded
Software
Digital
Electronics
C, C++
VHDL, Verilog
MCU
DSP
FPGA
ASIC
Test
Environments
Integration
Implement
Continuous
V&V
5
3
SILICA @ FPGA World 2008
- Lund
SILICA EMEA
Silica an Avnet Company
550 employees (450-sales and engineering team)
23 franchises
Local sales organisations w. centralized backbone for logistic
Excellent portfolio of value-added services and supply chain
solutions.
SILICA I The Engineers of Distribution.
1
KEY TECHNOLOGY
Power
Management
Microcontroller /
DSP
Commodities
Discrete I Logic
Analog I Memory
Analog
(Signal Chain)
Programmable
Logic
SILICA I The Engineers of Distribution.
SILICA LINECARD
SILICA I The Engineers of Distribution.
2
Avnet Design Services
Xilinx® Virtex™-5 FXT Evaluation Kit
•
•
•
•
XC5VFX30T-1FFG665
PowerPC™ 440
10/100/1000 Ethernet PHY
$ 395
Xilinx® Spartan™-3A Evaluation Kit
•
•
•
•
XC3S400A-4FTG256C
General FPGA prototyping
Cypress® PSoC evaluation (Capsense)
$ 39
SILICA I The Engineers of Distribution.
3
Actel Offering Overview
Hichem Belhadj
Actel Fellow
Actel Technology
Flash and Antifuse
Non-volatile Reprogrammable FPGAs
Flash (floating gate) technology
Non-Volatile OTP (One Time Programmable) FPGAs
ALU – Actel
ONO anti-fuse technology
M2M anti-fuse technology
© 2008 Actel Corporation. Confidential and Proprietary
Aug 19th 2008
2
1
Actel Technology Advantages
Power advantages
Inherently more secure than any other solution
Actel delivers a significant reliability
All Actel devices function as soon as power is
applied to the board
Single-chip offerings provide total cost
advantage over competition
© 2008 Actel Corporation. Confidential and Proprietary
ALU – Actel
Aug 19th 2008
3
Aug 19th 2008
4
Actel’s Silicon
Value-based Low Power FPGA
Ultra-low power
Very high volume
Sub-$1.00 market
Power and System Management
System developers needing
integrated functionality on single
chip
System Critical
Where failure and tampering are
not options
ALU – Actel
© 2008 Actel Corporation. Confidential and Proprietary
2
Actel Solutions
Industry’s Most Comprehensive
Power Management Portfolio
ALU – Actel
© 2008 Actel Corporation. Confidential and Proprietary
Aug 19th 2008
5
Aug 19th 2008
6
Later Today You are Invited
Low Power Solutions 12:30
Mixed-Signal FPGA and *TCA Solutions 1:30
ALU – Actel
© 2008 Actel Corporation. Confidential and Proprietary
3
Mentor Graphics
Complete FPGA Design Flow
Håkan Pettersson
Sr Applications Engineer
[email protected]
Mentor FPGA Design Solutions
Concept to Manufacturing
Embedded
Development
System
Design
C++
Func
if
else
if
Then ();
Goto des_dev
End if
PCB
Design
C-Based
Synthesis
RTL Reuse
& Creation
Verification
FPGA
Synthesis
2
2
Copyright
2007 Mentor
Graphics,
All Rights Reserved
FPGA World
November
2006
Copyright ©1999-2005, Mentor Graphics.
1
Mentor FPGA Design Solutions
Concept to Manufacturing
Nucleus OS
& EDGE Developer
HDL Designer
Vista &
Visual Elite
I/O Designer
& HyperLynx
CatapultC
Questa &
Seamless
&0
0--In
Precision &
FormalPro
3
3
Copyright
2007 Mentor
Graphics,
All Rights Reserved
FPGA World
November
2006
Copyright ©1999-2005, Mentor Graphics.
Mentor @ FPGA World
Open Verification Methodology – An
Overview
Verification Management
4
FPGA World November 2006
Copyright ©1999-2005, Mentor Graphics.
2
5
FPGA World November 2006
Copyright ©1999-2005, Mentor Graphics.
3
EBV Presentation FPGA
World 2008
1
September 2007
EBV Elektronik - The Full Solution Provider
EBV added values:
In-depth design expertise
Application know-how
Full logistics solutions
2
1
EBV Franchise Partner
3
EBV – The Technical Specialist
130 pan-European Field Application Engineers
– 13% of EBV’s total workforce! –
provide extensive application expertise and
design know-how.
2 weeks of internal FAE trainings per year
by the product specialists of EBV’s
manufacturers. (FSEs also attend)
Technologies are chosen from EBV!
2 weeks of additional training at our suppliers
EBV FAE Team
Every FAE receives at least 20 workings days of training per year!
4
2
EBV Reference Designs
Different suppliers combined to one solution!
Latest products, technologies & software
Customer requirements drive the applications
„Ready - to - use“ solutions
Saves customers costs & development time
Reduces time-to-market
5
FalconEye Development Board
Will be presented during the drive on one chip session
6
3
9/17/2008
Open Source - gives the
competitive edge
ORSoC make SoC-development
easy, accessible and cost efficient
for all companies, regardless size or financial strength.
Customer product
USB - Debugger
Development boards
1
9/17/2008
Floppy-disk
replacement
Designed and
developed by
ORSoC
Customer product
USB - Debugger
Owned and sold by
Swerob
Development boards
Open Source - gives the
competitive edge
ORSoC make SoC-development
easy, accessible and cost efficient
for all companies, regardless size or financial strength.
ORSoC makes it easy
2
9/17/2008
OpenCores
reach millions of engineers
www.opencores.org
OpenCores is owned and maintained by ORSoC.
OpenCores
Facts
OpenCores is the number one site in the world for open source hardware IPs
•
•
•
•
•
~540 projects (different IP-blocks)
~1 000 000 page views every month
~70 000 visitors every month
6:48 (min:sek) Average time at website
14GB data downloaded every month (IP source code)
3
9/17/2008
Welcome to Synopsys
May 20th, 2008
FPGAWorld 2008
1
Welcome to the
Synplicity Business Group
of Synopsys
Since May 15th, 2008
2
1
9/17/2008
The Message is . . .
“The acquisition by Synopsys allows us to scale
our FPGA and rapid prototyping business to help
more designers successfully solve increasingly
complex problems”
- Gary Meyers
General Manager, Synplicity Business Group
“The combination will support our strategy to
provide rapid prototyping capabilities and will
enhance Synplicity’s already strong offering in
the FPGA implementation market.”
- Aart de Geuss
CEO and Founder, Synopsys
3
Synplicity Business Group Products
Confirma™
ASIC / ASSP
Verification
Platform
FPGA
Implementation
Solutions
ESL
Synthesis
Synplify Premier
Single-FPGA Prototyping Environment
Certify
Multi-FPGA Prototyping Environment
Identify Pro
Full Visibility Functional Verification
HAPS
High-performance ASIC Prototyping System
Synplify Premier
The Ultimate in FPGA Implementation
Synplify Pro
The Industry Leader in FPGA Synthesis
Identify
Powerful RTL Debug
Synplify DSP
DSP Synthesis for FPGA Designers
Synplify DSP
ASIC Edition
DSP Synthesis for ASIC Designers
4
2
9/17/2008
FPGA World September 2008
© 2008 Actel Corporation
September 2008
Key Market Segments
Value-based
FPGA
Ultra-low power
High volumes
Sub-$10 market
Power
and System
Management
Needs integrated
functionality on
single chip
System
Critical
Failure and tampering
are not options
Confidential and Proprietary
© 2008 Actel Corporation
September 2008
2
1
9/17/2008
Power: Actel Technical Advantage
Competitive SRAM Cell
Bit Line
Vdd
Vdd
Actel’s Flash Cell
Bit Line
Negligible leakage per cell
Millions of configuration cells
Ultra-low static current
Word Line
Substantial leakage per cell
Millions of configuration cells
High static current
Confidential and Proprietary
© 2008 Actel Corporation
September 2008
3
Actel’s System Management Solutions
High-end,
standards-based system
management specifications
Fusion-based µTCA reference designs
Power Module and Advanced Mezzanine Card
Fusion-based ATCA reference designs
Low-cost
system management for typical
embedded design
Robust reference design leverages Fusion and
CoreABC
Confidential and Proprietary
© 2008 Actel Corporation
September 2008
4
2
9/17/2008
Vi visar:
Presentation
av Fusion som lösning på flera funktioner i ett
uTCA chassi samt en demo.
Klockan 13:00 i rum A
En
presentation samt demo av Igloo, som visar skillnaden i
strömförbrukning mellan Flash- och SRAM-baserade
FPGA:er
Klockan 15:30 i rum C
Väl mött!
Confidential and Proprietary
© 2008 Actel Corporation
September 2008
5
3
a leading Design and Service house in Sweden
Application specialists
Graphics, Imaging and Digital Video
Advanced Microelectronics
FPGA, Board, DSP, ASIC & System-on-Chip, Analog & SW
Offices
Head office in Stockholm with regional offices in Lund,
Uppsala, Växjö and in Gothemburg.
~60 employees
In average 10+ years in electronic design
1
Digital Signal Processing
Wireless LAN, Image Processing, Mobile telecom & Radar
Digital Video & Graphics
Processing, 2D acceleration, Displays, MPEG/JPEG
ASIC/System on Chip & SOPC (large FPGAs)
Analysis, tools, IP blocks, DFT issues, advanced verification
High Speed Design (PCB & FPGA)
Layout issues, tools, protocols and Interfaces, >3Gbps
2
SPEED UP YOUR BUSINESS
NOTE Lab
NOTE
NOTE is one of the leading EMS companies with more than 1100 people all over
the world. Everything we do is designed to make your company more successful
by developing electronics from design to after-sales services in close
cooperation with you.
WE OFFER
• A site close to you
• Design and test resourses
• Industrialisation
• NOTEfied for selection of right components
• NOTE LAB for fast prototypes
• Competitive component sourcing
• Serial production inclusive Box Build
• After sales services
SPEED UP YOUR BUSINESS
NOTE Lab
NOTE Lab
•
Specialists in prototyping and other customized production
•
Fast prototype production
– experienced component engineers and purchasing personnel
– prototype modifications while you wait
– advanced prototype delivery in days
– feedback based on customer needs
– seamless transfer to serial production
•
Box build in small volymes
1
SPEED UP YOUR BUSINESS
NOTE Lab
NOTEfied
NOTEfied for closer control of development and reliable production
It is essential to select the right component and NOTEfied supports with:
•
•
•
•
•
•
•
•
•
•
NOTE unique component article numbers
URL to data sheet
Manufacturers part number
Lead time
Quality classification
RoHS
Life cycle status
Symbols
Footprint
Production recommendations
SPEED UP YOUR BUSINESS
NOTE Lab
Let us help you!
We can help you to launch your product in a faster way and that can
be the differance between winn or lose.
If you want more information please visit www.note.se or contact us
in Lund on 046 – 286 92 00. If you have your business somewhere
else in Sweden you can find a NOTE site near you on our home page.
We look forward to hear from you!
2
Track B - Academic
Track B features presentations with focus on academic papers and
an industrial
applications. The
he presenters were selected by the Academic Programme
Committee. Due to the high quality, 5 out of the 17 papers submitted this year
were presented.
Papers
Session B1
A Java-Based
Based System for FPGA Programming
Session B2
Automated Design Approach for On-Chip
On Chip
Multiprocessor Systems
Session B3
ASM++ Charts: an Intuitive Circuit Representation
Ranging from Low Level RTL to SoC Design
De
Session B4
Space-Efficient
Efficient FPGA-Implementations
FPGA Implementations of FFTs in
High
High-Speed
Applications
Session B5
The ABB NoC – a Deflective Routing 2x2 Mesh
NoC targeted for Xilinx FPGAs
A Java-Based System for FPGA Programming
Jacob A. Bower, James Huggett, Oliver Pell and Michael J. Flynn
Maxeler Technologies
{jacob, jhuggett, oliver, flynn}@maxeler.com
Abstract
Photon is a Java-based tool for programming FPGAs.
Our objective is to bridge the gap between the ever increasing sizes of FPGAs and the tools used to program
them. Photon’s primary goal is to allow rapid development of FPGA hardware. In this paper we present Photon
by discussing both Photon’s abstract programming model
which separates computation and data I/O, and by giving
an overview of the compiler’s internal operation, including a flexible plug-and-play optimization system. We show
that designs created with Photon always lead to deeplypipelined hardware implementations, and present a case
study showing how a floating-point convolution filter design can be created and automatically optimized. Our final
design runs at 250MHz on a Xilinx Virtex5 FPGA and has
a data processing rate of 1 gigabyte per second.
1. Introduction
Traditional HDLs such as VHDL or Verilog incur major
development overheads when implementing circuits, particularly for FPGA which would support fast design cycles
compared to ASIC development. While tools such as Cto-gates compilers can help, often existing software cannot
be automatically transformed into high-performance FPGA
designs without major re-factoring.
In order to bridge the FPGA programming gap we propose a tool called Photon. Our goal with Photon is to simplify programming FPGAs with high-performance datacentric designs.
Currently the main features of Photon can be summarized as follows:
• Development of designs using a high-level approach
combining Java and an integrated expression parser.
• Designs can include an arbitrary mix of fixed and
floating point arithmetic with varied precision.
• Plug-and-play optimizations enabling design tuning
without disturbing algorithmic code.
• VHDL generation to enable optimizations via conventional synthesis tools.
• Automation and management of bitstream generation
for FPGAs, such as invoking FPGA vendor synthesis
tools and simulators.
The remainder of this paper is divided as up as follows:
In Section 2, we compare Photon and other tools for creating FPGA designs. In Section 3 we describe Photon’s programming model which ensures designs often lead to highperforming FPGA implementations. In Sections 4 and 5
we give an overview of how Photon works internally and
present a case study. Finally, in Section 6 we summarize
our work and present our conclusions on Photon so far.
2. Comparisons to Other Work
In Table 1 we compare tools for creating FPGA designs
using the following metrics:
• Design input – Programming language used to create
designs.
• High level optimizations – Automatic inference and
optimizing computation hardware, simplification of
arithmetic expressions etc.
• Low level optimizations – Boolean expression minimisation, state-machine optimizations, eliminating
unused hardware etc.
• Floating-point support – Whether the tool has intrinsic
support for floating-point and IEEE compliance.
• Meta-programmability – Ability to statically metaprogram with weaker features being conditional compilation and variable bit-widths and stronger features
such as higher-order design generation.
VHDL and Verilog use a traditional combination of
structural constructs and RTL to specify designs. These
tools typically require a high development effort. Such conventional tools typically have no direct support for floating
Design input
Floating-point
High-level
optimizations
Low-level
optimizations
Meta-programmability
Build automation
Photon
Java
IEEE
Yes
Impulse-C
C
IEEE
Yes
Handel-C
C
No
Yes
Verilog
Verilog
No
No
VHDL
VHDL
No
No
PamDC
C++
No
No
JHDL
Java
No
No
YAHDL
Ruby
No
No
ASC
C++
Yes
No
Via
VHDL
Strong
Yes
Via
HDL
Weak
Yes
Yes
Yes
Yes
No
No
No
No
Medium
Limited
Medium
No
Medium
No
Strong
No
Strong
No
Strong
Yes
Strong
No
Table 1. Comparison of tools for creating FPGA designs from software code.
point arithmetic and therefore require external IP. Metaprogrammability e.g. generics in VHDL are fairly inflexible [1]. The advantage of VHDL and Verilog is that they
give the developer control over every aspect of the microarchitecture, providing the highest potential for an optimal
design. Additionally, synthesis technology is relatively mature and the low-level optimizations can be very effective.
Other tools often produce VHDL or Verilog to leverage the
low-level optimizers present in the conventional synthesis
tool-chain.
Impulse-C [2] and Handel-C [3] are examples of C-togates tools aiming to enable hardware designs using languages resembling C. The advantage of this approach is
existing software code can form a basis for generating hardware with features such as ROMs, RAMs and floatingpoint units automatically inferred. However, software code
will typically require modifications to support a particular
C-to-gates compiler’s programming model. For example
explicitly specifying parallelism, guiding resource mapping, and eliminating features such as recursive function
calls. The disadvantage of C-to-gates compilers is that the
level of modification or guidance required of a developer
may be large as in general it is not possible to infer a highperformance FPGA design from a C program. This arises
as C programs in general are designed without parallelism
in mind and are highly sequential in nature. Also, metaprogrammability is often limited to the C pre-processor as
there is no other way to distinguish between static and dynamic program control in C.
PamDC [4], JHDL [5], YAHDL [1] and ASC [6] are
examples of Domain Specific Embedded Languages [7]
(DSELs) in which regular software code is used to implement circuit designs. With this approach all functionality
to produce hardware is encapsulated in software libraries
with no need for a special compiler. These systems are a
purely meta-programmed approach to generating hardware
with the result of executing a program being a net-list or
HDL for synthesis. Of these systems, PamDC, JHDL and
YAHDL all provide similar functions for creating hardware
structurally in C++, Java and Ruby respectively. YAHDL
and PamDC both take advantage of operator overloading
to keep designs concise, whereas JHDL designs are often
more verbose. YAHDL also provides functions for automating build processes and integrating with existing IP
and external IP generating tools. ASC is a system built
on top of PamDC and uses operator overloading to specify arithmetic computation cores with floating-point operations.
Photon is also implemented as a DSEL in Java. Photon’s
underlying hardware generation and build system is based
on YAHDL rewritten in Java to improve robustness. Unlike JHDL, Photon minimizes verbosity by using an integrated expression parser which can be invoked from regular
Java code. Photon also provides a pluggable optimization
system unlike the other DSELs presented, which generate
hardware in a purely syntax directed fashion.
3. Photon Programming Model
Our goal with Photon is to find a way to bridge the growing FPGA size versus programming gap when accelerating software applications. In this section we discuss the
programming model employed by Photon which provides
high-performance FPGA designs.
FPGA designs with the highest performance are generally those which implement deep, hazard free pipelines.
However, in general software code written without parallelism in mind tends to have loops with dependencies which cannot directly be translated into hazard free
pipelines. As such, software algorithm implementations
often need to be re-factored to be amenable to a highperformance FPGA implementation. Photon’s programming model is built around making it easy to implement
suitably re-factored algorithms.
When developing our programming model for Photon,
we observe that dense computation often involves a single arithmetic kernel nested in one or more long running
loops. Typically, dense computation arises from repeat-
ing the main kernel because the loop passes over a very
large set of data, or because a small set of data is being iterated over repeatedly. Examples of these two cases include
convolution, in which a DSP operation is performed on a
large set of data [8], and Monte-Carlo simulations repeatedly running random-walks in financial computations [9].
In the Photon programming model we implement applications following this loop-nested kernel pattern by dividing the FPGA implementation into two separate design
problems:
f o r i i n 0 t o N do
i f i > 1 and i < N−1 t h e n
dout [ i ] =
( d i n [ i ] + d i n [ i −1]+ d i n [ i + 1 ] ) / 3
else
dout [ i ] = din [ i ]
end
end
Listing 1. Pseudo-code 1D averaging filter.
1. Creating an arithmetic data-path for the computation
kernel.
2. Orchestrating the data I/O for this kernel.
Thus, we turn organizing data I/O for the kernel into a
problem that can be tackled separately from the data-path
compiler. Thus this leaves us with an arithmetic kernel
which does not contain any loop structures and hence can
be implemented as a loop-dependency free pipeline.
In Photon we assume the Data I/O problem is solved by
Photon-external logic. Based on this assumption, Photon
designs are implemented as directed acyclic graphs (DAGs)
of computation. The acyclic nature of these graphs ensures
a design can always be compiled to a loop-dependency free
pipeline.
Within a Photon DAG there are broadly five classes of
node:
• I/O nodes – Through which data flows into and out of
the kernel under the control of external logic.
• Value nodes – Nodes which produce a constant value
during computation. Values may be hard-coded or set
via an I/O side-channel when computation is not running.
• Computation nodes – Operations including: arithmetic (+, ÷ . . . ), bit-wise (&, or, . . . ), type-casts etc.
• Control nodes – Flow-control and stateful elements,
e.g.: muxes, counters, accumulators etc.
• Stream shifts – Pseudo operations used to infer buffering for simulating accessing to data ahead or behind
the current in-flow of data.
To illustrate Photon’s usage and graph elements, consider the pseudo-code program in Listing 1. This program
implements a simple 1D averaging filter passing over data
in an array din with output to array dout. The data I/O
for this example is trivial: data in the array din should be
passed linearly into a kernel implementing the average filter which outputs linearly into an array dout.
Figure 1. Photon DAG for 1D averaging.
Figure 1 shows a Photon DAG implementing the averaging kernel from Listing 1. Exploring this graph from topdown: data flows into the graph through the din input node,
from here data either goes into logic implementing an averaging computation or to a mux. The mux selects whether
the current input data point should skip the averaging operation and go straight to the output as should be the case
at the edges of the input data. The mux is controlled by
logic which determines whether we are at the edges of the
stream. The edge of the stream is detected using a combination of predicate operators (<, >, &) and a counter
which increases once for each item of data which enters
the stream. The constant input N − 1 to the < comparator
can be implemented as a simple constant value, meaning
the size of data which can be processed is fixed at compilation time. On the other hand, the constant input can
be implemented as a more advanced value-node that can
be modified via a side-channel before computation begins,
thus allowing data-streams of any size to processed. The
logic that performs the averaging computation contains a
number of arithmetic operators, a constant and two streamshifts. The stream-shift operators cause data to be buffered
such that it arrives at the addition operator one data-point
behind (−1) or one data-point ahead (+1) of the unshifted
c l a s s AddMul e x t e n d s P h o t o n D e s i g n {
AddMul ( B u i l d M a n a g e r bm ) {
s u p e r ( bm , ” AddMulExample ” ) ;
Var
Var
Var
Var
a
b
c
d
=
=
=
=
i n p u t ( ” a ” , hwFloat ( 8 , 2 4 ) ) ;
i n p u t ( ”b” , hwFloat ( 8 , 2 4 ) ) ;
i n p u t ( ” c ” , hwFloat ( 8 , 2 4 ) ) ;
outp ut ( ”d” , hwFloat ( 8 , 2 4 ) ) ;
d . c o n n e c t ( mul ( add ( a , b ) , c ) ;
}
}
Listing 2. Photon floating-point add/mul design.
4. Implementation of Photon
Figure 2. Scheduled DAG for 1D average filter.
data which comes directly from din.
To implement our 1D averaging Photon DAG in hardware, the design undergoes processing to arrive at a hardware implementation. Figure 2 illustrates the result of Photon processing our original DAG. In this processed DAG,
buffering implements the stream-shifting operators and ensures data input streams to DAG nodes are aligned. Clockenable logic has also been added for data alignment purposes.
With this newly processed DAG, data arriving at din
produces a result at dout after a fixed latency. This is
achieved by ensuring that data inputs to all nodes are
aligned with respect to each other. For example the mux
before dout has three inputs: the select logic, din and the
averaging logic. Without the buffering and clock-enable
logic, data from din would arrive at the left input to the
mux before the averaging logic has computed a result. To
compensate, buffering is inserted on the left input to balance out the delay through the averaging logic. For the
mux-select input a clock-enable is used to make sure the
counter is started at the correct time.
After Photon processes a DAG by inserting buffering
and clock-enable logic, the DAG can be turned into a structural hardware design. This process involves mapping all
the nodes in the graph to pre-made fully-pipelined implementations of the represented operations and connecting
the nodes together. As the design is composed of a series
of fully-pipelined cores, the overall core is inherently also
fully-pipelined. This means Photon cores typically offer a
high degree of parallelism with good potential for achieving a high clock-speed in an FPGA implementation.
In this section we given an overview of Photon’s concrete implementation. Of particular interest in Photon is
the mechanism by which designs are specified as Java programs which is covered first in Section 4.1. We then discuss
Photon’s compilation and hardware generation process in
Section 4.2.
4.1. Design Input
Photon is effectively a Java software library and as such,
Photon designs are created by writing Java programs. Executing a program using the Photon library results in either
the execution of simulation software for testing a design or
an FPGA configuration programming file being generated.
When using the Photon library a new design is created by extending the PhotonDesign class which acts
as the main library entry point. This class contains methods which wrap around the creation and inter-connection of
standard Photon nodes forming a DAG in memory which
Photon later uses to produce hardware. New nodes for custom hardware units, e.g. a fused multiply-accumulate unit,
can also be created by users of Photon.
Listing 2 shows an example Photon program. When executed this program creates a hardware design which takes
three floating-point numbers a, b and c as inputs, adds a
and b together and multiplies the result by c to produce a
single floating-point output d. Method calls in the code
specify a DAG which has six nodes: three inputs, an output, a multiplier and an add. These nodes are created by
calls to the input, output, mul and add methods respectively. The input and output methods take a string parameter to specify names of I/Os for use by external logic
and for performing data I/O. Another parameter specifies
the I/O type. For the example in this paper, we use IEEE
single precision floating-point numbers. The floating point
type is declared using a call to hwFloat which makes a
/ / C r e a t e I / Os
i n p u t ( ” din ” , hwFloat (8 , 2 4 ) ) ;
outp ut ( ” dout ” , hwFloat (8 , 2 4 ) ) ;
/ / Average Computation
e v a l ( ” p r e v d i n = s t r e a m S h i f t ( −1 , d i n ) ” ) ;
eval ( ” next din = streamShift (1 , din ) ” ) ;
e v a l ( ” avg = ( p r e v d i n + d i n + n e x t d i n ) / 3 ” ) ;
/ / 8− b i t c o u n t e r
e v al ( ” count = simpleCounter (8 , 255) ” ) ;
/ / S e l e c t l o g i c w i t h N hard−c o d e d t o 10
eval ( ” s e l = ( count > 1) & ( count < 10) ” ) ;
/ / Mux c o n n e c t e d t o o u t p u t
e v a l ( ” d o u t <− s e l ? avg : d i n ” ) ;
Listing 3. 1D averaging design implemented using Photon expressions.
floating point type object with an 8 bit exponent and a 24
bit mantissa following the IEEE specification. We can also
create floating-point numbers with other precisions, fixedpoint and/or integer types. Types used at I/Os propagate
through the DAG and hence define the types of operator
nodes. Casting functions can be used to convert and constrain types further within the design.
One drawback of using Java method calls to create a
DAG is verbosity, making it hard to read the code or relate lines back to the original specification. To resolve
the function-call verbosity the Photon library provides a
mechanism for expressing computation using a simple expression based language. Statements in this integrated expression parser can be written using a regular Java strings
passed to an eval method. The eval method uses the statements in the string to call the appropriate methods to extend
the DAG.
To demonstrate our eval expressions, Listing 3 shows
how our 1D averaging example from Figure 1 is implemented in Photon using eval calls.
4.2. Compilation and Hardware Generation
In addition to using Java for design specification, Photon
also implements the compilation and hardware generation
process entirely in Java. Photon’s design management features cover optimization of Photon designs, generation of
VHDL code, and calling external programs such as synthesis, simulation, IP generation, and place-and-route.
After a Photon design is fully specified, Photon turns the
specified DAG into a design which can be implemented in
hardware. Photon achieves this primarily by executing a
number of “graph-passes”. A graph-pass is piece of Java
code which visits every node in the DAG in topological
order. Typically, Photon passes transform the graph by
adding and/or deleting nodes, for example to implement
optimizations. Photon has a default set of optimization
passes which are used for all designs but users may also
develop their own, for example to detect application specific combinations of nodes in a graph and mutate them to
improve the hardware implementation.
Of the default graph passes the most important are those
which align the data stream inputs to nodes, inserting
buffering or clock-enable logic as illustrated in the difference between Figure 1 and Figure 2. We refer to this process as ‘scheduling’ the graph.
We perform scheduling using two graph-passes. The
first scheduling pass traverses the graph passively, collecting data about the latency (pipeline-depth) of each node in
the graph. We then determine an offset in our core pipeline
at which each node should be placed in order to ensure that
data for all its inputs arrives in synchrony. After all the offsets in a schedule are generated a second pass applies these
offsets by inserting buffering to align node inputs.
Sub-optimal offsets cause unnecessary extra buffering
to be inserted into the graph, wasting precious BlockRAM
and shift-register resources. To combat this inefficiency
we calculate a schedule for the offsets using Integer Linear Programming (ILP). Our ILP formulation ensures all
nodes are aligned such that their input data arrives at the
same time while minimising the total number of bits used in
buffering. Thus, Photon’s scheduled designs always have
optimal buffering.
After all other graph-passes, a final graph-pass produces
a hardware design. By this stage in compilation every node
in the DAG has a direct mapping to an existing piece of
parameterisable IP. Thus, this final pass creates a hardware
design, by instantiating one IP component per node in the
graph. Hardware is created in Photon using further Java
classes to describe structural designs either directly using
Java or by including external pre-written HDL, or running
external processes to generate IP, e.g. CoreGen for floatingpoint units. After a design is fully described, external synthesis or simulation tools are invoked by Java to produce
the required output for this execution. The system used to
implement this low-level structural design and tool automation is based on the model described in [1].
5. Case Study
As a case study we consider a simple 2D convolution
filter. This kind of design is common in many digital image
processing applications.
The filter we implement is shown in Figure 3. The filter
is separable, using the equivalent of two 1D 5 point convo-
Figure 3. Shape of 2D convolution filter implemented in Photon case-study.
No opts.
BRAM opts.
Mult. opts.
All opts.
LUT/FF Pairs
6192
6297
5851
5925
DSPs
10
10
6
6
RAMB36s
18
4
18
4
Table 2. Resource usage for 2D convolution filter
on a Virtex 5 with various optimizations.
lution operators with a total operation count of 8 additions/subtractions and (after algebraic optimization to factor out
common sub-expressions) 5 multiplications per input point.
5.1. Optimizations
The compilation of the convolution case study illustrates
two of the optimization graph-passes during the Photon
compilation process.
The Photon implementation of the filter makes use of
several large stream-shifts on the input data. These shifts
are necessary as each output data-point requires the 9 surrounding points to compute the convolved value. These
stream-shifts result in a large number of buffers being
added to the Photon design. Photon helps reduce this
buffering using a graph-pass that combines the multiple delay buffers into a single long chain of buffers. This ensures
each data item is only stored once, reducing buffering requirements.
Photon is able to use the precise value of the filter coefficient constants to optimize the floating-point multipliers.
Specifically, some of the coefficients are a power of two,
which can be highly optimized. To implement this, Photon includes another graph-pass which identifies floatingpoint multiplications by a power of two and replaces them
with a dedicated node representing a dedicated hardware
floating-point multiply-by-two IP core. This IP core uses
a small number of LUTs to implement the multiplication
rather than a DSP as in the conventional multipliers.
5.2. Implementation Results
For synthesis we target our filter design to a Xilinx Virtex-5 LX110T FPGA, with a clock frequency of
250MHz. At this speed, with data arriving and exiting the
circuit once per cycle, we achieve a sustained computation
rate of 1GB/s.
Table 2 shows the area impact of the Photon optimization graph-passes on the filter hardware. The multiplier
power of two substitution pass reduces the number of DSP
blocks used from 10 to 6, and the delay merging pass reduce BRAM usage from 18 RAMB36s to 4. The number
of LUTs required for the original and optimized designs are
similar.
6. Conclusion
In this paper we introduce Photon, a Java-based FPGA
programming tool. We describe the programming model
for Photon in which data I/O is separated from computation
allowing designs to implicitly be easy to pipeline and hence
perform well in an FPGA. We give an overview of Photon’s
implementation as a library directed by user created Java
programs. Finally, we present a case study demonstrating
that Photon’s pluggable optimization system can be used
to improve the resource utilisation of designs. Our current
and future work with Photon includes developing a system
for making it easier to create the data I/O logic external to
Photon designs, and creating more advanced optimization
passes.
References
[1] J. A. Bower, W. N. Cho, and W. Luk, “Unifying FPGA hardware development,” in International Conference on FieldProgrammable Technology, December 2007, pp. 113–120.
[2] Impulse Accelerated Technologies Inc., “ImpulseC,”
http://www.impulsec.com/, 2008.
[3] Agility, “DK design suite,” http://www.agilityds.com/, 2008.
[4] O. Mencer, M. Morf, and M. J. Flynn, “PAM-Blox: High
performance FPGA design for adaptive computing,” in IEEE
Symposium on FPGAs for Custom Computing Machines, Los
Alamitos, CA, 1998, pp. 167–174.
[5] P. Bellows and B. Hutchings, “JHDL - An HDL for reconfigurable systems,” in IEEE Symposium on FPGAs for Custom
Computing Machines, Los Alamitos, CA, 1998, pp. 175 –
184.
[6] O. Mencer, “ASC: A stream compiler for computing with
FPGAs,” IEEE Transactions on CAD of ICs and Systems,
vol. 25, pp. 1603–1617, 2006.
[7] P. Hudak, “Modular domain specific languages and tools,”
Intl. Conf. on Software Reuse, vol. 00, p. 134, 1998.
[8] O. Pell and R. G. Clapp, “Accelerating subsurface offset gathers for 3D seismic applications using FPGAs,” SEG Tech.
Program Expanded Abstracts, vol. 26, no. 1, pp. 2383–2387,
2007.
[9] D. B. Thomas, J. A. Bower, and W. Luk, “Hardware architectures for Monte-Carlo based financial simulations,” in International Conference on Field-Programmable Technology,
December 2006, pp. 377–380.
Automated Design Approach for On-Chip Multiprocessor Systems
P. Mahr, H. Ishebabi, B. Andres, C. Loerchner, M. Metzner and C. Bobda
Department of Computer Science, University of Potsdam, Germany
{pmahr,ishebabi,andres,lorchner,metzner,bobda}@cs.uni-potsdam.de
Abstract
This paper presents a design approach for adaptive multiprocessor systems-on-chip on FPGAs. The goal in this
particular design approach is to ease the implementation
of an adaptive multiprocessor system by creating components, like processing nodes or memories, from a parallel
program. Therefore message-passing, a paradigm for parallel programming on multiprocessor systems, is used. The
analysis and simulation of the parallel application provides
data for the formulation of constraints of the multiprocessor
system. These constraints are used to solve an optimization
problem with Integer Linear Programming: the creation
of a suitable abstract multiprocessor hardware architecture
and the mapping of tasks onto processors. The abstract architecture is then mapped onto a concrete architecture of
components, like a specific Power-PC or soft-core processor, and is further processed using a vendor tool-chain for
the generation of a configuration file for an FPGA.
1. Introduction
As apparent in current developments the reduction of
transistor size and the exploitation of instruction-level parallelization can not longer be continued to enhance the performance of processors [1]. Instead, multi-core processors
are a common way of enhancing performance by exploiting
parallelism of applications. However, designing and implementing multiple processors on a single chip leads to
new problems, which are absent in the design of singlecore processors. For example, an optimal communication
infrastructure between the processors needs to be found.
Also, software developers have to parallelize their applications, so that the performance of the application is increased
through multiple processors. In the case of multiprocessor systems-on-chip (MPSoCs), which combine embedded
heterogeneous or homogeneous processing nodes, memory
systems, interconnection networks and peripheral components, even more problems arise. Partly because of the variety of technologies available and partly because of their
sophisticated functionality [2] [3]. To reduce design time,
high level design approaches can be employed. In [4] ,[5],
[6] and [7] design methodologies and corresponding toolsupport are described.
In principle two communication paradigms for parallel
computing with multiprocessor systems exist, the communication through shared memory (SMP), i. e. cache or memory on a bus-based system, and the passing of messages
(MPI) through a communication network. SMP architectures, like the Sun Niagara processor [8] or the IBM Cell BE
processor [9], are the common multiprocessors today. MPI
is typically used in computer clusters, where physically distributed processors communicate through a network.
This paper presents a survey about our developments in
the area of adaptive MPSoC design with FPGAs (FieldProgrammable Gate Arrays) as a flexible platform for ChipMulti-Processors. In section 2 an overview of the proposed
design approach for MPSoCs is given. Therefore the steps
for architectural synthesis, starting with the analysis and
simulation of a parallel program and ending with the generation of a bitfile for the configuration of an FPGA are described in general. In the following section 3 an on-chip
message passing software library for communication between tasks of a parallel program, and a benchmark for the
purpose of evaluation are presented. Section 4 summarizes
the formulation of architecture constraints for the design
space exploration with Integer Linear Programming. These
constraints are formulated out of the results of the analysis
and simulation of a parallel program. The following section 5 gives an overview about the creation of MPSoCs using abstract components. Finally, this paper is concluded in
section 6 and a brief overview about future work is given in
section 7.
2. System design using architectural synthesis
To get an efficient multiprocessor system-on-chip from a
parallel program several approaches are possible. In figure
1 our proposed synthesis flow using an analytical approach
is shown. The architectural synthesis flow starts with parallel applications that are modeled as a directed graph, where
the nodes represent tasks and the edges represent communication channels [10].
is needed as well and should be performed during the proposed steps.
3. On-Chip Message Passing
In this section the on-chip communication between processing nodes using a subset of the MPI standard is described [12]. Therefore a small-sized MPI-library was developed (see figure 2), which is similar to the approaches
described in [13], [14] and [15].
Figure 1. Architectural Synthesis Flow
In the first step of the design flow, information on data
traffic and on task precedence is extracted from functional
simulations of the parallel program. Information on the
number of cycles of a task when executed on a specific processor is determined from cycle accurate simulations. This
information is used to formulate an instance of an Integer
Linear Programming (ILP) problem.
In the following step, called Abstract Component creation, a combinatorial optimization is done by solving an
ILP problem. Additionally to the information gathered in
the first step, platform constraints, e. g. area and speed of
the target platform, are needed as well. As a result of this
step an abstract system description including the (abstract)
hard- and software parts is generated.
The third step is called Component mapping. The abstract system description, which consists of abstract processors, memories, communication components or hardware accelerators and software tasks linked onto abstract
processors, is mapped on a concrete architecture of components like PPC405, MicroBlaze or on-chip BRAMs. If
needed, an operating systems can be generated with scripts
and makefiles and can be mapped onto a processor as well.
This step can be done using the PinHaT software (Platformindependent Hardware generation Tool) [11].
In the final step a bitfile is generated from the concrete
components using the platform vendor tool-chain performing logic synthesis and place & route. However, simulations
for the validation of the system on higher and lower levels
Figure 2. SoC-MPI Library
The library consists of two layers. A Network independent layer (NInL) and a network dependent layer (NDeL),
for the separation of the hardware dependent part from
the hardware independent part of the library. The advantage of this separation is the easy migration of the library
to other platforms. The NInL provides MPI functions,
like MPI Send, MPI Receive, MPI BSend or MPI BCast.
These functions are used to perform the communication between processes in the program. The NDeL is an accumulation of network dependent functions for different network
topologies. In this layer the ranks and addresses for concrete networks are determined and the cutting and sending of messages depending on the chosen network is carried out. Currently the length of a message is limited to
64 Bytes, due to the limited on-chip memory of FPGAs.
Longer messages are therefore cutted into several smaller
messages and are send in series. The parameters of the MPI
functions, like count, comm or dest (destination), are also
used as signals and parameters for the hardware components of the network topology. That is, the parameters are
used to build the header and the data packets for the communication over a network. The MPI parameters are directly
used for the control, data and address signals.
The following MPI functions are currently supported:
Init, Finalize, Initialized, Wtime, Wtick, Send, Recv,
SSend, BSend, RSend, SendRecv, Barrier, Gather, BCast,
Comm Size, Comm Rank.
Figure 4. Benchmarks of the SOC-MPI Library
Figure 3. Configuration of processing nodes
In figure 3 several processing nodes are connected together via a star network. Additionally node 0 and 1 are
directly connected together via FSL (Fast Simplex Link)
[16]. Each processing node has only a subset of the SoCMPI Library with the dependent functions for the network
topology.
3.1. Benchmarks
The MPI library is evaluated using Intel MPI Benchmarks 3.1, which is the successor of the well known package PMB (Pallas MPI Benchmarks) [17]. The MPI implementation was benchmarked on a Xilinx ML-403 evaluation
platform [18], which includes a Virtex 4 FPGA running at
100MHz. Three MicoBlaze soft-core processors [19] were
connected together via a star network. All programs were
stored in the on-chip memories.
In Figure 4 the results of the five micro benchmarks are
shown. Due to the limited on-chip memory not all benchmarks could be performed completely. Furthermore, a small
decay between 128 and 256 Bytes message size exists, because the maximum MPI message length is currently limited to 251 KBytes and a message larger than that must be
splitted into several messages. Further increase of the message size would lead to a bandwidth closer to the maximum
possible bandwidth, which is limited through the MicroBlaze and was measured with approximately 14 MBytes/s.
cating the best possible abstract architecture for a given parallel application under given constraints. The simultaneous
optimization problem is to map parallel tasks to a set of processing elements and to generate a suitable communication
architecture, which meets the constraints of the target platform, and minimizes the overall computation time of the
parallel program. The input for this step is obtained by using a profiling tool for the mpich2 package. In the following
two subsections area and time constraints of processors and
communication infrastructure are described in separate.
4.1. Processors - sharing constraint, area constraint
and costs
A few assumptions about processors and tasks need to
be made, because it is possible to map several tasks on a
processor: (1) a task scheduler exists, so that scheduling
is not involved in the optimization problem. (2) Task
mapping is static. (3) The instruction sequence of a task is
stored on the local program memory of the processor, e. g.
instruction cache, and hence the size of the local program
memory limits the number of tasks, which can be mapped
onto a processor. (4) Finally the cost of switching tasks in
terms of processor cycles does not vary from task to task.
Let Ii ∈ I0 , ..., In be a task, Jj ∈ J0 , ..., Jm a processor
and xij = 0, 1 a binary decision variable, whereas xij = 1
means that task Ii is mapped onto processor Jj .
m
X
xij = 1, ∀Ii
(1)
j=0
4. Abstract component creation using Integer
Linear Programing
In this flow, Integer Linear Programming (ILP) is used
for automated design space exploration with the goal of lo-
A constraint for task mapping (equation 2), called address space constraint, and the cost of task switching (equation 3) can be formulated, where sij is the size of a task Ii
of a processor Jj with the program memory size sj and tj
is the cost (time) of task switching.
n
X
xij · sij ≤ sj , ∀Ii
(2)
i=0
TSW IT CH =
m X
n
X
xij · tj
(3)
j=0 i=0
For the calculation of the area of the processors AP E
the area of of a single processor aj is needed. Because
xij only shows if a task Ii is mapped onto a processor Jj
and does not show the number of processors in the system
or the number of instantiations of a processor, an auxiliary
variable vj = 0, 1 is needed. For each instance of a processor Jj there is a corresponding virtual processor vj and
for all tasks mapped to a certain processor there is only one
task which is mapped to the corresponding virtual processor. This leads to the following constraint (equation 4) so
that the area of the processors can be calculated with equation 5.
vj ≤
n
X
xij , ∀Jj
(4)
i=0
AP E ≥
m
X
vij · aj
(5)
j=0
4.2. Communication networks - Network capacity
and area constraint
Several assumptions have to be made before constraints
about the communication network can be formulated. The
communication of two tasks mapped onto the same processor is done via intra-processor communication, which have
a negligible communication latency and overhead, compared to memory access latency. All processors can use any
of the available communication networks and can use more
then one network. A communication network has arbitration costs resulting from simultaneous access on the network. It is assumed, that tasks are not prioritized and an upper bound on arbitration time for each network can be computed for each network topology depending on the number
of processors. Finally, it is not predictable when two or
more tasks will attempt to access the network. Though a
certain probability can be assumed.
λi1,i2 is an auxiliary 0-1 decision variable that is 1, if two
communicating tasks are mapped on different processors.
The sum of xi1 j1 and xi2 j2 equals two if the tasks are on
different processors as seen in equation 6.
λi1,i2 =
xi1 j1 + xi2 j2
2
variable with value 1 if a communication topology Ck is
used for communication between two tasks Ii1 and Ii2 , and
0 otherwise. Ii1 l Ii2 is a precedence operator. Ii1 is preceded by Ii2 which means, that a data transfer is performed
from Ii1 to Ii2 . The maximum number of processes, which
can use a topology Ck is described by Mk .
X
yk +
λi1,i2 ≤ Mk , ∀Ck
(7)
Ii1 ,Ii2 |Ii1 lIi2
The total area cost of the communication network (resources for routing) can be calculated with equation 8,
where Ak is the area cost of topology Ck .
AN ET ≥
m
X
Ak · yk
(8)
j=0
The cost of the topology in terms of computation time
is calculated in 10, whereas zki1 i2 is a binary decision variable, which is 1 if a network will be used by two tasks. Otherwise the variable zki1 i2 is 0. Di1 i2 is the amount of data to
be transferred between the two communicating tasks and pk
is the probability that network arbitration will be involved
when a task wants to communicate. The upper bound arbitration time is τk .
zki1 i2 ≥ λi1,i2 + yk − 1
X
K
X
Ii1 ,Ii2 |Ii1 lIi2
k=0
TN ET =
(9)
!
(Di1 i2 + τk · pk )zki1 i2
(10)
Finally the total area cost A is calculated form the area
of the processing elements AP E (equation 5) and the area
for the routing resources AN ET (equation 8).
A ≥ AP E + AN ET
(11)
The cost of computation time can be calculated with
equation 12, whereas Tij is the time requirement to process
a task Ii on a processor Jj . The objective, in this case, is to
minimize computation time of a (terminating) parallel program. However for non-terminating programs, like signal
processing programs, the objectives are different.
(6)
A communication topology Ck ∈ C0 , ..., Ck may have a
maximum capacity of processors attached to it. This constraint is described in equation 7. yk is a binary decision

T = min 
m X
n
X

xij · Tij + TN ET + TSW IT CH 
j=0 i=0
(12)
5. Component Mapping
In this section the mapping of abstract components onto
concrete ones is described. This component-based approach
is similar to the one described by Cesário et al. [20], where
a high-level component-based methodology and design environment for application-specific MPSoC architectures is
presented. For this task, a component-based design environment called PinHaT for the generation and configuration of
the system infrastructure was developed. This environment
offers a vendor independent framework with which users
can specify the requirements of their system and automatically produce the necessary system configuration and startup code. The Tool was developed as a Java application.
Generally PinHaT follows a two-step approach. In the
first step an abstract specification of the system using abstract components like CPUs, Memories or Hardware Accelerators are described. In the following step these abstract
components will be refined and mapped to concrete components, e. g. a specific CPU (PPC405, MicrocBlaze) or
a Hardware-Divider. Also the software tasks are mapped
onto the concrete components. The structure of PinHaT is
shown in figure 5. A detailed overview about the PinHaT
tool is given in [11].
5.1. Generation of the System Infrastructure
The generation of the system infrastructure, that is the
mapping of an abstract system description onto a concrete
hardware description, is done by PinHaT. PinHaT uses
XML in conjunction with a document type definition file
(DTD) as an input for the abstract system description. An
input consists of valid modules, which are CPU, Memory,
Communication Modules, Periphery and Hardware Accelerator. The hardware mapping onto concrete components is
divided into three phases.
In the first phases an internal XML tree is build by parsing the input file. For all nodes of this tree an adequate
class is instantiated. These classes know how to process
its own parameters. Such classes can be easily added to
the framework to extend the IP-Core base. In a subsequent
step, another parser creates the platform specific hardware
information file from the gathered information. In the second phase individual mappers for all components and target
platforms, are created, followed by the last phase, where a
mapper creates the platform dependent hardware description files. These dependent hardware description files are
then passed to the vendor’s tool chain, e. g. Xilinx EDK or
Altera Quartus II.
5.2. Configuration of the System Infrastructure SW Mapping
In the case of software, a task is mapped onto a concrete
processor. This is in contrast to the mapping of abstract
components, e. g. processors or memories, to concrete ones
during the generation of the system infrastructure.
For the mapping step, parameters of the software must be
specified for each processor. The parameters include information about the application or the operating system,
like source-code, libraries or the os-type. With these information scripts and Makefiles for building the standalone
applications and the operating systems are created. While
standalone applications only need compiling and linking of
the application, building the operating system is more difficult. Depending of the operating system, different steps,
like configuration of the file-system or the kernel parameters, are necessary. The result of the task mapping is an
executable file for each processor in the system.
Figure 5. Structure of PinHaT
The component mapping is divided into the generation
and the configuration of the system infrastructure, where
hardware is generated and software is configured. In this
flow, the input to PinHaT is obtained by high level synthesis, as described in section 4.
6. Conclusion
In this paper a concept for the design automation of multiprocessor systems on FPGAs was presented. A smallsized MPI library was implemented to use message passing
for the communication between tasks of a parallel program.
Simulation and analysis of the parallel program is carried
out to gather information about task precedence, task interaction and data traffic between tasks. These information is needed to formulate constraints, for the solving of
an Integer Linear Programming problem. As a result an
abstract system description, consisting of hardware components, like processing nodes or hardware accelerators, and
linked tasks is created. With the help of the PinHaT software the components of the abstract system description can
be mapped onto concrete components, like specific processors. Finally the configuration file for an FPGA can be created using the vendor tool-chain.
7. Future Work
Currently the PinHaT software is being extended into
an easy to use software solution including the architectural
synthesis of a parallel program. Also new abstract and corresponding concrete components, like new network topologies or processing nodes like the OpenRISC processor will
be included to enhance flexibility of the architectural synthesis flow. Furthermore concepts of adaptivity of multiprocessor systems are to be analysed and a detailed evaluation
of on-chip message-passing in the face of communication
overhead and latency needs to be carried out.
References
[1] Kunle Olukotun and Lance Hammond. The future of microprocessors. Queue, 3(7):26–29, 2005.
[2] Wayne Wolf. The future of multiprocessor systems-on-chips.
In DAC ’04: Proceedings of the 41st annual conference on
Design automation, pages 681–685, New York, NY, USA,
2004. ACM.
[3] Gilles Sassatelli, Nicolas Saint-Jean, Cristiane Woszezenki,
Ismael Grehs, and Fernando Moraes. Architectural issues in
homogeneous noc-based mpsoc. In RSP ’07: Proceedings of
the 18th IEEE/IFIP International Workshop on Rapid System
Prototyping, pages 139–142, Washington, DC, USA, 2007.
IEEE Computer Society.
[4] D.D. Gajski, Jianwen Zhu, R. Dmer, A. Gerstlauer, and
Shuqing Zhao. SpecC: Specification Language and Methodology. Springer, 2000.
[5] Tero Kangas, Petri Kukkala, Heikki Orsila, Erno Salminen, Marko Hännikäinen, Timo D. Hämäläinen, Jouni Riihimäki, and Kimmo Kuusilinna. Uml-based multiprocessor
soc design framework. Trans. on Embedded Computing Sys.,
5(2):281–320, 2006.
[6] Simon Polstra. A systematic approach to exploring embedded system architectures at multiple abstraction levels. IEEE
Trans. Comput., 55(2):99–112, 2006. Member-Andy D. Pimentel and Student Member-Cagkan Erbas.
[7] Blanca Alicia Correa, Juan Fernando Eusse, Danny Munera,
Jose Edinson Aedo, and Juan Fernando Velez. High level
system-on-chip design using uml and systemc. In CERMA
’07: Proceedings of the Electronics, Robotics and Automotive Mechanics Conference (CERMA 2007), pages 740–745,
Washington, DC, USA, 2007. IEEE Computer Society.
[8] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle
Olukotun. Niagara: A 32-way multithreaded sparc processor. IEEE Micro, 25(2):21–29, 2005.
[9] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R.
Maeurer, and D. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49(4/5):589–604, 2005.
[10] Concepcio Roig, Ana Ripoll, and Fernando Guirado. A new
task graph model for mapping message passing applications.
IEEE Trans. Parallel Distrib. Syst., 18(12):1740–1753, 2007.
[11] Christophe Bobda, Thomas Haller, Felix Mühlbauer, Dennis
Rech, and Simon Jung. Design of adaptive multiprocessor on
chip systems. In SBCCI ’07: Proceedings of the 20th annual
conference on Integrated circuits and systems design, pages
177–183, New York, NY, USA, 2007. ACM.
[12] MPI Forum.
April 2008.
http://www.mpi-forum.org/.
01.
[13] J. A. Williams, I. Syed, J. Wu, and N. W. Bergmann. A
reconfigurable cluster-on-chip architecture with mpi communication layer. In FCCM ’06: Proceedings of the 14th
Annual IEEE Symposium on Field-Programmable Custom
Computing Machines, pages 350–352, Washington, DC,
USA, 2006. IEEE Computer Society.
[14] T. P. McMahon and A. Skjellum. empi/empich: Embedding
mpi. In MPIDC ’96: Proceedings of the Second MPI Developers Conference, page 180, Washington, DC, USA, 1996.
IEEE Computer Society.
[15] Manuel Saldaña and Paul Chow. Tmd-mpi: An mpi implementation for multiple processors across multiple fpgas. In
FPL, pages 1–6, 2006.
[16] Xilinx Fast Simplex Link (FSL). http://www.xilinx.
com/products/ipcenter/FSL.htm.
07. August
2008.
[17] Intel MPI Benchmarks. http://www.intel.com/
cd/software/products/asmo-na/eng/219848.
htm. 09. April 2008.
[18] Xilinx ML403 Evaluation Platform.
http:
//www.xilinx.com/products/boards/ml403/
reference_designs.htm. 09. April 2008.
[19] Xilinx MicroBlaze Processor. http://www.xilinx.
com/products/design_resources/proc_
central/microblaze.htm. 09. April 2008.
[20] W. Cesário, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A. A. Jerraya, and M. Diaz-Nava.
Component-based design approach for multicore socs. In
DAC ’02: Proceedings of the 39th conference on Design automation, pages 789–794, New York, NY, USA, 2002. ACM.
ASM++ charts: an intuitive circuit representation
ranging from low level RTL to SoC design
S. de Pablo, L.C. Herrero, F. Martínez
University of Valladolid
Valladolid (Spain)
[email protected]
M. Berrocal
eZono AG
Jena (Germany)
[email protected]
Abstract
On the language side a parallel effort has been
observed. In particular, SystemVerilog [4] now include
an ‘interface’ element that allow designers to join several
inputs and outputs together in one named description, so
textual designs may become easier to read and
understand. At a different scale, pursuing a higher level
of abstraction, the promising SpecC top-down
methodology [5] firstly describes computations and
communications at an abstract and untimed level, and
then descends to an accurate and precise level where
connections and delays are fully described.
The aim of this paper is to contribute to these efforts
from a bottom-up point of view, mostly adequate for
academic purposes. First of all, we present several
extensions to the Algorithmic State Machine (ASM)
methodology, what we have called “ASM++ charts”,
allowing the automatic generation of VHDL or Verilog
code from this charts, using a recently developed
ASM++ compiler. Furthermore, these diagrams may
describe hierarchical designs and define, through special
boxes, how to connect different modules all together.
This article presents a methodology to describe
digital circuits from register transfer level to system
level. When designing systems it encapsulates the
functionality of several modules and also encapsulates
the connections between those modules. To achieve these
results, the possibilities of Algorithmic State Machines
(ASM charts) have been extended to develop a compiler.
Using this approach, a System-on-a-Chip (SoC) design
becomes a set of linked boxes where several special
boxes encapsulate the connections between modules. The
compiler processes all required boxes and files, and then
generates the corresponding HDL code, valid for
simulation and synthesis. A small SoC example is shown.
1. Introduction
System-on-a-Chip (SoC) designs integrate processor
cores, memories and custom logic joined into complete
systems. The increased complexity requires more effort
and more efficient tools, but also an accurate knowledge
on how to connect new computational modules to new
peripheral devices using even new communication
protocols and standards.
A hierarchical approach may encapsulate on black
boxes the functionality of several modules. This
technique effectively reduces the number of components,
but system integration becomes more and more difficult
as new components are added every day.
Thus, the key to a short design time, enabling
“product on demand”, is the use of a set of predesigned
components which can be easily integrated through a set
of also predesigned connections, in order to build a
product.
Because of this reason, Xilinx and Altera have
proposed their high end tools named Embedded
Development Kit [1] and SoPC Builder [2], respectively,
that allow the automatic generation of systems. Using
these tools, designers may build complete SoC designs
based on their processors and peripheral modules in few
hours. At a lower scale, similar results may be found on
the Hardware Highway (HwHw) web tool [3].
2. ASM++ charts
The Algorithmic State Machine (ASM) method for
specifying digital designs was originally documented on
1973 by C.R. Clare [6], who worked at the Electronics
Research Laboratory of Hewlett Packard Labs, based on
previous developments made by T. Osborne at the
University of California at Berkeley [6]. Since then it has
been widely applied to assist designers in expressing
algorithms and to support their conversion into hardware
[7-10]. Many texts on digital logic design cover the
ASM method in conjunction with other methods for
specifying Finite State Machines (FSM) [11-12].
A FSM is a valid representation of the behavior of a
digital circuit when the number of transitions and the
complexity of operations is low. The example of fig. 1
shows a FSM for a 12x12 unsigned multiplier that
computes ‘outP = inA * inB’ through twelve conditional
additions. It is fired by a signal named ‘go’, it signals the
answer using ‘done’, and indicates through ‘ready’ that
new operands are welcome.
Due to the double meaning of rectangular boxes,
conditional operations must be represented using a
different shape, the oval boxes. But, actually, all
operations are conditional, because all of them are
state dependent.
– Additionally, designers must use lateral annotations
for state names, for reset signals or even for links
between different parts of a design (see fig. 2).
– Finally, the width of signals and ports cannot be
specified when using the current notation.
Proposed ASM++ notation [13-14] tries to solve all
these problems and extend far beyond the possibilities of
this methodology. The first and main change introduced
by this new notation, as seen at fig. 3, is the use of a
specific box for states –we propose oval boxes, very
similar to those circles used in bubble diagrams– thus
now all operations may share the same box, a rectangle
for synchronous assignments and a rectangle with bent
sides for asynchronous assertions. Diamonds are kept for
decision boxes because they are commonly recognized
and accepted.
–
Figure 1. An example of FSM for a multiplier.
However, on these situations traditional ASM charts
may be more accurate and consistent. As shown at fig. 2,
they use three different boxes to fully describe the
behavior of cycle driven RTL designs: a “state box” with
rectangular shape defines the beginning of each clock
cycle and may include unconditional operations that
must be executed during (marked with ‘=’) or at the end
(using the delay operator ‘←’) of that cycle; “decision
boxes” –diamond ones– are used to test inputs or internal
values to determine the execution flow; and finally
“conditional output boxes” –with oval shape– indicate
those operations that are executed during the same clock
cycle, but only when previous conditions are valid.
Additionally, an “ASM block” includes all operations
and decisions that are or can be executed simultaneously
during each clock cycle.
Figure 2. Traditional ASM chart for a multiplier.
The advantages of FSM for an overall description of a
module are evident, but the ASM representation allows
more complex designs through conditions that are
introduced incrementally and detailed operations located
where designer specifies.
However, ASM notation has several drawbacks:
– They use the same box, rectangular ones, for new
states and unconditional operations executed at
those states. Because of this property, ASM
diagrams are compact, but they are also more rigid
and difficult to read.
– Sometimes it is difficult to differentiate the frontier
between different states. The complexity of some
states requires the use of dashed boxes (named
ASM blocks) or even different colors for different
states.
Figure 3. ASM++ chart ready for compilation.
Figure 3 shows additional features of ASM++ charts,
included to allow their automatic compilation to generate
HDL code. In addition to an algorithmic part, a
declarative section may describe the design name, its
implementation parameters, the external interface, one or
more internal signals. The synchronization signal and its
reset sequence can be fully specified in a very intuitive
way too. A box for ‘defaults’ has been added to easily
describe the circuit behavior when any state leave any
signal free. Furthermore, all boxes use standard VHDL
or Verilog expressions, but never both of them; the
ASM++ compiler usually detects the HDL and then
generates valid HDL code using the same language.
3. Hierarchical design using ASM++ charts
As soon as a compiler generates the VHDL or Verilog
code related to an ASM++ chart, the advanced features
of modern HDL languages can be easily integrated on
them. The requirements for hierarchical design have
been included through the following elements:
– Each design begins with a ‘header’ box that
specifies the design name and, optionally, its
parameters or generics.
– Any design may use one or several pages on a MS
Visio 2007 document1, saved using its VDX format.
Each VDX document may include several designs
identified through their header boxes.
– Any design may instantiate other designs, giving
them an instance name. As soon as a lower level
module is instantiated, a full set of signals named
“instance_name.port_name” (see fig. 5) is created
to ease the connections with other elements. Later
on, any ‘dot’ will be replaced by an ‘underline’
because of HDL compatibility issues.
– When the description of an instantiated module is
located on another file, a ‘RequireFile’ box must be
used before the header box to allow a joint
compilation. However, the ASM++ compiler
identifies any previously compiled design to avoid
useless efforts and invalid duplications.
– VHDL users may include libraries or packages
using their ‘library’ and ‘use’ sentences, but also
before any header box.
– Nowadays, compiler does not support reading
external HDL files, in order to instantiate hand
written modules. A prototype of them, as shown at
fig. 4, can be used instead.
Using these features, an example with a slightly
improved multiplier can be easily designed. First of all, a
prototype of a small FIFO memory is declared, as shown
at fig. 4, thus compiler may know how to instantiate and
connect this module, described elsewhere on a Verilog
file. Then three FIFO memories are instantiated to
handle the input and output data flows, as shown at fig.
5, so several processors may feed and retrieve data from
this processing element.
Figure 4. A prototype of an external design.
1
Actually, designers may also use MS Visio 2003 or ConceptDraw.
However, the only supported file format is VDX.
Figure 5. An example of hierarchical design.
The ASM++ chart of fig. 5 can be compared with its
arranged compilation result, shown below. The
advantages of this methodology on flexibility, clarity and
time saving are evident. Not always a text based tool is
faster and more productive than a graphical tool.
module hierarchical_design (clk, reset, inA, inB, outP,
readyA, readyB, readyP, pushA, pushB, popP);
parameter width = 16;
parameter depth = 6;
// 16x16 => 32
// 64-level buffers
input
clk, reset;
output
input
input
[width-1:0]
readyA;
pushA;
inA;
[width-1:0]
readyB;
pushB;
inB;
output
input
input
output
readyP;
input
popP;
output [2*width-1:0] outP;
wire
activate;
wire
fifoA_clk,
fifoA_reset;
wire [width-1:0] fifoA_dataIn, fifoA_dataOut;
wire
fifoA_push, fifoA_pop;
wire
fifoA_empty, fifoA_full;
fifo # (
.width(width), .depth (depth)
) fifoA (
.clk
(fifoA_clk),
.reset
(fifoA_reset),
.data_in (fifoA_dataIn), .data_out (fifoA_dataOut),
.push (fifoA_push),
.pop
(fifoA_pop),
.empty (fifoA_empty),
.full
(fifoA_full)
);
wire
fifoB_clk,
fifoB_reset;
wire [width-1:0] fifoB_dataIn, fifoB_dataOut;
wire
fifoB_push, fifoB_pop;
wire
fifoB_empty, fifoB_full;
fifo # (
.width(width), .depth (depth)
) fifoB (
.clk
(fifoB_clk),
.reset
(fifoB_reset),
.data_in (fifoB_dataIn), .data_out (fifoB_dataOut),
.push (fifoB_push),
.pop
(fifoB_pop),
.empty (fifoB_empty),
.full
(fifoB_full)
);
wire
AxB_clk, AxB_reset;
wire
AxB_go, AxB_ready, AxB_done;
wire [width-1:0] AxB_inA, AxB_inB;
wire [2*width-1:0] AxB_outP;
multiplier # (
.N(width)
) AxB (
.clk (AxB_clk), .reset (AxB_reset),
.go (AxB_go), .ready(AxB_ready), .done(AxB_done),
.inA(AxB_inA), .inB (AxB_inB), .outP(AxB_outP)
);
wire
fifoP_clk,
fifoP_reset;
wire [2*width-1:0] fifoP_dataIn, fifoP_dataOut;
wire
fifoP_push, fifoP_pop;
wire
fifoP_empty, fifoP_full;
fifo # (
.width(2 * width), .depth (depth)
) fifoP (
.clk
(fifoP_clk),
.reset
(fifoP_reset),
.data_in (fifoP_dataIn), .data_out (fifoP_dataOut),
.push (fifoP_push),
.pop
(fifoP_pop),
.empty (fifoP_empty),
.full
(fifoP_full)
);
assign
assign
assign
assign
assign
assign
assign
assign
fifoA_clk
fifoB_clk
AxB_clk
fifoP_clk
fifoA_reset
fifoB_reset
AxB_reset
fifoP_reset
= clk;
= clk;
= clk;
= clk;
= reset;
= reset;
= reset;
= reset;
assign fifoA_push = pushA;
assign fifoA_dataIn = inA;
assign fifoA_pop
= activate;
// Default connections
// User connections
assign fifoB_push = pushB;
assign fifoB_dataIn = inB;
assign fifoB_pop
= activate;
assign AxB_inA
assign AxB_inB
assign AxB_go
= fifoA_dataOut;
= fifoB_dataOut;
= activate;
assign fifoP_push = AxB_done;
assign fifoP_dataIn = AxB_outP;
assign fifoP_pop
= popP;
assign activate
= AxB.ready & ~fifoA_empty
& ~fifoB_empty & ~fifoP_full;
assign
assign
assign
assign
= fifoP_dataOut;
= ~fifoA_full;
= ~fifoB_full;
= ~fifoP_empty;
outP
readyA
readyB
readyP
endmodule
/// hierarchical_design
4. Encapsulating connections using pipes
Following this bottom-up methodology, the next step
is using ASM++ charts to design full systems. As stated
above, a chart can be used to instantiate several modules
and connect them, with full, simple and easy access to all
port signals.
However, system designers need to know how their
available IP modules can or must be connected, in order
to build a system. Probably, they need to read thoroughly
several data sheets and try different combinations, to
finally match their requirements. Nonetheless, when they
become experts on those modules, newer and better IP
modules are developed, so system designers must start
again and again.
This paper presents an alternative to this situation,
called “Easy-Reuse”. During the following explanations,
please, refer to figures 6 to 9.
– First of all, a fully new concept must be introduced:
an ASM++ chart may describe an entity/module
that will be instantiated, like ‘multiplier’ at fig. 3,
but additionally it may be used for a description
that will be executed (see figs. 8 and 9). The former
will just instantiate a reference to an outer
description, meanwhile the later will generate one
or more sentences inside the modules that call
them. To differentiate those modules that will be
executed, header boxes enclose one or more
module names using ‘<’ and ‘>’ symbols. Later on,
these descriptions will be processed each time an
instance or a ‘pipe’ (described below) calls them.
– Furthermore, the ASM++ compiler has been
enhanced with PHP-like variables [15]. They are
immediately evaluated during compilation, but they
are available only at compilation time, so no circuit
structures will be directly inferred from them. Their
names are preceded by a dollar sign (‘$’), they may
be assigned with no previous declaration and store
integer values, strings or lists of freely indexed
variables.
– In order to differentiate several connections that
may use the same descriptor, variables are used
instead of parameters or generics. The
corresponding field at a header box, when using it
to start a connection description, is used to define
default values for several variables (see fig. 8);
these specifications would be changed by pipes on
each instantiation (see fig. 6).
– Usual ASM boxes are connected in a sequence
using arrows with sense; a new box called “pipe”
can be placed out of the sequence and connect two
instances through single lines, with no arrows.
– When compiler finishes the processing of the main
sequence, it searches all pipes, looks for their
linked instances, and executes the ASM charts
related to those connections. Before each operation,
it defines two automatic variables to identify the
connecting instances. As said above, the pipe itself
may define additional variables to personalize and
differentiate each connection.
– As soon as several pipes may describe connections
to the same signal, a resolution function must be
defined to handle their conflicts. A tristate function
would be used, but HDL compilers use to refuse
such connections if they suspect contentions;
furthermore, modern FPGAs do not implement
such resources any more because of their high
consumption, thus these descriptions are actually
replaced by gate-safe logic. Subsequently, a wiredOR, easier to understand than a wired-AND, has
been implemented when several sources define
different values from different pipe instantiations
or, in general, from different design threads.
– The last element required by ASM++ charts to
manage automatic connections is conditional
compilation. A diamond-like box, with double lines
at each side, is used to tell the ASM++ compiler to
follow one path and fully ignore the other one.
Thus, different connections are created when, for
example, a FIFO memory is accessed from a
processor to write data, to read data or both.
Using these ideas, a SoC design may now encapsulate
not only the functionality of several components, but
also their connections.
Figure 6 describes a small SoC that implements a
Harvard-like DSP processor (see [13]) connected to a
program memory, a 32-level FIFO and a register. First of
all, two C-like compiler directives are used to specify the
HDL language and a definition used later; a VDX file
that describes the DSP processor is also included before
giving a name to the SoC design. Then, all required
modules are instantiated and connected using pipes.
A small program memory has been designed for
testing purposes, as shown at fig. 7: the upper chart
describes a ROM memory with a short program that
emulates the behavior of a Xilinx Block RAM, and the
lower chart describes how this synchronous memory
must be connected to the DSP. This figure illustrates the
use of automatic variables (‘$ProgMem’ and
‘$DSPuva18’, whose values will be “mem_01” and
“dsp_01”, respectively) and the difference between
modules that can be instantiated or executed.
Figure 7. Charts may describe connections.
Figure 6. A small SoC design using pipes.
The pipe at figure 6 with text “RW250” describes the
connection of a FIFO memory (see fig. 4) to a
DSPuva18 processor [13], thus it executes the ASM++
chart shown at fig. 8. When executing this pipe, a ‘0’
value is firstly assigned to variables ‘$port’,
‘$write_side’ and ‘$read_side’, as stated by the header
box; then these values are changed as specified by the
pipe box (see the defined value of ‘RW250’); finally, the
chart of figure 8 generates the HDL code that fully
describes how “fifo_01” device is connected to
“dsp_01” processor for reading and writing using port
‘250’ for data and port ‘251’ for control (getting the state
through a read and forcing a reset through a write).
Several sentences of the HDL code generated by
ASM++ compiler when processing these diagrams
displayed following, revealing that ASM++ charts
fully capable of describing SoC designs using
intuitive, easy to use and consistent representation.
the
are
are
an
// I/O interface described by ‘SoC_iface’ instance and pipe (see figure 9):
input
clk, reset;
output [31:0] reg_01_LEDs;
// A connection described by “ <SoC_iface> <Register>” pipe:
assign reg_01_LEDs
= reg_01_dataOut;
// Connecting dsp_01 to mem_01, its program memory (see figure 6):
assign mem_01_rst
= dsp_01_progReset;
assign mem_01_addr
= dsp_01_progAddress;
assign dsp_01_progData = mem_01_data;
// Connecting reg_01 to dsp_01 (at port ‘0’):
assign reg_01_we
= dsp_01_portWrite & dsp_01_portAddress == 0);
assign reg_01_dataIn
= dsp_01_dataOut;
// Connecting fifo_01 to dsp_01 (at ports ‘250’ and ‘251’):
always @ (posedge fifo_01_clk)
begin
fifo_01_reset <= dsp_01_portWrite & (dsp_01_portAddress == 250 + 1);
end
assign fifo_01_dataIn
= dsp_01_ dataOut;
assign fifo_01_push
= dsp_01_portWrite & (dsp_01_portAddress == 250);
assign fifo_01_pop
= dsp_01_portRead & (dsp_01_portAddress == 250);
// Connecting several sources to dsp_01 using a wired-OR:
assign asm_thread_1017_dsp_01_dataIn =
(dsp_01_portRead & (dsp_01_portAddress == 0)) ? reg_01_dataOut : 0;
assign asm_thread_1021_dsp_01_dataIn =
(fifo_01_pop) ? fifo_01_dataOut :
(dsp_01_portRead & (dsp_01_portAddress == 250+1)) ?
{fifo_01_full, fifo_01_almostFull, fifo_01_half, fifo_01_almostEmpty, fifo_01_empty} : 0;
assign dsp_01_dataIn =
asm_thread_1017_dsp_01_dataIn | asm_thread_1021_dsp_01_dataIn;
5. Conclusions
Figure 8. An ASM++ chart that describes how a
FIFO must be connected to a DSP processor.
Two final ASM++ charts will be described at figure
9, but other required charts have not been included for
shortness. The chart at left specifies how the instance
named ‘SoC_iface’ at figure 6 must be executed, not
instantiated, in order to generate two control inputs and
to connect them to all modules. The diagram at right
generates additional I/O signals and connects them to the
register controlled by the DSP through its port ‘0’.
This article has presented a powerful and intuitive
methodology for SoC design named Easy-Reuse. It is
based on a suitable extension of traditional Algorithmic
State Machines, named ASM++ charts, its compiler and
a key idea: charts may describe entities or modules, but
they also may describe connections between modules.
The ASM++ compiler developed to process these charts
in order to generate VHDL or Verilog code has been
enhanced further to understand a new box called pipe
that implements the required connections. The result is a
self-documented diagram that fully describes the system
for easy maintenance, supervision, simulation and
synthesis.
6. Acknowledgments
The authors would like to acknowledge the financial
support for these developments of eZono AG, Jena,
Germany, ISEND SA, Valladolid, Spain, and the
Spanish Government (MEC) under grant CICYT
ENE2007-67417/ALT with FEDER funds.
Figure 9. Charts may describe I/O interface too.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Xilinx, “Platform Studio and the EDK”, on-line at http://
www.xilinx.com/ise/embedded_design_prod/platform_stu
dio.htm, last viewed on July 2008.
Altera, “SoPC Builder”, on-line at http://www.altera.com
/products/software/products/sopc/sop-index.html,
last
viewed on July 2008.
epYme workgroup, “HwHw: The Hardware Highway
web-tool for fast prototyping in digital system design”,
on-line at http://www.epYme.uva.es/HwHw.php, 2007.
SystemVerilog, “IEEE Std. 1800-2005: IEEE Standard
for SystemVerilog – Unified Hardware Design,
Specification, and Verification Language”, IEEE, 3 Park
Avenue, NY, 2005.
R. Dömer, D.D. Gajski and A. Gerstlauer, “SpecC
Methodology
for
High-Level
Modeling”,
9th
IEEE/DATC Electronic Design Processes Workshop,
2002.
C.R. Clare, Designing Logic Systems Using State
Machines, McGraw-Hill, New York, 1973.
D.W. Brown, “State-Machine Synthesizer – SMS”, Proc.
of 18th Design Automation Conference, pp. 301-305,
Nashville, Tennessee, USA, June 1981.
J.P. David and E. Bergeron, “A Step towards Intelligent
Translation from High-Level Design to RTL”, Proc. of
4th IEEE International Workshop on System-on-Chip for
Real-Time Applications, pp. 183-188, Banff, Alberta,
Canada, July 2004.
[9]
[10]
[11]
[12]
[13]
[14]
[15]
E. Ogoubi and J.P. David, “Automatic synthesis from
high level ASM to VHDL: a case study”, 2nd Annual
IEEE Northeast Workshop on Circuits and Systems
(NEWCAS 2004), pp. 81-84, June 2004.
D. Ponta and G. Donzellini, “A Simulator to Train for
Finite State Machine Design”, Proc. of 26th Annual
Conference on Frontiers in Education Conference
(FIE'96), vol. 2, pp. 725-729, Salt Lake City, Utah, USA,
November 1996.
D.D. Gajski, Principles of Digital Design, Prentice Hall,
Upper Saddle River, NJ, 1997.
Roth, Fundamentals of Logic Design, 5th edition,
Thomson-Engineering, 2003.
S. de Pablo, S. Cáceres, J.A. Cebrián and M. Berrocal,
“Application of ASM++ methodology on the design of a
DSP processor”, Proc. of 4th FPGAworld Conference,
pp. 13-19, Stockholm, Sweden, September 2007.
S. de Pablo, S. Cáceres, J.A. Cebrián, M. Berrocal and F.
Sanz, “ASM++ diagrams used on teaching electronic
design”, International Joint Conferences on Computer,
Information, and Systems Sciences, and Engineering
(CISSE 2007), on-line conference, December 2007.
The PHP Group, on-line at http://www.php.net, last
release has been PHP 5.2.6 at May 1st, 2008.
Space-Efficient FPGA-Implementations of FFTs in High-Speed Applications
Stefan Hochgürtel, Bernd Klein
Max Planck Institut für Radioastronomie,
Bonn, Germany
{shochgue, bklein}@mpifr-bonn.mpg.de
Abstract
Known and novel techniques are described to implement a Fast Fourier Transform (FFT) in hardware, such
that parallelized data can be processed. The usage of
both - real and imaginary FFT-input - can help saving
hardware. Based on the different techniques, flexible to
use FFT-implementations have been developed by combining standard FFT-components (partly IP) and are compared, according to their hardware utilization. Finally, applicability has been demonstrated in practice by a FFTimplementation with 8192 channels as part of a FPGAspectrometer with a total bandwidth of 1.5 GHz.
1. Introduction
A number of radio-astronomical telescopes are now
in operation for observations within the mm and submm
wavelength atmospheric windows. Each of these windows
stretches over many 10s of GHz, and heterodyne receivers
have now been developed to cover a large fraction thereof.
The necessary spectrometers that allow a simultaneous coverage of such wide bandwidths (>1.0 GHz) at high spectral resolution (1 MHz or better) have only recently become
feasible through the use of FPGAs that can handle multiple GigaBytes of data per second. Another great advantage
of FPGA-based spectrometers over the conventional filterbanks or auto-correlators is that their production is cheaper
and that they are more compact and consume less power.
Their stability furthermore allows the parallel stacking to
cover wider bandwidths than each individual ADC/FPGA
module can currently cover.
A common way to implement an FPGA spectrometer
is to feed the data stream of an Analog/Digital Converter
(ADC) to an FPGA that computes a Fast Fourier Transform (FFT). An N -point FFT takes N time-consecutive
data samples and transforms them to a frequency spectrum
with N spectral channels. According to the Shannon sampling theorem [4], the ADC’s sample-rate fS determines
the spectral bandwidth f = fS /2. The sensitivity specifies the spectrometer’s ability to detect weak signals. It
is in principle given by the ADC’s bit-resolution, but can
be increased by an integration (averaging) in time of each
spectral channel. Integration can also be performed in timedomain, which improves sensitivity by using each inputsample multiple times. Since this averaging may neutralize
a periodic input-signal, integration in time-domain must be
combined with a window function. This is called weightedoverlap-add (WOLA).
2. Related Work
Efficient FFT-cores are commercially available from RF
Engines [3], who provide pipelined FFTs as well as combined FFTs (Section 4.1). Another source of pipelined
FFTs in multiple variations is the Xilinx IP-Core-Generator,
which is freely bundled with Xilinx-ISE [5]. Since Xilinx
IP-cores are freely available, we use them for our pipelined
FFTs.
A powerful commercially available spectrometer-card
was developed by Acqiris [1]. It has been in operation as
a spectrometer backend at the APEX telescope [2] since
2006. It is based on a Xilinx Virtex-II Pro 70 FPGA, fed by
two ADCs that deliver 2 Gigasamples per second (Gs/s) at 8
bit resolution. Its bandwidth is thereby limited to 1.0 GHz.
In collaboration with Acqiris, the ARGOS-project [7] has
implemented a FFT-spectrometer on this board. Starting
with 8 bit input-samples, the datawidth grows scaled to
some degree, from 9 bits after preprocessing to 18 bits after a 32k-pt FFT. Unscaled growth would require at least
9 + log(32768) = 24 bits after the FFT, since it sums up
32768 values of 9 bits. Although scaled broadening saves
hardware, available to implement FFTs with a larger number of channels, it reduces precision and therefore potentially reduces sensitivity. The preprocessing comprises windowing but no WOLA. At the output, 32 consecutive bits
can be chosen from each 36 bits wide channel in integrated
power-spectrum, in order to transmit them by PCI-interface.
3. The Algorithm
m·q
a00m,q = e−2πi M ·N · a0m,q
We use an ADC that samples data at rates up to
3.0 GHz and a Virtex-4 FPGA with a maximum clock-rate
of 400 MHz [6]. So, input-data stream must be split up into
some parallel data-streams with reduced speed to allow the
FPGA to handle it. In this section, we will first recap how a
FFT can be split to handle those parallel data-streams. Then
it is shown how both the real and imaginary part of a complex FFT-input can be used for real only input-data, in order
to reduce the hardware requirements [8].
âN ·p+q =
M
−1
X
e−2πi
m·p
M
(4)
· a00m,q
(5)
m=0
3.2. Use imaginary input
A complex FFT fed with real input-samples produces a
spectrum whose second half is complex conjugated to the
first half:
3.1. Split FFTs
âN −k =
A M · N -pt FFT calculates M · N spectral channels âk
out of M · N input-samples aj , where M and N are powers
of two:
âk =
MX
·N −1
j·k
e−2πi M ·N · aj
e
ãk =
=
e−2πi
m·(N ·p+q)
M ·N
·
N
−1
X
n·q
N
−1
X
e
−2πi j·k
N
a00,k
=
a01,k = i ·
· aM ·n+m
ã N −k + ãk
2
2
ã N −k − ãk
2
2
n=0
m=0
=
N
−1
X
e−2πi
n·q
N
· (a2j + i · a2j+1 )
N
2
=
−1
X
e
−2πi j·k
N
2
· a2j
n=0
(8)
N
2
=
−1
X
e
−2πi j·k
N
2
· a2j+1
n=0
k
· aM ·n+m
(7)
A twiddle-multiplication and an addition yield the final
spectral channels:
The inner sums can be computed as M single N -pt
FFTs, fed with undersampled data:
a0m,q
2
All intermediate channels ã form corresponding pairs ãk
and ã N −k . By adding and subtracting them, two indepen2
dent half-sized FFTs can be calculated:
(2)
e−2πi
(6)
n=0
· aM ·n+m
m=0 n=0
M
−1
X
· aj = âk
(1)
The input can be split up into N groups n of M consecutive samples m, so that j = M · n + m. The same way
the output is split into M groups p of N consecutive channels q, with k = N · p + q. Applying this to (1) results in
a double-sum that can be simplified, since e−2πi·n·p = 1,
with n, p ∈ N:
âN ·p+q =
j·(N −k)
N
To prevent a waste of resources, real samples may alternately feed real and imaginary input of a half-sized FFT:
N
2
(M ·n+m)·(N ·p+q)
−2πi
M ·N
e−2πi
j=0
j=0
−1
M
−1 N
X
X
N
−1
X
ak = a00,k + e−2πi N · a01,k
(3)
(9)
n=0
4. Implementation
Thereby M spectra a0m are calculated, time-shifted to
each other by the original sampling-rate, each one with a
fraction 1/M of the original bandwidth. If their channels
a0m,q are multiplied by the correct power of e−2πi , a set
of N different M -pt FFTs remains to be computed. These
complex factors are called twiddle-factors, since a multiplication with e−2πi·x , with x ∈ R equals a clockwise rotation (twiddle) by x full turns. The correct twiddle here depends on time-shift (FFT-number 0 ≤ m < M ) and channel 0 ≤ q < N :
Since ADC-data must be demultiplexed to reduce its rate
to below a clock-rate realistic for the FPGA, we obtain P
parallel data-streams, each one containing undersampled
data with offsets of one sample to the adjoining streams.
If a P -pt FFT is insufficient for spectral resolution, inputsamples have to be collected over Q clock-cycles to calculate a P · Q-pt FFT. The structure of a FFT demands P and
Q to be powers of 2.
2
Pipelined FFT
a2 /a10 /..
Pipelined FFT
a3 /a11 /..
Pipelined FFT
a4 /a12 /..
Pipelined FFT
a5 /a13 /..
Pipelined FFT
a6 /a14 /..
Pipelined FFT
a7 /a15 /..
Pipelined FFT
a00
0,q
â0 /â1 /..
a01,q
a00
1,q
â128 /â129 /..
a02,q
a00
2,q
â256 /â257 /..
a03,q
a00
3,q
a04,q
a00
4,q
a05,q
a00
5,q
a06,q
a00
6,q
a07,q
a00
7,q
a0 /a8 /..
a1 /a9 /..
a3 /a11 /..
a4 /a12 /..
â512 /â513 /..
a5 /a13 /..
â640 /â641 /..
a6 /a14 /..
â768 /â769 /..
â896 /â897 /..
a7 /a15 /..
a00
m,0
a0m,1
a00
m,1
a256 /..
a0m,2
a00
m,2
a384 /..
a0m,3
a00
m,3
a0m,4
a00
m,4
a0m,5
a00
m,5
a768 /..
a0m,6
a00
m,6
a896 /..
a0m,7
a00
m,7
a128 /..
a2 /a10 /..
â384 /â385 /..
a0m,0
a0 /..
a512 /..
a640 /..
Parallel FFT
a1 /a9 /..
a00,q
Reorder
Pipelined FFT
Parallel FFT
a0 /a8 /..
â0 /â8 /..
Pipelined FFT
â1 /â9 /..
Pipelined FFT
â2 /â10 /..
Pipelined FFT
â3 /â11 /..
Pipelined FFT
â4 /â12 /..
Pipelined FFT
â5 /â13 /..
Pipelined FFT
â6 /â14 /..
Pipelined FFT
â7 /â15 /..
Pipelined FFT
Figure 2. Reordering the input, allows parallel FFT first and splitting data-streams afterward. A parallel 8-pt FFT followed by 8
pipelined 128-pt FFTs form a splitting 1024-pt
FFT (8x128).
Figure 1. 8 pipelined 128-pt FFTs followed by
a parallel 8-pt FFT form a combined 1024-pt
FFT (8x128).
4.1. Combining Parallel and Pipelined FFTs
plete spectrum.
So we combined a reorder unit, a parallel FFT and multiple pipelined FFTs, such that data-streams become independent after parallel FFT and can therefore be split into
multiple partitions at the end. In the following this technique is called splitting FFT.
The way a FFT is split up (Section 3.1) suggests feeding each parallel data-stream into a pipelined N -pt FFT
(N = Q) to calculate the inner sums a0m,q . After twiddlemultiplication, a parallel M -pt FFT can be used to calculate
M = P spectral channels in any clock cycle. Altogether
this creates a combined FFT (Figure 1) that operates sequentially on parallel data-streams. This implementation is
simple and hardware-efficient as well as flexible in adjusting its width P and its length Q to values, appropriate for
any clock-rate and any spectrum-size.
Keeping in mind that a FFT may exceed the capabilities
of a single FPGA, it is desirable to split it into independent parts. Since the pipelined FFTs are expected to be the
largest part of a combined FFT, separating them from each
other would have the highest impact on resource consumption. In a combined FFT data from all parallel streams has
to be set against each other at the end. If the pipelined FFTs’
instances are separated to different, parallel FPGAs, recombination of data from different FPGAs at high clock-rates
would be required. Therefore independent data-streams at
the end would simplify parallelization over several FPGAs.
This is achieved by a parallel N -pt FFT (N = P ) at the
beginning. Since each FFT in (3) needs undersampled input, the data has to be reordered first, dependent on the size
of the complete FFT: the parallel FFT at the front (Figure
2) requests M = Q subsequent samples at each input, skipping the next M · N − M samples, which are needed at its
other inputs in parallel. This is done by a memory module
that stores M · N samples, followed by a two-dimensional
shift-register. Behind the parallel FFT, the data-streams
become independent from each other. Each one is multiplied by twiddle-factors and the resulting a00m,q are fed into
pipelined M -pt FFTs to calculate the N th part of the com-
4.2. Reduce FFT-width
The number of channels that are computed by a combined or splitting FFT can be increased in two ways with
different effects on hardware utilization: increase its width
P or increase its length Q.
Doubling P will double the number of pipelined FFTs
and twiddle-multipliers and therefore the hardware utilized
by them. The parallel FFT grows with P as O(P · log(P )).
Increasing the length Q on the other hand, does not lead
to changes in parallel FFT - it simply operates more cycles per FFT. Twiddle-multiplication needs no more logic
resources, but memory use increases linearly with Q, due
to the need to store more twiddles. Doubling the channels
calculated from a pipelined FFT requires storing all previous channels again and adding another pipeline-stage. Thus
the pipelined FFTs’ memory use grows linear with Q, while
their logic utilization grows logarithmically:
Logic(P, Q) = O(P · log(P ) · log(Q))
(10)
M em(P, Q) = O(P · Q)
(11)
A combined or splitting FFT’s use of FPGA-logic grows
stronger with its width than with its length.
To optimize the resource usage, P should be reduced,
which could be achieved by less demultiplexing of input3
6
3
5
32 bit
64 bit
52 bit
26 bit
P
||²
P
||²
P
||²
P
||²
P
||²
P
||²
P
int64 −−> float32
2
Transform
7
||²
Transform
1
P
Transform
Figure 3. A splitting 512-pt FFT (4x128) using
imaginary inputs. Channel-transformation results in the first half of a 1024-pt FFT. The
second half is obsolete.
combined FFT (8x1024)
â1 /../â509
â511 /../â3
||²
Transform
25 bit
11 bit
8 bit
Transform
Reorder Reorder
0
4
4x WOLA−preprocessing
ã1 /../ã509
ã511 /../ã3
â0 /â256 /..
â2 /â510 /..
demultiplexing
a6 + ia7 /..
ã1 /../ã509
ã3 /../ã511
ã0 /ã2 /..
ã256 /ã510 /..
Transform
a4 + ia5 /..
ã0 /../ã508
ã2 /../ã510
Reorder
a2 + ia3 /..
splitting FFT (4x128)
a0 + ia1 /..
control (ethernet−connection)
data. This either requires higher clock-rates, which is not
realistic in most cases, or lower sampling-rates and thereby
reduced spectrometer bandwidth. Using the imaginary inputs of a combined or splitting FFT for one half of the real
input-samples has the same effect: the width of a combined or splitting FFT can be reduced to P/2 (Figure 3).
Instead of doubling the FFT-length Q, pairs of intermediate channels ã from the FFT-output are transformed to the
final channels, as described in Section 3.2. This transformation is similar to one stage in a pipelined FFT, but slightly
more complex. According to eq. (10), this leads to a more
efficient use of the FPGA-logic. Finally note, that spectral resolution is preserved, although the number of computed channels is halved, because half of the channels are
complex-conjugated to the others if the FFT input is real
(eq. 6). The hardware that is saved through this trick can be
used to increase Q, allowing a better spectral resolution.
Figure 4. Spectrometer-design with a combined 8192-pt FFT (8x1024), running on a
Virtex-4-SX55 FPGA. Datawidth grows by
log(4) + 1 bits in 4xWOLA, by log(8192) + 1
bits in 8192-pt FFT, by 1 bit during channeltransformation and is doubled by squaring.
of the number of summed values: log(4) bits in 4xWOLA
and log(P ·Q) in FFT and channel-transformation unit. Two
extra bits are spent to compensate window-multiplication
and to prevent overflow by twiddle-multiplication in the
FFT. Squaring doubles datawidth and constant 64 bits are
finally reserved for each channel of the integrated power
spectrum. These are converted to 32 bit float and sent to a
host-PC by Ethernet. Concurrently parameters can be received by Ethernet to adjust all modules in the design.
Currently this design runs on a custom-built board (Figure 5) with a Virtex-4-SX55 FPGA. Up to 3 Gs/s of data is
delivered by an A/D-Converter and received by a Spartan-31000 FPGA that controls all units on the board and connects
to the host.
4.3. Application
A combined FFT and a splitting FFT were implemented
in synthesizable VHDL together with a module to transform
intermediate channels, when imaginary inputs are used.
To enable flexible use in different environments these implementations are generic in FFT-width, FFT-length, and
input-datawidth. The pipelined FFTs come from the Xilinx
Core-Generator.
These modules are embedded into a spectrometer-design
(Figure 4) including some pre- and postprocessing around
the FFT: in the time-domain, WOLA is performed with sets
of up to 4 input-samples. The weighting is given by an arbitrary window function, which can be programmed from the
host over Ethernet. This preprocessing unit increases the
sensitivity and decreases leakage-effects. Behind the FFT,
power-spectra are calculated by squaring each channel’s absolute value, followed by an integration (average) over an
adjustable time-period. The design is unscaled to prevent
loss of sensitivity during calculation: the datawidth after
each unit is equal to the datawidth before plus the logarithm
5. Results
5.1. Hardware usage
To quantify the impact on resource consumption, the
different FFT-types described in Section 4 are compared
with respect to the following Virtex-4 primitives: Internal
Memory in Block-RAM, DSP48-Slices used as multipliers, and Slices, each containing 2 Flip-Flops and 2 LookUp-Tables (LUT). Four FFT-implementations were instantiated with comparable sizes, inserted into the spectrometerdesign shown in Figure 4 and synthesized. We compared
a combined 256-pt FFT (16x16), a splitting 256-pt FFT
(16x16), a splitting 128-pt FFT (8x16), using its imaginary
4
100
90
80
70
60
50
40
30
20
10
0
256-pt FFT, combined, use real input only
256-pt FFT, splitting, use real input only
128-pt FFT, combined, use imaginary input
128-pt FFT, splitting, use imaginary input
2048-pt FFT, combined, use imaginary input
2048-pt FFT, splitting, use imaginary input
8192-pt FFT, combined, use imaginary input
Slices
FlipFlops
LUT
BlockRAM
DSP48
Figure 6. Hardware-usage of seven spectrometers with different FFTs. Percentage of utilized primitives in a Virtex-4-SX55 FPGA.
inputs and a combined 128-pt FFT (8x16) using its imaginary inputs. A length of Q = 16 is chosen, since a spectrometer with a splitting 512-pt FFT (16x32) would have
exceeded the size of the SX55 FPGA. All four variants are
programmed onto the FPGA on the board shown in Figure
5.
Comparing combined FFT and splitting FFT (Figure 6),
mainly two effects are observed that affect the hardware
utilization: The number of consumed DSP-Slices only depends on datawidth in pipelined FFTs, parallel FFT and
twiddle-multipliers. Since the number of DSP-Slices grows
discontinuously with bitwidth, it is misleading to generalize
what FFT-type performs better. Utilization of Block-RAM
and Slices is also influenced by bitwidth, but the main reason for the difference is the initial reordering in a splitting
FFT, which consumes memory, registers, and logic. To conclude, a splitting FFT needs slightly more hardware than a
combined FFT, but it can be split more easily over multiple
FPGAs.
As predicted in Section 4.2, the hardware consumption
can be reduced significantly by using the FFT’s imaginary
inputs: About 40% of registers and logic is saved and about
30% of the multipliers (Figure 6), whereas the use of BlockRAMs hardly changes. A major advantage is the reduction
in Slice consumption, since their utilization is most critical
in this application.
The reduced hardware consumption allows improvement
of multiple properties of the spectrometer. Since bandwidth
and sensitivity are already optimized well, the option taken
here is to increase the spectral resolution by using a FFT
with more channels. When almost all Slices are used, as
with the 256-pt FFTs, the number of FlipFlops and LUTs
must be observed to keep Slice utilization comparable, since
unrelated logic is only mapped into one and the same Slice,
Figure 5. Custom-built spectrometer board
with ADC, Virtex-4 for FFT and Spartan-3 for
controlling.
5
6. Conclusion and Further Work
The test spectrum, shown in Figure 7, demonstrates, that
the described technique and its FFT implementation work
well in practice on our prototype spectrometer card.
Our spectrometer is fed by an ADC with up to 3 Gs/s,
and thereby produces a spectral bandwidth of 1.5 GHz. This
is significantly wider than other operational digital spectrometers currently produce. In our implementation, the
8-bit input data samples grow completely unscaled through
the design to prevent any loss of precision and therefore sensitivity. Under this condition, the implementation of a combined 8192-pt FFT, using imaginary inputs, fits on a Virtex4-SX55 FPGA, and can produce 32 times more channels
than if only the real input was used. Time-domain data can
be preprocessed by up to 4xWOLA and any window, programmed from a host-PC. The integration over adjustable
time-periods leads to a final 64 bits channel size. Finally
the channels are converted to 32 bit float-values and the data
is transfered to the host-PC over Ethernet - a standard that
guaranties compatibility over large time-periods.
Although the performance of our fully functional spectrometer already exceeds that of others, we plan to further
improve both, its bandwidth and spectral resolution.
A currently announced ADC with 5 Gs/s, in combination
with a Virtex-5 FPGA, could provide a significantly higher
bandwidth. Both components shall be placed onto a slightly
modified version of our existing, custom-built board. Since
doubling the bandwidth requires double parallelization, the
design is expected to reach the limits of a Virtex-5-SX95T,
preventing an increase of the number of channels.
One way to increase the channel number would be to
split the task between several FPGAs. We have explored
this direction with another custom-build prototype board.
Our future work will concentrate on reducing the hardware consumption in a FPGA. A large hardware savings
potential lies in reduced precision. We shall quantify how
bitwidth affects sensitivity for data as well as twiddlefactors in different parts of the design.
Figure 7. 1.5 GHz power-spectrum with 8192
channels of a 1.0 GHz sinus combined with
noise from 0 to 1.2 GHz.
when almost no Slice is left completely unused. The hardware utilization of the 256-pt FFTs (16x16, not using their
imaginary inputs) is closest to that of 2048-pt FFTs (8x256)
that do use their imaginary inputs (Figure 6). As eq. (10)
suggests, halving the FFT-width P = 16 allows squaring its
length Q = 16, with a comparable amount of used logic.
Since these FFTs are chosen for comparability, their
sizes do not define this applications’ maxima in any case.
The biggest FFT, implemented in this application so far, is
a combined 8192-pt FFT using its imaginary inputs. It has
32 times more independent channels, than the biggest FFT,
not using its imaginary inputs: A combined 512-pt FFT. The
spectrometer including the former one utilizes almost 100%
of the 24576 Slices, 84% of the 49152 FlipFlops, 77% of the
49152 LookUpTables, 70% of the 512 DSP48-Slices, and
59% of the 320 BlockRAMs in a Virtex-4-SX55 FPGA.
In conclusion, the improvement when using imaginary
inputs is either 40% of the chip-logic or a squared FFTlength Q: A 8x256-pt FFT replaced a 16x16-pt FFT and a
8x1024-pt FFT replaced a 16x32-pt FFT.
5.2. Spectrometer
References
The clock-rate Xilinx-ISE [5] achieves for a design dramatically degrades with a growing number of occupied
Slices. Although Xilinx-ISE calculates a maximum clockrate of 1333/8 = 166.7 MHz, the spectrometer has been
successfully overclocked to 1800/8 = 225 MHz. Still the
usual clock-speed is 1500/8 = 187.5 MHz for reliable operation. A spectrometer with 8192 channels and 1.5 GHz
bandwidth has now been implemented successfully, based
on a combined 8192-pt FFT with imaginary inputs used.
Input-data is preprocessed by 4xWOLA, weighted by a
Flat-Top-Window and about 180,000 single spectra are integrated over 1 second.
[1]
[2]
[3]
[4]
[5]
[6]
Acqiris. http://www.acqiris.com.
APEX. http://www.apex-telescope.org.
RF Engines Limited. http://www.rfel.com.
wikipedia. http://www.wikipedia.org.
Xilinx-ISE. http://www.xilinx.com/ise.
Xilinx Virtex-4 Data Sheet: DC and Switching Characteristics. http://direct.xilinx.com/bvdocs/publications/ds302.pdf.
[7] A. Benz et al. A broadband FFT spectrometer for radio and
millimeter astronomy. Astronomy & Astrophysics, (3568),
September 2005.
[8] E. O. Brigham. FFT : schnelle Fourier Transformation. R.
Oldenbourg Verlag GmbH, Munich, 1982.
6
The ABB NoC – a Deflective Routing 2x2 Mesh NoC
targeted for Xilinx FPGAs
Johnny Öberg1, Roger Mellander2, Said Zahrai2
1
Royal Institute of Technology (KTH),
Dept. of Electronics, Computer, and Software Systems
[email protected]
2
ABB Corporate Research AB,
{roger.mellander, said.zahrai}@se.abb.com
Abstract
The ABB NoC is a 2 by 2 Mesh NoC targeted for
Xilinx FPGAs. It implements a deflective routing
policy and is used to connect four MicroBlaze
processors, which implement an area- and timingcritical multiprocessor embedded real-time system.
Each MicroBlaze is connected to the NoC via a
Network-Interface (NI) that communicates through a
Fast Simplex Link (FSL) interface together with
Block RAM (BRAM) memories that implement a
shared memory between the NoC and the resource.
Application programs use device drivers to
communicate with the Network-Interface of the NoC.
The NI sets up message transfer, with a maximum
length of 2040 bytes, and sends flits with the size of
32 bit data plus 11 bit headers through the network.
The design has been implemented on Xilinx
Virtex2P and Virtex4 FPGAs. The NoC design has a
throughput of 200 Mbps, is about 2600 slices large
and operates at 111 MHz in the Virtex2P technology
and 125 MHz in the Virtex4 technology.
1. Introduction
An industrial control system is often divided
into subsystems that interact and together provide the
required functionality. Dividing a system into
separately developed subsystems provides additional
flexibility, in particular when standard components
are used. As an example, one may consider a
machine controller with several layers of intelligent
component with powerful PLCs at the highest level
and small embedded systems for sensing and
actuation at the lowest level.
In the competitive world of industrial markets,
reduced size, cost and power consumption are
important goals or constraints in development
projects. Integration of the closely related
components often gives positive effects but might
reduce the flexibility and scalability of the system.
The use of FPGAs can be a possible path into highly
integrated systems on a singe chip which, from an
architectural point of view, can be considered as
large systems consisting of separate subsystems.
Each of these subsystems might contain one or a
number of IP-cores and CPUs.
An industrial control system used for control of
machines and processes is usually composed of
different nodes with variable computing capacity
responsible for executing different tasks in the
system. Communication between these nodes occurs
through different types of communication links and
protocols such as proprietary ones, Ethernet and field
buses. Similarly, a SoC implementation of such a
system using a number of processors on the same
chip relies on suitable choice of communication
interfaces, such as FIFOs, memory-mapped links or
buses, each of which with its strengths and
weaknesses.
Bus based platforms suffer from limited
scalability and poor performance for large systems
[1]. This has led to proposals for building regular
packet switched networks on chip as suggested by
Dally, Sgroi, and Kumar [1, 2, 3]. These Network-onChips (NoCs) are the network based communication
solution for SoCs. They allow reuse of the
communication infrastructure across many products
thus reducing design-and-test effort as well as timeto-market. In this respect, NoC provides an attractive
alternative where a number of units are connected in
a standard way.
The aim of this paper is to present
implementation and evaluation of a NoC used in a
SoC in an industrial application. Ethernet
connections (100 Mbps) and shared memories of the
original design should be replaced with a NoC
system that handles data transfer through a DMA
technique. The resource software will typically send
messages over the NoC at a regular interval and the
maximum length of a message is 1500 bytes. For
compatibility to earlier systems, the design should be
implemented using Xilinx Virtex2P (or later)
technology platform. To be able to obtain sufficient
real-time performance, the processors need to run at
least at 100 MHz, with the support of hardware
accelerators, or as fast as possible, preferably at
higher speed than 150 MHz, to be able to run without
hardware accelerators. Thus, the application is both
area- and timing-critical.
B
A
uBlaze 0
R
R
A
B
L
M
B
FSL port 0
FSL0
R
Figure 1: Block diagram of the ABB NoC.
2. The ABB NoC
The ABB NoC is a 2 by 2 Manhattan style 2D Mesh NoC. It is based on the Nostrum NoC
Concept [4..9] developed by the Royal Institute of
Technology, KTH, in Stockholm. It is used to
connect four MicroBlaze [13] embedded processor
systems together.
The NoC consists of four three-port switches,
see Figure 1. The two ports A and B are connected to
neighbouring switches and the remaining port R is
connected to the Network-Interface of the resource.
Each port supports full duplex transmissions, i.e.,
they are built using two links, one in each direction,
to make it possible to transmit packets in both
directions simultaneously.
Each MicroBlaze is connected to the NoC via a
Network-Interface, consisting of a controller, two
Fast Simplex Links (FSL-links) [12] and two Block
RAM (BRAM) memories [11], situated on the Local
Memory Bus (LMB-bus) [10] of the MicroBlaze
processor. The FSL-link is used to setup the NoCFrame transmission, while the two BRAM memories
functions as a write and a read buffer, respectively.
To connect the ABB NoC to the four MicroBlaze
systems, eight BRAMs together with eight eight FSL
links are needed; two BRAMs and two FSL links per
resource. To be able to reuse the device drivers, the
FSL-link is connected to the same port (FSL port 0)
and the BRAMs have the same memory map on all
resources. A Block Diagram of the connection is
shown in Figure 2.
Switch Architecture
The Switch is implemented using a data path
and a controller, as shown in Figure 3 below. The
controller can be further split into a Decision control
algorithm and a four state control FSM which
determines the advance of the deflection algorithm
and the signalling to/from the NI.
The input ports are connected to one input
buffer each. From the input buffers, the valid bit, the
UD, LR and HC counters are extracted and
forwarded to the decision controller. The input
buffers are connected to three 4-to-1 muxes, one mux
b
u
s
2x2
NoC
B
R
L
M
B
BRAM
RdWr1
A
A
uBlaze 1
FSL1
BRAM
RdWr0
b
u
s
B
FSL port 0
L
M
B
L
M
B
BRAM
RdWr3
BRAM
RdWr2
b
u
s
FSL2
uBlaze 2
b
u
s
FSL3
FSL port 0
FSL port 0
uBlaze 3
Figure 2: NoC-NI-MicroBlaze Block Diagram.
Port R
Port A
Port B
Input
Recv R
x3
Recv A
0
1
2
3
Xmit R
Recv B
{Valid, UD, LR, HC}
Decision
Control
I/D A
Xmit A
I/D B
Xmit B
Adress
Incr/Decr
Port R
Port A
Port B
Output
Buffers
Select
State
Write_R
Read_R
S0
S3
S1
To NI
S2
Figure 3: Block diagram of NoC Switch
for each output port A, B, and R. The decision
controller investigates the priority of each message,
decides which packet should go to which output and
sets the select signal accordingly. A packet with both
UD and LR counter zero will be output on port R. A
packet with either UD or LR non-zero will be routed
to port A or B. A positive value means that the
packet should be sent downwards/rightwards in the
NoC. A negative value means that the packet should
be sent upwards/leftwards in the NoC. The UD and
LR counters are updated accordingly.
If no packet should be output, an empty packet
with the valid bit set to ‘0’ is inserted at the output
stream. If only one port has a valid packet, that
packet immediately gets its desired port. If two
packets are competing for a port, the one with highest
priority selects first. The default priority order is A,
B, and then R. In the case of port A and port B, the
port with the oldest packet selects first, i.e., the one
with the highest HC value is allowed to select an
output port first. If port A and port B has the same
HC, port A selects first.
NoC Send
Interface
NoC Recv
Interface
Idle
BRAM Interface
(Always Read)
FSL 2NoC
FSM
FSL Slave
Interface
NoC2FSL
FSM
BRAM Interface
(Always Write)
FSL Master
Interface
Figure 4. Block Diagram of the NI
The switch controller is a four clock cycle FSM.
It sets five signals: the input and output buffers’ load
enable signals, the write_R and read_R signals to
communicate with the NI, and when to output the
mux select signal. The deflection algorithm is
implemented to calculate the switch decision in state
S0 and S1, so a new Select value is output at state S3
every four clock cycles. If a packet should be written
to the NI, the Write_R signal is set to 1 in state S2
and S3. If a packet has been read by the switch, the
Read_R signal is set to 1 during state S2, and the
Load_Enable signal to 1 during state S3.
The Network Interface
The Network Interface (NI) forms the link
between the switch and the resource. In the ABB
NoC, the NI translates FSL messages from the
MicroBlazes to the NoC and vice versa. It uses two
BRAMs together with their BRAM controllers, to
work as a read and write buffer for the NoC. The
read and write buffers are implemented as a twoported shared memory, with one end connected to the
LMB Bus of the MicroBlaze and the other to the NI.
The translation is done by two FSMs that work
independently of each other. One FSM translates
FSL messages from the MicroBlaze to NoC packets
(FSL2NoC) while the other stores incoming NoC
packets in the receiver BRAM (NoC2FSL). The
block diagram is shown in Figure 4.
The FSL2NoC FSM has five states and a delay
counter. The NoC2FSL FSM has four states to match
the four states of the switch. The two FSMs are
shown in Figure 5 and 6.
The NoC2FSL FSM waits in the idle state until
there is a valid data on the NoC interface (indicated
by Write_R=’1’). Upon data valid, the FSM goes to
the Setup Data state in the case that the incoming
message was a data setup word. If not, it goes to the
Write Data state. In the Setup Data state, the Write
Buffer Address pointer for that sender is reset and an
FSL message for that buffer is prepared and stored.
In the Write Data state the received data is stored at
the next free position in the BRAM. In case it was
the last data word for that message, an interrupt
signal is asserted, indicating that this write buffer has
stored its last data. Both the Setup Data State and the
Write Data State then goes to the Wait_for_Write_R
state, where the FSM waits until the Write_R signal
from the NoC Switch goes low. When that happens,
Setup
Data
Setup
Write
Buffer
Write
Data
Wait for
Write_R
Write to
BRAM
Wait for Write
Ack from NoC
Figure 5. NoC2FSL Interface FSM
Idle
End of
Message
Generate Setup
Message
Wait for Read
Ack from NoC
Wait 12 clock
cycles
Read
Input
Wait for
Read_R
Read from
BRAM
Read
Memory
Delay
State
Figure 6. FSL2NoC FSM
the Write Buffer Address is decremented by one and
the FSM then goes back to the idle state to wait for
the next data word.
The FSL2NoC waits in the idle state until there
is data on the FSL Slave Interface. It then reads the
data and initiates a message transfer by sending a
Setup Word to the intended target. Sending involves
waiting in the Wait_for_Read_R state until the NoC
Switch sends a Read_R acknowledging that it has
read the data. It then goes to the delay state to wait
twelve clock cycles. The delay is used to guarantee
that packets arrive in order at the receiving end. It
then proceeds to read another data from the input
BRAM. When all packets have been sent, the sender
goes back to the idle state.
The NI initiates the transfer, and sends one data
word every 16th clock cycle until all data has been
transferred. For a 2 by 2 switch, this means that all
packets will arrive in order, since a deflected packet
will have enough time to arrive at the destination
before another packet will arrive. Thus, Quality-ofService (QoS) is maintained. For a larger network,
this property does not hold. Then a re-ordering
mechanism has to be implemented.
3. Protocol Stack
The protocol stack for the ABB NoC is outlined
in Figure 7. Application programs call device driver
functions that setup a communication message. The
Network interface then sets up a communication link
with the receiving node, and starts to transmit the
data. Data is transmitted over the NoC, one flit at a
time.
Application Program
Device Drivers
Network protocol
Switch Protocol
Figure 7. ABB NoC Protocol Stack
Device Driver prototypes
To access the NoC services, four C-language
primitives have been developed that can be used as
Device Drivers to the NoC. They are named:
• int flush_buffer_to_noc(coord node,
int buffer,int num_words)
• int get_noc_msg(void)
• int get_write_buffer(int val)
• int get_recv_buffer(int val)
The data type coord is the absolute coordinate of
the target node in the system, e.g., the value {0,0}
correspond to node 0, {0,1} to node 1, etc. It is
defined as
typedef struct {
int row;
int col;
} coord;
The MicroBlaze processor in the resource is
using the Fast Simplex Link protocol [12] to
communicate with the NI. In addition, the
MicroBlaze is connected to two two-port Block
RAMs (BRAMs) [11], where the other port is
connected to the NI, that are used as a read and a
write buffer, respectively. In order to avoid
confusion, an explanation regarding the naming
convention is in place. The BRAM that the NoC is
reading, is written by the corresponding MicroBlaze
resource. Thus, from the software point of view it is a
Write Buffer, whereas from the NoC point of view it
is a Read Buffer. The same goes for the NoC Write
Buffer, which is viewed as a Read Buffer by the
software.
The function flush_buffer_to_noc() takes the
MicroBlaze write buffer number (0 or 1) and the
length of the message in number of 32-bit words as
arguments. It sets up an FSL message to the NoC and
initiates the transfer by issuing a blocking write to the
FSL.
The function get_noc_msg() performs a
blocking read from the FSL port. It returns a value
that the NoC NI has put the FSL-link. The layout of
the received message is shown in Figure 8.
Bit index
Bit description
0-20
Not Used.
21-22
Sender ID
23-31
Length of Message
Figure 8: Bit interpretation of NI-to-NoC
data word
The function get_write_buffer() takes the write
buffer number (0 or 1) as a parameter and returns a
pointer to the base address in the memory of the
corresponding write buffer.
The function get_recv_buffer() takes the recv
buffer number (0, 1, 2, or 3) as a parameter and
returns a pointer to the base address in the memory of
the corresponding receive buffer.
Network protocol
A write to the NI from the MicroBlaze processor
is interpreted according to Figure 9 below. The NI
takes the targets absolute position, counted from the
zero reference – the Upper Left (UL) node in the
upper left corner, and converts it to a relative address
that can be used by the switch by subtracting its own
position in the network from the target’s address. The
Length of message is extracted and stored in a
counter. The counter together with the message
buffer number form the offset in address space used
to retrieve data from the send buffer.
Bit index
Bit
description
0-17
Not
Used
18-19
Target
UpDown
(UD)
absolute
position
20-21
Target
LeftRight
(LR)
absolute
position
22-30
Length
of
Message
31
Message
Buffer
Number
Figure 9: Bit interpretation of NoC message
setup command word
The NI responds by sending a message frame of
data, outlined in Figure 10 below. The message
frame in the ABB NoC is composed of 32 bit words.
It is composed of two setup words plus the number of
data words.
Word 0
Word 1
Word 2
…Word
Message
length+1
Bit positions
0
14
Not used
15
22
23
31
Global
Clock Message
Bit 40 to 32
length+1
0
31
Global Clock Bit 31 to 0
0
31
First Data Word (stored at Msg Length-1)
0
31
Last Data (stored at buffer position 0)
Figure 10.: The NoC Message Frame
The first data word that is transmitted is a setup
message that initializes the receiving node with the
number of data words to expect plus the first nine
bits of the global clock. The second word is the
remaining 32 bits of the global clock. The Global
Clock is a 41 bit counter that counts microseconds
since the last reset. It counts 12.75 days before it
wraps around. It is padded to all messages that are
sent in the NoC. It is used to keep track of when
things are happen, in case something goes wrong.
Since the global clock is added to all messages, a
local resource can always retrieve it by sending a
zero-length message to itself. After the global clock
has been transmitted, the message itself is
transmitted, MSWord first, LSWord last.
The NI uses the sender source ID to determine
which write buffer that should be used to save the
incoming message. Write buffer 0 (ID 00) has the
address 0x0000A000, Write buffer 1 (ID 01)
0x0000A800. Write buffer 2 (ID 10) has address
0x0000B000, and Write buffer 3 (ID 11)
0x0000B800. The maximum length of a message is
2040 bytes. The remaining eight bytes (two words)
are used for the setup word and global clock word
that is padded to every message.
Switch protocol
For every data word, or flit, from the
MicroBlaze, the NI adds a header. The header is 11
bits long and data is 32 bits long. The header is
interpreted by the switch to determine where the
packet’s destination is and how it should be routed.
The header is shown in Figure 11 below. The
receiving NI uses only two fields from the header,
the data type field and the message source field. The
rest of the header is discarded upon arrival.
Bit position
Interpretation
10 9
Data
type
8
7
Source
ID
6
4
Hop
Counter
(HC)
3
2
Up/Down
Counter
(UD)
1
0
Left/Right
Counter
(LR)
Figure 11: NoC Header contents
The data type field is two bits wide. The first bit
is the data valid bit that is used by the switch to
determine if the data is a valid packet or not. The NI
uses the second bit in the data type field to determine
the type of data. If it is a ‘1’, it is a message setup
data field. If it is a ‘0’, it indicates a normal data
field. During Word 0, the NI transmits a Setup Data.
During the rest of the transmission, the NI transmits
Normal Data (the second word of the global clock is
handled as global data).
The second field is used by the NI. It is the
message source field. It indicates which data source
the message comes from and to which write buffer
the NI should save data. This field is needed since
multiple transmissions from different sources can
occur simultaneously. To ensure that messages do
not get scrambled in the case of multiple incoming
simultaneous messages, each source gets its own
input queue.
The Hop Counter (HC) field is used by the
switch to determine the priority of the messages. It is
incremented by one for every switch that the data has
visited. Thus, the message gets a higher priority the
older the message is. This mechanism ensures that
data does not get stuck in the NoC in the case of a
contention.
The Up/Down (UD) and the Left/Right (LR)
counter fields are the relative address fields.
4. Experiments and Results
Simulation, Validation and Prototyping
In order to verify the functionality, two type of
simulations were performed. One functional
simulation on the NoC and its interfaces, and one
structural on the whole design. An iterative bottomup verification methodology was used. First the NoC
was verified that it functioned as it should, then the
Network-Interface was added and simulated.
In the second phase, the NoC was verified
together with its resources in a structural VHDL
simulation of the whole system. The simulation run
the first 40 us of the system at start up. The GPIO
signals were used to “print” out software status
messages on the simulation window. This was
necessary to verify that the device drivers worked
properly together with the MicroBlazes.
After debugging, the system was working
satisfactory within the given parameters. The system
was downloaded to a prototype board to verify that it
worked in accordance with the structural simulations.
Synthesis
From synthesis point of view, several
experiments were conducted. The goal with the
design was to make it as fast as possible and achieve
at least 100 MHz speed on a Xilinx FPGA. The NoC
itself was synthesized first for a Virtex2P with 1152
pins. It gave an area of 2606 Slices. The estimated
frequency was however only 94.85 MHz after several
rewritings of the code to make it run faster. Since the
number of I/Os for the NoC itself is also more than
an FPGA can accommodate, it was not possible to
get a proper timing estimate after place & route.
Instead, a simple prototype system, composed of
four MicroBlazes resources, where each resource
were connected to its own GPIO interface, was
designed and synthesized. The system was tested on
a prototype board, running at 80 Mhz, and was found
to work satisfactory. However, speed was still a
problem. After some investigations, the cause of the
problem turned out to be the pinning of the prototype
board. The prototype board has fixed pinning for its
I/Os, so if all resources should have at least two pins
connected to the LEDs on the prototype board, it is
impossible to achieve a higher speed since at least
one of the resources would be placed at the other end
at the chip compared to the location of the pin
connected to the LED.
With a careful placement of I/Os, it was possible
to meet the timing constraints and achieve 100 MHz
speed. It was also found that this timing was the same
as if the design had been synthesized without any
placement constraints on the pinning at all (autoplacement of pins). The placement constraints were
therefore removed.
Table 1: Area and Speed for 100 MHz constraint
Virtex2P (vp2xc30)
Virtex4 (v4fx60)
Area
Speed
Area
Speed
(Slices % (ns/MHz)
(Slices % (ns/MHz)
LUTs %)
LUTs % )
NoC
2606
10.543 ns 2575
8.525 ns
2x2
(19%)
94.85
(10%)
117.3
4507
MHz
4451 (8%) MHz
(16%)
System
6173
9.990 ns 6389
9.765 ns
(45%)
100.1
(25%)
102.4
10037
MHz
10554
MHz
(36%)
(20%)
The system was then synthesised for the Virtex4
technology from Xilinx. The results of the synthesis
are shown in Table 1. The NoC now shrunk slightly
to 2575 slices, and the speed was improved
significantly, with 24%.
To find out what the highest obtainable speed
was, the system was synthesized with several
different speed constraints. The results are
summarised in Table 2. For a Virtex2P system, the
maximum speed was 111 MHz and for a Virtex4
system 125 MHz.
System
Table 2: Max Speed
Virtex2P (vp2xc30)
Virtex4 (v4fx60)
8.982 ns
7.996 ns
111.3 MHz
125.1 MHz
Since the NI emits a packet every 16th clock
cycle, and data is 32 bits wide, this means that the
maximum communication speed is two bits*clock
rate, i.e., for a clock speed of 100 MHz, the
transmission rate would be 200 Mbps. This is twice
the speed of the Ethernet connections in the original
design. For the 111 MHz and 125 MHz
implementations, the communication speed is 222
Mbps and 250 Mbps, respectively.
5. Summary and Conclusion
The ABB NoC is a 2 by 2 Mesh NoC. It is using
Deflective routing as a switch protocol and it is
targeted for implementation on Xilinx FPGAs. It
connects four MicroBlaze processors together that
implement an area- and timing-critical Real-time
Embedded System. The MicroBlaze is connected to
the NoC via a two-way FSL interface together with
two two-ported BRAMs that implements a shared
memory between the NoC and the resource. The
shared memory functions as a read and write buffer
respectively.
The NoC transmits messages that are maximum
2040 Bytes long, i.e., 510 words. The remaining
eight bytes are reserved for the global clock. The
global clock counts microseconds since the last reset.
It is used to enable a possibility to trace messages in
case something should go wrong.
Data words are transmitted every 16th clock
cycle. Every switch in the NoC has an FSM
controller that switch a data word every 4th clock
cycle. The receiving Network-Interface is capable of
receiving a message data in synchrony with the
switch, i.e., every 4th cycle. A transmitted data word
is 32 bits long. Thus, the maximum communication
speed is two bit*clock rate, i.e., for a 100 MHz clock
system the transmission rate is 200 Mbps.
The NoC occupies 2600 slices and runs at 111
MHz for the Virtex2P and at 125 MHz for the
Virtex4 technology.
The NoC fulfils the basic needs of ABB, i.e., it
provides a standardized interface with a predictable
behaviour that a resource IP designer can take into
account when designing the IP. The size of the NoC
was a bit large, not in terms of area, but in terms of
the memory needed to implement the buffers. The
way the buffers were implemented limits the
scalability of the system. For larger NoCs, a
buffering scheme that occupies less memory is
needed, perhaps the custom-based FIFO solution that
[14] is advocating, together with a sorting unit to
avoid re-ordering of packets. ABB is also interested
to see how the NoC structure can be used to
implement fault-tolerant designs. This will be done in
the future.
6. References
[1] W. J. Dally and B. Towles, Route Packets, Not
wires: On-chip Interconnection Networks, In Proc. of DAC
2001, June 2001.
[2] M. Sgroi et al., Addressing the System-on-a-Chip
Interconnect Woes Through Communication-Based
Design, In Proc. DAC 2001, June 2001.
[3] S. Kumar, A. Jantsch, J-P. Soininen, M. Forsell, M.
Millberg, J. Öberg, K. Tiensyrjä, and A. Hemani. A
network on chip architecture and design methodology. In
Proceedings of IEEE Computer Society Annual Symposium
on VLSI, April 2002.
[4] E. Nilsson. Design and implementation of a hotpotato switch in a network on chip. Master's thesis, Royal
Institute of Technology (KTH), Stockholm, Sweden, June
2002. IMIT/LECS/2002-11.
[5] M. Millberg. The nostrum protocol stack and
suggested services provided by the nostrum backbone.
Technical Report TRITA-IMIT-LECSR02:01, ISSN16514661, ISRN KTH/IMIT/LECS/R-02/01-SE, LECS, ECS,
Royal Institute of Technology, Sweden, 2002.
[6] E. Nilsson, M. Millberg, J. Öberg, and A. Jantsch.
Load distribution with the proximity congestion awareness
in a network on chip. In Design, Automation and Test in
Europe Conference (DATE 2003), Munich, Germany,
March 2003.
[7] M. Millberg, E. Nilsson, R. Thid, S. Kumar, and A.
Jantsch. The nostrum backbone - a communication protocol
stack for networks on chip - poster. In Proceedings of the
International Conference on VLSI Design, January 2004.,
2004.
[8] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch.
Guaranteed bandwidth using looped containers in
temporally disjoint networks within the nostrum network
on chip. In Proceedings of DATE'04, February 2004.,
2004.
[9] Z. Lu, R. Thid, M. Millberg, E. Nilsson, and A.
Jantsch. NNSE: Nostrum network-on-chip simulation
environment. In Proceedings of the Swedish System-onChip Conference (SSoCC'05), April 2005.
[10] LMB BRAM Interface Controller (v1.00.a),
DS452, Xilinx Inc., 2004.
[11] Block RAM (BRAM) Block (v1.00.a), DS444,
Xilinx Inc., 2004.
[12] Fast Simplex Link (FSL) Bus (v2.00.a), DS449,
Xilinx Inc., 2004.
[13] MicroBlaze Processor Reference Guide, UG081
(v5.0), Xilinx Inc., 2005.
[14] K. Goossens, J. Dielissen, and A. Radulescu,
"Aetherial Networks on Chip: Concepts, Architectures and
Implementations", IEEE Design and Test of Computers,
vol. 22, no. 5, pp. 414-421, September-October 2005.
Track A – Industrial
Track A features presentations with focus on industrial applications. The
presenters were selected by the Industrial Programme Committee.
8 Presentations see next three pages.
If someone will change address or the text, please contact Lennart Lindh.
Open source within hardware
Johan Rilegård, +46 70 824 80 30, Email: [email protected]
<mailto:[email protected]>
The use of open source hardware IP to developed SoC has increased rapidly and is following the huge
growth within the software area. The use of IPs from OpenCores in big commercial design, implemented
in FPGAs or ASICs is approximately doubled each year. The IPs has already from start (1999) been quite
popular, but during the last 12 month the interest has exploded and is now evaluated in all kind of
applications (from low volumes products all the way up to the mobile product industries).
OpenCores is right now working to upgrade the site/community and adding quite allot of features which
will even further lower the threshold of start using these IPs.
The basic aspects driving this growth are of course all the benefits generated to a company by owning
its own design to 100% (no license fees, minimize of “end of life problems”, technology independent) in
combination with the fact that the technology is matured, very well verified and used in many different
applications. This gives the needed “trust” to the technology to increase the ongoing avalanche effect. If
we look into the future the growth of open source IPs will increase even further and within a few years
we truly believe that technologies based of open source will be just as common as the commercial
technologies.
To speak about
•
•
•
•
•
History of OpenCores (HW open source IP)
Advantages and disadvantages use open source IPs
What is still missing to make it the number one used technology?
Open source HW development and communities like
OpenCores – what will happen the next 12 month and what will it look like in 36 month?
Open and Flexible Network Hardware
[email protected]; Florian Fainelli
<[email protected]>, Xavier Carcelle
<[email protected]>, Etienne Flesch
<[email protected]>
The latest developments of home gateways, DSL boxes, and wire- less routers have brought a fair
amount of network hardware to the home. These developments have been followed by several initiatives
of "hacking the box" to be able to use open-source ï¬ rmware. The next step is to provide a device
with completely open and flexible hardware.
OpenPattern is developing such hardware based on the inputs from the past initiatives, using a FPGA as
the central component (SoPC) to acheive the "open hardware" goal at the IP core level.
Verification – reducing costs and increasing quality
Jan Hedman <[email protected]>
Many industry professionals tell us about their increasing problems with verification of large FPGA/ASICs
- despite investing considerable resources, projects are still going to market with errors, causing delays
and costs for recalling and correcting faulty products. Verification costs are increasing but existing
techniques just are not finding all the errors.
In this presentation we will show how you can improve your existing verification processes by using
system-level verification techniques and a Model-Based Design approach. Models developed in MATLAB
and Simulink can be reused not only for simulation and code-generation, but also for verification. We will
show how these improvements can be integrated with an existing verification flow.
Large scale real-time data acquisition and signal
processing in SARUS
Hans Holten-Lund <[email protected]>
SARUS is a Synthetic Aperture Real-time Ultrasound Scanner, made in cooperation between CFU (Center
for Fast Ultrasound Imaging) at DTU Elektro and Prevas A/S. The system consists of 64 Digital Acquision
and Ultrasound Processing (DAUP) boards which each have five Xilinx Virtex 4 FX FPGAs, a total of 320
FPGAs. The FPGAs communicate using the LocalLink and Aurora protocols over 3.125 Gbps MGT links,
both for chip-to-chip and board-to-board communications, with network switching inside the FPGAs.
Each board features 16 DAC and 16 ADC channels at 12 bits and 70 MHz sample rate, for a total of 1024
channels. For Ultrasound processing, 8192 point FFTs are used for filtering, after which a tile-based 6way parallel beamformer focuses ultrasound frames at 4000 Hz, resulting in high-resolution frames at
40 Hz.
The presentation will touch many of the design challenges in such a large scale system, e.g. DDR2
memory controllers, accessing register files from a PowerPC on a different FPGA, LVDS SERDES links,
Aurora MGT links, DSPs, clock distribution, etc.
Drive on one chip
Lang, Fredrik (EBV) [[email protected]]
EBV Elektronik has now completed its fifth reference design called MercuryCode which is based on an
FPGA from Altera®’s Cyclone® III family for universal industrial applications.
Thanks to the appropriate software and drivers, implementation of the user-system is practically preprogrammed.
The board is designed for use of the Nios® II processor and the Microsoft .NET Framework, while
supporting various industrial I/O Standards such as CAN, USB, RS485, RS232 and 24 V I/O for direct
connection to the industrial automation world.
A total of ten partner companies currently provide support for all common field bus systems on the
market. This enables developers to try out all major field buses available on the market with
MercuryCode. This is not possible with traditional boards since these are manufacturer-specific and no
manufacturer allows a competitor’s IP-code to be implemented on his own boards. As EBV Elektronik is
independent, none of the partners has a problem with porting the Ip-cores on MercuryCode. For the end
customer, there is another advantage: if the end customer has to change the field bus system, it is
possible to do so using the existing hardware.
The FalconEye reference design platform was developed to aid in the design process for controlling
and/or regulating brushless synchronous and induction motors.
With FalconEye + MercuryCode, EBV Elektronik is offering a complete reference design for motor control
on the basis of the Nios II core.
Standard architecture for typical remote sensing micro
satellite payload
AYMAN.M.M.AHMED, Egyptian Space Program, National Authority of Remote sensing and space
science (NARSS), Cairo, Egypt
Development of EGYPTSAT-1 was one of the most experienced project that raised solutions of many
technical problem due to implementation of relatively high-resolution imagers into a micro-platform. One
of these; is the development of payload computer which have to carry out many functions in a very
harsh environment (temperature variation, mechanical vibration, and pressure) and many limitation.
Payload computer developed using FPGA to carry out: imagers control, imagers’ synchronization, data
compression, data packing, and data storage. The paper discusses payload computer architecture and
design difficulties using Virtex family from Xilinx FPGA.
World's first mixed-signal FPGA
Hosseinalikhani, Rouzbeh [[email protected]]
World's first mixed-signal FPGA Fusion saves you money and time to market. Join this session to learn
more about how we have used Fusion to create a highly integrated control for the Power Module with
over 50 analogue real time measurements, microprocessor, as well as several interfaces. We will also
show how Fusion solves critical integration issues on the Advanced Mezzanine Card (AMC) for the
TCA/uTCA standard.
Analog Netlist partitioning and automatic generation of
schematic
Ashish Agrawal <[email protected]>
An automatic schematic generator of a netlist (analog, gate level) in today scenario plays a significant
role in making user aware of the implementation of his RTL design. A netlist generated post RTL level
lacks any major designer intervention as there exist various automatic tools which converts RTL to sign
off. The automatic schematic generator needs to do a good job of showing a netlist but there are various
challenges in visually showing such netlist.
•
•
•
The flat transistor level netlist lacks the knowledge of how various transistors are grouped to
form logic.
Showing the netlist in human comprehensible form. The visual diagram should be close to hand
drawn circuit by an electrical engineer.
To place these components symmetrically and optimally relative to other components to have
optimal routing.
The paper will focus on algorithms to partition, group, placement and routing techniques to achieve
“good” schematic.
Organizer: Lena Engdahl,
[email protected]
The winner: FPGA Hero from Iha
9/17/2008
FPGA Hero
Guitar Hero in VHDL
Overview
Introduction
Project introduction
System introduction
SOPC
Graphics
Guitar
Music
1
9/17/2008
Why?
Show the power of fast prototyping in
SOPC
A project with a lot of technological
posibilities
Use VHDL for a non traditional purpose
Who?
Engineering College of Aarhus
Paul Martin Bendixen
Morten Stigaard Laursen
Henrik Hagen Rasmussen
Kim Arild Steen
2
9/17/2008
How?
SOPC for the system layout (NIOS II,
Avalon bus etc.)
Selfmade VHDL components
IP’s for the remaining components
What?
Decoding of MIDI files
Interfacing with PS2 guitar controller
Interfacing with VGA
Communication via the Avalon bus
Interfacing with audio codec and SD card
NIOS II
3
9/17/2008
System Components
VGA
Audio
MIDI decoding
Controller interface
SD Card
Filesystem
Gameplay
Score update
Hit or miss a node
VGA and Gameplay
25MHz pixel Clk
Strict timing requirements
Few states
Ideal for HDL implementation
4
9/17/2008
Audio
Timing requirements (I2S bus)
More complex initialization (I2C bus)
High complexity for init, low complexity
for run mode
Can be split in HDL and software
HDL part taken from Demonstration
example
Controller interface
Short pulses
Low bandwith
Few states
Chose HDL, instead of IRQ
5
9/17/2008
SD Card
Fast clk (up to 50MHz)
Loose timing requirements
To get a high throughput HDL is a must
We chose software because of existing
library (Demonstration system for DE1
board)
Filesystem
Using existing GPL library called DOSFS
6
9/17/2008
SOPC Overview
NIOS II
(Master)
SDRAM
Interface
(Slave)
SDCard
Interface
(Slave)
Avalon BUS
Controller
Interface
(Slave)
GUI
Interface
(Slave)
Audio Codec
Interface
(Slave)
Software Overview
Init Audio, SDCard,
Filesystem
If Audio FIFO full
If note was hit
Send next sample to
FIFO
Increment score
Parse MIDI to RAM
Load WAVE to RAM
Reset Score
Reset Song position
If Strum
False
True
N/A
If ’Start’ key
Pressed on guitar
Read note buttons
If end of song
Send next MIDI note
to GUI
7
9/17/2008
The real Guitar Hero Graphics
Our simplification
Simple graphics
Square notes
No background
No perspective
Hit and miss are
shown by changing
the color
White for hit
Magenta for miss
8
9/17/2008
Output
Format: VGA
Resolution: 640 x 480
Frame rate: 60 Hz
Connected to 4 bit DAC.
Properties of VGA
Pixel order
Blanking
Synchronization
Pixel frequency
9
9/17/2008
VGA timing
Hit, miss and score
Hit and miss detection
Build in to Graphical part
Score
IP: CharROM
Showing Hex score
10
9/17/2008
Score
IP: CharROM
Showing Hex score
VGA Interface
Output
4 bit DAC resister network for R, G and B
Hsync
Vsync
Input
Avalon
11
9/17/2008
Input interface
Avalon
Guitar Interface
12
9/17/2008
SPI
Standard SPI
Extensions
ACKnowledge
Command communication
Polls via SPI
Recieves Actual button presses
Is able to send 8 bit data
13
9/17/2008
Changing Modes
Guitar will not work in standard mode
Change to Analog mode
Sending several commands
Decoding MIDI files
Contains info about the music
Only guitar notes are used
Thanks to the ”Frets on fire” project
14
9/17/2008
Questions / Frågan
Demo
15
2008-09-17
Nintendo Entertainment System on a chip
Patrik Eliardsson
Per Karlström
Richard Eklycke
Ulrika Uppman
Agenda
About NES
Hardware Overview
Video
Picture Processing Unit (PPU)
VGA Output
Sound
Audio Processing Unit (APU)
Sound Controller
CPU
Testning
Performance
Final words
2
Linköpings universitet
1
2008-09-17
Background of the Project
Nintendo Entertainment System (NES)
A video game console
Entered the market in 1983, 15 July
Widely popular
Over 700 game titles released
Project Goal
Creating a NES on an FPGA
Use orginal games and controlls
Develop our hardware development skills
3
System Overview
4
Linköpings universitet
2
2008-09-17
Difficulties Designing a NES
Different clocks for each subsystem
Communication between the subsystems
No official documentation
Most documents are for/by software emulator implementors, no
hardware ”thinking”
5
Video Subsystems
Picture Processing Unit (PPU)
Introduction of Video RAM (VRAM)
Overview of the PPU
Playfield / Background
Sprites
Interfacing with the PPU
VGA Output
Buffering of the video signal
Pixel sizing
6
Linköpings universitet
3
2008-09-17
Picture Processing Unit (PPU)
- VRAM Introduction
Pattern Tables (tiles for background and sprite)
16 bytes for each tile
256 tiles
Name Tables (index pointer to pattern table)
32x30 gives a resultion of 256x240
Attribute tables (1 for each NT, color key for 2x2 tile)
Color Palettes (1 for background, 1 for sprites)
13 colours, keyed by 2 bits from Attribute table and 2 bit from
Pattern table
XXXX
Attribute table bits
Pattern table bits
7
Picture Processing Unit (PPU)
- VRAM Introduction
Sprites
256 byte sprite memory
4 byte / sprite, gives 64 sprites
Y-coordinate
Pattern table index
Attribute bits (horizontal flipp, vertical flip)
X-coordinate
8
Linköpings universitet
4
2008-09-17
PPU – System Overview
9
PPU – Scanline Details
10
Linköpings universitet
5
2008-09-17
PPU – Frame Details
11
PPU – Fetch Background Tiles
12
Linköpings universitet
6
2008-09-17
PPU - Registers
$2000 Control Register 1 (W)
$2001 Control Register 2 (Mask Register) (W)
$2002 Status Register (R)
$2003 SPR-RAM Address Register (W)
$2004 SPR-RAM Data Register (RW)
$2005 PPU Background Scrolling Offset (W2)
$2006 VRAM Address Register (W2)
$2007 VRAM Read/Write Data Register (RW)
13
VGA
Buffering of PPU output data
Using SRAM
Continuous reading and writing
Different read/write speed
SRAM controller
Output to monitor
from 256x240 to 640x480
Read line twice
5 cycles per pixel in 50 MHz
”Stretched” pixels
14
Linköpings universitet
7
2008-09-17
Sound
Audio Processing Unit
Background
Overview
Sample of subsystem – Square channel
Interface
Sound Controller
Background
Overview
Configuration
15
Audio Processing Unit (APU)
- Introduction
5 channels
2 square channels
1 triangle channel
1 noise channel
1 delta modulation channel
Status register
16
Linköpings universitet
8
2008-09-17
Audio Processing Unit (APU)
- System Overview
17
Audio Processing Unit (APU)
- Square Channel
Sweep
Timer/2
Sequencer
Envelope
Length
DAC
18
Linköpings universitet
9
2008-09-17
Audio Processing Unit (APU)
- Mixer
Some mathematical formulas involving division and addition
19
Audio Processing Unit (APU)
- Mixer
Some mathematical formulas involving division and addition
20
Linköpings universitet
10
2008-09-17
Audio Processing Unit (APU)
- Controlling the Channels
$4000 ddle nnnn
Waveform shape
$4001 eppp nsss
Frequency sweep
$4002 pppp pppp
Timer period
$4003 llll lppp
Length index and timer period
21
Sound Controller
FPGA
AUDIO
BOARD
CTRL
8
APU
SEND
24 (s)
CODEC
DATA
I2C
CONF
i2c
22
Linköpings universitet
11
2008-09-17
6502 CPU
Background
Implementation
Memory map of the system
Writing software
23
6502 CPU
- Background
Released in 1975
A modified 6502 called 2A03 used in the NES
8-bit data bus, 16-bit address bus
1 accumulator register (A), 2 index register (X, Y)
2 interrupts
Non-maskable interrupt (NMI)
Interrupt Request (IRQ)
24
Linköpings universitet
12
2008-09-17
6502 CPU
- Implementation
For now, we use a free implementation of the 6502
Free6502
Thank you!
Timing problems with Free6502
Some intructions take too many cycles to execute
Own rewrite of the 6502
25
6502 CPU - Memory Map
Address Range
$0000 - $00FF
$0100 - $01FF
$0200 - $07FF
$0800 - $0FFF
$1000 - $17FF
$1800 - $1FFF
$2000 - $2007
$2008 - $3FFF
$4000 - $401F
$4020 - $5FFF
$6000 - $7FFF
$8000 - $BFFF
$C000 - $FFFF
$FFFA - $FFFB
$FFFC - $FFFD
$FFFE - $FFFF
Size
256 bytes
256 bytes
1536 bytes
2048 bytes
2048 bytes
2048 bytes
8 bytes
8184 bytes
32 bytes
8160 bytes
8192 bytes
16384 bytes
16384 bytes
2 bytes
2 bytes
2 bytes
Notes (Page size is 256 bytes)
Zero Page
Stack memory
RAM
Mirror of $0000-$07FF
Mirror of $0000-$07FF
Mirror of $0000-$07FF
Input / Output registers (PPU)
Mirror of $2000-$2007 (mulitple times)
Input / Output registers (APU, Controllers)
Expansion ROM
Save RAM
PRG-ROM lower bank - executable code
PRG-ROM upper bank - executable code
NMI Handler address
Power on reset handler routine address
Software interrupt handler routine address
26
Linköpings universitet
13
2008-09-17
Performance
A complete NES including PPU, APU, CPU
DAC interface
VGA interface
SRAM interface
Program and video pattern ROMs for demonstration
Memory usage: 207031 bits (86%) (... bits for demo ROMs)
Logic usage: 3681 LE (20%)
27
Final Words
Future for the project
Integrate cartridge
Support hand controllers
HQ2X graphic filter
Develop our own set of games!
Save screen or game state
We would like to thank:
Altera
The Department of Electrical Engineering, Linköping University
28
Linköpings universitet
14
2008-09-17
Questions?
www.liu.se
Linköpings universitet
15
Altera Innovate Nordic Competition
Analog Modeling Synthesizer
FPGA SOPC implementation
Arnaud Taffanel,Peyman Pouyan
Teacher:Professor Lambert Spaanenburg
Lund Institute of Technology
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Motivation
•
To produce an analog style sound and to
simulate the analog user experience by
permitting to change any parameter of the
sound generation path in real-time.
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Outline
•
Introduction
•
•
•
System Implementation
•
•
•
NIOS II system
Sound stream
Software Implementation
•
•
•
Music synthesizers
History of Analog Synthesizer
Project Mapping
µc/os II
Conclusion & results
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Music Synthesizers
Analog synthesizer=Subtractive Synthesizer
• Subtractive synthesis
starts wih a rich harmonic
waveform(such as square
or sawtooth wave)
and filters out unwanted
spectral components.
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Subtractive Synthesis
•
Basic Subtractive Synthesizer parts:
•
Oscillators
Filters
Voltage Controlled Amplifiers or VCAs.
Envelop Generator such as an ADSR.
•
Lfos, Low Frequency Oscillators.
•
•
•
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Modular Analog Synthesizer
•
Highly configurable
•
•
•
•
Completely manual interconnection
One interconnection configuration is called a path
Heavy
Very expensive
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Next generation Analog Synthesizer
•
Partially pre-cabled
•
•
Electronic elements have a fixed place in the circuit.
Less heavy and expensive
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Digital analog modeling Synthesizer
•
•
•
Appear in the 90s
Very Light
Cheaper
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Our Project
•
Main target:
•
Affordable and high performance analog modeling
synthesizer.
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
User view on schematic
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
System Implementation
•
•
•
•
Introduction
System Implementation
• NIOS II system
• Sound stream
Software Implementation
• Project Mapping
• µc/os II
Conclusion & results
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Board level Architecture
•
•
The MIDI interface implemented with serial port.
The CODEC interfaces is on the DE1 board.
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Inside the FPGA
•
The program inside the CPU responds to the
MIDI command.
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Actual system implementation(1)
•
•
Sound stream element configuration mapped to
the NIOS II Memory
Everything is wired in SOPC Builder
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Actual system implementation(2)
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Sound Stream
•
•
•
Based on Avalon ST Bus
Clocked by the CODEC
Back Pressure Avalon ST capability
•
•
The samples are ‘pull’ through the stream by the CODEC
‘D’ elements to implement a pipeline
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Software Implementation
•
•
•
•
Introduction
System Implementation
Software Implementation
• Project Mapping
• µc/os II
Conclusion & results
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Project Mapping
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Project Mapping(continued)...
•
Rules for Mapping:
•
•
•
•
All the blocks which are in the sound flow will be
implemented in hardware.
All the slow or computational blocks will be
implemented in software.
The interconnection between all the hardware
blocks is simplified by the usage of the Avalon
bus.
All the design is clocked by the same 50MHz
clock which is also routed by SOPC.
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
µC/OS II
•
•
•
Simple and efficient RTOS
Integrated to NIOS II IDE
Mainly 3 tasks implemented
•
•
•
Midi Task : receive and execute the MIDI commands
EG Task : envelope generator refresh
LFO Task : Low frequency oscillator refresh
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Software organisation
•
•
Operating Tasks
Interconnection matrix system
•
•
Almost everything can be interconnected dynamically
Define 2 connectors
 Sink to receive data (i.e. Oscillators, Filter)
 Source to emit data (i.e. LFO/EG/MIDI)
•
•
Automatic refresh
Control system to modify non-dynamic data
Sources
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Sink
Conclusion(1)
•
FPGA itself was pretty adapted for the signal
processing as:
•
•
•
•
•
It contains a lot of internal RAM .
It has a lot of multipliers which permit to create many
high performance design blocks.
It has a Parallel Architecture which can help us to
achieve a better throughput.
FPGAs are cheaper than DSPs .
The Avalon bus system is very efficient and
simple to implement
•
Mostly thanks to SOPC Builder
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Performance (2)
•
Smooth configuration
•
•
Polyphony
•
•
•
•
Easy to accomplish with an FPGA
Pipelined architecture of FPGA
1000 cycles/sample available
Actual Implementation should achieve a polyphony of at
least 100.
Minimal response time
•
Sample processing Vs. block processing
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Refrences
•
•
•
•
www.altera.com
www.wikipedia.org
www.dspmusic.com
www.micrium.com
Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan
Call for FPGAworld Conference 2009
Academic Paper Submission
Step 1) Send your submission to: [email protected] (max 6 pages or 4000 words)
Step 2) The Program Committee will decide if papers will be accepted for a long presentation (30 minutes) or short presentation
(15 minutes). For long and short presentations, 4-8 pages will be assigned in the published proceedings on the web. Instructions
will be sent after acceptation of your paper.
Step 3) Send in Camera Ready Paper to David Kallberg and it is also possible to add attachment to the paper (no limit on
number of pages).
Step 4) The papers will be published on the FPGAworld website and in print on the conference.
Please note that the academic papers will be presented in Stockholm on September 11.
Industrial Abstract Submission
Industrial submission could already be published on web or on other places (you as the owner); Design Examples, White Papers,
student industrial projects etc.
Step 1) Submit the abstract giving a short overview on the presentation. The abstract should not exceed 100 words. Please write
if you submit the presentation to Stockholm and/or Lund and also add “industrial paper”. Send your submission to:
[email protected]
Step 2) The Industrial program Committee decides if the papers will be accepted for a presentation.
Step 3) After acceptation of your abstract, you have possible to send a full paper to be included in the web based
documentation.
Presentations on the FPGAworld
Each presentation is limited to 30 minutes total including any questions, the session chairs will be very strict. We require that all
presentations be pre-loaded onto a single laptop (Windows XP, PowerPoint/Office-2003 or PDF) to avoid laptop switching time
and problems. Please send you presentation to the session charmen or contact him before you presentation to load your
presentation on the laptop.
Product presentations and demonstration are not reviewed, see under Exhibitors and contact David Kallberg. We can not
guarantee place at FPGAworld, so you have to book in time and please write if you book for Stockholm and/or Lund.
4th FPGAworld CONFERENCE, Lennart Lindh, Vincent John Mooney III (external), MRTC report ISSN 1404-3041 ISRN MDH-MRTC215/2007-1-SE, Mälardalen Real-Time Research Centre, Mälardalen University, September, 2007
3rd FPGAworld CONFERENCE, Lennart Lindh, Vincent John Mooney III (external), MRTC report ISSN 1404-3041 ISRN MDH-MRTC204/2006-1-SE, Mälardalen Real-Time Research Centre, Mälardalen University, November, 2006
2nd FPGAworld CONFERENCE, Lennart Lindh, Vincent John Mooney III (external), MRTC report ISSN 1404-3041 ISRN MDH-MRTC188/2005-1-SE, Mälardalen Real-Time Research Centre, Mälardalen University, September, 2005
Call for FPGAworld Conference 2009
Academic/Industrial Papers, Product Presentations, Exhibits and Tutorials
September 2009, Stockholm, Sweden
September 2009, Lund, Sweden (no academic)
The submissions should be in at least one of these areas
•
•
•
•
•
•
DESIGN METHODS - models and practices
o Project methodology.
o Design methods as Hardware/software co-design.
o Modeling of different abstraction.
o IP component designs.
o Interface design: supporting modularity.
o Integration - models and practices.
o Verification and validation.
o Board layout and verification.
o Etc.
TOOLS
o News
o Design, modeling, implementation, verification and validation.
o Instrumentation, monitoring, testing, debugging etc.
o Synthesis, Compilers and languages.
o Etc.
HW/SW IP COMPONENTS
o New IP components for platforms and applications,
o Real-time operating systems, file systems, internet communications.
o Etc.
PLATFORM ARCHITECTURES
o Single/multiprocessor Architecture.
o Memory architectures.
o Reconfigurable Architectures.
o HW/SW Architecture.
o Low power architectures.
o Etc.
APPLICATIONS
o Case studies from users in industry, academic and students
o HW/SW Component presentation.
o Prototyping.
o Etc.
SURVEYS, TRENDS AND EDUCATION
o History and surveys of reconfigurable logic.
o Tutorials.
o Student work and projects.
o Etc.
www.fpgaworld.com/conference
ISBN 978-91-976844-1-5
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement