5th FPGAworld CONFERENCE Book 2008 SEPTEMBER EDITORS Lennart Lindh, David Källberg, DTE - Santiago de Pablo and Vincent J. Mooney III The e FPGAworld Conference addresses aspects of digital and hardware/software system engineering on FPGA technology. It is a discussion and network forum for students, researchers and engineers working on industrial and research projects, state-of-the-art state art investigations, development and applications. The book contains all presentations; for more information see (www.fpgaworld.com/conference www.fpgaworld.com/conference). ISBN 978-91-976844-1-5 SPONSORS Copyright and Reprint Permission for personal or classroom use are allowed with credit to FPGAworld.com. For commercial or other for-profit/for-commercial-advantage uses, prior permission must be obtained from FPGAworld.com. Additional copies of 2008 or prior Proceedings may be found at www.FPGAworld.com or at Jönköpings University library (www.jth.hj.se), ISBN 978-91-976844-1-5 2008 PROGRAM COMMITTEE General Chair Lennart Lindh, FPGAworld and Jönköpings University, Sweden Publicity Chair David Kallberg, FPGAworld, Sweden Academic Programme Chair Vincent J. Mooney III, Georgia Institute of Technology, USA Academic Publicity Chair Santiago de Pablo, University of Valladolid, Spain Academic Programme Committee Members Ketil Roed, Bergen University College, Norway Lennart Lindh, Jönköping University, Sweden Pramote Kuacharoen, National Institute of Development Administration, Thailand Mohammed Yakoob Siyal, Nanyang Technological University, USA Fumin Zhang, Georgia Institute of Technology, USA Santiago de Pablo, University of Valladolid, Spain Industrial Programme Committee Members Sweden Solfrid Hasund, Bergen University College Daniel Stackenäs, Altera, Sweden Martin Olsson, Synective Labs, Kim Petersén, HDC, Sweden Mickael Unnebäck, ORSoC, Sweden Sweden Stefan Sjöholm, Prevas, Sweden Fredrik Lång, EBV, Sweden Niclas Jansson, BitSim, Sweden Ola Wall, Synplicity, Sweden Torbjorn Soderlund, Xilinx, Sweden Göran Bilski, Xilinx, Sweden Anders Enggaard, Axcon, Denmark Adam Edström, Elektroniktidningen, Sweden Doug Amos, Synplicity, UK Guido Schreiner, The Mathworks, Espen Tallaksen, Digitas, Norway Germany Göran Rosén, Actel, Sweden Tommy Klevin, ÅF, Sweden Stig Kalmo, Engineering College of Tryggve Mathiesen, BitSim, Sweden Aarhus, Denmark Hichem Belhadj, Actel, USA Fredrik Kjellberg, Net Insight, General Chairman’s Chairman’s Message The FPGAworld program committee welcomes you to the 5th FPGAworld conference.. This year’s conference is held in Stockholm and Lund, Lund Sweden. We hope that the conferences provide you with much more then you expected. We will try to balance academic and industrial presentations, exhibits exhibit and tutorials to provide a unique chance for our attendants to obtain knowledge from different views. This year we have the strongest program in FPGAworld´s history. Track A - Industrial Track A features presentations with focus on industrial applications. The presenters were selected by the Industrial Programme Committee. 8 papers was presented. Track B - Academic Track B features presentations with focus on academic papers and industrial industria applications. The he presenters were selected by the Academic Programme Committee. Due to the high quality, 5 out of the 17 papers submitted this year were presented. Track C - Product presentations Track C features product presentations from our exhibitors exhibitors and sponsors. Track D - Altera Innovate Nordic Track D is reserved for the Altera Innovate Nordic contest. Three in the final Exhibitors FPGAworld'2008 Stockholm & Lund; Lund; 15 unique exhibitors. The FPGAworld 2008 conference onference is bigger than the FPGAworld 2007 conference. Totally we are close to 300 participants (Stockholm and Lund). All are welcome elcome to submit industrial/academic papers,, exhibits and tutorials to the conference, both from student, academic and industrial backgrounds. Together we can make the FPGAworld conference exceed even ven above our best expectations! Please check out the website (http://fpgaworld.com/conference/) for more information about FPGAworld. FPGAworld. In addition, you may contact David Källberg ([email protected]) for more information. information We would like to thank all of the authors for submitting their papers and hope that the attendees enjoyed the FPGAworld conference 2008 and you welcome lcome to next year’s conference. conference Lennart Lindh Programme FPGAworld'2008 Lund (Ideon Beta in Lund, Sweden) 08:30 09:00 Registration 09:00 09:15 Conference opening Lennart Lindh, FPGAworld 09:15 10:00 Challenges and Opportunities in the FPGA industry? Gary Meyers, Synopsys Key Note Session 10:00 10:30 Coffee Break 10:30 11:30 Exhibitors Presentation 11:30 12:30 Lunch Break Sponsored by Actel Session Chair TBD Session Chair TBD Session A1 Open Source within Hardware Session A2 World's first mixed-signal FPGA 12:30 14:30 Synplicity Business Group of Synopsys Session C2 OVM introduction Session A3 Verification - reducing costs and increasing quality Mentor Graphics Session C3 Verification Management Session A4 Mentor Graphics Analog Netlist partitioning and automatic generation of schematic Session C4 MAGIC - Next generation platform for Telecom and Signal Processing BitSim 14:30 15:00 Coffee Break Session Chair TBD Session A5 Product Presentation ORSoC Session A6 15:00 – 16:30 Session C1 Prototyping Drives FPGA Tool Flows Drive on one chip Session Chair TBD Session C5 Product Presentation The Dini Group Session A7 Session C6 Product Presentation Actel Standard architecture for typical remote sensing micro satellite payload Session C7 Product Presentation Nextreme: The industries only Zero Maskcharge NEW ASIC Programme FPGAworld'2008 Stockholm (Electrum in Kista, Sweden) 08:30 09:00 Registration 09:00 09:15 Conference opening Lennart Lindh, FPGAworld 09:15 10:00 Key Note Session Making the FPGA Reconfigurable Platform Accessible to Domain Experts Dr. Ivo Bolsons, CTO, Xilinx 10:00 10:30 Coffee Break Sponsored by Synplicity 10:30 11:30 Exhibitors Presentations 11:30 12:30 Lunch Break Sponsored by Mentor Graphics Session Chair Kim Petersén HDC AB Session A1 Open Source within Hardware 12:30 14:30 Session A2 Open and Flexible Network Hardware Session A3 World's first mixed-signal FPGA Session A4 Verification - reducing costs and increasing quality 14:30 15:00 Session Chair Tommy Klevin ÅF Session Chair Johnny Öberg Session C1 Product Presentation Actel Session B1 A Java-Based System for FPGA Programming Session C2 Product Presentation The Dini Group Session B2 Automated Design Approach for On-Chip Multiprocessor Systems Session B3 ASM++ Charts: an Intuitive Circuit Representation Ranging from Low Level RTL to SoC Design Session C3 Product Presentation ORSoC Session D1 Altera Innovate Nordic Contest Session C4 7Circuits - I/O Synthesis for FPGA Board Design Gateline Coffee Break Session Chair TBD Session Chair Santiago de Pablo Session Chair TBD Session C5 Prototyping Drives FPGA Tool Flows Synplicity Business Group of Synopsys Session A5 Large scale real-time data acquisition and signal processing in SARUS 15:00 16:30 Session A6 Drive on one chip Session A7 Standard architecture for typical remote sensing micro satellite payload Session C6 OVM introduction Mentor Graphics Session B4 Space-Efficient FPGA-Implementations of FFTs in High-Speed Applications Session B5 The ABB NoC – a Deflective Routing 2x2 Mesh NoC targeted for Xilinx FPGAs Session C7 Verification Management Mentor Graphics Session C8 MAGIC - Next generation platform for Telecom and Signal Processing BitSim 16:30 17:00 Altera Innovate Nordic Prize draw 17:00 19:00 Exhibition & Snacks Sponsored by Altera Session D2 Altera Innovate Nordic Contest FPGA World 2008 Exhibitors FPGAworld'2008 Stockholm & Lund 5 Minutes presentation with PowerPoint’s Stockholm Altera BitSim Arrow Silica Actel Synplicity The Mathworks EBV Elektronik ACAL Technology The Dini Group VSYSTEMS Gateline ORSoC National Instruments Lund BitSim Arrow Silica Actel Synplicity The Mathworks EBV Elektronik ACAL Technology The Dini Group National Instruments NOTE Lund Innovate Nordic 2008 Design Contest Lena Engdahl © 2008 Altera Corporation—Public © 2008 Altera Corporation—Public Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 2 1 The finalists are: © 2008 Altera Corporation—Public Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 3 2 GATEline Presentation GATEline Melek Mentes GATEline Overview Value added reseller of eCAD and ePLM products on Nordic and Baltic market Established 1984 7 employees, 6 in Sweden and 1 in Norway Office in Stockholm and Oslo EDAnova – Representative in Finland 1 GATEline Represents The Ultimate PCB Design Environment FPGA I/O Synthesis Schematic Design Functional Simulation PCB Design Signal Integrity Simulation Component Database Design for Manufacturing GATEline offers probably the best integrated PCB design flow on the market! 2 The DINI Group Products and Roadmap Mike Dini [email protected] September ‘08 Philosophy … Why are we here? • We make big FPGA boards • Fastest, biggest for the lowest cost – Easy to use where important – Less polish where not • What you get: – Working, easy to use, cutting edge, cost effective, reference designs – High performance in both speed and gate density • What you don’t: – Pretty GUI’s and other SW that drives up the cost – The ‘soft-shoe’ on partitioning … 1 What We Provide vs. What You Need • HW only – Lots of reference stuff included • Customer needs – Simulation (verilog/VHDL) – Partitioning (optional) • Manual or third party solutions such as Auspy – Synthesis • Xilinx/Altera tools work fine – Place/Route • Comes from FPGA vendor: Xilinx/Altera – Debug • Chipscope, SignalTap, and other third party solutions Overview of Product Line • Goal: Provide customers a cost-effective vehicle to use the biggest and fastest FPGA’s – Xilinx • Virtex-5 – Altera • Stratix III – Stratix IV when available – We try to keep lead-times under 2 weeks. • If not 2 weeks, issue is usually availability of FPGAs 2 Xilinx Virtex5 • Shipping with V5 since Jan ‘07: – Standalone • DN9000k10 – Bride of Monster 16 FPGA’s (LX330’s) • DN7020k10 – Uncle of Monster 20 FPGA’s (S3/S4) – PCI • DN9000k10PCI – 6 FPGA’s (LX330’s) • DN9200k10PCI – 2 FPGA’s (LX330’s) • DN9002k10PCI – 2 FPGA’s (LX330/LX220/LX155/LX110) – PCIe • • • • • 8-lanes GEN1 with LX50T-2 4-lanes GEN1 with FX70T-2 DN9000k10PCIe-8T – 6 FPGA’s (LX330’s) DN9200k10PCIe-8T – 2 FPGA’s (LX330’s) DN9002k10PCIe-8T – 2 FPGA’s (LX330/LX220/LX155/LX110) Altera Stratix III (and IV) – 50M/100M gates and stackable 3 Daughter of Monster • DN9000k10 – – – – – – Bride of MonsterTM 16 Virtex-5 LX330’s Expected to start shipping in Dec ’07 32M ASIC gates (measured the real way …) 6 DDR2 SODIMM sockets (250MHz) 450MHz LVDS chip to chip interconnect 4 DN9000k10PCIe-8t DN9000k10PCI • 6 Virtex5 LX330 – Oversize PCI circuit board • 66MHz/64-bit • Stand-alone operation with ATX power supply – ~12 million USABLE ASIC gates • REAL ASIC gates! No exaggeration! – Any subset of FPGA’s can be stuffed to reduce cost. • 6, DDR2 SODIMM SDRAM sockets • Custom DDR2-compatible cards for FLASH, SSRAM, RLDRAM, mictors, and others • FPGA to FPGA interconnect LVDS (or singleended) • LVDS: 450MHz • 10x ISERDES/OSERDES tested and verified • 160-pin Main bus connects all FPGA’s 5 Block Diagram – DN9000k10 DINI_selection_guide_v700.xls FX LX SX 207,360 138,240 97,280 69,120 97,280 69,120 51,840 28,800 19,200 58,880 32,640 21,760 64,000 44,800 20,480 178,176 135,168 98,304 84,352 50,560 135,168 98,304 71,680 53,248 36,864 49,152 88,192 66,176 47,232 FPGA FF's 3,320 2,210 1,556 1,110 1,556 1,110 830 460 307 940 522 392 1,024 717 328 2,490 1,890 1,380 1,180 710 1,890 1,380 1,000 750 520 690 1,230 930 660 1,990 1,330 934 670 934 666 498 276 184 564 313 235 614 430 197 1,490 1,130 830 710 430 1,130 830 600 450 310 410 740 560 400 Total (kbytes) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 96 96 96 160 128 96 96 80 64 64 512 444 328 232 192 128 128 64 128 64 48 48 32 640 288 192 256 128 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 576 384 384 256 424 296 216 120 72 488 264 168 456 296 136 336 288 240 376 232 288 240 200 160 96 320 444 328 232 10,368 6,912 6,912 4,608 7,632 5,328 3,888 2,160 1,296 8,784 4,752 3,024 8,208 5,328 2,448 6,048 5,184 4,320 6,768 4,176 5,184 4,320 3,600 2,880 1,728 5,760 7,992 5,904 4,176 1,296 864 864 576 954 666 486 270 162 1,098 594 378 1,026 666 306 756 648 540 846 522 648 540 450 360 216 720 999 738 522 MLAB (640) M9K (9 kbit) M144K (144 kbit) Total (kbits) Total (kbytes) 13622 10624 6750 1529 1280 1040 64 64 48 22,977 20,736 16,272 2,872 2,592 2,034 M-RAM (4kx144) Total (kbits) Total (kbytes) 4 9 4,415 9,163 552 1,145 LUT Size Multipliers (18x18) 1,200 800 800 800 640 640 480 480 360 640 480 360 640 640 360 960 960 960 768 576 768 768 768 640 640 640 1040 996 692 Total (kbits) Max I/O's Gate Estimate Max I/O's 6-input 6-input 6-input 6-input 6-input 6-input 6-input 6-input 6-input 6-input 6-input 6-input 6-input 6-input 6-input 4-input 4-input 4-input 4-input 4-input 4-input 4-input 4-input 4-input 4-input 4-input 4-input 4-input 4-input Stratix III 4SE680 4SE530 3SL340 -4,-3,-2 -4,-3,-2 -4,-3,-2 6-input 6-input 6-input 544,880 424,960 270,000 8,718 6,799 4,320 5,231 4,080 2,592 1104 960 1120 1360 1024 576 StratixII GX StratixII 2SGX90E 2S180 -5,-4,-3 -5,-4,-3 6-input 6-input 72,768 143,520 1,020 2,010 610 1,210 558 1,170 192 384 Stratix IV Max Practical (100% util) (60% util) (1000's) (1000's) Memory Blocks (18kbits) Speed Grades (slowest to fastest) Altera VirtexII Pro -1,-2 -1,-2 -1,-2,-3 -1,-2,-3 -1,-2,-3 -1,-2,-3 -1,-2,-3 -1,-2,-3 -1,-2,-3 -1,-2,-3 -1,-2,-3 -1,-2,-3 -1,-2,-3 -1,-2,-3 -1,-2,-3 -10,-11 -10,-11,-12 -10,-11,-12 -10,-11,-12 -10,-11,-12 -10,-11,-12 -10,-11,-12 -10,-11,-12 -10,-11,-12 -10,-11,-12 -10,-11,-12 -5,-6 -5,-6,-7 -5,-6,-7 Practical (60% util) (1000's) Multipliers (25x18) LX LX330 LX220 LX155 LX110 LX155T LX110T LX85T LX50T LX30T SX95T SX50T SX35T FX100T FX70T FX30T LX200 LX160 LX100 FX100 FX60 LX160 LX100 LX80 LX60 LX40 SX55 2vp100 2vp70 2vp50 FF's Max (100% util) (1000's) Multipliers (18x18) SXT FPGA Speed Grades (slowest to fastest) Virtex-5 LXT FXT Virtex-4 Xilinx LX Gate Estimate Size (6-input or 4-input) LUT Memory M512 M4K (32x18) (128x36) 488 930 408 768 6 ® TM ® © 2008 The MathWorks, Inc. Overview of The MathWorks Jan Hedman Sr. Application Engineer Signal Processing and Communications ® TM ® The MathWorks at a Glance Headquarters: Natick, Massachusetts US US: California, Michigan, Washington DC, Texas Europe: UK, France, Germany, Switzerland, Italy, Spain, the Netherlands, Sweden Asia-Pacific: China, Korea, Australia Worldwide training and consulting Distributors in 25 countries Earth’s topography on an equidistant cylindrical projection, created with MATLAB® and Mapping Toolbox™. 2 1 TM ® ® Core MathWorks Products The leading environment for technical computing Numeric computation Data analysis and visualization The de facto industry-standard, high-level programming language for algorithm development Toolboxes for signal and image processing, statistics, optimization, symbolic math, and other areas Foundation of MathWorks products 3 TM ® ® Core MathWorks Products The leading environment for modeling, simulating, and implementing communications systems and semiconductors Foundation for Model-Based Design Digital, analog, and mixed-signal systems, with floating- and fixed-point support Algorithm development, system-level design, implementation, and test and verification Optimized code generation for FPGAs and DSPs Blocksets for signal processing, communications, video and image processing, and RF Open architecture with links to third-party modeling tools, IDEs, and test systems 4 2 ® TM ® Model-Based Design Workflow Requirements Research Design Data Analysis Environment Algorithm Development Physical Components Data Modeling Algorithms Embedded Software Digital Electronics C, C++ VHDL, Verilog MCU DSP FPGA ASIC Test Environments Integration Implement Continuous V&V 5 3 SILICA @ FPGA World 2008 - Lund SILICA EMEA Silica an Avnet Company 550 employees (450-sales and engineering team) 23 franchises Local sales organisations w. centralized backbone for logistic Excellent portfolio of value-added services and supply chain solutions. SILICA I The Engineers of Distribution. 1 KEY TECHNOLOGY Power Management Microcontroller / DSP Commodities Discrete I Logic Analog I Memory Analog (Signal Chain) Programmable Logic SILICA I The Engineers of Distribution. SILICA LINECARD SILICA I The Engineers of Distribution. 2 Avnet Design Services Xilinx® Virtex™-5 FXT Evaluation Kit • • • • XC5VFX30T-1FFG665 PowerPC™ 440 10/100/1000 Ethernet PHY $ 395 Xilinx® Spartan™-3A Evaluation Kit • • • • XC3S400A-4FTG256C General FPGA prototyping Cypress® PSoC evaluation (Capsense) $ 39 SILICA I The Engineers of Distribution. 3 Actel Offering Overview Hichem Belhadj Actel Fellow Actel Technology Flash and Antifuse Non-volatile Reprogrammable FPGAs Flash (floating gate) technology Non-Volatile OTP (One Time Programmable) FPGAs ALU – Actel ONO anti-fuse technology M2M anti-fuse technology © 2008 Actel Corporation. Confidential and Proprietary Aug 19th 2008 2 1 Actel Technology Advantages Power advantages Inherently more secure than any other solution Actel delivers a significant reliability All Actel devices function as soon as power is applied to the board Single-chip offerings provide total cost advantage over competition © 2008 Actel Corporation. Confidential and Proprietary ALU – Actel Aug 19th 2008 3 Aug 19th 2008 4 Actel’s Silicon Value-based Low Power FPGA Ultra-low power Very high volume Sub-$1.00 market Power and System Management System developers needing integrated functionality on single chip System Critical Where failure and tampering are not options ALU – Actel © 2008 Actel Corporation. Confidential and Proprietary 2 Actel Solutions Industry’s Most Comprehensive Power Management Portfolio ALU – Actel © 2008 Actel Corporation. Confidential and Proprietary Aug 19th 2008 5 Aug 19th 2008 6 Later Today You are Invited Low Power Solutions 12:30 Mixed-Signal FPGA and *TCA Solutions 1:30 ALU – Actel © 2008 Actel Corporation. Confidential and Proprietary 3 Mentor Graphics Complete FPGA Design Flow Håkan Pettersson Sr Applications Engineer [email protected] Mentor FPGA Design Solutions Concept to Manufacturing Embedded Development System Design C++ Func if else if Then (); Goto des_dev End if PCB Design C-Based Synthesis RTL Reuse & Creation Verification FPGA Synthesis 2 2 Copyright 2007 Mentor Graphics, All Rights Reserved FPGA World November 2006 Copyright ©1999-2005, Mentor Graphics. 1 Mentor FPGA Design Solutions Concept to Manufacturing Nucleus OS & EDGE Developer HDL Designer Vista & Visual Elite I/O Designer & HyperLynx CatapultC Questa & Seamless &0 0--In Precision & FormalPro 3 3 Copyright 2007 Mentor Graphics, All Rights Reserved FPGA World November 2006 Copyright ©1999-2005, Mentor Graphics. Mentor @ FPGA World Open Verification Methodology – An Overview Verification Management 4 FPGA World November 2006 Copyright ©1999-2005, Mentor Graphics. 2 5 FPGA World November 2006 Copyright ©1999-2005, Mentor Graphics. 3 EBV Presentation FPGA World 2008 1 September 2007 EBV Elektronik - The Full Solution Provider EBV added values: In-depth design expertise Application know-how Full logistics solutions 2 1 EBV Franchise Partner 3 EBV – The Technical Specialist 130 pan-European Field Application Engineers – 13% of EBV’s total workforce! – provide extensive application expertise and design know-how. 2 weeks of internal FAE trainings per year by the product specialists of EBV’s manufacturers. (FSEs also attend) Technologies are chosen from EBV! 2 weeks of additional training at our suppliers EBV FAE Team Every FAE receives at least 20 workings days of training per year! 4 2 EBV Reference Designs Different suppliers combined to one solution! Latest products, technologies & software Customer requirements drive the applications „Ready - to - use“ solutions Saves customers costs & development time Reduces time-to-market 5 FalconEye Development Board Will be presented during the drive on one chip session 6 3 9/17/2008 Open Source - gives the competitive edge ORSoC make SoC-development easy, accessible and cost efficient for all companies, regardless size or financial strength. Customer product USB - Debugger Development boards 1 9/17/2008 Floppy-disk replacement Designed and developed by ORSoC Customer product USB - Debugger Owned and sold by Swerob Development boards Open Source - gives the competitive edge ORSoC make SoC-development easy, accessible and cost efficient for all companies, regardless size or financial strength. ORSoC makes it easy 2 9/17/2008 OpenCores reach millions of engineers www.opencores.org OpenCores is owned and maintained by ORSoC. OpenCores Facts OpenCores is the number one site in the world for open source hardware IPs • • • • • ~540 projects (different IP-blocks) ~1 000 000 page views every month ~70 000 visitors every month 6:48 (min:sek) Average time at website 14GB data downloaded every month (IP source code) 3 9/17/2008 Welcome to Synopsys May 20th, 2008 FPGAWorld 2008 1 Welcome to the Synplicity Business Group of Synopsys Since May 15th, 2008 2 1 9/17/2008 The Message is . . . “The acquisition by Synopsys allows us to scale our FPGA and rapid prototyping business to help more designers successfully solve increasingly complex problems” - Gary Meyers General Manager, Synplicity Business Group “The combination will support our strategy to provide rapid prototyping capabilities and will enhance Synplicity’s already strong offering in the FPGA implementation market.” - Aart de Geuss CEO and Founder, Synopsys 3 Synplicity Business Group Products Confirma™ ASIC / ASSP Verification Platform FPGA Implementation Solutions ESL Synthesis Synplify Premier Single-FPGA Prototyping Environment Certify Multi-FPGA Prototyping Environment Identify Pro Full Visibility Functional Verification HAPS High-performance ASIC Prototyping System Synplify Premier The Ultimate in FPGA Implementation Synplify Pro The Industry Leader in FPGA Synthesis Identify Powerful RTL Debug Synplify DSP DSP Synthesis for FPGA Designers Synplify DSP ASIC Edition DSP Synthesis for ASIC Designers 4 2 9/17/2008 FPGA World September 2008 © 2008 Actel Corporation September 2008 Key Market Segments Value-based FPGA Ultra-low power High volumes Sub-$10 market Power and System Management Needs integrated functionality on single chip System Critical Failure and tampering are not options Confidential and Proprietary © 2008 Actel Corporation September 2008 2 1 9/17/2008 Power: Actel Technical Advantage Competitive SRAM Cell Bit Line Vdd Vdd Actel’s Flash Cell Bit Line Negligible leakage per cell Millions of configuration cells Ultra-low static current Word Line Substantial leakage per cell Millions of configuration cells High static current Confidential and Proprietary © 2008 Actel Corporation September 2008 3 Actel’s System Management Solutions High-end, standards-based system management specifications Fusion-based µTCA reference designs Power Module and Advanced Mezzanine Card Fusion-based ATCA reference designs Low-cost system management for typical embedded design Robust reference design leverages Fusion and CoreABC Confidential and Proprietary © 2008 Actel Corporation September 2008 4 2 9/17/2008 Vi visar: Presentation av Fusion som lösning på flera funktioner i ett uTCA chassi samt en demo. Klockan 13:00 i rum A En presentation samt demo av Igloo, som visar skillnaden i strömförbrukning mellan Flash- och SRAM-baserade FPGA:er Klockan 15:30 i rum C Väl mött! Confidential and Proprietary © 2008 Actel Corporation September 2008 5 3 a leading Design and Service house in Sweden Application specialists Graphics, Imaging and Digital Video Advanced Microelectronics FPGA, Board, DSP, ASIC & System-on-Chip, Analog & SW Offices Head office in Stockholm with regional offices in Lund, Uppsala, Växjö and in Gothemburg. ~60 employees In average 10+ years in electronic design 1 Digital Signal Processing Wireless LAN, Image Processing, Mobile telecom & Radar Digital Video & Graphics Processing, 2D acceleration, Displays, MPEG/JPEG ASIC/System on Chip & SOPC (large FPGAs) Analysis, tools, IP blocks, DFT issues, advanced verification High Speed Design (PCB & FPGA) Layout issues, tools, protocols and Interfaces, >3Gbps 2 SPEED UP YOUR BUSINESS NOTE Lab NOTE NOTE is one of the leading EMS companies with more than 1100 people all over the world. Everything we do is designed to make your company more successful by developing electronics from design to after-sales services in close cooperation with you. WE OFFER • A site close to you • Design and test resourses • Industrialisation • NOTEfied for selection of right components • NOTE LAB for fast prototypes • Competitive component sourcing • Serial production inclusive Box Build • After sales services SPEED UP YOUR BUSINESS NOTE Lab NOTE Lab • Specialists in prototyping and other customized production • Fast prototype production – experienced component engineers and purchasing personnel – prototype modifications while you wait – advanced prototype delivery in days – feedback based on customer needs – seamless transfer to serial production • Box build in small volymes 1 SPEED UP YOUR BUSINESS NOTE Lab NOTEfied NOTEfied for closer control of development and reliable production It is essential to select the right component and NOTEfied supports with: • • • • • • • • • • NOTE unique component article numbers URL to data sheet Manufacturers part number Lead time Quality classification RoHS Life cycle status Symbols Footprint Production recommendations SPEED UP YOUR BUSINESS NOTE Lab Let us help you! We can help you to launch your product in a faster way and that can be the differance between winn or lose. If you want more information please visit www.note.se or contact us in Lund on 046 – 286 92 00. If you have your business somewhere else in Sweden you can find a NOTE site near you on our home page. We look forward to hear from you! 2 Track B - Academic Track B features presentations with focus on academic papers and an industrial applications. The he presenters were selected by the Academic Programme Committee. Due to the high quality, 5 out of the 17 papers submitted this year were presented. Papers Session B1 A Java-Based Based System for FPGA Programming Session B2 Automated Design Approach for On-Chip On Chip Multiprocessor Systems Session B3 ASM++ Charts: an Intuitive Circuit Representation Ranging from Low Level RTL to SoC Design De Session B4 Space-Efficient Efficient FPGA-Implementations FPGA Implementations of FFTs in High High-Speed Applications Session B5 The ABB NoC – a Deflective Routing 2x2 Mesh NoC targeted for Xilinx FPGAs A Java-Based System for FPGA Programming Jacob A. Bower, James Huggett, Oliver Pell and Michael J. Flynn Maxeler Technologies {jacob, jhuggett, oliver, flynn}@maxeler.com Abstract Photon is a Java-based tool for programming FPGAs. Our objective is to bridge the gap between the ever increasing sizes of FPGAs and the tools used to program them. Photon’s primary goal is to allow rapid development of FPGA hardware. In this paper we present Photon by discussing both Photon’s abstract programming model which separates computation and data I/O, and by giving an overview of the compiler’s internal operation, including a flexible plug-and-play optimization system. We show that designs created with Photon always lead to deeplypipelined hardware implementations, and present a case study showing how a floating-point convolution filter design can be created and automatically optimized. Our final design runs at 250MHz on a Xilinx Virtex5 FPGA and has a data processing rate of 1 gigabyte per second. 1. Introduction Traditional HDLs such as VHDL or Verilog incur major development overheads when implementing circuits, particularly for FPGA which would support fast design cycles compared to ASIC development. While tools such as Cto-gates compilers can help, often existing software cannot be automatically transformed into high-performance FPGA designs without major re-factoring. In order to bridge the FPGA programming gap we propose a tool called Photon. Our goal with Photon is to simplify programming FPGAs with high-performance datacentric designs. Currently the main features of Photon can be summarized as follows: • Development of designs using a high-level approach combining Java and an integrated expression parser. • Designs can include an arbitrary mix of fixed and floating point arithmetic with varied precision. • Plug-and-play optimizations enabling design tuning without disturbing algorithmic code. • VHDL generation to enable optimizations via conventional synthesis tools. • Automation and management of bitstream generation for FPGAs, such as invoking FPGA vendor synthesis tools and simulators. The remainder of this paper is divided as up as follows: In Section 2, we compare Photon and other tools for creating FPGA designs. In Section 3 we describe Photon’s programming model which ensures designs often lead to highperforming FPGA implementations. In Sections 4 and 5 we give an overview of how Photon works internally and present a case study. Finally, in Section 6 we summarize our work and present our conclusions on Photon so far. 2. Comparisons to Other Work In Table 1 we compare tools for creating FPGA designs using the following metrics: • Design input – Programming language used to create designs. • High level optimizations – Automatic inference and optimizing computation hardware, simplification of arithmetic expressions etc. • Low level optimizations – Boolean expression minimisation, state-machine optimizations, eliminating unused hardware etc. • Floating-point support – Whether the tool has intrinsic support for floating-point and IEEE compliance. • Meta-programmability – Ability to statically metaprogram with weaker features being conditional compilation and variable bit-widths and stronger features such as higher-order design generation. VHDL and Verilog use a traditional combination of structural constructs and RTL to specify designs. These tools typically require a high development effort. Such conventional tools typically have no direct support for floating Design input Floating-point High-level optimizations Low-level optimizations Meta-programmability Build automation Photon Java IEEE Yes Impulse-C C IEEE Yes Handel-C C No Yes Verilog Verilog No No VHDL VHDL No No PamDC C++ No No JHDL Java No No YAHDL Ruby No No ASC C++ Yes No Via VHDL Strong Yes Via HDL Weak Yes Yes Yes Yes No No No No Medium Limited Medium No Medium No Strong No Strong No Strong Yes Strong No Table 1. Comparison of tools for creating FPGA designs from software code. point arithmetic and therefore require external IP. Metaprogrammability e.g. generics in VHDL are fairly inflexible [1]. The advantage of VHDL and Verilog is that they give the developer control over every aspect of the microarchitecture, providing the highest potential for an optimal design. Additionally, synthesis technology is relatively mature and the low-level optimizations can be very effective. Other tools often produce VHDL or Verilog to leverage the low-level optimizers present in the conventional synthesis tool-chain. Impulse-C [2] and Handel-C [3] are examples of C-togates tools aiming to enable hardware designs using languages resembling C. The advantage of this approach is existing software code can form a basis for generating hardware with features such as ROMs, RAMs and floatingpoint units automatically inferred. However, software code will typically require modifications to support a particular C-to-gates compiler’s programming model. For example explicitly specifying parallelism, guiding resource mapping, and eliminating features such as recursive function calls. The disadvantage of C-to-gates compilers is that the level of modification or guidance required of a developer may be large as in general it is not possible to infer a highperformance FPGA design from a C program. This arises as C programs in general are designed without parallelism in mind and are highly sequential in nature. Also, metaprogrammability is often limited to the C pre-processor as there is no other way to distinguish between static and dynamic program control in C. PamDC [4], JHDL [5], YAHDL [1] and ASC [6] are examples of Domain Specific Embedded Languages [7] (DSELs) in which regular software code is used to implement circuit designs. With this approach all functionality to produce hardware is encapsulated in software libraries with no need for a special compiler. These systems are a purely meta-programmed approach to generating hardware with the result of executing a program being a net-list or HDL for synthesis. Of these systems, PamDC, JHDL and YAHDL all provide similar functions for creating hardware structurally in C++, Java and Ruby respectively. YAHDL and PamDC both take advantage of operator overloading to keep designs concise, whereas JHDL designs are often more verbose. YAHDL also provides functions for automating build processes and integrating with existing IP and external IP generating tools. ASC is a system built on top of PamDC and uses operator overloading to specify arithmetic computation cores with floating-point operations. Photon is also implemented as a DSEL in Java. Photon’s underlying hardware generation and build system is based on YAHDL rewritten in Java to improve robustness. Unlike JHDL, Photon minimizes verbosity by using an integrated expression parser which can be invoked from regular Java code. Photon also provides a pluggable optimization system unlike the other DSELs presented, which generate hardware in a purely syntax directed fashion. 3. Photon Programming Model Our goal with Photon is to find a way to bridge the growing FPGA size versus programming gap when accelerating software applications. In this section we discuss the programming model employed by Photon which provides high-performance FPGA designs. FPGA designs with the highest performance are generally those which implement deep, hazard free pipelines. However, in general software code written without parallelism in mind tends to have loops with dependencies which cannot directly be translated into hazard free pipelines. As such, software algorithm implementations often need to be re-factored to be amenable to a highperformance FPGA implementation. Photon’s programming model is built around making it easy to implement suitably re-factored algorithms. When developing our programming model for Photon, we observe that dense computation often involves a single arithmetic kernel nested in one or more long running loops. Typically, dense computation arises from repeat- ing the main kernel because the loop passes over a very large set of data, or because a small set of data is being iterated over repeatedly. Examples of these two cases include convolution, in which a DSP operation is performed on a large set of data [8], and Monte-Carlo simulations repeatedly running random-walks in financial computations [9]. In the Photon programming model we implement applications following this loop-nested kernel pattern by dividing the FPGA implementation into two separate design problems: f o r i i n 0 t o N do i f i > 1 and i < N−1 t h e n dout [ i ] = ( d i n [ i ] + d i n [ i −1]+ d i n [ i + 1 ] ) / 3 else dout [ i ] = din [ i ] end end Listing 1. Pseudo-code 1D averaging filter. 1. Creating an arithmetic data-path for the computation kernel. 2. Orchestrating the data I/O for this kernel. Thus, we turn organizing data I/O for the kernel into a problem that can be tackled separately from the data-path compiler. Thus this leaves us with an arithmetic kernel which does not contain any loop structures and hence can be implemented as a loop-dependency free pipeline. In Photon we assume the Data I/O problem is solved by Photon-external logic. Based on this assumption, Photon designs are implemented as directed acyclic graphs (DAGs) of computation. The acyclic nature of these graphs ensures a design can always be compiled to a loop-dependency free pipeline. Within a Photon DAG there are broadly five classes of node: • I/O nodes – Through which data flows into and out of the kernel under the control of external logic. • Value nodes – Nodes which produce a constant value during computation. Values may be hard-coded or set via an I/O side-channel when computation is not running. • Computation nodes – Operations including: arithmetic (+, ÷ . . . ), bit-wise (&, or, . . . ), type-casts etc. • Control nodes – Flow-control and stateful elements, e.g.: muxes, counters, accumulators etc. • Stream shifts – Pseudo operations used to infer buffering for simulating accessing to data ahead or behind the current in-flow of data. To illustrate Photon’s usage and graph elements, consider the pseudo-code program in Listing 1. This program implements a simple 1D averaging filter passing over data in an array din with output to array dout. The data I/O for this example is trivial: data in the array din should be passed linearly into a kernel implementing the average filter which outputs linearly into an array dout. Figure 1. Photon DAG for 1D averaging. Figure 1 shows a Photon DAG implementing the averaging kernel from Listing 1. Exploring this graph from topdown: data flows into the graph through the din input node, from here data either goes into logic implementing an averaging computation or to a mux. The mux selects whether the current input data point should skip the averaging operation and go straight to the output as should be the case at the edges of the input data. The mux is controlled by logic which determines whether we are at the edges of the stream. The edge of the stream is detected using a combination of predicate operators (<, >, &) and a counter which increases once for each item of data which enters the stream. The constant input N − 1 to the < comparator can be implemented as a simple constant value, meaning the size of data which can be processed is fixed at compilation time. On the other hand, the constant input can be implemented as a more advanced value-node that can be modified via a side-channel before computation begins, thus allowing data-streams of any size to processed. The logic that performs the averaging computation contains a number of arithmetic operators, a constant and two streamshifts. The stream-shift operators cause data to be buffered such that it arrives at the addition operator one data-point behind (−1) or one data-point ahead (+1) of the unshifted c l a s s AddMul e x t e n d s P h o t o n D e s i g n { AddMul ( B u i l d M a n a g e r bm ) { s u p e r ( bm , ” AddMulExample ” ) ; Var Var Var Var a b c d = = = = i n p u t ( ” a ” , hwFloat ( 8 , 2 4 ) ) ; i n p u t ( ”b” , hwFloat ( 8 , 2 4 ) ) ; i n p u t ( ” c ” , hwFloat ( 8 , 2 4 ) ) ; outp ut ( ”d” , hwFloat ( 8 , 2 4 ) ) ; d . c o n n e c t ( mul ( add ( a , b ) , c ) ; } } Listing 2. Photon floating-point add/mul design. 4. Implementation of Photon Figure 2. Scheduled DAG for 1D average filter. data which comes directly from din. To implement our 1D averaging Photon DAG in hardware, the design undergoes processing to arrive at a hardware implementation. Figure 2 illustrates the result of Photon processing our original DAG. In this processed DAG, buffering implements the stream-shifting operators and ensures data input streams to DAG nodes are aligned. Clockenable logic has also been added for data alignment purposes. With this newly processed DAG, data arriving at din produces a result at dout after a fixed latency. This is achieved by ensuring that data inputs to all nodes are aligned with respect to each other. For example the mux before dout has three inputs: the select logic, din and the averaging logic. Without the buffering and clock-enable logic, data from din would arrive at the left input to the mux before the averaging logic has computed a result. To compensate, buffering is inserted on the left input to balance out the delay through the averaging logic. For the mux-select input a clock-enable is used to make sure the counter is started at the correct time. After Photon processes a DAG by inserting buffering and clock-enable logic, the DAG can be turned into a structural hardware design. This process involves mapping all the nodes in the graph to pre-made fully-pipelined implementations of the represented operations and connecting the nodes together. As the design is composed of a series of fully-pipelined cores, the overall core is inherently also fully-pipelined. This means Photon cores typically offer a high degree of parallelism with good potential for achieving a high clock-speed in an FPGA implementation. In this section we given an overview of Photon’s concrete implementation. Of particular interest in Photon is the mechanism by which designs are specified as Java programs which is covered first in Section 4.1. We then discuss Photon’s compilation and hardware generation process in Section 4.2. 4.1. Design Input Photon is effectively a Java software library and as such, Photon designs are created by writing Java programs. Executing a program using the Photon library results in either the execution of simulation software for testing a design or an FPGA configuration programming file being generated. When using the Photon library a new design is created by extending the PhotonDesign class which acts as the main library entry point. This class contains methods which wrap around the creation and inter-connection of standard Photon nodes forming a DAG in memory which Photon later uses to produce hardware. New nodes for custom hardware units, e.g. a fused multiply-accumulate unit, can also be created by users of Photon. Listing 2 shows an example Photon program. When executed this program creates a hardware design which takes three floating-point numbers a, b and c as inputs, adds a and b together and multiplies the result by c to produce a single floating-point output d. Method calls in the code specify a DAG which has six nodes: three inputs, an output, a multiplier and an add. These nodes are created by calls to the input, output, mul and add methods respectively. The input and output methods take a string parameter to specify names of I/Os for use by external logic and for performing data I/O. Another parameter specifies the I/O type. For the example in this paper, we use IEEE single precision floating-point numbers. The floating point type is declared using a call to hwFloat which makes a / / C r e a t e I / Os i n p u t ( ” din ” , hwFloat (8 , 2 4 ) ) ; outp ut ( ” dout ” , hwFloat (8 , 2 4 ) ) ; / / Average Computation e v a l ( ” p r e v d i n = s t r e a m S h i f t ( −1 , d i n ) ” ) ; eval ( ” next din = streamShift (1 , din ) ” ) ; e v a l ( ” avg = ( p r e v d i n + d i n + n e x t d i n ) / 3 ” ) ; / / 8− b i t c o u n t e r e v al ( ” count = simpleCounter (8 , 255) ” ) ; / / S e l e c t l o g i c w i t h N hard−c o d e d t o 10 eval ( ” s e l = ( count > 1) & ( count < 10) ” ) ; / / Mux c o n n e c t e d t o o u t p u t e v a l ( ” d o u t <− s e l ? avg : d i n ” ) ; Listing 3. 1D averaging design implemented using Photon expressions. floating point type object with an 8 bit exponent and a 24 bit mantissa following the IEEE specification. We can also create floating-point numbers with other precisions, fixedpoint and/or integer types. Types used at I/Os propagate through the DAG and hence define the types of operator nodes. Casting functions can be used to convert and constrain types further within the design. One drawback of using Java method calls to create a DAG is verbosity, making it hard to read the code or relate lines back to the original specification. To resolve the function-call verbosity the Photon library provides a mechanism for expressing computation using a simple expression based language. Statements in this integrated expression parser can be written using a regular Java strings passed to an eval method. The eval method uses the statements in the string to call the appropriate methods to extend the DAG. To demonstrate our eval expressions, Listing 3 shows how our 1D averaging example from Figure 1 is implemented in Photon using eval calls. 4.2. Compilation and Hardware Generation In addition to using Java for design specification, Photon also implements the compilation and hardware generation process entirely in Java. Photon’s design management features cover optimization of Photon designs, generation of VHDL code, and calling external programs such as synthesis, simulation, IP generation, and place-and-route. After a Photon design is fully specified, Photon turns the specified DAG into a design which can be implemented in hardware. Photon achieves this primarily by executing a number of “graph-passes”. A graph-pass is piece of Java code which visits every node in the DAG in topological order. Typically, Photon passes transform the graph by adding and/or deleting nodes, for example to implement optimizations. Photon has a default set of optimization passes which are used for all designs but users may also develop their own, for example to detect application specific combinations of nodes in a graph and mutate them to improve the hardware implementation. Of the default graph passes the most important are those which align the data stream inputs to nodes, inserting buffering or clock-enable logic as illustrated in the difference between Figure 1 and Figure 2. We refer to this process as ‘scheduling’ the graph. We perform scheduling using two graph-passes. The first scheduling pass traverses the graph passively, collecting data about the latency (pipeline-depth) of each node in the graph. We then determine an offset in our core pipeline at which each node should be placed in order to ensure that data for all its inputs arrives in synchrony. After all the offsets in a schedule are generated a second pass applies these offsets by inserting buffering to align node inputs. Sub-optimal offsets cause unnecessary extra buffering to be inserted into the graph, wasting precious BlockRAM and shift-register resources. To combat this inefficiency we calculate a schedule for the offsets using Integer Linear Programming (ILP). Our ILP formulation ensures all nodes are aligned such that their input data arrives at the same time while minimising the total number of bits used in buffering. Thus, Photon’s scheduled designs always have optimal buffering. After all other graph-passes, a final graph-pass produces a hardware design. By this stage in compilation every node in the DAG has a direct mapping to an existing piece of parameterisable IP. Thus, this final pass creates a hardware design, by instantiating one IP component per node in the graph. Hardware is created in Photon using further Java classes to describe structural designs either directly using Java or by including external pre-written HDL, or running external processes to generate IP, e.g. CoreGen for floatingpoint units. After a design is fully described, external synthesis or simulation tools are invoked by Java to produce the required output for this execution. The system used to implement this low-level structural design and tool automation is based on the model described in [1]. 5. Case Study As a case study we consider a simple 2D convolution filter. This kind of design is common in many digital image processing applications. The filter we implement is shown in Figure 3. The filter is separable, using the equivalent of two 1D 5 point convo- Figure 3. Shape of 2D convolution filter implemented in Photon case-study. No opts. BRAM opts. Mult. opts. All opts. LUT/FF Pairs 6192 6297 5851 5925 DSPs 10 10 6 6 RAMB36s 18 4 18 4 Table 2. Resource usage for 2D convolution filter on a Virtex 5 with various optimizations. lution operators with a total operation count of 8 additions/subtractions and (after algebraic optimization to factor out common sub-expressions) 5 multiplications per input point. 5.1. Optimizations The compilation of the convolution case study illustrates two of the optimization graph-passes during the Photon compilation process. The Photon implementation of the filter makes use of several large stream-shifts on the input data. These shifts are necessary as each output data-point requires the 9 surrounding points to compute the convolved value. These stream-shifts result in a large number of buffers being added to the Photon design. Photon helps reduce this buffering using a graph-pass that combines the multiple delay buffers into a single long chain of buffers. This ensures each data item is only stored once, reducing buffering requirements. Photon is able to use the precise value of the filter coefficient constants to optimize the floating-point multipliers. Specifically, some of the coefficients are a power of two, which can be highly optimized. To implement this, Photon includes another graph-pass which identifies floatingpoint multiplications by a power of two and replaces them with a dedicated node representing a dedicated hardware floating-point multiply-by-two IP core. This IP core uses a small number of LUTs to implement the multiplication rather than a DSP as in the conventional multipliers. 5.2. Implementation Results For synthesis we target our filter design to a Xilinx Virtex-5 LX110T FPGA, with a clock frequency of 250MHz. At this speed, with data arriving and exiting the circuit once per cycle, we achieve a sustained computation rate of 1GB/s. Table 2 shows the area impact of the Photon optimization graph-passes on the filter hardware. The multiplier power of two substitution pass reduces the number of DSP blocks used from 10 to 6, and the delay merging pass reduce BRAM usage from 18 RAMB36s to 4. The number of LUTs required for the original and optimized designs are similar. 6. Conclusion In this paper we introduce Photon, a Java-based FPGA programming tool. We describe the programming model for Photon in which data I/O is separated from computation allowing designs to implicitly be easy to pipeline and hence perform well in an FPGA. We give an overview of Photon’s implementation as a library directed by user created Java programs. Finally, we present a case study demonstrating that Photon’s pluggable optimization system can be used to improve the resource utilisation of designs. Our current and future work with Photon includes developing a system for making it easier to create the data I/O logic external to Photon designs, and creating more advanced optimization passes. References [1] J. A. Bower, W. N. Cho, and W. Luk, “Unifying FPGA hardware development,” in International Conference on FieldProgrammable Technology, December 2007, pp. 113–120. [2] Impulse Accelerated Technologies Inc., “ImpulseC,” http://www.impulsec.com/, 2008. [3] Agility, “DK design suite,” http://www.agilityds.com/, 2008. [4] O. Mencer, M. Morf, and M. J. Flynn, “PAM-Blox: High performance FPGA design for adaptive computing,” in IEEE Symposium on FPGAs for Custom Computing Machines, Los Alamitos, CA, 1998, pp. 167–174. [5] P. Bellows and B. Hutchings, “JHDL - An HDL for reconfigurable systems,” in IEEE Symposium on FPGAs for Custom Computing Machines, Los Alamitos, CA, 1998, pp. 175 – 184. [6] O. Mencer, “ASC: A stream compiler for computing with FPGAs,” IEEE Transactions on CAD of ICs and Systems, vol. 25, pp. 1603–1617, 2006. [7] P. Hudak, “Modular domain specific languages and tools,” Intl. Conf. on Software Reuse, vol. 00, p. 134, 1998. [8] O. Pell and R. G. Clapp, “Accelerating subsurface offset gathers for 3D seismic applications using FPGAs,” SEG Tech. Program Expanded Abstracts, vol. 26, no. 1, pp. 2383–2387, 2007. [9] D. B. Thomas, J. A. Bower, and W. Luk, “Hardware architectures for Monte-Carlo based financial simulations,” in International Conference on Field-Programmable Technology, December 2006, pp. 377–380. Automated Design Approach for On-Chip Multiprocessor Systems P. Mahr, H. Ishebabi, B. Andres, C. Loerchner, M. Metzner and C. Bobda Department of Computer Science, University of Potsdam, Germany {pmahr,ishebabi,andres,lorchner,metzner,bobda}@cs.uni-potsdam.de Abstract This paper presents a design approach for adaptive multiprocessor systems-on-chip on FPGAs. The goal in this particular design approach is to ease the implementation of an adaptive multiprocessor system by creating components, like processing nodes or memories, from a parallel program. Therefore message-passing, a paradigm for parallel programming on multiprocessor systems, is used. The analysis and simulation of the parallel application provides data for the formulation of constraints of the multiprocessor system. These constraints are used to solve an optimization problem with Integer Linear Programming: the creation of a suitable abstract multiprocessor hardware architecture and the mapping of tasks onto processors. The abstract architecture is then mapped onto a concrete architecture of components, like a specific Power-PC or soft-core processor, and is further processed using a vendor tool-chain for the generation of a configuration file for an FPGA. 1. Introduction As apparent in current developments the reduction of transistor size and the exploitation of instruction-level parallelization can not longer be continued to enhance the performance of processors [1]. Instead, multi-core processors are a common way of enhancing performance by exploiting parallelism of applications. However, designing and implementing multiple processors on a single chip leads to new problems, which are absent in the design of singlecore processors. For example, an optimal communication infrastructure between the processors needs to be found. Also, software developers have to parallelize their applications, so that the performance of the application is increased through multiple processors. In the case of multiprocessor systems-on-chip (MPSoCs), which combine embedded heterogeneous or homogeneous processing nodes, memory systems, interconnection networks and peripheral components, even more problems arise. Partly because of the variety of technologies available and partly because of their sophisticated functionality [2] [3]. To reduce design time, high level design approaches can be employed. In [4] ,[5], [6] and [7] design methodologies and corresponding toolsupport are described. In principle two communication paradigms for parallel computing with multiprocessor systems exist, the communication through shared memory (SMP), i. e. cache or memory on a bus-based system, and the passing of messages (MPI) through a communication network. SMP architectures, like the Sun Niagara processor [8] or the IBM Cell BE processor [9], are the common multiprocessors today. MPI is typically used in computer clusters, where physically distributed processors communicate through a network. This paper presents a survey about our developments in the area of adaptive MPSoC design with FPGAs (FieldProgrammable Gate Arrays) as a flexible platform for ChipMulti-Processors. In section 2 an overview of the proposed design approach for MPSoCs is given. Therefore the steps for architectural synthesis, starting with the analysis and simulation of a parallel program and ending with the generation of a bitfile for the configuration of an FPGA are described in general. In the following section 3 an on-chip message passing software library for communication between tasks of a parallel program, and a benchmark for the purpose of evaluation are presented. Section 4 summarizes the formulation of architecture constraints for the design space exploration with Integer Linear Programming. These constraints are formulated out of the results of the analysis and simulation of a parallel program. The following section 5 gives an overview about the creation of MPSoCs using abstract components. Finally, this paper is concluded in section 6 and a brief overview about future work is given in section 7. 2. System design using architectural synthesis To get an efficient multiprocessor system-on-chip from a parallel program several approaches are possible. In figure 1 our proposed synthesis flow using an analytical approach is shown. The architectural synthesis flow starts with parallel applications that are modeled as a directed graph, where the nodes represent tasks and the edges represent communication channels [10]. is needed as well and should be performed during the proposed steps. 3. On-Chip Message Passing In this section the on-chip communication between processing nodes using a subset of the MPI standard is described [12]. Therefore a small-sized MPI-library was developed (see figure 2), which is similar to the approaches described in [13], [14] and [15]. Figure 1. Architectural Synthesis Flow In the first step of the design flow, information on data traffic and on task precedence is extracted from functional simulations of the parallel program. Information on the number of cycles of a task when executed on a specific processor is determined from cycle accurate simulations. This information is used to formulate an instance of an Integer Linear Programming (ILP) problem. In the following step, called Abstract Component creation, a combinatorial optimization is done by solving an ILP problem. Additionally to the information gathered in the first step, platform constraints, e. g. area and speed of the target platform, are needed as well. As a result of this step an abstract system description including the (abstract) hard- and software parts is generated. The third step is called Component mapping. The abstract system description, which consists of abstract processors, memories, communication components or hardware accelerators and software tasks linked onto abstract processors, is mapped on a concrete architecture of components like PPC405, MicroBlaze or on-chip BRAMs. If needed, an operating systems can be generated with scripts and makefiles and can be mapped onto a processor as well. This step can be done using the PinHaT software (Platformindependent Hardware generation Tool) [11]. In the final step a bitfile is generated from the concrete components using the platform vendor tool-chain performing logic synthesis and place & route. However, simulations for the validation of the system on higher and lower levels Figure 2. SoC-MPI Library The library consists of two layers. A Network independent layer (NInL) and a network dependent layer (NDeL), for the separation of the hardware dependent part from the hardware independent part of the library. The advantage of this separation is the easy migration of the library to other platforms. The NInL provides MPI functions, like MPI Send, MPI Receive, MPI BSend or MPI BCast. These functions are used to perform the communication between processes in the program. The NDeL is an accumulation of network dependent functions for different network topologies. In this layer the ranks and addresses for concrete networks are determined and the cutting and sending of messages depending on the chosen network is carried out. Currently the length of a message is limited to 64 Bytes, due to the limited on-chip memory of FPGAs. Longer messages are therefore cutted into several smaller messages and are send in series. The parameters of the MPI functions, like count, comm or dest (destination), are also used as signals and parameters for the hardware components of the network topology. That is, the parameters are used to build the header and the data packets for the communication over a network. The MPI parameters are directly used for the control, data and address signals. The following MPI functions are currently supported: Init, Finalize, Initialized, Wtime, Wtick, Send, Recv, SSend, BSend, RSend, SendRecv, Barrier, Gather, BCast, Comm Size, Comm Rank. Figure 4. Benchmarks of the SOC-MPI Library Figure 3. Configuration of processing nodes In figure 3 several processing nodes are connected together via a star network. Additionally node 0 and 1 are directly connected together via FSL (Fast Simplex Link) [16]. Each processing node has only a subset of the SoCMPI Library with the dependent functions for the network topology. 3.1. Benchmarks The MPI library is evaluated using Intel MPI Benchmarks 3.1, which is the successor of the well known package PMB (Pallas MPI Benchmarks) [17]. The MPI implementation was benchmarked on a Xilinx ML-403 evaluation platform [18], which includes a Virtex 4 FPGA running at 100MHz. Three MicoBlaze soft-core processors [19] were connected together via a star network. All programs were stored in the on-chip memories. In Figure 4 the results of the five micro benchmarks are shown. Due to the limited on-chip memory not all benchmarks could be performed completely. Furthermore, a small decay between 128 and 256 Bytes message size exists, because the maximum MPI message length is currently limited to 251 KBytes and a message larger than that must be splitted into several messages. Further increase of the message size would lead to a bandwidth closer to the maximum possible bandwidth, which is limited through the MicroBlaze and was measured with approximately 14 MBytes/s. cating the best possible abstract architecture for a given parallel application under given constraints. The simultaneous optimization problem is to map parallel tasks to a set of processing elements and to generate a suitable communication architecture, which meets the constraints of the target platform, and minimizes the overall computation time of the parallel program. The input for this step is obtained by using a profiling tool for the mpich2 package. In the following two subsections area and time constraints of processors and communication infrastructure are described in separate. 4.1. Processors - sharing constraint, area constraint and costs A few assumptions about processors and tasks need to be made, because it is possible to map several tasks on a processor: (1) a task scheduler exists, so that scheduling is not involved in the optimization problem. (2) Task mapping is static. (3) The instruction sequence of a task is stored on the local program memory of the processor, e. g. instruction cache, and hence the size of the local program memory limits the number of tasks, which can be mapped onto a processor. (4) Finally the cost of switching tasks in terms of processor cycles does not vary from task to task. Let Ii ∈ I0 , ..., In be a task, Jj ∈ J0 , ..., Jm a processor and xij = 0, 1 a binary decision variable, whereas xij = 1 means that task Ii is mapped onto processor Jj . m X xij = 1, ∀Ii (1) j=0 4. Abstract component creation using Integer Linear Programing In this flow, Integer Linear Programming (ILP) is used for automated design space exploration with the goal of lo- A constraint for task mapping (equation 2), called address space constraint, and the cost of task switching (equation 3) can be formulated, where sij is the size of a task Ii of a processor Jj with the program memory size sj and tj is the cost (time) of task switching. n X xij · sij ≤ sj , ∀Ii (2) i=0 TSW IT CH = m X n X xij · tj (3) j=0 i=0 For the calculation of the area of the processors AP E the area of of a single processor aj is needed. Because xij only shows if a task Ii is mapped onto a processor Jj and does not show the number of processors in the system or the number of instantiations of a processor, an auxiliary variable vj = 0, 1 is needed. For each instance of a processor Jj there is a corresponding virtual processor vj and for all tasks mapped to a certain processor there is only one task which is mapped to the corresponding virtual processor. This leads to the following constraint (equation 4) so that the area of the processors can be calculated with equation 5. vj ≤ n X xij , ∀Jj (4) i=0 AP E ≥ m X vij · aj (5) j=0 4.2. Communication networks - Network capacity and area constraint Several assumptions have to be made before constraints about the communication network can be formulated. The communication of two tasks mapped onto the same processor is done via intra-processor communication, which have a negligible communication latency and overhead, compared to memory access latency. All processors can use any of the available communication networks and can use more then one network. A communication network has arbitration costs resulting from simultaneous access on the network. It is assumed, that tasks are not prioritized and an upper bound on arbitration time for each network can be computed for each network topology depending on the number of processors. Finally, it is not predictable when two or more tasks will attempt to access the network. Though a certain probability can be assumed. λi1,i2 is an auxiliary 0-1 decision variable that is 1, if two communicating tasks are mapped on different processors. The sum of xi1 j1 and xi2 j2 equals two if the tasks are on different processors as seen in equation 6. λi1,i2 = xi1 j1 + xi2 j2 2 variable with value 1 if a communication topology Ck is used for communication between two tasks Ii1 and Ii2 , and 0 otherwise. Ii1 l Ii2 is a precedence operator. Ii1 is preceded by Ii2 which means, that a data transfer is performed from Ii1 to Ii2 . The maximum number of processes, which can use a topology Ck is described by Mk . X yk + λi1,i2 ≤ Mk , ∀Ck (7) Ii1 ,Ii2 |Ii1 lIi2 The total area cost of the communication network (resources for routing) can be calculated with equation 8, where Ak is the area cost of topology Ck . AN ET ≥ m X Ak · yk (8) j=0 The cost of the topology in terms of computation time is calculated in 10, whereas zki1 i2 is a binary decision variable, which is 1 if a network will be used by two tasks. Otherwise the variable zki1 i2 is 0. Di1 i2 is the amount of data to be transferred between the two communicating tasks and pk is the probability that network arbitration will be involved when a task wants to communicate. The upper bound arbitration time is τk . zki1 i2 ≥ λi1,i2 + yk − 1 X K X Ii1 ,Ii2 |Ii1 lIi2 k=0 TN ET = (9) ! (Di1 i2 + τk · pk )zki1 i2 (10) Finally the total area cost A is calculated form the area of the processing elements AP E (equation 5) and the area for the routing resources AN ET (equation 8). A ≥ AP E + AN ET (11) The cost of computation time can be calculated with equation 12, whereas Tij is the time requirement to process a task Ii on a processor Jj . The objective, in this case, is to minimize computation time of a (terminating) parallel program. However for non-terminating programs, like signal processing programs, the objectives are different. (6) A communication topology Ck ∈ C0 , ..., Ck may have a maximum capacity of processors attached to it. This constraint is described in equation 7. yk is a binary decision T = min m X n X xij · Tij + TN ET + TSW IT CH j=0 i=0 (12) 5. Component Mapping In this section the mapping of abstract components onto concrete ones is described. This component-based approach is similar to the one described by Cesário et al. [20], where a high-level component-based methodology and design environment for application-specific MPSoC architectures is presented. For this task, a component-based design environment called PinHaT for the generation and configuration of the system infrastructure was developed. This environment offers a vendor independent framework with which users can specify the requirements of their system and automatically produce the necessary system configuration and startup code. The Tool was developed as a Java application. Generally PinHaT follows a two-step approach. In the first step an abstract specification of the system using abstract components like CPUs, Memories or Hardware Accelerators are described. In the following step these abstract components will be refined and mapped to concrete components, e. g. a specific CPU (PPC405, MicrocBlaze) or a Hardware-Divider. Also the software tasks are mapped onto the concrete components. The structure of PinHaT is shown in figure 5. A detailed overview about the PinHaT tool is given in [11]. 5.1. Generation of the System Infrastructure The generation of the system infrastructure, that is the mapping of an abstract system description onto a concrete hardware description, is done by PinHaT. PinHaT uses XML in conjunction with a document type definition file (DTD) as an input for the abstract system description. An input consists of valid modules, which are CPU, Memory, Communication Modules, Periphery and Hardware Accelerator. The hardware mapping onto concrete components is divided into three phases. In the first phases an internal XML tree is build by parsing the input file. For all nodes of this tree an adequate class is instantiated. These classes know how to process its own parameters. Such classes can be easily added to the framework to extend the IP-Core base. In a subsequent step, another parser creates the platform specific hardware information file from the gathered information. In the second phase individual mappers for all components and target platforms, are created, followed by the last phase, where a mapper creates the platform dependent hardware description files. These dependent hardware description files are then passed to the vendor’s tool chain, e. g. Xilinx EDK or Altera Quartus II. 5.2. Configuration of the System Infrastructure SW Mapping In the case of software, a task is mapped onto a concrete processor. This is in contrast to the mapping of abstract components, e. g. processors or memories, to concrete ones during the generation of the system infrastructure. For the mapping step, parameters of the software must be specified for each processor. The parameters include information about the application or the operating system, like source-code, libraries or the os-type. With these information scripts and Makefiles for building the standalone applications and the operating systems are created. While standalone applications only need compiling and linking of the application, building the operating system is more difficult. Depending of the operating system, different steps, like configuration of the file-system or the kernel parameters, are necessary. The result of the task mapping is an executable file for each processor in the system. Figure 5. Structure of PinHaT The component mapping is divided into the generation and the configuration of the system infrastructure, where hardware is generated and software is configured. In this flow, the input to PinHaT is obtained by high level synthesis, as described in section 4. 6. Conclusion In this paper a concept for the design automation of multiprocessor systems on FPGAs was presented. A smallsized MPI library was implemented to use message passing for the communication between tasks of a parallel program. Simulation and analysis of the parallel program is carried out to gather information about task precedence, task interaction and data traffic between tasks. These information is needed to formulate constraints, for the solving of an Integer Linear Programming problem. As a result an abstract system description, consisting of hardware components, like processing nodes or hardware accelerators, and linked tasks is created. With the help of the PinHaT software the components of the abstract system description can be mapped onto concrete components, like specific processors. Finally the configuration file for an FPGA can be created using the vendor tool-chain. 7. Future Work Currently the PinHaT software is being extended into an easy to use software solution including the architectural synthesis of a parallel program. Also new abstract and corresponding concrete components, like new network topologies or processing nodes like the OpenRISC processor will be included to enhance flexibility of the architectural synthesis flow. Furthermore concepts of adaptivity of multiprocessor systems are to be analysed and a detailed evaluation of on-chip message-passing in the face of communication overhead and latency needs to be carried out. References [1] Kunle Olukotun and Lance Hammond. The future of microprocessors. Queue, 3(7):26–29, 2005. [2] Wayne Wolf. The future of multiprocessor systems-on-chips. In DAC ’04: Proceedings of the 41st annual conference on Design automation, pages 681–685, New York, NY, USA, 2004. ACM. [3] Gilles Sassatelli, Nicolas Saint-Jean, Cristiane Woszezenki, Ismael Grehs, and Fernando Moraes. Architectural issues in homogeneous noc-based mpsoc. In RSP ’07: Proceedings of the 18th IEEE/IFIP International Workshop on Rapid System Prototyping, pages 139–142, Washington, DC, USA, 2007. IEEE Computer Society. [4] D.D. Gajski, Jianwen Zhu, R. Dmer, A. Gerstlauer, and Shuqing Zhao. SpecC: Specification Language and Methodology. Springer, 2000. [5] Tero Kangas, Petri Kukkala, Heikki Orsila, Erno Salminen, Marko Hännikäinen, Timo D. Hämäläinen, Jouni Riihimäki, and Kimmo Kuusilinna. Uml-based multiprocessor soc design framework. Trans. on Embedded Computing Sys., 5(2):281–320, 2006. [6] Simon Polstra. A systematic approach to exploring embedded system architectures at multiple abstraction levels. IEEE Trans. Comput., 55(2):99–112, 2006. Member-Andy D. Pimentel and Student Member-Cagkan Erbas. [7] Blanca Alicia Correa, Juan Fernando Eusse, Danny Munera, Jose Edinson Aedo, and Juan Fernando Velez. High level system-on-chip design using uml and systemc. In CERMA ’07: Proceedings of the Electronics, Robotics and Automotive Mechanics Conference (CERMA 2007), pages 740–745, Washington, DC, USA, 2007. IEEE Computer Society. [8] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A 32-way multithreaded sparc processor. IEEE Micro, 25(2):21–29, 2005. [9] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49(4/5):589–604, 2005. [10] Concepcio Roig, Ana Ripoll, and Fernando Guirado. A new task graph model for mapping message passing applications. IEEE Trans. Parallel Distrib. Syst., 18(12):1740–1753, 2007. [11] Christophe Bobda, Thomas Haller, Felix Mühlbauer, Dennis Rech, and Simon Jung. Design of adaptive multiprocessor on chip systems. In SBCCI ’07: Proceedings of the 20th annual conference on Integrated circuits and systems design, pages 177–183, New York, NY, USA, 2007. ACM. [12] MPI Forum. April 2008. http://www.mpi-forum.org/. 01. [13] J. A. Williams, I. Syed, J. Wu, and N. W. Bergmann. A reconfigurable cluster-on-chip architecture with mpi communication layer. In FCCM ’06: Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pages 350–352, Washington, DC, USA, 2006. IEEE Computer Society. [14] T. P. McMahon and A. Skjellum. empi/empich: Embedding mpi. In MPIDC ’96: Proceedings of the Second MPI Developers Conference, page 180, Washington, DC, USA, 1996. IEEE Computer Society. [15] Manuel Saldaña and Paul Chow. Tmd-mpi: An mpi implementation for multiple processors across multiple fpgas. In FPL, pages 1–6, 2006. [16] Xilinx Fast Simplex Link (FSL). http://www.xilinx. com/products/ipcenter/FSL.htm. 07. August 2008. [17] Intel MPI Benchmarks. http://www.intel.com/ cd/software/products/asmo-na/eng/219848. htm. 09. April 2008. [18] Xilinx ML403 Evaluation Platform. http: //www.xilinx.com/products/boards/ml403/ reference_designs.htm. 09. April 2008. [19] Xilinx MicroBlaze Processor. http://www.xilinx. com/products/design_resources/proc_ central/microblaze.htm. 09. April 2008. [20] W. Cesário, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A. A. Jerraya, and M. Diaz-Nava. Component-based design approach for multicore socs. In DAC ’02: Proceedings of the 39th conference on Design automation, pages 789–794, New York, NY, USA, 2002. ACM. ASM++ charts: an intuitive circuit representation ranging from low level RTL to SoC design S. de Pablo, L.C. Herrero, F. Martínez University of Valladolid Valladolid (Spain) [email protected] M. Berrocal eZono AG Jena (Germany) [email protected] Abstract On the language side a parallel effort has been observed. In particular, SystemVerilog [4] now include an ‘interface’ element that allow designers to join several inputs and outputs together in one named description, so textual designs may become easier to read and understand. At a different scale, pursuing a higher level of abstraction, the promising SpecC top-down methodology [5] firstly describes computations and communications at an abstract and untimed level, and then descends to an accurate and precise level where connections and delays are fully described. The aim of this paper is to contribute to these efforts from a bottom-up point of view, mostly adequate for academic purposes. First of all, we present several extensions to the Algorithmic State Machine (ASM) methodology, what we have called “ASM++ charts”, allowing the automatic generation of VHDL or Verilog code from this charts, using a recently developed ASM++ compiler. Furthermore, these diagrams may describe hierarchical designs and define, through special boxes, how to connect different modules all together. This article presents a methodology to describe digital circuits from register transfer level to system level. When designing systems it encapsulates the functionality of several modules and also encapsulates the connections between those modules. To achieve these results, the possibilities of Algorithmic State Machines (ASM charts) have been extended to develop a compiler. Using this approach, a System-on-a-Chip (SoC) design becomes a set of linked boxes where several special boxes encapsulate the connections between modules. The compiler processes all required boxes and files, and then generates the corresponding HDL code, valid for simulation and synthesis. A small SoC example is shown. 1. Introduction System-on-a-Chip (SoC) designs integrate processor cores, memories and custom logic joined into complete systems. The increased complexity requires more effort and more efficient tools, but also an accurate knowledge on how to connect new computational modules to new peripheral devices using even new communication protocols and standards. A hierarchical approach may encapsulate on black boxes the functionality of several modules. This technique effectively reduces the number of components, but system integration becomes more and more difficult as new components are added every day. Thus, the key to a short design time, enabling “product on demand”, is the use of a set of predesigned components which can be easily integrated through a set of also predesigned connections, in order to build a product. Because of this reason, Xilinx and Altera have proposed their high end tools named Embedded Development Kit [1] and SoPC Builder [2], respectively, that allow the automatic generation of systems. Using these tools, designers may build complete SoC designs based on their processors and peripheral modules in few hours. At a lower scale, similar results may be found on the Hardware Highway (HwHw) web tool [3]. 2. ASM++ charts The Algorithmic State Machine (ASM) method for specifying digital designs was originally documented on 1973 by C.R. Clare [6], who worked at the Electronics Research Laboratory of Hewlett Packard Labs, based on previous developments made by T. Osborne at the University of California at Berkeley [6]. Since then it has been widely applied to assist designers in expressing algorithms and to support their conversion into hardware [7-10]. Many texts on digital logic design cover the ASM method in conjunction with other methods for specifying Finite State Machines (FSM) [11-12]. A FSM is a valid representation of the behavior of a digital circuit when the number of transitions and the complexity of operations is low. The example of fig. 1 shows a FSM for a 12x12 unsigned multiplier that computes ‘outP = inA * inB’ through twelve conditional additions. It is fired by a signal named ‘go’, it signals the answer using ‘done’, and indicates through ‘ready’ that new operands are welcome. Due to the double meaning of rectangular boxes, conditional operations must be represented using a different shape, the oval boxes. But, actually, all operations are conditional, because all of them are state dependent. – Additionally, designers must use lateral annotations for state names, for reset signals or even for links between different parts of a design (see fig. 2). – Finally, the width of signals and ports cannot be specified when using the current notation. Proposed ASM++ notation [13-14] tries to solve all these problems and extend far beyond the possibilities of this methodology. The first and main change introduced by this new notation, as seen at fig. 3, is the use of a specific box for states –we propose oval boxes, very similar to those circles used in bubble diagrams– thus now all operations may share the same box, a rectangle for synchronous assignments and a rectangle with bent sides for asynchronous assertions. Diamonds are kept for decision boxes because they are commonly recognized and accepted. – Figure 1. An example of FSM for a multiplier. However, on these situations traditional ASM charts may be more accurate and consistent. As shown at fig. 2, they use three different boxes to fully describe the behavior of cycle driven RTL designs: a “state box” with rectangular shape defines the beginning of each clock cycle and may include unconditional operations that must be executed during (marked with ‘=’) or at the end (using the delay operator ‘←’) of that cycle; “decision boxes” –diamond ones– are used to test inputs or internal values to determine the execution flow; and finally “conditional output boxes” –with oval shape– indicate those operations that are executed during the same clock cycle, but only when previous conditions are valid. Additionally, an “ASM block” includes all operations and decisions that are or can be executed simultaneously during each clock cycle. Figure 2. Traditional ASM chart for a multiplier. The advantages of FSM for an overall description of a module are evident, but the ASM representation allows more complex designs through conditions that are introduced incrementally and detailed operations located where designer specifies. However, ASM notation has several drawbacks: – They use the same box, rectangular ones, for new states and unconditional operations executed at those states. Because of this property, ASM diagrams are compact, but they are also more rigid and difficult to read. – Sometimes it is difficult to differentiate the frontier between different states. The complexity of some states requires the use of dashed boxes (named ASM blocks) or even different colors for different states. Figure 3. ASM++ chart ready for compilation. Figure 3 shows additional features of ASM++ charts, included to allow their automatic compilation to generate HDL code. In addition to an algorithmic part, a declarative section may describe the design name, its implementation parameters, the external interface, one or more internal signals. The synchronization signal and its reset sequence can be fully specified in a very intuitive way too. A box for ‘defaults’ has been added to easily describe the circuit behavior when any state leave any signal free. Furthermore, all boxes use standard VHDL or Verilog expressions, but never both of them; the ASM++ compiler usually detects the HDL and then generates valid HDL code using the same language. 3. Hierarchical design using ASM++ charts As soon as a compiler generates the VHDL or Verilog code related to an ASM++ chart, the advanced features of modern HDL languages can be easily integrated on them. The requirements for hierarchical design have been included through the following elements: – Each design begins with a ‘header’ box that specifies the design name and, optionally, its parameters or generics. – Any design may use one or several pages on a MS Visio 2007 document1, saved using its VDX format. Each VDX document may include several designs identified through their header boxes. – Any design may instantiate other designs, giving them an instance name. As soon as a lower level module is instantiated, a full set of signals named “instance_name.port_name” (see fig. 5) is created to ease the connections with other elements. Later on, any ‘dot’ will be replaced by an ‘underline’ because of HDL compatibility issues. – When the description of an instantiated module is located on another file, a ‘RequireFile’ box must be used before the header box to allow a joint compilation. However, the ASM++ compiler identifies any previously compiled design to avoid useless efforts and invalid duplications. – VHDL users may include libraries or packages using their ‘library’ and ‘use’ sentences, but also before any header box. – Nowadays, compiler does not support reading external HDL files, in order to instantiate hand written modules. A prototype of them, as shown at fig. 4, can be used instead. Using these features, an example with a slightly improved multiplier can be easily designed. First of all, a prototype of a small FIFO memory is declared, as shown at fig. 4, thus compiler may know how to instantiate and connect this module, described elsewhere on a Verilog file. Then three FIFO memories are instantiated to handle the input and output data flows, as shown at fig. 5, so several processors may feed and retrieve data from this processing element. Figure 4. A prototype of an external design. 1 Actually, designers may also use MS Visio 2003 or ConceptDraw. However, the only supported file format is VDX. Figure 5. An example of hierarchical design. The ASM++ chart of fig. 5 can be compared with its arranged compilation result, shown below. The advantages of this methodology on flexibility, clarity and time saving are evident. Not always a text based tool is faster and more productive than a graphical tool. module hierarchical_design (clk, reset, inA, inB, outP, readyA, readyB, readyP, pushA, pushB, popP); parameter width = 16; parameter depth = 6; // 16x16 => 32 // 64-level buffers input clk, reset; output input input [width-1:0] readyA; pushA; inA; [width-1:0] readyB; pushB; inB; output input input output readyP; input popP; output [2*width-1:0] outP; wire activate; wire fifoA_clk, fifoA_reset; wire [width-1:0] fifoA_dataIn, fifoA_dataOut; wire fifoA_push, fifoA_pop; wire fifoA_empty, fifoA_full; fifo # ( .width(width), .depth (depth) ) fifoA ( .clk (fifoA_clk), .reset (fifoA_reset), .data_in (fifoA_dataIn), .data_out (fifoA_dataOut), .push (fifoA_push), .pop (fifoA_pop), .empty (fifoA_empty), .full (fifoA_full) ); wire fifoB_clk, fifoB_reset; wire [width-1:0] fifoB_dataIn, fifoB_dataOut; wire fifoB_push, fifoB_pop; wire fifoB_empty, fifoB_full; fifo # ( .width(width), .depth (depth) ) fifoB ( .clk (fifoB_clk), .reset (fifoB_reset), .data_in (fifoB_dataIn), .data_out (fifoB_dataOut), .push (fifoB_push), .pop (fifoB_pop), .empty (fifoB_empty), .full (fifoB_full) ); wire AxB_clk, AxB_reset; wire AxB_go, AxB_ready, AxB_done; wire [width-1:0] AxB_inA, AxB_inB; wire [2*width-1:0] AxB_outP; multiplier # ( .N(width) ) AxB ( .clk (AxB_clk), .reset (AxB_reset), .go (AxB_go), .ready(AxB_ready), .done(AxB_done), .inA(AxB_inA), .inB (AxB_inB), .outP(AxB_outP) ); wire fifoP_clk, fifoP_reset; wire [2*width-1:0] fifoP_dataIn, fifoP_dataOut; wire fifoP_push, fifoP_pop; wire fifoP_empty, fifoP_full; fifo # ( .width(2 * width), .depth (depth) ) fifoP ( .clk (fifoP_clk), .reset (fifoP_reset), .data_in (fifoP_dataIn), .data_out (fifoP_dataOut), .push (fifoP_push), .pop (fifoP_pop), .empty (fifoP_empty), .full (fifoP_full) ); assign assign assign assign assign assign assign assign fifoA_clk fifoB_clk AxB_clk fifoP_clk fifoA_reset fifoB_reset AxB_reset fifoP_reset = clk; = clk; = clk; = clk; = reset; = reset; = reset; = reset; assign fifoA_push = pushA; assign fifoA_dataIn = inA; assign fifoA_pop = activate; // Default connections // User connections assign fifoB_push = pushB; assign fifoB_dataIn = inB; assign fifoB_pop = activate; assign AxB_inA assign AxB_inB assign AxB_go = fifoA_dataOut; = fifoB_dataOut; = activate; assign fifoP_push = AxB_done; assign fifoP_dataIn = AxB_outP; assign fifoP_pop = popP; assign activate = AxB.ready & ~fifoA_empty & ~fifoB_empty & ~fifoP_full; assign assign assign assign = fifoP_dataOut; = ~fifoA_full; = ~fifoB_full; = ~fifoP_empty; outP readyA readyB readyP endmodule /// hierarchical_design 4. Encapsulating connections using pipes Following this bottom-up methodology, the next step is using ASM++ charts to design full systems. As stated above, a chart can be used to instantiate several modules and connect them, with full, simple and easy access to all port signals. However, system designers need to know how their available IP modules can or must be connected, in order to build a system. Probably, they need to read thoroughly several data sheets and try different combinations, to finally match their requirements. Nonetheless, when they become experts on those modules, newer and better IP modules are developed, so system designers must start again and again. This paper presents an alternative to this situation, called “Easy-Reuse”. During the following explanations, please, refer to figures 6 to 9. – First of all, a fully new concept must be introduced: an ASM++ chart may describe an entity/module that will be instantiated, like ‘multiplier’ at fig. 3, but additionally it may be used for a description that will be executed (see figs. 8 and 9). The former will just instantiate a reference to an outer description, meanwhile the later will generate one or more sentences inside the modules that call them. To differentiate those modules that will be executed, header boxes enclose one or more module names using ‘<’ and ‘>’ symbols. Later on, these descriptions will be processed each time an instance or a ‘pipe’ (described below) calls them. – Furthermore, the ASM++ compiler has been enhanced with PHP-like variables [15]. They are immediately evaluated during compilation, but they are available only at compilation time, so no circuit structures will be directly inferred from them. Their names are preceded by a dollar sign (‘$’), they may be assigned with no previous declaration and store integer values, strings or lists of freely indexed variables. – In order to differentiate several connections that may use the same descriptor, variables are used instead of parameters or generics. The corresponding field at a header box, when using it to start a connection description, is used to define default values for several variables (see fig. 8); these specifications would be changed by pipes on each instantiation (see fig. 6). – Usual ASM boxes are connected in a sequence using arrows with sense; a new box called “pipe” can be placed out of the sequence and connect two instances through single lines, with no arrows. – When compiler finishes the processing of the main sequence, it searches all pipes, looks for their linked instances, and executes the ASM charts related to those connections. Before each operation, it defines two automatic variables to identify the connecting instances. As said above, the pipe itself may define additional variables to personalize and differentiate each connection. – As soon as several pipes may describe connections to the same signal, a resolution function must be defined to handle their conflicts. A tristate function would be used, but HDL compilers use to refuse such connections if they suspect contentions; furthermore, modern FPGAs do not implement such resources any more because of their high consumption, thus these descriptions are actually replaced by gate-safe logic. Subsequently, a wiredOR, easier to understand than a wired-AND, has been implemented when several sources define different values from different pipe instantiations or, in general, from different design threads. – The last element required by ASM++ charts to manage automatic connections is conditional compilation. A diamond-like box, with double lines at each side, is used to tell the ASM++ compiler to follow one path and fully ignore the other one. Thus, different connections are created when, for example, a FIFO memory is accessed from a processor to write data, to read data or both. Using these ideas, a SoC design may now encapsulate not only the functionality of several components, but also their connections. Figure 6 describes a small SoC that implements a Harvard-like DSP processor (see [13]) connected to a program memory, a 32-level FIFO and a register. First of all, two C-like compiler directives are used to specify the HDL language and a definition used later; a VDX file that describes the DSP processor is also included before giving a name to the SoC design. Then, all required modules are instantiated and connected using pipes. A small program memory has been designed for testing purposes, as shown at fig. 7: the upper chart describes a ROM memory with a short program that emulates the behavior of a Xilinx Block RAM, and the lower chart describes how this synchronous memory must be connected to the DSP. This figure illustrates the use of automatic variables (‘$ProgMem’ and ‘$DSPuva18’, whose values will be “mem_01” and “dsp_01”, respectively) and the difference between modules that can be instantiated or executed. Figure 7. Charts may describe connections. Figure 6. A small SoC design using pipes. The pipe at figure 6 with text “RW250” describes the connection of a FIFO memory (see fig. 4) to a DSPuva18 processor [13], thus it executes the ASM++ chart shown at fig. 8. When executing this pipe, a ‘0’ value is firstly assigned to variables ‘$port’, ‘$write_side’ and ‘$read_side’, as stated by the header box; then these values are changed as specified by the pipe box (see the defined value of ‘RW250’); finally, the chart of figure 8 generates the HDL code that fully describes how “fifo_01” device is connected to “dsp_01” processor for reading and writing using port ‘250’ for data and port ‘251’ for control (getting the state through a read and forcing a reset through a write). Several sentences of the HDL code generated by ASM++ compiler when processing these diagrams displayed following, revealing that ASM++ charts fully capable of describing SoC designs using intuitive, easy to use and consistent representation. the are are an // I/O interface described by ‘SoC_iface’ instance and pipe (see figure 9): input clk, reset; output [31:0] reg_01_LEDs; // A connection described by “ <SoC_iface> <Register>” pipe: assign reg_01_LEDs = reg_01_dataOut; // Connecting dsp_01 to mem_01, its program memory (see figure 6): assign mem_01_rst = dsp_01_progReset; assign mem_01_addr = dsp_01_progAddress; assign dsp_01_progData = mem_01_data; // Connecting reg_01 to dsp_01 (at port ‘0’): assign reg_01_we = dsp_01_portWrite & dsp_01_portAddress == 0); assign reg_01_dataIn = dsp_01_dataOut; // Connecting fifo_01 to dsp_01 (at ports ‘250’ and ‘251’): always @ (posedge fifo_01_clk) begin fifo_01_reset <= dsp_01_portWrite & (dsp_01_portAddress == 250 + 1); end assign fifo_01_dataIn = dsp_01_ dataOut; assign fifo_01_push = dsp_01_portWrite & (dsp_01_portAddress == 250); assign fifo_01_pop = dsp_01_portRead & (dsp_01_portAddress == 250); // Connecting several sources to dsp_01 using a wired-OR: assign asm_thread_1017_dsp_01_dataIn = (dsp_01_portRead & (dsp_01_portAddress == 0)) ? reg_01_dataOut : 0; assign asm_thread_1021_dsp_01_dataIn = (fifo_01_pop) ? fifo_01_dataOut : (dsp_01_portRead & (dsp_01_portAddress == 250+1)) ? {fifo_01_full, fifo_01_almostFull, fifo_01_half, fifo_01_almostEmpty, fifo_01_empty} : 0; assign dsp_01_dataIn = asm_thread_1017_dsp_01_dataIn | asm_thread_1021_dsp_01_dataIn; 5. Conclusions Figure 8. An ASM++ chart that describes how a FIFO must be connected to a DSP processor. Two final ASM++ charts will be described at figure 9, but other required charts have not been included for shortness. The chart at left specifies how the instance named ‘SoC_iface’ at figure 6 must be executed, not instantiated, in order to generate two control inputs and to connect them to all modules. The diagram at right generates additional I/O signals and connects them to the register controlled by the DSP through its port ‘0’. This article has presented a powerful and intuitive methodology for SoC design named Easy-Reuse. It is based on a suitable extension of traditional Algorithmic State Machines, named ASM++ charts, its compiler and a key idea: charts may describe entities or modules, but they also may describe connections between modules. The ASM++ compiler developed to process these charts in order to generate VHDL or Verilog code has been enhanced further to understand a new box called pipe that implements the required connections. The result is a self-documented diagram that fully describes the system for easy maintenance, supervision, simulation and synthesis. 6. Acknowledgments The authors would like to acknowledge the financial support for these developments of eZono AG, Jena, Germany, ISEND SA, Valladolid, Spain, and the Spanish Government (MEC) under grant CICYT ENE2007-67417/ALT with FEDER funds. Figure 9. Charts may describe I/O interface too. References [1] [2] [3] [4] [5] [6] [7] [8] Xilinx, “Platform Studio and the EDK”, on-line at http:// www.xilinx.com/ise/embedded_design_prod/platform_stu dio.htm, last viewed on July 2008. Altera, “SoPC Builder”, on-line at http://www.altera.com /products/software/products/sopc/sop-index.html, last viewed on July 2008. epYme workgroup, “HwHw: The Hardware Highway web-tool for fast prototyping in digital system design”, on-line at http://www.epYme.uva.es/HwHw.php, 2007. SystemVerilog, “IEEE Std. 1800-2005: IEEE Standard for SystemVerilog – Unified Hardware Design, Specification, and Verification Language”, IEEE, 3 Park Avenue, NY, 2005. R. Dömer, D.D. Gajski and A. Gerstlauer, “SpecC Methodology for High-Level Modeling”, 9th IEEE/DATC Electronic Design Processes Workshop, 2002. C.R. Clare, Designing Logic Systems Using State Machines, McGraw-Hill, New York, 1973. D.W. Brown, “State-Machine Synthesizer – SMS”, Proc. of 18th Design Automation Conference, pp. 301-305, Nashville, Tennessee, USA, June 1981. J.P. David and E. Bergeron, “A Step towards Intelligent Translation from High-Level Design to RTL”, Proc. of 4th IEEE International Workshop on System-on-Chip for Real-Time Applications, pp. 183-188, Banff, Alberta, Canada, July 2004. [9] [10] [11] [12] [13] [14] [15] E. Ogoubi and J.P. David, “Automatic synthesis from high level ASM to VHDL: a case study”, 2nd Annual IEEE Northeast Workshop on Circuits and Systems (NEWCAS 2004), pp. 81-84, June 2004. D. Ponta and G. Donzellini, “A Simulator to Train for Finite State Machine Design”, Proc. of 26th Annual Conference on Frontiers in Education Conference (FIE'96), vol. 2, pp. 725-729, Salt Lake City, Utah, USA, November 1996. D.D. Gajski, Principles of Digital Design, Prentice Hall, Upper Saddle River, NJ, 1997. Roth, Fundamentals of Logic Design, 5th edition, Thomson-Engineering, 2003. S. de Pablo, S. Cáceres, J.A. Cebrián and M. Berrocal, “Application of ASM++ methodology on the design of a DSP processor”, Proc. of 4th FPGAworld Conference, pp. 13-19, Stockholm, Sweden, September 2007. S. de Pablo, S. Cáceres, J.A. Cebrián, M. Berrocal and F. Sanz, “ASM++ diagrams used on teaching electronic design”, International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering (CISSE 2007), on-line conference, December 2007. The PHP Group, on-line at http://www.php.net, last release has been PHP 5.2.6 at May 1st, 2008. Space-Efficient FPGA-Implementations of FFTs in High-Speed Applications Stefan Hochgürtel, Bernd Klein Max Planck Institut für Radioastronomie, Bonn, Germany {shochgue, bklein}@mpifr-bonn.mpg.de Abstract Known and novel techniques are described to implement a Fast Fourier Transform (FFT) in hardware, such that parallelized data can be processed. The usage of both - real and imaginary FFT-input - can help saving hardware. Based on the different techniques, flexible to use FFT-implementations have been developed by combining standard FFT-components (partly IP) and are compared, according to their hardware utilization. Finally, applicability has been demonstrated in practice by a FFTimplementation with 8192 channels as part of a FPGAspectrometer with a total bandwidth of 1.5 GHz. 1. Introduction A number of radio-astronomical telescopes are now in operation for observations within the mm and submm wavelength atmospheric windows. Each of these windows stretches over many 10s of GHz, and heterodyne receivers have now been developed to cover a large fraction thereof. The necessary spectrometers that allow a simultaneous coverage of such wide bandwidths (>1.0 GHz) at high spectral resolution (1 MHz or better) have only recently become feasible through the use of FPGAs that can handle multiple GigaBytes of data per second. Another great advantage of FPGA-based spectrometers over the conventional filterbanks or auto-correlators is that their production is cheaper and that they are more compact and consume less power. Their stability furthermore allows the parallel stacking to cover wider bandwidths than each individual ADC/FPGA module can currently cover. A common way to implement an FPGA spectrometer is to feed the data stream of an Analog/Digital Converter (ADC) to an FPGA that computes a Fast Fourier Transform (FFT). An N -point FFT takes N time-consecutive data samples and transforms them to a frequency spectrum with N spectral channels. According to the Shannon sampling theorem [4], the ADC’s sample-rate fS determines the spectral bandwidth f = fS /2. The sensitivity specifies the spectrometer’s ability to detect weak signals. It is in principle given by the ADC’s bit-resolution, but can be increased by an integration (averaging) in time of each spectral channel. Integration can also be performed in timedomain, which improves sensitivity by using each inputsample multiple times. Since this averaging may neutralize a periodic input-signal, integration in time-domain must be combined with a window function. This is called weightedoverlap-add (WOLA). 2. Related Work Efficient FFT-cores are commercially available from RF Engines [3], who provide pipelined FFTs as well as combined FFTs (Section 4.1). Another source of pipelined FFTs in multiple variations is the Xilinx IP-Core-Generator, which is freely bundled with Xilinx-ISE [5]. Since Xilinx IP-cores are freely available, we use them for our pipelined FFTs. A powerful commercially available spectrometer-card was developed by Acqiris [1]. It has been in operation as a spectrometer backend at the APEX telescope [2] since 2006. It is based on a Xilinx Virtex-II Pro 70 FPGA, fed by two ADCs that deliver 2 Gigasamples per second (Gs/s) at 8 bit resolution. Its bandwidth is thereby limited to 1.0 GHz. In collaboration with Acqiris, the ARGOS-project [7] has implemented a FFT-spectrometer on this board. Starting with 8 bit input-samples, the datawidth grows scaled to some degree, from 9 bits after preprocessing to 18 bits after a 32k-pt FFT. Unscaled growth would require at least 9 + log(32768) = 24 bits after the FFT, since it sums up 32768 values of 9 bits. Although scaled broadening saves hardware, available to implement FFTs with a larger number of channels, it reduces precision and therefore potentially reduces sensitivity. The preprocessing comprises windowing but no WOLA. At the output, 32 consecutive bits can be chosen from each 36 bits wide channel in integrated power-spectrum, in order to transmit them by PCI-interface. 3. The Algorithm m·q a00m,q = e−2πi M ·N · a0m,q We use an ADC that samples data at rates up to 3.0 GHz and a Virtex-4 FPGA with a maximum clock-rate of 400 MHz [6]. So, input-data stream must be split up into some parallel data-streams with reduced speed to allow the FPGA to handle it. In this section, we will first recap how a FFT can be split to handle those parallel data-streams. Then it is shown how both the real and imaginary part of a complex FFT-input can be used for real only input-data, in order to reduce the hardware requirements [8]. âN ·p+q = M −1 X e−2πi m·p M (4) · a00m,q (5) m=0 3.2. Use imaginary input A complex FFT fed with real input-samples produces a spectrum whose second half is complex conjugated to the first half: 3.1. Split FFTs âN −k = A M · N -pt FFT calculates M · N spectral channels âk out of M · N input-samples aj , where M and N are powers of two: âk = MX ·N −1 j·k e−2πi M ·N · aj e ãk = = e−2πi m·(N ·p+q) M ·N · N −1 X n·q N −1 X e −2πi j·k N a00,k = a01,k = i · · aM ·n+m ã N −k + ãk 2 2 ã N −k − ãk 2 2 n=0 m=0 = N −1 X e−2πi n·q N · (a2j + i · a2j+1 ) N 2 = −1 X e −2πi j·k N 2 · a2j n=0 (8) N 2 = −1 X e −2πi j·k N 2 · a2j+1 n=0 k · aM ·n+m (7) A twiddle-multiplication and an addition yield the final spectral channels: The inner sums can be computed as M single N -pt FFTs, fed with undersampled data: a0m,q 2 All intermediate channels ã form corresponding pairs ãk and ã N −k . By adding and subtracting them, two indepen2 dent half-sized FFTs can be calculated: (2) e−2πi (6) n=0 · aM ·n+m m=0 n=0 M −1 X · aj = âk (1) The input can be split up into N groups n of M consecutive samples m, so that j = M · n + m. The same way the output is split into M groups p of N consecutive channels q, with k = N · p + q. Applying this to (1) results in a double-sum that can be simplified, since e−2πi·n·p = 1, with n, p ∈ N: âN ·p+q = j·(N −k) N To prevent a waste of resources, real samples may alternately feed real and imaginary input of a half-sized FFT: N 2 (M ·n+m)·(N ·p+q) −2πi M ·N e−2πi j=0 j=0 −1 M −1 N X X N −1 X ak = a00,k + e−2πi N · a01,k (3) (9) n=0 4. Implementation Thereby M spectra a0m are calculated, time-shifted to each other by the original sampling-rate, each one with a fraction 1/M of the original bandwidth. If their channels a0m,q are multiplied by the correct power of e−2πi , a set of N different M -pt FFTs remains to be computed. These complex factors are called twiddle-factors, since a multiplication with e−2πi·x , with x ∈ R equals a clockwise rotation (twiddle) by x full turns. The correct twiddle here depends on time-shift (FFT-number 0 ≤ m < M ) and channel 0 ≤ q < N : Since ADC-data must be demultiplexed to reduce its rate to below a clock-rate realistic for the FPGA, we obtain P parallel data-streams, each one containing undersampled data with offsets of one sample to the adjoining streams. If a P -pt FFT is insufficient for spectral resolution, inputsamples have to be collected over Q clock-cycles to calculate a P · Q-pt FFT. The structure of a FFT demands P and Q to be powers of 2. 2 Pipelined FFT a2 /a10 /.. Pipelined FFT a3 /a11 /.. Pipelined FFT a4 /a12 /.. Pipelined FFT a5 /a13 /.. Pipelined FFT a6 /a14 /.. Pipelined FFT a7 /a15 /.. Pipelined FFT a00 0,q â0 /â1 /.. a01,q a00 1,q â128 /â129 /.. a02,q a00 2,q â256 /â257 /.. a03,q a00 3,q a04,q a00 4,q a05,q a00 5,q a06,q a00 6,q a07,q a00 7,q a0 /a8 /.. a1 /a9 /.. a3 /a11 /.. a4 /a12 /.. â512 /â513 /.. a5 /a13 /.. â640 /â641 /.. a6 /a14 /.. â768 /â769 /.. â896 /â897 /.. a7 /a15 /.. a00 m,0 a0m,1 a00 m,1 a256 /.. a0m,2 a00 m,2 a384 /.. a0m,3 a00 m,3 a0m,4 a00 m,4 a0m,5 a00 m,5 a768 /.. a0m,6 a00 m,6 a896 /.. a0m,7 a00 m,7 a128 /.. a2 /a10 /.. â384 /â385 /.. a0m,0 a0 /.. a512 /.. a640 /.. Parallel FFT a1 /a9 /.. a00,q Reorder Pipelined FFT Parallel FFT a0 /a8 /.. â0 /â8 /.. Pipelined FFT â1 /â9 /.. Pipelined FFT â2 /â10 /.. Pipelined FFT â3 /â11 /.. Pipelined FFT â4 /â12 /.. Pipelined FFT â5 /â13 /.. Pipelined FFT â6 /â14 /.. Pipelined FFT â7 /â15 /.. Pipelined FFT Figure 2. Reordering the input, allows parallel FFT first and splitting data-streams afterward. A parallel 8-pt FFT followed by 8 pipelined 128-pt FFTs form a splitting 1024-pt FFT (8x128). Figure 1. 8 pipelined 128-pt FFTs followed by a parallel 8-pt FFT form a combined 1024-pt FFT (8x128). 4.1. Combining Parallel and Pipelined FFTs plete spectrum. So we combined a reorder unit, a parallel FFT and multiple pipelined FFTs, such that data-streams become independent after parallel FFT and can therefore be split into multiple partitions at the end. In the following this technique is called splitting FFT. The way a FFT is split up (Section 3.1) suggests feeding each parallel data-stream into a pipelined N -pt FFT (N = Q) to calculate the inner sums a0m,q . After twiddlemultiplication, a parallel M -pt FFT can be used to calculate M = P spectral channels in any clock cycle. Altogether this creates a combined FFT (Figure 1) that operates sequentially on parallel data-streams. This implementation is simple and hardware-efficient as well as flexible in adjusting its width P and its length Q to values, appropriate for any clock-rate and any spectrum-size. Keeping in mind that a FFT may exceed the capabilities of a single FPGA, it is desirable to split it into independent parts. Since the pipelined FFTs are expected to be the largest part of a combined FFT, separating them from each other would have the highest impact on resource consumption. In a combined FFT data from all parallel streams has to be set against each other at the end. If the pipelined FFTs’ instances are separated to different, parallel FPGAs, recombination of data from different FPGAs at high clock-rates would be required. Therefore independent data-streams at the end would simplify parallelization over several FPGAs. This is achieved by a parallel N -pt FFT (N = P ) at the beginning. Since each FFT in (3) needs undersampled input, the data has to be reordered first, dependent on the size of the complete FFT: the parallel FFT at the front (Figure 2) requests M = Q subsequent samples at each input, skipping the next M · N − M samples, which are needed at its other inputs in parallel. This is done by a memory module that stores M · N samples, followed by a two-dimensional shift-register. Behind the parallel FFT, the data-streams become independent from each other. Each one is multiplied by twiddle-factors and the resulting a00m,q are fed into pipelined M -pt FFTs to calculate the N th part of the com- 4.2. Reduce FFT-width The number of channels that are computed by a combined or splitting FFT can be increased in two ways with different effects on hardware utilization: increase its width P or increase its length Q. Doubling P will double the number of pipelined FFTs and twiddle-multipliers and therefore the hardware utilized by them. The parallel FFT grows with P as O(P · log(P )). Increasing the length Q on the other hand, does not lead to changes in parallel FFT - it simply operates more cycles per FFT. Twiddle-multiplication needs no more logic resources, but memory use increases linearly with Q, due to the need to store more twiddles. Doubling the channels calculated from a pipelined FFT requires storing all previous channels again and adding another pipeline-stage. Thus the pipelined FFTs’ memory use grows linear with Q, while their logic utilization grows logarithmically: Logic(P, Q) = O(P · log(P ) · log(Q)) (10) M em(P, Q) = O(P · Q) (11) A combined or splitting FFT’s use of FPGA-logic grows stronger with its width than with its length. To optimize the resource usage, P should be reduced, which could be achieved by less demultiplexing of input3 6 3 5 32 bit 64 bit 52 bit 26 bit P ||² P ||² P ||² P ||² P ||² P ||² P int64 −−> float32 2 Transform 7 ||² Transform 1 P Transform Figure 3. A splitting 512-pt FFT (4x128) using imaginary inputs. Channel-transformation results in the first half of a 1024-pt FFT. The second half is obsolete. combined FFT (8x1024) â1 /../â509 â511 /../â3 ||² Transform 25 bit 11 bit 8 bit Transform Reorder Reorder 0 4 4x WOLA−preprocessing ã1 /../ã509 ã511 /../ã3 â0 /â256 /.. â2 /â510 /.. demultiplexing a6 + ia7 /.. ã1 /../ã509 ã3 /../ã511 ã0 /ã2 /.. ã256 /ã510 /.. Transform a4 + ia5 /.. ã0 /../ã508 ã2 /../ã510 Reorder a2 + ia3 /.. splitting FFT (4x128) a0 + ia1 /.. control (ethernet−connection) data. This either requires higher clock-rates, which is not realistic in most cases, or lower sampling-rates and thereby reduced spectrometer bandwidth. Using the imaginary inputs of a combined or splitting FFT for one half of the real input-samples has the same effect: the width of a combined or splitting FFT can be reduced to P/2 (Figure 3). Instead of doubling the FFT-length Q, pairs of intermediate channels ã from the FFT-output are transformed to the final channels, as described in Section 3.2. This transformation is similar to one stage in a pipelined FFT, but slightly more complex. According to eq. (10), this leads to a more efficient use of the FPGA-logic. Finally note, that spectral resolution is preserved, although the number of computed channels is halved, because half of the channels are complex-conjugated to the others if the FFT input is real (eq. 6). The hardware that is saved through this trick can be used to increase Q, allowing a better spectral resolution. Figure 4. Spectrometer-design with a combined 8192-pt FFT (8x1024), running on a Virtex-4-SX55 FPGA. Datawidth grows by log(4) + 1 bits in 4xWOLA, by log(8192) + 1 bits in 8192-pt FFT, by 1 bit during channeltransformation and is doubled by squaring. of the number of summed values: log(4) bits in 4xWOLA and log(P ·Q) in FFT and channel-transformation unit. Two extra bits are spent to compensate window-multiplication and to prevent overflow by twiddle-multiplication in the FFT. Squaring doubles datawidth and constant 64 bits are finally reserved for each channel of the integrated power spectrum. These are converted to 32 bit float and sent to a host-PC by Ethernet. Concurrently parameters can be received by Ethernet to adjust all modules in the design. Currently this design runs on a custom-built board (Figure 5) with a Virtex-4-SX55 FPGA. Up to 3 Gs/s of data is delivered by an A/D-Converter and received by a Spartan-31000 FPGA that controls all units on the board and connects to the host. 4.3. Application A combined FFT and a splitting FFT were implemented in synthesizable VHDL together with a module to transform intermediate channels, when imaginary inputs are used. To enable flexible use in different environments these implementations are generic in FFT-width, FFT-length, and input-datawidth. The pipelined FFTs come from the Xilinx Core-Generator. These modules are embedded into a spectrometer-design (Figure 4) including some pre- and postprocessing around the FFT: in the time-domain, WOLA is performed with sets of up to 4 input-samples. The weighting is given by an arbitrary window function, which can be programmed from the host over Ethernet. This preprocessing unit increases the sensitivity and decreases leakage-effects. Behind the FFT, power-spectra are calculated by squaring each channel’s absolute value, followed by an integration (average) over an adjustable time-period. The design is unscaled to prevent loss of sensitivity during calculation: the datawidth after each unit is equal to the datawidth before plus the logarithm 5. Results 5.1. Hardware usage To quantify the impact on resource consumption, the different FFT-types described in Section 4 are compared with respect to the following Virtex-4 primitives: Internal Memory in Block-RAM, DSP48-Slices used as multipliers, and Slices, each containing 2 Flip-Flops and 2 LookUp-Tables (LUT). Four FFT-implementations were instantiated with comparable sizes, inserted into the spectrometerdesign shown in Figure 4 and synthesized. We compared a combined 256-pt FFT (16x16), a splitting 256-pt FFT (16x16), a splitting 128-pt FFT (8x16), using its imaginary 4 100 90 80 70 60 50 40 30 20 10 0 256-pt FFT, combined, use real input only 256-pt FFT, splitting, use real input only 128-pt FFT, combined, use imaginary input 128-pt FFT, splitting, use imaginary input 2048-pt FFT, combined, use imaginary input 2048-pt FFT, splitting, use imaginary input 8192-pt FFT, combined, use imaginary input Slices FlipFlops LUT BlockRAM DSP48 Figure 6. Hardware-usage of seven spectrometers with different FFTs. Percentage of utilized primitives in a Virtex-4-SX55 FPGA. inputs and a combined 128-pt FFT (8x16) using its imaginary inputs. A length of Q = 16 is chosen, since a spectrometer with a splitting 512-pt FFT (16x32) would have exceeded the size of the SX55 FPGA. All four variants are programmed onto the FPGA on the board shown in Figure 5. Comparing combined FFT and splitting FFT (Figure 6), mainly two effects are observed that affect the hardware utilization: The number of consumed DSP-Slices only depends on datawidth in pipelined FFTs, parallel FFT and twiddle-multipliers. Since the number of DSP-Slices grows discontinuously with bitwidth, it is misleading to generalize what FFT-type performs better. Utilization of Block-RAM and Slices is also influenced by bitwidth, but the main reason for the difference is the initial reordering in a splitting FFT, which consumes memory, registers, and logic. To conclude, a splitting FFT needs slightly more hardware than a combined FFT, but it can be split more easily over multiple FPGAs. As predicted in Section 4.2, the hardware consumption can be reduced significantly by using the FFT’s imaginary inputs: About 40% of registers and logic is saved and about 30% of the multipliers (Figure 6), whereas the use of BlockRAMs hardly changes. A major advantage is the reduction in Slice consumption, since their utilization is most critical in this application. The reduced hardware consumption allows improvement of multiple properties of the spectrometer. Since bandwidth and sensitivity are already optimized well, the option taken here is to increase the spectral resolution by using a FFT with more channels. When almost all Slices are used, as with the 256-pt FFTs, the number of FlipFlops and LUTs must be observed to keep Slice utilization comparable, since unrelated logic is only mapped into one and the same Slice, Figure 5. Custom-built spectrometer board with ADC, Virtex-4 for FFT and Spartan-3 for controlling. 5 6. Conclusion and Further Work The test spectrum, shown in Figure 7, demonstrates, that the described technique and its FFT implementation work well in practice on our prototype spectrometer card. Our spectrometer is fed by an ADC with up to 3 Gs/s, and thereby produces a spectral bandwidth of 1.5 GHz. This is significantly wider than other operational digital spectrometers currently produce. In our implementation, the 8-bit input data samples grow completely unscaled through the design to prevent any loss of precision and therefore sensitivity. Under this condition, the implementation of a combined 8192-pt FFT, using imaginary inputs, fits on a Virtex4-SX55 FPGA, and can produce 32 times more channels than if only the real input was used. Time-domain data can be preprocessed by up to 4xWOLA and any window, programmed from a host-PC. The integration over adjustable time-periods leads to a final 64 bits channel size. Finally the channels are converted to 32 bit float-values and the data is transfered to the host-PC over Ethernet - a standard that guaranties compatibility over large time-periods. Although the performance of our fully functional spectrometer already exceeds that of others, we plan to further improve both, its bandwidth and spectral resolution. A currently announced ADC with 5 Gs/s, in combination with a Virtex-5 FPGA, could provide a significantly higher bandwidth. Both components shall be placed onto a slightly modified version of our existing, custom-built board. Since doubling the bandwidth requires double parallelization, the design is expected to reach the limits of a Virtex-5-SX95T, preventing an increase of the number of channels. One way to increase the channel number would be to split the task between several FPGAs. We have explored this direction with another custom-build prototype board. Our future work will concentrate on reducing the hardware consumption in a FPGA. A large hardware savings potential lies in reduced precision. We shall quantify how bitwidth affects sensitivity for data as well as twiddlefactors in different parts of the design. Figure 7. 1.5 GHz power-spectrum with 8192 channels of a 1.0 GHz sinus combined with noise from 0 to 1.2 GHz. when almost no Slice is left completely unused. The hardware utilization of the 256-pt FFTs (16x16, not using their imaginary inputs) is closest to that of 2048-pt FFTs (8x256) that do use their imaginary inputs (Figure 6). As eq. (10) suggests, halving the FFT-width P = 16 allows squaring its length Q = 16, with a comparable amount of used logic. Since these FFTs are chosen for comparability, their sizes do not define this applications’ maxima in any case. The biggest FFT, implemented in this application so far, is a combined 8192-pt FFT using its imaginary inputs. It has 32 times more independent channels, than the biggest FFT, not using its imaginary inputs: A combined 512-pt FFT. The spectrometer including the former one utilizes almost 100% of the 24576 Slices, 84% of the 49152 FlipFlops, 77% of the 49152 LookUpTables, 70% of the 512 DSP48-Slices, and 59% of the 320 BlockRAMs in a Virtex-4-SX55 FPGA. In conclusion, the improvement when using imaginary inputs is either 40% of the chip-logic or a squared FFTlength Q: A 8x256-pt FFT replaced a 16x16-pt FFT and a 8x1024-pt FFT replaced a 16x32-pt FFT. 5.2. Spectrometer References The clock-rate Xilinx-ISE [5] achieves for a design dramatically degrades with a growing number of occupied Slices. Although Xilinx-ISE calculates a maximum clockrate of 1333/8 = 166.7 MHz, the spectrometer has been successfully overclocked to 1800/8 = 225 MHz. Still the usual clock-speed is 1500/8 = 187.5 MHz for reliable operation. A spectrometer with 8192 channels and 1.5 GHz bandwidth has now been implemented successfully, based on a combined 8192-pt FFT with imaginary inputs used. Input-data is preprocessed by 4xWOLA, weighted by a Flat-Top-Window and about 180,000 single spectra are integrated over 1 second. [1] [2] [3] [4] [5] [6] Acqiris. http://www.acqiris.com. APEX. http://www.apex-telescope.org. RF Engines Limited. http://www.rfel.com. wikipedia. http://www.wikipedia.org. Xilinx-ISE. http://www.xilinx.com/ise. Xilinx Virtex-4 Data Sheet: DC and Switching Characteristics. http://direct.xilinx.com/bvdocs/publications/ds302.pdf. [7] A. Benz et al. A broadband FFT spectrometer for radio and millimeter astronomy. Astronomy & Astrophysics, (3568), September 2005. [8] E. O. Brigham. FFT : schnelle Fourier Transformation. R. Oldenbourg Verlag GmbH, Munich, 1982. 6 The ABB NoC – a Deflective Routing 2x2 Mesh NoC targeted for Xilinx FPGAs Johnny Öberg1, Roger Mellander2, Said Zahrai2 1 Royal Institute of Technology (KTH), Dept. of Electronics, Computer, and Software Systems [email protected] 2 ABB Corporate Research AB, {roger.mellander, said.zahrai}@se.abb.com Abstract The ABB NoC is a 2 by 2 Mesh NoC targeted for Xilinx FPGAs. It implements a deflective routing policy and is used to connect four MicroBlaze processors, which implement an area- and timingcritical multiprocessor embedded real-time system. Each MicroBlaze is connected to the NoC via a Network-Interface (NI) that communicates through a Fast Simplex Link (FSL) interface together with Block RAM (BRAM) memories that implement a shared memory between the NoC and the resource. Application programs use device drivers to communicate with the Network-Interface of the NoC. The NI sets up message transfer, with a maximum length of 2040 bytes, and sends flits with the size of 32 bit data plus 11 bit headers through the network. The design has been implemented on Xilinx Virtex2P and Virtex4 FPGAs. The NoC design has a throughput of 200 Mbps, is about 2600 slices large and operates at 111 MHz in the Virtex2P technology and 125 MHz in the Virtex4 technology. 1. Introduction An industrial control system is often divided into subsystems that interact and together provide the required functionality. Dividing a system into separately developed subsystems provides additional flexibility, in particular when standard components are used. As an example, one may consider a machine controller with several layers of intelligent component with powerful PLCs at the highest level and small embedded systems for sensing and actuation at the lowest level. In the competitive world of industrial markets, reduced size, cost and power consumption are important goals or constraints in development projects. Integration of the closely related components often gives positive effects but might reduce the flexibility and scalability of the system. The use of FPGAs can be a possible path into highly integrated systems on a singe chip which, from an architectural point of view, can be considered as large systems consisting of separate subsystems. Each of these subsystems might contain one or a number of IP-cores and CPUs. An industrial control system used for control of machines and processes is usually composed of different nodes with variable computing capacity responsible for executing different tasks in the system. Communication between these nodes occurs through different types of communication links and protocols such as proprietary ones, Ethernet and field buses. Similarly, a SoC implementation of such a system using a number of processors on the same chip relies on suitable choice of communication interfaces, such as FIFOs, memory-mapped links or buses, each of which with its strengths and weaknesses. Bus based platforms suffer from limited scalability and poor performance for large systems [1]. This has led to proposals for building regular packet switched networks on chip as suggested by Dally, Sgroi, and Kumar [1, 2, 3]. These Network-onChips (NoCs) are the network based communication solution for SoCs. They allow reuse of the communication infrastructure across many products thus reducing design-and-test effort as well as timeto-market. In this respect, NoC provides an attractive alternative where a number of units are connected in a standard way. The aim of this paper is to present implementation and evaluation of a NoC used in a SoC in an industrial application. Ethernet connections (100 Mbps) and shared memories of the original design should be replaced with a NoC system that handles data transfer through a DMA technique. The resource software will typically send messages over the NoC at a regular interval and the maximum length of a message is 1500 bytes. For compatibility to earlier systems, the design should be implemented using Xilinx Virtex2P (or later) technology platform. To be able to obtain sufficient real-time performance, the processors need to run at least at 100 MHz, with the support of hardware accelerators, or as fast as possible, preferably at higher speed than 150 MHz, to be able to run without hardware accelerators. Thus, the application is both area- and timing-critical. B A uBlaze 0 R R A B L M B FSL port 0 FSL0 R Figure 1: Block diagram of the ABB NoC. 2. The ABB NoC The ABB NoC is a 2 by 2 Manhattan style 2D Mesh NoC. It is based on the Nostrum NoC Concept [4..9] developed by the Royal Institute of Technology, KTH, in Stockholm. It is used to connect four MicroBlaze [13] embedded processor systems together. The NoC consists of four three-port switches, see Figure 1. The two ports A and B are connected to neighbouring switches and the remaining port R is connected to the Network-Interface of the resource. Each port supports full duplex transmissions, i.e., they are built using two links, one in each direction, to make it possible to transmit packets in both directions simultaneously. Each MicroBlaze is connected to the NoC via a Network-Interface, consisting of a controller, two Fast Simplex Links (FSL-links) [12] and two Block RAM (BRAM) memories [11], situated on the Local Memory Bus (LMB-bus) [10] of the MicroBlaze processor. The FSL-link is used to setup the NoCFrame transmission, while the two BRAM memories functions as a write and a read buffer, respectively. To connect the ABB NoC to the four MicroBlaze systems, eight BRAMs together with eight eight FSL links are needed; two BRAMs and two FSL links per resource. To be able to reuse the device drivers, the FSL-link is connected to the same port (FSL port 0) and the BRAMs have the same memory map on all resources. A Block Diagram of the connection is shown in Figure 2. Switch Architecture The Switch is implemented using a data path and a controller, as shown in Figure 3 below. The controller can be further split into a Decision control algorithm and a four state control FSM which determines the advance of the deflection algorithm and the signalling to/from the NI. The input ports are connected to one input buffer each. From the input buffers, the valid bit, the UD, LR and HC counters are extracted and forwarded to the decision controller. The input buffers are connected to three 4-to-1 muxes, one mux b u s 2x2 NoC B R L M B BRAM RdWr1 A A uBlaze 1 FSL1 BRAM RdWr0 b u s B FSL port 0 L M B L M B BRAM RdWr3 BRAM RdWr2 b u s FSL2 uBlaze 2 b u s FSL3 FSL port 0 FSL port 0 uBlaze 3 Figure 2: NoC-NI-MicroBlaze Block Diagram. Port R Port A Port B Input Recv R x3 Recv A 0 1 2 3 Xmit R Recv B {Valid, UD, LR, HC} Decision Control I/D A Xmit A I/D B Xmit B Adress Incr/Decr Port R Port A Port B Output Buffers Select State Write_R Read_R S0 S3 S1 To NI S2 Figure 3: Block diagram of NoC Switch for each output port A, B, and R. The decision controller investigates the priority of each message, decides which packet should go to which output and sets the select signal accordingly. A packet with both UD and LR counter zero will be output on port R. A packet with either UD or LR non-zero will be routed to port A or B. A positive value means that the packet should be sent downwards/rightwards in the NoC. A negative value means that the packet should be sent upwards/leftwards in the NoC. The UD and LR counters are updated accordingly. If no packet should be output, an empty packet with the valid bit set to ‘0’ is inserted at the output stream. If only one port has a valid packet, that packet immediately gets its desired port. If two packets are competing for a port, the one with highest priority selects first. The default priority order is A, B, and then R. In the case of port A and port B, the port with the oldest packet selects first, i.e., the one with the highest HC value is allowed to select an output port first. If port A and port B has the same HC, port A selects first. NoC Send Interface NoC Recv Interface Idle BRAM Interface (Always Read) FSL 2NoC FSM FSL Slave Interface NoC2FSL FSM BRAM Interface (Always Write) FSL Master Interface Figure 4. Block Diagram of the NI The switch controller is a four clock cycle FSM. It sets five signals: the input and output buffers’ load enable signals, the write_R and read_R signals to communicate with the NI, and when to output the mux select signal. The deflection algorithm is implemented to calculate the switch decision in state S0 and S1, so a new Select value is output at state S3 every four clock cycles. If a packet should be written to the NI, the Write_R signal is set to 1 in state S2 and S3. If a packet has been read by the switch, the Read_R signal is set to 1 during state S2, and the Load_Enable signal to 1 during state S3. The Network Interface The Network Interface (NI) forms the link between the switch and the resource. In the ABB NoC, the NI translates FSL messages from the MicroBlazes to the NoC and vice versa. It uses two BRAMs together with their BRAM controllers, to work as a read and write buffer for the NoC. The read and write buffers are implemented as a twoported shared memory, with one end connected to the LMB Bus of the MicroBlaze and the other to the NI. The translation is done by two FSMs that work independently of each other. One FSM translates FSL messages from the MicroBlaze to NoC packets (FSL2NoC) while the other stores incoming NoC packets in the receiver BRAM (NoC2FSL). The block diagram is shown in Figure 4. The FSL2NoC FSM has five states and a delay counter. The NoC2FSL FSM has four states to match the four states of the switch. The two FSMs are shown in Figure 5 and 6. The NoC2FSL FSM waits in the idle state until there is a valid data on the NoC interface (indicated by Write_R=’1’). Upon data valid, the FSM goes to the Setup Data state in the case that the incoming message was a data setup word. If not, it goes to the Write Data state. In the Setup Data state, the Write Buffer Address pointer for that sender is reset and an FSL message for that buffer is prepared and stored. In the Write Data state the received data is stored at the next free position in the BRAM. In case it was the last data word for that message, an interrupt signal is asserted, indicating that this write buffer has stored its last data. Both the Setup Data State and the Write Data State then goes to the Wait_for_Write_R state, where the FSM waits until the Write_R signal from the NoC Switch goes low. When that happens, Setup Data Setup Write Buffer Write Data Wait for Write_R Write to BRAM Wait for Write Ack from NoC Figure 5. NoC2FSL Interface FSM Idle End of Message Generate Setup Message Wait for Read Ack from NoC Wait 12 clock cycles Read Input Wait for Read_R Read from BRAM Read Memory Delay State Figure 6. FSL2NoC FSM the Write Buffer Address is decremented by one and the FSM then goes back to the idle state to wait for the next data word. The FSL2NoC waits in the idle state until there is data on the FSL Slave Interface. It then reads the data and initiates a message transfer by sending a Setup Word to the intended target. Sending involves waiting in the Wait_for_Read_R state until the NoC Switch sends a Read_R acknowledging that it has read the data. It then goes to the delay state to wait twelve clock cycles. The delay is used to guarantee that packets arrive in order at the receiving end. It then proceeds to read another data from the input BRAM. When all packets have been sent, the sender goes back to the idle state. The NI initiates the transfer, and sends one data word every 16th clock cycle until all data has been transferred. For a 2 by 2 switch, this means that all packets will arrive in order, since a deflected packet will have enough time to arrive at the destination before another packet will arrive. Thus, Quality-ofService (QoS) is maintained. For a larger network, this property does not hold. Then a re-ordering mechanism has to be implemented. 3. Protocol Stack The protocol stack for the ABB NoC is outlined in Figure 7. Application programs call device driver functions that setup a communication message. The Network interface then sets up a communication link with the receiving node, and starts to transmit the data. Data is transmitted over the NoC, one flit at a time. Application Program Device Drivers Network protocol Switch Protocol Figure 7. ABB NoC Protocol Stack Device Driver prototypes To access the NoC services, four C-language primitives have been developed that can be used as Device Drivers to the NoC. They are named: • int flush_buffer_to_noc(coord node, int buffer,int num_words) • int get_noc_msg(void) • int get_write_buffer(int val) • int get_recv_buffer(int val) The data type coord is the absolute coordinate of the target node in the system, e.g., the value {0,0} correspond to node 0, {0,1} to node 1, etc. It is defined as typedef struct { int row; int col; } coord; The MicroBlaze processor in the resource is using the Fast Simplex Link protocol [12] to communicate with the NI. In addition, the MicroBlaze is connected to two two-port Block RAMs (BRAMs) [11], where the other port is connected to the NI, that are used as a read and a write buffer, respectively. In order to avoid confusion, an explanation regarding the naming convention is in place. The BRAM that the NoC is reading, is written by the corresponding MicroBlaze resource. Thus, from the software point of view it is a Write Buffer, whereas from the NoC point of view it is a Read Buffer. The same goes for the NoC Write Buffer, which is viewed as a Read Buffer by the software. The function flush_buffer_to_noc() takes the MicroBlaze write buffer number (0 or 1) and the length of the message in number of 32-bit words as arguments. It sets up an FSL message to the NoC and initiates the transfer by issuing a blocking write to the FSL. The function get_noc_msg() performs a blocking read from the FSL port. It returns a value that the NoC NI has put the FSL-link. The layout of the received message is shown in Figure 8. Bit index Bit description 0-20 Not Used. 21-22 Sender ID 23-31 Length of Message Figure 8: Bit interpretation of NI-to-NoC data word The function get_write_buffer() takes the write buffer number (0 or 1) as a parameter and returns a pointer to the base address in the memory of the corresponding write buffer. The function get_recv_buffer() takes the recv buffer number (0, 1, 2, or 3) as a parameter and returns a pointer to the base address in the memory of the corresponding receive buffer. Network protocol A write to the NI from the MicroBlaze processor is interpreted according to Figure 9 below. The NI takes the targets absolute position, counted from the zero reference – the Upper Left (UL) node in the upper left corner, and converts it to a relative address that can be used by the switch by subtracting its own position in the network from the target’s address. The Length of message is extracted and stored in a counter. The counter together with the message buffer number form the offset in address space used to retrieve data from the send buffer. Bit index Bit description 0-17 Not Used 18-19 Target UpDown (UD) absolute position 20-21 Target LeftRight (LR) absolute position 22-30 Length of Message 31 Message Buffer Number Figure 9: Bit interpretation of NoC message setup command word The NI responds by sending a message frame of data, outlined in Figure 10 below. The message frame in the ABB NoC is composed of 32 bit words. It is composed of two setup words plus the number of data words. Word 0 Word 1 Word 2 …Word Message length+1 Bit positions 0 14 Not used 15 22 23 31 Global Clock Message Bit 40 to 32 length+1 0 31 Global Clock Bit 31 to 0 0 31 First Data Word (stored at Msg Length-1) 0 31 Last Data (stored at buffer position 0) Figure 10.: The NoC Message Frame The first data word that is transmitted is a setup message that initializes the receiving node with the number of data words to expect plus the first nine bits of the global clock. The second word is the remaining 32 bits of the global clock. The Global Clock is a 41 bit counter that counts microseconds since the last reset. It counts 12.75 days before it wraps around. It is padded to all messages that are sent in the NoC. It is used to keep track of when things are happen, in case something goes wrong. Since the global clock is added to all messages, a local resource can always retrieve it by sending a zero-length message to itself. After the global clock has been transmitted, the message itself is transmitted, MSWord first, LSWord last. The NI uses the sender source ID to determine which write buffer that should be used to save the incoming message. Write buffer 0 (ID 00) has the address 0x0000A000, Write buffer 1 (ID 01) 0x0000A800. Write buffer 2 (ID 10) has address 0x0000B000, and Write buffer 3 (ID 11) 0x0000B800. The maximum length of a message is 2040 bytes. The remaining eight bytes (two words) are used for the setup word and global clock word that is padded to every message. Switch protocol For every data word, or flit, from the MicroBlaze, the NI adds a header. The header is 11 bits long and data is 32 bits long. The header is interpreted by the switch to determine where the packet’s destination is and how it should be routed. The header is shown in Figure 11 below. The receiving NI uses only two fields from the header, the data type field and the message source field. The rest of the header is discarded upon arrival. Bit position Interpretation 10 9 Data type 8 7 Source ID 6 4 Hop Counter (HC) 3 2 Up/Down Counter (UD) 1 0 Left/Right Counter (LR) Figure 11: NoC Header contents The data type field is two bits wide. The first bit is the data valid bit that is used by the switch to determine if the data is a valid packet or not. The NI uses the second bit in the data type field to determine the type of data. If it is a ‘1’, it is a message setup data field. If it is a ‘0’, it indicates a normal data field. During Word 0, the NI transmits a Setup Data. During the rest of the transmission, the NI transmits Normal Data (the second word of the global clock is handled as global data). The second field is used by the NI. It is the message source field. It indicates which data source the message comes from and to which write buffer the NI should save data. This field is needed since multiple transmissions from different sources can occur simultaneously. To ensure that messages do not get scrambled in the case of multiple incoming simultaneous messages, each source gets its own input queue. The Hop Counter (HC) field is used by the switch to determine the priority of the messages. It is incremented by one for every switch that the data has visited. Thus, the message gets a higher priority the older the message is. This mechanism ensures that data does not get stuck in the NoC in the case of a contention. The Up/Down (UD) and the Left/Right (LR) counter fields are the relative address fields. 4. Experiments and Results Simulation, Validation and Prototyping In order to verify the functionality, two type of simulations were performed. One functional simulation on the NoC and its interfaces, and one structural on the whole design. An iterative bottomup verification methodology was used. First the NoC was verified that it functioned as it should, then the Network-Interface was added and simulated. In the second phase, the NoC was verified together with its resources in a structural VHDL simulation of the whole system. The simulation run the first 40 us of the system at start up. The GPIO signals were used to “print” out software status messages on the simulation window. This was necessary to verify that the device drivers worked properly together with the MicroBlazes. After debugging, the system was working satisfactory within the given parameters. The system was downloaded to a prototype board to verify that it worked in accordance with the structural simulations. Synthesis From synthesis point of view, several experiments were conducted. The goal with the design was to make it as fast as possible and achieve at least 100 MHz speed on a Xilinx FPGA. The NoC itself was synthesized first for a Virtex2P with 1152 pins. It gave an area of 2606 Slices. The estimated frequency was however only 94.85 MHz after several rewritings of the code to make it run faster. Since the number of I/Os for the NoC itself is also more than an FPGA can accommodate, it was not possible to get a proper timing estimate after place & route. Instead, a simple prototype system, composed of four MicroBlazes resources, where each resource were connected to its own GPIO interface, was designed and synthesized. The system was tested on a prototype board, running at 80 Mhz, and was found to work satisfactory. However, speed was still a problem. After some investigations, the cause of the problem turned out to be the pinning of the prototype board. The prototype board has fixed pinning for its I/Os, so if all resources should have at least two pins connected to the LEDs on the prototype board, it is impossible to achieve a higher speed since at least one of the resources would be placed at the other end at the chip compared to the location of the pin connected to the LED. With a careful placement of I/Os, it was possible to meet the timing constraints and achieve 100 MHz speed. It was also found that this timing was the same as if the design had been synthesized without any placement constraints on the pinning at all (autoplacement of pins). The placement constraints were therefore removed. Table 1: Area and Speed for 100 MHz constraint Virtex2P (vp2xc30) Virtex4 (v4fx60) Area Speed Area Speed (Slices % (ns/MHz) (Slices % (ns/MHz) LUTs %) LUTs % ) NoC 2606 10.543 ns 2575 8.525 ns 2x2 (19%) 94.85 (10%) 117.3 4507 MHz 4451 (8%) MHz (16%) System 6173 9.990 ns 6389 9.765 ns (45%) 100.1 (25%) 102.4 10037 MHz 10554 MHz (36%) (20%) The system was then synthesised for the Virtex4 technology from Xilinx. The results of the synthesis are shown in Table 1. The NoC now shrunk slightly to 2575 slices, and the speed was improved significantly, with 24%. To find out what the highest obtainable speed was, the system was synthesized with several different speed constraints. The results are summarised in Table 2. For a Virtex2P system, the maximum speed was 111 MHz and for a Virtex4 system 125 MHz. System Table 2: Max Speed Virtex2P (vp2xc30) Virtex4 (v4fx60) 8.982 ns 7.996 ns 111.3 MHz 125.1 MHz Since the NI emits a packet every 16th clock cycle, and data is 32 bits wide, this means that the maximum communication speed is two bits*clock rate, i.e., for a clock speed of 100 MHz, the transmission rate would be 200 Mbps. This is twice the speed of the Ethernet connections in the original design. For the 111 MHz and 125 MHz implementations, the communication speed is 222 Mbps and 250 Mbps, respectively. 5. Summary and Conclusion The ABB NoC is a 2 by 2 Mesh NoC. It is using Deflective routing as a switch protocol and it is targeted for implementation on Xilinx FPGAs. It connects four MicroBlaze processors together that implement an area- and timing-critical Real-time Embedded System. The MicroBlaze is connected to the NoC via a two-way FSL interface together with two two-ported BRAMs that implements a shared memory between the NoC and the resource. The shared memory functions as a read and write buffer respectively. The NoC transmits messages that are maximum 2040 Bytes long, i.e., 510 words. The remaining eight bytes are reserved for the global clock. The global clock counts microseconds since the last reset. It is used to enable a possibility to trace messages in case something should go wrong. Data words are transmitted every 16th clock cycle. Every switch in the NoC has an FSM controller that switch a data word every 4th clock cycle. The receiving Network-Interface is capable of receiving a message data in synchrony with the switch, i.e., every 4th cycle. A transmitted data word is 32 bits long. Thus, the maximum communication speed is two bit*clock rate, i.e., for a 100 MHz clock system the transmission rate is 200 Mbps. The NoC occupies 2600 slices and runs at 111 MHz for the Virtex2P and at 125 MHz for the Virtex4 technology. The NoC fulfils the basic needs of ABB, i.e., it provides a standardized interface with a predictable behaviour that a resource IP designer can take into account when designing the IP. The size of the NoC was a bit large, not in terms of area, but in terms of the memory needed to implement the buffers. The way the buffers were implemented limits the scalability of the system. For larger NoCs, a buffering scheme that occupies less memory is needed, perhaps the custom-based FIFO solution that [14] is advocating, together with a sorting unit to avoid re-ordering of packets. ABB is also interested to see how the NoC structure can be used to implement fault-tolerant designs. This will be done in the future. 6. References [1] W. J. Dally and B. Towles, Route Packets, Not wires: On-chip Interconnection Networks, In Proc. of DAC 2001, June 2001. [2] M. Sgroi et al., Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design, In Proc. DAC 2001, June 2001. [3] S. Kumar, A. Jantsch, J-P. Soininen, M. Forsell, M. Millberg, J. Öberg, K. Tiensyrjä, and A. Hemani. A network on chip architecture and design methodology. In Proceedings of IEEE Computer Society Annual Symposium on VLSI, April 2002. [4] E. Nilsson. Design and implementation of a hotpotato switch in a network on chip. Master's thesis, Royal Institute of Technology (KTH), Stockholm, Sweden, June 2002. IMIT/LECS/2002-11. [5] M. Millberg. The nostrum protocol stack and suggested services provided by the nostrum backbone. Technical Report TRITA-IMIT-LECSR02:01, ISSN16514661, ISRN KTH/IMIT/LECS/R-02/01-SE, LECS, ECS, Royal Institute of Technology, Sweden, 2002. [6] E. Nilsson, M. Millberg, J. Öberg, and A. Jantsch. Load distribution with the proximity congestion awareness in a network on chip. In Design, Automation and Test in Europe Conference (DATE 2003), Munich, Germany, March 2003. [7] M. Millberg, E. Nilsson, R. Thid, S. Kumar, and A. Jantsch. The nostrum backbone - a communication protocol stack for networks on chip - poster. In Proceedings of the International Conference on VLSI Design, January 2004., 2004. [8] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch. Guaranteed bandwidth using looped containers in temporally disjoint networks within the nostrum network on chip. In Proceedings of DATE'04, February 2004., 2004. [9] Z. Lu, R. Thid, M. Millberg, E. Nilsson, and A. Jantsch. NNSE: Nostrum network-on-chip simulation environment. In Proceedings of the Swedish System-onChip Conference (SSoCC'05), April 2005. [10] LMB BRAM Interface Controller (v1.00.a), DS452, Xilinx Inc., 2004. [11] Block RAM (BRAM) Block (v1.00.a), DS444, Xilinx Inc., 2004. [12] Fast Simplex Link (FSL) Bus (v2.00.a), DS449, Xilinx Inc., 2004. [13] MicroBlaze Processor Reference Guide, UG081 (v5.0), Xilinx Inc., 2005. [14] K. Goossens, J. Dielissen, and A. Radulescu, "Aetherial Networks on Chip: Concepts, Architectures and Implementations", IEEE Design and Test of Computers, vol. 22, no. 5, pp. 414-421, September-October 2005. Track A – Industrial Track A features presentations with focus on industrial applications. The presenters were selected by the Industrial Programme Committee. 8 Presentations see next three pages. If someone will change address or the text, please contact Lennart Lindh. Open source within hardware Johan Rilegård, +46 70 824 80 30, Email: [email protected] <mailto:[email protected]> The use of open source hardware IP to developed SoC has increased rapidly and is following the huge growth within the software area. The use of IPs from OpenCores in big commercial design, implemented in FPGAs or ASICs is approximately doubled each year. The IPs has already from start (1999) been quite popular, but during the last 12 month the interest has exploded and is now evaluated in all kind of applications (from low volumes products all the way up to the mobile product industries). OpenCores is right now working to upgrade the site/community and adding quite allot of features which will even further lower the threshold of start using these IPs. The basic aspects driving this growth are of course all the benefits generated to a company by owning its own design to 100% (no license fees, minimize of “end of life problems”, technology independent) in combination with the fact that the technology is matured, very well verified and used in many different applications. This gives the needed “trust” to the technology to increase the ongoing avalanche effect. If we look into the future the growth of open source IPs will increase even further and within a few years we truly believe that technologies based of open source will be just as common as the commercial technologies. To speak about • • • • • History of OpenCores (HW open source IP) Advantages and disadvantages use open source IPs What is still missing to make it the number one used technology? Open source HW development and communities like OpenCores – what will happen the next 12 month and what will it look like in 36 month? Open and Flexible Network Hardware [email protected]; Florian Fainelli <[email protected]>, Xavier Carcelle <[email protected]>, Etienne Flesch <[email protected]> The latest developments of home gateways, DSL boxes, and wire- less routers have brought a fair amount of network hardware to the home. These developments have been followed by several initiatives of "hacking the box" to be able to use open-source ï¬ rmware. The next step is to provide a device with completely open and flexible hardware. OpenPattern is developing such hardware based on the inputs from the past initiatives, using a FPGA as the central component (SoPC) to acheive the "open hardware" goal at the IP core level. Verification – reducing costs and increasing quality Jan Hedman <[email protected]> Many industry professionals tell us about their increasing problems with verification of large FPGA/ASICs - despite investing considerable resources, projects are still going to market with errors, causing delays and costs for recalling and correcting faulty products. Verification costs are increasing but existing techniques just are not finding all the errors. In this presentation we will show how you can improve your existing verification processes by using system-level verification techniques and a Model-Based Design approach. Models developed in MATLAB and Simulink can be reused not only for simulation and code-generation, but also for verification. We will show how these improvements can be integrated with an existing verification flow. Large scale real-time data acquisition and signal processing in SARUS Hans Holten-Lund <[email protected]> SARUS is a Synthetic Aperture Real-time Ultrasound Scanner, made in cooperation between CFU (Center for Fast Ultrasound Imaging) at DTU Elektro and Prevas A/S. The system consists of 64 Digital Acquision and Ultrasound Processing (DAUP) boards which each have five Xilinx Virtex 4 FX FPGAs, a total of 320 FPGAs. The FPGAs communicate using the LocalLink and Aurora protocols over 3.125 Gbps MGT links, both for chip-to-chip and board-to-board communications, with network switching inside the FPGAs. Each board features 16 DAC and 16 ADC channels at 12 bits and 70 MHz sample rate, for a total of 1024 channels. For Ultrasound processing, 8192 point FFTs are used for filtering, after which a tile-based 6way parallel beamformer focuses ultrasound frames at 4000 Hz, resulting in high-resolution frames at 40 Hz. The presentation will touch many of the design challenges in such a large scale system, e.g. DDR2 memory controllers, accessing register files from a PowerPC on a different FPGA, LVDS SERDES links, Aurora MGT links, DSPs, clock distribution, etc. Drive on one chip Lang, Fredrik (EBV) [[email protected]] EBV Elektronik has now completed its fifth reference design called MercuryCode which is based on an FPGA from Altera®’s Cyclone® III family for universal industrial applications. Thanks to the appropriate software and drivers, implementation of the user-system is practically preprogrammed. The board is designed for use of the Nios® II processor and the Microsoft .NET Framework, while supporting various industrial I/O Standards such as CAN, USB, RS485, RS232 and 24 V I/O for direct connection to the industrial automation world. A total of ten partner companies currently provide support for all common field bus systems on the market. This enables developers to try out all major field buses available on the market with MercuryCode. This is not possible with traditional boards since these are manufacturer-specific and no manufacturer allows a competitor’s IP-code to be implemented on his own boards. As EBV Elektronik is independent, none of the partners has a problem with porting the Ip-cores on MercuryCode. For the end customer, there is another advantage: if the end customer has to change the field bus system, it is possible to do so using the existing hardware. The FalconEye reference design platform was developed to aid in the design process for controlling and/or regulating brushless synchronous and induction motors. With FalconEye + MercuryCode, EBV Elektronik is offering a complete reference design for motor control on the basis of the Nios II core. Standard architecture for typical remote sensing micro satellite payload AYMAN.M.M.AHMED, Egyptian Space Program, National Authority of Remote sensing and space science (NARSS), Cairo, Egypt Development of EGYPTSAT-1 was one of the most experienced project that raised solutions of many technical problem due to implementation of relatively high-resolution imagers into a micro-platform. One of these; is the development of payload computer which have to carry out many functions in a very harsh environment (temperature variation, mechanical vibration, and pressure) and many limitation. Payload computer developed using FPGA to carry out: imagers control, imagers’ synchronization, data compression, data packing, and data storage. The paper discusses payload computer architecture and design difficulties using Virtex family from Xilinx FPGA. World's first mixed-signal FPGA Hosseinalikhani, Rouzbeh [[email protected]] World's first mixed-signal FPGA Fusion saves you money and time to market. Join this session to learn more about how we have used Fusion to create a highly integrated control for the Power Module with over 50 analogue real time measurements, microprocessor, as well as several interfaces. We will also show how Fusion solves critical integration issues on the Advanced Mezzanine Card (AMC) for the TCA/uTCA standard. Analog Netlist partitioning and automatic generation of schematic Ashish Agrawal <[email protected]> An automatic schematic generator of a netlist (analog, gate level) in today scenario plays a significant role in making user aware of the implementation of his RTL design. A netlist generated post RTL level lacks any major designer intervention as there exist various automatic tools which converts RTL to sign off. The automatic schematic generator needs to do a good job of showing a netlist but there are various challenges in visually showing such netlist. • • • The flat transistor level netlist lacks the knowledge of how various transistors are grouped to form logic. Showing the netlist in human comprehensible form. The visual diagram should be close to hand drawn circuit by an electrical engineer. To place these components symmetrically and optimally relative to other components to have optimal routing. The paper will focus on algorithms to partition, group, placement and routing techniques to achieve “good” schematic. Organizer: Lena Engdahl, [email protected] The winner: FPGA Hero from Iha 9/17/2008 FPGA Hero Guitar Hero in VHDL Overview Introduction Project introduction System introduction SOPC Graphics Guitar Music 1 9/17/2008 Why? Show the power of fast prototyping in SOPC A project with a lot of technological posibilities Use VHDL for a non traditional purpose Who? Engineering College of Aarhus Paul Martin Bendixen Morten Stigaard Laursen Henrik Hagen Rasmussen Kim Arild Steen 2 9/17/2008 How? SOPC for the system layout (NIOS II, Avalon bus etc.) Selfmade VHDL components IP’s for the remaining components What? Decoding of MIDI files Interfacing with PS2 guitar controller Interfacing with VGA Communication via the Avalon bus Interfacing with audio codec and SD card NIOS II 3 9/17/2008 System Components VGA Audio MIDI decoding Controller interface SD Card Filesystem Gameplay Score update Hit or miss a node VGA and Gameplay 25MHz pixel Clk Strict timing requirements Few states Ideal for HDL implementation 4 9/17/2008 Audio Timing requirements (I2S bus) More complex initialization (I2C bus) High complexity for init, low complexity for run mode Can be split in HDL and software HDL part taken from Demonstration example Controller interface Short pulses Low bandwith Few states Chose HDL, instead of IRQ 5 9/17/2008 SD Card Fast clk (up to 50MHz) Loose timing requirements To get a high throughput HDL is a must We chose software because of existing library (Demonstration system for DE1 board) Filesystem Using existing GPL library called DOSFS 6 9/17/2008 SOPC Overview NIOS II (Master) SDRAM Interface (Slave) SDCard Interface (Slave) Avalon BUS Controller Interface (Slave) GUI Interface (Slave) Audio Codec Interface (Slave) Software Overview Init Audio, SDCard, Filesystem If Audio FIFO full If note was hit Send next sample to FIFO Increment score Parse MIDI to RAM Load WAVE to RAM Reset Score Reset Song position If Strum False True N/A If ’Start’ key Pressed on guitar Read note buttons If end of song Send next MIDI note to GUI 7 9/17/2008 The real Guitar Hero Graphics Our simplification Simple graphics Square notes No background No perspective Hit and miss are shown by changing the color White for hit Magenta for miss 8 9/17/2008 Output Format: VGA Resolution: 640 x 480 Frame rate: 60 Hz Connected to 4 bit DAC. Properties of VGA Pixel order Blanking Synchronization Pixel frequency 9 9/17/2008 VGA timing Hit, miss and score Hit and miss detection Build in to Graphical part Score IP: CharROM Showing Hex score 10 9/17/2008 Score IP: CharROM Showing Hex score VGA Interface Output 4 bit DAC resister network for R, G and B Hsync Vsync Input Avalon 11 9/17/2008 Input interface Avalon Guitar Interface 12 9/17/2008 SPI Standard SPI Extensions ACKnowledge Command communication Polls via SPI Recieves Actual button presses Is able to send 8 bit data 13 9/17/2008 Changing Modes Guitar will not work in standard mode Change to Analog mode Sending several commands Decoding MIDI files Contains info about the music Only guitar notes are used Thanks to the ”Frets on fire” project 14 9/17/2008 Questions / Frågan Demo 15 2008-09-17 Nintendo Entertainment System on a chip Patrik Eliardsson Per Karlström Richard Eklycke Ulrika Uppman Agenda About NES Hardware Overview Video Picture Processing Unit (PPU) VGA Output Sound Audio Processing Unit (APU) Sound Controller CPU Testning Performance Final words 2 Linköpings universitet 1 2008-09-17 Background of the Project Nintendo Entertainment System (NES) A video game console Entered the market in 1983, 15 July Widely popular Over 700 game titles released Project Goal Creating a NES on an FPGA Use orginal games and controlls Develop our hardware development skills 3 System Overview 4 Linköpings universitet 2 2008-09-17 Difficulties Designing a NES Different clocks for each subsystem Communication between the subsystems No official documentation Most documents are for/by software emulator implementors, no hardware ”thinking” 5 Video Subsystems Picture Processing Unit (PPU) Introduction of Video RAM (VRAM) Overview of the PPU Playfield / Background Sprites Interfacing with the PPU VGA Output Buffering of the video signal Pixel sizing 6 Linköpings universitet 3 2008-09-17 Picture Processing Unit (PPU) - VRAM Introduction Pattern Tables (tiles for background and sprite) 16 bytes for each tile 256 tiles Name Tables (index pointer to pattern table) 32x30 gives a resultion of 256x240 Attribute tables (1 for each NT, color key for 2x2 tile) Color Palettes (1 for background, 1 for sprites) 13 colours, keyed by 2 bits from Attribute table and 2 bit from Pattern table XXXX Attribute table bits Pattern table bits 7 Picture Processing Unit (PPU) - VRAM Introduction Sprites 256 byte sprite memory 4 byte / sprite, gives 64 sprites Y-coordinate Pattern table index Attribute bits (horizontal flipp, vertical flip) X-coordinate 8 Linköpings universitet 4 2008-09-17 PPU – System Overview 9 PPU – Scanline Details 10 Linköpings universitet 5 2008-09-17 PPU – Frame Details 11 PPU – Fetch Background Tiles 12 Linköpings universitet 6 2008-09-17 PPU - Registers $2000 Control Register 1 (W) $2001 Control Register 2 (Mask Register) (W) $2002 Status Register (R) $2003 SPR-RAM Address Register (W) $2004 SPR-RAM Data Register (RW) $2005 PPU Background Scrolling Offset (W2) $2006 VRAM Address Register (W2) $2007 VRAM Read/Write Data Register (RW) 13 VGA Buffering of PPU output data Using SRAM Continuous reading and writing Different read/write speed SRAM controller Output to monitor from 256x240 to 640x480 Read line twice 5 cycles per pixel in 50 MHz ”Stretched” pixels 14 Linköpings universitet 7 2008-09-17 Sound Audio Processing Unit Background Overview Sample of subsystem – Square channel Interface Sound Controller Background Overview Configuration 15 Audio Processing Unit (APU) - Introduction 5 channels 2 square channels 1 triangle channel 1 noise channel 1 delta modulation channel Status register 16 Linköpings universitet 8 2008-09-17 Audio Processing Unit (APU) - System Overview 17 Audio Processing Unit (APU) - Square Channel Sweep Timer/2 Sequencer Envelope Length DAC 18 Linköpings universitet 9 2008-09-17 Audio Processing Unit (APU) - Mixer Some mathematical formulas involving division and addition 19 Audio Processing Unit (APU) - Mixer Some mathematical formulas involving division and addition 20 Linköpings universitet 10 2008-09-17 Audio Processing Unit (APU) - Controlling the Channels $4000 ddle nnnn Waveform shape $4001 eppp nsss Frequency sweep $4002 pppp pppp Timer period $4003 llll lppp Length index and timer period 21 Sound Controller FPGA AUDIO BOARD CTRL 8 APU SEND 24 (s) CODEC DATA I2C CONF i2c 22 Linköpings universitet 11 2008-09-17 6502 CPU Background Implementation Memory map of the system Writing software 23 6502 CPU - Background Released in 1975 A modified 6502 called 2A03 used in the NES 8-bit data bus, 16-bit address bus 1 accumulator register (A), 2 index register (X, Y) 2 interrupts Non-maskable interrupt (NMI) Interrupt Request (IRQ) 24 Linköpings universitet 12 2008-09-17 6502 CPU - Implementation For now, we use a free implementation of the 6502 Free6502 Thank you! Timing problems with Free6502 Some intructions take too many cycles to execute Own rewrite of the 6502 25 6502 CPU - Memory Map Address Range $0000 - $00FF $0100 - $01FF $0200 - $07FF $0800 - $0FFF $1000 - $17FF $1800 - $1FFF $2000 - $2007 $2008 - $3FFF $4000 - $401F $4020 - $5FFF $6000 - $7FFF $8000 - $BFFF $C000 - $FFFF $FFFA - $FFFB $FFFC - $FFFD $FFFE - $FFFF Size 256 bytes 256 bytes 1536 bytes 2048 bytes 2048 bytes 2048 bytes 8 bytes 8184 bytes 32 bytes 8160 bytes 8192 bytes 16384 bytes 16384 bytes 2 bytes 2 bytes 2 bytes Notes (Page size is 256 bytes) Zero Page Stack memory RAM Mirror of $0000-$07FF Mirror of $0000-$07FF Mirror of $0000-$07FF Input / Output registers (PPU) Mirror of $2000-$2007 (mulitple times) Input / Output registers (APU, Controllers) Expansion ROM Save RAM PRG-ROM lower bank - executable code PRG-ROM upper bank - executable code NMI Handler address Power on reset handler routine address Software interrupt handler routine address 26 Linköpings universitet 13 2008-09-17 Performance A complete NES including PPU, APU, CPU DAC interface VGA interface SRAM interface Program and video pattern ROMs for demonstration Memory usage: 207031 bits (86%) (... bits for demo ROMs) Logic usage: 3681 LE (20%) 27 Final Words Future for the project Integrate cartridge Support hand controllers HQ2X graphic filter Develop our own set of games! Save screen or game state We would like to thank: Altera The Department of Electrical Engineering, Linköping University 28 Linköpings universitet 14 2008-09-17 Questions? www.liu.se Linköpings universitet 15 Altera Innovate Nordic Competition Analog Modeling Synthesizer FPGA SOPC implementation Arnaud Taffanel,Peyman Pouyan Teacher:Professor Lambert Spaanenburg Lund Institute of Technology Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Motivation • To produce an analog style sound and to simulate the analog user experience by permitting to change any parameter of the sound generation path in real-time. Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Outline • Introduction • • • System Implementation • • • NIOS II system Sound stream Software Implementation • • • Music synthesizers History of Analog Synthesizer Project Mapping µc/os II Conclusion & results Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Music Synthesizers Analog synthesizer=Subtractive Synthesizer • Subtractive synthesis starts wih a rich harmonic waveform(such as square or sawtooth wave) and filters out unwanted spectral components. Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Subtractive Synthesis • Basic Subtractive Synthesizer parts: • Oscillators Filters Voltage Controlled Amplifiers or VCAs. Envelop Generator such as an ADSR. • Lfos, Low Frequency Oscillators. • • • Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Modular Analog Synthesizer • Highly configurable • • • • Completely manual interconnection One interconnection configuration is called a path Heavy Very expensive Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Next generation Analog Synthesizer • Partially pre-cabled • • Electronic elements have a fixed place in the circuit. Less heavy and expensive Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Digital analog modeling Synthesizer • • • Appear in the 90s Very Light Cheaper Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Our Project • Main target: • Affordable and high performance analog modeling synthesizer. Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan User view on schematic Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan System Implementation • • • • Introduction System Implementation • NIOS II system • Sound stream Software Implementation • Project Mapping • µc/os II Conclusion & results Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Board level Architecture • • The MIDI interface implemented with serial port. The CODEC interfaces is on the DE1 board. Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Inside the FPGA • The program inside the CPU responds to the MIDI command. Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Actual system implementation(1) • • Sound stream element configuration mapped to the NIOS II Memory Everything is wired in SOPC Builder Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Actual system implementation(2) Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Sound Stream • • • Based on Avalon ST Bus Clocked by the CODEC Back Pressure Avalon ST capability • • The samples are ‘pull’ through the stream by the CODEC ‘D’ elements to implement a pipeline Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Software Implementation • • • • Introduction System Implementation Software Implementation • Project Mapping • µc/os II Conclusion & results Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Project Mapping Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Project Mapping(continued)... • Rules for Mapping: • • • • All the blocks which are in the sound flow will be implemented in hardware. All the slow or computational blocks will be implemented in software. The interconnection between all the hardware blocks is simplified by the usage of the Avalon bus. All the design is clocked by the same 50MHz clock which is also routed by SOPC. Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan µC/OS II • • • Simple and efficient RTOS Integrated to NIOS II IDE Mainly 3 tasks implemented • • • Midi Task : receive and execute the MIDI commands EG Task : envelope generator refresh LFO Task : Low frequency oscillator refresh Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Software organisation • • Operating Tasks Interconnection matrix system • • Almost everything can be interconnected dynamically Define 2 connectors Sink to receive data (i.e. Oscillators, Filter) Source to emit data (i.e. LFO/EG/MIDI) • • Automatic refresh Control system to modify non-dynamic data Sources Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Sink Conclusion(1) • FPGA itself was pretty adapted for the signal processing as: • • • • • It contains a lot of internal RAM . It has a lot of multipliers which permit to create many high performance design blocks. It has a Parallel Architecture which can help us to achieve a better throughput. FPGAs are cheaper than DSPs . The Avalon bus system is very efficient and simple to implement • Mostly thanks to SOPC Builder Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Performance (2) • Smooth configuration • • Polyphony • • • • Easy to accomplish with an FPGA Pipelined architecture of FPGA 1000 cycles/sample available Actual Implementation should achieve a polyphony of at least 100. Minimal response time • Sample processing Vs. block processing Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Refrences • • • • www.altera.com www.wikipedia.org www.dspmusic.com www.micrium.com Analog Modeling Synthesizer: Arnaud Taffanel, Peyman Pouyan Call for FPGAworld Conference 2009 Academic Paper Submission Step 1) Send your submission to: [email protected] (max 6 pages or 4000 words) Step 2) The Program Committee will decide if papers will be accepted for a long presentation (30 minutes) or short presentation (15 minutes). For long and short presentations, 4-8 pages will be assigned in the published proceedings on the web. Instructions will be sent after acceptation of your paper. Step 3) Send in Camera Ready Paper to David Kallberg and it is also possible to add attachment to the paper (no limit on number of pages). Step 4) The papers will be published on the FPGAworld website and in print on the conference. Please note that the academic papers will be presented in Stockholm on September 11. Industrial Abstract Submission Industrial submission could already be published on web or on other places (you as the owner); Design Examples, White Papers, student industrial projects etc. Step 1) Submit the abstract giving a short overview on the presentation. The abstract should not exceed 100 words. Please write if you submit the presentation to Stockholm and/or Lund and also add “industrial paper”. Send your submission to: [email protected] Step 2) The Industrial program Committee decides if the papers will be accepted for a presentation. Step 3) After acceptation of your abstract, you have possible to send a full paper to be included in the web based documentation. Presentations on the FPGAworld Each presentation is limited to 30 minutes total including any questions, the session chairs will be very strict. We require that all presentations be pre-loaded onto a single laptop (Windows XP, PowerPoint/Office-2003 or PDF) to avoid laptop switching time and problems. Please send you presentation to the session charmen or contact him before you presentation to load your presentation on the laptop. Product presentations and demonstration are not reviewed, see under Exhibitors and contact David Kallberg. We can not guarantee place at FPGAworld, so you have to book in time and please write if you book for Stockholm and/or Lund. 4th FPGAworld CONFERENCE, Lennart Lindh, Vincent John Mooney III (external), MRTC report ISSN 1404-3041 ISRN MDH-MRTC215/2007-1-SE, Mälardalen Real-Time Research Centre, Mälardalen University, September, 2007 3rd FPGAworld CONFERENCE, Lennart Lindh, Vincent John Mooney III (external), MRTC report ISSN 1404-3041 ISRN MDH-MRTC204/2006-1-SE, Mälardalen Real-Time Research Centre, Mälardalen University, November, 2006 2nd FPGAworld CONFERENCE, Lennart Lindh, Vincent John Mooney III (external), MRTC report ISSN 1404-3041 ISRN MDH-MRTC188/2005-1-SE, Mälardalen Real-Time Research Centre, Mälardalen University, September, 2005 Call for FPGAworld Conference 2009 Academic/Industrial Papers, Product Presentations, Exhibits and Tutorials September 2009, Stockholm, Sweden September 2009, Lund, Sweden (no academic) The submissions should be in at least one of these areas • • • • • • DESIGN METHODS - models and practices o Project methodology. o Design methods as Hardware/software co-design. o Modeling of different abstraction. o IP component designs. o Interface design: supporting modularity. o Integration - models and practices. o Verification and validation. o Board layout and verification. o Etc. TOOLS o News o Design, modeling, implementation, verification and validation. o Instrumentation, monitoring, testing, debugging etc. o Synthesis, Compilers and languages. o Etc. HW/SW IP COMPONENTS o New IP components for platforms and applications, o Real-time operating systems, file systems, internet communications. o Etc. PLATFORM ARCHITECTURES o Single/multiprocessor Architecture. o Memory architectures. o Reconfigurable Architectures. o HW/SW Architecture. o Low power architectures. o Etc. APPLICATIONS o Case studies from users in industry, academic and students o HW/SW Component presentation. o Prototyping. o Etc. SURVEYS, TRENDS AND EDUCATION o History and surveys of reconfigurable logic. o Tutorials. o Student work and projects. o Etc. www.fpgaworld.com/conference ISBN 978-91-976844-1-5
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement