13. Power Optimization May 2013 QII52016-13.0.0 QII52016-13.0.0 The Quartus® II software offers power-driven compilation to fully optimize device power consumption. Power-driven compilation focuses on reducing your design’s total power consumption using power-driven synthesis and power-driven place-and-route. This chapter describes the power-driven compilation feature and flow in detail, as well as low power design techniques that can further reduce power consumption in your design. The techniques primarily target Arria® GX, Stratix® and Cyclone® series of devices. These devices utilize a low-k dielectric material that dramatically reduces dynamic power and improves performance. Arria series, Stratix II, Stratix III, Stratix IV, and Stratix V device families include efficient logic structures called adaptive logic modules (ALMs) that obtain maximum performance while minimizing power consumption. Cyclone device families offer the optimal blend of high performance and low power in a low-cost FPGA. f For more information about a device-specific architecture, refer to the device handbook, available from the Literature and Technical Documentation page on the Altera website. Altera provides the Quartus II PowerPlay Power Analyzer to aid you during the design process by delivering fast and accurate estimations of power consumption. You can minimize power consumption, while taking advantage of the industry’s leading FPGA performance, by using the tools and techniques described in this chapter. f For more information about the PowerPlay Power Analyzer, refer to the PowerPlay Power Analysis chapter in volume 3 of the Quartus II Handbook. Total FPGA power consumption is comprised of I/O power, core static power, and core dynamic power. This chapter focuses on design optimization options and techniques that help reduce core dynamic power and I/O power. In addition to these techniques, there are additional power optimization techniques available for Stratix III and Stratix IV devices. These techniques include: ■ Selectable Core Voltage (available only for Stratix III devices) ■ Programmable Power Technology ■ Device Speed Grade Selection f For more information about power optimization techniques available for Stratix III devices, refer to AN 437: Power Optimization in Stratix III FPGAs. For more information about power optimization techniques available for Stratix IV devices, refer to AN 514: Power Optimization in Stratix IV FPGAs. © 2013 Altera Corporation. All rights reserved. ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of Altera Corporation and registered in the U.S. Patent and Trademark Office and in other countries. All other words and logos identified as trademarks or service marks are the property of their respective holders as described at www.altera.com/common/legal.html. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. ISO 9001:2008 Registered Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Twitter Feedback Subscribe 13–2 Chapter 13: Power Optimization Power Dissipation Power Dissipation This section describes the sources of power dissipation in Stratix III and Cyclone III devices. You can refine techniques that reduce power consumption in your design by understanding the sources of power dissipation. Figure 13–1 shows the power dissipation of Stratix III and Cyclone III devices in different designs. All designs were analyzed at a fixed clock rate of 100 MHz and exhibited varied logic resource utilization across available resources. Figure 13–1. Average Core Dynamic Power Dissipation Average Core Dynamic Power Dissipation by Block Type in Stratix III Devices at a 12.5% Toggle Rate (1) Average Core Dynamic Power Dissipation by Block Type in Cyclone III Devices at a 12.5% Toggle Rate (2) Global Clock Routing 14% Global Clock Routing 16% Routing 30% Routing 29% Memory 21% Memory 20% DSP Blocks 1% (3) Combinational Logic 16% Combinational Logic 11% Multipliers 1% (3) Registered Logic 18% Registered Logic 23% Notes to Figure 13–1: (1) 103 different designs were used to obtain these results. (2) 96 different designs were used to obtain these results. (3) In designs using DSP blocks, DSPs consumed 5% of core dynamic power. As shown in Figure 13–1, a significant amount of the total power is dissipated in routing for both Stratix III and Cyclone III devices, with the remaining power dissipated in logic, clock, and RAM blocks. In Stratix and Cyclone device families, a series of column and row interconnect wires of varying lengths provide signal interconnections between logic array blocks (LABs), memory block structures, and digital signal processing (DSP) blocks or multiplier blocks. These interconnects dissipate the largest component of device power. FPGA combinational logic is another source of power consumption. The basic building block of logic in the latest Stratix series devices is the ALM, and in Cyclone II, Cyclone III and Cyclone IV GX devices, it is the logic element (LE). f For more information about ALMs and LEs in Cyclone II, Cyclone III, Cyclone IV GX, Stratix II, Stratix III, Stratix IV, and Stratix V, devices, refer to the respective device handbook. Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Design Space Explorer 13–3 Memory and clock resources are other major consumers of power in FPGAs. Stratix II devices feature the TriMatrix memory architecture. TriMatrix memory includes 512-bit M512 blocks, 4-Kbit M4K blocks, and 512-Kbit M-RAM blocks, which are configurable to support many features. Stratix IV and Stratix III TriMatrix on-chip memory is an enhancement based upon the Stratix II FPGA TriMatrix memory and includes three sizes of memory blocks: MLAB blocks, M9K blocks, and M144K blocks. Stratix III, Stratix IV, and Stratix V devices feature Programmable Power Technology, an advanced architecture that enables a smooth trade-off between speed and power. The core of each Stratix III, Stratix IV, and Stratix V device is divided into tiles, each of which may be put into a high-speed or low-power mode. The primary benefit of Programmable Power Technology is to reduce static power, with a secondary benefit being a small reduction in dynamic power. Cyclone II devices have 4-Kbit M4K memory blocks, and Cyclone III and Cyclone IV GX devices have 9-Kbit M9K memory blocks. Design Space Explorer Design Space Explorer (DSE) is a simple, easy-to-use, design optimization utility that is included in the Quartus II software. DSE explores and reports optimal Quartus II software options for your design, targeting either power optimization, design performance, or area utilization improvements. You can use DSE to implement the techniques described in this chapter. Figure 13–2 shows the DSE user interface. The Settings tab is divided into Project Settings and Exploration Settings. Figure 13–2. Design Space Explorer User Interface May 2013 Altera Corporation Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–4 Chapter 13: Power Optimization Power-Driven Compilation The Search for Lowest Power option, under Exploration Settings, uses a predefined exploration space that targets overall design power improvements. This setting focuses on applying different options that specifically reduce total design thermal power. By default, the Quartus II PowerPlay Power Analyzer is run for every exploration performed by the DSE when the Search for Lowest Power option is selected. This helps you debug your design and determine trade-offs between power requirements and performance optimization. h For more information about the DSE, refer to About Design Space Explorer in Quartus II Help. Power-Driven Compilation The standard Quartus II compilation flow consists of Analysis and Synthesis, placement and routing, Assembly, and Timing Analysis. Power-driven compilation takes place at the Analysis and Synthesis and Place-and-Route stages. Quartus II software settings that control power-driven compilation are located in the PowerPlay power optimization list on the Analysis & Synthesis Settings page, and the PowerPlay power optimization list on the Fitter Settings page. The following sections describes these power optimization options at the Analysis and Synthesis and Fitter levels. Power-Driven Synthesis Synthesis netlist optimization occurs during the synthesis stage of the compilation flow. The optimization technique makes changes to the synthesis netlist to optimize your design according to the selection of area, speed, or power optimization. This section describes power optimization techniques at the synthesis level. Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Power-Driven Compilation 13–5 The Analysis & Synthesis Settings page allows you to specify logic synthesis options. The PowerPlay power optimization option is available for all devices supported by the Quartus II software except MAX® 3000 and MAX 7000 devices. (Figure 13–3). Figure 13–3. Analysis & Synthesis Settings Page Table 13–1 shows the settings in the PowerPlay power optimization list. You can apply these settings on a project or entity level. Table 13–1. Optimize Power During Synthesis Options Settings Description Off No netlist, placement, or routing optimizations are performed to minimize power. Normal compilation Low compute effort algorithms are applied to minimize power through netlist (Default) optimizations as long as they are not expected to reduce design performance. Extra effort High compute effort algorithms are applied to minimize power through netlist optimizations. Max performance might be impacted. The Normal compilation setting is turned on by default. This setting performs memory optimization and power-aware logic mapping during synthesis. May 2013 Altera Corporation Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–6 Chapter 13: Power Optimization Power-Driven Compilation Memory blocks can represent a large fraction of total design dynamic power as described in “Reducing Memory Power Consumption” on page 13–14. Minimizing the number of memory blocks accessed during each clock cycle can significantly reduce memory power. Memory optimization involves effective movement of user-defined read/write enable signals to associated read-and-write clock enable signals for all memory types (Figure 13–4). Figure 13–4. Memory Transformation Data Data VCC Wr Clk Enable Rd Clk Enable Wren Write Enable Read Enable Write Address Read Address Switch Write Address Q Clock Data Data VCC Wren Wr Clk Enable Rd Clk Enable Rden Rden VCC Write Enable Read Enable VCC Write Address Read Address Q Read Address Switch Write Address Q Q Read Address Clock Figure 13–4 shows a default implementation of a simple dual-port memory block in which write-clock enable signals and read-clock enable signals are connected to VCC, making both read and write memory ports active during each clock cycle. Memory transformation effectively moves the read-enable and write-enable signals to the respective read-clock enable and write-clock enable signals. By using this technique, memory ports are shut down when they are not accessed. This significantly reduces your design’s memory power consumption. For more information about clock enable signals, refer to “Reducing Memory Power Consumption” on page 13–14. For Stratix III, Stratix IV, and Stratix V devices, the memory transformation takes place at the Fitter level by selecting the Normal compilation settings for the power optimization option. In Stratix III, Cyclone III, Cyclone IV GX, and Stratix III devices, the specified read-during-write behavior can significantly impact the power of single-port and bidirectional dual-port RAMs. It is best to set the read-during-write parameter to “Don’t care” (at the HDL level), as it allows an optimization whereby the read-enable signal can be set to the inversion of the existing write-enable signal (if one exists). This allows the core of the RAM to shut down (that is, not toggle), which saves a significant amount of power. The other type of power optimization that takes place with the Normal compilation setting is power-aware logic mapping. The power-aware logic mapping reduces power by rearranging the logic during synthesis to eliminate nets with high toggle rates. The Extra effort setting performs the functions of the Normal compilation setting and other memory optimizations to further reduce memory power by shutting down memory blocks that are not accessed. This level of memory optimization can require extra logic, which can reduce design performance. Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Power-Driven Compilation 13–7 The Extra effort setting also performs power-aware memory balancing. Power-aware memory balancing automatically chooses the best memory configuration for your memory implementation and provides optimal power saving by determining the number of memory blocks, decoder, and multiplexer circuits required. If you have not previously specified target-embedded memory blocks for your design’s memory functions, the power-aware balancer automatically selects them during memory implementation. Figure 13–5 shows an example of a 4k × 4 (4k deep and 4 bits wide) memory implementation in two different configurations using M4K memory blocks available in Stratix II devices. The minimum logic area implementation uses M4K blocks configured as 4k × 1. This implementation is the default in the Quartus II software because it has the minimum logic area (0 logic cells) and the highest speed. However, all four M4K blocks are active on each memory access in this implementation, which increases RAM power. The minimum RAM power implementation is created by selecting Extra effort in the PowerPlay power optimization list. This implementation automatically uses four M4K blocks configured as 1k × 4 for optimal power saving. An address decoder is implemented by the RAM megafunction to select which of the four M4K blocks should be activated on a given cycle, based on the state of the top two user address bits. The RAM megafunction automatically implements a multiplexer to feed the downstream logic by choosing the appropriate M4K output. This implementation reduces RAM power because only one M4K block is active on any cycle, but it requires extra logic cells, costing logic area and potentially impacting design performance. There is a trade-off between power saved by accessing fewer memories and power consumed by the extra decoder and multiplexor logic. The Quartus II software automatically balances the power savings against the costs to choose the lowest power configuration for each logical RAM. The benchmark data shows that the power-driven synthesis can reduce memory power consumption by as much as 60% in Stratix devices. Figure 13–5. 4K × 4 Memory Implementation Using Multiple M4K Blocks 4K Words Deep & 4 Bits Wide Minimum RAM Power (Power Efficient) Addr[10:11] Minimum Logic Area (Power Inefficient) Addr Decoder 4K Deep × 1 Wide M4K RAM 1K Deep × 4 Wide M4K RAM Addr[0:11] Addr[0:9] 4 Data[0:3] Addr[10:11] Data[0:3] May 2013 Altera Corporation Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–8 Chapter 13: Power Optimization Power-Driven Compilation Memory optimization options can also be controlled by the Low_Power_Mode parameter in the Default Parameters page of the Settings dialog box. The settings for this parameter are None, Auto, and ALL. None corresponds to the Off setting in the PowerPlay power optimization list. Auto corresponds to the Normal compilation setting and ALL corresponds to the Extra effort setting, respectively. You can apply PowerPlay power optimization either on a compiler basis or on individual entities. The Low_Power_Mode parameter always takes precedence over the Optimize Power for Synthesis option for power optimization on memory. You can also set the MAXIMUM_DEPTH parameter manually to configure the memory for low power optimization. This technique is the same as the power-aware memory balancer, but it is manual rather than automatic like the Extra effort setting in the PowerPlay power optimization list. You can set the MAXIMUM_DEPTH parameter for memory modules manually in the megafunction instantiation or in the MegaWizard™ Plug-In Manager for power optimization as described in “Reducing Memory Power Consumption” on page 13–14. The MAXIMUM_DEPTH parameter always takes precedence over the Optimize Power for Synthesis options for power optimization on memory optimization. h For step-by-step instructions on how to perform power-driven synthesis, refer to Running a Power-Optimized Compilation in Quartus II Help. Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Power-Driven Compilation 13–9 Power-Driven Fitter The Fitter Settings page enables you to specify options for fitting (Figure 13–6). The PowerPlay power optimization option is available for Arria GX, Arria II GX, Cyclone II, Cyclone III, Cyclone IV, Stratix II, Stratix II GX, Stratix III, Stratix IV, and Stratix V devices. Figure 13–6. Fitter Settings Page Table 13–2 lists the settings in the PowerPlay power optimization list. These settings can only be applied on a project-wide basis. The Extra effort setting for the Fitter requires extensive effort to optimize the design for power and can increase the compilation time. Table 13–2. Power-Driven Fitter Option Settings Off Description No netlist, placement, or routing optimizations are performed to minimize power. Normal compilation Low compute effort algorithms are applied to minimize power through placement and routing (Default) optimizations as long as they are not expected to reduce design performance. Extra effort May 2013 High compute effort algorithms are applied to minimize power through placement and routing optimizations. Max performance might be impacted. Altera Corporation Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–10 Chapter 13: Power Optimization Power-Driven Compilation The Normal compilation setting is selected by default and performs DSP optimization by creating power-efficient DSP block configurations for your DSP functions. For Stratix III, Stratix IV, and Stratix V devices, this setting, which is based on timing constraints entered for the design, enables the Programmable Power Technology to configure tiles as high-speed mode or low-power mode. Programmable Power Technology is always turned ON even when the OFF setting is selected for the Fitter PowerPlay power optimization option. Tiles are the combination of LAB and MLAB pairs (including the adjacent routing associated with LAB and MLAB), which can be configured to operate in high-speed or low-power mode. This level of power optimization does not have any affect on the fitting, timing results, or compile time. Also, for Stratix III devices, this setting enables the memory transformation as described in “Power-Driven Synthesis” on page 13–4. f For more information about Stratix III power optimization, refer to AN 437: Power Optimization in Stratix III FPGAs. For more information about Stratix IV power optimization, refer to AN 514: Power Optimization in Stratix IV FPGAs. The Extra effort setting performs the functions of the Normal compilation setting and other place-and-route optimizations during fitting to fully optimize the design for power. The Fitter applies an extra effort to minimize power even after timing requirements have been met by effectively moving the logic closer during placement to localize high-toggling nets, and using routes with low capacitance. However, this effort can increase the compilation time. The Extra effort setting uses a Value Change Dump File (.vcd) that guides the Fitter to fully optimize the design for power, based on the signal activity of the design. The best power optimization during fitting results from using the most accurate signal activity information. Signal activities from full post-fit netlist (timing) simulation provide the highest accuracy because all node activities reflect the actual design behavior, provided that supplied input vectors are representative of typical design operation. If you do not have a .vcd file, the Quartus II software uses assignments, clock assignments, and vectorless estimation values (PowerPlay Power Analyzer Tool settings) to estimate the signal activities. This information is used to optimize your design for power during fitting. The benchmark data shows that the power-driven Fitter technique can reduce power consumption by as much as 19% in Stratix devices. On average, you can reduce core dynamic power by 16% with the Extra effort synthesis and Extra effort fitting settings, as compared to the Off settings in both synthesis and Fitter options for power-driven compilation. 1 Only the Extra effort setting in the PowerPlay power optimization list for the Fitter option uses the signal activities (from .vcd files) during fitting. The settings made in the PowerPlay Power Analyzer Settings page in the Settings dialog box are used to calculate the signal activity of your design. f For more information about .vcd files and how to create them, refer to the PowerPlay Power Analysis chapter in volume 3 of the Quartus II Handbook. h For step-by-step instructions on how to perform power-driven fitting, refer to Running a Power-Optimized Compilation in Quartus II Help. Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Power-Driven Compilation 13–11 Area-Driven Synthesis Using area optimization rather than timing or delay optimization during synthesis saves power because you use fewer logic blocks. Using less logic usually means less switching activity. The Quartus II integrated synthesis tool provides Speed, Balanced, or Area for the Optimization Technique option. You can also specify this logic option for specific modules in your design with the Assignment Editor in cases where you want to reduce area using the Area setting (potentially at the expense of register-toregister timing performance) while leaving the default Optimization Technique setting at Balanced (for the best trade-off between area and speed for certain device families). The Speed Optimization Technique can increase the resource usage of your design if the constraints are too aggressive, and can also result in increased power consumption. The benchmark data shows that the area-driven technique can reduce power consumption by as much as 31% in Stratix devices and as much as 15% in Cyclone devices. Gate-Level Register Retiming You can also use gate-level register retiming to reduce circuit switching activity. Retiming shuffles registers across combinational blocks without changing design functionality. The Perform gate-level register retiming option in the Quartus II software enables the movement of registers across combinational logic to balance timing, allowing the software to trade off the delay between timing critical and noncritical timing paths. Retiming uses fewer registers than pipelining. Figure 13–7 shows an example of gate-level register retiming, where the 10 ns critical delay is reduced by moving the register relative to the combinational logic, resulting in the reduction of data depth and switching activity. Figure 13–7. Gate-Level Register Retiming Before D Q 10 ns D Q 5 ns D Q 8 ns D Q After D May 2013 Altera Corporation Q 7 ns D Q Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–12 Chapter 13: Power Optimization Design Guidelines 1 Gate-level register retiming makes changes at the gate level. If you are using an atom netlist from a third-party synthesis tool, you must also select the Perform WYSIWYG primitive resynthesis option to undo the atom primitives to gates mapping (so that register retiming can be performed), and then to remap gates to Altera primitives. When using Quartus II integrated synthesis, retiming occurs during synthesis before the design is mapped to Altera primitives. The benchmark data shows that the combination of WYSIWYG remapping and gate-level register retiming techniques can reduce power consumption by as much as 6% in Stratix devices and as much as 21% in Cyclone devices. f For more information about register retiming, refer to the Netlist Optimizations and Physical Synthesis chapter in volume 2 of the Quartus II Handbook. Design Guidelines Several low-power design techniques can reduce power consumption when applied during FPGA design implementation. This section provides detailed design techniques for Cyclone II, Cyclone III, Cyclone IV GX, Stratix II, and Stratix III devices that affect overall design power. The results of these techniques might be different from design to design. Clock Power Management Clocks represent a significant portion of dynamic power consumption due to their high switching activity and long paths. Figure 13–1 on page 13–2 shows a 14% average contribution to power consumption for global clock routing in Stratix III devices and 16% in Cyclone III devices. Actual clock-related power consumption is higher than this because the power consumed by local clock distribution within logic, memory, and DSP or multiplier blocks is included in the power consumption for the respective blocks. Clock routing power is automatically optimized by the Quartus II software, which enables only those portions of the clock network that are required to feed downstream registers. Power can be further reduced by gating clocks when they are not required. It is possible to build clock-gating logic, but this approach is not recommended because it is difficult to generate a glitch free clock in FPGAs using ALMs or LEs. Arria GX, Arria II GX, Cyclone III, Cyclone IV, Stratix II, Stratix III, Stratix IV, and Stratix V devices use clock control blocks that include an enable signal. A clock control block is a clock buffer that lets you dynamically enable or disable the clock network and dynamically switch between multiple sources to drive the clock network. You can use the Quartus II MegaWizard Plug-In Manager to create this clock control block with the ALTCLKCTRL megafunction. Arria GX, Arria II GX, Cyclone III, Cyclone IV, Stratix II, Stratix III, Stratix IV, and Stratix V devices provide clock control blocks for global clock networks. In addition, Stratix II, Stratix III, Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Design Guidelines 13–13 Stratix IV, and Stratix V devices have clock control blocks for regional clock networks. The dynamic clock enable feature lets internal logic control the clock network. When a clock network is powered down, all the logic fed by that clock network does not toggle, thereby reducing the overall power consumption of the device. Figure 13–8 shows a 4-input clock control block diagram. Figure 13–8. Clock Control Block Diagram ena inclk 3× inclk 2× inclk 1× inclk 0× outclk clkselect[1..0] The enable signal is applied to the clock signal before being distributed to global routing. Therefore, the enable signal can either have a significant timing slack (at least as large as the global routing delay) or it can reduce the fMAX of the clock signal. f For more information about using clock control blocks, refer to the Clock Control Block Megafunction User Guide (ALTCLKCTRL). Another contributor to clock power consumption is the LAB clock that distributes a clock to the registers within a LAB. LAB clock power can be the dominant contributor to overall clock power. For example, in Cyclone III devices, each LAB can use two clocks and two clock enable signals, as shown in Figure 13–9. Each LAB’s clock signal and clock enable signal are linked. For example, an LE in a particular LAB using the labclk1 signal also uses the labclkena1 signal. Figure 13–9. LAB-Wide Control Signals Dedicated LAB Row Clocks 6 Local Interconnect Local Interconnect Local Interconnect Local Interconnect labclkena1 labclk1 May 2013 Altera Corporation labclkena2 labclk2 labclr1 syncload synclr labclr2 Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–14 Chapter 13: Power Optimization Design Guidelines To reduce LAB-wide clock power consumption without disabling the entire clock tree, use the LAB-wide clock enable to gate the LAB-wide clock. The Quartus II software automatically promotes register-level clock enable signals to the LAB-level. All registers within an LAB that share a common clock and clock enable are controlled by a shared gated clock. To take advantage of these clock enables, use a clock enable construct in the relevant HDL code for the registered logic. LAB-Wide Clock Enable Example The VHDL code in Example 13–1 makes use of a LAB-wide clock enable. This clock-gating logic is automatically turned into an LAB-level clock enable signal. Example 13–1. IF clk'event AND clock = '1' THEN IF logic_is_enabled = '1' THEN reg <= value; ELSE reg <= reg; END IF; END IF; f For more information about LAB-wide control signals, refer to the Stratix II Architecture, Cyclone III Device Family Overview, or Cyclone II Architecture chapters in the respective device handbook. Reducing Memory Power Consumption The memory blocks in FPGA devices can represent a large fraction of typical core dynamic power. Memory consumes approximately 20% of the core dynamic power in typical Cyclone III and Stratix III device designs. Memory blocks are unlike most other blocks in the device because most of their power is tied to the clock rate, and is insensitive to the toggle rate on the data and address lines. When a memory block is clocked, there is a sequence of timed events that occur within the block to execute a read or write. The circuitry controlled by the clock consumes the same amount of power regardless of whether or not the address or data has changed from one cycle to the next. Thus, the toggle rate of input data and the address bus have no impact on memory power consumption. The key to reducing memory power consumption is to reduce the number of memory clocking events. You can achieve this through clock network-wide gating described in “Clock Power Management” on page 13–12, or on a per-memory basis through use of the clock enable signals on the memory ports. Figure 13–10 shows the logical view of the internal clock of the memory block. Use the appropriate enable signals on the memory to make use of the clock enable signal instead of gating the clock. Figure 13–10. Memory Clock Enable Signal 1 Enable 0 Internal Memory Clk Clk Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Design Guidelines 13–15 Using the clock enable signal enables the memory only when necessary and shuts it down for the rest of the time, reducing the overall memory power consumption. You can use the MegaWizard Plug-In Manager to create these enable signals by selecting the Clock enable signal option for the appropriate port when generating the memory block function (Figure 13–11). Figure 13–11. MegaWizard Plug-In Manager RAM 2-Port Clock Enable Signal Selectable Option For example, consider a design that contains a 32-bit-wide M4K memory block in ROM mode that is running at 200 MHz. Assuming that the output of this block is only required approximately every four cycles, this memory block will consume 8.45 mW of dynamic power according to the demands of the downstream logic. By adding a small amount of control logic to generate a read clock enable signal for the memory block only on the relevant cycles, the power can be cut 75% to 2.15 mW. You can also use the MAXIMUM_DEPTH parameter in your memory megafunction to save power in Cyclone II, Cyclone III, Cyclone IV GX, Stratix II, Stratix III, Stratix IV, and Stratix V devices; however, this approach might increase the number of LEs required to implement the memory and affect design performance. You can set the MAXIMUM_DEPTH parameter for memory modules manually in the megafunction instantiation or in the MegaWizard Plug-In Manager (Figure 13–12). The Quartus II software automatically chooses the best design memory configuration for optimal power, as described in “Power-Driven Compilation” on page 13–4. May 2013 Altera Corporation Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–16 Chapter 13: Power Optimization Design Guidelines Figure 13–12. MegaWizard Plug-In Manager RAM 2-Port Maximum Depth Selectable Option Memory Power Reduction Example Table 13–3 shows power usage measurements for a 4K × 36 simple dual-port memory implemented using multiple M4K blocks in a Stratix II EP2S15 device. For each implementation, the M4K blocks are configured with a different memory depth. Table 13–3. 4K × 36 Simple Dual-Port Memory Implemented Using Multiple M4K Blocks M4K Configuration Number of M4K Blocks ALUTs 4K × 1 (Default setting) 36 0 2K × 2 36 40 1K × 4 36 62 512 × 9 32 143 256 × 18 32 302 128 × 36 32 633 Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Design Guidelines 13–17 Figure 13–13 shows the amount of power saved using the MAXIMUM_DEPTH parameter. For all implementations, a user-provided read enable signal is present to indicate when read data is required. Using this power-saving technique can reduce power consumption by as much as 60%. Power Savings Figure 13–13. Power Savings Using the MAXIMUM_DEPTH Parameter 70% 60% 50% 40% 30% 20% 10% 0% 4K × 1 2K × 2 1K × 4 512 × 9 M4K Configuration 256 × 18 128 × 36 As the memory depth becomes more shallow, memory dynamic power decreases because unaddressed M4K blocks can be shut off using a decoded combination of address bits and the read enable signal. For a 128-deep memory block, power used by the extra LEs starts to outweigh the power gain achieved by using a more shallow memory block depth. The power consumption of the memory blocks and associated LEs depends on the memory configuration. 1 The SOPC Builder and Qsys system do not offer specific power savings control for on-chip memory block. There is no read enable, write enable, or clock enable that you can enable in the on-chip RAM megafunction to shut down the RAM block in the SOPC Builder and Qsys system. Pipelining and Retiming Designs with many glitches consume more power because of faster switching activity. Glitches cause unnecessary and unpredictable temporary logic switches at the output of combinational logic. A glitch usually occurs when there is a mismatch in input signal timing leading to unequal propagation delay. For example, consider an input change on one input of a 2-input XOR gate from 1 to 0, followed a few moments later by an input change from 0 to 1 on the other input. For a moment, both inputs become 1 (high) during the state transition, resulting in 0 (low) at the output of the XOR gate. Subsequently, when the second input transition takes place, the XOR gate output becomes 1 (high). During signal transition, a glitch is May 2013 Altera Corporation Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–18 Chapter 13: Power Optimization Design Guidelines produced before the output becomes stable, as shown in Figure 13–14. This glitch can propagate to subsequent logic and create unnecessary switching activity, increasing power consumption. Circuits with many XOR functions, such as arithmetic circuits or cyclic redundancy check (CRC) circuits, tend to have many glitches if there are several levels of combinational logic between registers. Figure 13–14. XOR Gate Showing Glitch at the Output A A Glitch B Q B XOR (Exclusive OR) Gate Q t Timing Diagram for the 2-Input XOR Gate Pipelining can reduce design glitches by inserting flipflops into long combinational paths. Flipflops do not allow glitches to propagate through combinational paths. Therefore, a pipelined circuit tends to have less glitching. Pipelining has the additional benefit of generally allowing higher clock speed operations, although it does increase the latency of a circuit (in terms of the number of clock cycles to a first result). Figure 13–15 shows an example where pipelining is applied to break up a long combinational path. Figure 13–15. Pipelining Example Non-Pipelined Combinational Logic D Q Long Logic Depth D Q Pipelined Combinational Logic D Q Short Logic Depth Combinational Logic D Q Short Logic Depth D Q Pipelining is very effective for glitch-prone arithmetic systems because it reduces switching activity, resulting in reduced power dissipation in combinational logic. Additionally, pipelining allows higher-speed operation by reducing logic-level numbers between registers. The disadvantage of this technique is that if there are not many glitches in your design, pipelining can increase power consumption by adding unnecessary registers. Pipelining can also increase resource utilization. The benchmark data shows that pipelining can reduce dynamic power consumption by as much as 30% in Cyclone and Stratix devices. Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Design Guidelines 13–19 Architectural Optimization You can use design-level architectural optimization by taking advantage of specific device architecture features. These features include dedicated memory and DSP or multiplier blocks available in FPGA devices to perform memory or arithmetic-related functions. You can use these blocks in place of LUTs to reduce power consumption. For example, you can build large shift registers from RAM-based FIFO buffers instead of building the shift registers from the LE registers. The Stratix device family allows you to efficiently target small, medium, and large memories with the TriMatrix memory architecture. Each TriMatrix memory block is optimized for a specific function. The M512 memory blocks available in Stratix II devices are useful for implementing small FIFO buffers, DSP, and clock domain transfer applications. M512 memory blocks are more power-efficient than the distributed memory structures in some competing FPGAs. The M4K memory blocks are used to implement buffers for a wide variety of applications, including processor code storage, large look-up table implementation, and large memory applications. The M-RAM blocks are useful in applications where a large volume of data must be stored on-chip. Effective utilization of these memory blocks can have a significant impact on power reduction in your design. The latest Stratix and Cyclone device families have configurable M9K memory blocks that provide various memory functions such as RAM, FIFO buffers, and ROM. f For more information about using DSP and memory blocks efficiently, refer to the Area and Timing Optimization chapter in volume 2 of the Quartus II Handbook. I/O Power Guidelines Nonterminated I/O standards such as LVTTL and LVCMOS have a rail-to-rail output swing. The voltage difference between logic-high and logic-low signals at the output pin is equal to the VCCIO supply voltage. If the capacitive loading at the output pin is known, the dynamic power consumed in the I/O buffer can be calculated as shown in Equation 13–1: Equation 13–1. Capacitive loading at the output pin P = 0.5 × F × C × V 2 In this equation, F is the output transition frequency and C is the total load capacitance being switched. V is equal to VCCIO supply voltage. Because of the quadratic dependence on VCCIO, lower voltage standards consume significantly less dynamic power. Transistor-to-transistor logic (TTL) I/O buffers consume very little static power. As a result, the total power consumed by a LVTTL or LVCMOS output is highly dependent on load and switching frequency. When using resistively terminated I/O standards like SSTL and HSTL, the output load voltage swings by a small amount around some bias point. The same dynamic power equation is used, where V is the actual load voltage swing. Because this is much smaller than VCCIO, dynamic power is lower than for nonterminated I/O under similar conditions. These resistively terminated I/O standards dissipate significant static (frequency-independent) power, because the I/O buffer is constantly driving May 2013 Altera Corporation Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–20 Chapter 13: Power Optimization Design Guidelines current into the resistive termination network. However, the lower dynamic power of these I/O standards means they often have lower total power than LVCMOS or LVTTL for high-frequency applications. Use the lowest drive strength I/O setting that meets your speed and waveform requirements to minimize I/O power when using resistively terminated standards. You can save a small amount of static power by connecting unused I/O banks to the lowest possible VCCIO voltage of 1.2 V. Table 13–4 shows the total supply and thermal power consumed by outputs using different I/O standards for Stratix II devices. The numbers are for an I/O pin transmitting random data clocked at 200 MHz with a 10 pF capacitive load. For this configuration, nonterminated standards generally use less power, but this is not always the case. If the frequency or the capacitive load is increased, the power consumed by nonterminated outputs increases faster than the power of terminated outputs. Table 13–4. I/O Power for Different I/O Standards in Stratix II Devices Standard Total Supply Current Drawn from VCCIO Supply (mA) Total On-Chip Thermal Power Dissipation (mW) 3.3-V LVTTL 2.42 9.87 2.5-V LVCMOS 1.9 6.69 1.8-V LVCMOS 1.34 4.18 1.5-V LVCMOS 1.18 3.58 3.3-V PCI 2.47 10.23 SSTL-2 class I 6.07 4.42 SSTL-2 class II 10.72 5.1 SSTL-18 class I 5.33 3.28 SSTL-18 class II 8.56 4.06 HSTL-15 class I 6.06 3.49 HSTL-15 class II 11.08 4.87 HSTL-18 class I 6.87 4.09 HSTL-18 class II 12.33 5.82 f For more information about I/O standards, refer to the Selectable I/O Standards in Stratix II Devices and Stratix II GX Devices chapter in volume 2 of the Stratix II Device Handbook, the Stratix III Device I/O Features chapter in volume 1 of the Stratix III Device Handbook, the I/O Features in Stratix IV Devices in volume 1 of the Stratix IV Device Handbook, or the Selectable I/O Standards in Cyclone II Devices chapter in the Cyclone II Device Handbook, the Cyclone III Device Handbook, or the Cyclone IV GX Handbook. Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Design Guidelines 13–21 When calculating I/O power, the PowerPlay Power Analyzer uses the default capacitive load set for the I/O standard in the Capacitive Loading page of the Device and Pin Options dialog box. For Stratix II devices, if Enable Advanced I/O Timing is turned on, I/O power is measured using an equivalent load calculated as the sum of the near capacitance, the transmission line distributed capacitance, and the far-end capacitance as defined in the Board Trace Model page of the Device and Pin Options dialog box or the Board Trace Model view in the Pin Planner. Any other components defined in the board trace model are not taken into account for the power measurement. For Cyclone III, Cyclone IV GX, Stratix III, Stratix IV, and Stratix V, devices, Advanced I/O Timing, which uses the full board trace model, is always used. f For information about using Advanced I/O Timing and configuring a board trace model, refer to the I/O Management chapter in volume 2 of the Quartus II Handbook. Dynamically Controlled On-Chip Terminations Stratix V, Stratix IV and Stratix III FPGAs offer dynamic on-chip termination (OCT). Dynamic OCT enables series termination (RS) and parallel termination (RT) to dynamically turn on/off during the data transfer. This feature is especially useful when Stratix V, Stratix IV and Stratix III FPGAs are used with external memory interfaces, such as interfacing with DDR memories. Compared to conventional termination, dynamic OCT reduces power consumption significantly as it eliminates the constant DC power consumed by parallel termination when transmitting data. Parallel termination is extremely useful for applications that interface with external memories where I/O standards, such as HSTL and SSTL, are used. Parallel termination supports dynamic OCT, which is useful for bidirectional interfaces (see Figure 13–16). Figure 13–16. Stratix III On-Chip Parallel Termination Stratix III OCT VCCIO 100W Zo = 50W VREF 100W Transmitter GND Receiver The following is an example of power saving for a DDR3 interface using on-chip parallel termination. May 2013 Altera Corporation Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–22 Chapter 13: Power Optimization Design Guidelines The static current consumed by parallel OCT is equal to the VCCIO voltage divided by 100 Ω . For DDR3 interfaces that use SSTL-15, the static current is 1.5 V/100 Ω = 15 mA per pin. Therefore, the static power is 1.5 V ×15 mA = 22.5 mW. For an interface with 72 DQ and 18 DQS pins, the static power is 90 pins × 22.5 mW = 2.025 W. Dynamic parallel OCT disables parallel termination during write operations, so if writing occurs 50% of the time, the power saved by dynamic parallel OCT is 50% × 2.025 W = 1.0125 W. f For more information about dynamic OCT in Stratix IV and Stratix III devices, refer to the Stratix III Device I/O Features chapter in the Stratix III Device Handbook and the Stratix IV Device I/O Features chapter in the Stratix IV Device Handbook, respectively. Power Optimization Advisor The Quartus II software includes the Power Optimization Advisor, which provides specific power optimization advice and recommendations based on the current design project settings and assignments. The advisor covers many of the suggestions listed in this chapter. The following example shows how to reduce your design power with the Power Optimization Advisor. Power Optimization Advisor Example After compiling your design, run the PowerPlay Power Analyzer to determine your design power and to see where power is dissipated in your design. Based on this information, you can run the Power Optimization Advisor to implement recommendations that can reduce design power. Figure 13–17 shows the Power Optimization Advisor after compiling a design that is not fully optimized for power. Figure 13–17. Power Optimization Advisor The Power Optimization Advisor shows the recommendations that can reduce power in your design. The recommendations are split into stages to show the order in which you should apply the recommended settings. The first stage shows mostly CAD setting options that are easy to implement and highly effective in reducing design power. An icon indicates whether each recommended setting is made in the current Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation Chapter 13: Power Optimization Design Guidelines 13–23 project. In Figure 13–17, the checkmark icons for Stage 1 shows the recommendations that are already implemented. The warning icons indicate recommendations that are not followed for this compilation. The information icon shows the general suggestions. Each recommendation includes the description, summary of the effect of the recommendation, and the action required to make the appropriate setting. There is a link from each recommendation to the appropriate location in the Quartus II user interface where you can change the setting. You can change the Power-Driven Synthesis setting by clicking Open Settings dialog box - Analysis & Synthesis Settings page. The Settings dialog box is shown with the Analysis & Synthesis Settings page selected, where you can change the PowerPlay power optimization settings. After making the recommended changes, recompile your design. The Power Optimization Advisor indicates with green check marks that the recommendations were implemented successfully (Figure 13–18). You can use the PowerPlay Power Analyzer to verify your design power results. Figure 13–18. Implementation of Power Optimization Advisor Recommendations The recommendations listed in Stage 2 generally involve design changes, rather than CAD settings changes as in Stage 1. You can use these recommendations to further reduce your design power consumption. Altera recommends that you implement Stage 1 recommendations first, then the Stage 2 recommendations. Conclusion The combination of a smaller process technology, the use of low-k dielectric material, and reduced supply voltage significantly reduces dynamic power consumption in the latest FPGAs. To further reduce your dynamic power, use the design recommendations presented in this chapter to optimize resource utilization and minimize power consumption. May 2013 Altera Corporation Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization 13–24 Chapter 13: Power Optimization Document Revision History Document Revision History Table 13–5 shows the revision history for this chapter. Table 13–5. Document Revision History Date Version Changes May 2013 13.0.0 Added a note to “Memory Power Reduction Example” on page 13–16 on Qsys and SOPC Builder power savings limitation for on-chip memory block. June 2012 12.0.0 Removed survey link. November 2011 10.0.2 Template update. December 2010 10.0.1 Template update. July 2010 10.0.0 November 2009 March 2009 9.0.0 November 2008 May 2008 9.1.0 8.1.0 8.0.0 ■ Was chapter 11 in the 9.1.0 release ■ Updated Figures 14-2, 14-3, 14-6, 14-18, 14-19, and 14-20 ■ Updated device support ■ Minor editorial updates ■ Updated Figure 11-1 and associated references ■ Updated device support ■ Minor editorial update ■ Was chapter 9 in the 8.1.0 release ■ Updated for the Quartus II software release ■ Added benchmark results ■ Removed several sections ■ Updated Figure 13–1, Figure 13–17, and Figure 13–18 ■ Changed to 8½” × 11” page size ■ Changed references to altsyncram to RAM ■ Minor editorial updates ■ Added support for Stratix IV devices ■ Updated Table 9–1 and 9–9 ■ Updated “Architectural Optimization” on page 9–22 ■ Added “Dynamically-Controlled On-Chip Terminations” on page 9–26 ■ Updated “Referenced Documents” on page 9–29 ■ Updated references f For previous versions of the Quartus II Handbook, refer to the Quartus II Handbook Archive. Quartus II Handbook Version 13.1 Volume 2: Design Implementation and Optimization May 2013 Altera Corporation
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement