NIAGARA Presented by Linda Pescatore 1 NIAGARA: It’s about throughput §  Key performance metric: Sustained throughput of client requests §  Both multicore and multithreaded Less romantic name: UltraSPARC T1 with CoolThreads technology Released Nov 14, 2005 SPARC = Scalable Processor ARChitecture 2 NIAGARA: Amdahl’s at it again Improving performance of a single thread using §  Multiple instruction issue §  Out of order processing, and §  Aggressive branch prediction mostly reduces compute time, not memory access time. (True?) 3 NIAGARA: We like throughput §  Optimized for multithreaded performance §  Commercial server apps: high TLP, low to medium ILP §  8 “Sparc pipe” thread groups of 4 §  32 threads total (64 in Niagara 2) §  “Parallel execution of many threads … hides memory latency.” §  No aggressive branch prediction §  Speculative thread: low priority 4 NIAGARA: We got the power §  Power efXiciency: § 
§ 
§ 
§ 
Hydroelectric power dam at the Robert Moses generating facility, fed by conduits under the city of Niagara Falls. Dissipate 60W expected Resource-­‐sharing Clock speed not pushed to limit Sun’s SWaP = Performance / (Space x Power Consumption) “The performance per watt is four to 10 times better than any other chip.” (Nathan Brookwood, Analyst, Insight64) §  Conserve space 5 Block Diagrams Kongetira Credit: David Halko, Creative Commons license 6 NIAGARA: Ceci n’est pas une Sparc Pipe •  Adds Thread Select Logic •  Controls when to fetch, when to decode and execute. •  Thread selection policy: –  Switch between available threads every cycle –  Prioritize least recently used Kongetira 7 Niagara makes a splash §  The T1 processor is in: §  Sun/Fujitsu/Fujitsu Siemens SPARC Enterprise T1000 and T2000 servers §  Sun Fire T1000, T2000 servers §  Sun Netra T2000 server §  Sun Netra CP3060 Blade §  Sun Blade T6300 server module § UltraSPARC T2 (N2, Victoria Falls): 8x8 §  2x threads = area-­‐efXicient, enhance cryptography, incorporate FGU §  New “pick” pipe stage chooses 2 of 8 threads to execute each cycle §  Double set associativity of L1-­‐I to 8 §  Double fully associative DTLB to 128 entries § Double L2 banks to 8 § UltraSPARC T2 Plus: 16 cores x 8 threads § UltraSPARC T3: 16 x 8 § UltraSPARC T4 (2011!): 8 cores, OOO § UltraSPARC T5: 16 cores, 28 nm process 8 Niagara 1 and 2 are open source! §  First and only 64-­‐bit chip multithreaded microprocessors ever open-­‐sourced, according to OpenSparc.net. Find: § 
§ 
§ 
§ 
Processor design source code (Verilog) Simulation tools Design veriXication suites Hypervisor source code §  OpenSPARC can boot real off-­‐the-­‐shelf commercial operating systems (e.g., Solaris, Linux, FreeBSD). Use a real design for your study or research! 9 Related work: Piranha •  Piranha: Compaq 2000 Niagara: Sun 2005 •  Niagara paper refs Piranha* •  Almost identical rationales •  High BW, low latency –  1.6 GB/sec x 8 = 12.8 GB/sec A piranha at the Memphis zoo, by Alexdi, Creative Commons license * “Other studies have also indicated the signiXicant performance gains possible using this approach on multithreaded workloads.” (Konetira) •  8 Alpha single-­‐issue in-­‐order cores (RISC), individual L1 data and instruction caches, Intra-­‐
Chip Switch, shared L2 •  8-­‐stage pipeline: – 
– 
– 
– 
Instruction fetch Register read ALU stages 1-­‐5 (incl FP & mult.) Write back 10 Remember Niagara Photos of Niagara Falls courtesy of GoCanada 11 Circular registers Kongetira paper The SPARC Architecture Manual Version 9 12 Niagara 2 vs UltraSparc T1 Golla , slide 8 13