ASPLOS XI Tutorial
The Liberty Research Group
http://www.liberty-research.org
The Liberty Simulation Environment
ASPLOS XI Tutorial
Jason Blome
University of Michigan
Prof. Manish Vachharajani
University of Colorado
Neil Vachharajani, Prof. David August
The Liberty Research Group
Princeton University
http://www.liberty-research.org
The Liberty Simulation Environment (LSE)
LSE in One Slide
• LSE is not a simulator!
• LSE defines a HARDWARE DESCRIPTION LANGUAGE
ASPLOS XI Tutorial
• LSE is a collection of TOOLS:
• Simulator Builder
• Visualizer
• Others…
• LSE supports collections of COMPONENTS and DOMAINS:
•
•
•
•
•
The Core Library (arbiter, queue…)
Architecture Libraries (branch predictors, cache components…)
Instruction Set Emulators (IA-64, PowerPC, DLX, MIPS…)
Third Party Integrators (BLiSS:SimpleScalar, Simics…)
Domains (Checkpointing, Sampling, Clock, Emulator…)
The Liberty Research Group
http://www.liberty-research.org
Tutorial Sequence
8:00 Welcome
8:05 LSE Introduction and Philosophy
LSE Basics
Your First Configuration
David
Manish
Neil
ASPLOS XI Tutorial
10:00 Refreshment Break (30 Minutes)
10:30 Emulators
Building Processor Model
Running OS Code
Putting It All Together
The Future of Liberty
Manish and Neil
Manish and Neil
Jason
David
David
12:30 Adjourn
The Liberty Research Group
http://www.liberty-research.org
The Hardware Design Process
ASPLOS XI Tutorial
Layout
Transistors
Gates
RTL
Microarchitecture
Design
Criteria
Expensive
Cost of microarchitecture level changes
Cheap
Layout for
Fabrication
4-6 yrs
• Microarchitecture difficult to design
• Design decisions have major implications
• Expensive to correct shortcomings later
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
Choose best design
Explore minor
Design variations
Possible
Designs
Construct Simulator
Research Design
Techniques
Microarchitecture Design Process
Design for
RTL level
6-12 mos.
9 – 18 mos.
4-6 yrs
• Only 1 major design concept is tried
• Only minor variations such as cache size and branch prediction explored
• Simulator is inaccurate [Black ’98, Gibson ’00, Intel ’04]
• Predictions off by 10%-30%
The Liberty Research Group
http://www.liberty-research.org
The Mapping Problem
Hardware Block Diagram
Software Call Graph (C, C++, etc.)
AA
CC
DD
Sequencer
ASPLOS XI Tutorial
BB
C
l
ba
g lo ia b le
r
va
A
B
D
• Equivalent functionality but cognitive mismatch
• No mapping discipline for this manual process
• error prone
• time consuming
• little reuse and interoperability
• C, C++ simulators suffer from this problem (Simics, SimpleScalar)
• Others have remnants of this (ASIM, most SystemC models)
The Liberty Research Group
http://www.liberty-research.org
The Mapping Problem in Practice
• Must remap to avoid pitfalls
[LSE/MICRO35]
• Locked into major architecture decisions without iteration
• Hard to keep simulator up-to-date (e.g. VLSI designer feedback)
• Reverse mapping is difficult – Not Transparent
ASPLOS XI Tutorial
• Simulators are hard to validate (little trust in simulator results)
• Simulators do not accurately model the hardware [Hennessy/Flash
vs. Flash]
• Temptation to approximate is strong
• Timing is often approximated, without validation of approximation
[Emer/Asim]
• Mapping accuracy important
• Design details must accurately be modeled to show true
performance of architecture [Berger/sim-alpha]
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
I2
I$
Writeback
Execute
Decode
Fetch
I2
BPred
Mem Access
A Natural Specification Language
I2
D$
• Modularize the model like hardware, no mapping!
• Basic components
• Concurrent computation
• Communication through ports
• Called structural modeling
The Liberty Research Group
http://www.liberty-research.org
Exam Results
Exam Results
12
8
6
4
2
S9
S1
0
S1
1
S1
2
S1
3
S1
4
S1
5
S1
6
S1
7
S1
8
S2
0
S2
1
S2
2
S2
3
Av
er
S2
ag Av 4
e
e
wi rag
th
ou e
tS
8
S8
S7
S6
S5
S4
S3
S2
0
S1
ASPLOS XI Tutorial
Correct Answers
10
Structural Correct with missing wrong
Sequential correct with missing wrong
Structural correct with missing right
Sequential correct with missing right
The Liberty Research Group
Subject
http://www.liberty-research.org
S1
S2
S3
S4
S5
S6
S7
S8
S
S1 9
S10
S11
S12
S13
S14
S15
S16
S17
S28
S20
S21
Av
S22
er
ag A S 3
e ve 24
w ra
/o ge
ut
S8
ASPLOS XI Tutorial
Time in minutes
120
100
80
60
40
20
0
Did not provide timing data
Exam Times
Sequential Questions
Structural Questions
Control Questions
The Liberty Research Group
Subjects
http://www.liberty-research.org
ASPLOS XI Tutorial
I$
Coarse
D$
I2
I$
Writeback
Mem Access
Execute
Decode
Fetch
I2
BPred
Writeback
Mem Access
Execute
Decode
Fetch
Decode
Fetch
Managing Hardware Complexity During Design
I2
D$
Detailed
• Designers reason about the design through decomposition
• Design is refined by adding and improving components
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
I$
D$
I2
I$
Writeback
Mem Access
Execute
Decode
Fetch
I2
BPred
Writeback
Mem Access
Execute
Decode
Fetch
Decode
Fetch
Accelerate Modeling with Reuse
Customizable
Component
Customization
Flexible
Components
I2
D$
Component Library
• Reuse components to amortize costs [Charest ’02, Emer ’02, Koegst ’98]
• Real reuse results from structural component use
The Liberty Research Group
http://www.liberty-research.org
The Hard Realities of Reuse
• Cannot reuse blocks that cannot be modularized
• For example: timing control has global pipeline knowledge
• LSE: Modularization strategy for timing control
ASPLOS XI Tutorial
• Reuse overhead too high
•
•
•
•
•
Reuse requires highly flexible components
Flexible components require too much specification [Radetzki ’98]
LSE: New techniques to infer component parameters
LSE: Statically analyzable model structure
LSE: Separation of concerns (instrumentation, domains)
• LSE: Don’t write a simulator, specify a model
The Liberty Research Group
http://www.liberty-research.org
Types Inference via Structure
delay
in:’a out:’a
delay
in:’b out:’b
delay
in:’c out:’c
in1:’d
ASPLOS XI Tutorial
out:bool diff
in2:’d
• Parametric polymorphism can be resolved via static
structure
• ‘a = bool, ‘b=bool, ‘c=bool, ‘d=bool
• Basic algorithm functions by solving constraints
• Similar to type reconstruction for MinML [Harper ’03]
The Liberty Research Group
http://www.liberty-research.org
Type Inference is NP-complete [Liberty-04-02]
NP-hard
Monotone 1-in-3
SAT
Type
Inference
in NP
Unification
ASPLOS XI Tutorial
Non-determinism
• Problem is NP-hard
• Can map any 3-in-1 monotone 3-SAT problem to type inference
problem
• Problem is in NP
• Use non-determinism to decide direction of ‘or’-types
• Without or-types, problem is unification which is in P [Paterson ’76]
The Liberty Research Group
http://www.liberty-research.org
LSE: Natural Specification
AA
CC
DD
BB
Simulator Specification
ASPLOS XI Tutorial
µ-arch Specification
Description
Compiler
Executable Simulator
The Liberty Research Group
• Eliminates mapping problem
Component • No preconceived ideas
Library
• Amortize costs
• Reuse Components [Charest ’02,
Emer ’02, Koegst ’98, Radetzki ’98]
• Enabled by structural
specification [MICRO-35, PLDI ’04]
http://www.liberty-research.org
ASPLOS XI Tutorial
Analyzability and Communication of Ideas
Figure from H&P
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
Analyzability and Communication of Ideas
Automatically Generated
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
Analyzability and Communication of Ideas
Automatically Generated
The Liberty Research Group
http://www.liberty-research.org
Reuse
ASPLOS XI Tutorial
LSE allows for user-defined modules, but…
LSE comes with some libraries:
• Core components, examples:
•
•
•
•
•
arbiter - Arbitrate anything with any policy
mqueue - Multiple in/out queue of any type
router - Route anything with any policy
sink - Think /dev/null of any type
source - Universal source of any pattern and type
• Architectural components, examples:
• Cache Replacement Controller - Any replacement policy
• Cache Request Module - Any request structure
The Liberty Research Group
http://www.liberty-research.org
Reuse Across Models in LSE
ASPLOS XI Tutorial
Model
•
•
•
•
•
•
Instances Hierarchical
Leaf
Modules Modules
Instances/
Module
% Instances
from Library
Library
Modules
A
277
46 (10)
18
4.33 (8.61)
73%
13
B
281
46 (11)
18
4.39 (8.48)
73%
12
C
62
1
18
3.37
73%
10
D
192
4
25
6.62
86%
22
E
329
4
26
10.97
89%
22
F
183
18(3)
19
4.35 (8.32)
82%
18
Total
1324
69 (16)
39
12.26 (22.83)
80%
22
A – Tomasulo style machine that executes DLX instructions
B – Same as A but with a unified issue window
C – Model equivalent to Simplescalar’s sim-outorder.c machine
D – Out-of-order IA64 core
E – Two of the OoO IA64 cores communicating with a shared memory
F – Validated Itanium 2 model
The Liberty Research Group
http://www.liberty-research.org
Flexibility
LSE can model anything with a clock
ASPLOS XI Tutorial
(GALS with the addition of multiple clock domains)
Existing Models From Users Domestic and International:
• Single and multiple core processors OOO/Inorder
• Heterogeneous multiprocessor systems
• Power models of interconnection networks (ORION)
• Toy configurations like LFSR (good for tutorials)
• Variety of DLX processors (LSE concept illustration)
• Multicore network interface controller (SPINACH)
• Tiled and Novel Architectures
• Novel VLIW with Power Models (Justice)
• Flexible Component Sets (MicroLib)
The Liberty Research Group
http://www.liberty-research.org
LSE in Practice: Rice University, SPINACH Project
TIGON-2 NIC LSE Model
ASPLOS XI Tutorial
• Validated TIGON-2 NIC model by Rice University [Willmann, LCTES ’04]
• Model constructed by 2 students in 1.5 months (~12 person-weeks)!
The Liberty Research Group
http://www.liberty-research.org
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
Hardware CPI
Simulator CPI
16
4.
gz
i
17 p
5.
vp
r
17
6.
gc
18 c
1.
18 mcf
6.
c
19 raf
ty
7.
pa
rs
25 er
25 2.e
3.
o
pe n
rlb
m
25 k
4.
2 5 ga p
5.
vo
25 rte
6. x
bz
ip
2
Cycles per Instruction
ASPLOS XI Tutorial
LSE in Practice: An Itanium 2 Simulator with LSS
• Itanium 2 simulator built using LSS by one person in 11 weeks!
• Average accuracy within 3% of actual hardware
The Liberty Research Group
http://www.liberty-research.org
Simulation Speed of LSE
LSS has an Optimizing Compiler:
• Emits HSR (Heterogeneous Synchronous Reactive) simulator [Edwards ’97]
• Optimizes system to create static
Module
Architectural
Library
Description
schedules where possible [DAC ’03]
• Eliminates the reuse penalty:
ASPLOS XI Tutorial
Model
Cycles/sec
Speedup
Build (sec)
Custom SystemC
53722
0.35
49.1
Custom LSE
154104
-
15.4
Reusable LSE
40649
0.26
33.9
Reusable LSE with
optimization
57046
0.37
34.4
The Liberty
Simulation Environment
Simulator Builder
Simulator
10x SimpleScalar, but this is just a start to what can be done.
Does this even matter?
The Liberty Research Group
http://www.liberty-research.org
Simulation Speed of LSE
Checkpointing and Sampling Domain
• LSE domain supports a variety of checkpointing and
sampling methodologies
• TurboSMARTS [Wunderlich et al. ISCA ‘03] in validated Itanium 2
ASPLOS XI Tutorial
• Accurate simulation results – < 3% IPC error, 97% confidence
• Less time than running code on real hardware!
• Other sampling and checkpointing methodologies
supported.
Planned:
• Additional simulator builder optimizations
• HW assist methods
The Liberty Research Group
http://www.liberty-research.org
LSE Key Points Review
•
Models built using LSE are accurate
•
•
ASPLOS XI Tutorial
•
Building LSE models can be inexpensive
•
•
•
10-30% error Æ 3% error
Independent efforts, same result (TIGON-2 and Itanium 2)
Many person-years Æ few person-months (TIGON-2, Itanium 2)
Reuse features that actually get used
Models are analyzable and meaningful
•
•
•
Communicate hardware ideas easily
Automatic Visualization
Faster simulation with optimization and sampling support
The Liberty Research Group
http://www.liberty-research.org
LSE Basics
http://www.liberty-research.org
A Simple Machine
gen
hole
ASPLOS XI Tutorial
out
in
• Model a hardware system at the high-level in which
• gen outputs a different integer every cycle
• hole consumes the integer data sent
• Three-options
• Build a C simulator that models this hardware
• Write gen and hole components for a modeling system
• Customize reusable components to build gen and hole
The Liberty Research Group
http://www.liberty-research.org
Source Module
using corelib;
gen
ASPLOS XI Tutorial
out
instance gen:source;
…
LSS specification
• LSE module library – package corelib
• The source module
• Generates one data item per cycle
• Data type and actual data produced are customizable
• Instantiate the source module with the name gen
The Liberty Research Group
http://www.liberty-research.org
Customizing Modules
• Conventional Parameters
• Size of caches
• History bits in a branch predictor
• Number of functional units in a processor core
ASPLOS XI Tutorial
• Source module can use array of data parameter
• What about constant values?
• What about infinite sequences?
The Liberty Research Group
gen
out
http://www.liberty-research.org
Algorithmic Parameters
ASPLOS XI Tutorial
cache
Replacement Policy
The Liberty Research Group
branch pred
• LSE supports algorithmic
parameters
• Called userpoints
• “hole” in module behavior
• Users fill hole to customize
• Sequential code
• Good for algorithms
Predictor
State
Machine
http://www.liberty-research.org
Customizing Modules
Algorithmic Parameters
source
ASPLOS XI Tutorial
out
module source {
…
parameter create_data:
userpoint(…);
};
• Source module
• create_data userpoint - controls data output
The Liberty Research Group
http://www.liberty-research.org
Customize Source Behavior
using corelib;
ASPLOS XI Tutorial
instance gen:source;
gen.create_data = <<<
*data = LSE_time_get_cycle(LSE_time_now);
return LSE_signal_something | …
…
>>>;
…
The Liberty Research Group
http://www.liberty-research.org
Sink Module
hole
ASPLOS XI Tutorial
in
using corelib;
instance gen:source;
instance hole:sink;
…
LSS specification
• LSE module library – package corelib
• The sink module
• Consumes one data item per cycle
• Instantiate the sink module with the name hole
The Liberty Research Group
http://www.liberty-research.org
Building Structure
gen
hole
ASPLOS XI Tutorial
out
using corelib;
instance gen:source;
… /* create_data code */
instance hole:sink;
in
• Connect gen.out to
hole.in
gen.out -> hole.in;
LSS specification
The Liberty Research Group
http://www.liberty-research.org
Clock Domains
• Current release supports single synchronous clock
• Next release of LSE will support multiple clocks
• Instantiate clocks
ASPLOS XI Tutorial
LSE_clock::create(“clock1”, 100, 0)
LSE_clock::create(“clock2”, 35, 50)
• Assign module instances a clock (or multiple clocks for
boundary modules)
• System automatically manages scheduling and time
management
The Liberty Research Group
http://www.liberty-research.org
Source-Sink Configuration
using corelib;
instance gen:source;
instance hole:sink;
ASPLOS XI Tutorial
gen.out ->[int] hole.in;
• LSE visualizer
• can automatically visualize specifications
• Model writers mental picture automatically
generated!
The Liberty Research Group
http://www.liberty-research.org
Simulator Instrumentation
http://www.liberty-research.org
ASPLOS XI Tutorial
Simulator Instrumentation
• What is the output?
• In existing systems
• Hack the source or sink module to add monitoring code
• Breaks modularity and reuse
• Intertwines functionality and instrumentation
The Liberty Research Group
http://www.liberty-research.org
Simulator Instrumentation (2)
Event
Sent data
0xbf13
gen
ASPLOS XI Tutorial
out
Collector
printf(…)
collector out.resolved on “gen” {
record = <<<
…
printf(LSE_time_print_args(LSE_time_now));
printf(“: %d\n”, *datap);
…
>>>;
};
The Liberty Research Group
http://www.liberty-research.org
Simulator Instrumentation (3)
ASPLOS XI Tutorial
collector out.resolved on “gen” {
record = <<<
…
printf(LSE_time_print_args(LSE_time_now));
printf(“: %d\n”, *datap);
…
>>>;
};
0/0: 0
1/0: 1
2/0: 2
…..
<CTRL-C>
%
The Liberty Research Group
http://www.liberty-research.org
Instrumentation Benefits
Cache
Event
Miss on
address
0x3214
Collector
ASPLOS XI Tutorial
misses++;
•
•
•
Instrumentation separate from behavior
Different users can reuse model
Components can be reused
•
•
Data collection code has global knowledge
Orthogonality preserves encapsulation
The Liberty Research Group
http://www.liberty-research.org
More Complex Configuration
http://www.liberty-research.org
More Complex Configuration
Linear Feedback Shift Register (LFSR)
bit2
in
bit1
out
in
bit0
out
in
out
in1
ASPLOS XI Tutorial
out
xor
in2
• Built using reusable modules in corelib
• bit0, bit1, bit2
• unit delay modules, called delay
The Liberty Research Group
http://www.liberty-research.org
LFSR Specification
ASPLOS XI Tutorial
using corelib;
include "xor.lss";
instance
instance
instance
instance
instance
bit0 : delay;
bit1 : delay;
bit2 : delay;
xor : xor_gate;
bit1_tee : tee;
bit2.out -> bit1.in;
bit1.out -> bit1_tee.in;
bit1_tee.out[0] -> xor.in0;
bit1_tee.out[1] -> bit0.in;
bit0.out -> xor.in1;
xor.out -> bit2.in;
The Liberty Research Group
http://www.liberty-research.org
The tee Module
tee
ASPLOS XI Tutorial
in
out
• Fans out input to multiple outputs
• # of connections to out determines fan-out degree
• Each port in LSE is an array of ports
• out – port
• out[i] – port instance
The Liberty Research Group
http://www.liberty-research.org
Ports and Port Instances
using corelib; include
"xor.lss";
ASPLOS XI Tutorial
instance
instance
instance
instance
instance
bit0 : delay;
bit1 : delay;
bit2 : delay;
xor : xor_gate;
bit1_tee : tee;
bit2.out -> bit1.in;
bit1.out -> bit1_tee.in;
bit1_tee.out[0] -> xor.in0;
bit1_tee.out[1] -> bit0.in;
bit0.out -> xor.in1;
xor.out -> bit2.in;
The Liberty Research Group
http://www.liberty-research.org
Ports and Port Instances
using corelib; include
"xor.lss";
ASPLOS XI Tutorial
instance
instance
instance
instance
instance
bit0 : delay;
bit1 : delay;
bit2 : delay;
xor : xor_gate;
bit1_tee : tee;
bit2.out -> bit1.in;
bit1.out -> bit1_tee.in;
bit1_tee.out -> xor.in0;
bit1_tee.out -> bit0.in;
bit0.out -> xor.in1;
xor.out -> bit2.in;
The Liberty Research Group
http://www.liberty-research.org
Linear Feedback Shift Register
Live Demo
http://www.liberty-research.org
What About Data Types?
bit0
in:bool
in:’a out:bool
out:’a
ASPLOS XI Tutorial
• LSE Components can be polymorphic
• Support more than one datatype
• Polymorphism allows type neutral modules
• Flexible, reusable, queues, memories, crossbars, etc.
• Delay module
• Can store data of any type
• But, only one type per instance at run-time
The Liberty Research Group
http://www.liberty-research.org
Polymorphism
Type Inference
bit2
in:’a out:’a
bit1
in:’b out:’b
bit0
in:’c out:’c
in1:bool
ASPLOS XI Tutorial
out:bool xor
in2:bool
• Explicit instantiation burdensome
• Polymorphism resolved through type inference
• Reduces cumbersome type instantiation process
• ‘a = bool, ‘b=bool, ‘c=bool
The Liberty Research Group
http://www.liberty-research.org
Polymorphism
Type Inference
gen
hole
ASPLOS XI Tutorial
out:’a
in:’b
• Source and sink modules are polymorphic
• Type inference cannot resolve polymorphism!
The Liberty Research Group
http://www.liberty-research.org
Type Constraints
gen
hole
ASPLOS XI Tutorial
out:’a
in:’b
• ‘a=int, ‘b=int
using corelib;
instance gen:source;
instance hole:sink;
User can constrain types
gen.out ->[int]
-> hole.in;
hole.in;
The Liberty Research Group
http://www.liberty-research.org
Flexible Structure
http://www.liberty-research.org
Scalable Port-Widths
21 read
read ports
port
register_file
ASPLOS XI Tutorial
Read ports
Response ports
Write ports
• Flexible interfaces needed for reuse
• e.g., control the request width result width
• Examples
• Number of read ports on register files and memories
• Sizing of crossbars and arbiters
• Branch prediction requests
The Liberty Research Group
http://www.liberty-research.org
Parametric Customization of Structure
register_file
Data Array
ASPLOS XI Tutorial
Read ports
read
write
MUX
MUX
21 read
read ports
port
Response ports
Write ports
• Different levels model refinement require different number of ports
• Variation in ports requires varying structure
• One register read port requires one MUX
• Two register read ports requires two MUXes
The Liberty Research Group
http://www.liberty-research.org
Parametric Customization of Structure
num_reads = 2;
1;
module register_file {
parameter num_reads:int;
inport read:read_req;
inport write:write_req;
Data Array
read
write
instance data_array:…;
instance muxes:
mux[num_reads];
for(i=0;i<num_reads;i++){
data_array.out[i] ->
muxes[i].in[0];
}
…
MUX
MUX
ASPLOS XI Tutorial
register_file
};
• Flexible interface add parameterization overhead
• Our Itanium 2 model needs 523 interface sizing parameters
The Liberty Research Group
http://www.liberty-research.org
Lowering Overhead with Use-based Specialization
register_file
Data Array
ASPLOS XI Tutorial
read
write
MUX
MUX
User makes 2
1
connections
connection
• Infer parameters from usage (i.e. use-based specialization)
• Other customizations based on usage
• Types – if auxillary data ports connected, output type is a struct
• Semantics – If branch target port is connected a BTB is instantiated
• Eliminates 523 interface size parameters for I2 model
The Liberty Research Group
http://www.liberty-research.org
Torus Live Demo
http://www.liberty-research.org
Control in LSE
http://www.liberty-research.org
Timing Control Example
source
block 1
block 2
ASPLOS XI Tutorial
stall
no stall
comp
yes
no
comp stall?
• Timing control handles stalling
• Timing control is logically centralized
• Controller has global knowledge
The Liberty Research Group
http://www.liberty-research.org
More Complex Timing Control
source
3 slot buffer
ASPLOS XI Tutorial
stall
yes
buffer full?
block 2
stall
block 2
no
no stall
for source
comp
yes
comp stall?
no stall
no
• Even simple data path changes require updating the controller
• Controller cannot be reused!
The Liberty Research Group
http://www.liberty-research.org
Timing Control and Existing Modeling Systems
• Existing work treats timing-control as global entity
• Control neutrality
• SystemC, Objective VHDL, UPFAST, HASE, etc.
• Approaches focus on other problems, not timing-control
ASPLOS XI Tutorial
• Specification of global timing controller
• Generality versus complexity tradeoff
• Template-based
• LISA [Pees ’99], RADL [Siska ’98]
• Very limited architecture class
• Alternative representations
Global Timing Control
• Expression [Mishra ‘01], MADL [Qin ’02]
• Generality vs. complexity
• Modularize control instead
The Liberty Research Group
http://www.liberty-research.org
The Components of Timing Control
Back-pressure
Peer Stall
Stall Distribution
Stall Distribution
Distribution
ASPLOS XI Tutorial
Semantic Stall
Structural Stall
Stall Detection
Detection
Detection
Timing Control Tasks
• Stall detection – when to stall
• Structural stall conditions
• Semantic stall conditions
• Stall distribution – what to stall
• Backpressure stall distribution
• Peer stall distribution
The Liberty Research Group
http://www.liberty-research.org
Stall Detection
Structural
Stall
source
3 slot buffer
Semantic
Stall
block 2
comp
ASPLOS XI Tutorial
• Semantic stalls
• Stall condition varies as semantics change
• Data hazards, control hazards, etc.
• Structural stalls –
• Stall condition invariant across different semantics
Peer Stall
• No buffer slots, bus arbitration loss, etc.
Back-pressure
Stall Distribution Distribution
• Should be reusable!
Semantic Stall Structural Stall
Detection
Detection
The Liberty Research Group
http://www.liberty-research.org
Stall Distribution
source
3 slot buffer
block 2
comp
stall
source
ASPLOS XI Tutorial
block 3
comp2
• Stalls propagate along the datapath
• Back pressure
• Stall earlier blocks in the pipeline
• Follows opposite direction of datapath
• Coordination
• Stall peers in the pipeline
• Usually follows datapath at fanout nodes
The Liberty Research Group
Back-pressure
Stall Distribution
Peer Stall
Distribution
Semantic Stall Structural Stall
Detection
Detection
http://www.liberty-research.org
Modularizing Backpressure Stall Distribution
source
block 1
block 2
comp
stall
yes
ASPLOS XI Tutorial
comp stall?
no stall
no
• Reverse control signal for backpressure stalls
Back-pressure
Stall Distribution
Peer Stall
Distribution
Semantic Stall Structural Stall
Detection
Detection
The Liberty Research Group
http://www.liberty-research.org
Modularizing Structural Stall Detection
source
3 slot buffer
stall
block 2
yes
ASPLOS XI Tutorial
buffer full & stall?
no stall
comp
no
• Structural stalls encapsulated in reusable
components
stall
yes
comp stall?
no stall
Back-pressure
Stall Distribution
no
Peer Stall
Distribution
Semantic Stall Structural Stall
Detection
Detection
The Liberty Research Group
http://www.liberty-research.org
Modularizing Peer Stall Distribution and Coordination
source
3 slot buffer
block 2
comp
block 3
comp2
ASPLOS XI Tutorial
Coordination Logic
• Forward control signal handles peer stalls
• Coordination controlled at fanout
• Reasonable default semantics for almost all components
• User over-ridable
Peer Stall
Back-pressure
Stall Distribution
Distribution
Semantic Stall Structural Stall
Detection
Detection
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
Timing Control Modularization Summary
Back-pressure
Stall Distribution
Peer Stall
Distribution
Semantic Stall
Detection
Structural Stall
Detection
Timing Control Tasks
• Control abstraction makes 3 of 4 portions reusable
• Other approaches good at semantic stalls
• Expression, MADL, etc. can be leveraged!
The Liberty Research Group
http://www.liberty-research.org
Control Signals
3 Signal Values
Module A
output
data
enable
input
Module B
ASPLOS XI Tutorial
ack
DATA
something nothing
unknown
ENABLE
enabled disabled
ACK
ack
nack
unknown
unknown
• Transition from unknown to one known value per cycle
The Liberty Research Group
http://www.liberty-research.org
Control Signals
3 Signal Semantics
Module A
output
… data
enable
input
…
Module B
Computation
Computation
ack
ASPLOS XI Tutorial
Internal
InternalState
State
• Module A sends data (unknown -> something | nothing)
• Module B receives data and acknowledges
• unknown -> ack
• Module A enables based on acknowledge
• unknown -> enabled
The Liberty Research Group
http://www.liberty-research.org
Control Signals
Delay Element
data
enable
in
Delay Element
out
Internal
InternalState
State
ASPLOS XI Tutorial
ack
data
enable
ack
Space
SpaceAvailable?
Available?
• If space is available
• Send ack
• If input enabled, update state at the end of cycle
• Otherwise
• Send nack
The Liberty Research Group
http://www.liberty-research.org
Combinational Control Signals
Combo Logic
data
enable
out
in
data
enable
Computation
Computation
ASPLOS XI Tutorial
ack
ack
• Process input data and generate new data
• Pass enable and ack straight through
The Liberty Research Group
http://www.liberty-research.org
Combinational Control Signals
in2
xor
out
ASPLOS XI Tutorial
in1
For multiple inputs
• Combine enable signals
• Fan out ack signals
The Liberty Research Group
http://www.liberty-research.org
Live Demo
Putting it all together - Pipeline Stall Control
http://www.liberty-research.org
Live Demo
Putting it all together – Stalls in the Torus
http://www.liberty-research.org
Customizing Control
Control Functions
Output
Port
data
enable
ASPLOS XI Tutorial
ack
Local
Input
data
enable
ack
Global
Input
The Liberty Research Group
Input
Control
Function
Output
Control
Function
data
enable
data
enable
ack
Global
Output
Input
Port
ack
Local
Output
http://www.liberty-research.org
ASPLOS XI Tutorial
Control Functions
LFSR Control
• Desired Control
• Every bit updates every cycle
The Liberty Research Group
http://www.liberty-research.org
Control Functions
LFSR Custom Control
ASPLOS XI Tutorial
data
enable
ack
Global
Input
Input
Control
Function
data
enable
bit2
Input
Port
in
ack
Local
Output
bit2.in.control = <<<
return LSE_signal_extract_data(istatus) |
LSE_signal_enabled |
LSE_signal_ack;
>>>;
The Liberty Research Group
http://www.liberty-research.org
Structural Control
data
enable
ASPLOS XI Tutorial
ack
Control
Module
data
enable
module
in
ack
• Control Function was like a small module
• Can make this explicit
• Add extra inputs
• Build using reusable concurrently executing components
• e.g. gate module in corelib
The Liberty Research Group
http://www.liberty-research.org
Control Review
• Default Control
• Corresponds to back-pressure
• Makes reusing components easy in common case
• Control Functions
ASPLOS XI Tutorial
• Customize control locally
• Simple, easy to specify
• Explicit Structural Control
• When all else fails, use explicit control structure
The Liberty Research Group
http://www.liberty-research.org
End of Part 1
30 minute break
http://www.liberty-research.org
Building Processor Models
Part 2
http://www.liberty-research.org
DLX Config
http://www.liberty-research.org
Microarchitecture and ISA
ASPLOS XI Tutorial
Intel
IntelIA64
IA64
• Core ISA constant
• Microarchitecture varies
Itanium
Itanium II
The Liberty Research Group
http://www.liberty-research.org
Bulk Execution
Fetch
Emulator
Emulator
ASPLOS XI Tutorial
Instruction Behavior
Modeling Code
RF
The Liberty Research Group
M
I-mem
Decode
EX
M
D-mem
WB
http://www.liberty-research.org
Bulk Execution w/ Rollback
Fetch
Emulator
Emulator
ASPLOS XI Tutorial
Instruction Behavior
Modeling Code
RF
RF
RF
The Liberty Research Group
RF
RF
M
I-mem
Decode
EX
M
D-mem
WB
Res.
Logic
http://www.liberty-research.org
Full Callback
Fetch
Emulator
Emulator
Fetch Decode EX M WB
I-mem
Decode
ASPLOS XI Tutorial
EX
RF
RF
RF
The Liberty Research Group
RF
RF
M
M
D-mem
WB
Res.
Logic
http://www.liberty-research.org
Microarchitecture Model Maintains ISA State
Emulator
Emulator
Decode EX
Fetch
Decode
I-mem
RF
ASPLOS XI Tutorial
EX
The Liberty Research Group
M
D-mem
WB
Res.
Logic
http://www.liberty-research.org
Emulation in LSE
Instruction Behavior
Modeling Code
RF
ASPLOS XI Tutorial
RF
RF
RF
RF
M
Emulator Interface
Emulator
Emulator
Fetch
LSE_emu_do…(…)
• LSE Supports Abstraction of ISA functionality
• via the emulator interface
• Microarchitecture model invokes functions in ISA model to
implement instruction behavior
• Emulator interface is flexible
• As simple as bulk emulation at fetch
• Detailed enough for all data to be managed in µarch model
The Liberty Research Group
http://www.liberty-research.org
Live Demo
Building a Simple IA64 Processor Model
http://www.liberty-research.org
Demo Details
IA64
IA64Emulator
Emulator
ASPLOS XI Tutorial
Fetch Decode EX M WB
RF
RF
RF
RF
RF
M
• IA64 Emulator w/ reusable components for µarch
• Port of SimpleScalar Emulator Interface
• Others possible
• Simics interface, ARM, etc.
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
Itanium 2 Model
The Liberty Research Group
http://www.liberty-research.org
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
Hardware CPI
Simulator CPI
16
4.
gz
i
17 p
5.
vp
r
17
6.
gc
18 c
1.
18 mcf
6.
c
19 raf
ty
7.
pa
rs
25 er
25 2.e
3.
o
pe n
rlb
m
25 k
4.
2 5 ga p
5.
vo
25 rte
6. x
bz
ip
2
Cycles per Instruction
ASPLOS XI Tutorial
An Itanium 2 Simulator with LSS
• Itanium 2 simulator built using LSS by one person in 11 weeks!
• Average accuracy within 3% of actual hardware
The Liberty Research Group
http://www.liberty-research.org
Running OS-level Code
http://www.liberty-research.org
Full System Simulation/Emulation Issues
Firmware
Virtual Memory Management
Interrupt Handling
Device Support
ASPLOS XI Tutorial
•
•
•
•
The Liberty Research Group
http://www.liberty-research.org
Firmware Overview
• Processor Abstraction Layer (PAL)
• Abstracts processor implementation
• System Abstraction Layer (SAL)
• Abstracts platform implementation
• Extensible Firmware Interface (EFI)
ASPLOS XI Tutorial
• Interface between the OS and the platform firmware
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
Firmware Overview
Intel IA64 Software Developer’s
Manual Volume 2
The Liberty Research Group
http://www.liberty-research.org
IA64 Emulator Firmware
ASPLOS XI Tutorial
Platform
Description
The Liberty Research Group
Firmware C
Code
Firmware
Image
http://www.liberty-research.org
Virtual Memory Management
• Hardware Structures
• Translation Lookaside Buffer
• Instruction/Data Translation Registers/Translation Caches
• Region Registers
• 8 Region registers can identify up to 224 61-bit address spaces
ASPLOS XI Tutorial
• Protection Key Registers
• Permit domain-granular protection for page access
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
Virtual Memory Management
The Liberty Research Group
http://www.liberty-research.org
IA64 Emulator Virtual Memory Management
Virtual Memory
Address
ASPLOS XI Tutorial
Memory Lookup
Virtual Block
The Liberty Research Group
Memory Block
http://www.liberty-research.org
IA64 Emulator Virtual Memory Management
Virtual Memory
Address
Instruction TLB
Data TLB
ASPLOS XI Tutorial
Fault?
The Liberty Research Group
Physical Memory
Address
Memory Block
http://www.liberty-research.org
IA64 Interrupt Handling
ASPLOS XI Tutorial
• IVA-Based: OS serviced interruptions vectored in the
interruption vector table
• PAL-Based: Serviced by PAL firmware, system firmware,
vectored through hardware entry points directly into PAL
firmware
• Interruption Types:
• Initialization
• Platform Management
• External (Non-Maskable/External Controller)
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
IA64 Interrupt Handling
The Liberty Research Group
http://www.liberty-research.org
Adding Device Support to LSE Timing Models
The Liberty Research Group
Dev 1
Dev 2
Emulator Interface
Interconnect
ASPLOS XI Tutorial
CPU
Emulator
Emulator
Instruction Behavior
Modeling Code
M
Dev 3
http://www.liberty-research.org
Multiple Clock Domains
• Current release supports single synchronous clock
• Next release of LSE will support multiple clocks
• Instantiate clocks
ASPLOS XI Tutorial
LSE_clock::create(“clock1”, 100, 0)
LSE_clock::create(“clock2”, 35, 50)
• Assign module instances a clock (or multiple clocks for
boundary modules)
• System automatically manages scheduling and time
management
The Liberty Research Group
http://www.liberty-research.org
End of Talk
http://www.liberty-research.org
Tutorial Sequence
8:00 Welcome
8:05 LSE Introduction and Philosophy
LSE Basics
Your First Configuration
David
Manish
Neil
ASPLOS XI Tutorial
10:00 Refreshment Break (30 Minutes)
10:30 Emulators
Building Processor Model
Running OS Code
Putting It All Together
The Future of Liberty
Manish and Neil
Manish and Neil
Jason
David
David
12:30 Adjourn
The Liberty Research Group
http://www.liberty-research.org
Tutorial Sequence
8:00 Welcome
8:05 LSE Introduction and Philosophy
LSE Basics
Your First Configuration
David
Manish
Neil
ASPLOS XI Tutorial
10:00 Refreshment Break (30 Minutes)
10:30 Emulators
Building Processor Model
Running OS Code
Putting It All Together
The Future of Liberty
Manish and Neil
Manish and Neil
Jason
David
David
12:30 Adjourn
The Liberty Research Group
http://www.liberty-research.org
Now It Is Your Turn
• We did lots of research to get structural modeling right
• Many have benefited greatly from this powerful reuse
ASPLOS XI Tutorial
• LSE has become a standard framework for the exchange of
architectural components and designs
• The power of this reuse is related to the number of users in
the community, so…
The Liberty Research Group
Join In!
http://www.liberty-research.org
The Free Spirit
In this vein, we strive to grow this community…
ASPLOS XI Tutorial
LSE tools and libraries are non-copylefted free software
• Share your ideas with companies
• Use LSE in your company
We respond to your feedback
• Let us know what features you would like
• Tell us about your likes/dislikes
The Liberty Research Group
http://www.liberty-research.org
Stay in Touch
Mailing Lists
See Liberty Website To Sign Up
liberty@lists.cs.princeton.edu:
• Liberty Research Group Announcements - very low traffic
liberty-lse-install@lists.cs.princeton.edu:
ASPLOS XI Tutorial
• Installation Issues and Questions
liberty-lse-users@lists.cs.princeton.edu:
• User community
• Share modules/configs with others - exploit reuse…
• Initiate collaborations with others - exploit reuse…
• Watch here for LSE updates, new tools
• Usage support
The Liberty Research Group
http://www.liberty-research.org
What to do Next?
Try LSE!
ASPLOS XI Tutorial
You have the CD and Getting Started with LSE
1. Install LSE
2. Read the documentation (we have lots!!)
•
•
•
•
•
•
•
Getting Started with LSE
The LSE Core Module Library Reference
The LSE User's Manual
The LSE Visualizer Manual
The LSE API Reference Manual
The LSE Developer's Manual
The LSE Internals Manual
The Liberty Research Group
http://www.liberty-research.org
What to do Next?
Try LSE!
ASPLOS XI Tutorial
You have the CD and Getting Started with LSE
1. Install LSE
2. Read the documentation (we have lots!!)
3. Play with sample configurations, emulators
•
•
•
LFSR
Tomasulo DLX
IA-64
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
What to do Next?
Try LSE!
You have the CD and Getting Started with LSE
1. Install LSE
2. Read the documentation (we have lots!!)
3. Play with sample configurations, emulators
4. Use LSE in your research/development
5. Keep in touch (visit us!)
6. Check the website for updates
The Liberty Research Group
http://www.liberty-research.org
ASPLOS XI Tutorial
Motivation for Future Tools Work
The Future Liberty Tool Collection
Application(s
Application(s))
Auto-Explorer
Auto-Explorer
The
TheVELOCITY
VELOCITY
Compiler
Compiler
LSS+GLAD
LSS+GLAD
LSE
LSE
Optimized
Optimized
Executables
Executables
The Liberty Research Group
Architectural
ArchitecturalSimulator
Simulator
Instance
Instance
http://www.liberty-research.org
Shameless Research Plug
The Liberty Research Group
• Remove human intervention from the “critical path”
ASPLOS XI Tutorial
• Understanding and designing for compiler and architectural
optimization interactions
• Redefine the hardware/software interface
• “How to do” not just “what to do”
• Far ILP
• Question the artificial boundaries present in compilers
• Liberate us from inlining once and for all!
• Phase order/optimization issues
• Reliability guarantees
The Liberty Research Group
http://www.liberty-research.org
Thank you!
ASPLOS XI Tutorial
• We are very excited about the future for LSE
• Thank you for joining us today
The Liberty Research Group
http://www.liberty-research.org
END OF TUTORIAL
The Liberty Research Group
Princeton University
http://www.liberty-research.org