Analysis and Optimisation of Real-Time Systems with Stochastic Behaviour Sorin Manolache by

Analysis and Optimisation of Real-Time Systems with Stochastic Behaviour Sorin Manolache by
Linköping Studies in Science and Technology
Dissertation No. 983
Analysis and Optimisation of Real-Time Systems
with Stochastic Behaviour
by
Sorin Manolache
Department of Computer and Information Science
Linköpings universitet
SE-581 83 Linköping, Sweden
Linköping 2005
ISBN 91-85457-60-4
ISSN 0345-7524
Printed by UniTryck, Linköping 2005
Abstract
Embedded systems have become indispensable in our life:
household appliances, cars, airplanes, power plant control systems, medical equipment, telecommunication systems, space
technology, they all contain digital computing systems with
dedicated functionality. Most of them, if not all, are real-time
systems, i.e. their responses to stimuli have timeliness constraints.
The timeliness requirement has to be met despite some unpredictable, stochastic behaviour of the system. In this thesis,
we address two causes of such stochastic behaviour: the application and platform-dependent stochastic task execution times,
and the platform-dependent occurrence of transient faults on
network links in networks-on-chip.
We present three approaches to the analysis of the deadline
miss ratio of applications with stochastic task execution times.
Each of the three approaches fits best to a different context.
The first approach is an exact one and is efficiently applicable to
monoprocessor systems. The second approach is an approximate
one, which allows for designer-controlled trade-off between analysis accuracy and analysis speed. It is efficiently applicable to
multiprocessor systems. The third approach is less accurate but
sufficiently fast in order to be placed inside optimisation loops.
Based on the last approach, we propose a heuristic for task mapping and priority assignment for deadline miss ratio minimisation.
Our contribution is manifold in the area of buffer and time
constrained communication along unreliable on-chip links.
First, we introduce the concept of communication supports,
an intelligent combination between spatially and temporally redundant communication. We provide a method for constructing
a sufficiently varied pool of alternative communication supports
iii
iv
for each message. Second, we propose a heuristic for exploring
the space of communication support candidates such that the
task response times are minimised. The resulting time slack
can be exploited by means of voltage and/or frequency scaling
for communication energy reduction. Third, we introduce an algorithm for the worst-case analysis of the buffer space demand
of applications implemented on networks-on-chip. Last, we
propose an algorithm for communication mapping and packet
timing for buffer space demand minimisation.
All our contributions are supported by sets of experimental
results obtained from both synthetic and real-world applications
of industrial size.
This work has been supported by ARTES (A Network for RealTime Research and Graduate Education in Sweden) and STRINGENT (Strategic Integrated Electronic Systems Research at Linköpings Universitet).
Acknowledgements
This work would not have been possible without the contribution
of many people, to whom I would now take the opportunity to
thank.
Foremost, I thank Prof. Petru Eles, my considerate adviser.
His confidence in our work compelled me to persist, while his
personality commends my admiration. Prof. Zebo Peng has been
the ideal research group director during these years. I thank
him for his soft and efficient leading style.
The embedded systems laboratory, and in a bigger context the
whole department, provided an excellent environment for professional and personal development. I thank all those who contributed to it, foremost my colleagues in ESLAB for their friendship, spirit, and time.
Some of the work in this thesis stems from the period in
which I visited the research department of Ericsson Radio Systems lead by Dr. Peter Olanders. I want to thank him, as well
as Dr. Béatrice Philibert and Erik Stoy for providing me that
opportunity.
New interests and enthusiasm, professional but also cultural,
were awoken in me during my visit to the group led by Prof. Radu
Mărculescu at the Carnegie-Mellon University. I take this opportunity to thank him for inviting me and to salute the people
I met there.
Geographically more or less distant, some persons honoured
me with their friendship for already frighteningly many years.
It means very much to me.
I am grateful to my mother for the mindset and values she
has passed to me. In the same breath, I thank my father for the
smile and confidence he has always offered me.
Last, I thank Andrea for filling our lives with joy.
Sorin Manolache
Linköping, October 2005
v
vi
Contents
I
Preliminaries
1
1 Introduction
1.1 Embedded System Design Flow . . . . . . . . . . . . . . . . . .
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
7
8
II Stochastic Schedulability Analysis and
Optimisation
9
2 Motivation and Related Work
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
11
14
3 System Modelling
3.1 Hardware Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Periodic Task Model . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.4 Execution Times . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.5 Real-Time Requirements . . . . . . . . . . . . . . . . .
3.2.6 Late Task Policy . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.7 Scheduling Policy . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
19
20
20
22
23
23
24
25
27
27
4 Analysis of Monoprocessor Systems
4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
31
31
32
32
vii
viii
CONTENTS
4.2 Analysis Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 The Underlying Stochastic Process . . . . . . . .
4.2.2 Memory Efficient Analysis Method . . . . . . . .
4.2.3 Multiple Simultaneously Active Instantiations of the Same Task Graph . . . . . . . . . . . . .
4.2.4 Construction and Analysis Algorithm . . . . . .
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Stochastic Process Size as a Function of the
Number of Tasks . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Stochastic Process Size as a Function of the
Application Period . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Stochastic Process Size as a Function of the
Task Dependency Degree . . . . . . . . . . . . . . . . .
4.3.4 Stochastic Process Size as a Function of the
Average Number of Concurrently Active Instantiations of the Same Task Graph . . . . . .
4.3.5 Rejection versus Discarding . . . . . . . . . . . . . . .
4.3.6 Encoding of a GSM Dedicated Signalling
Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Limitations and Extensions . . . . . . . . . . . . . . . . . . . . .
5 Analysis of Multiprocessor Systems
5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Approach Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Intermediate Model Generation . . . . . . . . . . . . . . . . . .
5.3.1 Modelling of Task Activation and Execution
5.3.2 Modelling of Periodic Task Arrivals . . . . . . . .
5.3.3 Modelling Deadline Misses . . . . . . . . . . . . . . . .
5.3.4 Modelling of Task Graph Discarding . . . . . . .
5.3.5 Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . .
5.4 Generation of the Marking Process . . . . . . . . . . . . . . .
5.5 Coxian Approximation . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6 Approximating Markov Chain Construction . . . . . . .
5.7 Extraction of Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8.1 Analysis Time as a Function of the Number
of Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8.2 Analysis Time as a Function of the Number
of Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
33
41
42
45
49
51
52
52
54
55
57
61
65
66
66
67
67
67
70
70
73
73
74
74
75
78
79
88
89
91
91
CONTENTS
ix
5.8.3 Memory Reduction as a Consequence of
the On-the-Fly Construction of the Markov
Chain Underlying the System . . . . . . . . . . . . . 92
5.8.4 Stochastic Process Size as a Function of the
Number of Stages of the Coxian Distributions 94
5.8.5 Accuracy of the Analysis as a Function of
the Number of Stages of the Coxian Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.8.6 Encoding of a GSM Dedicated Signalling
Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.9.1 Individual Task Periods . . . . . . . . . . . . . . . . . . 96
5.9.2 Task Rejection vs. Discarding . . . . . . . . . . . . . 102
5.9.3 Arbitrary Task Deadlines . . . . . . . . . . . . . . . . . 107
5.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6 Deadline Miss Ratio Minimisation
6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Approach Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 The Inappropriateness of Fixed Execution Time
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Mapping and Priority Assignment Heuristic . . . . . .
6.4.1 The Tabu Search Based Heuristic . . . . . . . . .
6.4.2 Candidate Move Selection . . . . . . . . . . . . . . . .
6.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5.1 Analysis Algorithm . . . . . . . . . . . . . . . . . . . . . . .
6.5.2 Approximations . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.1 RNS and ENS: Quality of Results . . . . . . . . .
6.6.2 RNS and ENS: Exploration Time . . . . . . . . . .
6.6.3 RNS and LO-AET: Quality of Results and
Exploration Time . . . . . . . . . . . . . . . . . . . . . . . .
6.6.4 Real-Life Example: GSM Voice Decoding . . .
109
109
109
110
110
110
111
114
114
117
120
120
126
130
131
133
133
133
III Communication Synthesis for Networkson-Chip
137
7 Motivation and Related Work
139
x
CONTENTS
7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.3 Highlights of Our Approach . . . . . . . . . . . . . . . . . . . . . 143
8 System Modelling
8.1 Hardware Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Communication Model . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5 Message Communication Support . . . . . . . . . . . . . . . .
145
145
146
146
147
148
9 Energy and Fault-Aware Time Constrained Communication Synthesis for NoC
153
9.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.1.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.1.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.1.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.2 Approach Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.3 Communication Support Candidates . . . . . . . . . . . . . 156
9.4 Response Time Calculation . . . . . . . . . . . . . . . . . . . . . . 161
9.5 Selection of Communication Supports . . . . . . . . . . . . 162
9.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.6.1 Latency as a Function of the Number of Tasks165
9.6.2 Latency as a Function of the Imposed Message Arrival Probability . . . . . . . . . . . . . . . . . . 166
9.6.3 Latency as a Function of the Size of the NoC
and Communication Load . . . . . . . . . . . . . . . . . 167
9.6.4 Optimisation Time . . . . . . . . . . . . . . . . . . . . . . . 168
9.6.5 Exploiting the Time Slack for Energy Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.6.6 Real-Life Example: An Audio/Video Encoder 170
9.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
10 Buffer Space Aware Communication Synthesis for
NoC
175
10.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
10.1.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
10.1.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
10.1.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
10.2 Motivational Example . . . . . . . . . . . . . . . . . . . . . . . . . . 177
10.3 Approach Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.3.1 Delimitation of the Design Space . . . . . . . . . . 181
CONTENTS
xi
10.3.2 Exploration Strategy . . . . . . . . . . . . . . . . . . . . .
10.3.3 System Analysis Procedure . . . . . . . . . . . . . . .
10.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.1 Evaluation of the Solution to the CSBSDM
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.2 Evaluation of the Solution to the CSPBS
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.3 Real-Life Example: An Audio/Video Encoder
10.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IV
Conclusions
182
185
189
189
192
192
192
195
11 Conclusions
11.1 Applications with Stochastic Execution Times . . . . .
11.1.1 An Exact Approach for Deadline Miss Ratio
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1.2 An Approximate Approach for Deadline
Miss Ratio Analysis . . . . . . . . . . . . . . . . . . . . . .
11.1.3 Minimisation of Deadline Miss Ratios . . . . . .
11.2 Transient Faults of Network-on-Chip Links . . . . . . .
11.2.1 Time-Constrained Energy-Efficient Communication Synthesis . . . . . . . . . . . . . . . . . . . .
11.2.2 Communication Buffer Minimisation . . . . . .
197
197
A Abbreviations
201
Bibliography
203
197
198
198
199
199
200
xii
CONTENTS
Part I
Preliminaries
1
Chapter 1
Introduction
This chapter briefly presents the frame of this thesis, namely
the area of embedded real-time systems. It introduces the two
aspects of stochastic behaviour of real-time systems that we address in this thesis, namely the application-specific and platform-specific stochastic task execution times and the platformspecific transient faults of hardware. The chapter summarises
the contributions and draws the outline of the thesis.
1.1 Embedded System Design Flow
Systems controlled by embedded computers become indispensable in our lives and can be found in avionics, automotive industry, home appliances, medicine, telecommunication industry,
mecatronics, space industry, etc. [Ern98].
Very often, these embedded systems are reactive, i.e. they
are in steady interaction with their environment, acting in a
prescribed way as response to stimuli received from the environment. In most cases, this response has to arrive at a certain
time moment or within a prescribed time interval from the moment of the application of the stimulus. Such systems, in which
the correctness of their operation is defined not only in terms of
functionality but also in terms of timeliness, form the class of
real-time systems [But97, KS97, Kop97, BW94].
Timeliness requirements may be hard meaning that the violation of any such requirement is not tolerated. In a hard realtime system, if not all deadlines are guaranteed to be met, the
3
4
CH. 1. INTRODUCTION
system is said to be unschedulable. Typical hard real-time application domains are plant control, aircraft control, medical,
and automotive applications. Systems classified as soft real-time
may occasionally break a real-time requirement provided that
the service quality exceeds prescribed levels.
The nature of real-time embedded systems is typically heterogeneous along multiple dimensions. For example, an application may exhibit data, control and protocol processing characteristics. It may also consist of blocks exhibiting different categories
of timeliness requirements, such as hard and soft. Another dimension of heterogeneity is given by the environment the system
operates in. For example, the stimuli and responses may be of
both discrete and continuous nature.
The heterogeneity in the nature of the application itself on
one side and, on the other side, constraints such as cost, performance, power dissipation, legacy designs and implementations,
as well as requirements such as reliability, availability, security,
and safety, often lead to implementations consisting of heterogeneous multiprocessor platforms.
Designing such systems implies the deployment of different
techniques with roots in system engineering, software engineering, computer architectures, specification languages, formal
methods, real-time scheduling, simulation, programming languages, compilation, hardware synthesis, etc. Considering the
huge complexity of such a design task, there is an urgent need
for automatic tools for design, estimation, and synthesis in order to support and guide the designer. A rigorous, disciplined,
and systematic approach to real-time embedded system design
is the only way the designer can cope with the complexity of
current and future designs in the context of high time-to-market
pressure. A simplified view of such a design flow is depicted in
Figure 1.1 [Ele02].
The design process starts from a less formal specification together with a set of constraints. This initial informal specification is then captured as a more rigorous model formulated in
one or possibly several modelling languages [JMEP00]. Next,
a system architecture is chosen as the hardware platform executing the application. This system architecture typically consists of programmable processors of various kinds (application
specific instruction processors (ASIPs), general purpose processors, digital signal processors (DSPs), protocol processors), and
dedicated hardware processors (application specific integrated
1.1. EMBEDDED SYSTEM DESIGN FLOW
5
Functional
simulation
Formal
verification
Modelling
Architecture
selection
System model
System
architecture
Mapping
Estimation
Scheduling
Software model
Mapped and
scheduled model
ok
ok
Simulation
Software generation
Analysis
Hardware model
Hardware synthesis
Software blocks
Simulation
Testing
Prototype
not ok
not ok
Simulation
Formal
verification
Hardware blocks
ok
Fabrication
Figure 1.1: Typical design flow
Lower levels
not ok
System level
Informal specification,
constraints
6
CH. 1. INTRODUCTION
circuits (ASICs), field-programmable gate arrays (FPGAs)) interconnected by means of shared buses, point-to-point links or
networks of various types. Once a system architecture is chosen, the functionality, clustered in tasks or pieces of functionality of certain conceptual unity, are mapped onto (or assigned to)
the processors or circuits of the system architecture. The communication is mapped on the buses, links, or networks. Next,
the tasks or messages that share the same processing element
or bus/link are scheduled. The resulting mapped and scheduled
system model is then estimated by means of analysis or simulation or formal verification or combinations thereof. During the
system level design space exploration phase, different architecture, mapping, and scheduling alternatives are assessed in order
to meet the design requirements and possibly optimise certain
indicators. Once a design is found that satisfies the functional
and non-functional requirements, the system is synthesized in
the lower design phases.
At each design phase, the designers are confronted with various levels of uncertainty. For example, the execution times of
tasks are unknown before the functionality is mapped to a system architecture. Even after mapping, a degree of uncertainty
persists regarding the task execution times. For example, the
state of the various levels of the memory hierarchy, which affects the data access latency and execution time, depends on the
task scheduling. After a task scheduling has been decided upon,
the task execution times heavily depend on the input data of the
application, which in turn may be unpredictable. The task execution time is one of the components that induce a stochastic
behaviour on systems.
Transient failures of various hardware components also induce a stochastic behaviour on the system. They may be caused
by environmental factors, such as over-heating of certain parts of
the circuit, or electro-magnetic interference, cross-talk between
communication lines. Communication time and energy, and even
application correctness may be affected by such phenomena.
We consider transient failures of on-chip communication
links, while we assume that failures of other system components
are tolerated by techniques outside the scope of this thesis.
There exists a need for design tools that take into account
the stochastic behaviour of systems in order to support the designer. This thesis aims at providing such a support. We mainly
address two sources that cause the system to have a stochastic
1.2. CONTRIBUTION
7
behaviour: task execution times and transient faults of the links
of on-chip networks.
1.2 Contribution
The shaded blocks in Figure 1.1 denote the phases of the design
processes to which this thesis contributes.
With respect to the analysis of a mapped and scheduled
model, we propose the following approaches:
1. Chapter 4 presents an approach that determines the exact deadline miss probability of tasks and task graphs of
soft real-time applications with stochastic task execution
times. The approach is efficient in the case of applications
implemented on monoprocessor systems.
2. Chapter 5 presents an approach that determines an approximation of the deadline miss probability of systems
under the same assumptions as those presented in Chapter 4. The approach allows a designer-controlled way to
trade analysis speed for analysis accuracy and is efficient
in the case of applications implemented on multiprocessor
systems.
3. Chapter 6 presents an approach to the fast approximation
of the deadline miss probability of soft real-time applications with stochastic task execution times. The analysis is
efficiently applicable inside design optimisation loops.
4. Chapter 10 presents an approach to the analysis of the
buffer space demand of applications implemented on networks-on-chip.
Our contribution to the functionality and communication
mapping problem is the following:
1. Chapter 6 presents an approach to the task and communication mapping under constraints on deadline miss ratios.
2. Chapter 9 presents an approach to the communication
mapping for applications implemented on networks-onchip with unreliable on-chip communication links under
timeliness, energy, and reliability constraints.
3. Chapter 10 presents an approach to the communication
mapping for applications implemented on networks-onchip with unreliable on-chip communication links under
timeliness, buffer space, and reliability constraints.
8
CH. 1. INTRODUCTION
Our contribution to the scheduling problem consists of an approach to priority assignment for tasks with stochastic task execution times under deadline miss ratio constraints. The approach is presented in Chapter 6.
The work presented in Chapter 4 is published in [MEP01,
MEP04b]. Chapter 5 is based on [MEP02, MEP], and the contributions of Chapter 6 are published in [MEP04a]. The work presented in Chapter 9 is published in [MEP05] while Chapter 10
is based on [MEP06].
1.3 Thesis Organisation
The thesis is organised as follows.
Part II of this thesis deals with the stochastic behaviour
caused by the non-deterministic nature of task execution. First,
we present the motivation of our work in this area and we
survey the related work (Chapter 2). Next, Chapter 3 introduces the notation and system model to be used throughout
Part II. Then, Chapter 4 and Chapter 5 present two analytic
performance estimation approaches, an exact one, efficiently
applicable to monoprocessor systems, and an approximate analysis approach, efficiently applicable to multiprocessor systems.
Part II concludes with an approach to the mapping of tasks on a
multiprocessor platform and the priority assignment to tasks in
order to optimise the deadline miss ratios.
Part III deals with the stochastic behaviour generated by the
non-deterministic nature of transient faults occurring on links
of on-chip networks. Chapter 7 introduces the context of the
problem and surveys the related work. Chapter 8 describes the
system model that we use throughout Part III. Chapter 9 introduces an approach to communication energy optimisation under timeliness and communication reliability constraints, while
Chapter 10 presents an approach to buffer space demand minimisation of applications implemented on networks-on-chip.
Part IV concludes the thesis.
Part II
Stochastic
Schedulability
Analysis and
Optimisation
9
Chapter 2
Motivation and
Related Work
This chapter first motivates the work in the area of performance
analysis of systems with stochastic task execution times. Next,
related approaches are surveyed.
2.1 Motivation
Historically, real-time system research emerged from the need
to understand, design, predict, and analyse safety critical applications such as plant control and aircraft control, to name a
few. Therefore, the community focused on hard real-time systems, where breaking timeliness requirements is not tolerated.
The analysis of such systems gives a yes/no answer to the question if the system fulfils the timeliness requirements. Hard realtime analysis relies on building worst-case scenarios. A scenario
typically consists of a combination of task execution times. It is
worst-case with respect to a timeliness requirement if either the
requirement is broken in the given scenario or if the fact that
the requirement is not broken in the given scenario implies that
the system fulfils the requirement in all other possible scenarios.
Hard real-time analysis cannot afford but to assume that worstcase scenarios always happen and to provision for these cases.
This approach is the only one applicable for the class of safety
critical embedded systems, even if very often leads to significant
under-utilisation of resources.
11
12
CH. 2. MOTIVATION AND RELATED WORK
For the class of soft real-time systems, however, such an approach misses the opportunity to create much cheaper products
with low or no perceived service quality reduction. For example,
multimedia applications like JPEG and MPEG encoding, sound
encoding, etc. exhibit this property. In these situations, the designer may trade cost for quality. Thus, it is no longer sufficient to build worst-case scenarios, but it is more important to
analyse the likelihood of each scenario. Instead of answering
whether a timeliness requirement is fulfilled or not, soft realtime analysis answers questions such as what is the probability
that the requirement is fulfilled or how often is it broken during
the lifetime of the system. While in hard real-time analysis the
tasks are assumed to execute for the amount of time that leads to
the worst-case scenario, in soft real-time analysis task execution
time probability distributions are preferred in order to be able to
determine execution time combinations and their likelihoods.
The execution time of a task is a function of application dependent, platform dependent, and environment dependent factors. The amount of input data to be processed in each task
instantiation as well as its type (pattern, configuration) are application dependent factors. The micro-architecture of the processing unit that executes a task is a platform dependent factor
influencing the task execution time. If the time needed for communication with the environment (database lookups, for example) is to be considered as a part of the task execution time, then
network load is an example of an environmental factor influencing the task execution time.
Input data amount and type may vary, as for example is the
case for differently coded MPEG frames. Platform-dependent
characteristics, like cache memory behaviour, pipeline stalls,
write buffer queues, may also introduce a variation in the task
execution time. Thus, obviously, all of the enumerated factors
influencing the task execution time may vary. Therefore, a
model considering the variability of execution times would be
more realistic than the one considering just worst-case execution times. In the most general model, task execution times
with arbitrary probability distribution functions are considered.
These distributions can be extracted from performance models
[van96] by means of analytic methods or simulation and profiling [van03, Gv00, Gau98]. Obviously, the worst-case task
execution time model is a particular case of such a stochastic
one.
13
probability density
probability density
2.1. MOTIVATION
deadline
t
deadline
15%
time
(a) Fast and expensive processor
t
WCET
time
(b) Slower but inexpensive
processor
Figure 2.1: Execution time probability density functions
Figure 2.1 shows two execution time probability density functions of the same task. The first corresponds to the case in which
the task is mapped on a fast processor (Figure 2.1(a)). In this
case, the worst-case execution time of the task is equal to its
deadline. An approach based on a worst-case execution time
model would implement the task on such an expensive processor
in order to guarantee the imposed deadline for the worst-case
situation. However, situations in which the task execution time
is close to the worst-case execution time occur with small probability. If the nature of the application is such that a certain percentage of deadline misses is affordable, a cheaper system, which
still fulfils the imposed quality of service, can be designed. For
example, on such a system the execution time probability density
function of the same task could look as depicted in Figure 2.1(b).
If it is acceptable for the task to miss 15% of its deadlines, such
a system would be a viable and much cheaper alternative.
In the case of hard real-time systems, the question posed
to the performance analysis process is whether the system is
schedulable, which means if all deadlines are guaranteed to be
met or not. In the case of soft real-time systems however, the
analysis provides fitness estimates, such as measures of the degree to which a system is schedulable, rather than binary classifications. One such measure is the expected deadline miss ratio
14
CH. 2. MOTIVATION AND RELATED WORK
of each task or task graph and is the focus of this part of the
thesis.
Performance estimation tools can be classified in simulation
and analysis tools. Simulation tools are flexible, but there is always the danger that unwanted and extremely rare glitches in
behaviour, possibly bringing the system to undesired states, are
never observed. The probability of not observing such an existing behaviour can be decreased at the expense of increasing the
simulation time. Analysis tools are more precise, but they usually rely on a mathematical formalisation, which is sometimes
difficult to come up with or to understand by the designer. A
further drawback of analysis tools is their often prohibitive running time due to the analysis complexity. A tool that trades, in a
designer-controlled way, analysis complexity (in terms of analysis time and memory, for example) with analysis accuracy or the
degree of insight that it provides, could be a viable solution to
the performance estimation problem.
We aim at providing analytical support for the design of systems with stochastic task execution times. Chapters 4 and 5
present two approaches for the analysis of the deadline miss ratio of tasks, while Chapter 6 presents an approach for the minimisation of deadline miss ratios by means of task mapping and
priority assignment.
2.2 Related Work
Before presenting our approach to the analysis and optimisation
of systems with stochastic task execution times, we survey some
of the related work in the area.
An impressive amount of work has been carried out in the
area of schedulability analysis of applications with worst-case
task execution times both for monoprocessor platforms [LL73,
BBB01, LW82, LSD89, ABD+ 91, Bla76, ABRW93, SGL97, SS94,
GKL91] and multiprocessor platforms [SL95, Sun97, Aud91,
ABR+ 93, TC94, PG98] under fairly general assumptions.
Much fewer publications address the analysis of applications
with stochastic task execution times. Moreover, most of them
consider relatively restricted application classes, limiting their
focus on monoprocessor systems, or on exponential task execution time probability distribution functions. Some approaches
address specific scheduling policies or assume high-load systems.
2.2. RELATED WORK
15
Burns et al. [BPSW99] address the problem of a system
breaking its timeliness requirements due to transient faults.
In their case, the execution time variability stems from task
re-executions. The shortest interval between two fault occurrences such that no task exceeds its deadline is determined by
sensitivity analysis. The probability that the system exceeds its
deadline is given by the probability that faults occur at a faster
rate than the tolerated one. Broster et al. [BBRN02] propose
a different approach to the same problem. They determine the
response time of a task given that it re-executes k ∈ N times
due to faults. Then, in order to obtain the probability distribution of the response time, they compute the probability of the
event that k faults occur. The fault occurrence process is assumed to be a Poisson process in both of the cited works. Burns
et al. [BBB03] extend Broster’s approach in order to take into
account statistical dependencies among execution times. While
their approaches are applicable to systems with sporadic tasks,
they are unsuited for the determination of task deadline miss
probabilities of tasks with generalised execution time probability distributions. Also their approaches are confined to sets of
independent tasks implemented on monoprocessor systems.
Bernat et al. [BCP02] address a different problem. They determine the frequency with which a single task executes for a
particular amount of time, called execution time profile. This
is performed based on the execution time profiles of the basic
blocks of the task. The strength of this approach is that they
consider statistical dependencies among the execution time profiles of the basic blocks. However, their approach would be difficult to extend to the deadline miss ratio analysis of multi-task
systems because of the complex interleaving that characterises
the task executions in such environments. This would be even
more difficult in the case of multiprocessor systems.
Atlas and Bestavros [AB98] extend the classical rate monotonic scheduling policy [LL73] with an admittance controller
in order to handle tasks with stochastic execution times. They
analyse the quality of service of the resulting schedule and
its dependence on the admittance controller parameters. The
approach is limited to monoprocessor systems, rate monotonic
analysis and assumes the presence of an admission controller at
run-time.
Abeni and Buttazzo’s work [AB99] addresses both scheduling
and performance analysis of tasks with stochastic parameters.
16
CH. 2. MOTIVATION AND RELATED WORK
Their focus is on how to schedule both hard and soft real-time
tasks on the same processor, in such a way that the hard ones
are not disturbed by ill-behaved soft tasks. The performance
analysis method is used to assess their proposed scheduling policy (constant bandwidth server), and is restricted to the scope of
their assumptions.
Tia et al. [TDS+ 95] assume a task model composed of independent tasks. Two methods for performance analysis are given.
One of them is just an estimate and is demonstrated to be overly
optimistic. In the second method, a soft task is transformed into
a deterministic task and a sporadic one. The latter is executed
only when the former exceeds the promised execution time. The
sporadic tasks are handled by a server policy. The analysis is
carried out on this particular model.
Gardner et al. [Gar99, GL99], in their stochastic time demand analysis, introduce worst-case scenarios with respect to
task release times in order to compute a lower bound for the
probability that a job meets its deadline. Their approach however does not consider data dependencies among tasks and applications implemented on multiprocessors.
Zhou et al. [ZHS99] and Hu et al. [HZS01] root their work
in Tia’s. However, they do not intend to give per-task guarantees, but characterise the fitness of the entire task set. Because
they consider all possible combinations of execution times of all
requests up to a time moment, the analysis can be applied only
to small task sets due to complexity reasons.
De Veciana et al. [dJG00] address a different type of problem. Having a task graph and an imposed deadline, their goal is
to determine the path that has the highest probability to violate
the deadline. In this case, the problem is reduced to a non-linear
optimisation problem by using an approximation of the convolution of the probability densities.
A different school of thought [Leh96, Leh97] addresses the
problem under special assumptions, such that the system exhibits “heavy traffic”, i.e. the processor loads are close to 1.
The system is modelled as a continuous state Markov model, in
which the state comprises the task laxities, i.e. the time until
their deadlines. Under heavy traffic, such a stochastic process
converges to a Brownian motion with drift and provides a simple solution. The theory was further extended by Harrison and
Nguyen [HN93], Williams [Wil98] and others [PKH01, DLS01,
2.2. RELATED WORK
17
DW93], by modelling the application as a multi-class queueing
network and analysing it in heavy traffic.
As far as we are aware, there are two limitations that restrict
the applicability of heavy traffic theory to real-time systems.
Firstly, heavy traffic theory assumes Poisson task arrival processes and execution times with exponentially distributed probabilities. Secondly, as heavy traffic leads to very long (infinite)
queues of ready-to-run tasks, the probability for a job to meet its
deadline is almost 0 unless the deadline is very large. Designing
a system such that it exhibits heavy traffic is thus undesirable.
Other researchers, such as Kleinberg et al. [KRT00] and Goel
and Indyk [GI99], apply approximate solutions to problems exhibiting stochastic behaviour but in the context of load balancing, bin packing and knapsack problems. Moreover, the probability distributions they consider are limited to a few very particular cases.
Dı́az et al. [DGK+ 02] derive the expected deadline miss ratio
from the probability distribution function of the response time of
a task. The response time is computed based on the system-level
backlog at the beginning of each hyperperiod, i.e. the residual
execution times of the jobs at those time moments. The stochastic process of the system-level backlog is Markovian and its stationary solution can be computed. Dı́az et al. consider only sets
of independent tasks and the task execution times may assume
values only over discrete sets. In their approach, complexity is
mastered by trimming the transition probability matrix of the
underlying Markov chain or by deploying iterative methods, both
at the expense of result accuracy. According to the published results, the method is exercised only on extremely small task sets.
Kalavade and Moghé [KM98] consider task graphs where the
task execution times are arbitrarily distributed over discrete
sets. Their analysis is based on Markovian stochastic processes
too. Each state in the process is characterised by the executed
time and lead-time. The analysis is performed by solving a system of linear equations. Because the execution time is allowed
to take only a finite (most likely small) number of values, such a
set of equations is small.
Kim and Shin [KS96] consider applications that are implemented on multiprocessors and modelled them as queueing networks. They restricted the task execution times to exponentially
distributed ones, which reduces the complexity of the analysis.
The tasks were considered to be scheduled according to a partic-
18
CH. 2. MOTIVATION AND RELATED WORK
ular policy, namely first-come-first-served (FCFS). The underlying mathematical model is then the appealing continuous time
Markov chain.
In the context of multiprocessor systems, our work significantly extends the one by Kim and Shin [KS96]. Thus, we
consider arbitrary execution time probability density functions
(Kim and Shin consider only exponential ones) and we address
a much larger class of scheduling policies (as opposed to FCFS
considered by them, or fixed priority scheduling considered by
most of the previous work). Moreover, our approach is applicable
in the case of arbitrary processor loads as opposed to the heavy
traffic school of thought.
Our work is mostly related to the ones of Zhou et al. [ZHS99],
Hu et al. [HZS01], Kalavade and Moghé [KM98] and Dı́az et
al. [DGK+02]. It differs from the others mostly by considering
less restricted application classes. As opposed to Kalavade and
Moghé’s work and to Dı́az et al.’s work, we consider continuous
ETPDFs. In addition to Dı́az et al.’s approach, we consider task
sets with dependencies among tasks. Also, we accept a much
larger class of scheduling policies than the fixed priority ones
considered by Zhou and Hu. Moreover, our original way of concurrently constructing and analysing the underlying stochastic
process, while keeping only the needed stochastic process states
in memory, allows us to consider larger applications.
Chapter 3
System Modelling
This chapter introduces the notations and application model
used throughout the thesis. The hardware model presented in
this chapter is used throughout Part II.
3.1 Hardware Model
The hardware model consists of a set of processing elements.
These can be programmable processors of any kind (general purpose, controllers, DSPs, ASIPs, etc.). Let P E = {P E1 , P E2 ,
. . . , P Ep } denote the set of processing elements. A bus may
connect two or more processing elements in the set P E. Let
B = {B1 , B2 , . . . , Bl } denote the set of buses. Data sent along
a bus by a processing element connected to that bus may be read
by all processing elements connected to that bus.
Unless explicitly stated, the two types of hardware resources,
processing elements and buses, will not be differently treated in
the scope of this part of the thesis, and therefore they will be
denoted with the general term of processors. Let M = p+l denote
the number of processors and let P = P E ∪ B = {P1 , P2 , . . . , PM }
be the set of processors.
Figure 3.1 depicts a hardware platform consisting of three
processing elements and two buses. Bus B1 connects all processing elements, while bus B2 is a point-to-point link connecting
processing element P E1 and processing element P E2 .
19
CH. 3. SYSTEM MODELLING
20
B2
PE 1
PE 2
PE 3
B1
t1
Figure 3.1: Hardware model
t2
χ1
t3
χ3
t7
χ2
t4
t5
t6
t8
χ4
t 10
t9
PE 1
PE 2
PE 3
Figure 3.2: Application model
3.2 Application Model
3.2.1 Functionality
The functionality of an application is modelled as a set of processing tasks, denoted with t1 , t2 , . . . , tn . A processing task is a
piece of work that has a conceptual unity and is assigned to a
processing element. Examples of processing tasks are performing a discrete cosine transform on a stream of data in a video
decoding application or the encryption of a stream of data in
the baseband processing of a mobile communication application.
Let P T denote the set of processing tasks. Processing tasks are
graphically represented as large circles, as shown in Figure 3.2.
Processing tasks may pass messages to each other. The passing of a message is modelled as a communication task, denoted
with χ. Let CT denote the set of communication tasks. They are
graphically depicted as small circles, as shown in Figure 3.2.
Unless explicitly stated, the processing and the communication tasks will not be differently treated in the scope of Part II of
the thesis, and therefore they will be denoted with the general
3.2. APPLICATION MODEL
21
term of tasks. Let N be the number of tasks and T = P T ∪ CT =
{τ1 , τ2 , . . . , τN } denote the set of tasks.
The passing of a message between tasks τi and τj enforces
data dependencies between the two tasks. Data dependencies
are graphically depicted as arrows from the sender task to the
receiver task, as shown in Figure 3.2.
The task that sends the message is the predecessor of the
receiving task, while the receiving task is the successor of the
sender. The set of predecessors of task τ is denoted with ◦ τ ,
while the set of successors of task τ with τ ◦ . A communication
task has exactly one predecessor and one successor and both are
processing tasks. For illustration (Figure 3.2), task t3 has two
predecessors, namely the processing task t1 and the communication task χ1 , and it has two successors, namely tasks t4 and
χ2 .
Tasks with no predecessors are called root tasks, while tasks
with no successors are called leaf tasks. In Figure 3.2 tasks t1 ,
t2 , t6 , and t10 are root tasks, while tasks t4 , t5 , t8 , t9 , and t10 are
leaf tasks.
Let us consider a sequence of tasks (τ1 , τ2 , . . . , τk ), k > 1. If
there exists a data dependency between tasks τi and τi+1 , ∀1 ≤
i < k, then the sequence (τ1 , τ2 , . . . , τk ) forms a computation path
of length k. We say that the computation path leads from task
τ1 to task τk . Task τi is an ancestor task of task τj if there exists
a computation path from task τi to task τj . Complementarily,
we say that task τi is a descendant task of task τj if there exists
a computation path from task τj to task τi . We do not allow
circular dependencies, i.e. no task can be both the ancestor and
the descendant of another task. In Figure 3.2, (t2 , χ1 , t3 , χ2 , t5 ) is
an example of a computation path of length 5, and task χ1 is an
ancestor of tasks t3 , t4 , t5 , and χ2 .
We define the relation γ ⊂ T × T as follows:
• (τ, τ ) ∈ γ, ∀τ ∈ T ,
• (τi , τj ) ∈ γ, ∀τi , τj ∈ T, τi 6= τj iff
– they have at least one common ancestor, or
– they have at least one common successor, or
– they are in a predecessor-successor relationship.
As γ is a reflexive, symmetric, and transitive relation, it is an
equivalence relation. Hence, it partitions the set of tasks T into
g subsets, denoted with Vi , 1 ≤ i ≤ g (∪gi=1 Vi = T ∧ Vi ∩ Vj =
∅, ∀1 ≤ i, j ≤ g, i 6= j). Thus, an application consists of a set
22
CH. 3. SYSTEM MODELLING
Γ = {Γ1 , Γ2 , . . . , Γg } of g task graphs, Γi = (Vi , Ei ⊂ Vi × Vi ),
1 ≤ i ≤ g. A directed edge (τa , τb ) ∈ Ei , τa , τb ∈ Vi , represents the
data dependency between tasks τa and τb , denoted τa → τb .
The application example in Figure 3.2 consists of three task
graphs: Γ1 = ({t1 , t2 , t3 , t4 , t5 , χ1 , χ2 }, {(t1 , t3 ), (t2 , χ1 ), (χ1 , t3 ),
(t3 , t4 ), (t3 , χ2 ), (χ2 , t5 )}), Γ2 = ({t6 , t7 , t8 , t9 , χ3 , χ4 }, {(t6 , t7 ),
(t7 , χ4 ), (χ4 , t9 ), (t6 , χ3 ), (χ3 , t8 )}), and Γ3 = ({t10 }, ∅).
3.2.2 Periodic Task Model
Task instantiations (also known as jobs) arrive periodically. The
ith job of task τ is denoted (τ, i), i ∈ N.
Let ΠT = {πi ∈ N : τi ∈ T } denote the set of task periods, or
job inter-arrival times, where πi is the period of task τi . Instantiation u ∈ N of task τi demands execution (the job is released
or the job arrives) at time moment u · πi . The period πi of any
task τi is assumed to be a common multiple of all periods of its
predecessor tasks (πj divides πi , where τj ∈ ◦ τi ). Let kij denote
πi
◦
πj , τj ∈ τi . Instantiation u ∈ N of task τi may start executing
only if instantiations u · kij , u · kij + 1, . . . , u · kij + kij − 1 of tasks
τj , ∀τj ∈ ◦ τi , have completed their execution.
Let ΠΓ = {πΓ1 , πΓ2 , . . . , πΓg } denote the set of task graph periods where πΓj denotes the period of the task graph Γj . πΓj is
equal to the least common multiple of all πi , where πi is the peπΓ
riod of τi and τi ∈ Vj . Task τi ∈ Vj is instantiated Ji = πij times
during one instantiation of task graph Γj . The k th instantiation
of task graph Γj , k ≥ 0, denoted (Γj , k), is composed of the jobs
(τi , u), where τi ∈ Vj and u ∈ {k · Ji , k · Ji + 1, . . . , k · Ji + Ji − 1}.
In this case, we say that task instantiation (τi , u) belongs to task
graph instantiation (Γj , k) and we denote it with (τi , u) ∈ (Γj , k).
The model, where task periods are integer multiples of the
periods of predecessor tasks, is more general than the model assuming equal task periods for tasks in the same task graph. This
is appropriate, for instance, when modelling protocol stacks. For
example, let us consider a part of baseband processing on the
GSM radio interface [MP92]. A data frame is assembled out of 4
radio bursts. One task implements the decoding of radio bursts.
Each time a burst is decoded, the result is sent to the frame
assembling task. Once the frame assembling task gets all the
needed data, that is every 4 invocations of the burst decoding
task, the frame assembling task is invoked. This way of modelling is more modular and natural than a model assuming equal
3.2. APPLICATION MODEL
23
task periods, which would have crammed the four invocations of
the radio burst decoding task in one task. We think that more
relaxed models than ours, with regard to relations between task
periods, are not necessary, as such applications would be more
costly to implement and are unlikely to appear in common engineering practice.
3.2.3
Mapping
Processing tasks are mapped on processing elements and communication tasks are mapped on buses. All instances of a processing task are executed by the same processing element on
which the processing task is mapped. Analogously, all instances
of a message are conveyed by the bus on which the corresponding
communication task is mapped.
Let M apP : P T → P E be a surjective function that maps
processing tasks on the processing elements. M apP (ti ) = Pj
indicates that processing task ti is executed on the processing
element Pj . Let M apC : CT → B be a surjective function that
maps communication tasks on buses. M apC(χi ) = Bj indicates
that the communication task χi is performed on the bus Bj . For
notation simplicity, M ap : T → P is defined, where M ap(τi ) =
M apP (τi ) if τi ∈ P T and M ap(τi ) = M apC(τi ) if τi ∈ CT . Conversely, let Tp = {τ ∈ T : M ap(τ ) = p ∈ P } denote the set of tasks
that are mapped on processor p. Let Tτ be a shorthand notation
for TM ap(τ ) .
The mapping is graphically indicated by the shading of the
task. In Figure 3.2, tasks t1 , t3 , t4 , and t9 are mapped on processing element P E1 , tasks t2 , t6 , and t7 on processing element P E2 ,
and tasks t5 , t9 , and t10 on processing element P E3 . Communication task χ1 is mapped on bus B2 and communication tasks χ2 ,
χ3 , and χ4 on bus B1 . The corresponding system architecture is
shown in Figure 3.1.
3.2.4
Execution Times
For a processing task ti , ∀1 ≤ i ≤ n, let Exti denote its execution
time on processing element M apP (ti ). Let ti be the probability
density of Exti .
First, we discuss the modelling of the communication time
between two processing tasks that are mapped on the same processing element. Let ti and tj be any two processing tasks such
24
CH. 3. SYSTEM MODELLING
that task ti is a predecessor of task tj (ti ∈ ◦ tj ) and tasks ti
and tj are mapped on the same processing element (M apP (ti ) =
M apP (tj )). In this case, the time of the communication between
task ti and tj is considered to be part of the execution time of
task ti . Thus, the execution time probability density ti accounts
for this intra-processor communication time.
Next, we discuss the modelling of the communication time
between two processing tasks that are mapped on different processing elements. Let ti and tj be two processing tasks, let χ be
a communication task, let P Ea and P Eb be two distinct processing elements and let B be a bus such that all of the following
statements are true:
• Processing tasks ti and tj are mapped on processing elements P Ea and P Eb respectively (M apP (ti ) = P Ea and
M apP (tj ) = P Eb ).
• Communication task χ is mapped on bus B (M apC(χ) =
B).
• Bus B connects processing elements P Ea and P Eb .
• Task χ is a successor of task ti and a predecessor of task tj
(χ ∈ t◦i ∧ χ ∈ ◦ tj ).
The transmission time of the message that is passed between
tasks ti and tj on the bus B is modelled by the execution time
Exχ of the communication task χ. Let χ denote the probability
density of Exχ .
Without making any distinction between processing and communication tasks, we let Exi denote an execution (communication) time of an instantiation of task τi ∈ T and we let ET =
{1 , 2 , . . . , N } denote the set of N execution time probability density functions (ETPDFs).
3.2.5 Real-Time Requirements
The real-time requirements are expressed in terms of deadlines.
Let ∆T = {δi ∈ N : τi ∈ T } denote the set of task deadlines.
δi is the deadline of task τi . If job (τi , u) has not completed its
execution at time u · πi + δi , then the job is said to have missed
its deadline.
Let ∆Γ = {δΓj ∈ N : 1 ≤ j ≤ g} denote the set of task graph
deadlines, where δΓj is the deadline of task graph Γj . If there
exists at least one task instantiation (τi , u) ∈ (Γj , k), such that
(τi , u) has not completed its execution at time moment k · πΓj +
3.2. APPLICATION MODEL
25
δΓj , we say that task graph instantiation (Γj , k) has missed its
deadline.
If Di (t) denotes the number of jobs of task τi that have missed
their deadline over a time span t and Ai (t) = b πti c denotes the
total number of jobs of task τi over the same time span, then
i (t)
limt→∞ D
Ai (t) denotes the expected deadline miss ratio of task τi .
Similarly, we define the expected deadline miss ratio of task
graph Γj as the long-term ratio between the number of instantiations of task graph Γj that have missed their deadlines and the
total number of instantiations of task graph Γj .
Let M issedT = {mτ1 , mτ2 , . . . , mτN } be the set of expected
deadline miss ratios per task. Similarly, the set M issedΓ =
{mΓ1 , mΓ2 , . . . , mΓg } is defined as the set of expected deadline
miss ratios per task graph.
The designer may specify upper bounds for tolerated deadline miss ratios, both for tasks and for task graphs. Let ΘT =
{θτ1 , θτ2 , . . . , θτN } be the set of deadline miss thresholds for tasks
and let ΘΓ = {θΓ1 , θΓ2 , . . . , θΓg } be the set of deadline miss
thresholds for task graphs.
Some tasks or task graphs may be designated as being critical by the designer, which means that deadline miss thresholds
are not allowed to be violated. The deadline miss deviation of
task τ , denoted devτ , is defined as


mτ > θτ , τ critical
∞
devτ = mτ − θτ mτ > θτ , τ not critical
(3.1)


0
mτ ≤ θ τ .
Analogously, we define the deadline miss deviation of a task
graph.
3.2.6
Late Task Policy
We say that a task graph instantiation (Γ, k), k ≥ 0, is active in
the system at time t if there exists at least one task instantiation
(τ, u) ∈ (Γ, k) such that job (τ, u) has not completed its execution
at time t. Let Insti (t) denote the number of active instantiations
of task graph Γi , 1 ≤ i ≤ g, at time t.
For each task graph Γi , ∀1 ≤ i ≤ g, we let the designer specify
bi ∈ N+ , the maximum number of simultaneously active instantiations of task graph Γi . Let Bounds = {bi ∈ N+ : 1 ≤ i ≤ g} be
their set.
CH. 3. SYSTEM MODELLING
26
We consider two different policies for ensuring that no more
than bi instantiations of task graph Γi , ∀1 ≤ i ≤ g, are active in
the system at the same time. We call these policies the discarding and the rejection policy.
We assume that a system applies the same policy (either discarding or rejection) to all task graphs, although our work can
be easily extended in order to accommodate task graph specific
late task policies.
The Discarding Policy
The discarding policy specifies that whenever a new instantiation of task graph Γi , ∀1 ≤ i ≤ g, arrives and bi instantiations
are already active in the system at the time of the arrival of the
new instantiation, the oldest active instantiation of task graph
Γi is discarded. We mean by “oldest” instantiation the instantiation whose arrival time is the minimum among the arrival times
of the active instantiations of the same task graph. “Discarding”
a task graph implies:
• The running jobs belonging to the task graph to be discarded are immediately removed from the processors they
run onto. These jobs are eliminated from the system, i.e.
their execution is never resumed and all resources that
they occupy (locks, memory, process control blocks, file control blocks, etc.) are freed.
• The ready-to-run and blocked-on-I/O jobs belonging to the
task graph to be discarded are immediately removed from
the ready-to-run and waiting-on-I/O queues of the scheduler. They are also eliminated from the system.
The Rejection Policy
The rejection policy specifies that whenever a new instantiation
of task graph Γi , ∀1 ≤ i ≤ g, arrives and bi instantiations are active in the system at the time of the arrival of the new instantiation, the new instantiation is not accepted in the system. Thus,
all execution requests by jobs belonging to the new instantiation
are ignored by the system.
3.3. ILLUSTRATIVE EXAMPLE
3.2.7
27
Scheduling Policy
In the common case of more than one task mapped on the same
processor, the designer has to decide on a scheduling policy. Such
a scheduling policy has to be able to unambiguously determine
the running task at any time on that processor. The selection
of the next task to run is made by a run-time scheduler based
on the priority associated to the task. Priorities may be static
(the priority of a task does not change in time) or dynamic (the
priority of a task changes in time).
We limit the set of accepted scheduling policies to those
where the sorting of tasks according to their priority is unique
during those time intervals in which the queue of ready tasks
is unmodified. For practical purposes, this is not a limitation, as all practically used priority-based scheduling policies
[LL73, But97, Fid98, ABD+ 95], both with static priority assignment (rate monotonic, deadline monotonic) and with dynamic
assignment (earlier deadline first (EDF)), fulfil this requirement.
The scheduling policy is nevertheless restricted to nonpreemptive scheduling. This limitation is briefly discussed in
Section 4.2.1.
Each processing element or bus may have a different scheduling policy associated to it.
Based on these assumptions, and considering the communication tasks, we are able to model any priority based bus arbitration protocol as, for instance, CAN [Bos91]. 1
3.3 Illustrative Example
This section illustrates the behaviour of the application shown
in Figure 3.2 by means of a Gantt diagram. Tasks are mapped
as indicated in Section 3.2.3.
1 Time division multiple access bus protocols, such as the TTP [TTT99], could
be modelled using dynamic priorities. For example, all communication tasks that
are not allowed to transmit on a bus at a certain moment of time have priority
−∞. However, in this case, the sorting of tasks according to their priorities is
not any more unique between two consecutive events that change the set of ready
tasks. The reason is that between two such events, a time slot may arrive when
a certain communication is allowed and, thus, the priorities of communication
tasks are changed. Therefore, time-triggered communication protocols are not
supported by our analysis method.
28
CH. 3. SYSTEM MODELLING
Tasks t1 , t2 , t3 , t5 , χ1 , and χ2 have a period of 6 and task t4
has a period of 12. Consequently, task graph Γ1 has a period
πΓ1 = 12. Tasks t6 , t7 , t8 , χ3 , and χ4 have a period of 4 and task
τ9 has a period of 8. Thus, πΓ2 = 8. Task t10 has a period of 3.
For all tasks and task graphs of this example, their deadline is
equal to their period (δi = πi , 1 ≤ i ≤ 14 and δΓi = πΓi , 1 ≤ i ≤ 3).
The late task policy for this example is the discarding policy. The set of bounds on the number of simultaneously active
instantiations of the same task graph is Bounds = {1, 1, 2}.
The deployed scheduling policy is fixed priority for this example. As the task priorities do not change, this policy obviously
satisfies the restriction that the sorting of tasks according to
their priorities must be invariable during the intervals in which
the queue of ready tasks does not change. Task t7 has a higher
priority than task t6 , which in turn has a higher priority than
task t2 . Task t10 has a higher priority than task t8 , which in turn
has a higher priority than task t5 . Task t9 has a higher priority
than any of the tasks t1 , t3 , and t4 . Message χ4 has a higher priority than message χ3 , which in turn has a higher priority than
message χ2 .
A Gantt diagram illustrating a possible task execution over
a span of 20 time units is depicted in Figure 3.3. The Ox axes
corresponds to time, while each processor, shown on the Oy axis,
has an associated Ox axis. The job executions are depicted as
rectangles stretching from the point on the Ox axis that corresponds to the start time of the job execution to the point on the
Ox axis that corresponds to its finishing time. The different task
graphs are depicted in different shades in this figure. Vertical
lines of different line patterns are used for better readability.
Job 2 of task t10 arrives at time 6 and is ready to run. However, processing element P E3 is busy executing task t8 . Therefore, job (t10 , 2) starts its execution later, at time7.5. The execution of the job finishes at time 9.5. At time 9, job 3 of task t10
arrives and is ready to run. Thus, we observe that between times
9 and 9.5 two instantiations of task graph Γ3 are active. This is
allowed as b2 = 2.
Instantiation 0 of task graph Γ1 illustrates a discarding situation. Job 1 of task t3 , which arrived at time 6, starts its execution
at time 8.875, when communication task χ1 has finished, and finishes at time 11. At this time, the message between jobs (t3 , 1)
and (t5 , 1) is ready to be sent on bus B1 (an instance of communication task χ2 is “ready-to-run”). Nevertheless, the message
B2
B1
PE 3
PE 2
PE 1
0
0
0
0
0
t10
t6
t1
1
1
1
1
1
χ3
t7
2
2
t8
2
2
2
χ4
t2
χ1
3
3
3
3
3
t10
t3
4
4
4
4
4
t5
5
5
t7
5
χ2 χ3
5
t6
5
6
6
6
6
6
χ4
t8
t1
7
7
7
t2
7
7
8
8
8
8
8
χ1
t10
t6
9
9
9
9
9
χ3
10
10
t10
10
t7
10
t3
10
11
14
t9
15
12
12
13
χ3
13
14
χ4
14
t10 t8
13 14
t10
17
16
16
18
17
18
χ2 χ3 χ4
17 18
t2
t6
20
t3
20
19
19
χ1
20
20
t10
19 20
19
t1
18 19
t5 t8
17 18
t6 t7
16 17
t3
16
χ1
15 16
15
15
t6
t7
t2
12 13 14 15
χ4
11 12
t8
11
11
t4
t1
11 12 13
Figure 3.3: Gantt diagram
t9
3.3. ILLUSTRATIVE EXAMPLE
29
30
CH. 3. SYSTEM MODELLING
cannot be sent at time 11 as the bus is occupied by an instance
of communication task χ4 . Message χ4 is still being sent at the
time 12, when a new instantiation of task graph Γ1 arrives. At
time 12, the following jobs belonging to instantiation 0 of task
graph Γ1 have not yet completed their execution: (t4 , 0), (t5 , 1),
and (χ2 , 1). They are in the states “running”, “waiting on I/O”,
and “ready-to-run” respectively. Because at most one instantiation of task graph Γ1 is allowed to be active at any time, instantiation 0 must be discarded at time 12. Hence, job (t4 , 0) is removed from processing element P E1 , job (t5 , 1) is removed from
the waiting-on-I/O queue of the processing element P E2 , and job
(χ2 , 1) is removed from the ready-to-run queue of bus B1 .
The deadline miss ratio of Γ3 over the interval [0, 18) is 1/6,
because there are 6 instantiations of task graph Γ3 in this interval and one of them, instantiation 2, which arrived at time 6,
missed its deadline. When analysing this system, the expected
deadline miss ratio of Γ3 (the ratio of the number instantiations
that missed their deadline and the total number of instantiations over an infinite time interval) is 0.08. The expected deadline miss ratios of Γ1 and Γ2 are 0.4 and 0.15 respectively.
Chapter 4
Analysis of
Monoprocessor
Systems
This chapter presents an exact approach for analytically determining the expected deadline miss ratios of task graphs with
stochastic task execution times in the case of monoprocessor systems.
First, we give the problem formulation (Section 4.1). Second,
we present the analysis procedure based on an example before
we give the precise algorithm (Section 4.2). Third, we evaluate the efficiency of the analysis procedure by means of experiments (Section 4.3). Section 4.4 presents some extensions of the
assumptions. Last, we discuss the limitations of the approach
presented in this chapter and we hint on the possible ways to
overcome them.
4.1 Problem Formulation
The formulation of the problem to be solved in this chapter is the
following:
4.1.1
Input
The input of the analysis problem to be solved in this chapter is
given as follows:
31
CH. 4. MONOPROCESSOR SYSTEMS
32
• The set of task graphs Γ,
• The set of task periods ΠT and the set of task graph periods
ΠΓ ,
• The set of task deadlines ∆T and the set of task graph
deadlines ∆Γ ,
• The set of execution time probability density functions ET ,
• The late task policy is the discarding policy,
• The set Bounds = {bi ∈ N\{0} : 1 ≤ i ≤ g}, where bi is the
maximum numbers of simultaneously active instantiations
of task graph Γi , and
• The scheduling policy.
4.1.2 Output
The result of the analysis is the set M issedT of expected deadline miss ratios for each task and the set M issedΓ of expected
deadline miss ratios for each task graph.
4.1.3 Limitations
We assume the discarding late task policy. A discussion on discarding versus rejection policy is presented in Section 4.3.5.
4.2 Analysis Algorithm
The goal of the analysis is to obtain the expected deadline miss
ratios of the tasks and task graphs. These can be derived from
the behaviour of the system. The behaviour is defined as the
evolution of the system through a state space in time. A state
of the system is given by the values of a set of variables that
characterise the system. Such variables may be the currently
running task, the set of ready tasks, the current time and the
start time of the current task..
Due to the considered periodic task model, the task arrival
times are deterministically known. However, because of the
stochastic task execution times, the completion times and implicitly the running task at an arbitrary time instant or the
state of the system at that instant cannot be deterministically
predicted. The mathematical abstraction best suited to describe
and analyse such a system with random character is the stochastic process.
4.2. ANALYSIS ALGORITHM
33
In this section, we first sketch the stochastic process construction and analysis procedure based on a simplified example.
Then the memory efficient construction of the stochastic process
underlying the application is detailed. Third, the algorithm is
refined in order to handle multiple concurrently active instantiations of the same task graph. Finally, the complete algorithm
is presented.
4.2.1
The Underlying Stochastic Process
Let us define LCM as the least common multiple of the task periods. For simplicity of the exposition, we first assume that at
most one instantiation of each task graph is tolerated in the system at the same time (bi = 1, ∀1 ≤ i ≤ g). In this case, the set
of time moments when all late tasks are discarded include the
sequence LCM, 2 · LCM, . . . , k · LCM, . . . because at these moments new instantiations of all tasks arrive. The system behaves
at these time moments as if it has just been started. The time
moments k · LCM , k ∈ N are called regeneration points. Regardless of the chosen definition of the state space of the system, the
system states at the renewal points are equivalent to the initial
state, which is unique and deterministically known. Thus, the
behaviour of the system over the intervals [k·LCM, (k+1)·LCM ),
k ∈ N, is statistically equivalent to the behaviour over the time
interval [0, LCM ). Therefore, in the case when bi = 1, 1 ≤ i ≤ g,
it is sufficient to analyse the system solely over the time interval
[0, LCM ).
One could choose the following state space definition: S =
{(τ, W, t) : τ ∈ T, W ∈ set of all multisets of T, t ∈ R}, where τ
represents the currently running task, W stands for the multiset1 of ready tasks at the start time of the running task, and t
represents the start time of the currently running task. A state
change occurs at the time moments when the scheduler has to
decide on the next task to run. This happens
• when a task completes its execution, or
• when a task arrives and the processor is idle, or
• when the running task graph has to be discarded.
The point we would like to make is that, by choosing this
state space, the information provided by a state si = (τi , Wi , ti ),
1 If
bi = 1, ∀1 ≤ i ≤ g, then W is a set.
CH. 4. MONOPROCESSOR SYSTEMS
probability
34
0.66
0
1
2
time
3
4
(a) 1
probability
0.66
0
1
2
time
3
4
5
6
(b) 2
Figure 4.1: ETPDFs of tasks τ1 (1 ) and τ2 (2 )
together with the current time, is sufficient to determine the
next system state sj = (τj , Wj , tj ). The time moment when the
system entered state si , namely ti , is included in si . Because
of the deterministic arrival times of tasks, based on the time
moments tj and on ti , we can derive the multiset of tasks that
arrived in the interval (ti , tj ]. The multiset of ready tasks at
time moment ti , namely Wi , is also known. We also know that
τi is not preempted between ti and tj . Therefore, the multiset of
ready tasks at time moment tj , prior to choosing the new task to
run, is the union of Wi and the tasks arrived during the interval
(ti , tj ]. Based on this multiset and on the time tj , the scheduler
is able to predictably choose the new task to run. Hence, in general, knowing a current state s and the time moment t when a
transition out of state s occurs, the next state s0 is unambiguously determined.
The following example is used throughout this subsection in
order to discuss the construction of the stochastic process. The
system consists of one processor and the following application:
Γ = {({τ1 }, ∅), ({τ2 }, ∅)}, Π = {3, 5}, i.e. a set of two indepen-
4.2. ANALYSIS ALGORITHM
35
τ1, {τ2}, 0
τ2, Ø, t 1
τ2, Ø, t 2 ...
τ2, Ø, t k
τ2, {τ1}, tk+1 ...
τ2, {τ1}, tq
(a) Individual task completion times
τ1, {τ2}, pmi1
τ2, Ø, pmi1
τ2, {τ1}, pmi2
(b) Intervals containing task completion times
Figure 4.2: State encoding
dent tasks with corresponding periods 3 and 5. The tasks are
scheduled according to a non-preemptive EDF scheduling policy
[LL73]. LCM , the least common multiple of the task periods is
15. For simplicity, in this example it is assumed that the relative deadlines equal the corresponding periods (δi = πi ). The
ETPDFs of the two tasks are depicted in Figure 4.1. Note that
1 contains execution times larger than the deadline δ1 .
Let us assume a state representation like the one introduced
above: each process state contains the identity of the currently
running task, its start time and the multiset of ready task at the
start time of the currently running one. For our application example, the initial state is (τ1 , {τ2 }, 0), i.e. task τ1 is running, it
has started to run at time moment 0 and task τ2 is ready to run,
as shown in Figure 4.2(a). t1 , t2 , . . . , tq in the figure are possible
finishing times for the task τ1 and, implicitly, possible starting
times of the waiting instantiation of task τ2 . The number of next
states equals the number of possible execution times of the running task in the current state. In general, because the ETPDFs
are continuous, the set of state transition moments form a dense
CH. 4. MONOPROCESSOR SYSTEMS
36
τ2
pmi1
τ1
0
pmi2 pmi3
3
5 6
pmi4
pmi5 pmi6
9 10
pmi7
12
15
Figure 4.3: Priority monotonicity intervals
set in R leading to an underlying stochastic process theoretically
of uncountable state space. In practice, the stochastic process
is extremely large, depending on the discretisation resolution of
the ETPDFs. Even in the case when the task execution time
probabilities are distributed over a discrete set, the resulting
underlying process becomes prohibitively large and practically
impossible to solve.
In order to avoid the explosion of the underlying stochastic
process, in our approach, we have grouped time moments into
equivalence classes and, by doing so, we limited the process size
explosion. Thus, practically, a set of equivalent states is represented as a single state in the stochastic process.
As a first step to the analysis, the interval [0, LCM ) is partitioned in disjunct intervals, the so-called priority monotonicity
intervals (PMI). The concept of PMI (called in their paper “state”)
was introduced by Zhou et al. [ZHS99] in a different context,
unrelated to the construction of a stochastic process. A PMI is
delimited by task arrival times and task execution deadlines.
Figure 4.3 depicts the PMIs for the example above. The only
restriction imposed on the scheduling policies accepted by our
approach is that inside a PMI the ordering of tasks according to
their priorities is not allowed to change. This allows the scheduler to predictably choose the next task to run, regardless of the
completion time within a PMI of the previously running task. As
mentioned in Section 3.2.7, all the widely used scheduling policies we are aware of (rate monotonic (RM), EDF, first come first
served (FCFS), LLF, etc.) exhibit this property, as mentioned
before.
Consider a state s characterised by (τi , W, t): τi is the currently running task, it has been started at time t, and W is the
multiset of ready tasks. Let us consider two next states derived
from s: s1 characterised by (τj , W1 , t1 ) and s2 by (τk , W2 , t2 ). Let
t1 and t2 belong to the same PMI. This means that no task instantiation has arrived or been discarded in the time interval between t1 and t2 , and the relative priorities of the tasks inside the
4.2. ANALYSIS ALGORITHM
37
set W have not changed between t1 and t2 . Thus, τj = τk = the
highest priority task in the multiset W , and W1 = W2 = W \{τj }.
It follows that all states derived from state s that have their time
t belonging to the same PMI have an identical currently running task and identical sets of ready tasks. Therefore, instead of
considering individual times we consider time intervals, and we
group together those states that have their associated start time
inside the same PMI. With such a representation, the number of
next states of a state s equals the number of PMIs the possible
execution time of the task that runs in state s is spanning over.
We propose a representation in which a stochastic process
state is a triplet (τ, W, pmi), where τ is the running task, W the
multiset of ready tasks at the start time of task τ , and pmi is the
PMI containing the start time of the running task. In our example, the execution time of task τ1 (which is in the interval [2, 3.5],
as shown in Figure 4.1(a)) is spanning over the PMIs pmi1 —
[0, 3)—and pmi2 —[3, 5). Thus, there are only two possible states
emerging from the initial state, as shown in Figure 4.2(b).
Figure 4.4 depicts a part of the stochastic process constructed
for our example. The initial state is s1 : (τ1 , {τ2 }, pmi1 ). The first
field indicates that an instantiation of task τ1 is running. The
second field indicates that an instantiation of task τ2 is ready to
execute. The third field shows the current PMI (pmi1 —[0, 3)). If
the instantiation of task τ1 does not complete until time moment
3, then it will be discarded. The state s1 has two possible next
states. The first one is state s2 : (τ2 , ∅, pmi1 ) and corresponds to
the case when the τ1 completes before time moment 3. The second one is state s3 : (τ2 , {τ1 }, pmi2 ) and corresponds to the case
when τ1 was discarded at time moment 3. State s2 indicates that
an instantiation of task τ2 is running (it is the instance that was
waiting in state s1 ), that the PMI is pmi1 —[0, 3)—and that no
task is waiting. Consider state s2 to be the new current state.
Then the next states could be state s4 : (−, ∅, pmi1 ) (task τ2
completes before time moment 3 and the processor is idle), state
s5 : (τ1 , ∅, pmi2) (task τ2 completes at a time moment sometime
between 3 and 5), or state s6 : (τ1 , {τ2 }, pmi3 ) (the execution of
task τ2 reaches over time moment 5 and, hence, it is discarded at
time moment 5). The construction procedure continues until all
possible states corresponding to the time interval [0, LCM ), i.e.
[0, 15), have been visited.
Let Pi denote the set of predecessor states of a state si , i.e.
the set of all states that have si as a next state. The set of suc-
CH. 4. MONOPROCESSOR SYSTEMS
38
probability
s5
τ 1 , Ø, pmi2
s6
τ 1 , {τ 2 }, pmi3
3 4 5 6 time
z6
s3
τ 2 , {τ 1 }, pmi2
1 2 3 4 time
z3
s1
τ 1 , {τ 2 }, pmi1
probability
3 4 5 6 time
z5
s2
τ 2 , Ø, pmi1
−, Ø, pmi1
probability
Figure 4.4: Stochastic process example
probability
1 2 3 4 time
z2
s4
1 2 3 4 time
z4
probability
4.2. ANALYSIS ALGORITHM
39
cessor states of a state si consists of those states that can directly be reached from state si . Let Zi denote the time when
state si is entered. State si can be reached from any of its predecessor states sj ∈ Pi . Therefore, the probability P(Zi ≤ t)
that state si is entered before time t is a weighted sum over
j of probabilities that the transitions sj → si , sj ∈ Pi , occur
before time t. The weights are equal to the probability P(sj )
that the system
P is in state sj prior to the transition. Formally,
P(Zi ≤ t) = j∈Pi P(Zji ≤ t|sj ) · P(sj ), where Zji is the time of
transition sj → si . Let us focus on Zji , the time of transition
sj → si . If the state transition occurs because the processor is
idle and a new task arrives or because the running task graph
has to be discarded, the time of the transition is deterministically known as task arrivals and deadlines have fixed times. If,
however, the cause of the state transition is a task completion,
the time Zji is equal to to Zj + Exτ , where task τ is the task that
runs in state sj and whose completion triggers the state transition. Because Zji is a sum involving the random variable Exτ ,
Zji too is a random variable. Its probability
density function, is
R∞
computed as the convolution zj ∗ τ = 0 zj (t − x) · τ (x)dx of the
probability density functions of the terms.
Let us illustrate the above, based on the example depicted in
Figure 4.4. z2 , z3 , z4 , z5 , and z6 are the probability density functions of Z2 , Z3 , Z4 , Z5 , and Z6 respectively. They are shown in
Figure 4.4 to the left of their corresponding states s2 , s3 , . . . , s6 .
The transition from state s4 to state s5 occurs at a precisely
known time instant, time 3, at which a new instantiation of task
τ1 arrives. Therefore, z5 will contain a scaled Dirac impulse at
the beginning of the corresponding PMI. The scaling coefficient
equals the probability of being in state s4 (the integral of z4 , i.e.
the shaded surface below the z4 curve). The probability density
function z5 results from the superposition of z2 ∗ 2 (because task
τ2 runs in state s2 ) with z3 ∗ 2 (because task τ2 runs in state
s3 too) and with the aforementioned scaled Dirac impulse over
pmi2 , i.e. over the time interval [3, 5).
The probability of a task missing its deadline is easily computed from the transition probabilities of those transitions that
correspond to a deadline miss of a task instantiation (the thick
arrows in Figure 4.4, in our case). The probabilities of the transitions out of a state si are computed exclusively from the information stored in that state si . For example, let us consider
the transition s2 → s6 . The system enters state s2 at a time
40
CH. 4. MONOPROCESSOR SYSTEMS
whose probability density is given by z2 . The system takes the
transition s2 → s6 when the attempted completion time of τ2
(running in s2 ) exceeds 5. The completion time is the sum of
the starting time of τ2 (whose probability density is given by z2 )
and the execution time of τ2 (whose probability density is given
by 2 ). Hence, the probability density of the completion time
of τ2 is given by the convolution z2 ∗ 2 of the above mentioned
densities. Once this density is computed, the probability of the
completion time being larger than 5 is easily computed by integrating the result of the convolution over the interval (5, ∞). If
τ2 in s2 completes its execution at some time t ∈ [3, 5), then the
state transition s2 → s5 occurs (see Figure 4.4). The probability of this transition is computed by integrating z2 ∗ 2 over the
interval [3, 5).
As can be seen, by using the PMI approach, some process
states have more than one incident arc, thus keeping the graph
“narrow”. This is because, as mentioned, one process state in our
representation captures several possible states of a representation considering individual times (see Figure 4.2(a)).
The non-preemption limitation could, in principle, be overcome if we extended the information stored in the state of the
underlying stochastic process. Namely, the residual run time
probability distribution function of a task instantiation, i.e. the
PDF of the time a preempted instantiation still has to run, has
to be stored in the stochastic process state. This would several
times multiply the memory requirements of the analysis. Additionally, preemption would increase the possible behaviour of the
system and, consequently, the number of states of its underlying
stochastic process.
Because the number of states grows rapidly even with our
state reduction approach and each state has to store its probability density function, the memory space required to store the
whole process can become prohibitively large. Our solution to
master memory complexity is to perform the stochastic process
construction and analysis simultaneously. As each arrow updates the time probability density z of the state it leads to, the
process has to be constructed in topological order. The result
of this procedure is that the process is never stored entirely in
memory but rather that a sliding window of states is used for
analysis. For the example in Figure 4.4, the construction starts
with state s1 . After its next states (s2 and s3 ) are created, their
corresponding transition probabilities determined and the pos-
4.2. ANALYSIS ALGORITHM
41
1
4
2
3
5
6
Figure 4.5: State selection order
sible deadline miss probabilities accounted for, state s1 can be
removed from memory. Next, one of the states s2 and s3 is taken
as current state, let us consider state s2 . The procedure is repeated, states s4 , s5 and s6 are created and state s2 removed. At
this moment, one would think that any of the states s3 , s4 , s5 ,
and s6 can be selected for continuation of the analysis. However,
this is not the case, as not all the information needed in order
to handle states s5 and s6 are computed. More exactly, the arcs
emerging from states s3 and s4 have not yet been created. Thus,
only states s3 and s4 are possible alternatives for the continuation of the analysis in topological order. The next section discusses the criteria for selection of the correct state to continue
with.
4.2.2
Memory Efficient Analysis Method
As shown in the example in Section 4.2.1, only a sliding window
of states is simultaneously kept in memory. All states belonging
to the sliding window are stored in a priority queue. Once a state
is extracted from this queue and its information processed, it is
eliminated from the memory. The key to the process construction in topological order lies in the order in which the states are
extracted from this queue. First, observe that it is impossible
for an arc to lead from a state with a PMI number u to a state
with a PMI number v such that v < u (there are no arcs back
in time). Hence, a first criterion for selecting a state from the
queue is to select the one with the smallest PMI number. A sec-
42
CH. 4. MONOPROCESSOR SYSTEMS
ond criterion determines which state has to be selected out of
those with the same PMI number. Note that inside a PMI no
new task instantiation can arrive, and that the task ordering according to their priorities is unchanged. Thus, it is impossible
that the next state sk of a current state sj would be one that contains waiting tasks of higher priority than those waiting in sj .
Hence, the second criterion reads: among states with the same
PMI, one should choose the one with the waiting task of highest priority. Figure 4.5 illustrates the algorithm on the example
given in Section 4.2.1 (Figure 4.4). The shades of the states denote their PMI number. The lighter the shade, the smaller the
PMI number. The numbers near the states denote the sequence
in which the states are extracted from the queue and processed.
4.2.3 Multiple Simultaneously Active Instantiations of the Same Task Graph
The examples considered so far dealt with applications where at
most one active instance of each task graph is allowed at any
moment of time (bi = 1, 1 ≤ i ≤ g).
In order to illustrate the construction of the stochastic process in the case bi > 1, when several instantiations of a task
graph Γi may exist at the same time in the system, let us consider an application consisting of two independent tasks, τ1 and
τ2 , with periods 2 and 4 respectively. LCM = 4 in this case. The
tasks are scheduled according to a rate monotonic (RM) policy
[LL73]. At most one active instantiation of τ1 is tolerated in the
system at a certain time (b1 = 1) and at most two concurrently
active instantiations of τ2 are tolerated in the system (b2 = 2).
Figure 4.6 depicts a part of the stochastic process underlying this example. It is constructed using the procedure sketched
in Sections 4.2.1 and 4.2.2. The state indexes show the order
in which the states were analysed (extracted from the priority
queue mentioned in Section 4.2.2).
Let us consider state s6 = (τ2 , ∅, [2, 4)), i.e. the instantiation
of τ2 that arrives at time moment 0 has been started at a moment inside the PMI [2, 4) and there have not been any ready
tasks at the start time of τ2 . Let us assume that the finishing
time of τ2 lies past the LCM = 4. At time moment 4, a new
instantiation of τ2 arrives and the running instantiation is not
discarded, as b2 = 2. On one hand, if the finishing time of the
running instantiation belongs to the interval [6, 8), the system
4.2. ANALYSIS ALGORITHM
43
τ1,{τ2},[0,2)
s1
τ2,Ø ,[0,2)
s2
s4
s3
−, Ø ,[0,2)
τ1,{τ2},[2,4)
τ1,Ø ,[2,4)
s5
s6
s7
−, Ø ,[2,4)
τ2,Ø ,[2,4)
s8
s9
τ1,{τ2},[4,6)
τ1,{τ2,τ2},[4,6)
τ2,Ø ,[4,6)
τ2,{τ2},[4,6)
s16
s12
−, Ø ,[4,6)
s13
s10
s11
τ1,{τ2,τ2},[6,8)
s14
τ1,Ø ,[6,8)
s15
τ1,{τ2},[6,8)
τ2,{τ2},[6,8)
s17
s18
τ2,Ø ,[6,8)
−, Ø ,[6,8)
s19
s20
τ1,{τ2},[8,10)
τ1,{τ2,τ2},[8,10)
s25
τ1,{τ2},[10,12)
s30
τ1,{τ2,τ2},[12,14)
Figure 4.6: Part of the stochastic process underlying the example
application
44
CH. 4. MONOPROCESSOR SYSTEMS
performs the transition s6 → s14 (Figure 4.6). If, on the other
hand, the running instantiation attempts to run past the time
moment 8, then at this time moment a third instantiation of τ2
would require service from the system and, therefore, the running task (the oldest instantiation of τ2 ) is eliminated from the
system. The transition s6 → s19 in the stochastic process in Figure 4.6 corresponds to this latter case. We observe that when a
task execution spans beyond the time moment LCM , the resulting state is not unique. The system does not behave as if it has
just been restarted at time moment LCM , and, therefore, the
intervals [k · LCM, (k + 1) · LCM ), k ∈ N, are not statistically
equivalent to the interval [0, LCM ). Hence, it is not sufficient
to analyse the system over the interval [0, LCM ) but rather over
several consecutive intervals of length LCM .
Let an interval of the form [k · LCM, (k + 1) · LCM ) be called
the hyperperiod k and denoted Hk . Hk0 is a lower hyperperiod
than Hk (Hk0 < Hk ) if k 0 < k. Consequently, Hk is a higher
hyperperiod than Hk0 (Hk > Hk0 ) if k > k 0 .
For brevity, we say that a state s belongs to a hyperperiod
k (denoted s ∈ Hk ) if its PMI field is a subinterval of the hyperperiod k. In our example, three hyperperiods are considered,
H0 = [0, 4), H1 = [4, 8), and H2 = [8, 12). In the stochastic process in Figure 4.6, s1 , s2 , . . . , s7 ∈ H0 , s8 , s9 , . . . , s18 ∈ H1 , and
s19 , s20 , s25 ∈ H2 (note that not all states have been depicted in
Figure 4.6).
In general, let us consider a state s and let Ps be the set of its
predecessor states. Let k denote the order of the state s defined
as the lowest hyperperiod of the states in Ps (k = min{j : s0 ∈
Hj , s0 ∈ Ps }). If s ∈ Hk and s is of order k 0 and k 0 < k, then s is a
back state. In our example, s8 , s9 , s14 , and s19 are back states of
order 0, while s20 , s25 and s30 are back states of order 1.
Obviously, there cannot be any transition from a state belonging to a hyperperiod H to a state belonging to a lower hyperperiod than H (s → s0 , s ∈ Hk , s0 ∈ Hk0 ⇒ Hk ≤ Hk0 ). Consequently,
the set S of all states belonging to hyperperiods greater or equal
to Hk can be constructed from the back states of an order smaller
than k. We say that S is generated by the aforementioned back
states. For example, the set of all states s8 , s9 , . . . , s18 ∈ H1 can
be derived from the back states s8 , s9 , and s14 of order 0. The
intuition behind this is that back states inherit all the needed
information across the border between hyperperiods.
4.2. ANALYSIS ALGORITHM
45
Before continuing our discussion, we have to introduce the
notion of similarity between states. We say that two states si
and sj are similar (si ∼ sj ) if all the following conditions are
satisfied:
1. The task that is running in si and the one in sj are the
same,
2. The multiset of ready tasks in si and the one in sj are the
same,
3. The PMIs in the two states differ only by a multiple of
LCM , and
4. zi = zj (zi is the probability density function of the times
when the system takes a transition to si ).
Let us consider the construction and analysis of the stochastic process, as described in Sections 4.2.1 and 4.2.2. Let us consider the moment x, when the last state belonging to a certain
hyperperiod Hk has been eliminated from the sliding window.
Rk is the set of back states stored in the sliding window at the
moment x. Let the analysis proceed with the states of the hyperperiod Hk+1 and let us consider the moment y when the last state
belonging to Hk+1 has been eliminated from the sliding window.
Let Rk+1 be the set of back states stored in the sliding window
at moment y.
If the sets Rk and Rk+1 contain pairwise similar states, then
it is guaranteed that Rk and Rk+1 generate identical stochastic
processes during the rest of the analysis procedure (as stated,
at a certain moment the set of back states unambiguously determines the rest of the stochastic process). In our example,
R0 = {s8 , s9 , s14 , s19 } and R1 = {s19 , s20 , s25 , s30 }. If s8 ∼ s19 ,
s9 ∼ s20 , s14 ∼ s25 , and s19 ∼ s30 then the analysis process may
stop as it reached convergence.
Consequently, the analysis proceeds by considering states of
consecutive hyperperiods until the information captured by the
back states in the sliding window does not change any more.
Whenever the underlying stochastic process has a steady state,
this steady state is guaranteed to be found.
4.2.4
Construction and Analysis Algorithm
The analysis is performed in two phases:
1. Divide the interval [0, LCM ) in PMIs,
2. Construct the stochastic process in topological order and
analyse it.
46
CH. 4. MONOPROCESSOR SYSTEMS
Let A denote the set of task arrivals in the interval [0, LCM ],
i.e. A = {x|0 ≤ x ≤ LCM, ∃1 ≤ i ≤ N, ∃k ∈ N : x = kπi }.
Let D denote the set of deadlines in the interval [0, LCM ], i.e.
D = {x|0 ≤ x ≤ LCM, ∃1 ≤ i ≤ N, ∃k ∈ N : x = kπi + δi }. The set
of PMIs of [0, LCM ) is {[a, b)|a, b ∈ A ∪ D∧ 6 ∃x ∈ (A ∪ D) ∩ (a, b)}.
If PMIs of a higher hyperperiod Hk , k > 0, are needed during the
analysis, they are of the form [a + k · LCM, b + k · LCM ), where
[a, b) is a PMI of [0, LCM ).
The algorithm proceeds as discussed in Sections 4.2.1, 4.2.2
and 4.2.3. An essential point is the construction of the process
in topological order, which requires only parts of the states to be
stored in memory at any moment. The algorithm for the stochastic process construction and analysis is depicted in Figure 4.7.
A global priority queue stores the states in the sliding window. The state priorities are assigned as shown in Section 4.2.2.
The initial state of the stochastic process is put in the queue. The
explanation of the algorithm is focused on the construct and analyse procedure (lines 9–27). Each invocation of this procedure constructs and analyses the part of the underlying stochastic process that corresponds to one hyperperiod Hk . It starts
with hyperperiod H0 (k = 0). The procedure extracts one state
at a time from the queue. Let sj = (τi , Wi , pmii ) be such a state.
The probability density of the time when a transition occurs to
sj is given by the function zj . The priority scheme of the priority queue ensures that sj is extracted from the queue only after
all the possible transitions to sj have been considered, and thus
zj contains accurate information. In order to obtain the probability density of the time when task τi completes its execution,
the probability density of its starting time (zj ) and the ETPDF
of τi (i ) have to be convoluted. Let ξ be the probability density
resulting from the convolution.
Figure 4.8 presents an algorithmic description of the procedure next state. Based on ξ, the finishing time PDF of task τi
if task τi is never discarded, we compute the maximum execution time of task τi , max exec time. max time is the minimum
between max exec time and the time at which task τi would be
discarded. P M I will then denote the set of all PMIs included
in the interval between the start of the PMI in which task τi
started to run and max time. Task τi could, in principle, complete its execution during any of these PMIs. We consider each
PMI as being the one in which task τi finishes its execution. A
new underlying stochastic process state corresponds to each of
4.2. ANALYSIS ALGORITHM
(1)
(2)
(3)
divide [0, LCM ) in PMIs;
put first state in the
priority queue pqueue;
Rold = ∅;
(4)
(Rnew , M issed) =
= construct and analyse();
(5)
(6)
(7)
do
Rold = Rnew ;
(Rnew , M issed) =
= construct and analyse();
while Rnew 6= Rold ;
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
construct and analyse:
while ∃s ∈ pqueue such that
s.pmi ≤ pmi no do
sj = extract state from pqueue;
τi = sj .running;
ξ = convolute(i , zj );
nextstatelist = next states(sj );
for each su ∈ nextstatelist do
compute the probability
of the transition from
sj to su using ξ;
update deadline miss
probabilities M issed;
update zu ;
if su 6∈ pqueue then
put su in the pqueue;
end if;
if su is a back state
and su 6∈ Rnew then
Rnew = Rnew ∪ {su };
end if;
end for;
delete state sj ;
end while;
return (Rnew , M issed);
47
// Rold is the set of densities
z of the back states after iteration k
// M issed is the set of expected deadline miss ratios
// first field of the state
// consider task dependencies!
Figure 4.7: Construction and analysis algorithm
48
(1)
(2)
CH. 4. MONOPROCESSOR SYSTEMS
next states(sj = (τi , Wi , ti )):
nextstates = ∅;
max exec time = sup{t : ξ(t) > 0};
(3)
max time = min{max exec time,
discarding timei }
(4)
P M I = {[lop , hip ) ∈ P M Is :
lop ≥ ti ∧ hip ≤ max time}
(5)
(6)
for each [lop , hip ) ∈ P M I do
Arriv = {τ ∈ T : τ arrived in the
interval [ti , hip )};
Discarded = {τ ∈ Wi : τ was
discarded in the
interval [ti , hip )};
Enabled = {τ ∈ T : τ becomes
ready to execute as an effect
of τi ’s completion};
W = (Wi \Discarded) ∪ Enabled∪
{τ ∈ Arriv : ◦ τ = ∅};
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
select the new running task τu
from W based on the
scheduling policy
Wu = W \{τu };
add (τu , Wu , [lop , hip )) to
nextstatelist;
end for;
return nextstates;
// the largest finishing
time of τi
// the minimum between
finishing time and discarding time of τi
// the set of PMIs included in the interval
[ti , max time]
// add the newly arrived
tasks with no predecessors, as they are ready
to execute, and the newly
enabled ones
Figure 4.8: next states procedure
4.3. EXPERIMENTAL RESULTS
49
these possible finishing PMIs. For each PMI, we determine the
multiset Arriv of newly arrived tasks while task τi is executing.
Also, we determine the multiset Discarded of those tasks that
are ready to execute when task τi starts, but are discarded in
the mean time, as the execution of task τi spans beyond their
deadlines. Once task τi completes its execution, some of its successor tasks may become ready to execute. The successor tasks,
which become ready to execute as a result of task τi ’s completion,
form the set Enabled. The new multiset of ready tasks, W , is the
union of the old multiset of ready tasks except the ones that are
discarded during the execution of task τi , Wi \Discarded, and the
set Enabled and those newly arrived tasks that have no predecessor and therefore are immediately ready to run. Once the new
set of ready tasks is determined, the new running task τu is selected from multiset W based on the scheduling policy of the application. A new stochastic process state (τu , W \{τu }, [lop , hip ))
is constructed and added to the list of next states.
The probability densities zu of the times a transition to
su ∈ nextstatelist is taken are updated based on ξ. The state
su is then added to the priority queue and sj removed from
memory. This procedure is repeated until there is no task instantiation that starts its execution in hyperperiod Hk (until
no more states in the queue have their PMI field in the range
k · pmi no, . . . , (k + 1) · pmi no, where pmi no is the number of
PMIs between 0 and LCM ). Once such a situation is reached,
partial results, corresponding to the hyperperiod Hk are available and the construct and analyse procedure returns. The
construct and analyse procedure is repeated until the set of
back states R does not change any more.
4.3 Experimental Results
The most computation intensive part of the analysis is the computation of the convolutions zi ∗ j . In our implementation we
used the FFTW library [FJ98] for performing convolutions based
on the Fast Fourier Transform. The number of convolutions to
be performed equals the number of states of the stochastic process. The memory required for analysis is determined by the
maximum number of states in the sliding window. The main factors on which the size of the stochastic process depends are LCM
(the least common multiple of the task periods), the number of
50
CH. 4. MONOPROCESSOR SYSTEMS
PMIs, the number of tasks N , the task dependencies, and the
maximum allowed number of concurrently active instantiations
of the same task graph.
As the selection of the next running task is unique, given the
pending tasks and the time moment, the particular scheduling
policy has a reduced impact on the process size. Hence, we use
the non-preemptive EDF scheduling policy in the experiments
below. On the other hand, the task dependencies play a significant role, as they strongly influence the set of ready tasks and,
by this, the process size.
The ETPDFs are randomly generated. An interval [Emin,
Emax] is divided into smaller intervals. For each of the smaller
intervals, the ETPDF has a constant value, different from the
value over other intervals. The curve shape has of course an
influence on the final result of the analysis, but it has little or
no influence on the analysis time and memory consumed by the
analysis itself. The interval length Emax − Emin influences the
analysis time and memory, but only marginally.
The periods are randomly picked from a pool of periods with
the restriction that the period of task τ has to be an integer multiple of the periods of the predecessors of task τ . The pool comprises periods in the range 2, 3, . . . , 24. Large prime numbers
have a lower probability to be picked, but it occurs nevertheless.
In the following, we report on six sets of experiments. The
first four investigate the impact of the enumerated factors
(LCM , the number N of tasks, the task dependencies, the maximum allowed number of concurrently active instantiations of
the same task graph) on the analysis complexity. The fifth
set of experiments considers the rejection late task policy and
investigates its impact on the analysis complexity. The sixth
experiment is based on a real-life example from the area of
telecommunication systems.
The aspects of interest were the stochastic process size, as it
determines the analysis execution time, and the maximum size
of the sliding window, as it determines the memory space required for the analysis. Both the stochastic process size and the
maximum size of the sliding window are expressed in number of
states. All experiments were performed on an UltraSPARC 10
at 450 MHz.
4.3. EXPERIMENTAL RESULTS
51
200000
a=15.0
a=8.8
a=10.9
a=4.8
Stochastic process size [number of states]
180000
160000
140000
120000
100000
80000
60000
40000
20000
10
11
12
13
14
15
Number of tasks
16
17
18
19
Figure 4.9: Stochastic process size vs. number of tasks
4.3.1
Stochastic Process Size as a Function of
the Number of Tasks
In the first set of experiments we analysed the impact of the
number of tasks on the process size. We considered task sets of
10 to 19 independent tasks. LCM , the least common multiple
of the task periods, was 360 for all task sets. We repeated the
experiment four times for average values of the task periods a =
15.0, 10.9, 8.8, and 4.8 (keeping LCM = 360). The results are
shown in Figure 4.9. Figure 4.10 depicts the maximum size of
the sliding window for the same task sets. As it can be seen from
the diagram, the increase, both of the process size and of the
sliding window, is linear. The steepness of the curves depends
on the task periods (which influence the number of PMIs). It
is important to notice the big difference between the process size
and the maximum number of states in the sliding window. In the
case of 9 tasks, for example, the process size is between 64356 and
198356 while the dimension of the sliding window varies between
373 and 11883 (16 to 172 times smaller). The reduction factor of
the sliding window compared to the process size was between 15
and 1914, considering all our experiments.
CH. 4. MONOPROCESSOR SYSTEMS
52
12000
a=15.0
a=8.8
a=10.9
a=4.8
Sliding window size [number of states]
10000
8000
6000
4000
2000
0
10
11
12
13
14
15
Number of tasks
16
17
18
19
Figure 4.10: Size of the sliding window of states vs. number of
tasks
4.3.2 Stochastic Process Size as a Function of
the Application Period
In the second set of experiments we analysed the impact of the
application period LCM (the least common multiple of the task
periods) on the process size. We considered 784 sets, each of 20 independent tasks. The task periods were chosen such that LCM
takes values in the interval [1, 5040]. Figure 4.11 shows the variation of the average process size with the application period.
4.3.3 Stochastic Process Size as a Function of
the Task Dependency Degree
With the third set of experiments we analysed the impact of task
dependencies on the process size. A task set of 200 tasks with
strong dependencies (28000 arcs) among the tasks was initially
created. The application period LCM was 360. Then 9 new task
graphs were successively derived from the first one by uniformly
removing dependencies between the tasks until we finally got a
set of 200 independent tasks. The results are depicted in Figure 4.12 with a logarithmic scale for the y axis. The x axis rep-
4.3. EXPERIMENTAL RESULTS
53
1.8e+06
Stochastic process size [number of states]
1.6e+06
1.4e+06
1.2e+06
1e+06
800000
600000
400000
200000
0
0
500
1000
1500
2000
2500
3000
3500
4000
Least common multiple of task periods
4500
5000
5500
Figure 4.11: Stochastic process size vs. application period LCM
Stochastic process size [number of states]
1e+06
100000
10000
1000
0
1
2
3
4
5
6
7
8
Dependency degree (0 - independent tasks, 9 - highest dependency degree)
9
Figure 4.12: Stochastic process size vs. task dependency degree
CH. 4. MONOPROCESSOR SYSTEMS
54
1e+07
Stochastic process size [number of states]
12 tasks
15 tasks
18 tasks
21 tasks
24 tasks
27 tasks
1e+06
100000
10000
1000
1
1.5
2
2.5
Average of maximum number of concurrently active instantiations of the same task graph
3
Figure 4.13: Stochastic process size vs. average number of concurrently active instantiations of the same task graph
resents the degree of dependencies among the tasks (0 for independent tasks, 9 for the initial task set with the highest amount
of dependencies).
As mentioned, the execution time for the analysis algorithm
strictly depends on the process size. Therefore, we showed all
the results in terms of this parameter. For the set of 200 independent tasks used in this experiment (process size 1126517) the
analysis time was 745 seconds. In the case of the same 200 tasks
with strong dependencies (process size 2178) the analysis took
1.4 seconds.
4.3.4 Stochastic Process Size as a Function
of the Average Number of Concurrently
Active Instantiations of the Same Task
Graph
In the fourth set of experiments, the impact of the average
number of concurrently active instantiations of the same task
graph on the stochastic process size was analysed. 18 sets of
task graphs containing between 12 and 27 tasks grouped in 2
4.3. EXPERIMENTAL RESULTS
55
to 9 task graphs were randomly generated. Each task set was
analysed between 9 and 16 times considering different upper
bounds for the maximum allowed number of concurrently active
task graph instantiations. These upper bounds ranged from 1
to 3. The results were averaged for the same number of tasks.
The dependency of the underlying stochastic process size as a
function of the average of the maximum allowed number of instantiations of the same task graph that are concurrently active
is plotted in Figure 4.13. Note that the y-axis is logarithmic.
Different curves correspond to different sizes of the considered
task sets. It can be observed that the stochastic process size is
approximately linear in the average of the maximum allowed
number of concurrently active instantiations of the same task
graph.
4.3.5
Rejection versus Discarding
As formulated in Section 3.2.6, the discarding policy specifies
that the oldest instantiation of task graph Γi is eliminated from
the system when there are bi concurrently active instantiations
of Γi in the system, and a new instantiation of Γi demands service. Sometimes, such a strategy is not desired, as the oldest
instantiation might have been very close to finishing, and by
discarding it, the invested resources (time, memory, bandwidth,
etc.) are wasted.
Therefore, our problem formulation has been extended to
support a late task policy in which, instead of discarding the oldest instantiation of Γi , the newly arrived instantiation is denied
service (rejected) by the system.
In principle, the rejection policy is easily supported by only
changing the next states procedure in the algorithm presented in Section 4.2.4. However, this has a strong impact on
the analysis complexity as shown in Table 4.1. The significant
increase in the stochastic process size (up to two orders of magnitude) can be explained considering the following example. Let
s be the stochastic process state under analysis, let τj belonging
to task graph Γi be the task running in s and let us consider
that there are bi concurrently active instantiations of Γi in the
system. The execution time of τj may be very large, spanning
over many PMIs. In the case of discarding, it was guaranteed
that τj will stop running after at most bi · πΓi time units, because
at that time moment it would be eliminated from the system.
56
CH. 4. MONOPROCESSOR SYSTEMS
Therefore, when considering the discarding policy, the number
of next states of a state s is upper bounded. When considering
the rejection policy, this is not the case any more.
Moreover, let us assume that bi instantiations of the task
graph Γi are active in the system at a certain time. In the case
of discarding, capturing this information in the system state is
sufficient to unambiguously identify those bi instantiations: they
are the last bi that arrived, because always the oldest one is discarded. For example, the two ready instantiations of τ2 in the
state s13 = (τ1 , {τ2 , τ2 }, [6, 8)) in Figure 4.6 are the ones that arrived at the time moments 0 and 4. However, when the rejection
policy is deployed, just specifying that bi instantiations are in the
system is not sufficient for identifying them. We will illustrate
this by means of the following example. Let bi = 2, and let the
current time be kπΓi . In a first scenario, the oldest instantiation
of Γi , which is still active, arrived at time moment (k − 5)πΓi and
it still runs. Therefore, the second oldest instantiation of Γi is
the one that arrived at time moment (k − 4)πΓi and all the subsequent instantiations were rejected. In a second scenario, the
instantiation that arrived at time moment (k − 5)πΓi completes
its execution shortly before time moment (k − 1)πΓi . In this case,
the instantiations arriving at (k − 3)πΓi and (k − 2)πΓi were rejected but the one arriving at (k − 1)πΓi was not. In both scenarios, the instantiation arriving at kπΓi is rejected, as there are two
concurrently active instantiations of Γi in the system, but these
two instantiations cannot be determined without extending the
definition of the stochastic process state space. Extending this
space with the task graph arrival times is partly responsible for
the increase in number of states of the underlying stochastic process.
The fifth set of experiments reports on the analysis complexity when the rejection policy is deployed. 101 task sets of 12 to
27 tasks grouped in 2 to 9 task graphs were randomly generated.
For each task set two analyses were performed, one considering the discarding policy and the other considering the rejection
policy. The results were averaged for task sets with the same
cardinality and shown in Table 4.1.
4.3. EXPERIMENTAL RESULTS
Tasks
12
15
18
21
24
27
Average stochastic process size
[number of states]
Discarding
Rejection
2223.52
95780.23
7541.00
924548.19
4864.60
364146.60
18425.43
1855073.00
14876.16
1207253.83
55609.54
5340827.45
57
Relative
increase
42.07
121.60
73.85
99.68
80.15
95.04
Table 4.1: Discarding compared to rejection
4.3.6
Encoding of a GSM Dedicated Signalling
Channel
Finally, we present an example from industry, in particular the
mobile communication area.
Mobile terminals access telecommunication networks via a
radio interface. An interface is composed of several channels. As
the electromagnetic signals on the radio interface are vulnerable
to distortion due to interference, fading, reflexion, etc., sophisticated schemes for error detection and correction are deployed in
order to increase the reliability of the channels of the radio interface.
On the network side, the device responsible with radio transmission and reception and also with all signal processing specific to the radio interface is the base transceiver station (BTS).
The chosen demonstrator application is the baseband processing of the stand-alone dedicated control channel (SDCCH) of the
Global System for Mobile Communication [ETS]. It represents a
rather complex case, making use of all of the stages of baseband
processing.
The task graphs that model the downlink part (BTS to mobile station) of the GSM SDCCH are shown in Figure 4.14. Every four frame periods, i.e. every 240/13ms ≈ 18.46ms, a block
of 184 bits, specified in GSM 05.03, requires transmission. This
block is denoted block1 in Figure 4.14. The block is processed by
a so-called FIRE encoder that adds 40 parity bits to block1. The
FIRE encoder and its polynomial are specified in GSM 05.03,
section 4.1.2. The result of the FIRE encoding is block2, a 224
bits block. Four zero bits are appended to block2 by the tailer
CH. 4. MONOPROCESSOR SYSTEMS
58
block1
FIRE enc.
block2
block5 2
subTS
synch
COUNT1
Ciphering
COUNT2
COUNT3
COUNT4
COUNT
count
block6 4
block6 3
block6 2
COUNT4
Assembler
Assembler
Assembler
Assembler
COUNT2 COUNT3
Tailer
COUNT1
block3
Kc
Conv. enc.
A5
block4
ciphering stream
block6 1
block5 1
Interleaver
block5 3
block5 4
Hopping
freq 3
HSN
RNTABLE
MAIO
COUNT3
MAIO
COUNT2
HSN
RNTABLE
Hopping
Oversampl. +
ramping +
freq. transl.
Oversampl. +
ramping +
freq. transl.
Oversampl. +
ramping +
freq. transl.
Oversampl. +
ramping +
freq. transl.
burst8 4
burst8 3
burst8 2
burst8 1
HSN
RNTABLE
MAIO
COUNT1
Hopping
freq 1
freq 2
burst7 4
burst7 3
burst7 2
burst7 1
freq 4
Modulator
Modulator
Modulator
Modulator
Hopping
HSN
RNTABLE
MAIO
COUNT4
burst6 1
burst6 2
TS
burst6 3
burst6 4
Figure 4.14: Encoding of a GSM dedicated signalling channel
4.3. EXPERIMENTAL RESULTS
59
as specified in the aforementioned GSM document. The result
is block3, a 228 bits block. The 228 bits of block3 are processed
by a convolutional encoder specified in GSM 05.03, section 4.1.3,
where the generating polynomials are given. The result of the
convolutional encoding is block4, a 456 bit block. Block51 , 52 , 53 ,
and 54 , each 114 bits long, result from the interleaving of block4
as specified by GSM 05.03, section 4.1.4. Depending on how the
establishment of the channel was negotiated between the mobile station and the network, the communication may or may not
be encrypted. In case of encryption, blocks 61 to 64 result from
blocks 51 to 54 respectively, as specified in GSM 03.20, annex 3,
sections A3.1.2 and A3.1.3. Block61 , 62 , 63 , and 64 are then assembled as specified by GSM 05.03, section 4.1.5 and GSM 05.02,
section 5.2.3. The assembling is done using a training sequence
TS. TS is a 26 bit array and is one of the 8 training sequences
of GSM, specified in GSM 05.02, section 5.2.3. The assembling
of blocks 61 to 64 results in bursts 61 to 64 , each of these bursts
being 148 bits long. Bursts 71 to 74 result from the modulation of
bursts 61 to 64 respectively. Bursts 81 to 84 are radio bursts modulated on frequencies f req1 to f req4 respectively. F req1, f req2,
f req3, and f req4 are integers, maximum 6 bit long in GSM900,
that indicate the frequency to be used for sending a burst. They
are computed as specified by GSM 05.02, section 6.2.3. Their
computation makes use of the 6 bit integers MAIO (mobile allocation index offset) and HSN (hopping sequence number), and
of RNTABLE, a vector of 114 integers of 7 bits each, specified by
GSM 05.02, section 6.2.3. COU N T is the current frame number
while COU N T 1 to 4 are numbers of the frames in which the four
bursts will be sent on the radio interface. COU N T 1 to 4 are obtained by task “count” from the subtimeslot number of the SDC
channel (subT S), from a clock tick synch, and from the current
frame number COU N T .
The graph in Figure 4.14 contains many functional units of
fine granularity, which could induce high communication overhead. In order to reduce the overhead, some replicated units
could be collapsed, other could be merged together. The modified
graph is depicted in Figure 4.15. In this case, the A5, ciphering, assembling, modulating, hopping, and oversampling units
iterate four times in each activation of the graph. Merging the
interleaver and the assembler leads to the modification of the
algorithms of the interleaver and the ciphering unit. The ciphering unit does not receive 114 bits from the interleaver, but 148
CH. 4. MONOPROCESSOR SYSTEMS
60
block1
FIRE enc. +
tailer
block3
Conv. enc.
block4
Interleaver + block5 1,2,3,4
assembler
COUNT1,2,3,4
Kc
A5
ciphering stream
Ciphering
subTS
synch
COUNT
count
COUNT4
burst7 1,2,3,4
COUNT2 COUNT3
COUNT1
burst6 1,2,3,4
Modulator
HSN
RNTABLE
MAIO
burst8 1,2,3,4
COUNT1,2,3,4
Hopping
freq 1,2,3,4
Oversampl. +
ramping +
freq. transl.
Figure 4.15: Encoding of a GSM dedicated signalling channel, reduced architecture
4.4. LIMITATIONS AND EXTENSIONS
61
bits, structured in the form 3 + 57 + 1 + 26 + 1 + 57 + 3. The
ciphering unit performs then a XOR between the 114 bits ciphering stream and the two 57-bit fields of the received block, leaving
the remaining 3 + 1 + 26 + 1 + 3 bits untouched.
The whole application runs on a single DSP processor and the
tasks are scheduled according to fixed priority scheduling. The
FIRE encoding task has a period of 240/13 ≈ 18.46ms.2 FIRE
encoding, convolutional encoding and interleaving are activated
once every task graph instantiation, while the ciphering, A5,
modulator, hopping, and oversampling tasks are activated four
times every task graph instantiation. The end-to-end deadline
of the task graph is equal to its period, i.e. 240/13ms.
In this example, there are two sources of variation in execution times. The modulating task has both data and control intensive behaviour, which can cause pipeline hazards on the deeply
pipelined DSP it runs on. Its execution time probability density
is derived from the input data streams and measurements. Another task will implement a ciphering unit. Due to the lack of
knowledge about the deciphering algorithm A5 (its specification
is not publicly available), the ciphering task execution time is
considered to be uniformly distributed between an upper and a
lower bound.
When two channels are scheduled on the DSP, the ratio of
missed deadlines is 0 (all deadlines are met). Considering three
channels assigned to the same processor, the analysis produced
a ratio of missed deadlines, which was below the one enforced by
the required QoS. It is important to note that using a hard realtime model with WCET, the system with three channels would
result as unschedulable on the selected DSP. The underlying
stochastic process for the three channels had 130 nodes and its
analysis took 0.01 seconds. The small number of nodes is caused
by the strong harmony among the task periods, imposed by the
GSM standard.
4.4 Limitations and Extensions
Although our proposed method is, as shown, efficiently applicable to the analysis of applications implemented on monoprocessor systems, it can handle only small scale multiprocessor ap2 We use a time quanta of 1/13ms in this application such that all task periods
may be specified as integers.
CH. 4. MONOPROCESSOR SYSTEMS
62
1 τ1
τ2 2
τ4 1
τ3 3
Figure 4.16: Example of multiprocessor application
P2
P2
τ1
τ1
t1
P1
t1
τ2
τ3
t’
(a) Scenario 1
τ4
P1
τ2
τ4
τ3
t’’
(b) Scenario 2
Figure 4.17: Two execution scenarios
plications. This section identifies the causes of this limitation
and sketches an alternative approach to handle multiprocessor
applications.
When analysing multiprocessor applications, one approach
could be to decompose the analysis problem into several subproblems, each of them analysing the tasks mapped on one of the processors. We could attempt to apply the present approach in order
to solve each of the subproblems. Unfortunately, in the case of
multiprocessors and with the assumption of data dependencies
among tasks, this approach cannot be applied. The reason is that
the set of ready tasks cannot be determined based solely on the
information regarding the tasks mapped on the processor under
consideration. To illustrate this, let us consider the example in
Figure 4.16. Tasks τ2 , τ3 , and τ4 are mapped on processor P1 and
task τ1 is mapped on processor P2 . The numbers near the tasks
indicate the task priorities. For simplicity, let us assume that
all tasks have the same period π, and hence there is only one
priority monotonicity interval [0, π). Let us examine two possible scenarios. The corresponding Gantt diagrams are depicted
in Figure 4.17. At time moment 0 task τ1 starts running on processor P2 and task τ2 starts running on processor P1 . Task τ1
4.4. LIMITATIONS AND EXTENSIONS
63
completes its execution at time moment t1 ∈ [0, π). In the first
scenario, task τ2 completes its execution at time moment t0 > t1
and task τ3 starts executing on the processor P1 at time moment
t0 because it has the highest priority among the two ready tasks
τ3 and τ4 at that time. In the second scenario, task τ2 completes
its execution at time moment t00 < t1 . Therefore, at time moment
t00 , only task τ4 is ready to run and it will start its execution on
the processor P1 at that time. Thus, the choice of the next task
to run is not independent of the time when the running task
completes its execution inside a PMI. This makes the concept of
PMIs unusable when looking at the processors in isolation.
An alternative approach would be to consider all the tasks
and to construct the global state space of the underlying stochastic process accordingly. In principle, the approach presented in
the previous sections could be applied in this case. However, the
number of possible execution traces, and implicitly the stochastic
process, explodes due to the parallelism provided by the application platform. As shown, the analysis has to store the probability distributions zi for each process state in the sliding window of
states, leading to large amounts of needed memory and limiting
the appropriateness of this approach to very small multiprocessor applications. Moreover, the number of convolutions zi ∗ j ,
being equal to the number of states, would also explode, leading to prohibitive analysis times. The next chapter presents an
approach that overcomes these problems. However, as opposed
to the method presented in this chapter, which produces exact
values for the expected deadline miss ratios, the alternative approach generates approximations of the real ratios.
64
CH. 4. MONOPROCESSOR SYSTEMS
Chapter 5
Analysis of
Multiprocessor
Systems
The challenge taken in this chapter is to analyse an application
running on a multiprocessor system with acceptable accuracy,
without the need to explicitly store and compute the memory
consuming distributions of the residual execution times of each
task in the states of the underlying stochastic process. Also, we
would like to avoid the calculation of the computation-intensive
convolutions.
We address this problem by using an approximation approach for the task execution time probability distribution functions. Approximating the generalised ETPDFs with weighted
sums of convoluted exponential functions leads to approximating the underlying generalised semi-Markov process with a
continuous time Markov chain. By doing so, we avoid both the
computation of convolutions and the storage of the zi functions.
However, as opposed to the method presented in the previous
chapter, which produces exact values for the expected deadline
miss ratios, the alternative approach generates approximations
of the real ratios.
The approximation of the generalised task execution time
probability distributions by weighted sums of convoluted exponential distributions leads to a large continuous time Markov
chain. Such a Markov chain is much larger than the stochastic
65
CH. 5. MULTIPROCESSOR SYSTEMS
66
process underlying the system with the real, non-approximated
execution times, but, as the state holding times probability distributions are exponential, there is no need to explicitly store
their distributions, leading to a much more efficient use of the
analysis memory. Moreover, by construction, the Markov chain
exhibits regularities in its structure. These regularities are exploited during the analysis such that the infinitesimal generator
of the chain is constructed on-the-fly, saving additional amounts
of memory. In addition, the solution of the continuous time
Markov chain does not imply any computation of convolutions.
As a result, multiprocessor applications of realistic size may be
analysed with sufficient accuracy. Moreover, by controlling the
precision of the approximation of the ETPDFs, the designer may
trade analysis resources for accuracy.
5.1 Problem Formulation
The multiprocessor system analysis problem that we solve in
this chapter is formulated as follows.
5.1.1 Input
The input of the problem consists of:
•
•
•
•
•
•
•
•
•
The set of task graphs Γ,
The set of processors P ,
The mapping M ap,
The set of task periods ΠT and the set of task graph periods
ΠΓ ,
The set of task deadlines ∆T and the set of task graph
deadlines ∆Γ ,
The set of execution time probability density functions ET ,
The late task policy is the discarding policy,
The set Bounds = {bi ∈ N\{0} : 1 ≤ i ≤ g}, where bi is the
maximum numbers of simultaneously active instantiations
of task graph Γi , and
The scheduling policies on the processing elements and
buses.
5.2. APPROACH OUTLINE
5.1.2
67
Output
The results of the analysis are the sets M issedT and M issedΓ
of expected deadline miss ratios for each task and task graph
respectively.
5.1.3
Limitations
For now, we restrict our assumptions on the system to the following:
• All tasks belonging to the same task graph have the same
period (πa = πb , ∀τa , τb ∈ Vi ⊂ T , where Γi = (Vi , Ei ) is a
task graph),
• The task deadlines (task graph deadlines) are equal to the
corresponding task periods (task graph periods) (πi = δi ,
∀1 ≤ i ≤ N , and πΓi = δΓi , ∀1 ≤ i ≤ g), and
• The late task policy is the discarding policy.
These restrictions are relaxed in Section 5.9 where we discuss
their impact on the analysis complexity.
5.2 Approach Outline
In order to extract the desired performance metrics, the underlying stochastic process corresponding to the application has to
be constructed and analysed. Events, such as the arrival of a
deadline, represent state transitions in the stochastic process.
In order to obtain the long-time average rate of such events,
the stationary state probabilities have to be calculated for the
stochastic process underlying the system.
The underlying stochastic process is regenerative, i.e. the
system behaves probabilistically equivalent in the time intervals between consecutive visits to a regenerative state [Ros70].
Thus, it would be sufficient to analyse the system in the interval between two consecutive regenerations. However, the subordinated stochastic process (the process between two regeneration points) is a continuous-state time-homogeneous generalised
semi-Markov process (GSMP) [She93, Gly89]. Hence, its stationary analysis implies the numerical solution of a system of
partial differential equations with complicated boundary conditions [GL94]. This makes the applicability of the GSMP-based
analysis limited to extremely small systems.
68
CH. 5. MULTIPROCESSOR SYSTEMS
Because the limitations of the GSMP-based approach, we
proceed along a different path, namely the exact analysis of an
approximating system. We approximate the generalised probability distributions of task execution times with Coxian probability distributions [Cox55]. The stochastic process that underlies
a system with only Coxian probability distributions is a continuous time Markov chain (CTMC) [Lin98], whose steady state
analysis implies the solution of a system of linear equations.
Albeit theoretically simple, the applicability of this approach, if
used directly, is limited by the enormous increase of the number of states of the CTMC relative to the number of states of
the stochastic process underlying the application. In order to
cope with this increase, we exploit the specific structure of the
infinitesimal generator of the CTMC such that we reduce the
needed analysis memory by more than one order of magnitude.
The outline of our approach is depicted in Figure 5.1. At step
1, we generate a model of the application as a Concurrent Generalised Petri Net (CGPN) [PST98] (Section 5.3).
At step 2, we construct the tangible reachability graph (TRG)
of the CGPN. The TRG is also the marking process, i.e. the
stochastic process in which the states represent the tangible
markings of the CGPN. The marking process of a CGPN is a
generalised semi-Markov process (GSMP) [Lin98] (Section 5.4).
The third step implies the approximation of the arbitrary
real-world ETPDFs with Coxian distributions, i.e. weighted
sums of convoluted exponential distributions. Some details regarding Coxian distributions and the approximation process
follow in Section 5.5.
Directly analysing the GSMP obtained at step 2 is practically impossible (because of time and memory complexity) for
even small toy examples, if they are implemented on multiprocessor systems. Therefore, at step 4, the states of this process
are substituted by sets of states based on the approximations
obtained in the third step. The transitions of the GSMP are substituted by transitions with exponentially distributed firing interval probabilities from the Coxian distributions. What results
is a continuous time Markov chain (CTMC), much larger than
the GSMP, however easier to analyse. The explanation of this
rather counter-intuitive fact is twofold:
• By exploiting regularities in the structure of the CTMC, the
elements of its generator matrix can be constructed on-thefly during the analysis, leading to memory savings.
5.2. APPROACH OUTLINE
Application
(set of graphs of tasks
with arbitrary ETPDFs)
step 1
CGPN model
generation
69
step 3
Approximation of the
arbitrary ETPDFs with
Coxian distributions
CGPN model
Coxian distributions
corresponding to
arbitrary ETPDFs
step 2
TRG/GSMP
construction
TRG/GSMP
step 4
CTMC construction
CTMC
step 5
Analysis of the CTMC
Results
(percentage of missed
deadlines)
Figure 5.1: Approach outline
70
CH. 5. MULTIPROCESSOR SYSTEMS
τ1
τ2
τ4
τ3
Γ1
Γ2
Figure 5.2: Task graphs
• Computationally expensive convolutions of probability
density functions are avoided by using exponentially distributed ETPDFs.
The construction procedure of the CTMC is detailed in Section 5.6.
As the last step, the obtained CTMC is solved and the performance metrics extracted (Section 5.7).
5.3 Intermediate Model Generation
As the first step, starting from the task graph model given by
the designer, an intermediate model based on Concurrent Generalised Petri Nets (CGPN) [PST98] is generated. Such a model
allows an efficient and elegant capturing of the characteristics of
the application and of the scheduling policy. It constitutes also
an appropriate starting point for the generation of the CTMC, to
be discussed in the following sections.
The rest of this section details on the modelling of applications with CGPN. Note that the generation of a CGPN model
from the input data of our problem is automatic and its complexity is linear in the number of tasks, and hence negligible relative
to the solving of the marking process underlying the system.
5.3.1 Modelling of Task Activation and Execution
We illustrate the construction of the CGPN based on an example.
Let us consider the task graphs in Figure 5.2. Tasks τ1 , τ2 and τ3
form graph Γ1 while Γ2 consists of task τ4 . τ1 and τ2 are mapped
3
d1
3
c1
6
5
Bnd 1
e1
r1
j1
a 01
w1
b1
1
4
3
d2
3
dsc 1
6
c2
4
3
e2
r2
j2
a 12
2
f1
4
d3
6
Proc 1 Proc 2
Tick
Clock
f2
c3
3
1
5
a 23
Figure 5.3: CGPN example
5
v1
1
e3
r3
j3
4
3
v2
2
4
3
d4
c4
a 13
dsc 2
6
3
r4
e4
3
5
j4
4
b2
1
a 04
w2
Bnd 2
5.3. INTERMEDIATE MODEL GENERATION
71
72
CH. 5. MULTIPROCESSOR SYSTEMS
on processor P1 and τ3 and τ4 on processor P2 . The task priorities
are 1, 2, 2, and 1 respectively. The task graph Γ1 has period πΓ1
and Γ2 has period πΓ2 . For simplicity, in this example, we ignore
the communication tasks.
The CGPN corresponding to the example is depicted in Figure 5.3. Concurrent Generalised Petri Nets are extensions of
Generalised Stochastic Petri Nets (GSPN) introduced by Balbo
et al. [BCFR87]. CGPNs have two types of transitions, timed
(denoted as solid rectangles in Figure 5.3) and immediate (denoted as thin lines). Timed transitions have an associated firing delay, which can be deterministic or stochastic with a given
generalised probability distribution function. The firing policy
that we consider is race with enabling policy, i.e. the elapsed
time is kept if the transition stays enabled. Immediate transitions have zero firing delay, i.e. they fire as soon as they are enabled. Immediate transitions have associated priorities, shown
as the positive integers that annotate the transitions in the figure. The priorities are used to determine the immediate transition that fires among a set of competitively enabled immediate
transitions. Arcs have associated multiplicities, shown as the
positive integers that annotate the arcs. A necessary condition
for a transition to fire is that the number of tokens in each input
place of the transition is equal to or greater than the multiplicity of the arc that connects the corresponding input place to the
transition.
The execution of task τi , 1 ≤ i ≤ 4, is modelled by the place
ri and timed transition ei . If a timed transition ei is enabled,
it means that an instantiation of the task τi is running. The
probability distribution of the firing delay of transition ei is equal
to the ETPDF of task τi . As soon as ei fires, it means that the
instantiation of τi completed execution and leaves the system.
The task priorities are modelled by prioritising the immediate
transitions ji .
In our example, the mutual exclusion of the execution of
tasks mapped on the same processor is modelled by means of
the places P roc1 and P roc2 , which correspond to processors P1
and P2 respectively. The data dependencies among the tasks are
modelled by the arcs e2 → a23 , e1 → a13 and e1 → a12 .
5.3. INTERMEDIATE MODEL GENERATION
5.3.2
73
Modelling of Periodic Task Arrivals
The periodic arrival of graph instantiations is modelled by
means of the transition Clock with the deterministic delay T ick,
as illustrated in Figure 5.3. Clock fires every T ick time units,
where T ick is the greatest common divisor of the graph periods.
As soon as Clock has fired πΓi /T ick times, the transition vi fires
and a new instantiation of task graph Γi demands execution. (In
our example, for the simplicity of the illustration, we considered
πΓ1 /T ick = 1 and πΓ2 /T ick = 1. This is modelled by specifying
an arc multiplicity of 1 and 1 for the arcs f1 → v1 and f2 → v2
respectively.)
5.3.3
Modelling Deadline Misses
In the general case of arbitrary deadlines, the event of an instantiation of a task τi (task graph Γj ) missing its deadline is modelled by the firing of a certain transition (modelling the deadline
arrival) in a marking with a certain property (modelling the fact
that the considered task (task graph) instantiation has not yet
completed).
In the particular case of the task deadline being equal to the
task period and to the corresponding task graph period (δτi =
πτi = πΓj if τi ∈ Vj ), the arrival of the deadline of any task τi ∈ Vj
is modelled by the firing of vj , i.e. the arrival of a new instantiation of a task graph Γj . The fact that an instantiation of task τi
has not yet completed its execution is modelled by a marking in
which at least one of the places aki (ready-to-run) or ri (running)
is marked. Hence, the event of an instantiation of τi missing its
deadline is modelled by the firing of vj in any of the markings
with the above mentioned property.
As explained in Section 3.2.5, the event of an instantiation of
task graph Γj missing its deadline is equal to the earliest event
of a task instantiation of task τi ∈ Vj missing its deadline δτi .
In order to avoid the implied bookkeeping of events, in the case
when δτi = πτi = πΓj , τi ∈ Vj , task graph deadline misses have
been modelled in the following equivalent but simpler form.
The place Bndj is initially marked with bj tokens, meaning
that at most bj concurrent instantiations of Γj are allowed in
the system. Whenever a new instantiation of task graph Γj is
accepted in the system (transition wj fires), a token is removed
from place Bndj . Once a task graph instantiation leaves the sys-
74
CH. 5. MULTIPROCESSOR SYSTEMS
tem (all places di are marked, where τi ∈ Vj ), a token is added to
Bndj . Having less than bj tokens in place Bndj indicates that at
least one instantiation of task graph Γj is active in the system.
Hence, an instantiation of task graph Γj misses its deadline if
and only if transition vj fires (denoting the arrival of the deadlines of any task τi ∈ Vj , in the above mentioned particular case)
in a CGPN marking with the property that place Bndj contains
less than bj tokens.
The modelling of task and task graph deadline misses in the
more general cases of δτi = πτi 6= πΓj , and δτi ≤ πτi if τi ∈ Vj is
discussed in Section 5.9.1 and Section 5.9.3.
5.3.4 Modelling of Task Graph Discarding
If Bndj contains no tokens at all when vj fires, then the maximum number of instantiations of Γj are already present in the
system and, therefore, the oldest one will be discarded. This is
modelled by firing the immediate transition dscj and marking
the places ci , where one such place ci corresponds to each task
in Γj . The token in ci will attempt to remove from the system an
already completed task (a token from di ), or a running task (a
token from ri ), or a ready task (a token from aki ), in this order.
The transitions wj have a higher priority than the transitions
dscj , in order to ensure that an instantiation of Γj is discarded
only when Bndj contains no tokens (there already are bj concurrently active instantiations of Γj in the system). The structure
of the CGPN is such that a newly arrived instantiation is always
accepted in the system.
5.3.5 Scheduling Policies
The scheduling policy determines which of the enabled transitions ji fires. In the case of static priorities, this is easily modelled by assigning the task priorities to the corresponding immediate transitions ji as is the case of the example in Figure 5.3. In
the case of dynamic task priorities, the choice is made based on
the global time, which can be deduced from the global marking
of the Petri Net. In general, the time left until the deadline of Γj
is computed by subtracting from the multiplicity of the outgoing
arc of fj (how many T ick units separate consecutive arrivals of
Γi ) the number of tokens of fj (how many T ick units passed since
the last arrival of Γj ).
5.4. GENERATION OF THE MARKING PROCESS
75
s3
e2
Clock
e1
s2
e4
e2
e4
s7
s4
k
lo
c
e1
e3
C
e4
Clo
ck
ck
Clo
ck
Clo
s6
s1
Clock
s5
Clock
Figure 5.4: Marking process of CGPN in Figure 5.3
In the case of dynamic priorities of tasks, the marking process construction algorithm has to be instructed to choose which
transition to fire based on the net marking and not on the transition priorities depicted in the model.
5.4 Generation of the Marking Process
This section discusses step 2 of our approach (Figure 5.1), the
generation of the marking process of the Petri Net that models
an application.
A tangible marking of a CGPN is a marking in which no
immediate transitions are enabled. Such a marking can be directly reached from another tangible marking by firing exactly
one timed transition followed by a possibly empty sequence of
immediate transition firings, until no more immediate transitions are enabled. The tangible reachability graph (TRG) contains the tangible markings of the Petri net. Each marking in
the TRG corresponds to a state in the underlying stochastic process, also known as the marking process.
Balbo et al. [BCFR87] gave an algorithm for the generation of
the tangible reachability graph (TRG) for Generalised Stochastic Petri Nets (GSPN). Even if the Petri Net formalism that we
use is an extension to GSPN, the algorithm is applicable nevertheless. In the worst case, the number of nodes in the TRG is
exponential in the number of places and number of transitions
of the Petri Net.
76
CH. 5. MULTIPROCESSOR SYSTEMS
The tangible reachability graph of the Petri Net in Figure 5.3
is shown in Figure 5.4. The graph in Figure 5.4 is also a graphical representation of the stochastic process of the marking of the
net. An edge in the TRG is labelled with the timed transition
that triggers the marking change. The thicker arcs correspond
to transition firings that model task deadline misses. If we compute the steady-state rates of the firings along the thick arcs, we
will be able to obtain the expected deadline miss rates for each
task and task graph.
If all timed transitions had exponentially distributed firing
delay probabilities, the marking process would be a continuous
time Markov chain (CTMC). The computation of its stationary
state probability vector would imply the solution of a system of
linear equations. As we assume that tasks may have execution
times with generalised probability distributions, the marking
process is not a CTMC.
Observing that our systems are regenerative1, a possible
solution would be the one proposed by Choi et al. [CKT94].
They introduced Markov Regenerative Stochastic Petri Nets
(MRSPN), which allow timed transitions to have firing delays
with generalised probability distributions. However, MRSPN
have the limitation that at most one transition whose firing
delay probability has a generalised distribution is enabled in
every marking. In this case, the underlying stochastic process
is a Markov regenerative process (MRGP). Choi et al. [CKT94]
present a method for the transient analysis of the MRGP corresponding to their marking graph.
One important observation we can make from Figure 5.4 is
that in all states, except state s5 , there are more than one simultaneously enabled transitions whose firing delay probabilities
have generalised distributions. This situation occurs because
there are several tasks with non-exponential execution time
probability distribution functions that execute concurrently on
different processors. Therefore, the process is not a Markov
regenerative process with at most one non-exponentially distributed event in each state. Hence, the analysis method of Choi
et al. [CKT94] does not apply in our case.
1 Let LCM denote the least common multiple of all task periods. The process regenerates itself when both of the following conditions are true: The time
becomes k · LCM , k ∈ N (every LCM/T ick firings of transition Clock), and all
processors were idle just before this time. In Figure 5.4, such a situation happens
every time state s1 is entered.
5.4. GENERATION OF THE MARKING PROCESS
77
The Concurrent Generalised Petri Net model (CGPN), introduced by Puliafito et al. [PST98], softens the restriction of
MRSPNs. Puliafito et al. address the analysis of the marking
processes of CGPN with an arbitrary finite number of simultaneously enabled events with generalised probability distribution.
Nevertheless, they restrict the system such that all simultaneously enabled transitions get enabled at the same instant.
Under this assumption, the marking process is still a MRGP.
However, we cannot assume that all tasks that are concurrently
executed on a multiprocessor platform start their execution at
the same time. Therefore, we remove this restriction placed on
CGPN and we will use the term CGPN in a wider sense in the
sequel. The marking process of such a CGPN in the wide sense
is not necessarily a MRGP.
In order to keep the Markovian property, we could expand
the state of the marking process to contain the residual firing
times of enabled transitions. In this case, the subordinated process, i.e. the process between two regeneration times, is a timehomogeneous generalised semi-Markov process (GSMP) [She93,
Gly89]. Hence, its stationary analysis implies the numerical solution of a system of partial differential equations with complicated boundary conditions [GL94]. This makes the applicability
of the GSMP-based analysis limited to extremely small systems.
All this leads us to a different approach for solving the marking process. We approximate the generalised probability distribution functions with Coxian distributions [Cox55], i.e. weighted
sums of convoluted exponential distribution functions. The
resulting process contains transitions with exponentially distributed firing delay probabilities. Hence, it is a continuous time
Markov chain (CTMC) that approximates the non-Markovian
marking process. The steady-state analysis of the CTMC implies the solving of a system of linear equations. If applied
directly, the approach is severely limited by the immense number of states of the approximating CTMC. A key to the efficient
analysis of such huge Markov chains is the fact that we observe
and exploit a particular structure of the chain, such that its
infinitesimal generator may be generated on-the-fly during the
analysis and does not need to be stored in memory.
Section 5.5 discusses the approximation of generalised probability distribution functions with Coxian distribution functions.
Section 5.6 details on the aforementioned structural properties
of the CTMC that allow for its efficient solving.
CH. 5. MULTIPROCESSOR SYSTEMS
78
(a) A transition
with
arbitrarily
distributed
firing
delay
probability
α1µ1
α2µ2
α3µ3
(1−α2)µ2
(1−α1)µ1
(b) Subnet modelling a transition whose firing delay probability has a Coxian distribution
Figure 5.5: Coxian approximation with three stages
5.5 Coxian Approximation
Coxian distributions were introduced by Cox [Cox55] in the context of queueing theory. A Coxian distribution of r stages is
a weighted sum of convoluted exponential distributions. The
Laplace transform of the probability density of a Coxian distribution with r stages is given below:
X(s) =
r
X
i=1
αi ·
i−1
Y
k=1
(1 − αk ) ·
i
Y
k=1
µk
s + µk
X(s) is a strictly proper rational transform, implying that the
Coxian distribution may approximate a fairly large class of arbitrary distributions with an arbitrary accuracy, provided a sufficiently large r.
Figure 5.5 illustrates the way we use Coxian distributions in
our approach. Let us consider the timed transition with a certain probability distribution of its firing delay in Figure 5.5(a).
This transition can be replaced by the Petri Net in Figure 5.5(b),
5.6. MARKOV CHAIN CONSTRUCTION
79
where hollow rectangles represent timed transitions with exponential firing delay probability distribution. The annotations
near those transitions indicate their average firing rate. In this
example, three stages have been used for approximation.
Practically, the approximation problem can be formulated as
follows: given an arbitrary probability distribution and a certain
number of stages r, find µi , 1 ≤ i ≤ r, and αi , 1 ≤ i ≤ r − 1
(αr = 1), such that the quality of approximation of the given distribution by the Coxian distribution with r stages is maximised.
Malhotra and Reibman [MR93] describe a method for parameter fitting that combines moment-matching with least squares
fitting for phase approximations, of which Coxians are a subclass. In our approach, because the analytic expression of a Coxian distribution is quite complicated in the time domain, we perform the least squares fitting in the frequency domain. Hence,
we minimise the distance between the Fourier transform X(jω)
of the Coxian distribution and the computed Fourier transform
of the distribution to be approximated. The minimisation is a
typical interpolation problem and can be solved by various numerical methods [PTVF92]. We use a simulated annealing approach that minimises the difference of only a few most significant harmonics of the Fourier transforms, which is very fast if
provided with a good initial solution. We choose the initial solution in such way that the first moment of the real and approximated distribution coincide.
By replacing all transitions whose firing delays have generalised probability distributions (as shown in Figure 5.5(a)) with
subnets of the type depicted in Figure 5.5(b) we obtain a CTMC
that approximates the non-Markovian marking process of the
CGPN. It is obvious that the introduced additional places trigger
a potentially huge increase in the size of the TRG and implicitly
in the size of the resulted CTMC. The next section details how
to efficiently handle such an increase in the dimensions of the
underlying stochastic process.
5.6 Approximating Markov Chain Construction
The marking process of a Petri Net that models an application
under the assumptions stated in Section 5.1, such as the one depicted in Figure 5.3, is a generalised semi-Markov process. If
80
CH. 5. MULTIPROCESSOR SYSTEMS
we replace the transitions that have generally distributed firing
delay probabilities with Coxian distributions, the resulting Petri
Net has a marking process that is a continuous time Markov
chain that approximates the generalised semi-Markov process.
In this section we show how to express the infinitesimal generator of the CTMC by means of generalised tensor sums and
products.
Plateau and Fourneau [PF91] have shown that the infinitesimal generator of a CTMC underlying a parallel system can
be expressed as a generalised tensor sum [Dav81] of the local
infinitesimal generators, i.e. the local generators of the parallel subsystems. Haddad et al. [HMC97] analysed marking
processes of Generalised Stochastic Petri Nets with Coxian and
phase-type distributions. They classify the transitions of the net
in local and non-local transitions and use the results of Plateau
and Fourneau. However, they place certain restrictions on the
usage of immediate transitions. Moreover, because under our
assumptions the firing of the Clock transition (see Figure 5.3)
may disable all the other timed transitions, there exists no partition of the set of places such that the results of Plateau and
Fourneau, and Haddad might apply.
In our case, the infinitesimal generator is not a sum of tensor
products and sums. Rather, the infinitesimal generator matrix is
partitioned into submatrices, each of them being a sum of tensor
expressions.
Let S be the set of states of the GSMP underlying the Petri
Net before the replacement outlined in the previous section. This
GSMP corresponds to the TRG of the Petri Net model. Let M =
[mij ] be a square matrix of size |S| × |S| where mij = 1 if there
exists a transition from the state si to the state sj in the GSMP
and mij = 0 otherwise. We first partition the set of states S in
clusters such that states in the same cluster have outgoing edges
labelled with the same set of transitions. A cluster is identified
by a binary combination that indicates the set of transitions that
are enabled in the particular cluster (which, implicitly, also indicates the set of tasks that are running in the states belonging to that particular cluster). The clusters are sorted according
to their corresponding binary combination and the states in the
same cluster are consecutively numbered.
Consider an application with three independent tasks, each
of them mapped on a different processor. In this case, 8 clusters can be formed, each corresponding to a possible combina-
000
001
010
011
100
101
110
111
5.6. MARKOV CHAIN CONSTRUCTION
000
001
010
011
81
M
U,V
1 0 0 0
0 1 0 0
100
101
110
111
Figure 5.6: The matrix corresponding to a GSMP
tion of simultaneously running tasks. Note that if the tasks were
not independent, the number of combinations of simultaneously
running tasks, and implicitly of clusters, would be smaller. The
cluster labelled with 101, for example, contains states in which
the tasks τ1 and τ3 are running.
Figure 5.6 depicts the matrix M corresponding to the GSMP
of the application described above. The rows and columns in the
figure do not correspond to individual rows and columns in M.
Each row and column in Figure 5.6 corresponds to one cluster
of states. The row labelled with 100, for example, as well as the
column labelled with the same binary number, indicate that the
task τ1 is running in the states belonging to the cluster labelled
with 100, while tasks τ2 and τ3 are not running in the states belonging to this cluster. Each cell in the figure does not correspond
to a matrix element but to a submatrix Mli ,lj , where Mli ,lj is
the incidence matrix corresponding to the clusters labelled with
li and lj (an element of Mli ,lj is 1 if there is a transition from
the state corresponding to its row to the state corresponding to
its column, and it is 0 otherwise). The submatrix MU,V at the
intersection of the row labelled with U = 100 and the column labelled with V = 011 is detailed in the figure. The cluster labelled
with U = 100 contains 2 states (corresponding to the two rows of
the magnified cell in Figure 5.6), while the cluster labelled with
V = 011 contains 4 states (corresponding to the four columns of
the magnified cell in Figure 5.6). As shown in the figure, when
a transition from the first state of the cluster labelled with U
occurs, the first state of the cluster labelled with V is reached
CH. 5. MULTIPROCESSOR SYSTEMS
82
Y
v
X
u
Z
Figure 5.7: Part of a GSMP
(corresponding to the 1 in the intersection of the first row and
first column in the magnified cell). This corresponds to the case
when τ1 completes execution (τ1 is the only running task in the
states belonging to the cluster labelled with U ) and τ2 and τ3
are subsequently started (τ2 and τ3 are the running tasks in the
states belonging to the cluster labelled with V ).
Once we have the matrix M corresponding to the underlying GSMP, the next step is the generation of the CTMC using
the Coxian distribution for approximation of arbitrary probability distributions of transition delays. When using the Coxian
approximation, a set of new states is introduced for each state in
S (S is the set of states in the GSMP), resulting in an expanded
state space S 0 , the state space of the approximating CTMC. We
have to construct a matrix Q of size |S 0 | × |S 0 |, the so called infinitesimal generator of the approximating CTMC. The construction of Q is done cell-wise: for each submatrix of M, a corresponding submatrix of Q will be generated. Furthermore, null
submatrices of M will result in null submatrices of Q. A cell
QU,V of Q will be of size G × H, where
Y
G = |U | ·
ri
i∈EnU
H = |V | ·
Y
ri
i∈EnV
and U and V are clusters of states, |U | and |V | denote the number
of states belonging to the respective clusters, EnU = {k : transition ek , corresponding to the execution of task τk , is enabled in
U }, EnV = {k : transition ek , corresponding to the execution of
task τk , is enabled in V }, and rk is the number of stages we use
in the Coxian approximation of the ETPDF of task τk .
We will illustrate the construction of a cell in Q from a cell in
M using an example. We consider a cell on the main diagonal,
as it is the most complex case. Let us consider three states in the
GSMP depicted in Figure 5.7. Two tasks, τu and τv , are running
in the states X and Y . These two states belong to the same cluster, labelled with 11. Only task τv is running in state Z. State
5.6. MARKOV CHAIN CONSTRUCTION
83
β1λ1
β2λ2
(1−β1)λ1
Figure 5.8: Coxian approximation with two stages
3µ
3
(1−α2)µ2
1)
(1−α1)µ1 λ1
(1
α2 µ
(1−α1)µ1 2
(1
−β
1 )λ
X 01
1
α1µ1
α1µ1
X 00
β
X 11
−β
µ2
α2
β1λ1
2λ
2
β1λ1
X 10
β1λ1
β
2λ
2
Z2
(1−α2)µ2
(1−α2)µ2
(1
3
α3 µ
X 02
α
(1−α2)µ2
(1−α1)µ1
1
Y 00
Y 11
Y 10
1
(1−α1)µ1
(1
−β
1 )λ
Y 01
2λ
2
−β
−β
(1
(1−α2)µ2
(1
−β
1 )λ
Y 02
β
1 )λ
1
1 )λ
1
X 12
Z1
(1−α1)µ1
Y 12
Figure 5.9: Expanded Markov chain
Z0
84
CH. 5. MULTIPROCESSOR SYSTEMS
Z belongs to cluster labelled with 10. If task τv finishes running in state X, a transition to state Y occurs in the GSMP. This
corresponds to the situation when a new instantiation of τv becomes active immediately after the completion of a previous one.
When task τu finishes running in state X, a transition to state
Z occurs in the GSMP. This corresponds to the situation when
a new instantiation of τu is not immediately activated after the
completion of a previous one. Consider that the probability distribution of the execution time of task τv is approximated with
the three stage Coxian distribution depicted in Figure 5.5(b) and
that of τu is approximated with the two stage Coxian distribution depicted in Figure 5.8. The resulting CTMC corresponding
to the part of the GSMP in Figure 5.7 is depicted in Figure 5.9.
The edges between the states are labelled with the average firing rates of the transitions of the Coxian distributions. Dashed
edges denote state changes in the CTMC caused by firing of transitions belonging to the Coxian approximation of the ETPDF of
τu . Solid edges denote state changes in the CTMC caused by firing of transitions belonging to the Coxian approximation of the
ETPDF of τv . The state Y12 , for example, corresponds to the situation when the GSMP in Figure 5.7 would be in state Y and the
first two of the three stages of the Coxian distribution approximating the ETPDF of τv (Figure 5.5(b)) and the first stage out of
the two of the Coxian distribution approximating the ETPDF of
τu (Figure 5.8) have fired.
Let us construct the cell Q(11),(11) on the main diagonal of the
infinitesimal generator Q of the CTMC in Figure 5.9. The cell is
situated at the intersection of the row and column corresponding to the cluster labelled with 11 and is depicted in Figure 5.10.
The matrix Q(11),(11) contains the average transition rates between the states Xij and Yij , 0 ≤ i ≤ 1, 0 ≤ j ≤ 2, of the CTMC
in Figure 5.9 (no state Z, only states X and Y belong to the cluster labelled with 11). The observed regularity in the structure of
the stochastic process in Figure 5.9 is reflected in the expression
of Q(11),(11) as shown in Figure 5.10. Because Q is a generator
matrix (sum of row elements equals 0), some negative elements
have to be introduced on the main diagonal that do not correspond to transitions in the chain depicted in Figure 5.9. The
expression of Q(11),(11) is given below:
Q(11),(11) = (Au ⊕ Av ) ⊗ I|11| + Iru ⊗ Bv ⊗ erv ⊗ Dv
5.6. MARKOV CHAIN CONSTRUCTION
85
(1−α1)µ1
X 00Y 00X 01Y 01X 02Y 02X 10Y 10X 11Y 11X 12Y 12
X 00
Y 00
X 01
Y 01
X 02
Y 02
X 10
Y 10
X 11
Y 11
X 12
Y 12
(1−α2)µ2
(1−β1)λ1
α1µ1
α2µ2
β1λ1
a number such
that the sum of
row elements = 0
Figure 5.10: The cell Q(11),(11) corresponding to the example in
Figure 5.9
where
(1 − β1 )λ1
−λ2

(1 − α1 )µ1
−µ2
0
−λ1
Au =
0
and
−µ1
Av =  0
0


α 1 µ1
B v =  α 2 µ2 
α 3 µ3

0
(1 − α2 )µ2 
−µ3
0 1
Dv =
0 0
(5.1)
e rv = 1 0 0
(5.2)
|11| denotes the size of the cluster labelled with 11. Ii is the
identity matrix of size i × i, ri indicates the number of stages
of the Coxian distribution that approximates the ETPDF of task
τi . ⊕ and ⊗ are the Kronecker sum and product of matrices,
respectively.
In general, a matrix Ak = [aij ] is an rk × rk matrix, and is
defined as follows:


(1 − αki )µki if j = i + 1
aij = −µki
(5.3)
if j = i


0
otherwise
CH. 5. MULTIPROCESSOR SYSTEMS
86
where αki and µki characterise the ith stage of the Coxian distribution approximating a transition tk .
A matrix Bk = [bij ] is an rk × 1 matrix and bi1 = αki · µki . A
matrix erk = [eij ] is a 1×rk matrix and e11 = 1, e1i = 0, 1 < i ≤ rk .
A matrix Dk = [dij ] corresponding to a cell U, V is a |U | × |V |
matrix defined as follows:


1 if an edge labelled with k links
dij =
(5.4)
the ith state of U with the j th state of V


0 otherwise
In general, considering a label U , the cell QU,U on the main
diagonal of Q is obtained as follows:
!
M
Ai ⊗ I|U | +
QU,U =
i∈EnU




(5.5)
X  O
 O




+
I rj 
I rj 

 ⊗ B i ⊗ e ri ⊗ 
 ⊗ Di
i∈EnU
j∈EnU
j>i
j∈EnU
j<i
A cell situated at the intersection of the row corresponding
to label U with the column corresponding to label V (U 6= V ) is
obtained as follows:


X
O

QU,V =
Fij  ⊗ Di
(5.6)
i∈EnU
j∈EnU ∪EnV
The matrices F are given by the following expression:


if j ∈ EnU ∧ j 6∈ EnV ∧ j 6= i
v rj





if j ∈ EnU ∧ j ∈ EnV ∧ j 6= i
I rj
Fij = Bi
if j 6∈ EnV ∧ j = i



Bi ⊗ eri if j ∈ EnV ∧ j = i



e
if j 6∈ EnU
rj
(5.7)
where vrk = [vi1 ] is a rk × 1 matrix, vi1 = 1, 1 ≤ i ≤ rk .
The solution of the CTMC implies solving for π in the following equation:
t
π·Q=0
(5.8)
5.6. MARKOV CHAIN CONSTRUCTION
87
where π is the steady-state probability vector (and t π is its transpose) and Q the infinitesimal generator of the CTMC.
We conclude this section with a discussion on the size of Q
and its implications on analysis time and memory.
P The sub1
matrix
Q
,
defined
by
Eq.(5.5),
has
(|EnU
|
+
1
−
U,U
i∈EnU ri ) ·
Q
i∈EnU ri · |U | non-zero elements. Similarly, for a given cluster label U , allQthe submatrices QU,V , U 6= V , have approximately |EnU | · i∈EnU ri · |U
Q| non-zero elements on aggregate
(see Eq.(5.6)). Letting ZU = i∈EnU ri · |U |, the entire generator
Q has approximately
|Q| ≈
X
(2 · |EnU | + 1 −
U
X 1
) · ZU
ri
(5.9)
i∈EnU
non-zero elements. The state probability vector π has
X
X
Y
|π| =
ZU =
|U | ·
ri
U
U
(5.10)
i∈EnU
elements.
Let us suppose that we store the matrix Q in memory. Then,
in order to solve the equation t π · Q = 0, we would need
|π| + ξ · |Q| =
X
ZU + ξ ·
U
X
(2 · |EnU | + 1 −
U
X 1
) · ZU (5.11)
ri
i∈EnU
memory locations, where ξ is the number of information items
characterising each non-zero element of the matrix. ξ reflects
also the overhead of sparse matrix element storage.
As can be seen from the expressions of QU,U and QU,V , the
matrix Q is completely specified by means of the matrices Ai , Bi ,
and Di (see Eq.(5.5) and (5.6)), hence it needs not be stored explicitly in memory, but its elements are generated on-the-fly during the numerical solving of the CTMC. In this case, we would
need to store
X
X
X
X
|π| +
|Ai | +
|Bi | +
|Di | =
ZU +
i
i
+
X
i
i
(3 · ri − 1) +
U
X
ξ · |EnU | · |U |
(5.12)
U
values in order to solve t π · Q = 0. Even for large
P applications,
the matrices Ai and Bi are of negligible size ( i (2 · ri − 1) and
88
CH. 5. MULTIPROCESSOR SYSTEMS
i ri respectively). The ratio between the memory space needed
by Q if stored and if generated on-the-fly is
P
P
P
(2 · |EnU | + 1 − i∈EnU r1i )ZU
U ZU + ξ ·
P
PU
P
(5.13)
U ZU +
i (3 · ri − 1) + ξ ·
U |EnU | · |U |
P
For the application example in Figure 5.2, the expression in
Eq.(5.13) evaluates to 11.12 if Coxian distributions with 6 stages
substitute the original distributions. The actual memory saving
factor, as indicated by our analysis tool, is 9.70. The theoretical overestimation is due to the fact that possible overlaps of
non-null elements of matrices QU V were not taken into account
in the Eq.(5.11). Nevertheless, we see that even for small applications memory savings of one order of magnitude can be
achieved by exploiting the special structure of the infinitesimal
generator of the approximating CTMC. Further evaluations will
be presented in Section 5.8.
The drawback of this approach is the time overhead in order to generate each non-zero element of Q on-the-fly. A naı̈ve
approach would need O(|EnU | · |EnV |) arithmetic operations in
order to compute each non-zero element of the cell (U, V ). In the
worst case, the number of arithmetic operations is O(M 2 ), where
M is the number of processors of the system. However, a large
body of research [BCKD00] provides intelligent numerical algorithms for matrix-vector computation that exploit factorisation,
reuse of temporary values, and reordering of computation steps.
Thus, the aforementioned overhead is significantly amortised.
5.7 Extraction of Results
This section describes how the deadline miss ratios for each task
and task graph are calculated, once the CTMC approximating
the stochastic process underlying the system is solved.
As described in Section 5.3, the event of a task (task graph)
missing its deadline corresponds to identifiable edges in the tangible reachability graph of the CGPN modelling the application
(and implicitly to transitions in the underlying GSMP). The expected deadline miss ratio of a task (task graph) can be computed
as the sum of the expected transition rates of the corresponding
identified edges multiplied with the task (task graph) period.
Not only deadline miss events may be modelled as transitions
along certain edges in the TRG, but also more complex events.
5.8. EXPERIMENTAL RESULTS
89
Such events are, for example, the event that a task graph misses
its deadline when a certain task τi missed its deadline. Another
example is the event that task τj missed its deadline when task
τi started later than a given time moment. There is indeed a
large number of events that may be represented as transitions
along identifiable edges in the TRG. Inspecting such events
can be extremely useful for diagnosis, finding performance bottlenecks and non-obvious correlations between deadline miss
events, detecting which task needs to be re-implemented, or
whose execution is badly scheduled such that other processors
idle. Such kind of information can be exploited for the optimisation of the application.
In conclusion, once the steady state probabilities of the
stochastic process are obtained, we are interested in the expected rate of certain edges in the TRG of the CGPN. We illustrate how to calculate this rate based on an example.
Let us consider the edge X → Z in the GSMP in Figure 5.7.
The edges X00 → Z0 , X10 → Z0 , X01 → Z1 , X11 → Z1 , X02 → Z2 ,
and X12 → Z2 in the CTMC in Figure 5.9, which approximates
the GSMP in Figure 5.7, correspond to the edge X → Z in the
GSMP. The expected transition rate of X → Z can be approximated by means of the expected transition rates of the corresponding edges in the CTMC and is given by the expression
(πX00 + πX01 + πX02 ) · β1 λ1 + (πX10 + πX11 + πX12 ) · β2 λ2 (5.14)
where β1 , β2 , λ1 , and λ2 characterise the Coxian distribution
that approximates the probability distribution of the delay of
the transition X → Z (in this case, the ETPDF of τv ). πXij is
the probability of the CTMC being in state Xij after the steady
state is reached. The probabilities πXij are obtained as the result
of the numerical solution of the CTMC (Eq(5.8)).
5.8 Experimental Results
We performed four sets of experiments. All were run on an
AMD Athlon at 1533 MHz.
CH. 5. MULTIPROCESSOR SYSTEMS
90
5000
Individual experiments
Average
4500
4000
Analysis time [s]
3500
3000
2500
2000
1500
1000
500
0
10
20
30
40
Number of tasks
50
60
Figure 5.11: Analysis time vs. number of tasks
45000
Individual experiments
Average
Stochastic process size [number of states]
40000
35000
30000
25000
20000
15000
10000
5000
0
10
20
30
40
Number of tasks
50
60
Figure 5.12: Stochastic process size vs. number of tasks
5.8. EXPERIMENTAL RESULTS
5.8.1
91
Analysis Time as a Function of the Number of Tasks
The first set of experiments investigates the dependency of the
analysis time on the number of tasks in the system. Sets of random task graphs were generated, with 9 to 60 tasks per set. Ten
different sets were generated and analysed for each number of
tasks per set. The underlying architecture consists of two processors. The task execution time probability distributions were
approximated with Coxian distributions with 2 to 6 stages. The
dependency between the needed analysis time and the number
of tasks is depicted in Figure 5.11. The crosses indicate the analysis times of the individual applications, while the boxes represent the average analysis time for classes of application with the
same number of tasks. The analysis time depends on the size
of the stochastic process to be analysed (in terms of number of
states) as well as on the convergence rate of the numerical solution of the CTMC. This variation of the convergence rate of
the numerical solution does not allow us to isolate the effect of
the number of tasks on the analysis time. Therefore, in Figure 5.12 we depicted the influence of the number of tasks on the
size of the CTMC in terms of number of states. As seen from
the figure, the number of states of the CTMC increases following a linear tendency with the number of tasks. The observed
slight non-monotonicity stems from other parameters that influence the stochastic process size, such as task graph periods, and
the amount of task parallelism.
5.8.2
Analysis Time as a Function of the Number of Processors
In the second set of experiments, we investigated the dependency between the analysis time and the number of processors.
Ten different sets of random task graphs were generated. For
each of the ten sets, 5 experiments were performed, by allocating the 18 tasks of the task graphs to 2 to 6 processors. The results are plotted in Figure 5.13. It can be seen that the analysis
time is exponential in the number of processors. The exponential
character is induced by the possible task execution orderings.
The number of task execution orderings increases exponentially
with increased parallelism provided by the architecture. We see
from Eq.(5.10) that the number of states of the CTMC is expo-
CH. 5. MULTIPROCESSOR SYSTEMS
92
7000
Individual experiments
Average
6000
Analysis time [s]
5000
4000
3000
2000
1000
0
1
2
3
4
Number of processors
5
6
7
Figure 5.13: Analysis time vs. number of processors
nentially dependent on |EnU |, the number of simultaneously enabled transitions in the states of cluster |U |. |EnU | is at most
M + 1, where M is the number of processes. Thus, the experiment confirms the theoretical result that the number of states of
the CTMC is exponential in the number of processors.
5.8.3 Memory Reduction as a Consequence
of the On-the-Fly Construction of the
Markov Chain Underlying the System
In the third set of experiments, we investigated the reduction in
the memory needed in order to perform the CTMC analysis when
using on-the-fly construction of the infinitesimal generator based
on equations (5.5) and (5.6). We constructed 450 sets of synthetic
applications with 20 to 40 tasks each. The execution time probability distributions were approximated with Coxian distributions with 6 stages. For each application, we ran our analysis
twice. In the first run, the entire infinitesimal generator of the
CTMC approximating the stochastic process underlying the application was stored in memory. In the second run, the elements
of the infinitesimal generator were computed on-demand during
5.8. EXPERIMENTAL RESULTS
93
25
Applications mapped on two processors
Applications mapped on three processors
Applications mapped on four processors
Percent of cases
20
15
10
5
0
10
11
12
13
14
15
16
17
Relative reduction of the memory space needed by the analysis
18
19
Figure 5.14: Histogram of memory reduction
the analysis and not stored. In both runs, we measured the memory needed for the analysis, and calculated the relative memory reduction in the case of the on-demand generation. The histogram of the relative memory reduction is shown in Figure 5.14.
We observe a memory reduction ranging from 10 to 19 times. Additionally, we observe a strong correlation between the memory
reduction factor and the number of processors on which the application is mapped. Also, the memory reduction factor increases
with the number of processors. Last, we observe that only 48%
of the applications mapped on three processors and only 13.3%
of the applications mapped on four processors could have been
analysed in both runs, such that the comparison can be made.
For 52% and 86.7% of applications mapped on 3 respectively 4
processors, the 512MB memory of a desktop PC computer were
insufficient for the analysis in the case the entire infinitesimal
generator is stored in memory. The cases in which the comparison could not be made were not included in the histogram.
CH. 5. MULTIPROCESSOR SYSTEMS
94
500
450
400
|S’|/|S|
350
300
250
200
150
100
50
2
2.5
3
3.5
4
4.5
5
5.5
6
Average number of stages of Coxian distribution
Figure 5.15: Increase in stochastic process size with number of
stages for approximating the arbitrary ETPDFs
5.8.4 Stochastic Process Size as a Function of
the Number of Stages of the Coxian Distributions
In the fourth set of experiments, we investigated the increase
in the stochastic process size induced by using different number of stages for approximating the arbitrary ETPDFs. We constructed 98 sets of random task graphs ranging from 10 to 50
tasks mapped on 2 to 4 processors. The ETPDFs were approximated with Coxian distributions using 2 to 6 stages. The results
for each type of approximation were averaged over the 98 sets
of graphs and the results are plotted in Figure 5.15. Recall that
|S| is the size of the GSMP and |S 0 | is the much larger size of
the CTMC obtained after approximation. As more stages are
used for approximation, as larger the CTMC becomes compared
to the original GSMP. As shown in Section 5.6, in the worst case,
the growth factor is
Y
ri
(5.15)
i∈E
5.8. EXPERIMENTAL RESULTS
95
Table 5.1: Accuracy vs. number of stages
2 stages 3 stages 4 stages 5 stages
Relative error
8.467%
3.518%
1.071%
0.4%
As can be seen from Figure 5.15, the real growth factor is smaller
than the theoretical upper bound. It is important to emphasise
that the matrix Q corresponding to the CTMC does not need to
be stored, but only a vector with the length corresponding to a
column of Q. The growth of the vector length with the number of
Coxian stages used for approximation can be easily derived from
Figure 5.15. The same is the case with the growth of analysis
time, which follows that of the CTMC.
5.8.5
Accuracy of the Analysis as a Function of
the Number of Stages of the Coxian Distributions
The fifth set of experiments investigates the accuracy of results
as a factor of the number of stages used for approximation. This
is an important aspect in deciding on a proper trade-off between
quality of the analysis and cost in terms of time and memory.
For comparison, we used analysis results obtained with our
approach elaborated in the previous chapter. That approach is
an exact one based on solving the underlying GSMP. However,
because of complexity reasons, it can efficiently handle only
monoprocessor systems. Therefore, we applied the approach
presented in this chapter to a monoprocessor example, which
has been analysed in four variants using approximations with
2, 3, 4, and 5 stages. The relative error between missed deadline
ratios resulted from the analysis using the approximate CTMC
and the ones obtained from the exact solution is presented in Table 5.1. The generalised ETPDFs used in this experiment were
created by drawing Bézier curves that interpolated randomly
generated control points. It can be observed that good quality
results can already be obtained with a relatively small number
of approximation stages.
96
CH. 5. MULTIPROCESSOR SYSTEMS
5.8.6 Encoding of a GSM Dedicated Signalling
Channel
Finally, we considered the telecommunication application described in Section 4.3.6, namely the baseband processing of
a stand-alone dedicated control channel of the GSM. In Section 4.3.6, the application was mapped on a single processor.
This implementation could be inefficient as it combines the signal processing of the FIRE encoding and convolutional encoding,
and the bit-operation-intensive interleaver and ciphering on one
hand with the control dominant processing of the modulator on
the other hand. Moreover, the implementation of the publicly
unavailable A5 algorithm could be provided as a separate circuit. Thus, in this experiment we consider a mapping as shown
in Figure 5.16. In the implementation alternative depicted in
the figure, the FIRE and convolutional encodings, the bit interleaving and ciphering, as well as the hopping and count tasks
are mapped on a digital signal processor. The A5 task is executed by a ASIC, while the modulating and oversampling tasks
are mapped on a different ASIC.
In the case of the 9 tasks depicted in Figure 5.16, the analysis
reported an acceptable miss deadline ratio after an analysis time
of 4.8 seconds. The ETPDFs were approximated by Coxian distributions with 6 stages. If we attempt to perform the baseband
processing of an additional channel on the same DSP processor,
three more tasks, namely an additional FIRE and convolutional
encoding, and interleaving tasks, are added to the task graph.
The analysis in this case took 5.6 seconds. As a result of the
analysis, in the case of two channels (9 tasks in total), 10.05% of
the deadlines are missed, which is unacceptable according to the
application specification.
5.9 Extensions
Possible extensions that address the three restrictions that we
assumed on the system model (Section 5.1.3) are discussed in
this section.
5.9.1 Individual Task Periods
As presented in Section 5.1.3, we considered that all the tasks
belonging to a task graph have the same period. This assump-
block1
synch
A5
Figure 5.16: Encoding and mapping of a GSM dedicated signalling channel
Kc
Interleaver + block5 1,2,3,4
Ciphering
assembler
ciphering stream
MAIO
Modulator
burst6 1,2,3,4
burst7 1,2,3,4
freq1,2,3,4
Oversampl. +
ramping +
freq. transl.
block4
Hopping
HSN
Conv. enc.
COUNT1,2,3,4
RNTABLE
FIRE enc. + block3
tailer
count
subTS
COUNT
burst8 1,2,3,4
5.9. EXTENSIONS
97
98
CH. 5. MULTIPROCESSOR SYSTEMS
τ1 2
τ2 3
τ3 12
Figure 5.17: Application example
tion can be relaxed as follows. Each task τi ∈ Γj has its own
period πτi , with the restriction that πτi is a common multiple of
all periods of the tasks in ◦ τi (πτi is an integer multiple of πτk ,
where τk ∈ ◦ τi ). In this case, πΓj , the period of the task graph
Γj , is equal to the least common multiple of all πτi , where πτi is
the period of τi and τi ∈ Vj . The introduction of individual task
periods implies the existence of individual task deadlines: each
task τi has its own deadline δτi = πτi and the deadline δΓj of a
task graph Γj (the time by which all tasks τi ∈ Vj have to have
finished) is πΓj .
In order to illustrate how applications under such an extension are modelled, let us consider the application example depicted in Figure 5.17: one task graph consisting of the three
tasks τ1 , τ2 and τ3 , where τ1 is mapped on processor P2 and τ2
and τ3 are mapped on processor P1 . Task τ1 has period 2, task τ2
has period 3 and task τ3 has period 12 as indicated by the numbers near the circles in the figure. The task graph period is the
least common multiple of the periods of the tasks that belong to
it, i.e. πΓ1 = 12 for our example.
Figure 5.18 depicts the CGPN that models the application
described above. Whenever an instantiation of task τ1 completes
its execution (transition e1 fires), a token is added in place r3,1 .
Similarly, whenever an instantiation of task τ2 completes its execution, a token is added in place r3,2 . In order to be ready to run,
an instantiation of task τ3 needs πτ3 /πτ1 = 6 data items produced
by 6 instantiations of task τ1 and πτ3 /πτ2 = 4 data items produced by 4 instantiations of task τ2 . Therefore, the arcs r3,1 → r3
and r3,2 → r3 have multiplicity 6 and 4 respectively.
The firing delay T ick of the Clock transition is not any more
the greatest common divisor of the task graph periods, 12, but
the greatest common divisor of the task periods, 1. Every two
ticks, f1 fires and a new token is added to place task1 , modelling
an arrival of a new instantiation of τ1 . Similarly, every three
1
6
proc 2
done 1
2
5
6
4
f1
4
e1
run 1
3 j1
a1
r1
task 1
1
3
Figure 5.18: CGPN modelling the application in Figure 5.17
4
e3
proc 1
run 3
2
3
done 3
5
4
3 j3
e2
4
6
r3
done 2
4
1
4
a3
3
6
3
run 2
3 j2
a2
3
4
r 3,1
3
r 3,2
Tick
5
4
r2
4
Clock
replace 1
12
4
1
fG
4
3
5
graph 1
f2
synch 1
Bnd 1
b1
absorb 1
task 2
5.9. EXTENSIONS
99
100
CH. 5. MULTIPROCESSOR SYSTEMS
ticks a new token is added to place task2 . In general, the fact that
an instantiation of task τi has not yet completed its execution is
modelled by a marking with the property that at least one of the
places ri,k , ai or runi is marked. Hence, an instantiation of task
τi misses its deadline if and only if fi fires in a marking with the
above mentioned property.
Every 12 ticks, fΓ1 fires and a new token is added to place
graph1 , modelling a new instantiation of task graph Γ1 . In general, if Bndi contains less than bi tokens when a new instantiation of Γi arrives, then at least one instantiation is active in the
system. The event of a task graph Γi missing its deadline corresponds to the firing of fΓi in a marking in which Bndi contains
less than bi tokens.
If Bnd1 is not marked at that time, it means that there are
already b1 active instantiations of Γ1 in the system. In this
case, replace1 fires and the oldest instantiation is removed from
the system. Otherwise, absorb1 fires, consuming a token from
Bnd1 . This token is added back when synch1 fires modelling
the completion of the task graph Γ1 . An instantiation of Γ1
completes its execution when πΓ1 /πτ1 = 6 instantiations of τ1 ,
πΓ1 /πτ2 = 4 instantiations of τ2 and πΓ1 /πτ3 = 1 instantiation of
τ3 complete their execution. Therefore, the arcs done1 → synch1 ,
done2 → synch1 , and done3 → synch1 , have multiplicity 6, 4 and
1 respectively.
The following experiment has been carried out in order to
assess the impact of individual task periods on the analysis complexity. Three sets of test data, N ormal, High, and Low, have
been created. The test data in set N ormal contains 300 sets of
random task graphs, each set comprising 12 to 27 tasks grouped
in 3 to 9 task graphs. The tasks are mapped on 2 to 6 processors. Each task has its own period, as described in this section.
For any task graph Γi , the least common multiple LCMi and the
greatest common divisor, GCDi , of the periods of the tasks belonging to the task graph are computed. The test data High and
Low is identical to the test data N ormal with the exception of
task periods that are equal for all tasks belonging to the same
task graph. All tasks belonging to the task graph Γi in the test
data High have period LCMi while the same tasks have period
GCDi in the test data Low.
Figure 5.19 plots the dependency of the average size of the
underlying generalised semi-Markov process (number of states)
on the number of tasks. The three curves correspond to the three
5.9. EXTENSIONS
101
1e+06
Average stochastic process size [number of states]
900000
Normal
High
Low
800000
700000
600000
500000
400000
300000
200000
100000
0
12
14
16
Number of tasks
18
20
22
Figure 5.19: Individual task periods compared to uniform task
test data sets. The following conclusions can be drawn from the
figure:
1. The plots corresponding to High and Low are very close to
each other. This indicates that the number of states in the
GSMP is only weakly dependent on the particular period
value.
2. The number of states in the GSMP corresponding to test
set N ormal is larger than the ones corresponding to test
sets High and Low. This confirms that different periods for
tasks belonging to the same task graph lead to a relative
increase in the GSMP size as compared to the case when
all tasks in the same task graph have the same period.
3. The relative growth of the size of the GSMP in the case in
which tasks belonging to the same task graph have different periods compared to the case in which tasks belonging
to the same task graph have the same period is decreasing
with the number of tasks.
Table 5.2 contains the average increase in the size of the generalised semi-Markov processes corresponding to the task sets
in test data N ormal relative to the size of the generalised semi-
CH. 5. MULTIPROCESSOR SYSTEMS
102
Table 5.2: Relative increase of the GSMP size in the case of individual periods
Increase relative to
Number of tasks
test set Low
test set High
12
13.109
6.053
15
2.247
1.260
18
1.705
1.443
21
1.204
1.546
Markov processes underlying the task sets in test data Low and
High.
5.9.2 Task Rejection vs. Discarding
As formulated in Section 5.1.3, when there are bi concurrently
active instantiations of task graph Γi in the system, and a new
instantiation of Γi demands service, the oldest instantiation of Γi
is eliminated from the system. Sometimes, this behaviour is not
desired, as the oldest instantiation might have been very close
to finishing, and by discarding it, the invested resources (time,
memory, bandwidth, etc.) are wasted, as discussed before.
Therefore, our approach has been extended to support a late
task policy in which, instead of discarding the oldest instantiation of Γi , the newly arrived instantiation is denied service (rejected) by the system. However, the analysis method supports
the rejection policy only in the context of fixed priority scheduling.
The CGPN modelling the application in Figure 5.2, when considering the rejection policy, is depicted in Figure 5.20.
If there are bi concurrently active instantiations of a task
graph Γi in the system, then the place Bndi contains no tokens.
If a new instantiation of Γi arrives in such a situation (vi fires),
then dsci will fire, “throwing away” the newly arrived instantiation.
Let us suppose that an application consists of two independent tasks, τ1 and τ2 , they are mapped on different processors
and they have the same period π. b1 = b2 = 1. An instantiation of τ2 always finishes before the arrival of the next instantiation. Suppose that both tasks are running at time moment
5.9. EXTENSIONS
dsc 1
103
3
dsc 2
3
3
2
3
3
v1
v2
Clock
w1
b1
Tick
4
w2
4
Bnd 2
Bnd 1
b2
a 01
j1
a 12
1
j2
a 23
j3
2
Proc 1
e1
e2
a 13
2
a 04
j4
1
Proc 2
e3
e4
Figure 5.20: CGPN modelling the task graphs in Figure 5.2 in
the case of the rejection policy
104
CH. 5. MULTIPROCESSOR SYSTEMS
t. The marking of the CGPN at time moment t is Mt . The instantiation of τ1 that is running at time moment t runs an extremely long time, beyond time moment (bt/πc + 1)π + ( > 0,
but very small). Therefore, the instantiation of τ1 that arrives at
time moment (bt/πc + 1)π is rejected. The marking of the CGPN
M(bt/πc+1)π+ at time (bt/πc + 1)π + is identical to the marking
at time moment t, Mt . However, Mt corresponds to the situation when the freshest instantiations of τ1 and τ2 are running,
while M(bt/πc+1)π+ corresponds to the situation when an older
instantiation of τ1 and the freshest instantiation of τ2 are running. Hence, if the task priorities are dynamic, it is impossible to
extract their priorities solely from the marking of the CGPN. In
the case of discarding, as opposed to rejection, always the freshest task instantiations are active in the system. Therefore, their
latencies can be computed based on the current time (extracted
from the current marking, as shown in Section 5.3.5). Therefore,
the rejection policy is supported by our analysis method only in
the context of fixed (static) priority scheduling when task priorities are constant and explicit in the CGPN model such that
they do not have to be extracted from the net marking. Timestamping of tokens would be a solution for extending the support
of the rejection policy to dynamic priority scheduling. This, however, is expected to lead to a significant increase in the tangible
reachability graph of the modelling Petri Net and implicitly in
the number of states of the underlying GSMP.
Although CGPNs like the one in Figure 5.20, modelling applications with rejection policy, are simpler than the CGPN in Figure 5.3, modelling applications with discarding policies, the resulting tangible reachability graphs (and implicitly the underlying generalised semi-Markov processes) are larger. In the case of
discarding, the task instantiations are always the freshest ones.
In the case of rejection, however, the instantiation could be arbitrarily old leading to many more combinations of possible active
task instantiations. In order to illustrate this, let us consider
the following example. The task set consists of two independent
tasks, τ1 and τ2 with the same period and mapped on the same
processor. Task τ1 has a higher priority than task τ2 . At most one
active instantiation of each task is allowed in the system at each
time moment. Figure 5.21(a) depicts the underlying generalised
semi-Markov process in the case of the discarding policy, while
Figure 5.21(b) depicts the underlying generalised semi-Markov
process in the case of the rejection policy. The states are anno-
5.9. EXTENSIONS
105
(τ1,{τ2})
τ1
Clock
s0
Clock
Clock
(τ2, Ø ) s
1
s2 (−,Ø)
τ2
(a) Discarding
(τ1,{τ2})
τ1
s0
Clock
Clock
(τ2, Ø ) s
1
s2 (−,Ø)
τ2
Clock
Clock
(τ2,{τ1}) s3
Clock
τ2
τ1
s4 (τ1, Ø )
(b) Rejection
Figure 5.21: Stochastic process underlying the application
CH. 5. MULTIPROCESSOR SYSTEMS
106
s0
s1
s0
s1 s2
τ1
τ2
τ1
τ2
Tick
2Tick
(a) Discarding
s0
τ1
s1
s3
τ2
Tick
s4
s2
τ1
2Tick
(b) Rejection
Figure 5.22: Gantt diagrams for the highlighted paths in Figure 5.21
Table 5.3: Discarding compared to rejection
Average GSMP size
Relative increase
Tasks Discarding
Rejection of the GSMP size
12
8437.85
18291.23
1.16
15
27815.28
90092.47
2.23
18
24089.19
194300.66
7.06
21
158859.21
816296.36
4.13
24
163593.31
845778.31
4.17
27
223088.90 1182925.81
4.30
tated by tuples of the form (a, W ) where a is the running task
and W is the set of ready tasks. The labels τ1 and τ2 on arcs indicate the completion of the corresponding tasks, while Clock indicates the arrival of new task instantiations every period. Figures 5.22(a) and 5.22(b) depict the Gantt diagrams corresponding to the two highlighted paths in the stochastic processes depicted in Figure 5.21(a) and 5.21(b). The Gantt diagrams are
annotated with the states in the corresponding stochastic processes. As seen, the rejection policy introduces states, like the
priority inversion noted in state s3 , which are impossible when
applying discarding.
5.10. CONCLUSIONS
107
In order to assess the impact of the rejection policy on the
analysis complexity compared to the discarding policy, the following experiments were carried out. 109 task sets of 12 to 27
tasks grouped in 2 to 9 task graphs were randomly generated.
Each task set has been analysed two times, first considering the
discarding policy and then considering the rejection policy. The
results were averaged for task sets with the same cardinality
and shown in Table 5.3.
5.9.3
Arbitrary Task Deadlines
As discussed in Section 5.3, a deadline miss is modelled by the
firing of a transition capturing the deadline arrival in a marking with certain properties. Therefore, when the task deadline
is equal to the corresponding task period, and, implicitly, it coincides with the arrival of a new instantiation, such a transition is
already available in the CGPN model. For example, such transitions are vi in Figure 5.3 and Figure 5.20 and fi in Figure 5.18.
In the case of arbitrary deadlines, such a transition has to be
explicitly added to the model, very much in the same way as the
modelling of new task instantiations is done. In this case, the
firing delay T ick of Clock is the greatest common divisor of the
task periods and of their relative deadlines. Because the deadlines may be arbitrary, very often the value of T ick will be 1.
This leads to an increase in the number of tokens circulating in
the CGPN model and implicitly to a potentially huge increase of
the number of states of the underlying generalised semi-Markov
process.
5.10 Conclusions
In the current and the previous chapter we have presented two
approaches to the performance analysis of applications with
stochastic task execution time. The first approach calculates
the exact deadline miss ratios of tasks and task graphs and
is efficient for monoprocessor systems. The second approach
approximates the deadline miss ratios and is conceived for the
complex case of multiprocessor systems.
While both approaches efficiently analyse one design alternative, they cannot be successfully applied to driving the optimisation phase of a system design process where a huge number
108
CH. 5. MULTIPROCESSOR SYSTEMS
of alternatives has to be evaluated. The next chapter presents
a fast but less accurate analysis approach together with an approach for deadline miss ratio minimisation.
Chapter 6
Deadline Miss Ratio
Minimisation
The previous two chapters addressed the problem of analysing
the deadline miss ratios of applications with stochastic task execution times. In this chapter we address the complementary
problem: given a multiprocessor hardware architecture and a
functionality as a set of task graphs, find a task mapping and
priority assignment such that the deadline miss ratios are below
imposed upper bounds.
6.1 Problem Formulation
The problem addressed in this chapter is formulated as follows.
6.1.1
Input
The problem input consists of
• The set of processing elements P E, the set of buses B, and
their connection to processors,
• The set of task graphs Γ,
• The set of task periods ΠT and the set of task graph periods
ΠΓ ,
• The set of task deadlines ∆T and of task graph deadlines
∆Γ ,
• The set P Eτ of allowed mappings of τ for all tasks τ ∈ T ,
109
CH. 6. MISS RATIO MINIMISATION
110
• The set of execution (communication) time probability density functions corresponding to each processing element p ∈
P Eτ for each task τ ,
• The late task policy is the discarding policy,
• The set Bounds = {bi ∈ N\{0} : 1 ≤ i ≤ g}, where bi = 1,
∀1 ≤ i ≤ g, i.e. there exists at most one active instantiation
of any task graph in the system at any time,
• The set of task deadline miss thresholds ΘT and the set of
task graph deadline miss thresholds ΘΓ , and
• The set of tasks and task graphs that are designated as
being critical.
6.1.2 Output
The problem output consists of a mapping and priority assignment such that the cost function
X
dev =
N
X
devτi +
i=1
g
X
devΓi
(6.1)
i=1
giving the sum of miss deviations is minimised, where the deadline miss deviation is defined as in Section 3.2.5.
If a mapping
P
and priority assignment is found such that
dev is finite, it is
guaranteed that the deadline miss ratios of all critical tasks and
task graphs are below their imposed thresholds.
6.1.3 Limitations
We restrict our assumptions on the system to the following:
• The scheduling policy is restricted to fixed priority nonpreemptive scheduling.
• At most one instance of a task graph may be active in the
system at any time.
• The late task policy is the discarding policy.
6.2 Approach Outline
Because the defined problem is NP-hard (see the complexity
of the classical mapping problem [GJ79]), we have to rely on
heuristic techniques for solving the formulated problem. An
accurate estimation of the miss deviation, which is used as a
6.3. FIXED EXECUTION TIME MODELS
A
B
A
B
C
E
D
(a)
111
C
E
D
(b)
Figure 6.1: Motivational example
cost function for the optimisation process, is in itself a complex
and time consuming task, as shown in the previous two chapters. Therefore, a fast approximation of the cost function value
is needed to guide the design space exploration. Hence, the
following subproblems have to be solved:
• Find an efficient design space exploration strategy, and
• Develop a fast and sufficiently accurate analysis, providing
the needed cost indicators.
Section 6.4 discusses the first subproblems while Section 6.5
focuses on the system analysis we propose. First, however, we
will present a motivation to our endeavour showing how naı̈ve
approaches fail to successfully solve the formulated problem.
6.3 The Inappropriateness of Fixed
Execution Time Models
A naı̈ve approach to the formulated problem would be to use
fixed execution time models (average, median, worst-case execution time, etc.) and to hope that the resulting designs would be
optimal or close to optimal also from the point of view of the percentage of missed deadlines. The following example illustrates
the pitfalls of such an approach and emphasises the need for
an optimisation technique that considers the stochastic execution times. Let us consider the application in Figure 6.1(a). All
the tasks have period 20 and the deadline of the task graph is
18. Tasks A, B, C, and D have constant execution times of 1,
6, 7, and 8 respectively. Task E has a variable execution time
CH. 6. MISS RATIO MINIMISATION
112
P1 A
B
D
time
15
18
Bus
time
18 miss
P2
25%
E
C
time
15
18
21
(a)
miss 8.33%
P1 A
E
B
time
18 19
Bus
time
18
P2
C
D
time
17 18
(b)
Figure 6.2: Gantt diagrams of the two mapping alternatives in
Figure 6.1
probability density
probability density
6.3. FIXED EXECUTION TIME MODELS
deadline
8%
WCET
113
deadline
30%
WCET
time
time
(a)
(b)
Figure 6.3: Motivational example
whose probability is uniformly distributed between 0 and 12.
Hence, the average (expected) execution time of task E is 6. The
inter-processor communication takes 1 time unit per message.
Let us consider the two mapping alternatives depicted in Figure 6.1(a) and 6.1(b). The two Gantt diagrams in Figure 6.2(a)
and 6.2(b) depict the execution scenarios corresponding to the
two considered mappings if the execution of task E took the expected amount of time, that is 6. The shaded rectangles depict
the probabilistic execution of E. A mapping strategy based on
the average execution times would select the mapping in Figure 6.1(a) as it leads to a shorter response time (15 compared
to 17). However, in this case, the worst-case execution time of
the task graph is 21. The deadline miss ratio of the task graph
is 3/12 = 25%. If we took into consideration the stochastic nature of the execution time of task E, we would prefer the second
mapping alternative, because of the better deadline miss ratio
of 1/12 = 8.33%. If we considered worst-case response times instead of average ones, then we would choose the second mapping
alternative, the same as the stochastic approach. However, approaches based on worst-case execution times can be dismissed
by means of very simple counter-examples.
Let us consider a task τ that can be mapped on processor
P1 or on processor P2 . P1 is a fast processor with a very deep
pipeline. Because of its pipeline depth, mispredictions of target addresses of conditional jumps, though rare, are severely
penalised. If τ is mapped on P1 , its ETPDF is shown in Figure 6.3(a). The long and flat density tail corresponds to the rare
114
CH. 6. MISS RATIO MINIMISATION
but expensive jump target address misprediction. If τ is mapped
on processor P2 , its ETPDF is shown in Figure 6.3(b). Processor P2 is slower with a shorter pipeline. The WCET of task τ
on processor P2 is smaller than the WCET if τ ran on processor P1 . Therefore, a design space exploration tool based on the
WCET would map task τ on P2 . However, as Figure 6.3 shows,
the deadline miss ratio in this case is larger than if task τ was
mapped on processor P1 .
6.4 Mapping and Priority Assignment
Heuristic
In this section, we propose a design space exploration strategy
that maps tasks to processors and assigns priorities to tasks in
order to minimise the cost function defined in Eq.(6.1). The exploration strategy is based on the Tabu Search (TS) heuristic
[Glo89].
6.4.1 The Tabu Search Based Heuristic
Tabu Search is a heuristic introduced by Glover [Glo89]. We use
an extended variant, which is described in this section. The variant is not specific to a particular problem. After explaining the
heuristic in general, we will become more specific at the end of
the section where we illustrate the heuristic in the context of
task mapping and priority assignment.
Typically, optimisation problems are formulated as follows:
Find a configuration, i.e. an assignment of values to parameters
that characterise a system, such that the configuration satisfies
a possibly empty set of imposed constraints and the value of a
cost function is minimal for that configuration.
We define the design space S as a set of points (also called
solutions), where each point represents a configuration that satisfies the imposed constraints. A move from one solution in the
design space to another solution is equivalent to assigning a new
value to one or more of the parameters that characterise the system. We say that we obtain solution s2 by applying the move
m on solution s1 , and we write s2 = m(s1 ). Solution s1 can be
obtained back from solution s2 by applying the negated move m,
denoted m (s1 = m(s2 )).
6.4. MAPPING AND PRIORITY ASSIGNMENT
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
115
crt sol = init sol
global best sol = crt sol
global best = cost(crt sol)
TM = ∅
since last improvement = 0
iteration count = 1
CM = set of candidate moves(crt sol)
(chose move, next sol cost) = choose move(CM )
while iteration count < max iterations do
while since last improvement < W do
next sol = move(crt sol, chosen move)
T M = T M ∪ {chosen move}
since last improvement + +
iteration count + +
crt sol = next sol
if next sol cost < global best cost then
global best cost = next sol cost
global best sol = crt sol
since last improvement = 0
end if
CM = set of candidate moves(T M, crt sol)
(chosen move, next sol cost) = choose move(CM )
end while
since last improvement = 0
(chosen move, next sol cost) = diversif y(T M, crt sol)
iteration count + +
end while
return global best sol
Figure 6.4: Design space exploration algorithm
Solution s0 is a neighbour of solution s if there exists a move m
such that solution s0 can be obtained from solution s by applying
move m. All neighbours of a solution s form the neighbourhood
V (s) of that solution (V (s) = {q : ∃m such that q = m(s)}).
The exploration algorithm is shown in Figure 6.4. The exploration starts from an initial solution, labelled also as the current
solution (line 1) considered as the globally best solution so far
(line 2). The cost function is evaluated for the current solution
(line 3). We keep track of a list of moves T M that are marked as
tabu. Initially the list is empty (line 4).
116
CH. 6. MISS RATIO MINIMISATION
We construct CM , a subset of the set of all moves that are
possible from the current solution point (line 7). Let N (CM ) be
the set of solutions that can be reached from the current solution
by means of a move in CM .1 The cost function is evaluated for
each solution in N (CM ). A move m ∈ CM is selected (line 8) if
• m is non-tabu and leads to the solution with the lowest cost among the solutions in N (CM \T M ) (m 6∈ T M ∧
cost(m(crt sol)) ≤ cost(q), ∀q ∈ N (CM \T M )), or
• it is tabu but improves the globally best solution so far (m ∈
T M ∧ cost(m(crt sol)) ≤ global best), or
• all moves in CM are tabu and m leads to the solution with
the lowest cost among those solutions in N (CM ) (∀mv ∈
CM , mv ∈ T M ∧ cost(m(crt sol)) ≤ cost(mv(crt sol))).
The new solution is obtained by applying the chosen move m on
the current solution (line 11). The reverse of move m is marked
as tabu such that m will not be reversed in the next few iterations (line 12). The new solution becomes the current solution
(line 15). If it is the case (line 16), the new solution becomes
also the globally best solution reached so far (lines 17–18). However, it should be noted that the new solution could have a larger
cost than the current solution. This could happen if there are
no moves that would improve on the current solution or all such
moves would be tabu. The list T M of tabu moves ensures that
the heuristic does not get stuck in local minima. The procedure
of building the set of candidate moves and then choosing one
according to the criteria listed above is repeated. If no global
improvement has been noted for the past W iterations, the loop
(lines 10–23) is interrupted (line 10). In this case, a diversification phase follows (line 25) in which a rarely used move is performed in order to force the heuristic to explore different regions
in the design space. The whole procedure is repeated until the
heuristic iterated for a specified maximum number of iterations
(line 9). The procedure returns the solution characterised by the
lowest cost function value that it found during the design space
exploration (line 28).
Two issues are of utmost importance when tailoring the general tabu search based heuristic described above for particular
problems.
First, there is the definition of what is a legal move. On one
hand, the transformation of a solution must result in another so1 N (CM )
= V (crt sol) if CM is the set of all possible moves from crt sol.
6.4. MAPPING AND PRIORITY ASSIGNMENT
117
lution, i.e. the resulting parameter assignment must satisfy the
set of constraints. On the other hand, because of complexity reasons, certain restrictions must be imposed on what constitutes
a legal move. For example, if any transformation were a legal
move, the neighbourhood of a solution would comprise the entire
solution space. In this case, it is sufficient to run the heuristic for
just one iteration (max iterations = 1) but that iteration would
require an unreasonably long time, as the whole solution space
would be probed. Nevertheless, if moves were too restricted, a
solution could be reached from another solution only after applying a long sequence of moves. This makes the reaching of the
far-away solution unlikely. In this case, the heuristic would be
inefficient as it would circle in the same region of the solution
space until a diversification step would force it out.
The second issue is the construction of the subset of candidate
moves. One solution would be to include all possible moves from
the current solution in the set of candidate moves. In this case,
the cost function, which sometimes can be computationally expensive, has to be calculated for all neighbours. Thus, we would
run the risk to render the exploration slow. If we had the possibility to quickly assess which are promising moves, we could
include only those in the subset of candidate moves.
For our particular problem, namely the task mapping and
priority assignment, each task is characterised by two attributes: its mapping and its priority. In this context, a move in
the design space is equivalent to changing one or both attributes
of one single task.
In the following section we discuss the issue of constructing
the subset of candidate moves.
6.4.2
Candidate Move Selection
The cost function is evaluated |CM | times at each iteration,
where |CM | is the cardinality of the set of candidate moves.
Let us consider that task τ , mapped on processor Pj , is moved
to processor Pi and there are qi tasks on processor Pi . Task τ
can take one of qi + 1 priorities on processor Pi . If task τ is not
moved to a different processor, but only its priority is changed
on processor Pj , then there are qj − 1 possible new priorities. If
we consider all processors, there are M − 2 + N possible moves
CH. 6. MISS RATIO MINIMISATION
118
for each task τ , as shown in the equation
qj − 1 +
M
X
i=1
i6=j
(qi + 1) = M − 2 +
M
X
qi = M − 2 + N,
(6.2)
i=1
where N is the number of tasks and M is the number of processors. Hence, if all possible moves are candidate moves,
N · (M − 2 + N )
(6.3)
moves are possible at each iteration. Therefore, a key to the
efficiency of the algorithm is the intelligent selection of the set
CM of candidate moves. If CM contained only those moves that
had a high chance to drive the search towards good solutions,
then fewer points would be probed, leading to a speed up of the
algorithm.
In our approach, the set CM of candidate moves is composed
of all moves that operate on a subset of tasks. Tasks are assigned
scores and the chosen subset of tasks is composed of the first K
tasks with respect to their score. Thus, if we included all possible
moves that modify the mapping and/or priority assignment of
only the K highest ranked tasks, we would reduce the number
of cost function evaluations N/K times.
We illustrate the way the scores are assigned to tasks based
on the example in Figure 6.1(a). As a first step, we identify the
critical paths and the non-critical paths of the application. In
general, we consider a path to be an ordered sequence of tasks
(τ1 , τ2 , . . . , τn ) such that τi+1 is data dependent on τi . The average
execution time of a path is given by the sum of the average execution times of the tasks belonging to the path. A path is critical
if its average execution time is the largest among the paths belonging to the same task graph. For the example in Figure 6.1(a),
the critical path is A → B → D, with an average execution time
of 1 + 6 + 8 = 15. In general, non-critical paths are those paths
starting with a root node or a task on a critical path, ending
with a leaf node or a task on a critical path and containing only
tasks that do not belong to any critical path. For the example in
Figure 6.1(a), non-critical paths are A → C and B → E.
For each critical or non-critical path, a path mapping vector
is computed. The mapping vector is a P -dimensional integer vector, where P is the number of processors. The modulus of its projection along dimension pi is equal to the number of tasks that
6.4. MAPPING AND PRIORITY ASSIGNMENT
119
are mapped on processor pi and that belong to the considered
path. For the example in Figure 6.1(a), the vectors corresponding to the paths A → B → D, A → C and B → E are 3i + 0j,
1i + 1j, and 1i + 1j respectively, where i and j are the versors
along the two dimensions. Each task is characterised by its task
mapping vector, which has a modulus of 1 and is directed along
the dimension corresponding to the processor on which the task
is mapped. For example, the task mapping vectors of A, B, C, D,
and E are 1i, 1i, 1j, 1i, and 1j respectively.
Next, for each path and for each task belonging to that path,
the angle between the path and the task mapping vectors is computed. For example, the task mapping vectors of tasks A, B,
and D form an angle of 0◦ with the path mapping vector of critical path A → B → D and the task mapping vectors of task A
and C form an angle of 45◦ with the path mapping vector of the
non-critical path A → C. The score assigned to each task is a
weighted sum of angles between the task’s mapping vector and
the mapping vectors of the paths to whom the task belongs. The
weights are proportional to the relative criticality of the path.
Intuitively, this approach attempts to map the tasks that belong
to critical paths on the same processor. In order to avoid processor overload, the scores are penalised if the task is intended to
be moved on highly loaded processors.
Once scores have been assigned to tasks, the first K = N/c
tasks are selected according to their scores. In our experiments,
we use c = 2. In order to further reduce the search neighbourhood, not all possible moves that change the task mapping
and/or priority assignment of one task are chosen. Only 2 processors are considered as target processors for each task. The selection of those two processors is made based on scores assigned to
processors. These scores are a weighted sum of potential reduction of interprocessor communication and processor load. The
processor load is weighted with a negative weight, in order to
penalise overload. For example, if we moved task C from the
shaded processor to the white processor, we would reduce the interprocessor communication with 100%. However, as the white
processor has to cope with an average work load of 15 units (the
average execution times of tasks A, B, and D), the 100% reduction would be penalised with an amount proportional to 15.
On average, there will be N/M tasks on each processor.
Hence, if a task is moved to a different processor, it may take
N/M + 1 possible priorities on its new processor. By considering
CH. 6. MISS RATIO MINIMISATION
120
only N/2 tasks and only 2 processors for each task, we restrict
the neighbourhood to
N/2 · 2 · (1 + N/M ) = N · (1 + N/M )
(6.4)
candidate moves on average, i.e. approximately
N · (M − 2 + N )/(N · (1 + N/M )) ≈ M times.
(6.5)
We will denote this method as the restricted neighbourhood
search. In Section 6.6 we will compare the restricted neighbourhood search with an exploration of the complete neighbourhood.
6.5 Analysis
This section presents our approximate analysis algorithm. The
first part of the section discusses the deduction of the algorithm
itself, while the second part presents some considerations on the
approximations that were made and their impact regarding the
accuracy.
6.5.1 Analysis Algorithm
The
P cost function that is driving the design space exploration is
dev, where dev is the miss deviation as defined in Eq.(6.1).
The miss deviation for each task is obtained as the result of a
performance analysis of the system.
In the previous chapter, we presented a performance analysis method for multiprocessor applications with stochastic task
executions times. The method is based on the Markovian analysis of the underlying stochastic process. As the latter captures
all possible behaviours of the system, the method gives great insight regarding the system’s internals and bottlenecks. However,
its relatively large analysis time makes its use inappropriate inside an optimisation loop. Therefore, we propose an approximate
analysis method of polynomial complexity. The main challenge
is in finding those dependencies among random variables that
are weak and can be neglected such that the analysis becomes of
polynomial complexity and the introduced inaccuracy is within
reasonable bounds.
Before proceeding with the exposition of the approximate
analysis approach, we introduce the notation that we use in the
sequel.
6.5. ANALYSIS
121
The finishing time of the j th job of task τ is the time moment
when (τ, j) finishes its execution. We denote it with Fτ,j . The
deadline miss ratio of a job is the probability that its finishing
time exceeds its deadline:
mτ,j = 1 − P(Fτ,j ≤ δτ,j )
(6.6)
The ready time of (τ, j) is the time moment when (τ, j) is ready
to execute, i.e. the maximum of the finishing times of jobs in
its predecessor set. We denote the ready time with Aτ,j and we
write
Aτ,j = max
Fσ,j
(6.7)
◦
σ∈ τ
The starting time of (τ, j) is the time moment when (τ, j) starts
executing. We denote it with Sτ,j . Obviously, the relation
Fτ,j = Sτ,j + Exτ,j
(6.8)
holds between the starting and finishing times of (τ, j), where
Exτ,j denotes the execution time of (τ, j). The ready time and
starting times of a job may differ because the processor might be
busy at the time the job becomes ready for execution. The ready,
starting and finishing times are all random variables.
Let Lτ,j (t) be a function that takes value 1 if (τ, j) is running
at time moment t and 0 otherwise. In other words, if Lτ,j (t) = 1,
processing element M ap(τ ) is busy executing job j of task τ at
time t. If (τ, j) starts executing at time t, Lτ,j (t) is considered to
be 1. If (τ, j) finishes its execution at time t0 , Lτ,j (t0 ) is considered
to be 0. For simplicity, in the sequel, we will write Lτ,j (t) when
we
P mean Lτ,j (t) = 1. Also, Lσ (t) is a shorthand notation for
j∈N Lσ,j (t).
Let Iτ,j (t) be a function that takes value 1 if
• All tasks in the ready-to-run queue of the scheduler on processor M ap(τ ) at time t have a lower priority than task τ ,
and
P
•
σ∈Tτ \{τ } Lσ (t) = 0,
and it takes value 0 otherwise, where Tτ = TM ap(τ ) is the set
of tasks mapped on the same processor as task τ . Intuitively,
Iτ,j (t) = 1 implies that (τ, j) could start running on processing
element M ap(τ ) at time t if (τ, j) becomes ready at or prior to
time t. Let Iτ,j (t, t0 ) be a shorthand notation for ∃ξ ∈ (t, t0 ] :
CH. 6. MISS RATIO MINIMISATION
122
Iτ,j (ξ) = 1, i.e. there exists a time moment ξ in the right semiclosed interval (t, t0 ], such that (τ, j) could start executing at ξ if
it become ready at or prior to ξ.
In order to compute the deadline miss ratio of (τ, j) (Eq.(6.6)),
we need to compute the probability distribution of the finishing
time Fτ,j . This in turn can be precisely determined (Eq.(6.8))
from the probability distribution of the execution time Exτ,j ,
which is an input data, and the probability distribution of the
starting time of (τ, j), Sτ,j . Therefore, in the sequel, we focus on
determining P(Sτ,j ≤ t).
We start by observing that Iτ,j (t, t+h) is a necessary condition
for t < Sτ,j ≤ t + h. Thus,
P(t < Sτ,j ≤ t + h) = P(t < Sτ,j ≤ t + h ∩ Iτ,j (t, t + h)).
(6.9)
We can write
P(t < Sτ,j ≤ t + h ∩ Iτ,j (t, t + h)) =
= P(t < Sτ,j ∩ Iτ,j (t, t + h))−
(6.10)
− P(t + h < Sτ,j ∩ Iτ,j (t, t + h)).
Furthermore, we observe that the event
t + h < Sτ,j ∩ Iτ,j (t, t + h)
is equivalent to
(t + h < Aτ,j ∩ Iτ,j (t, t + h))∪
∪(sup{ξ ∈ (t, t + h] :Iτ,j (ξ) = 1} < Aτ,j ≤ t + h ∩ Iτ,j (t, t + h)).
In other words, (τ, j) starts executing after t + h when the processor was available sometimes in the interval (t, t + h] if and
only if (τ, j) became ready to execute after the latest time in
(t, t + h] at which the processor was available. Thus, we can
rewrite Eq.(6.10) as follows:
P(t < Sτ,j ≤ t + h ∩ Iτ,j (t, t + h)) =
= P(t < Sτ,j ∩ Iτ,j (t, t + h))−
− P(t + h < Aτ,j ∩ Iτ,j (t, t + h))−
− P(sup{ξ ∈ (t, t + h] : Iτ,j (ξ) = 1} < A ≤ t + h∩
∩ Iτ,j (t, t + h)).
(6.11)
6.5. ANALYSIS
123
After some manipulations involving negations of the events in
the above equation, and by using Eq.(6.9), we obtain
P(t < Sτ,j ≤ t + h) = P(Aτ,j ≤ t + h ∩ Iτ,j (t, t + h))−
− P(Sτ,j ≤ t ∩ Iτ,j (t, t + h))−
− P(sup{ξ ∈ (t, t + h] : Iτ,j (ξ) = 1} < Aτ,j ≤ t + h∩
(6.12)
∩ Iτ,j (t, t + h)).
When h becomes very small, the last term of the right-hand side
of the above equation becomes negligible relative to the other
two terms. Hence, we write the final expression of the distribution of Sτ,j as follows:
P(t < Sτ,j ≤ t + h) ≈
≈ P(Aτ,j ≤ t + h ∩ Iτ,j (t, t + h))−
(6.13)
− P(Sτ,j ≤ t ∩ Iτ,j (t, t + h)).
We observe from Eq.(6.13) that the part between t and t + h
of the probability distribution of Sτ,j can be calculated from the
probability distribution of Sτ,j for time values less than t. Thus,
we have a method for an iterative calculation of P(Sτ,j ≤ t), in
which we compute
P(kh < Sτ,j ≤ (k + 1)h), k ∈ N,
at iteration k+1 from values obtained during previous iterations.
A difficulty arises in the computation of the two joint distributions in the right-hand side of Eq.(6.13). The event that a job
starts or becomes ready prior to time t and that the processor
may start executing it in a vicinity of time t is a very complex
event. It depends on many aspects, such as the particular order of execution of tasks on different (often all) processors, and
on the execution time of different tasks, quite far from task τ in
terms of distance in the computation tree. Particularly the dependence on the execution order of tasks on different processors
makes the exact computation of
P(Aτ,j ≤ t + h ∩ Iτ,j (t, t + h))
and
P(Sτ,j ≤ t ∩ Iτ,j (t, t + h))
of exponential complexity. Nevertheless, exactly this multitude
of dependencies of events Iτ,j (t, t + h), Aτ,j ≤ t + h, and Sτ,j ≤ t
CH. 6. MISS RATIO MINIMISATION
124
on various events makes the dependency weak among the aforementioned three events. Thus, we approximate the right-hand
side of Eq.(6.13) by considering the joint events as if they were
conjunctions of independent events. Hence, we approximate
P(t < Sτ,j ≤ t + h) as follows:
P(t < Sτ,j ≤ t + h) ≈
≈ (P(Aτ,j ≤ t + h) − P(Sτ,j ≤ t)) · P(Iτ,j (t, t + h)).
(6.14)
The impact of the introduced approximation on the accuracy of
the analysis is discussed in Section 6.5.2 based on a non-trivial
example.
In order to fully determine the probability distribution of Sτ,j
(and implicitly of Fτ,j and the deadline miss ratio), we need the
probability distribution of Aτ,j and the probability P(Iτ,j (t, t +
h)). Based on Eq.(6.7), if the finishing times of all tasks in the
predecessor set of task τ were statistically independent, we could
write
Y
P(Aτ ≤ t) =
P(Fσ ≤ t).
(6.15)
σ∈◦ τ
In the majority of cases the finishing times of all tasks in
the predecessor set of task τ are not statistically independent.
For example, if there exists a task α and two computation paths
α → σ1 and α → σ2 , where tasks σ1 and σ2 are predecessors of
task τ , then the finishing times Fσ1 and Fσ2 are not statistically
independent. The dependency becomes weaker the longer these
computation paths are. Also, the dependency is weakened by the
other factors that influence the finishing times of tasks σ, for example the execution times and execution order of the tasks on
processors M ap(σ1 ) and M ap(σ2 ). Even if no common ancestor
task exists among any of the predecessor tasks σ, the finishing
times of tasks σ may be dependent because they or some of their
predecessors are mapped on the same processor. However, these
kind of dependencies are extremely weak as shown by Kleinrock
[Kle64] for computer networks and by Li [LA97] for multiprocessor applications. Therefore, in practice, Eq.(6.15) is a good
approximation.
Last, we determine the probability P(Iτ,j (t, t + h)), i.e. the
probability that processor M ap(τ ) may start executing (τ, j)
sometimes in the interval (t, t + h]. This probability is given by
6.5. ANALYSIS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
125
Sort all tasks in topological order of the task
graph and put the sorted tasks in sequence T
For all (τ, j), such that τ has no predecessors
determine P(Aτ,j ≤ t)
For all (τ, j), let P(Iτ,j (0, h)) = 1
for t := 0 to LCM step h do
for each τ ∈ T do
compute P(Aτ ≤ t)
compute P(t < Sτ,j ≤ t + h)
compute P(Fτ,j ≤ t + h)
compute P(Lτ,j (t + h))
compute P(Iτ,j (t, t + h))
end for
end for
compute the deadline miss ratios
Eq.(6.15)
Eq.(6.14)
Eq.(6.8)
Eq.(6.17)
Eq.(6.16)
Eq.(6.6)
Figure 6.5: Approximate analysis algorithm
the probability that no task is executing at time t, i.e.
X
P(Iτ,j (t, t + h)) = 1 −
P(Lσ (t) = 1).
(6.16)
σ∈Tτ \{τ }
The probability that (τ, j) is running at time t is given by
P(Lτ,j (t) = P(Sτ,j ≤ t) − P(Fτ,j ≤ t).
(6.17)
The analysis algorithm is shown in Figure 6.5. The analysis
is performed over the interval [0, LCM ), where LCM is the least
common multiple of the task periods. The algorithm computes
the probability distributions of the random variables of interest
parsing the set of tasks in their topological order. Thus, we make
sure that ready times propagate correctly from predecessor tasks
to successors.
Line 7 of the algorithm computes the probability that job
(τi , j) starts its execution sometime in the interval (t, t + h] according to Eq. (6.14). The finishing time of the job may lie within
one of the intervals (t + BCETi , t + h + BCETi ], (t + BCETi +
h, t+2h+BCETi ], . . . , (t+W CETi , t+h+W CETi ], where BCETi
and W CETi are the best and worst-case execution times of task
τi respectively. There are d(W CETi − BCETi )/he such intervals.
Thus, the computation of the probability distribution of the finishing time of the task (line 8) takes d|ET P DFi |/he steps, where
|ET P DFi | = W CETi − BCETi .
126
CH. 6. MISS RATIO MINIMISATION
A
B
C
D
E
Figure 6.6: Application example
Let |ET P DF | = max1≤i≤N d|ET P DFi |/he. Then the complexity of the algorithm is O(N · LCM/h · d|ET P DF |/he), where N is
the number of processing and communication tasks. The choice
of the discretisation resolution h is done empirically such that we
obtain a fast analysis with reasonable accuracy for the purpose
of task mapping.
6.5.2 Approximations
We have made several approximations in the algorithm described in the previous section. These are:
1. The discretisation approximation used throughout the approach, i.e. the fact that the probability distributions of
interest are all computed at discrete times {0, h, 2h, . . . ,
bLCM/hc}, Q
2. P(Aτ ≤ t) ≈ σ∈◦ τ P(Fσ ≤ t),
3. P(Aτ,j ≤ t+h∩Iτ,j (t, t+h)) ≈ P(Aτ,j ≤ t+h)·P(Iτ,j (t, t+h))
and P(Sτ,j ≤ t ∩ Iτ,j (t, t + h)) ≈ P(Sτ,j ≤ t) · P(Iτ,j (t, t + h)).
The first approximation is inevitable when dealing with continuous functions. Moreover, its accuracy may be controlled by
choosing different discretisation resolutions h.
The second approximation is typically accurate as the dependencies between the finishing times Fσ are very weak [Kle64],
and we will not focus on its effects in this discussion.
In order to discuss the last approximation, we will introduce
the following example. Let us consider the application depicted
in Figure 6.6. It consists of 5 tasks, grouped into two task graphs
Γ1 = {A, B, C, D} and Γ2 = {E}. Tasks A and B are mapped
on the first processor, while tasks C, D, and E are mapped on
the second processor. The two black dots on the arrows between
tasks A and C, and tasks B and D represent the inter-processor
communication tasks. Tasks C, D, and E have fixed execution
times of 4, 5, and 6 time units respectively. Tasks A and B have
execution times with exponential probability distributions, with
6.5. ANALYSIS
127
average rates of 1/7 and 1/2 respectively.2 Each of the two interprocessor communications takes 0.5 time units. Task A arrives
at time moment 0, while task E arrives at time moment 11. Task
E is the highest priority task. The deadline of both task graphs
is 35.
Because of the data dependencies between the tasks, task D
is the last to run among the task graph Γ1 . The probability that
processor two is executing task D at time t is analytically determined and plotted in Figure 6.7(a) as a function of t. On the same
figure, we plotted the approximation of the same probability as
obtained by our approximate analysis method. The probability
that task E is running at time t and its approximation are shown
in Figure 6.7(b).
In Figure 6.7(a), we observe that large approximation errors
occur at times around the earliest possible start time of task D,
i.e. around time 4.5.3 We can write
P(AD ≤ t + h ∩ ID (t, t + h)) =
= P(ID (t, t + h)|AD ≤ t + h) · P(AD ≤ t + h).
P(ID (t, t + h)|AD ≤ t + h) is interpreted as the probability that
task D may start to run in the interval (t, t + h] knowing that it
became ready to execute prior to time t + h. If t + h < 4.5, and we
took into consideration the fact that AD ≤ t + h, then we know
for sure that task C could not have yet finished its execution of 4
time units (see footnote). Therefore
P(ID (t, t + h)|AD ≤ t + h) = 0, t + h < 4.5.
However, in our analysis, we approximate P(ID (t, t + h)|AD ≤
t + h) with P(ID (t, t + h)), i.e. we do not take into account that
AD ≤ t. Not taking into account that task D became ready prior
to time t opens the possibility that task A has not yet finished
its execution at time t. In this case, task C has not yet become
ready, and the processor on which tasks C and D are mapped
2 We chose exponential execution time probability distributions only for the
scope of this illustrative example. Thus, we are able to easily deduce the exact
distributions in order to compare them to the approximated ones. Note that our
approach is not restricted to exponential distributions and we use generalised
distributions throughout the experimental results
3 The time when task D becomes ready is always after the time when task C
becomes ready. Task C is ready the earliest at time 0.5, because the communication A → C takes 0.5 time units. The execution of task C takes 4 time units.
Therefore, the processor is available to task D the earliest at time 4.5.
CH. 6. MISS RATIO MINIMISATION
128
0.5
Probability of task D running
Approximation of probability of task D running
Probability that the task is running
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
30
35
40
Time
(a) Approximation of the probability that task D is running
Probability of task E running
Approximation of probability of task E running
1
Probability that the task is running
0.8
0.6
0.4
0.2
0
10
15
20
Time
25
(b) Approximation of the probability that task E is running
Figure 6.7: Approximation accuracy
30
6.5. ANALYSIS
129
0.5
AA
PA
0.45
0.4
Processor load
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
2000
4000
6000
8000
10000
Time [sec]
12000
14000
16000
18000
Figure 6.8: Approximation accuracy
could be idle. Thus,
P(ID (t, t + h)) 6= 0,
because the processor might be free if task C has not yet started.
This illustrates the kind of approximation errors introduced by
P(AD ≤ t + h ∩ ID (t, t + h)) ≈ P(ID (t, t + h)) · P(AD ≤ t + h).
However, what we are interested in is a high-quality approximation towards the tail of the distribution, because typically there
is the deadline. As we can see from the plots, the two curves almost overlap for t > 27. Thus, the approximation of the deadline
miss ratio of task D is very good. The same conclusion is drawn
from Figure 6.7(b). In this case too we see a perfect match between the curves for time values close to the deadline.
Finally, we assessed the quality of our approximate analysis on larger examples. We compare the processor load curves
obtained by our approximate analysis (AA) with processor load
curves obtained by our high-complexity performance analysis
(PA) presented in the previous chapter. The benchmark application consists of 20 processing tasks mapped on 2 processors
and 3 communication tasks mapped on a bus connecting the two
CH. 6. MISS RATIO MINIMISATION
130
Task
19
13
5
9
Average error
0.056351194
0.001688039
0.029250265
0.016695770
Standard deviation of errors
0.040168796
0.102346107
0.178292338
0.008793487
Table 6.1: Approximation accuracy
processors. Figure 6.8 gives a qualitative measure of the approximation. It depicts the two processor load curves for a task in the
benchmark application. One of the curves was obtained with PA
and the other with AA. A quantitative measure of the approximation is given in Table 6.1. We present only the extreme values
for the average errors and standard deviations. Thus, row 1 in
the table, corresponding to task 19, shows the largest obtained
average error, while row 2, corresponding to task 13, shows the
smallest obtained average error. Row 3, corresponding to task 5,
shows the worst obtained standard deviation, while row 4, corresponding to task 9, shows the smallest obtained standard deviation. The average of standard deviations of errors over all tasks
is around 0.065. Thus, we can say with 95% confidence that AA
approximates the processor load curves with an error of ±0.13.
6.6 Experimental Results
The proposed heuristic for task mapping and priority assignment has been experimentally evaluated on randomly generated
benchmarks and on a real-life example. This section presents
the experimental setup and comments on the obtained results.
The experiments were run on a desktop PC with an AMD Athlon
processor clocked at 1533MHz.
The benchmark set consisted of 396 applications. The applications contained t tasks, clustered in g task graphs and mapped
on p processors, where t ∈ {20, 22, . . . , 40}, g ∈ {3, 4, 5}, and
p ∈ {3, 4, . . . , 8}. For each combination of t, g, and p, two applications were randomly generated. Three mapping and priority
assignment methods were run on each application. All three implement a Tabu Search algorithm with the same tabu tenure,
termination criterion and number of iterations after which a diversification phase occurs. In each iteration, the first method selects the next point in the design space while considering the en-
6.6. EXPERIMENTAL RESULTS
131
tire neighbourhood of design space points. Therefore, we denote
it ENS, exhaustive neighbourhood search. The second method
considers only a restricted neighbourhood of design space points
when selecting the next design transformation. The restricted
neighbourhood is defined as explained in Section 6.4. We call
the second method RNS, restricted neighbourhood search. Both
ENS and RNS use the same cost function, defined in Eq.(6.1)
and calculated according to the approximate analysis described
in Section 6.5. The third method considers only fixed task execution times, equal to the average task execution times. It uses
an exhaustive neighbourhood
search and minimises the value of
P
the cost function laxτ , where laxτ is defined as follows
(
∞
Fτ > δτ ∧ τ is critical
laxτ =
(6.18)
Fτ − δτ otherwise.
The third method is abbreviated LO-AET, laxity optimisation
based on average execution times. Once LO-AET has produced
a solution, the cost function defined in Eq.(6.1) is calculated and
reported for the produced mapping and priority assignment.
6.6.1
RNS and ENS: Quality of Results
The first issue we look at is the quality of results obtained with
RNS compared to those produced by ENS. The deviation of the
cost function obtained from RNS relative to the cost function obtained by ENS is defined as
costRN S − costEN S
costEN S
(6.19)
Figure 6.9 depicts the histogram of the deviation over the 396
benchmark applications. The relative deviation of the cost function appears on the x-axis. The value on the y-axis corresponding to a value x on the x-axis indicates the percentage of the 396
benchmarks that have a cost function deviation equal to x. On
average, RNS is only 1.65% worse than ENS. In 19% of the cases,
the obtained deviation was between 0 and 0.1%. Note that RNS
can obtain better results than ENS (negative deviation). This is
due to the intrinsically heuristic nature of Tabu Search.
CH. 6. MISS RATIO MINIMISATION
132
20
Histogram of deviation
Average deviation = 1.6512%
[%]
15
10
5
0
-20
-15
-10
-5
0
5
10
Deviation of cost function obtained from RNS relative to ENS [%]
15
20
Figure 6.9: Cost obtained by RNS vs. ENS
5
ENS
RNS
LO-AET
4.5
Average time per iteration [sec/iteration]
4
3.5
3
2.5
2
1.5
1
0.5
0
20
25
30
Tasks
35
Figure 6.10: Run times of RNS vs. ENS
40
6.6. EXPERIMENTAL RESULTS
6.6.2
133
RNS and ENS: Exploration Time
As a second issue, we compared the run times of RNS, ENS, and
LO-AET. Figure 6.10 shows the average times needed to perform one iteration in RNS, ENS, and LO-AET respectively. It
can be seen that RNS runs on average 5.16–5.6 times faster than
ENS. This corresponds to the theoretical prediction, made at in
Section 6.4.2, stating that the neighbourhood size of RNS is M
times smaller than the one of ENS when c = 2. In our benchmark suite, M is between 3 and 8 averaging to 5.5. We also observe that the analysis time is close to quadratic in the number
of tasks, which again corresponds to the theoretical result that
the size of the search neighbourhood is quadratic in N , the number of tasks.
We finish the Tabu Search when 40 · N iterations have executed, where N is the number of tasks. In order to obtain the
execution times of the three algorithms, one needs to multiply
the numbers on the ordinate in Figure 6.10 with 40 · N . For example, for 40 tasks, RNS takes circa 26 minutes while ENS takes
roughly 2h12’.
6.6.3
RNS and LO-AET: Quality of Results and
Exploration Time
The LO-AET method is marginally faster than RNS. However, as
shown in Figure 6.11, the value of the cost function obtained by
LO-AET is on average almost an order of magnitude worse (9.09
times) than the one obtained by RNS. This supports one of the
main messages of this chapter, namely that considering a fixed
execution time model for optimisation of systems is completely
unsuitable if deadline miss ratios are to be improved. Although
LO-AET is able to find a good implementation in terms of average execution times, it turns out that this implementation is
very poor from the point of view of deadline miss ratios. What
is needed is a heuristic like RNS, which is explicitly driven by
deadline miss ratios during design space exploration.
6.6.4
Real-Life Example: GSM Voice Decoding
Last, we considered an industrial-scale real-life example from
the telecommunication area, namely a smart GSM cellular
CH. 6. MISS RATIO MINIMISATION
134
16
Histogram of deviation
Average deviation = 909.1818%
14
12
[%]
10
8
6
4
2
0
0
2000
4000
6000
8000
Deviation of cost function obtained from LO-AET relative to RNS [%]
10000
Figure 6.11: Cost obtained by LO-AET vs. RNS
phone [Sch03], containing voice encoder and decoder, an MP3
decoder, as well as a JPEG encoder and decoder.
In GSM a second of human voice is sampled at 8kHz, and
each sample is encoded on 13 bits. The resulting stream of 13000
bytes per second is then encoded using so-called regular pulse
excitation long-term predictive transcoder (GSM 06.10 specification [ETS]). The encoded stream has a rate of 13000 bits per
second, i.e. a frame of 260 bits arrives every 20ms. Such a frame
is decoded by the application shown in Figure 6.12. It consists of
one task graph of 34 tasks mapped on two processors. The task
partitioning and profiling was done by M. Schmitz [Sch03]. The
period of every task is equal to the frame period, namely 20ms.
The tasks process an input block of 260 bits. The layout of a
260 bit frame is shown on the top of Figure 6.12, where also the
correspondence between the various fields in the frame and the
tasks processing them is depicted.
For all tasks, the deadline is equal to the period. No tasks
are critical in this application but the deadline miss threshold of
every task is 0. Hence, the value of the cost function defined in
Eq.(6.1) is equal to the sum of the deadline miss ratios of all 34
tasks and the deadline miss ratio of the entire application.
φ= 0.02s
0
Buffer
read
6
1
xMcr[0..12]
APCM inverse
quantization
xmaxrc[0]
2
9
xMcr[13..25]
Ncr[1]
18
LARp to rp
Short term
synthesis filtering
19
Postprocessing
33
29
LARp to rp
28
coefficients
27..39
27
Short term
synthesis filtering
Decoding of
coded LAR
Short term
synthesis filtering
26
25
LARp to rp
22
23
coefficients
13..26
scale
12
xMcr[26..39]
3
Ncr(delay)
4*7 = 28bit
APCM inverse
quantization
xmaxcr[2]
bcr(gain)
4*2 = 8bit
RPE grid
positioning
13
Mcr[2]
coefficients
0..12
24
11
bcr[1]
GSM long term
synthesis filtering
21
APCM inverse
quantization
xmaxcr[1]
RPE grid
positioning
10
Mcr[1]
8
Ncr[0]
xMcr[...]
4*13*3 = 156bit
GSM long term
synthesis filtering
bcr[0]
xmaxcr(max value)
4*6 = 24bit
14
Ncr[2]
31
Short term
synthesis filtering
32
LARp to rp
4
15
xMcr[40..52]
17
Ncr[3]
GSM long term
synthesis filtering
bcr[3]
Buffer
write
20
APCM inverse
quantization
xmaxcr[3]
RPE grid
positioning
16
Mcr[3]
30 coefficients
40..139
GSM long term
synthesis filtering
bcr[2]
LARcr[1..8]
36bit
5
LARcr[1..8]
Figure 6.12: Task graph modelling GSM voice decoding. From M. Schmitz’s [Sch03] PhD thesis.
RPE grid
positioning
7
Mcr[0]
Mcr(grid)
4*2 = 8bit
6.6. EXPERIMENTAL RESULTS
135
136
CH. 6. MISS RATIO MINIMISATION
The restricted neighbourhood search found a task mapping
and priority assignment of cost 0.0255 after probing 729, 662 potential solutions in 1h310 on an AMD Athlon clocked at 1533MHz.
This means that the deadline miss ratio of the voice decoding application, if the tasks are mapped and their priority is assigned
as found by the RNS, is less than 2.55%. This result is about 16
times better than the cost of an initial random solution.
Part III
Communication
Synthesis for
Networks-on-Chip
137
Chapter 7
Motivation and
Related Work
Transient failures of on-chip network links are a source of
stochastic behaviour of applications implemented on networkson-chip. In this chapter, we introduce the motivation of our work
in the area of providing reliable and low-energy communication
under timeliness constraints. Next, in Section 7.2 we survey the
related work in the area and underline our contributions.
7.1 Motivation
Shrinking feature sizes make possible the integration of millions
and soon billions of transistors on multi-core chips. At this integration level, effects such as capacitive cross-talk, power supply
noise, and neutron and alpha radiation [SN96, AKK+ 00] lead to
non-negligible rates of transient failures of interconnects and/or
devices, jeopardising the correctness of applications.
For example, new technologies such as Extreme Ultraviolet
Lithography promise to deliver feature sizes of 20nm [Die00].
This allows for single-chip implementations of extremely complex, computation-intensive applications, such as advanced signal processing in, for example, the military or medical domains,
high-quality multimedia processing, high-throughput network
routing, and high-traffic web services.
However, these technological capabilities do not come without unprecedented challenges to the design community. These
139
140
CH. 7. MOTIVATION AND RELATED WORK
challenges include increased design and verification complexity
and high power density and an increased rate of transient faults
of the components and/or communication links.
Several authors [BD02, DRGR03, KJS+ 02] have proposed
network-on-chip (NoC) architectures as replacements to busbased designs in order to improve scalability, reduce design,
verification and test complexity and to ease the power management problem.
With shrinking feature size, the on-chip interconnects have
become a performance bottleneck [Dal99]. Thus, a first concern,
which we address in this part of the thesis, is application latency.
The energy consumption of wires has been reported to account for about 40% of the total energy consumed by the chip
[Liu94]. Moreover, another significant source of energy consumption are the buffers distributed within the on-chip communication infrastructure. This is a strong incentive to consider
the communication energy reduction by means of efficient utilisation of the on-chip communication channels and by means of
reducing the buffering demand of applications. Thus, a second
concern, which we address in Chapters 9 and 10, is communication energy and buffer space demand minimisation.
A third problem arising from shrinking feature size is the increasing rate of transient failures of the communication lines.
The reliability of network nodes is guaranteed by specific methods, which are outside the scope of this work. In general, 100%
reliable communication cannot be achieved in the presence of
transient failures, except under assumptions such as no multiple
simultaneous faults or at most n bit flips, which are unrealistic
in the context of complex NoC. Hence, we are forced to tolerate
occasional errors, provided that they occur with a rate below an
imposed threshold. Thus, a third concern, addressed in Chapter 9, is to ensure an imposed communication reliability degree
under constraints on application latency, while keeping energy
consumption as low as possible.
We address the three identified problems, namely energy reduction and satisfaction of timeliness and communication reliability constraints, by means of communication synthesis. In this
context, synthesizing the communication means mapping data
packets to network links and determining the time moments
when the packets are released on the links. The selection of
message routes has a significant impact on the responsiveness
of applications implemented on the NoC. The communication re-
7.2. RELATED WORK
141
liability is ensured by deploying a combination of spatially and
temporally redundant communication. This however renders
the communication mapping problem particularly difficult.
The next section surveys related work and contrasts it
with ours. Chapter 9 presents our approach to communication mapping for energy-efficient reliable communication with
predictable latency. Chapter 10 presents a communication
synthesis approach for the minimisation of the buffer space
demands of applications.
7.2 Related Work
Communication synthesis greatly affects performance and energy consumption. Closest to our approach, which maps data
packets to network links in an off-line manner, is deterministic
routing [HM05, MD04]. One of its advantages is that it may
guarantee deadlock-free communication and the communication
latency and energy consumption are easier to predict. Nevertheless, deterministic routing can be efficiently applied only if
traffic patterns are known in more detail at design time. Under
the assumptions that we make in this thesis, the communication mapping (and the deterministic routing that results from it)
is complicated by the fact that we deploy redundant communication.
Wormhole routing [BIGA04] is a popular switching technique
among NoC designs. However, an analysis that would provide
bounds on its latency and/or energy consumption has yet to be
devised. Therefore, throughout this part of the thesis, we will
assume virtual cut-through switching [KK79], whose analysis
we present in Chapter 9.
As opposed to deterministic routing, Dumitraş and Mărculescu [DM03] have proposed stochastic communication as a way to
deal with permanent and transient faults of network links and
nodes. Their method has the advantage of simplicity, low implementation overhead, and high robustness w.r.t. faults. However,
their method suffers the disadvantages of non-deterministic
routing. Thus, the selection of links and of the number of redundant copies to be sent on the links is stochastically done at
runtime by the network routers. Therefore, the transmission
latency is unpredictable and, hence, it cannot be guaranteed.
142
CH. 7. MOTIVATION AND RELATED WORK
More importantly, stochastic communication is very wasteful in
terms of energy [Man04].
Pirretti et al. [PLB+ 04] report significant energy savings relative to Dumitraş’ and Mărculescu’s approach, while still keeping the low implementation overhead of non-deterministic routing. An incoming packet is forwarded to exactly one outgoing
link. This link is randomly chosen according to pre-assigned
probabilities that depend on the message source and destination. However, due to the stochastic character of transmission
paths and link congestion, neither Dumitraş and Mărculescu,
nor Pirretti et al. can provide guarantees on the transmission
latency.
As opposed to Dumitraş and Mărculescu and Pirretti et al.,
who address the problem of reliable communication at systemlevel, Bertozzi et al. [BBD02] address the problem at on-chip
bus level. Bertozzi’s approach is based on low-swing signals carrying data encoded with error resilient codes. They analyse the
trade-off between consumed energy, transmission latency and error codes, while considering the energy and the chip area of the
encoders/decoders. While Bertozzi et al. address the problem
at link level, in this chapter we address the problem at application level, considering time-constrained multi-hop transmission
of messages sharing the links of an NoC.
Several researchers addressed the problem of dimensioning
of the buffers of the on-chip communication infrastructure. Saastamoinen et al. [SAN03] study the properties of on-chip buffers,
report gate-area estimates and analyse the buffer utilisation.
Chandra et al. [CXSP04] analyse the effect of increasing buffer
size on interconnect throughput. However, they use a single
source, single sink scenario.
An approach for buffer allocation on NoC is given by Hu and
Mărculescu [HM04a]. They consider a design scenario in which
an NoC is custom designed for a particular application. Hu and
Mărculescu propose a method to distribute a given buffer space
budget over the network switches. The algorithm is based on a
buffer space demand analysis that relies on given Poisson traffic
patterns of the application. Therefore, their approach cannot
provide application latency guarantees.
7.3. HIGHLIGHTS OF OUR APPROACH
143
7.3 Highlights of Our Approach
In Chapter 9, we address all of the three stringent problems
identified in Section 7.1: link reliability, latency, and energy
consumption. We propose a solution for the following problem:
Given an NoC architecture with a failure probability for its network links and given an application with required message arrival probabilities and imposed deadlines, find a mapping of messages to network links such that the imposed message arrival
probability and deadline constraints are satisfied at reduced energy costs.
Our approach differs from the approaches of Dumitraş and
Mărculescu [DM03] and of Pirretti et al. [PLB+ 04] in the sense
that we deterministically select at design time the links to be
used by each message and the number of copies to be sent on
each link. Thus, we are able to guarantee not only message arrival probabilities, but also worst-case message arrival times. In
order to cope with the unreliability of on-chip network links, we
propose a way to combine spatially and temporally redundant
message transmission. Our approach to communication energy
reduction is to minimise the application latency at almost no
energy overhead by intelligently mapping the redundant message copies to network links. The resulting time slack can be
exploited for energy minimisation by means of voltage reduction
on network nodes and links.
While the work presented in Chapter 9 tackles the problem of
energy-efficient communication, it leaves a potential for further
energy and cost savings unexploited. A key factor to the energy
and cost-efficiency of applications implemented on NoC is the
synthesis of the communication such that buffer needs are kept
low.
A poor synthesis of the communication may lead to a high
degree of destination contention at ingress buffers of network
switches. Undesirable consequences of this contention include
long latency and an increased energy consumption due to repeated reads from the buffers [YBD02]. Moreover, a high degree of destination contention runs the risk of buffer overflow
and consequently packet drop with significant impact on the
throughput [KJS+ 02]. Even in the presence of a back pressure
mechanism, which would prevent packet drops, the communication latency would be severely affected by the packet contention
[HM04a]. Thus, in Chapter 10, we concentrate on the buffer
144
CH. 7. MOTIVATION AND RELATED WORK
space aware communication mapping and packet release timing
for applications implemented on NoC.
We focus on two design scenarios, namely the custom design
of application-specific NoCs and the implementation of applications on general-purpose NoCs. In the former, the size and distribution of communication buffers can be tailored to precisely
fit the application demands. Thus, synthesizing the communication in an intelligent manner could significantly reduce the total
need of buffering. In this scenario, the optimisation objective for
the communication synthesis approach that we propose is the
minimisation of the overall communication buffer space.
In the second design scenario, we assume that an application
has to be implemented on a given NoC, with fixed capacity for
each buffer. Thus, the challenge consists in mapping the data
packets such that no buffer overflow occurs. In both scenarios,
it has to be guaranteed that the worst-case task response times
are less than the given deadlines, and that the message arrival
probability is equal or above an imposed threshold.
Our approach relies on an analysis of both timing behaviour
and communication buffer space demand at each buffer in the
worst case. Thus, in both design scenarios, if a solution to the
communication synthesis problem is found, we are able to guarantee worst-case timing behaviour and worst-case buffer space
demand, which means that no buffer overflows/packet drops occur.
Our approach differs in several aspects from the approach of
Hu and Mărculescu [HM04a]. First, in addition to buffer allocation, we perform off-line packet routing under timeliness and
buffer capacity constraints. Second, we are able to guarantee the
application latency and that no packets are dropped due to buffer
overflows at the switches. Third, we propose a complementary
technique that can be independently deployed for the minimisation of the buffer space demand. This technique consists of
delaying the release of packets in order to minimise destination
contention at the buffers. The method is sometimes referred to
as traffic shaping [RE02].
The next chapter introduces the system model we use
throughout this part of the thesis, while Chapter 9 presents
our approach to communication mapping for low energy and
Chapter 10 describes our approach to buffer space demand
minimisation.
Chapter 8
System Modelling
This chapter presents the system model used throughout this
part of the thesis.
8.1 Hardware Model
We describe the system model and introduce the notations based
on the example in Figure 8.1. The hardware platform consists of
a 2D array of W × H cores, depicted as squares in the figure,
where W and H denote the number of columns and rows of the
array respectively. The cores are denoted with Px,y , where x is
the 0-based column index and y is the 0-based row index of the
L 0,1,E
τ2
L 0,1,S
τ1
P 2,1
τ
L 0,0,E
P 0,0
τ
τ5
9
τ
7
L 1,0,E
P 2,0
P 1,0
P 3,1
L 2,1,S
P 1,1
L 1,1,S
P 0,1
L 1,1,E
10
τ6
τ
L 2,0,E
8
τ3
Figure 8.1: Application example
145
P 3,0
τ
τ4
11
146
CH. 8. SYSTEM MODELLING
core in the array. The inter-core communication infrastructure
consists of a 2D mesh network. The small circles in Figure 8.1
depict the switches, denoted Sx,y , 0 ≤ x < W , 0 ≤ y < H. Core
Px,y is connected to switch Sx,y , ∀0 ≤ x < W , 0 ≤ y < H. The
thick lines connecting the switches denote the communication
links. Each switch, except those on the borders of the 2D mesh,
contains five input buffers: one for the link connecting the switch
to the core with the same index as the switch, and the rest corresponding to the links conveying traffic from the four neighbouring switches.
The link connecting switch Sx,y to switch Sx,y+1 is denoted
with Lx,y,N while the link connecting switch Sx,y+1 to switch
Sx,y is denoted with Lx,y+1,S . The link connecting switch Sx,y
to switch Sx+1,y is denoted with Lx,y,E while the link connecting
switch Sx+1,y to switch Sx,y is denoted with Lx+1,y,W . Each link
is characterised by the time and energy it needs to transmit a bit
of information.
8.2 Application Model
The application model is similar to the one described in Section 3.2. It diverges from the application model introduced in
Section 3.2 in the following aspects:
• Tasks belonging to the same task graph have the same period (πa = πb if τa , τb ∈ Vi and (Vi , Ei ⊂ Vi × Vi ) is a task
graph.)
• The task execution time probability density functions are
unknown. For each task τi ∈ T , we know only the upper and lower bounds on its execution time, W CETi and
BCETi respectively.
• There are no limits on the maximum number of task graph
instantiations that may be active in the system at the same
time.
• Tasks are scheduled according to a fixed-priority preemptive scheduling policy.
8.3 Communication Model
Communication between pairs of tasks mapped on different
cores is performed by message passing. Their transmission on
8.4. FAULT MODEL
147
network links is done packet-wise, i.e. the message is chopped
into packets, which are sent on links and reassembled at the
destination core. Messages are characterised by their priority,
length (number of bits), and the size of the packets they are
chopped into.
If an output link of a switch is busy sending a packet while
another packet arrives at the switch and demands forwarding
on the busy link, the newly arrived packet is stored in the input
buffer corresponding to the input link on which it arrived. When
the output link becomes available, the switch picks the highest
priority packet that demands forwarding on the output link. If
an output link of a switch is not busy while a packet arrives
at the switch and demands forwarding on the busy link, then
the packet is forwarded immediately, without buffering. This
scheme is called virtual cut-through routing [KK79].
Packet transmission on a link is modelled as a task, called
communication task. The worst-case execution time of a communication task is given by the packet length divided by the
link bandwidth. The execution of communication tasks is nonpreemptible.
8.4 Fault Model
Communication links may temporarily malfunction, with a
given probability. If a data packet is sent on the link during the
time the link is in the failed state, the data is scrambled. We assume that the switches have the ability to detect if an incoming
packet is scrambled. Scrambled packets are dropped as soon as
they are detected and are not forwarded further. Several copies
of the same packet may be sent on the network links. In order
for a message to be successfully received, at the destination core,
at least one copy of every packet of the message has to reach the
destination core unscrambled. Otherwise, the message is said to
be lost.
We define the message arrival probability of the message τi →
Si,j (t)
τj as the long term ratio M APi,j = limt→∞ dt/π
, where Si,j is
ie
the number of messages between tasks τi and τj that are successfully received at the destination in the time interval [0, t),
and πi denotes the period of the sender task. For each pair of
communicating tasks τi → τj , the designer may require lower
148
CH. 8. SYSTEM MODELLING
bounds Bi,j on the ratio of messages that are received unscrambled at the destination.
Let us assume that switches have the capability to detect erroneous (scrambled) packets, but they do not have the capacity to
correct these errors. We let α denote the probability of a packet
to traverse a network link unscrambled. A strategy to satisfy the
constraints on the message arrival probability (M APi,j ≥ Bi,j )
is to make use of spatially and/or temporally redundant packet
transmission, i.e. several copies of the same packet are simultaneously transmitted on different paths and/or they are resent
several times on the same path. This strategy is discussed in
Chapter 9.
An alternative strategy to cope with transient faults on the
network links is to add redundant bits to the packets for error
correction. Additionally, extra circuitry has to be deployed at
the switches such that they can correct some of the erroneous
packets. In this case, let α denote the probability of a packet
successfully traversing a link, which means that
• the packet traverses the network link unscrambled, or
• the packet is scrambled during its transmission along the
link, but the error correction circuitry at the end of the link
is able to correct the scrambled bits.
Note that the two strategies are orthogonal, in the sense that
the redundant transmission can be deployed even when error
correction capability is present in the network. In this case, redundant transmission attempts to cope with the errors that are
detected but cannot be corrected by the correction circuitry. An
analysis of the trade-off between the two strategies is beyond the
scope of this thesis.
In the sequel, we will make use of α, the probability of a
packet to successfully traverse a link, abstracting away whether
the successful transmission is due to error correction or not.
8.5 Message Communication Support
In order to satisfy message arrival probabilities imposed by the
designer, temporally and/or spatially redundant communication
is deployed. We introduce the notion of communication supports
(CS) for defining the mapping of redundant messages to network
links. For this purpose, we use the example in Figure 8.1. A
8.5. MESSAGE COMMUNICATION SUPPORT
L 0,1,E
L 1,1,E
τ2
L 0,1,S
τ1
L 0,0,E
P 0,0
τ
τ5
9
P 3,1
P 2,1
P 1,0
τ
7
L 1,0,E
P 2,0
τ
L 2,1,S
P 1,1
L 1,1,S
P 0,1
149
10
τ6
τ
8
τ3
L 2,0,E
P 3,0
τ
τ4
11
Figure 8.2: Message mapping for the application in Figure 8.1
possible mapping of messages to network links is depicted in
Figure 8.2. The directed lines depicted parallel to a particular
link denote that the message represented by the directed line is
mapped on that link. Thus, message τ1 → τ2 is conveyed by link
L0,1,E , message τ1 → τ5 by link L0,1,S , message τ7 → τ8 by link
L2,1,S , message τ9 → τ10 by link L0,0,E , message τ5 → τ6 by links
L0,0,E and L1,0,E , message τ10 → τ11 by links L1,0,E and L2,0,E .
Of particular interest are messages τ3 → τ4 and τ2 → τ3 . Two
identical copies of the former are sent on the same link, namely
link L2,0,E , as indicated by the double arrow between task τ3 and
τ4 in the figure. Therefore, the transmission of message τ3 → τ4
is temporally redundant. A more complex case is exemplified
by message τ2 → τ3 . Identical copies of the message take different routes. Therefore, the transmission of message τ2 → τ3
is spatially redundant. One copy is conveyed by links L1,1,E
and L2,1,S , while the second copy is conveyed by links L1,1,S and
L1,0,E . Moreover, the copy travelling along the first route is in its
turn replicated once it reaches switch S2,3 and sent twice on link
L2,1,S , as shown by the double arrow in the figure.
In general, the mapping of the communication between two
tasks τi → τj can be formalised as a set of tuples CSi,j = {(L, n) :
L is a link, n ∈ N}, where n indicates the number of copies of
the same message that are conveyed by the corresponding link
L. We will call the set CSi,j the communication support (CS) of
τi → τ j .
Let M ⊂ T × T be the set of all pairs of communicating
tasks that are mapped on different cores ((τa , τb ) ∈ M iff ∃Γi =
CH. 8. SYSTEM MODELLING
150
(Vi , Ei ⊂ Vi × Vi ) such that (τa , τb ) ∈ Ei and M ap(τi ) 6= M ap(τj ).)
A communication mapping, denoted CM, is a function defined
on M that maps each pair of communicating tasks to one communication support.
In our example, the communication mapping is the following: τ1 → τ2 is mapped on CS1,2 = {(L0,1,E , 1)}, τ1 → τ5 on
CS1,5 = {(L0,1,S , 1)}, τ7 → τ8 on CS7,8 = {(L2,1,S , 1)}, τ9 → τ10
on CS9,10 = {(L0,0,E , 1)}, τ10 → τ11 on CS10,11 = {(L1,0,E , 1),
(L2,0,E , 1)}, τ5 → τ6 on CS5,6 = {(L0,0,E , 1), (L1,0,E , 1)}, τ3 → τ4
on CS3,4 = {(L2,0,E , 2)}, and τ2 → τ3 on CS2,3 = {(L1,1,E , 1),
(L2,1,S , 2), (L1,1,S , 1), (L1,0,E , 1)}.
Two properties of a communication support are of interest:
• The arrival probability of a message that is mapped on that
communication support, called message arrival probability
of the CS, denoted MAP, and
• The expected energy consumed by the transmission of the
message on that support, called expected communication
energy of the CS, denoted ECE.
The values of MAP and ECE can be computed by means of simple
probability theory. We will illustrate their computation using
the CS supporting message τ2 → τ3 , CS2,3 . For simplicity, in this
example we assume that the message consists of a single packet
and the energy consumed by the transmission of the packet on
any link is 1.
The MAP of CS2,3 is given by P(V ∪ W ), where V is the event
that the copy sent along the path L1,1,S → L1,0,E successfully
reaches core P2,0 , and W is the event that the copy sent along
the path L1,1,E → L2,1,S successfully reaches core P2,0 .
P(V ) = α2 .
The probability that both temporally redundant copies that are
sent on link L2,1,S get scrambled is (1 − α)2 . Thus, the probability that the packet successfully reaches core P2,0 if sent on path
L1,1,E → L2,1,S is
P(W ) = α · (1 − (1 − α)2 ).
Thus, the MAP of CS2,3 is
P(V ∪ W ) = P(V ) + P(W ) − P(V ∩ W ) =
= α2 + α · (1 − (1 − α)2 ) − α3 · (1 − (1 − α)2 ).
8.5. MESSAGE COMMUNICATION SUPPORT
151
The expected communication energy is the expected number
of sent bits multiplied by the average energy per bit. The energy per bit, denoted Ebit , can be computed as shown by Ye et al.
[YBD02].
The ECE of CS2,3 is proportional to
E[SentS1,1 ] + E[SentS2,1 ] + E[SentS1,0 ],
where SentS denotes the number of packets sent from switch S
and E[SentS ] denotes its expected value.
E[SentS ] = E[SentS |RS ] · P(RS ),
where RS is the event that at least one copy of the packet successfully reaches switch S, end E[SentS |RS ] is the number of
copies of the packet that are forwarded from switch S given that
at least one copy successfully reaches switch S. Hence,
ECE2,3 ∼ 2 + 2 · α + 1 · α = 2 + 3α.
The proportionality constant is Ebit · b, where Ebit is the energy
per bit and b is the number of bits of the packet.
152
CH. 8. SYSTEM MODELLING
Chapter 9
Energy and
Fault-Aware
Time-Constrained
Communication
Synthesis for NoC
In this chapter we present an approach for determining the communication support for each message such that the communication energy is minimised, the deadlines are met and the message
arrival probability is higher than imposed lower bounds. The
approach is based on constructing a set of promising communication support candidates for each message. Then, the space of
communication support candidates is explored in order to find
a communication support for each message such that the task
response times are minimised. In the last step of our approach,
the resulted execution time slack can be exploited by means of
voltage and/or frequency scaling in order to reduce the communication energy.
153
154
CH. 9. ENERGY-AWARE SYNTHESIS
9.1 Problem Formulation
This section gives the formulation of the problem that we solve
in this chapter.
9.1.1 Input
The input to the problem consists of:
• The hardware model, i.e. the size of the NoC, and, for each
link, the energy-per-bit, the bandwidth, and the probability
of a packet to be successfully conveyed by the link; and
• The application model, i.e. the set of task graphs Γ, the set
of task and task graph deadlines ∆T and ∆Γ respectively,
the mapping of tasks to cores, the set of task periods ΠT ,
the best-case and worst-case execution times of all tasks on
the cores on which they are mapped, the task priorities and
the amounts of data to be transmitted between communicating tasks;
• The communication model, i.e. the packet size and message priority for each message (alternatively, our approach
can automatically assign message priorities according to
the message criticality);
• The lower bounds Bi,j imposed on the message arrival
probability M APi,j , which is the expected fraction of successfully transmitted messages, for each pair of communicating tasks τi → τj .
9.1.2 Output
The output of the problem consists of the communication mapping CM, such that the total communication energy is minimised.
9.1.3 Constraints
The application has to satisfy the following constraints:
• For each pair of communicating tasks τi → τj , such that
tasks τi and τj are mapped on different cores, the message
arrival probability M APi,j is greater than or equal to the
imposed lower bound Bi,j (M APi,j ≥ Bi,j , ∀τi , τj ∈ T : τi ∈
◦
τj ∧ M ap(τi ) 6= M ap(τj )).
• All deadlines are met.
9.2. APPROACH OUTLINE
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
155
for each pair of communicating tasks τi → τj
find a set of candidate CSs that satisfies M APi,j ≥ Bi,j
(Section 9.3)
end for
(sol, min cost) = explore the space of candidate CSs
(Section 9.5)
using response time calculation (Section 9.4)
for driving the exploration
if min cost = ∞ then
return “no solution”
else
sol0 =voltage freq selection(sol) (according to [ASE+ 04])
return sol 0
end if
Figure 9.1: Approach outline
9.2 Approach Outline
The outline of our approach to solve the problem is shown in Figure 9.1. First, for each pair of communicating tasks (message),
we find a set of candidate communication supports (line 2, see
Section 9.3), such that the lower bound constraint on the message arrival probability is satisfied. Second, the space of candidate communication supports is explored in order to find sol, the
selection of communication supports that result in the minimum
cost min cost (line 4).1 The worst-case response time of each explored solution is determined by the response time calculation
function that drives the design space exploration (line 5, see Section 9.4). If no solutions are found that satisfy the response time
constraints (min cost = ∞), the application is deemed impossible to implement with the given resources (line 7). Otherwise,
the solution with the minimum cost among the found solutions
is selected. Voltage selection is performed on the selected solution in order to decrease the overall system energy consumption
(line 9), and the modified solution is returned (line 10).
The next section discusses the construction of the set of candidate communication supports for an arbitrary pair of communicating tasks. Section 9.4 describes how the response time calcu1 See Section 9.5 for a precise definition of the cost of a solution. Intuitively, a
low cost corresponds to a solution characterised by large time slack (long intervals between the finishing time of a task and its deadline.)
156
CH. 9. ENERGY-AWARE SYNTHESIS
lation is performed, while Section 9.5 outlines how the preferred
communication supports representing the final solution are selected.
9.3 Communication
dates
Support
Candi-
This section describes how to construct a set of candidate communication supports for a pair of communicating tasks. First we
introduce the notions of path, coverage, and spatial, temporal,
and general redundancy degree of a CS.
A path of length n connecting the switch corresponding to a
source core to a switch corresponding to a destination core is an
ordered sequence of n links, such that the end point of the ith
link in the sequence coincides with the start point of the i + 1th
link, ∀1 ≤ i < n, and the start point of the first link is the source
switch and the end point of the last link is the destination switch.
We consider only loop-free paths. A path belongs to a CS if all its
links belong to the CS. A link of a CS is covered by a path if it
belongs to the path.
The spatial redundancy degree (SRD) of a CS is given by
the minimum number of distinct paths belonging to the CS that
cover all the links of the CS. For example, the CSs depicted in
Figures 9.2(a) and 9.2(b) both have a SRD of 1, as they contain
only one path, namely path (L0,0,N , L0,1,E , L1,1,E , L2,1,N , L2,2,N ,
L2,3,E ). The CS shown in Figure 9.2(c) has spatial redundancy
degree 2, as at least two paths are necessary in order to cover
links L1,1,N and L1,1,E , for example paths (L0,0,N , L0,1,E , L1,1,E ,
L2,1,N , L2,2,N , L2,3,E ) and (L0,0,N , L0,1,E , L1,1,N , L1,2,E , L2,2,N ,
L2,3,E ).
The temporal redundancy degree (TRD) of a link is given by
the number of redundant copies to be sent on the link. The TRD
of a CS is given by the maximum TRD of its links. For example,
the TRD of the CS shown in Figure 9.2(b) is 2 as two redundant
copies are sent on links L1,1,E , L2,1,N , L2,2,N , and L2,3,E . The
TRD of the CSs shown in Figures 9.2(a) and 9.2(c) is 1.
The general redundancy degree (GRD) of a CS is given by
the sum of temporal redundancy degrees of all its links. For
example, the GRD of the CS shown in Figure 9.2(a) is 6, the
GRD of the CSs shown in Figures 9.2(b) and 9.2(c) is 10.
9.3. COMMUNICATION SUPPORT CANDIDATES
157
P 0,3
P 1,3
P 2,3
P 3,3
P 0,3
P 1,3
P 2,3
P 3,3
P 0,2
P 1,2
P 2,2
P 3,2
P 0,2
P 1,2
P 2,2
P 3,2
P 0,1
P 1,1
P 2,1
P 3,1
P 0,1
P 1,1
P 2,1
P 3,1
P 0,0
P 1,0
P 2,0
P 3,0
P 0,0
P 1,0
P 2,0
P 3,0
(a)
(b)
P 0,3
P 1,3
P 2,3
P 3,3
P 0,2
P 1,2
P 2,2
P 3,2
P 0,1
P 1,1
P 2,1
P 3,1
P 0,0
P 1,0
P 2,0
P 3,0
(c)
Figure 9.2: Communication supports
CH. 9. ENERGY-AWARE SYNTHESIS
158
Expected number of transmitted bits (proportional to ECE)
700
SRD = 1, GRD = 10
SRD = 2, GRD = 10
SRD = 2, GRD = 11
SRD = 2, GRD = 12
SRD = 2, GRD = 13
680
660
640
620
600
580
560
540
520
0.96
0.965
0.97
0.975
0.98
0.985
Message arrival probability
0.99
0.995
1
Figure 9.3: Energy-efficiency of CSs of SRD 1 and 2
It is important to use CSs of minimal GRD because the expected communication energy (ECE) of a message is strongly dependent on the GRD of the CS supporting it. To illustrate this,
we constructed all CSs of SRD 2 and GRD 10–13 for a message
sent from the lower-left core to the upper-right core of a 4 × 4
NoC. We also constructed all CSs of SRD 1 and GRD 10. For
each of the constructed CS, we computed their MAP and ECE.
In Figure 9.3, we plotted all resulting (M AP, ECE) pairs. Note
that several different CSs may have the same MAP and ECE
and therefore one dot in the figure may correspond to many CSs.
We observe that the ECE of CSs of the same GRD do not differ
significantly among them, while the ECE difference may account
to more than 10% for CSs of different GRD.
The algorithm for the candidate set construction proceeds as
shown in Figure 9.4. Candidate CSs with SRD of only 1 and 2
are used. The justification for this choice is given later in the
section.
We illustrate how to find the minimal GRD for a message
based on the example depicted in Figure 9.2. We consider a 4 × 4
NoC, and a message sent from core P0,0 to core P3,3 . The message
consists of just one packet, the probability that the packet suc-
9.3. COMMUNICATION SUPPORT CANDIDATES
(1)
(2)
(3)
(4)
159
for each pair of communicating tasks τi → τj
Determine N1 and N2 , the minimum general redundancy
degrees of CSs of SRD 1 and 2 respectively, such that the
MAP constraint on τi → τj is satisfied
Add all CSs with SRD 1 and with GRD N1 and all CSs
with SRD 2 and with GRD N2 to the set of CS candidates
of τi → τj
end for
Figure 9.4: Construction of candidate CS set
cessfully traverses any of the links is α = 0.99, and the imposed
lower bound on the MAP is B = 0.975.
We look first at CSs with SRD of 1, i.e. consisting of a single
path. We consider only shortest paths, that is of length 6. Obviously, a lower bound on GRD is 6. If we assign just one copy
per link, the message arrival probability would be α6 ≈ 0.941 <
0.975 = B. We try with a GRD of 7, and regardless to which
of the 6 links we assign the redundant copy, we get a MAP of
α5 · (1 − (1 − α)2 ) ≈ 0.95 < 0.975 = B. Hence, we are forced
to increase the GRD once more. We observe that there are 5
links left with a TRD of 1. The probability to traverse them is
α5 ≈ 0.95, less than the required lower bound. Therefore it is
useless to assign one more redundant copy to the link that now
has a TRD of 2 because anyway the resulting MAP would not
exceed α5 . Thus, the new redundant copy has to be assigned to a
different link of the CS of GRD 8. In this case, we get a MAP of
α4 · (1 − (1 − α)2 )2 ≈ 0.96, still less than the required bound. We
continue the procedure of increasing the GRD and distributing
the redundant copies to different links until we satisfy the MAP
constraint. In our example, this happens after adding 4 redundant copies (M AP = α2 · (1 − (1 − α2 ))4 ≈ 0.9797). The resulting
CS of SRD 1 and GRD 10 is shown in Figure 9.2(b), where the
double lines represent links that convey two copies of the same
packet. Thus, the minimal GRD for CSs of SRD 1 is N1 = 10.
There are 20 distinct paths between core P0,0 and core P3,3 and
there are 15 ways of distributing the 4 redundant copies to each
path. Thus, 15 · 20 = 300 distinct candidate CSs of SRD 1 and
GRD 10 can be constructed for the considered message. They
all have the same message arrival probability, but different expected communication energies. The ECEs among them vary
1.61%.
160
CH. 9. ENERGY-AWARE SYNTHESIS
Similarly, we obtain N2 , the minimal GRD for CSs of SRD 2.
In this case, it can be mathematically shown that larger message
arrival probabilities can be obtained with the same GRD if the
two paths of the CS intersect as often as possible and the distances between the intersection points are as short as possible
[Man04]. Intuitively, intersection points are important because
even if a copy is lost on one incoming path, the arrival of another copy will trigger a regeneration of two packets in the core
where the two paths intersect. The closer to each other the intersection points are, the shorter the packet transmission time
between the two points is. Thus, the probability to lose a message between the two intersection points is lower. Therefore, in
order to obtain N2 , we will consider CSs with many intersection
points that are close to each other. For our example, the lowest
GRD that lets the CS satisfy the MAP constraint is N2 = 10
(M AP = α6 · (2 − α2 )2 ≈ 0.9793). This CS is shown in Figure 9.2(c). The minimum number of needed redundant copies
in order to satisfy the MAP constraint is strongly dependent on
α and the imposed lower bound on the MAP, and only weakly
dependent on the geometric configuration of the CS. Therefore,
typically N2 = N1 or it is very close to N1 .
In conclusion, N1 and N2 are obtained by progressively increasing the GRD until the CS satisfies the MAP constraint. The
redundant copies must be uniformly distributed over the links
of the CS. Additionally, in the case of CSs with SRD 2, when
increasing the GRD, links should be added to the CS such that
many path intersection points are obtained and that they are
close to each other.
The following reasoning lies behind the decision to use CSs
with SRD of only 1 and 2. First, we give the motivation for using
CSs with SRD larger than 1. While, given a GRD of N , it is possible to obtain the maximum achievable message arrival probability with CSs of SRD 1, concurrent transmission of redundant
message copies would be impossible if we used CSs with SRD
of only 1. This could severely affect the message latency or, even
worse, lead to link overload. CSs with SRD 2 are only marginally
more energy hungry, as can be seen from the cluster of points in
the lower-left corner of Figure 9.3. Usually, the same MAP can
be obtained by a CS of SRD 2 with only 1–2% more energy than
a CS of SRD 1.
While the previous consideration supports the use of CSs
with SRD greater than 1, there is no reason to go with the SRD
9.4. RESPONSE TIME CALCULATION
161
τ3
P 0,1
τ1
P 1,1
τ1
τ1 τ2
τ1 τ3
τ4
τ5
τ7
τ2
P 1,0
(a)
τ2
τ9
τ 10 τ 11
P 0,0
τ8
τ6
τ3
(b)
Figure 9.5: Application modelling for response time analysis
beyond 2. Because of the two-dimensional structure of the NoC,
there are at most 2 different links that belong to the shortest
paths between the source and the destination and whose start
points coincide with the source core. Thus, if a CS consisted
only of the shortest paths, the message transmission would be
vulnerable to a double fault of the two initial links. Therefore,
CSs with SRD greater than 2, while consuming more energy
for communication, would still be vulnerable to a double fault
on the initial links and hence can only marginally improve the
MAP. If we did not restrict the CS to the shortest paths, while
overcoming the limitation on the MAP, we would consume extra
energy because of the longer paths. At the same time, latency
would be negatively affected. Thus, for two-dimensional NoC,
we consider CSs of SRD of only 1 and 2.
9.4 Response Time Calculation
In order to guarantee that tasks meet their deadlines, in case
no message is lost, response times have to be determined in the
worst case.
Let us consider the example depicted in Figure 9.5(a). Solid
lines depict data dependencies among the tasks, while the
dotted lines show the actual communication mapping to the
on-chip links. The two CSs are CS1,2 = {(L0,0,E , 1)} and
162
CH. 9. ENERGY-AWARE SYNTHESIS
CS1,3 = {(L0,0,E , 1), (L1,0,N , 1), (L0,0,N , 2), (L0,1,E , 2)}. Packet
sizes are such that message τ1 → τ2 is chopped into 2 packets,
while message τ1 → τ3 fits into a single packet.
Based on the application graph, its mapping and the communication supports, we construct a task graph as shown in Figure 9.5(b). Each link L is regarded as a processor PL , and each
packet transmission on link L is regarded as a non-preemptive
task executed on processor PL . The shadings of the circles denote
the processors (links) on which the tasks (packets) are mapped.
Tasks τ4 and τ5 represent the first and the second packet of the
message τ1 → τ2 . They are both dependent on task τ1 , as the
two packets are generated when task τ1 completes its execution,
while task τ2 is dependent on both task τ4 and τ5 as it can start
only after it has received the entire message, i.e. both packets,
from task τ1 . Both tasks τ4 and τ5 are mapped on the “processor”
corresponding to the link L0,0,E . Task τ6 represents the packet
of the message τ1 → τ3 that is sent on link L0,0,E and task τ7
represents the same packet once it reaches link L1,0,N . Tasks τ8
and τ9 are the two copies of the packet of the message τ1 → τ3
that are sent on link L0,0,N .
We are interested in the worst-case scenario w.r.t. response
times. In the worst case, all copies of packets get scrambled except the latest packet. Therefore, the copies to be sent by a core
on its outgoing links have to wait until the last of the copies arriving on incoming links of the core has reached the core. For
example, tasks τ10 and τ11 , modelling the two copies of the message τ1 → τ3 that are sent on the link L0,1,E , depend on both τ8
and τ9 , the two copies on link L0,0,N . Also, task τ3 depends on all
three copies, τ7 , arriving on link L1,0,N , and τ10 and τ11 , arriving
on link L0,1,E .
The modified model, as shown in Figure 9.5(b), is analysed
using the dynamic offset based schedulability analysis proposed
by Palencia and Harbour [PG98]. The analysis calculates the
worst-case response times and jitters for all tasks.
9.5 Selection of Communication Supports
As shown in Section 9.3 (see also line 2 in Figure 9.1), we have
determined the most promising (low energy, low number of messages) set of CSs for each transmitted message in the applica-
9.5. COMMUNICATION SUPPORT SELECTION
163
tion. All those CSs guarantee the requested MAP. As the next
step of our approach (line 4 in Figure 9.1) we have to select one
particular CS for each message, such that the solution cost is
minimised, which corresponds to maximising the smallest time
slack. The response time for each candidate solution is calculated as outlined in Section 9.4 (line 5 in Figure 9.1).
The design space is explored with the Tabu Search based
heuristic. The basic principles of Tabu Search have been described in Section 6.4.1. The design space is the Cartesian product of the sets of CS candidates for each message (constructed as
shown in Section 9.3.) Because all CS candidates guarantee the
requested MAP, all points in the solution space satisfy the MAP
constraint of the problem (Section 9.1.3). A point in the design
space is an assignment of communication supports to messages
(see Section 8.5). A move means picking one pair of communicating tasks and selecting a new communication support for the
message sent between them. In order to select a move, classical
Tabu Search explores all solutions that can be reached by one
move from the current solution. For each candidate solution,
the application response time has to be calculated. Such an approach would be too time consuming for our problem. Therefore,
we only explore “promising” moves. Thus,
1. we look at messages with large jitters as they have a higher
chance to improve their transmission latency by having assigned a new CS; and
2. for a certain message τi → τj , we consider only those candidate CSs that would decrease the amount of interference
of messages of higher priority than τi → τj . (By this we
remove messages from overloaded links.)
The value of the cost function that drives the response-time
minimisation, evaluated for an assignment of communication
supports to messages CS, is:


∃τ ∈ T : W CRTτ > δτ ∨
∞
cost(CS) =
∨∃Γi ∈ Γ : W CRTΓi > δΓi (9.1)


W CRTτ
maxτ ∈T
otherwise,
δτ
where T is the set of tasks and W CRT and δ denote worst-case
response times and deadlines respectively. The worst-case response time of a task is obtained as shown in Section 9.4.
CH. 9. ENERGY-AWARE SYNTHESIS
164
In the case of the cost function in Eq. (9.1), we make the conservative assumption that voltage and frequency are set systemwide for the whole NoC. This means that, at design time, an optimal voltage and/or frequency is determined for the whole NoC.
The determined values for voltage and frequency do not change
during the entire execution time of the application. For such a
scenario, if we decrease the system-wide voltage (or frequency),
the worst-case response times of all tasks would scale with the
same factor. Therefore, in the definition of the cost function
(Eq. (9.1)) we use the max operator, as the width of the interval in which the response time is allowed to increase is limited
W CRTτi
) among the tasks.
by the smallest slack (largest
δτ i
In a second scenario, we can assume that voltage and/or frequency may be set core-wise. This means that, at design time,
an optimal voltage and/or frequency is calculated for each core.
These voltages and frequencies do not change during the whole
execution time of the application. The cost function would become


∞
∃τ ∈ T : W CRTτ > δ∨


∃Γi ∈ Γ : W CRTΓi > δΓi
cost(CS) = P

W CRTτ

maxτ ∈Tp δτ
otherwise,

p∈P
(9.2)
where p is a core in P , the set of cores of the NoC, and Tp is the
set of tasks mapped on core p.
If we assume that voltage and/or frequency may change for
each core during operation, then they may be set task-wise and
the cost function becomes


∞
∃τ ∈ T : W CRTτ > δ∨


∨∃Γi ∈ Γ : W CRTΓi > δΓi
(9.3)
cost(CS) = P

W CRTτ

otherwise.

δτ
τ ∈T
9.6 Experimental Results
We report on three sets of experiments that we ran in order to
assess the quality of our approach.
9.6. EXPERIMENTAL RESULTS
165
3000
Only temporal redundancy
Temporal and spatial redundancy
Response time
2500
2000
1500
1000
500
20
30
40
50
Number of tasks
60
70
80
Figure 9.6: Application latency vs number of tasks
9.6.1
Latency as a Function of the Number of
Tasks
The first set investigates the application latency as a function of
the number of tasks. 340 applications of 16 to 80 tasks were randomly generated. The applications are executed by a 4 × 4 NoC.
The probability that a link successfully conveys a data packet is
0.97, and the imposed lower bound on the message arrival probability is 0.99. For each application, we ran our communication
mapping tool twice. In the first run, we consider CSs of SRD
1, i.e. packets are retransmitted on the same, unique path. In
the second run, we consider CSs of SRD 1 and 2, as described
in Section 9.3. Figure 9.6 depicts the averaged results. The approach that uses both spatially and temporally redundant CSs
leads to shorter application latencies than the approach that just
re-sends on the same path.
CH. 9. ENERGY-AWARE SYNTHESIS
166
2000
Only temporal redundancy
Temporal and spatial redundancy
1800
Respose time
1600
1400
1200
1000
800
0.94
0.95
0.96
0.97
0.98
Imposed message arrival probability
0.99
1
Figure 9.7: Application latency vs bound on MAP
9.6.2 Latency as a Function of the Imposed
Message Arrival Probability
The second experiment investigates the dependency of latency
on the imposed message arrival probability. 20 applications,
each of 40 tasks, were randomly generated. We considered the
same hardware platform as in the first experiment. For each
application, we considered 17 different lower bounds on MAP,
ranging from 0.94 to 0.9966. The averaged results are shown
in Figure 9.7. For low bounds on MAP, such as 0.94, almost no
transmission redundancy is required to satisfy the MAP constraint. Therefore, the approach combining spatially and temporally redundant communication fares only marginally better
than the approach that uses only temporal redundancy. However, for higher bounds on the MAP, the approach that combines
spatially and temporally redundant transmission has the edge.
In the case of bounds on the MAP larger than 0.9992, spatial redundancy cannot satisfy the constraint anymore, and therefore
the temporally redundant transmission becomes dominant and
the approach combining spatial and temporal redundancy does
not lead to significant latency reductions anymore.
9.6. EXPERIMENTAL RESULTS
167
25
4x4 NoC
5x5 NoC
6x6 NoC
Relative latency reduction [%]
20
15
10
5
0
1
1.5
2
2.5
3
3.5
Amount of communication per time unit [bits / abstract time unit]
4
Figure 9.8: Application latency vs NoC size and communication
load
9.6.3
Latency as a Function of the Size of the
NoC and Communication Load
The third experiment has a double purpose. First, it investigates
the dependency of latency reduction on the size of the NoC. Second, it investigates latency reduction as a function of the communication load (bits/time unit). 20 applications of 40, 62 and 90
tasks were randomly generated. The applications with 40 tasks
run on a 4 × 4 NoC, those with 62 tasks run on a 5 × 5 NoC and
those with 90 tasks run on a 6 × 6 NoC. For each application, we
considered communication loads of 1–4 bits/time unit. The averaged latency reductions when using the optimal combination
of spatial and temporal redundancy, compared to purely temporal redundancy, are depicted in Figure 9.8. We observe that for
low communication loads, the latency reduction is similar for all
three architectures, around 22%. However, at loads higher than
3.4 the relatively small number of links of the 4 × 4 NoC get congested and response times grow unboundedly. This, however, is
not the case with the larger NoCs. Latency reduction for a load
of 4 is 22% for a NoC of 6 × 6 and 12% for 5 × 5.
CH. 9. ENERGY-AWARE SYNTHESIS
168
3
Percentage of benchmarks [%]
2.5
2
1.5
1
0.5
0
0
2000
4000
6000
Optimisation time [s]
8000
10000
Figure 9.9: Histogram of the optimisation time as measured on
an AMD [email protected] desktop PC
9.6.4 Optimisation Time
Figure 9.9 depicts the histogram of the optimisation time for all
benchmarks that are used in this section, as measured on a desktop PC with an AMD Athlon processor clocked at 1533 MHz. On
average, the optimisation time is 912 seconds. We note that the
optimisation time for a large majority of benchmarks is smaller
than 1000 seconds, while 5.6% of all benchmarks took between
1000 and 2000 seconds to optimise, and the optimisation of 8.7%
of benchmarks took longer than 2000 seconds.
9.6.5 Exploiting the Time Slack for Energy Reduction
The presented experiments have shown that, by using an optimal combination of temporal and spatial redundancy for message mapping, significant reduction of latency can be obtained
while guaranteeing message arrival probability at the same
time. It is important to notice that the latency reduction is obtained without energy penalty, as shown in Section 9.3. This
means that for a class of applications using the proposed ap-
9.6. EXPERIMENTAL RESULTS
169
0.07
Only temporal redundancy
Temporal and spatial redundancy
Energy consumption [J]
0.06
0.05
0.04
0.03
0.02
0.01
15
20
25
30
35
40
Number of tasks
45
50
55
60
Figure 9.10: Energy consumption vs. number of tasks
proach it will be possible to meet the imposed deadlines, which
otherwise would not be possible without changing the underlying NoC architecture. However, the proposed approach gives
also the opportunity to further reduce the energy consumed by
the application. If the obtained application response time is
smaller than the imposed one, the resulting slack can be exploited by running the application at reduced voltage. In order
to illustrate this, we have performed another set of experiments.
Applications of 16 to 60 tasks running on a 4 × 4 NoC were
randomly generated. For each application we ran our message
mapping approach twice, once using CSs with SRD of only 1, and
second using CSs with SRD of 1 and 2. The slack that resulted
in the second case was exploited for energy reduction. We have
used the algorithm published in [ASE+ 04] for calculating the
voltage levels for which to run the application. For our energy
models, we considered a 70nm CMOS fabrication process. The
resulted energy consumption is depicted in Figure 9.10. The energy reduction ranges from 20% to 13%. For this experiment, we
considered the conservative scenario in which, at design time,
an optimal voltage and/or frequency is computed for the whole
NoC (see Eq. (9.1) in Section 9.5). We do not assume the avail-
CH. 9. ENERGY-AWARE SYNTHESIS
170
FP
FS0
38k
7k
75.2k
MC
0.2k
33.8k
ME
FP
FS1
7k
8k
116.
FFT
Flt
16.6k
PAM
ADD
16.6k
k
MDCT
33.8k
IQ
FS2
Video
Audio
28.2k
Q
26.9k
IDCT
38
33.8k
DCT
IE1
Bit reservoir 1
IE2
Bit reservoir 2
0.
7k
Synch
0.6k
0.6k
VLE
Buffer
Multiplexing
Figure 9.11: H.263 and MP3 encoding application
ability of a dynamic voltage scaling capability in the NoC. If such
capability existed, even larger energy savings could be achieved.
9.6.6 Real-Life Example: An Audio/Video Encoder
Finally, we applied our approach to a multimedia application,
namely an audio/video encoder implementing the H.263 recommendation [Int05] of the International Telecommunication
Union (ITU) for video encoding and the MPEG-1 Audio Layer 3
standard for audio encoding (ISO/IEC 11172-3 Layer 3 [Int93]).
Figure 9.11 depicts the task graph that models the application, while Figure 9.12 shows the application mapping to the
NoC cores. The task partitioning, mapping, and profiling was
done by Hu and Mărculescu [HM04b]. The video encoding part of
the application consists of 9 tasks: frame prediction (FP), motion
estimation (ME), discrete cosine transform (DCT), quantisation
9.6. EXPERIMENTAL RESULTS
Mem2
ASIC3
Buffer
171
ASIC4
Bit reservoir 2
VLE
Bit reservoir 1
ASIC1
DSP2
CPU2
Synch
IE2
Multiplexing
IE1
DSP6
Q
DSP1
IQ
CPU1
DSP3
ADD
MDCT
DCT
FP
IDCT
Flt
DSP4
MC
ASIC2
DSP5
Mem1
FS2
PAM
FP
ME
FFT
FS1
FS0
Figure 9.12: NoC implementation of the H.263 and MP3 encoding application
172
CH. 9. ENERGY-AWARE SYNTHESIS
(Q), inverse quantisation (IQ), inverse discrete cosine transform
(IDCT), motion compensation (MC), addition (ADD), and variable length encoding (VLE). Three memory regions are used for
frame stores FS0, FS1, and FS2. The audio encoding part consists of 7 tasks: frame prediction (FP), fast Fourier transform
(FFT), psycho-acoustic model (PAM), filter (Flt), modified discrete cosine transform (MDCT), and two iterative encoding tasks
(IE1 and IE2). The numbers that annotate arcs in Figure 9.11
denote the communication amount of the message represented
by the corresponding arc. The period of the task graph depends
on the imposed frame rate, which depends on the video clip. We
use periods of 41.6ms, corresponding to 24 frames per second.
The deadlines are equal to the periods.
The application is executed by an NoC with 6 DSPs, 2 CPUs,
4 ASICs, and 2 memory cores, organised as a 4 × 4 NoC with
two unused tiles, as shown in Figure 9.12. The probability that
a packet successfully traverses a network link is assumed to be
0.99. The approach combining spatially and temporally redundant message transmission obtained a 25% response time reduction relative to the approach deploying only temporal redundancy. The energy savings after voltage reduction amounted to
20%. Because of the relatively small design space of this example, the optimisation took only 3 seconds when combining spatially and temporally redundant communication supports.
9.7 Conclusions
In this chapter we addressed the problem of communication energy minimisation under task response time and message arrival probability constraints. The total communication energy
is reduced by means of two strategies. On one hand, we intelligently select the communication supports of messages such
that we reduce application response time with negligible energy
penalty while satisfying message arrival probability constraints.
On the other hand, the execution time slack can be exploited by
deploying voltage and/or frequency scaling on the cores and communication links. The approach is efficient as it results in energy reductions up to 20%. Nevertheless, a significant cost and
energy reduction potential has not been considered in this chapter, namely the reduction of buffer sizes at the network switches.
The next chapter presents an approach for communication map-
9.7. CONCLUSIONS
173
ping with the goal to reduce the buffering need of packets while
guaranteeing timeliness and lower bounds on message arrival
probability.
174
CH. 9. ENERGY-AWARE SYNTHESIS
Chapter 10
Buffer Space Aware
Communication
Synthesis for NoC
In this chapter we address two problems related to the buffering
of packets at the switches of on-chip networks. First, we present
an approach to minimise the buffer space demand of applications implemented on networks-on-chip. This is particularly relevant when designing application-specific NoCs, as the amount
and distribution of on-chip memory can be tailored for the application. Second, we solve the problem of mapping the communication of an application implemented on an NoC with predefined buffers such that no buffer overflows occur during operation. Both problems are additionally constrained by timeliness
requirements and bounds on the message arrival probability. For
solving the described problems we introduce a buffer space demand analysis procedure, which we present in Section 10.3.3.
The buffer space demand minimisation is achieved by a combination of two techniques: an intelligent mapping of redundant
messages to network links and a technique for delaying the sending of packets on links, also known as traffic shaping [RE02].
Section 10.1 gives a precise formulation of the two problems that
we solve in this chapter. Section 10.2 discusses the two techniques that we propose. Section 10.3 presents our approach to
solving the formulated problems and the buffer demand analysis
175
176
CH. 10. BUFFER SPACE AWARE SYNTHESIS
procedure. Section 10.4 presents experimental results. Finally,
Section 10.5 draws the conclusions.
10.1 Problem Formulation
In this section we define the two problems that we solve in this
chapter.
10.1.1
Input
The input common to both problems consists of:
• The hardware model, i.e. the size of the NoC, and, for each
link, the energy-per-bit, the bandwidth, and the probability
of a packet to be successfully conveyed by the link;
• The application model, i.e. the set of task graphs Γ, the
mapping of tasks to cores M ap, the set of task periods ΠT ,
deadlines ∆T , worst-case execution times, priorities and
the amounts of data to be transmitted between communicating tasks;
• The communication model, i.e. the packet size and message
priority for each message; and
• The lower bounds Bi,j imposed on the message arrival
probability M APi,j , for each message τi → τj .
10.1.2
Constraints
The constraints for both problems are:
• All message arrival probabilities satisfy M APi,j ≥ Bi,j ;
• All tasks meet their deadlines.
10.1.3
Output
The communication synthesis problem with buffer space demand minimisation (CSBSDM) is formulated as follows:
Given the above input, for each message τi → τj find the communication support CSij , and determine the time each packet is
delayed at each switch, such that the imposed constraints are
satisfied and the total buffer space demand is minimised. Additionally, determine the needed buffer capacity of every input
buffer at every switch.
10.2. MOTIVATIONAL EXAMPLE
177
The communication synthesis problem with predefined buffer
space (CSPBS) is formulated as follows:
Given the above input, and additionally the capacity of every input buffer at every switch, for each message τi → τj find
the communication support CSij , and determine the time each
packet is delayed at each switch, such that the imposed constraints are satisfied and no buffer overflow occurs at any switch.
10.2 Motivational Example
In this section we motivate the importance of intelligently choosing communication supports and we demonstrate the power of
traffic shaping based on an example.
Let us consider the application shown in Figure 10.1(a). We
assume that each message consists of a single packet. Assuming
that messages are mapped on only shortest paths (paths traversing a minimum number of switches), for each message, except
the message τ2 → τ3 , there is only one mapping alternative,
namely the shortest path. For the message τ2 → τ3 , however,
there are two such shortest paths, namely L1,1,E → L2,1,S and
L1,1,S → L1,0,E .
One way to minimise buffer space demand is to intelligently
map the message τ2 → τ3 . Let us assume that the message is
mapped on path L1,1,E → L2,1,S . Such a situation is depicted
in Figure 10.1(b). The corresponding Gantt diagram is shown in
Figure 10.2(a). The rectangles represent task executions (respectively message transmissions) on the processing elements (respectively communication links) to which the tasks (messages)
are mapped.
Message τ2 → τ3 competes with message τ7 → τ8 for link
L2,1,S . Message τ7 → τ8 arrives at the switch connecting tile
P2,1 to the network while message τ2 → τ3 is conveyed on link
L2,1,S . Due to the unavailability of the link, message τ7 → τ8
has to be buffered. The situations in which buffering is necessary are highlighted by black ellipses. Messages that have been
buffered before being transmitted, due to momentary resource
unavailability, are depicted in hashed manner. The total needed
buffering space is proportional to the sum of hashed areas. One
more such situation occurs in Figure 10.2(a), caused by the conflict between messages τ5 → τ6 and τ9 → τ10 on link L0,0,E .
CH. 10. BUFFER SPACE AWARE SYNTHESIS
L 0,1,E
P 1,1
τ2
L 0,1,S
τ1
P 2,1
L 0,0,E
P 0,0
τ
7
L 1,0,E
P 2,0
P 1,0
9
P 3,1
τ
L 1,1,S
P 0,1
L 1,1,E
τ
τ5
L 2,1,S
178
τ
10
L 2,0,E
P 3,0
8
τ
τ4
τ3
τ6
11
(a)
L 0,1,E
P 2,1
L 0,1,S
τ2
L 0,0,E
P 0,0
τ
7
L 1,0,E
P 2,0
P 1,0
9
P 3,1
τ
τ
τ5
L 2,1,S
P 1,1
τ1
L 1,1,S
P 0,1
L 1,1,E
τ
10
L 2,0,E
P 3,0
8
τ
τ4
τ3
τ6
11
(b)
L 0,1,E
τ2
L 0,1,S
τ1
P 2,1
τ
L 0,0,E
P 0,0
τ
τ5
9
7
L 1,0,E
P 2,0
P 1,0
τ
P 3,1
L 2,1,S
P 1,1
L 1,1,S
P 0,1
L 1,1,E
10
τ6
τ
L 2,0,E
8
τ3
P 3,0
τ
τ4
(c)
Figure 10.1: Application example
11
9
10
(a)
τ
τ5
τ4
τ
8
τ
11
time
9
τ
τ5
7
τ2
L 2,0,E
P 3,0
τ
τ1
10
(b)
τ
τ3
τ6
τ4
τ
8
τ
11
time
L 2,0,E
P 3,0
P 0,0
L 0,1,S
P 1,0
L 0,0,E
P 2,0
L 1,0,E
L 1,1,S
L 2,1,S
P 2,1
P 1,1
L 1,1,E
P 0,1
L 0,1,E
τ
9
τ1
τ
τ5
7
τ2
τ
10
τ3
τ6
(c)
P 0,0
L 0,1,S
P 1,0
L 0,0,E
P 2,0
L 1,0,E
L 1,1,S
L 2,1,S
P 2,1
P 1,1
L 1,1,E
P 0,1
L 0,1,E
Figure 10.2: Impact of communication mapping and traffic shaping
τ6
τ3
L 2,0,E
P 3,0
τ
7
P 0,0
L 0,1,S
τ
τ2
P 1,0
L 0,0,E
τ1
P 2,0
L 1,0,E
L 1,1,S
L 2,1,S
P 2,1
P 1,1
L 1,1,E
P 0,1
L 0,1,E
τ4
τ
8
τ
11
time
10.2. MOTIVATIONAL EXAMPLE
179
180
CH. 10. BUFFER SPACE AWARE SYNTHESIS
We observe that message τ7 → τ8 needs a relatively large
buffering space, which can be avoided by choosing a different
mapping alternative for message τ2 → τ3 . This mapping is depicted in Figure 10.1(c), while its corresponding Gantt diagram
is shown in Figure 10.2(b). However, while saving the buffering space required by message τ7 → τ8 , the new mapping introduces a conflict between messages τ2 → τ3 and τ5 → τ6 on link
L1,0,E . As a result, the packet from task τ5 to task τ6 has to
be buffered at the switch S10 in the input buffer corresponding
to link L0,0,E . Nevertheless, because message τ7 → τ8 does not
need to be buffered, we reduced the overall buffer space demand
relative to the alternative in Figure 10.1(b).
As there are no other mapping alternatives, we resort to the
second technique, namely traffic shaping, in order to further reduce the total amount of buffering space.
In Figure 10.2(b), we observe that message τ5 → τ6 is buffered
twice, the first time before being sent on L0,0,E , and the second
time before being sent on link L1,0,E . If we delayed the sending of message τ5 → τ6 , as shown in the Gantt diagram in Figure 10.2(c), we could avoid the need to buffer the message at
switch S10 . In the particular case of our example, this message
delaying comes with no task graph response time penalty. This is
because the task graph response time is given by the largest response time among the tasks of the graph (τ4 in our case), shown
as the dotted line in Figure 10.2, which is unaffected by the delaying of message τ5 → τ6 . In general, traffic shaping may increase the application latency. Therefore, we deploy traffic shaping with predilection to messages on non-critical computation
paths.
The above example demonstrates the efficiency of intelligent
communication mapping and traffic shaping when applied to the
problem of buffer need minimisation. Obviously, the techniques
are also effective in the case of the second problem formulated in
Section 10.1, the communication synthesis problem with predefined buffer space.
10.3 Approach Outline
The solution to both problems defined in Section 10.1 consists of
two components each: the set of message communication supports and the set of packet delays. Thus, each problem is di-
10.3. APPROACH OUTLINE
CM
Design space delimitation (S.10.3.1)
181
TS
Design space delimitation (S.10.3.1)
Design space exploration (S 10.3.2)
Design space exploration (S 10.3.2)
System analysis (S. 10.3.3)
System analysis (S. 10.3.3)
Figure 10.3: Approach outline
vided into two subproblems, the communication mapping subproblem (CM), which determines the communication support for
each message, and the traffic shaping subproblem (TS), which
determines the possible delays applied to forwarding a particular packet. Depending on the actual problem, we will introduce
CSBSDM-CM and CSBSDM-TS, and CSPBS-CM and CSPBSTS respectively.
The outline of our approach is depicted in Figure 10.3. Solving the communication mapping as well as the traffic shaping
subproblem is itself decomposed into three subproblems:
1. Delimit the space of potential solutions (Section 10.3.1)
2. Deploy an efficient strategy for the exploration of the design space (Section 10.3.2), and
3. Find a fast and accurate system analysis procedure for
guiding the search (Section 10.3.3).
10.3.1
Delimitation of the Design Space
Concerning the CM problem, including all possible CSs for each
message in the set of potential solutions leads to a very large
design space, impossible to explore in reasonable time. Thus, in
Section 9.3 we established criteria for picking only promising CS
candidates, which we include in the space of potential solutions.
The solution space for the TS problem is constructed as follows. For each tuple (pi,j , S), where pi,j is a packet from task τi
to task τj and S is a network switch on its route, we consider the
set of delays {0, ∆, 2∆, . . . , Dj }, where ∆ is the minimum amount
of time it takes for the packet to traverse a network link, and
Dj = δj − W CETj − H · ∆, where δj is the deadline of task τj ,
W CETj is the worst-case execution time of task τj , and H is the
Manhattan distance between the two cores on which tasks τi and
τj are mapped. Delaying the packet pi,j longer than Dj would
CH. 10. BUFFER SPACE AWARE SYNTHESIS
182
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
sm = sort messages;
for each msg in sm do
CS[msg] = select(msg, candidates[msg]);
if CS[msg] =N ONE then
abort N O S OLUTION;
return CS;
select(msg, cand list):
cost = ∞; selected =N ONE;
for each cnd in cand list do
CS[msg] = cnd; crt cost = cost func;
if crt cost < cost then
selected = cnd; cost = crt cost;
return selected;
Figure 10.4: Heuristic for communication mapping
certainly cause task τj to break its deadline δj if it executed for
its worst-case execution time W CETj .
10.3.2
Exploration Strategy
Cost Function
The value of the cost function that drives the design space exploration is infinite for solutions in which there exists a task whose
response time exceeds its deadline.
The cost function
for the CSBSDM-CM and CSBSDM-TS
P
subproblems is b∈B db , where B is the set of all switch input
buffers, b is a buffer in this set, and db is the maximum demand
of buffer space of the application at buffer b.
The cost function for the CSPBS-CM and CSPBS-TS subproblems is maxb∈B (db − cb ), where cb is the capacity of buffer b. Solutions of the CSPBS problem with strictly positive cost function value do not satisfy the buffer space constraint and are thus
unfeasible. For the CSPBS problem, we stop the design space
exploration as soon as we find a solution whose cost is zero or
negative.
10.3. APPROACH OUTLINE
183
Communication Mapping
We propose a greedy heuristic for communication mapping. We
map messages to CSs stepwise. At each step, we map one message and we obtain a partial solution. When evaluating partial
solutions, the messages that have not yet been mapped are not
considered.
The heuristic proceeds as shown in Figure 10.4, lines 1–6. It
returns the list of communication supports for each message if
a feasible solution is found (line 6) or aborts otherwise (line 5).
Before proceeding, we sort all messages in increasing order of
their number of mapping alternatives (line 1). Then, we iterate
through the sorted list of messages sm. In each iteration, we
select a mapping alternative to the current message (line 3).
The selection of a mapping alternative out of the list of candidates (determined in the previous step, Section 10.3.1, and in
Section 9.3) is shown in Figure 10.4, lines 7–12. We iterate over
the list of mapping alternatives (line 8) and evaluate each of
them (line 9). We select the alternative that gives the minimum
cost (line 11).
The motivation for synthesizing the communication in the
particular order of increasing number of mapping alternatives of
messages is the following. We would like to minimise the chance
that the heuristic runs into the situation in which it does not find
any feasible solution, although at least one exists. If messages
enjoying a large number of mapping alternatives are mapped
first, we restrict the search space prematurely and gratuitously,
running the risk that no feasible mapping is found for other messages among their few mapping alternatives.
Traffic Shaping
The greedy heuristic, shown in Figure 10.5, determines the
amount of time each communication task has to be delayed
(a.k.a. shaping delay). As a first step, we sort the communication tasks according to a criterion to be explained later (line
1). Then, for all communication tasks in the sorted list we find
the appropriate shaping delay (line 2). The selection of a shaping delay of a communication task is performed by the function
shape (lines 3–9). We probe shaping delays ranging from 0 to
Dj = δj − W CETj − H · ∆ in increments of ∆, where the index δj
and W CETj are the deadline and the worst-case execution time
of the receiving task τj (see Section 10.3.1). For each probed
184
CH. 10. BUFFER SPACE AWARE SYNTHESIS
(1)
(2)
sct =sort comm tasks;
for each τ in sct do delay[τ ] =shape(τ );
(3)
(4)
(5)
(6)
(7)
(8)
(9)
shape(τ ):
cost = ∞;
for delay[τ ] = 0; delay[τ ] < Dτ ; delay[τ ] := delay[τ ] + ∆
crt cost = cost func;
if crt cost < cost then
best delay = delay[τ ]; cost = crt cost;
end for;
return best delay;
Figure 10.5: Heuristic for traffic shaping
shaping delay, we evaluate the cost of the obtained partial solution (line 5). When calculating it, we assume that the shaping
delay of those tasks for which none has yet been chosen is 0. We
select the shaping delay that leads to the minimum cost solution
(lines 6–7).
Before closing this section, we will explain in which order to
perform the shaping delay selection. We observe that communication tasks on paths whose response times are closer to the
deadline have a smaller potential for delaying. Thus, delaying
such communication tasks runs a higher risk to break the timeliness constraints. In order to quantify this risk, we compute
the worst-case response time Rτ of each leaf task τ . Then, for
each task τi we determine L(τi ), the set of leaf tasks τj such that
there exists a computation path between task τi and τj . Then,
to each task τi we assign the value prti = minτ ∈L(τi) (δτ − Rτ ).
Last, we sort the tasks in decreasing order of their prti .1 In case
of ties, tasks with smaller depths2 in the task graph are placed
after tasks deeper in the graph. (If tasks with small depths were
delayed first, their delay would seriously restrict the range of
feasible delays of tasks with large depths.)
1 The procedure can be easily generalised for the case in which not only leaf
tasks have deadlines.
2 The depth of a task τ is the length of the longest computation path from a
root task to task τ .
10.3. APPROACH OUTLINE
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
185
Buf = 0; b = 0; t = 0; F = R0 (t); F1 = F ;
loop
t0 = next t0 ;
F 0 = R0 (t0 );
if t0 = F 0 then
return Buf ;
b0 = (F 0 − F ) · bw + b − (t < F1 )?1 : 0;
if b0 > Buf then
Buf = b0 ;
if t0 > F1 then
b := b − (t0 − max(t, F1 ) − (F 0 − F )) · bw;
t = t0 ; F = F 0 ;
end loop;
Figure 10.6: Buffer space analysis algorithm
10.3.3
System Analysis Procedure
In order to be able to compute the cost function as defined in Section 10.3.2, we need to determine the worst-case response time
of each task as well as the buffering demand at each buffer in
the worst case. To do so, we extended the schedulability analysis
algorithm of Palencia and González [PG98].
At the core of the worst-case response time calculation of task
τi is a fix-point equation of type wi = Ri (wi ). Ri (t) gives the
worst-case response time of task τi when considering interference of tasks of higher priority than that of τi that arrive in the
interval [0, t). The time origin is considered the arrival time of
task τi . Thus, evaluating Ri at two time moments, t1 and t2 ,
allows us to determine the execution time demanded by higher
priority tasks arrived during the interval [t1 , t2 ). More details
regarding the calculation of the worst-case response time can
be found in cited work [PG98]. Here we will concentrate on
our approach to buffer demand analysis. For communication
tasks, their “execution time” on their “processors” are actually
the transmission times of packets on network links. This transmission time is proportional to the length of the packet. Thus,
by means of the analysis of Palencia and González, which can
determine the execution time demanded during a time interval,
we are able to determine the buffering demand arrived during
the interval.
CH. 10. BUFFER SPACE AWARE SYNTHESIS
p4
p3
10 p
2
p5
estimated demand
8
6
4
2
0
real demand
estimated demand
real demand
occupied buffer space
186
5
10
15
20
25
30
35
40
time
45
Figure 10.7: Waiting time and buffer demand
The algorithm for the calculation of the buffer space demand
of an ingress buffer of an arbitrary network link is given in Figure 10.6. We explain the algorithm based on the following example.
Let us consider the following scenario. Prior to time moment
0, a 400MHz link is idle. The links convey the bits of a word in
parallel, with one word per cycle. At time moment 0, the first
word of a 6-word packet p1 arrives at the switch and is immediately conveyed on the link without buffering. The following
packets subsequently arrive at the switch and demand forwarding on the link: p2 , 5 words long, arrives at 5ns, p3 , 3 words
long, arrives at 10ns, p4 , 2 words long, arrives at 15.25ns, and
p5 , 1 word long, arrives at 17.5ns. Let us assume that a fictive
packet p0 of zero length and of very low priority arrived at time
0+ , i.e. immediately after time 0. We compute the worst-case
buffer space need based on the worst-case transmission time of
this fictive packet.
The scenario is shown in Figure 10.7. Time is shown on the
abscissa, while the saw-teeth function shows the instantaneous
communication time backlog, and the solid step function shows
the instantaneous amount of occupied buffer space. The arrows
pointing from the steps in the backlog line to the shaded areas
show which message arrival causes the corresponding buffering.
The time interval during which the link is busy sending packets is called the busy period. In our example the busy period is
the interval [0, 42.5), as can be seen on the figure. The main part
of the algorithm in Figure 10.6 consists of a loop (lines 2–13). A
subinterval [t, t0 ) of the busy period is considered in each iteration of the loop. In the first iteration t = 0 while in iteration i,
10.3. APPROACH OUTLINE
187
t takes the value of t0 of iteration i − 1 for all i > 1 (line 12). F
and F 0 are the times at which the link would be idle if it has to
convey just the packets arrived sooner than or exactly at times t
and t0 respectively (lines 1 and 4). t0 , the upper limit of the interval under consideration in each iteration, is obtained as shown
in line 3. For the moment, let us consider that next t0 = F and
we will discuss the rationale and other possible choices later in
the section.
For our example, only packet p1 of 6 words is to be sent just
after time 0. Hence, R0 (0+ ) = 6words/0.4 · 10−9words/sec = 15ns.
The first iteration of the loop considers the interval [t = 0, t0 =
F = R0 (0+ ) = 15). We compute F 0 = R0 (t0 = 15) (line 4) and we
get 35ns, i.e. the 15ns needed to convey the six words of packet
p1 plus the 5words/0.4 · 10−9 words/sec = 12.5ns needed to convey
packet p2 plus the 7.5ns needed to convey packet p3 (p2 and p3
having arrived in the interval [0, 15)). The time by which the
link would become idle if it has to convey just the packets arrived
prior to t0 = 15ns is greater than t0 . Hence, there are unexplored
parts of the busy period left and the buffer space calculation is
not yet over (lines 5–6). The packets that arrived between 0 and
15ns extended the busy period with F 0 − F = 20ns, hence the
number of newly arrived words is (F 0 − F ) × bw = 20ns × 0.4 ·
10−9 = 8words. The algorithm is unable to determine the exact
time moments when the 8 words arrived. Therefore, we assume
the worst possible moment from the perspective of the buffer
space demand. This moment is time t+ , i.e. immediately after
time t. The 8 words are latched at the next clock period after
time t+ = 0, i.e. at 2.5ns. b0 , the amount of occupied buffer after
latching, is b, the amount of occupied buffer at time t, plus the
8 words, minus possibly one word that could have been pumped
out of the buffer between t and t+2.5ns. During the time interval
[0, F1 = 15), where F1 is the time it takes to convey packet p1 ,
the words conveyed on the link belong to p1 , which is not stored.
Therefore, no parts of the buffer are freed in the interval [0, F1 )
(see line 7). If the required buffer space is larger than what has
been computed so far, the buffer space demand is updated (lines
8–9). Because no buffer space is freed during the interval [0, 15),
lines 10–11 are not executed in the first iteration of the loop.
The second iteration considers the interval [t = 15, t0 = 35).
F = 35ns and F 0 = 42.5ns in this case. Hence, (F 0 − F ) · bw =
7.5ns × 0.4 · 10−9 words/sec = 3words arrived during interval
[15, 35). The three words are considered to have arrived at the
188
CH. 10. BUFFER SPACE AWARE SYNTHESIS
worst moment, i.e. at 15+ . They are latched at time 17.5ns when
b = 8 − 1, i.e. the 8 words that are stored in the buffer at 15ns
minus one word that is pumped out between 15 and 17.5ns. Thus
b0 , the amount of occupied buffer at 17.5ns is 8 − 1 + 3 = 10 (line
7). The value Buf is updated accordingly (lines 8–9). Between
15 and 35ns some words that were stored in the buffer are sent
on the link and therefore we have to account for the reduction
of the amount of occupied buffer. Thus, the amount of occupied
buffer at 35ns is equal to 8, the amount present at 15ns, plus
the 3 words that arrived between 15 and 35ns and minus the
(35 − 15) × 0.4 · 10−9 = 8 that are conveyed on the link in the
interval [15, 35) (see lines 10–11).
The third iteration considers the interval [35, 42.5). As no new
packets arrive during this interval, t0 = R0 (t0 ) = 42.5 and the
algorithm has reached a fix-point and returns the value of Buf .
We will close the section with a discussion on next t0 , the complexity of the algorithm, and the trade-off between the algorithm
execution speed and accuracy.
The actual amount of occupied buffer is shown as the thick
solid line in Figure 10.7, while the amount, as estimated by the
algorithm, is shown as thick dotted line. We observe that the
analysis procedure produces a pessimistic result. This is due to
the fact that the analysis assumes that the new packets that arrive in the interval [t, t0 ) arrive always at the worst possible moment, that is moment t+ . If we partitioned the interval in which
the link is busy sending packets into many shorter intervals, we
could reduce the pessimism of the analysis, because fewer arrivals would be amassed at the same time moment. However,
that would also imply that we invoke function R0 more often,
which is computationally expensive. Thus, there exists a tradeoff between speed of the analysis and pessimism, which is reflected in the choice of next t0 (line 3). A value closer to t would
lead to short intervals, i.e. less pessimism and slower analysis,
while a value farther from t would lead to longer intervals, i.e.
more pessimistic but possibly (not necessarily, as shown below)
faster analysis.
In our experiments, we use next t0 = F , which is the finishing time of the busy period if no new packets arrive after time
t. Choosing a value larger than F would incur the risk to overestimate the busy period. As a result, packets that arrive after
the real finishing time of the busy period might wrongly be considered as part of the current busy period. On one hand that
10.4. EXPERIMENTAL RESULTS
189
leads to the overestimation of the buffer space, and on the other
hand it increases the time until the loop in Figure 10.6 reaches
fix-point. In our experiments, choosing next t0 = 1.6 · F results in
a 10.3% buffer overestimation and a 2.3× larger analysis time
relative to the case when next t0 = F . Conversely, choosing
smaller values for next t0 lead to reductions of at most 5.3% of
the buffer space estimate while the analysis time increased with
up to 78.5%.
The algorithm is of pseudo-polynomial complexity due to the
calculation of function R [PG98].
10.4 Experimental Results
We use a set of 225 synthetic applications in order to assess the
efficiency of our approach to solve the CSBSDM problem. The
applications consist of 27 to 79 tasks, which are mapped on a 4×4
NoC. The probability that a 110-bit packet traverses one network
link unscrambled is 0.99, while the imposed lower bound on the
message arrival probability is also 0.99. Due to the fact that the
implementation of the packet delay capability could excessively
increase the complexity of the switches, we have considered that
traffic shaping is performed only at the source cores. This has
the advantage of no hardware overhead.
10.4.1
Evaluation of the Solution to the CSBSDM Problem
For each application, we synthesized the communication using
three approaches and we determined the total buffer space demand obtained in each of the three cases. In the first case, we use
the buffer space minimisation approach presented in chapter. In
the second case, we replaced the greedy heuristics described in
Section 10.3.2 with Tabu Search based heuristics that are assumed to generate close to optimal solutions provided that they
are let to explore the design space for a very long time. In the
third case, we deployed the communication synthesis approach
presented in the previous chapter, in which we do not considered
buffer space minimisation. The resulting total buffer space as a
function of the number of tasks is shown in Figure 10.8 as the
curves labelled with “greedy”, “tabu”, and “no buffer minimisation” respectively.
CH. 10. BUFFER SPACE AWARE SYNTHESIS
190
Total amount of needed buffer space
14000
No buffer minimisation
Tabu search, no traffic shaping
Tabu search, after traffic shaping
12000
Greedy, no traffic shaping
Greedy, after traffic shaping
10000
8000
6000
4000
2000
30
40
50
60
70
80
Number of tasks
Figure 10.8: Buffering space vs. number of tasks
Design space exploration time [sec]
10000
Tabu search
Greedy
1000
100
10
1
30
40
50
60
Number of tasks
Figure 10.9: Run time comparison
70
80
10.4. EXPERIMENTAL RESULTS
191
Percentage of applications that can be
implemented with the given buffer space
100
80
60
40
No buffer minimisation
Tabu search, no traffic shaping
Tabu search, after traffic shaping
Greedy, no traffic shaping
Greedy, after traffic shaping
20
0
10000
15000
20000
25000
30000
Total buffer space of the NoC
Figure 10.10: Percentage of the number of feasible applications
as a function of the NoC buffer capacity
First, we observe that buffer space minimisation is worth
pursuing, as it results in 22.3% reduction of buffer space on average when compared to the case when buffer space minimisation
is neglected. Second, traffic shaping is an effective technique, reducing the buffer space demand with 14.2% on average relative
to the approach that is based solely on communication mapping.
Third, the greedy heuristic performs well as it obtains results
on average of only 3.6% worse than the close-to-optimal Tabu
Search. The running times of the Tabu Search based and the
greedy heuristic, as measured on a 1533 MHz AMD Athlon processor, are shown in Figure 10.9. The greedy heuristic performs
about two orders of magnitude faster (note the logarithmic scale
of the y axis) than the Tabu Search based heuristic. Thus, we
are able to synthesize the communication for applications of 79
tasks in 10 4000 , while the Tabu Search based heuristic requires
around 1h300 for applications of 59 tasks.
192
10.4.2
CH. 10. BUFFER SPACE AWARE SYNTHESIS
Evaluation of the Solution to the CSPBS
Problem
We use 50 different 4 × 4 NoCs in order to assess the efficiency
of our approach to solve the CSPBS problem. The total buffering capacities at switches range between 9, 000 and 30, 000 bits,
uniformly distributed among the switches. We map 200 applications, one at a time, each consisting of 40 tasks, on each of the 50
NoCs, and we attempt to synthesize the communication of the
application such that no buffer overflows or deadline violations
occur. For each NoC, we count the applications for which we
succeeded to find feasible solutions to the CSPBS problem. The
percentage of the number of applications for which feasible communication synthesis solutions were found is plotted as a function of the total buffer capacity of the NoC in Figure 10.10. The
proposed heuristic soundly outperforms the approach that neglects the buffering aspect as the percentage of found solutions
is on average 53 points higher in the former case than in the
latter. Also, the deployment of traffic shaping results in leveraging the percentage of found solutions to the CSPBS problem
with 18.5% compared to the case when no traffic shaping is deployed. The results of the greedy heuristic come within 9% of the
results obtained by Tabu Search, while the greedy heuristic runs
on average 25 times faster.
10.4.3
Real-Life Example: An Audio/Video Encoder
Finally, we applied our approach the multimedia application described in Section 9.6.6 and depicted in Figures 9.11 and 9.12.
The communication mapping heuristic reduced the total buffer
space with 12.6% relative to the approach that synthesized the
communication without attempting to reduce the total buffer
space demand. Traffic shaping allowed for a further reduction
of 31.8%, giving a total buffer space demand of 77.3kB.
10.5 Conclusions
In this chapter, we developed an approach to the worst-case
buffer need analysis of time constrained applications implemented on NoCs. Based on this analysis we solved two re-
10.5. CONCLUSIONS
193
lated problems: (1) the total buffer space need minimisation
for application-specific NoCs and (2) communication synthesis
with imposed buffer space constraints. For both cases we guarantee that imposed deadlines and message arrival probability
thresholds are satisfied. We have argued that traffic shaping is
a powerful method for buffer space minimisation. We proposed
two efficient greedy heuristics for the communication mapping
and traffic shaping subproblems and we presented experimental
results, which demonstrate the efficiency of the approach.
194
CH. 10. BUFFER SPACE AWARE SYNTHESIS
Part IV
Conclusions
195
Chapter 11
Conclusions
This thesis addresses several problems related to real-time systems with stochastic behaviour. Two sources of stochastic behaviour were considered. The first source, namely the stochastic task execution times, stems with predilection from the application, although features of the hardware platform, such as
cache replacement algorithms, may also influence it. The second source, namely the transient faults that may occur on the
on-chip network links, is inherent in the hardware platform and
the environment.
11.1 Applications with Stochastic Execution Times
In the area of real-time systems with stochastic task execution
times, we provide three different analysis approaches, each efficiently applicable in a different context. Additionally, we propose
a heuristic for deadline miss probability minimisation.
11.1.1
An Exact Approach for Deadline Miss
Ratio Analysis
In Chapter 4 we proposed a method for the schedulability analysis of task sets with probabilistically distributed task execution times. Our method improves the currently existing ones by
providing exact solutions for larger and less restricted task sets.
Specifically, we allow arbitrary continuous task execution time
197
198
CH. 11. CONCLUSIONS
probability distributions, and we do not restrict our approach to
one particular scheduling policy. Additionally, task dependencies
are supported, as well as arbitrary deadlines.
The analysis of task sets under such generous assumptions
is made possible by three complexity management methods:
1. The exploitation of the PMI concept,
2. The concurrent construction and analysis of the stochastic
process, and
3. The usage of a sliding window of states, made possible by
the construction in topological order.
As the presented experiments demonstrate, the proposed
method can efficiently be applied to applications implemented
on monoprocessor systems.
11.1.2
An Approximate Approach for Deadline
Miss Ratio Analysis
In Chapter 5 we presented an approach to performance analysis
of tasks with probabilistically distributed execution times, implemented on multiprocessor systems. The arbitrary probability
distributions of the execution times are approximated with Coxian distributions, and the expanded underlying Markov chain is
constructed in a memory efficient manner exploiting the structural regularities of the chain. In this way we have practically
pushed the solution of an extremely complex problem to its limits. Our approach also allows to trade-off between time and
memory complexity on one side and solution accuracy on the
other. The efficiency of the approach has been investigated by
means of experiments. The factors that influence the analysis
complexity, and their quantitative impact on the analysis resource demands have been discussed. Additional extensions of
the problem formulation and their impact on complexity have
also been illustrated.
11.1.3
Minimisation of Deadline Miss Ratios
In Chapter 6 we addressed the problem of design optimisation
of soft real-time systems with stochastic task execution times
under deadline miss ratio constraints. The contribution is threefold:
11.2. TRANSIENT FAULTS OF NOC LINKS
199
1. We have shown that methods considering fixed execution
time models are unsuited for this problem.
2. We presented a design space exploration strategy based on
tabu search for task mapping and priority assignment.
3. We introduced a fast and approximate analysis for guiding
the design space exploration.
Experiments demonstrated the efficiency of the proposed approach.
11.2 Transient Faults of Network-onChip Links
The contribution of this thesis in the area of network-on-chip
communication in the presence of transient faults on the links
is fourfold. First, we present a way to intelligently combine spatial and temporal redundant communication for response time
reduction and energy minimisation. Second, we provide an analysis algorithm that determines the amount of needed buffers at
the network switches. Third, we present a heuristic algorithm
for minimising the buffer space demand of applications. Fourth,
we propose a heuristic algorithm for communication mapping
under buffer space constraints.
11.2.1
Time-Constrained Energy-Efficient Communication Synthesis
In Chapter 9 we presented an approach to reliable, low-energy
on-chip communication for time-constrained applications implemented on NoC. The contribution is manifold:
1. We showed how to generate supports for message communication in order to meet the message arrival probability
constraint and to minimise communication energy.
2. We gave a heuristic for selecting most promising communication supports with respect to application responsiveness
and energy.
3. We modelled the fault-tolerant application for response
time analysis.
4. We presented experiments demonstrating the proposed approach.
200
11.2.2
CH. 11. CONCLUSIONS
Communication Buffer Minimisation
In Chapter 10 we developed an approach to the worst-case buffer
need analysis of time-constrained applications implemented on
NoCs. Based on this analysis we solved two related problems:
1. The total buffer space need minimisation for applicationspecific NoCs, and
2. the communication synthesis with imposed buffer space
constraints.
For both cases we guarantee that imposed deadlines and message arrival probability thresholds are satisfied. We argued that
traffic shaping is a powerful method for buffer space minimisation. We proposed two efficient greedy heuristics for the communication mapping and traffic shaping subproblems and we presented experimental results, which demonstrate the efficiency of
the approach.
Appendix A
Abbreviations
AA
BCET
CGPN
CM
CS
CSBSDM
CSPBS
CTMC
ECE
ENS
ETPDF
GRD
GSMP
GSPN
LCM
LO-AET
MAP
MRGP
MRSPN
PA
PMI
Approximate Analysis
Best-Case Execution Time
Concurrent Generalised Petri Nets
Communication Mapping
Communication Support
Communication Synthesis with Buffer Space Demand Minimisation
Communication Synthesis with Predefined Buffer
Space
Continuous Time Markov Chain
Expected Communication Energy
Exhaustive Neighbourhood Search
Execution Time Probability Density Function
General Redundancy Degree
Generalised Semi-Markov Process
Generalised Stochastic Petri Net
Least Common Multiple
Laxity Optimisation based on Average Execution
Times
Message Arrival Probability
Markov Regenerative Process
Markov Regenerative Stochastic Petri Net
High-Complexity Performance Analysis
Priority Monotonicity Interval
201
202
RNS
SRD
TRD
TRG
TS
WCET
WCRT
APPENDIX A. ABBREVIATIONS
Restricted Neighbourhood Search
Spatial Redundancy Degree
Temporal Redundancy Degree
Tangible Reachability Graph
Traffic Shaping
Worst-Case Execution Time
Worst-Case Response Time
Bibliography
[AB98]
A. Atlas and A. Bestavros. Statistical rate monotonic
scheduling. In Proceedings of the 19th IEEE RealTime Systems Symposium, pages 123–132, 1998.
[AB99]
L. Abeni and G. Butazzo. QoS guarantee using
probabilistic deadlines. In Proceedings of the 11th
Euromicro Conference on Real-Time Systems, pages
242–249, 1999.
[ABD+ 91] N. C. Audsley, A. Burns, R. I. Davis, K. W. Tindell,
and A. J. Wellings. Hard real-time scheduling: The
deadline monotonic approach. In Proceedings of the
8th IEEE Workshop on Real-Time Operating Systems
and Software, pages 133–137, 1991.
[ABD+ 95] N. C. Audsley, A. Burns, R. I. Davis, K. W. Tindell, and A. J. Wellings. Fixed priority pre-emptive
scheduling: An historical perspective.
Journal
of Real-Time Systems, 8(2-3):173–198, March-May
1995.
[ABR+ 93] N. Audsley, A. Burns, M. Richardson, K. Tindell,
and A. Wellings. Applying new scheduling theory
to static priority pre-emptive scheduling. Software
Engineering Journal, 8(5):284–292, 1993.
[ABRW93] N. C Audsley, A. Burns, M. F. Richardson, and A. J.
Wellings. Incorporating unbounded algorithms into
predictable real-time systems. Computer Systems
Science and Engineering, 8(3):80–89, 1993.
[AKK+ 00] K. Aingaran, F. Klass, C. M. Kim, C. Amir, J. Mitra, E. You, J. Mohd, and S. K. Dong. Coupling noise
203
204
BIBLIOGRAPHY
analysis for VLSI and ULSI circuits. In Proceedings
of IEEE ISQED, pages 485–489, 2000.
[ASE+ 04] A. Andrei, M. Schmitz, P. Eles, Z. Peng, and B. AlHashimi. Simultaneous communication and processor voltage scaling for dynamic and leakage energy
reduction in time-constrained systems. In Proc. of
ICCAD, 2004.
[Aud91]
N. C. Audsley. Optimal priority assignment and feasibility of static priority tasks with arbitrary start
times. Technical Report YCS 164, Department of
Computer Science, University of York, December
1991.
[BBB01]
E. Bini, G. Butazzo, and G. Butazzo. A hyperbolic
bound for the rate monotonic algorithm. In Proceedings of the 13th Euromicro Conference on Real-Time
Systems, pages 59–66, 2001.
[BBB03]
A. Burns, G. Bernat, and I. Broster. A probabilistic framework for schedulability analysis. In R. Alur
and I. Lee, editors, Proceedings of the Third International Embedded Software Conference, EMSOFT,
number LNCS 2855 in Lecture Notes in Computer
Science, pages 1–15, 2003.
[BBD02]
D. Bertozzi, L. Benini, and G. De Micheli. Low power
error resilient encoding for on-chip data buses. In
Proc. of DATE, pages 102–109, 2002.
[BBRN02] I. Broster, A. Burns, and G. Rodriguez-Navas. Probabilistic analysis of CAN with faults. In Proceedings
of the 23rd Real-Time Systems Symposium, 2002.
[BCFR87] G. Balbo, G. Chiola, G. Franceschinis, and G. M.
Roet. On the efficient construction of the tangible
reachability graph of Generalized Stochastic Petri
Nets. In Proceedings of the 2nd Workshop on Petri
Nets and Performance Models, pages 85–92, 1987.
[BCKD00] P. Buchholtz, G. Ciardo, P. Kemper, and S. Donatelli. Complexity of memory-efficient Kronecker
operations with applications to the solution of
Markov models. INFORMS Journal on Computing,
13(3):203–222, 2000.
BIBLIOGRAPHY
205
[BCP02]
G. Bernat, A. Colin, and S. Petters. WCET analysis of probabilistic hard real-time systems. In Proceedings of the 23rd Real-Time Systems Symposium,
pages 279–288, 2002.
[BD02]
L. Benini and G. De Micheli. Networks on chips: a
new SoC paradigm. IEEE Computer, 35(1):70–78,
2002.
[BIGA04]
E. Bolotin, Cidon I., R. Ginosar, and Kolodny A.
QNoC: QoS architecture and design process for
networks-on-chip. Journal of Systems Architecture,
50:105–128, 2004.
[Bla76]
J. Blazewicz. Scheduling dependent tasks with different arrival times to meet deadlines. In E. Gelenbe and H. Bellner, editors, Modeling and Performance Evaluation of Computer Systems. NorthHolland, Amsterdam, 1976.
[Bos91]
Bosch, Robert Bosch GmbH, Postfach 50, D-7000
Stuttgart 1, Germany. CAN Specification, 1991.
[BPSW99] A. Burns, S. Punnekkat, L. Strigini, and D.R.
Wright. Probabilistic scheduling guarantees for
fault-tolerant real-time systems. In Proceedings of
the 7th International Working Conference on Dependable Computing for Critical Applications, pages 339–
356, 1999.
[But97]
Giorgio C. Buttazzo. Hard Real-Time Computing
Systems. Kluwer Academic, 1997.
[BW94]
A. Burns and A. Wellings. Real-Time Systems and
Their Programming Languages. Addison Wesley,
1994.
[CKT94]
H. Choi, V.G. Kulkarni, and K.S. Trivedi. Markov Regenerative Stochastic Petri Nets. Performance Evaluation, 20(1–3):337–357, 1994.
[Cox55]
D. R. Cox. A use of complex probabilities in the theory of stochastic processes. In Proceedings of the
Cambridge Philosophical Society, pages 313–319,
1955.
206
BIBLIOGRAPHY
[CXSP04] V. Chandra, A. Xu, H. Schmit, and L. Pileggi. An interconnect channel design methodology for high performance integrated circuits. In Proceedings of the
Conference on Design Automation and Test in Europe, page 21138, 2004.
[Dal99]
W. Dally. Interconnect-limited VLSI architecture. In
IEEE Conference on Interconnect Technologies, pages
15–17, 1999.
[Dav81]
M. Davio. Kronecker products and shuffle algebra.
IEEE Transactions on Computing, C-30(2):1099–
1109, 1981.
[DGK+02] J. L. Dı́az, D. F. Garcı́a, K. Kim, C.-G. Lee, L. Lo
Bello, J. M. López, S. L. Min, and O. Mirabella.
Stochastic analysis of periodic real-time systems. In
Proceedings of the 23rd Real-Time Systems Symposium, 2002.
[Die00]
K. Diefenderhoff. Extreme lithography. Microprocessor Report, 6(19), 2000.
[dJG00]
G. de Veciana, M. Jacome, and J.-H. Guo. Assessing probabilistic timing constraints on system performance. Design Automation for Embedded Systems, 5(1):61–81, February 2000.
[DLS01]
B. Doytchinov, J. P. Lehoczky, and S. Shreve. Realtime queues in heavy traffic with earliest-deadlinefirst queue discipline. Annals of Applied Probability,
11:332–378, 2001.
[DM03]
T. Dumitraş and R. Mărculescu. On-chip stochastic
communication. In Proc. of DATE, 2003.
[DRGR03] J. Dielissen, A. Rădulescu, K. Goossens, and E. Rijpkema. Concepts and implementation of the Philips
network-on-chip. In IP-Based SoC Design, 2003.
[DW93]
J. G. Dai and Y. Wang. Nonexistence of Brownian models for certain multiclass queueing networks.
Queueing Systems, 13:41–46, 1993.
[Ele02]
P. Eles. System design and methodology, 2002.
http://www.ida.liu.se/˜TDTS30/.
BIBLIOGRAPHY
207
[Ern98]
R. Ernst. Codesign of embedded systems: Status and
trends. IEEE Design and Test of Computers, pages
45–54, April-June 1998.
[ETS]
European telecommunications standards institute.
http://www.etsi.org/.
[Fid98]
C. J. Fidge. Real-time schedulability tests for preemptive multitasking. Journal of Real-Time Systems, 14(1):61–93, 1998.
[FJ98]
M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings of the
IEEE International Conference on Acoustics, Speech
and Signal Processing, volume 3, pages 1381–1384,
1998.
[Gar99]
M.K. Gardner. Probabilistic Analysis and Scheduling of Critical Soft Real-Time Systems. PhD thesis,
University of Illinois at Urbana-Champaign, 1999.
[Gau98]
H. Gautama. A probabilistic approach to the analysis of program execution time. Technical Report 168340-44(1998)06, Faculty of Information Technology and Systems, Delft University of Technology,
1998.
[GI99]
A. Goel and P. Indyk. Stochastic load balancing and
related problems. In IEEE Symposium on Foundations of Computer Science, pages 579–586, 1999.
[GJ79]
M. R. Garey and D. S. Johnson. Computers and Intractability. Freeman, 1979.
[GKL91]
M. González Harbour, M. H. Klein, and J. P.
Lehoczky. Fixed priority scheduling of periodic rasks
with varying execution priority. In Proceedings of the
IEEE Real Time Systems Symposium, pages 116–
128, 1991.
[GL94]
R. German and C. Lindemann. Analysis of stochastic Petri Nets by the method of supplementary variables. Performance Evaluation, 20(1–3):317–335,
1994.
208
BIBLIOGRAPHY
[GL99]
M.K. Gardner and J. W.S. Liu. Analyzing Stochastic Fixed-Priority Real-Time Systems, pages 44–58.
Springer, 1999.
[Glo89]
F. Glover. Tabu search—Part I. ORSA J. Comput.,
1989.
[Gly89]
P.W Glynn. A GSMP formalism for discrete-event
systems. In Proceedings of the IEEE, volume 77,
pages 14–23, 1989.
[Gv00]
H. Gautama and A. J. C. van Gemund. Static performance prediction of data-dependent programs. In
Proceedings of the 2nd International Workshop on
Software and Performance, pages 216–226, September 2000.
[HM04a]
J. Hu and R. Mărculescu. Applicationspecific buffer
space allocation for networksonchip router design.
In Proc. of the ICCAD, 2004.
[HM04b]
J. Hu and R. Mărculescu. Energy-aware communication and task scheduling for network-on-chip architectures under real-time constraints. In Proceedings
of the Design Automation and Test in Europe Conference, page 10234, 2004.
[HM05]
J. Hu and R. Mărculescu. Energy and performanceaware mapping for regular NoC architectures. IEEE
Transactions on CAD of Integrated Circuits and Systems, 24(4), 2005.
[HMC97]
S. Haddad, P. Moreaux, and G Chiola. Efficient
handling of phase-type distributions in Generalized
Stochastic Petri Nets. In 18th International Conference on Application and Theory of Petri Nets, 1997.
[HN93]
J. M. Harrison and V. Nguyen. Brownian models of
multiclass queueing networks: Current status and
open problems. Queueing Systems, 13:5–40, 1993.
[HZS01]
X. S. Hu, T. Zhou, and E. H.-M. Sha. Estimating
probabilistic timing performance for real-time embedded systems. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 9(6):833–844, December 2001.
BIBLIOGRAPHY
209
[Int93]
International Organization for Standardization
(ISO). ISO/IEC 11172-3:1993 – Coding of moving
pictures and associated audio for digital storage
media at up to about 1,5 Mbit/s – Part 3: Audio,
1993. http://www.iso.org/.
[Int05]
International Telecommunication Union (ITU).
H.263 – Video coding for low bit rate communication, 2005. http://www.itu.int/publications/itu-t/.
[JMEP00] R. Jigorea, S. Manolache, P. Eles, and Z. Peng. Modelling of real-time embedded systems in an object
oriented design environment with UML. In Proceedings of the 3rd IEEE International Symposium
on Object-Oriented Real-Time Distributed Computing (ISORC00), pages 210–213, March 2000.
[KJS+ 02]
S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell,
M. Millberg, J. Öberg, K. Tiensyrjä, and A. Hemani.
A network on chip architecture and design methodology. In Proceedings of the IEEE Computer Society
Annual Symposium on VLSI, April 2002.
[KK79]
P. Kermani and L. Kleinrock. Virtual Cut-Through:
A new computer communication switching technique. Computer Networks, 3(4):267–286, 1979.
[Kle64]
L. Kleinrock. Communication Nets: Stochastic Message Flow and Delay. McGraw-Hill, 1964.
[KM98]
A. Kalavade and P. Moghé. A tool for performance
estimation of networked embedded end-systems. In
Proceedings of the 35th Design Automation Conference, pages 257–262, 1998.
[Kop97]
H. Kopetz. Real-Time Systems. Kluwer Academic,
1997.
[KRT00]
J. Kleinberg, Y. Rabani, and E. Tardos. Allocating
bandwidth for bursty connections. SIAM Journal on
Computing, 30(1):191–217, 2000.
[KS96]
J. Kim and K. G. Shin. Execution time analysis of communicating tasks in distributed systems.
IEEE Transactions on Computers, 45(5):572–579,
May 1996.
210
BIBLIOGRAPHY
[KS97]
C. M. Krishna and K. G. Shin. Real-Time Systems.
McGraw-Hill, 1997.
[LA97]
Y. A. Li and J. K. Antonio. Estimating the execution
time distribution for a task graph in a heterogeneos
computing system. In Proceedings of the Heterogeneous Computing Workshop, 1997.
[Leh96]
J. P. Lehoczky. Real-time queueing theory. In Proceedings of the 18th Real-Time Systems Symposium,
pages 186–195, December 1996.
[Leh97]
J. P. Lehoczky. Real-time queueing network theory.
In Proceedings of the 19th Real-Time Systems Symposium, pages 58–67, December 1997.
[Lin98]
Ch. Lindemann. Performance Modelling with Deterministic and Stochastic Petri Nets. John Wiley and
Sons, 1998.
[Liu94]
D. Liu et al. Power consumption estimation in CMOS
VLSI chips. IEEE Journal of Solid-State Circuits,
(29):663–670, 1994.
[LL73]
C. L. Liu and J. W. Layland. Scheduling algorithms
for multiprogramming in a hard-real-time environment. Journal of the ACM, 20(1):47–61, January
1973.
[LSD89]
J. Lehoczky, L. Sha, and Y. Ding. The rate monotonic scheduling algorithm: Exact characterization
and average case behaviour. In Proceedings of the
11th Real-Time Systems Symposium, pages 166–171,
1989.
[LW82]
J. Y. T. Leung and J. Whitehead. On the complexity of
fixed-priority scheduling of periodic, real-time tasks.
Performance Evaluation, 2(4):237–250, 1982.
[Man04]
S. Manolache. Fault-tolerant communication on
network-on-chip. Technical report, Linköping University, 2004.
[MD04]
S. Murali and G. De Micheli. Bandwidth-constrained
mapping of cores onto NoC architectures. In Proceedings of the Conference on Design Automation and
Test in Europe, 2004.
BIBLIOGRAPHY
211
[MEP]
S. Manolache, P. Eles, and Z. Peng. An approach to
performance analysis of multiprocessor applications
with stochastic task execution times. Submitted for
publication.
[MEP01]
S. Manolache, P. Eles, and Z. Peng. Memory and
time-efficient schedulability analysis of task sets
with stochastic execution time. In Proceedings of the
13th Euromicro Conference on Real Time Systems,
pages 19–26, June 2001.
[MEP02]
S. Manolache, P. Eles, and Z. Peng. Schedulability analysis of multiprocessor real-time applications
with stochastic task execution times. In Proceedings of the 20th International Conference on Computer Aided Design, pages 699–706, November 2002.
[MEP04a] S. Manolache, P. Eles, and Z. Peng. Optimization of
soft real-time systems with deadline miss ratio constraints. In Proceedings of the 10th IEEE Real-Time
and Embedded Technology and Applications Symposium, pages 562–570, 2004.
[MEP04b] S. Manolache, P. Eles, and Z. Peng. Schedulability
analysis of applications with stochastic task execution times. ACM Transactions on Embedded Computing Systems, 3(4):706–735, 2004.
[MEP05]
S. Manolache, P. Eles, and Z. Peng. Fault and energyaware communication mapping with guaranteed latency for applications implemented on NoC. In Proc.
of DAC, 2005.
[MEP06]
S. Manolache, P. Eles, and Z. Peng. Buffer space optimisation with communication synthesis and traffic
shaping for NoCs. In Proceedings of the Conference
on Design Automation and Test in Europe, March
2006.
[MP92]
M. Mouly and M.-B. Pautet. The GSM System for
Mobile Communication. Palaiseau, 1992.
[MR93]
M. Malhotra and A. Reibman. Selecting and implementing phase approximations for semi-Markov
models. Stochastic Models, 9(4):473–506, 1993.
212
BIBLIOGRAPHY
[PF91]
B. Plateau and J-M. Fourneau. A methodology for
solving Markov models of parallel systems. Journal
of Parallel and Distributed Computing, 12(4):370–
387, 1991.
[PG98]
J. C. Palencia Gutiérrez and M. González Harbour.
Schedulability analysis for tasks with static and dynamic offsets. In Proceedings of the 19th IEEE Real
Time Systems Symposium, pages 26–37, December
1998.
[PKH01]
E. L. Plambeck, S. Kumar, and J. M. Harrison. A
multiclass queue in heavy traffic with throughput
time constraints: Asymptotically optimal dynamic
controls. Queueing Systems, 39(1):23–54, September
2001.
[PLB+ 04] M. Pirretti, G. M. Link, R. R. Brooks, N. Vijaykrishnan, M. Kandemir, and Irwin M. J. Fault tolerant algorithms for network-on-chip interconnect. In Proc.
of the ISVLSI, 2004.
[PST98]
A. Puliafito, M. Scarpa, and K.S. Trivedi. Petri Nets
with k-simultaneously enabled generally distributed
timed transitions. Performance Evaluation, 32(1):1–
34, February 1998.
[PTVF92] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and
B. P. Flannery. Numerical Recipes in C. Cambridge
University Press, 1992.
[RE02]
K. Richter and R. Ernst. Event model interfaces for
heterogeneous system analysis. In Proc. of DATE,
2002.
[Ros70]
S.M. Ross. Applied Probability Models with Optimization Applications. Holden-Day, 1970.
[SAN03]
I. Saastamoinen, M. Alho, and J. Nurmi. Buffer implementation for proteo network-on-chip. In Proceedings of the 2003 International Symposium on
Circuits and Systems, volume 2, pages II 113–II 116,
2003.
BIBLIOGRAPHY
213
[Sch03]
M. Schmitz. Energy Minimisation Techniques for
Distributed Embedded Systems. PhD thesis, Dept.
of Computer and Electrical Enginieering, Univ. of
Southampton, UK, 2003.
[SGL97]
J. Sun, M. K. Gardner, and J. W. S. Liu. Bounding completion times of jobs with arbitrary release
times, variable execution times, and resource sharing. IEEE Transactions on Software Engineering,
23(10):604–615, October 1997.
[She93]
G.S Shedler. Regenerative Stochastic Simulation.
Academic Press, 1993.
[SL95]
J. Sun and J. W. S. Liu. Bounding the end-toend response time in multiprocessor real-time systems. In Proceedings of the Workshop on Parallel and
Distributed Real-Time Systems, pages 91–98, April
1995.
[SN96]
K. Shepard and V. Narayanan. Noise in deep submicron digital design. In ICCAD, pages 524–531, 1996.
[SS94]
M. Spuri and J. A. Stankovic. How to integrate
precedence constraints and shared resources in realtime scheduling. IEEE Transactions on Computers,
43(12):1407–1412, December 1994.
[Sun97]
J. Sun. Fixed-Priority End-to-End Scheduling in Distributed Real-Time Systems. PhD thesis, University
of Illinois at Urbana-Champaign, 1997.
[TC94]
K. Tindell and J. Clark. Holistic schedulability analysis for distributed real-time systems. Euromicro
Jurnal on Microprocessing and Microprogramming
(Special Issue on Parallel Embedded Real-Time Systems), 40:117–134, 1994.
[TDS+ 95] T.-S. Tia, Z. Deng, M. Shankar, M. Storch, J. Sun,
L.-C. Wu, and J. W. S. Liu. Probabilistic performance guarantee for real-time tasks with varying
computation times. In Proceedings of the IEEE
Real-Time Technology and Applications Symposium,
pages 164–173, May 1995.
214
BIBLIOGRAPHY
[TTT99]
TTTech Computertechnik AG, TTTech Computertechnik AG, Schönbrunner Straße, A-1040 Vienna,
Austria. TTP/C Protocol, 1999.
[van96]
A. J. van Gemund. Performance Modelling of Parallel
Systems. PhD thesis, Delft University of Technology,
1996.
[van03]
A. J. C. van Gemund. Symbolic performance modeling of parallel systems. IEEE Transactions on Parallel and Distributed Systems, 2003. to be published.
[Wil98]
R. J. Williams. Diffusion approximations for open
multiclass queueing networks: Sufficient conditions
involving state space collapse. Queueing Systems,
30:27–88, 1998.
[YBD02]
T.T. Ye, L. Benini, and G. De Micheli. Analysis
of power consumption on switch fabrics in network
routers. In Proc. of DAC, pages 524–529, 2002.
[ZHS99]
T. Zhou, X. (S.) Hu, and E. H.-M. Sha. A probabilistic performace metric for real-time system design.
In Proceedings of the 7th International Workshop on
Hardware-Software Co-Design, pages 90–94, 1999.
Department of Computer and Information Science
Linköpings universitet
Dissertations
Linköping Studies in Science and Technology
No 14
Anders Haraldsson: A Program Manipulation
System Based on Partial Evaluation, 1977,
ISBN 91-7372-144-1.
No 165 James W. Goodwin: A Theory and System for
Non-Monotonic Reasoning, 1987, ISBN 917870-183-X.
No 17
Bengt Magnhagen: Probability Based Verification of Time Margins in Digital Designs, 1977,
ISBN 91-7372-157-3.
No 170 Zebo Peng: A Formal Methodology for Automated Synthesis of VLSI Systems, 1987, ISBN
91-7870-225-9.
No 18
Mats Cedwall: Semantisk analys av processbeskrivningar i naturligt språk, 1977, ISBN 917372-168-9.
No 174 Johan Fagerström: A Paradigm and System for
Design of Distributed Systems, 1988, ISBN 917870-301-8.
No 22
Jaak Urmi: A Machine Independent LISP
Compiler and its Implications for Ideal Hardware, 1978, ISBN 91-7372-188-3.
No 192 Dimiter Driankov: Towards a Many Valued
Logic of Quantified Belief, 1988, ISBN 91-7870374-3.
No 33
Tore Risch: Compilation of Multiple File Queries in a Meta-Database System 1978, ISBN 917372-232-4.
No 213 Lin Padgham: Non-Monotonic Inheritance for
an Object Oriented Knowledge Base, 1989,
ISBN 91-7870-485-5.
No 51
Erland Jungert: Synthesizing Database Structures from a User Oriented Data Model, 1980,
ISBN 91-7372-387-8.
No 214 Tony Larsson: A Formal Hardware Description and Verification Method, 1989, ISBN 917870-517-7.
No 54
Sture Hägglund: Contributions to the Development of Methods and Tools for Interactive
Design of Applications Software, 1980, ISBN
91-7372-404-1.
No 221 Michael Reinfrank: Fundamentals and Logical
Foundations of Truth Maintenance, 1989, ISBN
91-7870-546-0.
No 55
Pär Emanuelson: Performance Enhancement
in a Well-Structured Pattern Matcher through
Partial Evaluation, 1980, ISBN 91-7372-403-3.
No 239 Jonas Löwgren: Knowledge-Based Design
Support and Discourse Management in User
Interface Management Systems, 1991, ISBN 917870-720-X.
No 58
Bengt Johnsson, Bertil Andersson: The Human-Computer Interface in Commercial Systems, 1981, ISBN 91-7372-414-9.
No 244 Henrik Eriksson: Meta-Tool Support for
Knowledge Acquisition, 1991, ISBN 91-7870746-3.
No 69
H. Jan Komorowski: A Specification of an Abstract Prolog Machine and its Application to
Partial Evaluation, 1981, ISBN 91-7372-479-3.
No 252 Peter Eklund: An Epistemic Approach to Interactive Design in Multiple Inheritance Hierarchies,1991, ISBN 91-7870-784-6.
No 71
René Reboh: Knowledge Engineering Techniques and Tools for Expert Systems, 1981,
ISBN 91-7372-489-0.
No 258 Patrick Doherty: NML3 - A Non-Monotonic
Formalism with Explicit Defaults, 1991, ISBN
91-7870-816-8.
No 77
Östen Oskarsson: Mechanisms of Modifiability in large Software Systems, 1982, ISBN 917372-527-7.
No 260 Nahid Shahmehri: Generalized Algorithmic
Debugging, 1991, ISBN 91-7870-828-1.
No 94
Hans Lunell: Code Generator Writing Systems, 1983, ISBN 91-7372-652-4.
No 97
Andrzej Lingas: Advances in Minimum
Weight Triangulation, 1983, ISBN 91-7372-660-5.
No 109 Peter Fritzson: Towards a Distributed Programming Environment based on Incremental
Compilation,1984, ISBN 91-7372-801-2.
No 111 Erik Tengvald: The Design of Expert Planning
Systems. An Experimental Operations Planning System for Turning, 1984, ISBN 91-7372805-5.
No 155 Christos Levcopoulos: Heuristics for Minimum Decompositions of Polygons, 1987, ISBN
91-7870-133-3.
No 264 Nils Dahlbäck: Representation of DiscourseCognitive and Computational Aspects, 1992,
ISBN 91-7870-850-8.
No 265 Ulf Nilsson: Abstract Interpretations and Abstract Machines: Contributions to a Methodology for the Implementation of Logic Programs,
1992, ISBN 91-7870-858-3.
No 270 Ralph Rönnquist: Theory and Practice of
Tense-bound Object References, 1992, ISBN 917870-873-7.
No 273 Björn Fjellborg: Pipeline Extraction for VLSI
Data Path Synthesis, 1992, ISBN 91-7870-880-X.
No 276 Staffan Bonnier: A Formal Basis for Horn
Clause Logic with External Polymorphic Functions, 1992, ISBN 91-7870-896-6.
No 277 Kristian Sandahl: Developing Knowledge
Management Systems with an Active Expert
Methodology, 1992, ISBN 91-7870-897-4.
No 281 Christer Bäckström: Computational Complexity of Reasoning about Plans, 1992, ISBN 917870-979-2.
No 292 Mats Wirén: Studies in Incremental Natural
Language Analysis, 1992, ISBN 91-7871-027-8.
No 297 Mariam Kamkar: Interprocedural Dynamic
Slicing with Applications to Debugging and
Testing, 1993, ISBN 91-7871-065-0.
No 302 Tingting Zhang: A Study in Diagnosis Using
Classification and Defaults, 1993, ISBN 917871-078-2.
No 452 Kjell Orsborn: On Extensible and Object-Relational Database Technology for Finite Element
Analysis Applications, 1996, ISBN 91-7871-8279.
No 459 Olof Johansson: Development Environments
for Complex Product Models, 1996, ISBN 917871-855-4.
No 461 Lena Strömbäck: User-Defined Constructions
in Unification-Based Formalisms,1997, ISBN
91-7871-857-0.
No 462 Lars Degerstedt: Tabulation-based Logic Programming: A Multi-Level View of Query Answering, 1996, ISBN 91-7871-858-9.
No 312 Arne Jönsson: Dialogue Management for Natural Language Interfaces - An Empirical Approach, 1993, ISBN 91-7871-110-X.
No 475 Fredrik Nilsson: Strategi och ekonomisk styrning - En studie av hur ekonomiska styrsystem
utformas och används efter företagsförvärv,
1997, ISBN 91-7871-914-3.
No 338 Simin Nadjm-Tehrani: Reactive Systems in
Physical Environments: Compositional Modelling and Framework for Verification, 1994,
ISBN 91-7871-237-8.
No 480 Mikael Lindvall: An Empirical Study of Requirements-Driven Impact Analysis in ObjectOriented Software Evolution, 1997, ISBN 917871-927-5.
No 371 Bengt Savén: Business Models for Decision
Support and Learning. A Study of DiscreteEvent Manufacturing Simulation at Asea/ABB
1968-1993, 1995, ISBN 91-7871-494-X.
No 485 Göran Forslund: Opinion-Based Systems: The
Cooperative Perspective on Knowledge-Based
Decision Support, 1997, ISBN 91-7871-938-0.
No 375 Ulf Söderman: Conceptual Modelling of Mode
Switching Physical Systems, 1995, ISBN 917871-516-4.
No 383 Andreas Kågedal: Exploiting Groundness in
Logic Programs, 1995, ISBN 91-7871-538-5.
No 396 George Fodor: Ontological Control, Description, Identification and Recovery from Problematic Control Situations, 1995, ISBN 91-7871603-9.
No 413 Mikael Pettersson: Compiling Natural Semantics, 1995, ISBN 91-7871-641-1.
No 414 Xinli Gu: RT Level Testability Improvement
by Testability Analysis and Transformations,
1996, ISBN 91-7871-654-3.
No 416 Hua Shu: Distributed Default Reasoning, 1996,
ISBN 91-7871-665-9.
No 429 Jaime Villegas: Simulation Supported Industrial Training from an Organisational Learning
Perspective - Development and Evaluation of
the SSIT Method, 1996, ISBN 91-7871-700-0.
No 431 Peter Jonsson: Studies in Action Planning: Algorithms and Complexity, 1996, ISBN 91-7871704-3.
No 437 Johan Boye: Directional Types in Logic Programming, 1996, ISBN 91-7871-725-6.
No 439 Cecilia Sjöberg: Activities, Voices and Arenas:
Participatory Design in Practice, 1996, ISBN 917871-728-0.
No 448 Patrick Lambrix: Part-Whole Reasoning in Description Logics, 1996, ISBN 91-7871-820-1.
No 494 Martin Sköld: Active Database Management
Systems for Monitoring and Control, 1997,
ISBN 91-7219-002-7.
No 495 Hans Olsén: Automatic Verification of Petri
Nets in a CLP framework, 1997, ISBN 91-7219011-6.
No 498 Thomas Drakengren: Algorithms and Complexity for Temporal and Spatial Formalisms,
1997, ISBN 91-7219-019-1.
No 502 Jakob Axelsson: Analysis and Synthesis of
Heterogeneous Real-Time Systems, 1997, ISBN
91-7219-035-3.
No 503 Johan Ringström: Compiler Generation for
Data-Parallel Programming Langugaes from
Two-Level Semantics Specifications, 1997,
ISBN 91-7219-045-0.
No 512
Anna Moberg: Närhet och distans - Studier av
kommunikationsmmönster i satellitkontor och
flexibla kontor, 1997, ISBN 91-7219-119-8.
No 520 Mikael Ronström: Design and Modelling of a
Parallel Data Server for Telecom Applications,
1998, ISBN 91-7219-169-4.
No 522 Niclas Ohlsson: Towards Effective Fault
Prevention - An Empirical Study in Software
Engineering, 1998, ISBN 91-7219-176-7.
No 526 Joachim Karlsson: A Systematic Approach for
Prioritizing Software Requirements, 1998, ISBN
91-7219-184-8.
No 530 Henrik Nilsson: Declarative Debugging for
Lazy Functional Languages, 1998, ISBN 917219-197-x.
No 555 Jonas Hallberg: Timing Issues in High-Level
Synthesis,1998, ISBN 91-7219-369-7.
No 561 Ling Lin: Management of 1-D Sequence Data From Discrete to Continuous, 1999, ISBN 917219-402-2.
No 563 Eva L Ragnemalm: Student Modelling based
on Collaborative Dialogue with a Learning
Companion, 1999, ISBN 91-7219-412-X.
No 567 Jörgen Lindström: Does Distance matter? On
geographical dispersion in organisations, 1999,
ISBN 91-7219-439-1.
No 582 Vanja Josifovski: Design, Implementation and
Evaluation of a Distributed Mediator System
for Data Integration, 1999, ISBN 91-7219-482-0.
No 637 Esa Falkenroth: Database Technology for
Control and Simulation, 2000, ISBN 91-7219766-8.
No 639 Per-Arne Persson: Bringing Power and
Knowledge Together: Information Systems
Design for Autonomy and Control in
Command Work, 2000, ISBN 91-7219-796-X.
No 660 Erik Larsson: An Integrated System-Level
Design for Testability Methodology, 2000, ISBN
91-7219-890-7.
No 688 Marcus Bjäreland: Model-based Execution
Monitoring, 2001, ISBN 91-7373-016-5.
No 689 Joakim Gustafsson: Extending Temporal
Action Logic, 2001, ISBN 91-7373-017-3.
No 589 Rita Kovordányi: Modeling and Simulating
Inhibitory Mechanisms in Mental Image Reinterpretation - Towards Cooperative HumanComputer Creativity, 1999, ISBN 91-7219-506-1.
No 720 Carl-Johan Petri: Organizational Information
Provision - Managing Mandatory and Discretionary Use of Information Technology, 2001,
ISBN-91-7373-126-9.
No 592 Mikael Ericsson: Supporting the Use of Design Knowledge - An Assessment of Commenting Agents, 1999, ISBN 91-7219-532-0.
No 724
No 593 Lars Karlsson: Actions, Interactions and Narratives, 1999, ISBN 91-7219-534-7.
No 725 Tim Heyer: Semantic Inspection of Software
Artifacts: From Theory to Practice, 2001, ISBN 91
7373 208 7.
No 594 C. G. Mikael Johansson: Social and Organizational Aspects of Requirements Engineering
Methods - A practice-oriented approach, 1999,
ISBN 91-7219-541-X.
No 595 Jörgen Hansson: Value-Driven Multi-Class
Overload Management in Real-Time Database
Systems, 1999, ISBN 91-7219-542-8.
No 596 Niklas Hallberg: Incorporating User Values in
the Design of Information Systems and
Services in the Public Sector: A Methods
Approach, 1999, ISBN 91-7219-543-6.
No 597 Vivian Vimarlund: An Economic Perspective
on the Analysis of Impacts of Information
Technology: From Case Studies in Health-Care
towards General Models and Theories, 1999,
ISBN 91-7219-544-4.
No 598 Johan Jenvald: Methods and
Computer-Supported Taskforce
1999, ISBN 91-7219-547-9.
Tools in
Training,
No 607 Magnus
Merkel:
Understanding
enhancing translation by parallel
processing, 1999, ISBN 91-7219-614-9.
and
text
No 611 Silvia Coradeschi: Anchoring symbols to
sensory data, 1999, ISBN 91-7219-623-8.
No 613 Man Lin: Analysis and Synthesis of Reactive
Systems: A Generic Layered Architecture
Perspective, 1999, ISBN 91-7219-630-0.
No 618 Jimmy Tjäder: Systemimplementering i
praktiken - En studie av logiker i fyra projekt,
1999, ISBN 91-7219-657-2.
No 627 Vadim Engelson: Tools for Design, Interactive
Simulation, and Visualization of ObjectOriented Models in Scientific Computing,
2000, ISBN 91-7219-709-9.
Paul Scerri: Designing Agents for Systems with
Adjustable Autonomy, 2001, ISBN 91 7373 207
9.
No 726 Pär Carlshamre: A Usability Perspective on Requirements Engineering - From Methodology to
Product Development, 2001, ISBN 91 7373 212 5.
No 732 Juha Takkinen: From Information Management to Task Management in Electronic Mail,
2002, ISBN 91 7373 258 3.
No 745 Johan Åberg: Live Help Systems: An Approach
to Intelligent Help for Web Information Systems, 2002, ISBN 91-7373-311-3.
No 746 Rego Granlund: Monitoring Distributed Teamwork Training, 2002, ISBN 91-7373-312-1.
No 757 Henrik André-Jönsson: Indexing Strategies for
Time Series Data, 2002, ISBN 917373-346-6.
No 747 Anneli Hagdahl: Development of IT-supported Inter-organisational Collaboration - A Case
Study in the Swedish Public Sector, 2002, ISBN
91-7373-314-8.
No 749 Sofie Pilemalm: Information Technology for
Non-Profit Organisations - Extended Participatory Design of an Information System for Trade
Union Shop Stewards, 2002, ISBN 91-7373318-0.
No 765 Stefan Holmlid: Adapting users: Towards a
theory of use quality, 2002, ISBN 91-7373-397-0.
No 771 Magnus Morin: Multimedia Representations
of Distributed Tactical Operations, 2002, ISBN
91-7373-421-7.
No 772 Pawel Pietrzak: A Type-Based Framework for
Locating Errors in Constraint Logic Programs,
2002, ISBN 91-7373-422-5.
No 758 Erik Berglund: Library Communication
Among Programmers Worldwide, 2002,
ISBN 91-7373-349-0.
No 774 Choong-ho Yi: Modelling Object-Oriented
Dynamic Systems Using a Logic-Based Framework, 2002, ISBN 91-7373-424-1.
No 779 Mathias Broxvall: A Study in the
Computational Complexity of Temporal
Reasoning, 2002, ISBN 91-7373-440-3.
No 793 Asmus Pandikow: A Generic Principle for
Enabling Interoperability of Structured and
Object-Oriented Analysis and Design Tools,
2002, ISBN 91-7373-479-9.
No 785 Lars Hult: Publika Informationstjänster. En
studie av den Internetbaserade encyklopedins
bruksegenskaper, 2003, ISBN 91-7373-461-6.
No 800 Lars Taxén: A Framework for the Coordination of Complex Systems´ Development, 2003,
ISBN 91-7373-604-X
No 808 Klas Gäre: Tre perspektiv på förväntningar
och förändringar i samband med införande av
informationsystem, 2003, ISBN 91-7373-618-X.
No 821 Mikael Kindborg: Concurrent Comics - programming of social agents by children, 2003,
ISBN 91-7373-651-1.
No 823 Christina Ölvingson: On Development of Information Systems with GIS Functionality in
Public Health Informatics: A Requirements
Engineering Approach, 2003, ISBN 91-7373656-2.
No 828 Tobias Ritzau: Memory Efficient Hard RealTime Garbage Collection, 2003, ISBN 91-7373666-X.
No 833 Paul Pop: Analysis and Synthesis of
Communication-Intensive
Heterogeneous
Real-Time Systems, 2003, ISBN 91-7373-683-X.
No 852 Johan Moe: Observing the Dynamic
Behaviour of Large Distributed Systems to Improve Development and Testing - An Emperical Study in Software Engineering, 2003, ISBN
91-7373-779-8.
No 867 Erik Herzog: An Approach to Systems Engineering Tool Data Representation and Exchange, 2004, ISBN 91-7373-929-4.
No 872 Aseel Berglund: Augmenting the Remote
Control: Studies in Complex Information
Navigation for Digital TV, 2004, ISBN 91-7373940-5.
No 869 Jo Skåmedal: Telecommuting’s Implications
on Travel and Travel Patterns, 2004, ISBN 917373-935-9.
No 870 Linda Askenäs: The Roles of IT - Studies of Organising when Implementing and Using Enterprise Systems, 2004, ISBN 91-7373-936-7.
No 874 Annika Flycht-Eriksson: Design and Use of
Ontologies in Information-Providing Dialogue
Systems, 2004, ISBN 91-7373-947-2.
No 873 Peter Bunus: Debugging Techniques for Equation-Based Languages, 2004, ISBN 91-7373941-3.
No 876 Jonas Mellin: Resource-Predictable and Efficient Monitoring of Events, 2004, ISBN 917373-956-1.
No 883 Magnus Bång: Computing at the Speed of Paper: Ubiquitous Computing Environments for
Healthcare Professionals, 2004, ISBN 91-7373971-5
No 882 Robert Eklund: Disfluency in Swedish
human-human and human-machine travel
booking dialogues, 2004. ISBN 91-7373-966-9.
No 887 Anders Lindström: English and other Foreign
Linquistic Elements in Spoken Swedish. Studies of Productive Processes and their Modelling
using Finite-State Tools, 2004, ISBN 91-7373981-2.
No 889 Zhiping Wang: Capacity-Constrained Production-inventory systems - Modellling and Analysis in both a traditional and an e-business
context, 2004, ISBN 91-85295-08-6.
No 893 Pernilla Qvarfordt: Eyes on Multimodal Interaction, 2004, ISBN 91-85295-30-2.
No 910 Magnus Kald: In the Borderland between
Strategy and Management Control - Theoretical Framework and Empirical Evidence,
2004, ISBN 91-85295-82-5.
No 918 Jonas Lundberg: Shaping Electronic News:
Genre Perspectives on Interaction Design,
2004, ISBN 91-85297-14-3.
No 900 Mattias Arvola: Shades of use: The dynamics
of interaction design for sociable use, 2004,
ISBN 91-85295-42-6.
No 920 Luis Alejandro Cortés: Verification and
Scheduling Techniques for Real-Time Embedded Systems, 2004, ISBN 91-85297-21-6.
No 929 Diana Szentivanyi: Performance Studies of
Fault-Tolerant Middleware, 2005, ISBN 9185297-58-5.
No 933 Mikael Cäker: Management Accounting as
Constructing and Opposing Customer Focus:
Three Case Studies on Management Accounting and Customer Relations, 2005, ISBN 9185297-64-X.
No 937 Jonas Kvarnström: TALplanner and Other
Extensions to Temporal Action Logic, 2005,
ISBN 91-85297-75-5.
No 938 Bourhane Kadmiry: Fuzzy Gain-Scheduled
Visual Servoing for Unmanned Helicopter,
2005, ISBN 91-85297-76-3.
No 945 Gert Jervan: Hybrid Built-In Self-Test and
Test Generation Techniques for Digital Systems, 2005, ISBN: 91-85297-97-6.
No 946 Anders Arpteg: Intelligent Semi-Structured
Information Extraction, 2005, ISBN 91-8529798-4.
No 947 Ola Angelsmark: Constructing Algorithms
for Constraint Satisfaction and Related Problems - Methods and Applications, 2005, ISBN
91-85297-99-2.
No 963 Calin Curescu: Utility-based Optimisation of
Resource Allocation for Wireless Networks,
2005. ISBN 91-85457-07-8.
No 972 Björn Johansson: Joint Control in Dynamic
Situations, 2005, ISBN 91-85457-31-0.
No 974 Dan Lawesson: An Approach to Diagnosability Analysis for Interacting Finite State
Systems, 2005, ISBN 91-85457-39-6.
No 979 Claudiu Duma: Security and Trust Mechanisms for Groups in Distributed Services,
2005, ISBN 91-85457-54-X.
No 983 Sorin Manolache: Analysis and Optimisation of Real-Time Systems with Stochastic
Behaviour, 2005, ISBN 91-85457-60-4.
Linköping Studies in Information Science
No 1
Karin Axelsson: Metodisk systemstrukturering- att skapa samstämmighet mellan informa-tionssystemarkitektur och verksamhet,
1998. ISBN-9172-19-296-8.
No 2
Stefan Cronholm: Metodverktyg och användbarhet - en studie av datorstödd metodbaserad
systemutveckling, 1998. ISBN-9172-19-299-2.
No 3
Anders Avdic: Användare och utvecklare - om
anveckling med kalkylprogram, 1999. ISBN91-7219-606-8.
No 4
Owen Eriksson: Kommunikationskvalitet hos
informationssystem och affärsprocesser, 2000.
ISBN 91-7219-811-7.
No 5
Mikael Lind: Från system till process - kriterier för processbestämning vid verksamhetsanalys, 2001, ISBN 91-7373-067-X
No 6
Ulf Melin: Koordination och informationssystem i företag och nätverk, 2002, ISBN 91-7373278-8.
No 7
Pär J. Ågerfalk: Information Systems Actability - Understanding Information Technology as
a Tool for Business Action and Communication, 2003, ISBN 91-7373-628-7.
No 8
Ulf Seigerroth: Att förstå och förändra
systemutvecklingsverksamheter - en taxonomi
för metautveckling, 2003, ISBN91-7373-736-4.
No 9
Karin Hedström: Spår av datoriseringens
värden - Effekter av IT i äldreomsorg, 2004,
ISBN 91-7373-963-4.
No 10
Ewa Braf: Knowledge Demanded for Action Studies on Knowledge Mediation in Organisations, 2004, ISBN 91-85295-47-7.
No 11
Fredrik Karlsson: Method Configuration method and computerized tool support, 2005,
ISBN 91-85297-48-8.
No 12
Malin Nordström: Styrbar systemförvaltning Att organisera systemförvaltningsverksamhet
med hjälp av effektiva förvaltningsobjekt,
2005, ISBN 91-85297-60-7.
No 13
Stefan Holgersson: Yrke: POLIS - Yrkeskunskap, motivation, IT-system och andra förutsättningar för polisarbete, 2005, ISBN 9185299-43-X.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement