Institutionen för datavetenskap Porting a Real-Time Operating System to a Multicore Platform

Institutionen för datavetenskap Porting a Real-Time Operating System to a Multicore Platform
Institutionen för datavetenskap
Department of Computer and Information Science
Final thesis
Porting a Real-Time Operating System to a
Multicore Platform
by
Sixten Sjöström Thames
LIU-IDA/LITH-EX-A--12/009--SE
2012-02-07
Linköpings universitet
SE-581 83 Linköping, Sweden
Linköpings universitet
581 83 Linköping
Linköpings universitet
Institutionen för datavetenskap
Final thesis
Porting a Real-Time Operating System to a
Multicore Platform
by
Sixten Sjöström Thames
LIU-IDA/LITH-EX-A--12/009--SE
2012-02-07
Supervisor: Sergiu Rafiliu
Examiner: Petru Ion Eles
Abstract
This thesis is part of the European MANY project. The goal of MANY is to
provide developers with tools to develop software for multi and many-core hardware platforms. This is the first thesis that is part of MANY at Enea. The thesis
aims to provide a knowledge base about software on many-core at the Enea student research group. More than just providing a knowledge base, a part of the
thesis is also to port Enea’s operating system OSE to Tilera’s many-core processor TILEpro64. The thesis shall also investigate the memory hierarchy and
interconnection network of the Tilera processor.
The knowledge base about software on many-core was constrained to investigating the shared memory model and operating systems for many-core. This was
achieved by investigating prominent academic research about operating systems
for many-core processors. The conclusion was that a shared memory model does
not scale and for the operating system case, operating systems shall be designed
with scalability as one of the most important requirements.
This thesis has implemented the hardware abstraction layer required to execute
a single-core version of OSE on the TILEpro architecture. This was done in three
steps. The Tilera hardware and the OSE software platform were investigated.
After that, an OSE target port was chosen as reference architecture. Finally, the
hardware dependent parts of the reference software were modified. A foundation
has been made for future development.
v
Acknowledgments
My deepest gratitude goes to Patrik Strömblad for guiding me during the whole
project. Patrik has enlightened me about multi-core and has provided valuable
advice during the porting process.
I thank my supervisors Barbro and Detlef for giving me the chance to work on
this project. I really appreciate their moral support and guidance.
I would like to thank the employees at Enea who gave me guidance and good
company, especially Johan Wiezell who explained the details of porting OSE.
Finally I thank my girlfriend Bibbi, who has supported and encouraged me
during the thesis work.
vii
Contents
1 Introduction
1.1 Thesis Background . . . . . . . . . . . . . . . . . . .
1.2 Problem Statement . . . . . . . . . . . . . . . . . . .
1.2.1 Target Interface and Board Support Package
1.2.2 Memory Hierarchy and Network-On-Chip . .
1.2.3 Shared Memory Multi Processing . . . . . . .
1.3 Method . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
2
2
2
3
2 Background
2.1 ITEA2 - MANY . . . . . . . . . . .
2.2 Multicore Architecture . . . . . . . .
2.2.1 Heterogeneous Multi-Core . .
2.2.2 Homogeneous Multi-Core . .
2.2.3 Memory Architecture . . . .
2.3 Software Parallelism . . . . . . . . .
2.3.1 Bit-Level Parallelism . . . . .
2.3.2 Instruction Level Parallelism
2.3.3 Data parallelism . . . . . . .
2.3.4 Task Parallelism . . . . . . .
2.4 Software Models . . . . . . . . . . .
2.4.1 Symmetric Multiprocessing .
2.4.2 Asymmetric Multiprocessing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
6
6
6
6
7
8
8
8
9
9
9
9
10
3 Enea’s OSE
3.1 Architecture Overview . . . . . . . . .
3.2 Load Modules, Domains and Processes
3.3 OSE for Multi-Core . . . . . . . . . .
3.3.1 Migration and Load Balancing
3.4 Hardware Abstraction Layer . . . . . .
3.5 Conclusion . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
11
13
14
14
14
ix
x
4 Tilera’s TILEpro64
4.1 Architecture Overview . . . . . . . . . . . . . . . . . .
4.2 Interconnection Network - iMesh . . . . . . . . . . . .
4.2.1 Interconnection Hardware . . . . . . . . . . . .
4.2.2 The Networks . . . . . . . . . . . . . . . . . . .
4.2.3 Protecting the Network . . . . . . . . . . . . .
4.2.4 Deadlocks . . . . . . . . . . . . . . . . . . . . .
4.2.5 iLib . . . . . . . . . . . . . . . . . . . . . . . .
4.2.6 Conclusions about the Interconnection Network
4.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . .
4.3.1 Memory Homing . . . . . . . . . . . . . . . . .
4.3.2 Dynamic Distributed Cache . . . . . . . . . . .
4.3.3 Conclusions about the Memory Architecture .
4.4 Tools and Software Stack . . . . . . . . . . . . . . . .
4.5 Tilera Application Binary Interface . . . . . . . . . . .
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
19
19
19
21
21
22
22
22
23
24
24
25
25
5 Software on Many-Core
5.1 Scalability Issues with SMP Operating Systems . . . . . . . .
5.1.1 Locking the kernel . . . . . . . . . . . . . . . . . . . .
5.1.2 Sharing Cache and TLBs between Application and OS
5.1.3 Dependency on Effective Cache Coherency . . . . . .
5.1.4 Scalable SMP Systems . . . . . . . . . . . . . . . . . .
5.2 Operating Systems for Many-Core . . . . . . . . . . . . . . .
5.2.1 Design Principles: Factored Operating Systems . . . .
5.2.2 Design Principles: Barrelfish . . . . . . . . . . . . . .
5.2.3 Conclusions from Investigating fos . . . . . . . . . . .
5.2.4 Conclusions from Investigating Barrelfish . . . . . . .
5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Distributed Architectures are Scalable . . . . . . . . .
5.3.2 One Thread - One Core . . . . . . . . . . . . . . . . .
5.3.3 IPC with Explicit Message Passing . . . . . . . . . . .
5.3.4 Example of a Many-Core OS . . . . . . . . . . . . . .
5.3.5 Enea OSE and Many-Core . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
30
30
31
31
31
31
32
33
34
35
35
35
36
36
37
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Porting Enea OSE to TILEpro64
6.1 Milestones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Milestone 1 - Build environment . . . . . . . . . . . . .
6.1.2 Milestone 2 - Launch OSE and write into a Ramlog . .
6.1.3 Milestone 3 - Get OSE in to a safe state . . . . . . . . .
6.1.4 Milestone 4 - Full featured single-core version of OSE
TILEpro64 . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.5 Milestone 5 - Full featured multi-core version of OSE
TILEpro64 . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 MS1 - Build Environment . . . . . . . . . . . . . . . . . . . . .
6.2.1 Omnimake . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Requirements and Demonstration . . . . . . . . . . . . .
. .
. .
. .
. .
on
. .
on
. .
. .
. .
. .
39
39
40
40
40
40
40
40
41
41
Contents
Work Approach . . . . . . . . . . .
Coresys . . . . . . . . . . . . . . .
Implemented Parts . . . . . . . . .
Design Decisions . . . . . . . . . .
Requirements and Demonstration .
Work Approach . . . . . . . . . . .
Get OSE into a safe state . . . . .
Design Decisions . . . . . . . . . .
Implemented Parts . . . . . . . . .
Requirements and Demonstration .
Work Approach . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
42
42
43
43
44
44
45
45
47
48
7 Conclusions, Discussion and Future Work
7.1 Conclusions from the Theoretical Study .
7.2 Results and Future Work . . . . . . . . .
7.2.1 Future Work - Theoretical . . . . .
7.2.2 Future Work - Implementation . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
49
50
50
6.3
6.4
6.2.3
MS2 6.3.1
6.3.2
6.3.3
6.3.4
MS3 6.4.1
6.4.2
6.4.3
6.4.4
xi
Bibliography
51
8 Demonstration Application and Output
8.1 Demonstration Application . . . . . . . . . . . . . . . . . . . . . .
8.2 Demonstration Application Output . . . . . . . . . . . . . . . . . .
55
55
57
Chapter 1
Introduction
1.1
Thesis Background
This thesis work has been conducted at ENEA AB, a global software and services
company with focus on solutions for communication-driven products. The thesis
is part of the MANY1 project hosted by ITEA22 .
Predicted by Moore’s law, the number of transistors per chip will double approximately every 18 months. Because of the inability of sequential single-core
processors deliver grater performance proportional to the increased number of
transistors, multi-core processors are now standard in basically all domains [1][2].
In the near future hundreds of cores are to be expected in various embedded
devices[1]. As the number of cores per chip is growing, developing software is becoming a more complex task[3]. To scale well, when the number of cores increases,
software has to be rewritten to execute in parallel. Soon many-core3 hardware
systems are expected to be main stream in the embedded segment and software
development has to adept to this[1]. Demands on shorter time-to-market and the
complexity of parallel software make it necessary to provide developers with good
tools. The MANY project addresses this issue and has an objective to provide developers with an efficient programming environment[4]. This master thesis focuses
on porting OSE ME4 to a many-core platform.
1.2
Problem Statement
This thesis required a pre study of software on many-core, an investigation of
the chosen many-core hardware platform (TILEpro64) and the software platform
(OSE) that was ported. To achieve this, a number of subjects had to be studied
1 Many-core
programming and resource management for high-performance Embedded Systems
Technology for European Advancement
3 Multi-core processor with at least dozens of cores
4 Enea OSE: Multicore Real-Time Operating System
2 Information
1
2
Introduction
in detail. A good understanding of OSE ME architecture and the target platform
architecture was necessary.
1.2.1
Target Interface and Board Support Package
A new OSE target interface and a BSP5 had to be developed for the TILEpro64
architecture. This is a heavy task that requires a good understanding about the
TILEpro64 architecture and the OSE architecture. A build environment for the
new architecture had to be implemented as well. A question asked in the project
specification was: To what extent can code be reused?
1.2.2
Memory Hierarchy and Network-On-Chip
Each core on the TILEpro64 processor has an integrated L1 and L2 cache. The
L3 cache is distributed among the tiles. On top of this there are 4 DRR2 controllers
connected to the iMesh. This memory hierarchy had to be investigated. The
following question was asked in the project specification.How does the memory
hierarchy of TILE64 cope with the demands of a RTOS? The TILE64 processor
has 64 cores (also referred to as tiles) connected in an iMesh6 on chip network. An
important part of the pre-study was to investigate interfaces for communication
between tiles. The following question was asked in the project specification. What
implications does the iMesh network have on a Real-Time Operating System?
1.2.3
Shared Memory Multi Processing
As stated above the developers need to be provided with an efficient programming environment. The OSE programming model uses an asynchronous message
passing model for IPC7 . There is also a shared memory model available with
POSIX8 threads. This thesis had to investigate what options that are available
and most suitable among solutions based on a shared memory model. The following question was asked in the project specification. Is it desirable to develop a
shared memory tool for parallel computing?
1.3
Method
The work was organized in two phases, consisting of theoretical research and
implementation. During phase 1 (covering the first 10 weeks) software for manycore was investigated. Tilera and Enea documentation was also studied in detail.
The result of phase 1 was a half time presentation and a half time report.
5 Board
Support Package
mesh
7 Inter-Process Communication
8 Portable Operating System Interface
6 Intelligent
1.4 Limitations
3
In phase 2 (the final 10 weeks) the porting took place. This meant that the
implementation of an OSE target interface and a BSP for the TILEpro64 processor
had to be done. It was not expected that the complete operating system was to
be ported, but a prototype and a foundation for future thesis projects had to be
implemented.
1.4
Limitations
The time limit for this study was 20 weeks in which, all the literature study,
implementation, report and presentation had to be completed.
Chapter 2
Background
This chapter introduces some basic concepts that are necessary when describing
operating systems for many-core.
Multicore processors are common nowadays and have been shipped with desktop
PCs for almost a decade. Processors such as Tilera’s TILEpro64 have dozens of
cores and are referred to as many-core processors. It is a common assumption
that a single chip will contain as many as 1000 cores within the next decade [1] [5].
The reason for this dramatically increased core count is that it has been necessary
to meet the demand of higher performance and decreased power consumption.
Before multicore processors appeared, performance was improved by increasing
the frequency, utilizing instruction level parallelism and increasing cache sizes.
This however came to an end [2]. The main reasons are listed below.
• When increasing the frequency, the power consumption also increases. This
is not acceptable when power consumption is a main requirement. The
following equation shows the relation between power and frequency.
power = capacitance ∗ voltage2 ∗ f requency
• Superscalar techniques do not scale with frequency. The increased frequency
may demand that a pipeline have to be stalled or that an additional stage
needs to be added. This reduces the benefits of increased frequency.
• The off-chip memory and IO subsystems work on a lower frequency and tend
to stall the processor. This has been countered with increasing cache sizes.
However, making the cache bigger requires more silicone which implies more
power consumption.
5
6
Background
2.1
ITEA2 - MANY
MANY [4] is the name of the European project hosted by ITEA2, which this
thesis is part of. The objective of MANY is to provide embedded system developers
with a programming environment for many-core.
2.2
Multicore Architecture
This subsection contains a brief background on multi-core architecture.
2.2.1
Heterogeneous Multi-Core
Heterogeneous multicore systems are SoCs1 containing processors with different
instruction sets. This is common in embedded systems where, for example, a
general purpose processor provides a user interface and controls special purpose
hardware. The special purpose hardware can be a digital signal processor or an
FPGA. This architecture provides both challenges and advantages. One challenge
is that the same OS image cannot be executed on all cores. A big advantage
however is, that performance can be improved by using special purpose hardware.
Figure 2.1 shows a heterogeneous system.
Figure 2.1. Heterogeneous Multicore
2.2.2
Homogeneous Multi-Core
This is the most common architecture in desktop systems. In the homogeneous
multicore processor all cores have the same architecture. A homogeneous system,
containing only a few cores, together executing a SMP2 operating system, provides
a pleasant environment for the developer. This is pleasant since the operating
system is able to provide a lot of abstraction. Figure 2.2 shows a homogeneous
system.
1 System
on Chip
Multiprocessing
2 Symmetric
2.2 Multicore Architecture
7
Figure 2.2. Homogeneous Multicore
2.2.3
Memory Architecture
There are numerous memory architectures for multicore. This section describes
the two basic concepts.
Distributed Memory
Distributed memory is when all cores have their own private memory. Communication is done by messages or streams over high-speed interconnection hardware.
Shared Memory
Shared memory is when the cores share the main memory. Communication
between cores is done through the memory. The cores typically use a private
cache which means that some kind of hardware has to make sure that the memory
is consistent throughout the system.
Cache Coherency
In a shared memory system were cores uses a private cache, it is important
to make sure that the shared resources are consistent throughout the caches. For
CMPs3 this is guaranteed by a cache coherence protocol implemented in hardware.
There are two main types of cache coherency protocols.
Bus snooping can be used on a shared bus CMP. One solution is that the
private caches are implemented as write through. When one core writes to the
main memory, the cache coherency hardware at each core monitors the bus and
invalidates its own copy of the data if it is located in the cache. The bus as an
interconnection network does not scale to many-core which excludes snooping as
a cache coherence protocol for many-core.
3 Chip
Multiprocessor
8
Background
Directory based is when a central directory keeps track of all data that is being
shared between the cores. Communication with the main memory goes through
the directory. This creates a bottleneck and bottlenecks do not scale.
The chapter covering TILEpro64 describes the Distributed Shared Cache and
Dynamic Distributed Cache.
2.3
Software Parallelism
Unfortunately, an increased number of cores do not imply the same level of
performance[3]. Sequential code typically performs worse on a multi-core system.
To efficiently utilize multi-core systems the software has to be written with parallelism in mind. Amdahl’s law shows how sequential code for a fixed size problem
affects the performance on parallel systems and that adding more cores does not
imply the same amount of speedup[2].
Speedup = 1/(S + (1 − S)/N )
Amdahls’s law. S = sequential portion of code, N = Number of processors
Amdahl’s law does only show speedup for algorithms with a fixed size problem[2].
Gustafson’s law shows the scaled speedup for a problem that is of variable size[2].
One example is packet processing. It is easy to understand that by adding more
cores, the system will be able to process more packets.
Scaledspeedup = N + (1 − N ) ∗ S
Gustafson’s law. S = Serial portion of code, N = Number of processors
Parallelism is being extracted and utilized at every level: in hardware, at compile
time, by the system software and at application software level. These are the four
basic types of parallelism.
2.3.1
Bit-Level Parallelism
Bit-level parallelism is the width of the processor. Increasing bit-level parallelism
is when a processor is redesigned to work with larger data sizes. One example is
extending a parallel bus from 32-bit to 64-bit.
2.3.2
Instruction Level Parallelism
Instruction level parallelism is the parallelism that can be achieved by executing instructions that are not dependent on each other in parallel. This is taken
advantage of by VLIW4 and superscalar processors by using techniques such as
out of order execution. Compilers can optimize code to achieve a higher degree of
instruction level parallelism.
4 Very
Long Instruction Word
2.4 Software Models
2.3.3
9
Data parallelism
Data parallelism is when there is no dependency between data. One example
is when working with arrays of data where the elements do not depend on each
other. This has been utilized for a long time in CPU’s that supports SIMD5
instructions. This is also an area where multi-core systems make a big difference.
A developer can, for example, use POSIX threads and divide pixels in an image
between different threads that run on different cores (this is connected with task
parallelism).
2.3.4
Task Parallelism
Task parallelism is the ability to run different processes and threads in parallel. This is an area where multi-core system can provide big performance boosts.
Systems consisting of loosely coupled units can really benefit from this. The application developer has to find parallel parts in his or her sequential program that
can be divided into different threads that can be executed in parallel.
2.4
Software Models
2.4.1
Symmetric Multiprocessing
The SMP model is when resources such as OS image are shared between the
cores[2][3]. Communication and synchronization is done through a shared memory.
This is a suitable model for systems with a few number of cores. It provides
an environment similar to the multitasking single core system. The application
developer can use many tools if he/she wants to use the shared memory model
to utilize parallelism in the application. Examples of such tools are: OpenMP[6],
Wool[7], Cilk[8] and Cilk+[9].
Affinity Scheduling
To better utilize the performance of multi-core processors, SMP operating systems usually provide affinity scheduling[2]. That means that the scheduler takes
the physical location of threads into account. Threads that shares memory are
preferably located on the same core or at cores close to each other if it is a NUMA6
system.
Scalability
The fact that SMP systems share resources among cores also implies bottlenecks.
This is described more in detail in chapter 5.
5 Single
6 Non
Instruction Multiple Data
Uniform Memory Access
10
2.4.2
Background
Asymmetric Multiprocessing
This is the classical model for distributed systems. In AMP7 the cores do not
share resources[2]. This is common for heterogeneous distributed systems but
can also be implemented on homogeneous systems. On homogeneous systems
a hypervisor can be used to handle protection, provide communication channels
and distribute resources among operating systems that execute on different cores.
When using a hypervisor it is also possible to run multiple SMP systems in parallel
on the same processor as a big AMP system.
7 (Asymmetric
Multiprocessing
Chapter 3
Enea’s OSE
OSE (Operating System Embedded) is a distributed operating system that supports hard and soft real time applications. Being a distributed operating system,
OSE can execute on heterogeneous distributed systems and clusters. Operating
system core services and application provided services can be accessed location
transparently, by applications through a message based programming model. Each
node in the distributed system runs an OSE micro-kernel that can be extended
with modules. OSE multi-core edition has extended the previous pure AMP model
to be a hybrid AMP/SMP model. The content of this chapter is mainly derived
from OSE documentation [10][11][12][13].
3.1
Architecture Overview
As mentioned above, OSE is based on a message based model. The micro-kernel
provides real-time features such as a preemptive scheduler, prioritized interrupts
and processes. The kernel provides memory management and memory pools are
used to provide deterministic access times. Figure 3.1 shows the layers of OSE.
Processes in OSE are the similar to POSIX threads. A location transparent
message passing API for inter-process communication is provided by the kernel.
All communication and synchronization in the OSE programming model is done
with asynchronous message passing. This makes application code scalable since
it can be moved from a single core system to a multi-processor cluster without
modifying the code. Inter-process communication between cores is serviced by the
medium transparent Enea LINX.
3.2
Load Modules, Domains and Processes
A module is a collection of processes that make an application. A program is an
instantiated load module. A module that is linked with the kernel during compile
11
12
Enea’s OSE
Figure 3.1. OSE layers
time is referred to as a core module. Modules that are separately linked can be
loaded dynamically in run-time and are referred to as load modules.
OSE processes do introduce some confusion. Processes in OSE are more similar
to what is commonly known as threads. Processes may be grouped, share memory
pools and heaps.
A domain is a memory region shared by programs. If a MMU1 is used, OSE
is able to provide full memory protection and virtual memory regions between
domains. A domain usually contains code, data, heaps and pools. The pool
is used for deterministic dynamic allocation of signal buffers and process stacks.
The heap is also used for dynamic allocation but is not used for stacks and signal
buffers. The heap may preferably be used by applications that do not completely
use the OSE programming model. Software that depends on POSIX may use the
heap.
A program can be configured as private or shared. In a program configured as
private, signals, buffers and files are privately owned by specific processes. These
are reclaimed by the kernel when the process terminates. A process may only
modify data that is owned by that process. When the program is configured as
shared, a heap is shared among processes and the POSIX shared programming
1 Memory
Management Unit
3.3 OSE for Multi-Core
13
models works. A program that uses POSIX compatible threads shall be executed
in shared memory mode.
A shared heap has to be used for multi-threaded parallel programming. As
always with shared resources, when sharing a heap between processes and cores,
locks, spinlocks and mutexes have to be used on critical sections.
3.3
OSE for Multi-Core
The OSE programming model has been adaptable to multi-processor systems for
a long time through the AMP model. This model can be used today on multi-core
systems when there is a hypervisor that manages memory, peripherals and intercore communication. OSE for multi-core has extended this distributed model with
certain properties of SMP operating systems, creating a hybrid. OSE MCE has
startup code that loads the OS image onto several cores on a multi-core processor.
Figure 3.2. OSE Multi-core Edition
The multi-core edition is the same distributed operating system as before with
additional SMP features [14][2]. OSE still uses the distributed system AMP model,
having each core running its own scheduler with associated data structures. OS
services can still be distributed and accessed via the message passing model and
14
Enea’s OSE
distributed CRT. The OSE MCE architecture tries to keep shared system resources
to a minimum (to maintain the AMP scalability). Figure 3.2 shows how OSE MCE
is distributed among the cores.
Synchronization between cores is done with something called kernel events. A
global shared data structure called kernel event queue is used. When a core has
stored its kernel event in the queue it generates an IPI2 to notify the receiving
core(s). The IPI implementation is hardware specific. A high speed interconnection mechanism is preferably used.
3.3.1
Migration and Load Balancing
OSE MCE provides functionality to migrate domains, programs, blocks and
processes between cores. When a program or domain is moved between two cores,
all the programs processes, blocks and the program heap daemon is also moved
automatically. It is possible to lock programs to specific cores. Interrupt processes
and timer interrupt processes cannot be moved and processes that use non-SMP
system calls will be locked to one core.
The OSE kernel does not provide any load balancing. This is expected to be
implemented and controlled by the application designer.
3.4
Hardware Abstraction Layer
The bottom layer of the OSE stack is called HAL3 . The HAL provides a target
interface to the OSE kernel. This layer implements hardware specific functionality
like MMU and cache support. This is the layer that has to be modified during the
porting process.
Part of the target layer is also the board support package. The BSP contains
system initialization code and device driver source code. An OSE distribution
does not include a BSP. The BSP is instead delivered as a separate component
and the user can freely choose which BSP to use when compiling the OSE system.
3.5
Conclusion
OSE multi-core edition has extended the AMP model with a shared memory
environment. It provides a single chip AMP like environment that does not rely
on a hypervisor to work. It has extended the OSE message based environment
with an inter-core shared memory environment. Heaps that can be accessed from
multiple cores implement the shared memory context.
2 Inter
Process Interrupt
Abstraction Layer
3 Hardware
3.5 Conclusion
15
If a shared memory implementation for parallel programming such as Wool, Cilk
or OpenMP is to be implemented, it is important that individual processes can be
easily created, killed and preferably able to migrate between cores. OSE multi-core
edition support these features. When implementing a load balancer it is obviously
important that the kernel provides core load monitoring functionality. In OSE the
Program Manager or the Run Time Monitor provides this functionality.
The following question is asked in the project specification: Is it desirable to
implement a shared memory model for IPC? This question is faulty. OSE supports
a shared memory model and implements a subset of POSIX. The question should
instead be Is it desirable to implement a shared memory tool for parallel computing? This question can be answered. The answer: The optimal programming
model when developing for OSE is the OSE message passing model; it enforces
a scalable parallel design and has been used in clusters for a long time. A user
may, however, have special reasons to use some other programming model based
on shared memory. One example of such a situation would be when legacy code
has to run on an OSE system. OSE multi-core edition provides the required functionality for implementing a shared memory tool for exploiting task parallelism in
legacy applications. This is illustrated in figure 3.3.
Figure 3.3. Processes sharing an address space between cores
The question from the project specification can be answered: To what extent
can code be reused? The answer: The fact is that porting means that the hardware
specific parts, such as the hardware abstraction layer of the operating system, are
the ones that have to be changed. This means that there might not be that much
code that can be reused. Of course, it is possible to use code for another target as
reference, but direct copying might be difficult. When making the device drivers,
there are more possibilities for reuse and there are also device driver templates
that can be used [15].
Chapter 4
Tilera’s TILEpro64
TILEpro64[16] is dedicated its own chapter in this thesis. The reason for this
is that it has an interesting many-core architecture. Studying this may lead to a
better general understanding of many-core, especially about interconnection networks and memory hierarchy. The TILEpro64 is also the target architecture when
porting OSE in the implementation part of this thesis. This chapter can thus also
be considered as a pre-study directly linked to the implementation. The content
of this chapter is based on the documentation located at the Tilera Open Source
Documentation page http://www.tilera.com/scm/docs/index.html and the iMesh
article [17].
4.1
Architecture Overview
Tilera’s TILEpro64 architecture is a homogeneous tiled multi-core processor
inspired by MIT’s RAW [18]. The cores (referred to as tiles) are organized in
an 8x8 mesh on-chip interconnection network called iMesh. Each tile contains a
VLIW core, cache and switch that connect the tile to the on-chip network[16].
The layout of the processor can be seen in figure 4.1.
Each core is a general-purpose 32-bit VLIW. Each core contains an independent program counter, interrupt hardware, different protection levels and virtual
memory and is capable of running an operating system. There are actually four
protection levels, user, supervisor, hypervisor and hypervisor debug. This means
that virtualization is supported in hardware. The processor uses a RISC ISA extended with instructions commonly used in DSP or packet processing applications.
On each tile there is also a cache-engine containing a L1 data (8KB) and instruction (16KB) cache and a unified L2 (64KB) cache. There is a total of 5.5 MB
of on-chip cache distributed among the cores. It is possible for any tile to access
the L2 cache of any other tile on the interconnection network. This makes up the
17
18
Tilera’s TILEpro64
Figure 4.1. The TILEpro64 processor is organized in an 8x8 mesh
virtual L3 cache called Distributed Shared Cache (DSC)[16]. The processor supports full cache-coherency. The memory hierarchy and interconnection network is
described below. The tile can be seen in figure 4.2.
Figure 4.2. The tile consisting of a VLIW core, cache engine and network switch
The cores are connected to the iMesh with a switch-engine residing on each tile.
The iMesh is actually six parallel mesh networks. Five dynamic (UDN, TDN,
MDN, CDN and IDN) and one static (STN)[16].
4.2 Interconnection Network - iMesh
4.2
19
Interconnection Network - iMesh
iMesh [17] consists of six physical 2D-mesh networks. Each network has a dedicated purpose such as communication between tiles and I/O controllers or communication between caches and memory. The UDN, IDN and STN are all accessible
from software. The other networks are controlled by hardware and are used by
the memory system to provide inter-tile shared memory, cache coherence and tile
to memory communication. The hardware controlled networks are guaranteed to
be deadlock free, however care must be taken for the software accessible dynamic
networks.
The six networks are physically independent 32-bit full duplex. They are named:
The Static Network (STN), The Tile Dynamic Network (TDN), The User Dynamic
Network (UDN), The Memory dynamic Network MDN, The Coherence Dynamic
Network (CDN) and The I/O Dynamic Network (IDN). The networks can be used
simultaneously.
Each tile has a switch. All switches together make the iMesh network and provide control and data-path for connections on the network. They also implement
buffering and flow control.
4.2.1
Interconnection Hardware
The Switch
The switch is connected to all six networks and has five full duplex ports for
each of them. One in each direction (North, East, South, West) and one connected
to the local tile. The reason why the iMesh implements physical networks instead
of logical is that logical networks would need the same amount of buffer and the
extra wire connections are relatively cheap [17].
Receiving Messages
It is possible to implement demultiplexing of received messages in software.
This is implemented by triggering an interrupt when a message is received. The
interrupt service routine then stores the message in a queue located in memory.
On the IDN and UDN demultiplexing of incoming messages are supported in
hardware. On UDN there are four queues that can be programmed to store different incoming messages depending on message tag. On IDN there are two such
queues. They both also have a catch all queue that catches messages that do not
match any of the other queues.
4.2.2
The Networks
The dynamic networks use packet based communication[17]. The packet header
contains information about the destination tile and the packet length. On the
20
Tilera’s TILEpro64
network the packet is wormhole routed which means that much smaller buffers
are needed at each switch because the packet buffering is distributed all along the
connection path. A dimension ordered routing policy is then used, which means
that a packet first travels in the x-direction and then in the y-direction. This also
means that it is possible to deadlock these networks.
Figure 4.3. A wormhole routed packet travelling on the network.
Figure 4.3 shows how a packet travels from the upper left tile to its destination,
first in the x-direction, then in the y-direction. It can also be seen how the packet
occupies the channel where it is travelling, thus preventing other packets from
using that channel at the same time. A situation where a transmission blocks a
channel on the interconnection network can introduce deadlocks[16][17]. It is thus
important that the system developer makes sure that a packet can be received and
buffered at its destination.
UDN
The user dynamic network is a software accessible packet-switched network. It
can be used to implement a high-speed inter process communication between tiles.
The developer has to be careful, as this network can be deadlocked[16][17].
IDN
Just as the user dynamic network, the I/O dynamic network is packet switched
and accessible from software. The I/O dynamic network can be used to communicate between tiles and between I/O devices. The developer has to be careful, as
this network can be deadlocked[16][17].
4.2 Interconnection Network - iMesh
21
MDN
Only the cache engine has access to the memory dynamic network. It is used
for cache-cache and cache-memory communication[16][17].
TDN
The tile dynamic network complements the memory dynamic network and is
used for cache-cache communication. If a tile wants to read from another tiles L2
cache, the request is sent over the TDN and the answer is sent over MDN. The
reason that two networks are used is to prevent deadlocks[16][17].
CDN
The coherence dynamic network is used by the cache coherence hardware to
carry cache invalidate messages[16].
STN
The static network is not routed dynamically as the other networks. Instead
switches are configured to provide a static path for point-to-point communication[16][17].
As UDN and IDN, STN is also available to software.
4.2.3
Protecting the Network
It might not be desirable that user processes can communicate directly with
I/O devices or operating systems on other tiles. This is prevented by hardware
with the so called multi-core hardwall[16][17][19]. The hardwall is controlled by
a couple of special purpose registers[19]. This makes it possible to control what
kind of traffic that passes through the switch. A hardwall protection violation
may trigger an interrupt routine that does appropriate actions, like tunneling the
traffic to another destination.
4.2.4
Deadlocks
It is possible to trigger a deadlock when using the dynamic networks[17][19]. The
developer must therefore be careful when designing the application. The wormhole
routing protocol can overflow if the receiver does not take care of incoming packets.
When designing a dynamic network protocol it is important to make sure that
each tile always empties its own receive buffer and that there is no blocking send
operation that stops the receive operation from executing[17][20]. This means that
it is also important to not send more packet data than the receiving demux buffer
can handle.
The memory networks are controlled by hardware which guarantees that no
deadlocks will occur[19].
22
4.2.5
Tilera’s TILEpro64
iLib
The iLib provides a user API for inter tile communication over UDN. It provides
socket like streaming channels and a MPI1 [21] like message passing interface.
Streaming
iLib supports two types of streaming[17]. One is called Raw channels and the
other one is called buffered channels. Raw channels have little overhead and are
suitable for software that has high demands on latency. Buffered channels do have
more overhead but instead support large buffers residing in memory.
Messages
The messaging API is similar to MPI. Everyone can send messages to everyone
without the need of manually setting up a communication channel. The API makes
sure that messages are received in order and that they are buffered depending on
message tag.
4.2.6
Conclusions about the Interconnection Network
The project specification asked the following question: What implications does
the iMesh network have on a RTOS? The answer to this question is that there is
a risk that the software managed networks becomes congested and deadlocked.
To prevent this, communication protocols have to make sure that the receiver
does not overflow[17][20]. Receive buffers must always be emptied and the sender
must know that there are empty buffers before transmitting.
The application designer has to take into consideration that the interconnection
network can be congested. An application with high demands on determinism
of network latencies can be protected with the network protection mechanism described above. This will prevent unwanted communication on parts of the network
used by the critical application[17][16]. One other solution is to use the static network for critical applications.
As long as the software developer follows these instructions, there should not be
any problems using a RTOS on the iMesh.
4.3
Memory Hierarchy
The TILEpro64 has a 36-bit shared physical address space which is visible as
a 32-bit virtual address space. The memory can be visible and shared among all
tiles or it can be grouped into protected domains. Each tile has a separate 8KB
L1 cache for data, one 16KB L1 instruction cache and a unified 2-way 64KB L2
1 Message
Passing Interface
4.3 Memory Hierarchy
23
cache. The on tile caches are complemented with the Distributed Shared Cache
(DSC). That is a virtual nonuniform access L3 cache distributed between all L2caches. The cache coherence protocol called Dynamic Distributed Cache (DDC)
implements system wide coherency and has a number of configurations.
4.3.1
Memory Homing
All physical memory on the TILEpro64 can be associated with a home tile. The
home tile is responsible for cache consistency its associated addresses. The memory
homing system implements distributed directory based cache coherency. One use
for home tiles is to dedicate the L2 cache to its associated physical memory and
let all accesses from all tiles to those addresses go through the home tiles cache.
This is how the L3 cache is implemented. The TLBs on each tile do not only map
virtual addresses to physical ones but do also keep track of which home tile a cache
line belongs to.
There exist a couple of strategies on how to configure the home tiles. These
strategies can be customized, and to achieve the best performance, software should
be optimized with locality in mind [22][23].
Local Homing
This strategy does not use the L3 cache. On a L2 cache miss the DDR memory
is accessed directly and the complete page that the accessed data belongs to is
cached local at the accessing tile. This strategy is good when different cores do
not share data, because accessing the off-chip memory directly on a cache miss is
faster than first trying to read the L2 cache of another tile[24].
Remote Homing
This strategy implements the L3 cache. All physical pages get dedicated home
tiles. When a L2 miss occurs, a request is sent to the home tile of the requested
memory (which becomes the virtual L3 cache). If a second L2 miss occurs at
the home tile the data has to be fetched from memory. This strategy is good for
producer consumer applications where the producer can write directly into the
consumers L2 cache[24].
Hashed Homing
This strategy reminds of the Remote Homing Strategy, but the difference is that
the pages are distributed among tiles at cache line level, using a hash function.
This makes it suitable for applications where instructions and data are shared
among several cores. The hashed distribution of memory provides better load
balancing on the iMesh and avoids bottleneck situations where many tiles access
the same page[24].
24
Tilera’s TILEpro64
Figure 4.4. Example of cache configurations
In figure 4.4 the different ways of configuring caching is explained. In the example, Tile 1 and Tile 2 are accessing the same three pages (A, B and C). Page
A is configured as local, page B as remote and page C as hashed. Note that the
memory sizes are not using a correct scale in the image (the L2 cache is not of the
same size as the off-chip memory). Also note that the hashed page is not really
hashed in the picture, the picture only tries to explain that the page is divided
between caches.
4.3.2
Dynamic Distributed Cache
Dynamic Distributed Cache (DDC) is the name of the cache coherency system
on TILEpro64. It uses the homing concept to implement distributed directory
based cache coherency[16]. Each home tile is responsible for keeping track of
which tiles that has a copy of homed data. The home tile is also responsible for
invalidating all copies if a cache line is updated.
4.3.3
Conclusions about the Memory Architecture
The question from the problem statement can be answered. How does the memory hierarchy of TILE64 cope with the demands of a RTOS? It depends if there
are many applications accessing the same memory controller. A bad configuration
where many applications try to access a memory controller or a remote cache at
the same time can congest the memory networks. This will lead to bad performance. The TILEpro64 processor can use a configuration called memory striping.
This configuration splits pages between the four memory controllers which makes
the traffic on the memory networks more evenly distributed. This, combined with
a wisely chosen cache configuration, can increase performance. If there are hard
requirements on deterministic access times, the application developer may consider
4.4 Tools and Software Stack
25
letting the critical application have a dedicated memory controller.
The memory latencies are deterministic[16]. However a congested memory network can change this. The developer has to take the memory accesses from all
applications running in parallel on different tiles into account[24].
4.4
Tools and Software Stack
Tilera has a number of host-side tools and a software stack that runs on the
hardware. The host-side tools collection provide functionality like building, functional and cycle accurate simulator, debugging and profiling tools. The tile-side
software stack is basically a complete software environment including hypervisor,
libraries and custom made Linux version.
Hypervisor
The main functionality provided by the hypervisor is booting, loading guest
operating systems, managing resources and memory, providing an interface for
inter tile communication and I/O device drivers[16][24].
Bare Metal Environment
For users with extra demands on performance there is support for a bare metal
environment that can be used instead of running on top of the hypervisor[16]. The
bare metal environment executes at the same protection level as the hypervisor
and provides full access to the hardware.
4.5
Tilera Application Binary Interface
It was necessary to study the Tile processor binary interface[25] to make the
assembler functions in the target interface work together with the c code. The application binary interface specifies things such as data representation and function
call convention.
The tile processor uses byte aligned data and has little endian data representation. This means that the least significant byte in a data item is stored at the
lowest address. The compiler supported data types are described in table 4.1.
The register usage convention is described in table 4.2. Caller saved means that
the caller has to save the register values if they are to survive a function call. The
values in callee saved register has to be preserved by the called function and thus
are required to contain the same value on function exit as on function entry.
26
Tilera’s TILEpro64
C Types
char
short
int
pointer
double
long long
Data Types
Size (Bytes) Byte Alignment
1
1
2
2
4
4
4
4
8
8
8
8
Machine Type
byte
halfword
word
word
doubleword
doubleword
Table 4.1. Supported Data Types
Register
0-9
10 - 29
30 - 51
52
53
54
55
56
57
58
59
60
61
62
63
Register Usage Convention
Assembler Name Type
Purpose
r0 - r9
Caller-Saved Parameter Passing / Return Values
r10 - r29
Caller-Saved
r30 - r51
Callee-Saved
r52
Calle-Saved
Optional Frame Pointer
tp
Dedicated
Thread-Local Data
sp
Dedicated
Stack Pointer
lr
Caller-Saved Return Address
sn
Network
Static Network
idn0
Network
IO Dynamic Network 0
idn1
Network
IO Dynamic Network 1
sp
Network
User Dynamic Network 0
sp
Network
User Dynamic Network 1
sp
Network
User Dynamic Network 2
sp
Network
User Dynamic Network 3
sp
Stack Pointer
Table 4.2. Register Usage Convention
4.5 Tilera Application Binary Interface
27
The stack grows downward and is controlled completely by software. The stack
pointer has to be 8-byte aligned. Table 4.3 shows the stack usage convention. If a
function requires more arguments than there are dedicated registers, the arguments
left without a register are stored on the stack, starting at address SP+8.
Region
Locals
Dynamic Space
Argument Space
Frame Pointer
Calle lr
Stack Usage Convetion
Purpose
Local variables
Dynamic allocated stack space
If more than 10 arguments, then save them here
Incoming sp
Incoming lr
Table 4.3. Stack Usage Convetion
Size
Variable
Variable
Variable
One Word
One word
Chapter 5
Software on Many-Core
Parallel software for multi-core has moved from being a restricted subject for
scientific and high performance computing to become common in computer systems. Today, standard desktop PCs are shipped with 4 to 8 cores. It is believed
that the number of cores per chip will continue to double every 18 months and that
within ten years processors will contain as many as 1000 cores[1] [5]. Many-core
processors have been available for some years.
Extracting and managing parallelism in applications on multi-core is a hot subject. There are many tools available and research is active [3]. The system side
is still in a somewhat novel state when it comes to many-core CMPs. There exist
a couple of research operating systems that address the issues with many-core.
The operating systems that have been investigated in this thesis in order to derive
requirements for software on many-core are Wentzlaff et al.’s Factored Operating
System (fos) [26][27][28] and the Barrelfish operating system [29][30]. These operating systems are designed with scalability as the main requirement, which means
that common requirements in the embedded segment such as real-time capabilities
are not of the highest priority.
This chapter covers some operating system scalability problems and how to
counter these problems. The first section considers about why the SMP model has
problems scaling to many-core processors. The second section investigates the two
many-core operating systems mentioned above. Finally, there is a discussion about
what design principles and requirements need to be considered when developing
operating systems for many-core.
5.1
Scalability Issues with SMP Operating Systems
It is a fairly accepted belief that SMP operating systems do not scale as the
number of cores increase. Some studies claim that Linux only scales to about 8
29
30
Software on Many-Core
cores [2][26][29]. There are however those who disagree [31][32]. Boyd-Wickizer,
Silas et al. have been able to make Linux scale to 48 cores by making modifications
to the Linux kernel.
5.1.1
Locking the kernel
A simple way to make a kernel SMP safe is to use a so called big kernel lock.
This means that only one thread can enter the kernel at a time. Using a big
kernel lock makes it possible for multiple cores to share the same kernel. Since
only one thread can execute in the kernel, threads on other cores have to wait if
they want kernel access. Operating system designers have countered this problem
by replacing the big kernel locks with fine grained locks[2][31]. Fine grained locks
make the kernel more suitable for multi-core as the probability that threads on
different cores require access to the same resource decreases. Amdahl’s law shows
us that serial segments of code have a big negative impact on scalability. This
implies that even if the locks have very fine granularity, the shared resources will
still become bottleneck when moving to many-core.
Wentslaff et al. has with the use of a micro-benchmark discovered that the
physical page allocator in Linux 2.6.24.7 [26], even though it uses fine grained
locks, only scales up to 8 cores. The lock used when balancing the free pages lists
between cores turned out to be the bottleneck.
One other related problem is that the lock granularity is optimized for a specific
number of cores, meaning that the optimal lock granularity for a many-core system
may create too much lock overhead on a system with fewer cores[2][31]. It is also
possible that the cache coherence protocol congests the core interconnection while
invalidating shared cache lines.
Making the kernel more scalable by making the locks more fine grained can
be a difficult task and may introduce errors[2][31]. Forgetting to protect a shared
resource or inserting deadlocks are risks when making locks finer grained. Patterns
that trigger deadlocks can be difficult to detect.
5.1.2
Sharing Cache and TLBs between Application and OS
The trend is that cache sizes decrease as the core count increases [1]. An operating system where application and operating system services share the same
cache and TLB can suffer from bad cache performance due to anti-locality collisions. By measuring cache misses in user and supervisor mode when running an
Apache2 server running on Linux 2.6.18.8 Wentslaff et al. show that application
and operating system cache interference by anti-locality collisions is sizable and
that the operating system is responsible for the majority of the cache-misses [26].
This supports a more distributed operating system design where operating system
services and applications do not compete for cache.
5.2 Operating Systems for Many-Core
5.1.3
31
Dependency on Effective Cache Coherency
SMP design relies on efficient cache coherency protocols[2][31]. It is doubtful
that system wide cache coherency will scale to many-core. Wentslaff et al. claim
that a difficulty with scaling appears with directory based cache coherency protocols when using more than 100 cores[26]. Baumann et al. believes that it is
a distinct possibility that future operating systems have to handle non-coherent
memory[29].
5.1.4
Scalable SMP Systems
Boyd-Wickizer, Silas et al. however claim that the Linux kernel only required
relatively modest changes to become more scalable [31]. This was done by inserting locks with finer granularity and by introducing lock free protocols. Their
conclusion is that it is not necessary to give up traditional operating system design, yet. They say that the bottlenecks are to be found at application level and
among shared hardware resources, not in the Linux kernel. They also stated that
making Linux scalable was dependent on effective cache coherency.
5.2
Operating Systems for Many-Core
As mentioned in the beginning of this chapter, two novel operating systems have
been investigated to derive requirements for a many-core operating system. The
common features of these operating systems are that they propose a distributed
operating system design where shared resources are kept to a minimum and communication is done with message passing. Barrelfish has also been putting extra
emphasis on hardware diversity, using a system knowledge database (SKB) that
is supposed to provide extra support for optimization on heterogeneous systems
[33].
5.2.1
Design Principles: Factored Operating Systems
Factored Operating Systems[28] is designed by Wentzlaff et al. and has scalability as the main requirement. The main principle behind fos is that operating
system services are dedicated and distributed among cores, inspired by distributed
web servers. Factored Operating Systems distributes deep kernel services such as
physical page allocation, scheduling, memory management and hardware multiplexing. The fos is spatially aware and takes locality into account when distributing the servers.
Microkernel Providing Location Transparent Communication
Fos consists of three main components. A micro-kernel, system service servers
(referred to as OS layer) and applications. Operating system services and applications never execute on the same core. Both operating system services and
applications are referred to as clients and the micro-kernel does not differ between
32
Software on Many-Core
them. The micro-kernel takes care of resource management, implements a machine dependent communication infrastructure, name server cache and an API
for spawning processes. The micro-kernel can allocate receive mailboxes which
can be used by the clients to publish and access services. There is a local name
server cache that contains mappings to services located on different cores. The distributed name server that resides in the OS layer provides load balancing between
service fleet servers.
Fleets of Distributed Servers
Each function specific service belongs to a fleet of distributed cooperating servers.
A server is locked to one core and communication between servers is done with
message passing. A caller uses the micro-kernel name cache or the name server
service to find the closest server providing a specific service. Applications can execute on one or several cores and may use shared memory for communication. It
is not believed that cache coherency will scale system wide but may be effectively
utilized using an application, implemented with a shared memory model, which
executes on a couple of cores.
All this makes fos resemble a distributed web server, looking at techniques such
as replication with data consistency and spatial aware distribution.
5.2.2
Design Principles: Barrelfish
Barrelfish[?] is designed on three main principles[29]. All inter-core communication shall be explicit and the operating system structure shall be hardware neutral.
The final principle is that a state shall never be shared between cores instead it
shall be replicated and kept consistent with agreement protocols. One difference
from fos is that Barrelfish does not only have scalability as a main requirement
but also portability[33].
Main Components
Barrelfish consists of two main components, the CPU driver and the monitor.
The CPU driver is private to each core and executes in kernel space. The monitor
executes in user space and is responsible for coordination between cores. Together
they provide the typical functionality of a micro-kernel: scheduling, communication and resource allocation. Other functionality provided by the operating system
such as device drivers, network stacks and memory allocators execute in user space.
The CPU Driver
The CPU driver takes care of protection, time slicing of processes and provides
the target interface. It is private to each core and thus does not share any state.
Hardware interrupts are demultiplexed by the CPU driver and delivered to destination processes. It also provides a medium for asynchronous split-phase messages
between local processes.
5.2 Operating Systems for Many-Core
33
The Monitor
The monitor is a schedulable user space process responsible for coordinating
the system-wide state. Resources are replicated and the monitor is responsible
for keeping the replicated data structures consistent with an agreement protocol.
Processes that want to access the global state need to go through the local monitor
to access a remote copy of the state. The monitor is also responsible for interprocess communication between different cores. All virtual memory management
is done in user space and the monitors are responsible for keeping global resources
such as page tables consistent.
IPC with Message Passing
All inter-core communication is done with message passing. The message passing interface abstracts the communication medium and provides transparency to
the communicators. The implementation is hardware specific and uses hardware
interconnection when possible.
Processes represented by Dispatcher Objects
Processes are represented by special dispatcher objects that exist on each core
that the process will execute. The dispatchers are scheduled by the CPU driver.
The dispatcher contains a thread scheduler and a virtual address space. All operating system software uses replication instead of shared resources. However,
user-space processes are free to use a shared memory model for parallel applications.
Considering Hardware Diversity
Barrelfish puts extra emphasis on running efficient on heterogeneous systems
with little modifications[33]. It uses a system knowledge database that provides
user applications with knowledge about the underlying hardware. This information
is supposed to be used for run-time optimization.
5.2.3
Conclusions from Investigating fos
Wentzlaff et al. says that one advantage when moving to a distributed operating system is that by making communication inside the operating system explicit
with message passing, it is no longer necessary to search for shared memory bottlenecks and locks[26]. One other advantage is that system services implemented
as distributed servers (where the number of servers increases with core count),
scales in the same manner as distributed web servers. A third advantage with
the distributed design is that system services do not have to share cores with applications. Applications can, instead, utilize the services with remote calls. The
cost of dynamic messages is in the order of 15-45 cycles on the RAW and Tilera
processors. Context switches on modern processors typically have a much higher
latency in cycles [26]. As mentioned earlier, cores are expected to get smaller as
34
Software on Many-Core
the core count grows[1]. This and the fact that embedded systems typically have
very strict demands on software overhead has led to the development of small
software overhead message passing APIs like the MCAPI[34] and rMPI [35].
The authors of fos state that it is their belief that the number of running threads
in a system will be of the same order as the number of cores. This means that
load balancing will occur less often. More cores also mean that the need for time
multiplexing of resources will decrease. To achieve good performance on many-core
systems placement of processes will instead be a bigger issue. Finally Wentzlaff et
al. thinks that by using an explicit message passing model, the operating system
designer will encourage application developers to think more carefully about what
data is being shared[26].
5.2.4
Conclusions from Investigating Barrelfish
Baumann et al. argue that operating system design has to be rethought using
ideas from distributed systems. They suggest minimal inter-core sharing and that
OS functionality should be distributed among cores, communicating with message
passing.
The Barrelfish design says that shared resources should be controlled by servers
and the clients then has to perform some kind of RPC to access the shared resource. With a micro benchmark they show that a client server message passing
model scales better with the number of cores than a shared memory model when
updating a data structure[29]. The reason for this is that when using the shared
memory model, the cores get stalled because of invalidated cache lines. When
using the message passing model the delay is proportional to the number of cores
accessing the server. The Barrelfish design considers also consider that effective
cache coherence will not scale with increased core count[29]. This favors the message passing model.
There are existing point-sollutions to the problems that come with the shared
memory model[31][36][37]. Scalable shared memory software in high performance
computing has tackled this by fine-tuning lock granularity and the memory layout
of shared data structures[3]. This means that the developer has to be careful
about how the data is encoded on the particular platform and how the cachecoherence protocol is implemented.Example of such situations would be if a specific
underlying implementation store arrays of data line wise or row wise, the size
cache lines and what kind of cache coherency protocols that are used. This is
an argument supporting the idea that the implicit communication with shared
memory should be replaced with an explicit message passing model, encouraging
the developer to create a parallel design which is less platform dependent.
5.3 Conclusions
5.3
35
Conclusions
This section contains the overall conclusions about operating system design for
many-core. These conclusions are for a scalable operating system that appears as
a single image system to the executing processes. There are other ways to utilize
the performance provided by many-core that are not covered in this thesis. (One
example would be to run parallel AMP systems on top of a hypervisor[2]). Finally
there is a subsection discussing Enea OSE on many-core.
5.3.1
Distributed Architectures are Scalable
Distribute OS Services
To be able to scale on many-core processors, operating systems should aim at a
distributed design. This means that the services provided by a standard monolith
operating system (such as a file system server, physical memory allocator, name
server or an IP stack), should be distributed to dedicated cores. These services
should also have an internal distributed design, where a spatially distributed fleet
of servers provide the specific service to requesting processes. Server fleets should
be able to grow or decrease depending on demand (maybe not dynamically, but
at least during static configuration). These fleets should use a replicated state to
be scalable. Consistency should be contained with a state of the art agreement
protocol.
A Microkernel Provides Inter-Core Communication
The distributed OS services should run on top of a microkernel that provides
location transparent communication. There should be a distributed name server
that keeps track of OS services and application services. The distributed name
server makes sure that a request is always serviced by the most appropriate server.
There should be a name cache in the microkernel to increase performance.
5.3.2
One Thread - One Core
It is to be expected that cores become weaker as the core count grows[1]. That
means that caches will be smaller. This implies that executing operating system
services and application processes on the same core will experience performance
loss due to anti-locality cache collisions. The same could be said about multitasking.
The cores will be weak but there will be lots of them. This means it will be
possible to let each thread have a dedicated core. Fast interconnection, better
cache performance and no overhead from context switching makes this a good
choice.
36
Software on Many-Core
Instead of time slicing threads on one core, the big task will instead be placement
of threads. Finding the optimal placement will be necessary to get the desired
performance. If threads have a good initial placement, there will be little need for
migration and one thread per core means there will be no need for load balancing.
5.3.3
IPC with Explicit Message Passing
Utilizing the performance of multi-core and many-core processors requires parallel software design[3][2]. Shared memory models for parallel programming does not
scale well on many-core because of the lack of system wide cache coherence. The
papers behind Barrelfish and fos state that explicit message passing for IPC should
be used[26][29]. Enforcing a programming model based on explicit message passing means that threads will not be designed with shared resources. This makes the
applications scale better. The message passing should preferably be implemented
using the fast interconnection networks of many-core processors citeEmbeddedMulticore.
The idea behind making the communication explicit is to make the software less
platform dependent. Implicit communication is more platform dependent because
the developer has to keep in mind how the underlying software and hardware
handles data structures[26][29]. One example is when a developer is working with
a shared matrix. If the communication mechanism is implemented with shared
memory, the developer has to adapt the solution to how the matrix is stored in
the underlying implementation and memory engine.
5.3.4
Example of a Many-Core OS
This section aims to clarify the conclusions with an example.
Figure 5.1 shows an example of how an operating system for many-core could
look like. The cores are placed in an 8x8 mesh. The gray tiles are application
threads, the green and the red tiles are operating system services (observe that
the servers have no optimal placement, this is just an example). Both applications
and OS services are running on top of a microkernel that provides a location
transparent communication interface.
If an application wants an OS service, it does a standard system call. The
microkernel realizes that the requested service is located on another core. The
microkernel forwards the remote call over the interconnection network to the most
appropriate server. At the destination tile, the call is delivered by the microkernel
to the OS process providing the service. The response is sent back to the requesting
application process in the same way, over the interconnection network.
This was a very simple example that explains the thoughts behind the design
of a distributed operating system executing on a many-core processor.
5.3 Conclusions
37
Figure 5.1. Example of Many-Core OS
5.3.5
Enea OSE and Many-Core
Enea OSE is described in chapter 3. OSE executes on top of a microkernel that
can be extended with different core services. IPC is done with message passing
and can be location transparent if used together with Enea LINX[11]. This copes
well with the conclusions made about operating systems for many-core.
The overall conclusion about OSE and many-core is that the OSE architecture
copes well with the requirements stated in the fos and Barrelfish papers. Porting
OSE to a many-core processor is, thus, very well motivated and will, with high
probability, be an interesting research base for software on many-core at Enea.
Chapter 6
Porting Enea OSE to
TILEpro64
The subject covered by this chapter is the porting of OSE to TILEpro64. Porting
OSE to the Tilera processor was stated already in the project specification. This
decision was strengthened when the pre-study showed that OSE might be very well
suited for many-core. When doing the specification, it was difficult to estimate
how far the porting would get in the time scope of this thesis. It was known that
a good understanding of the OSE architecture, the Tilera hardware and of the
tools that had to be used was required. Therefore, the project specification stated
”It is not expected that the complete porting of OSE ME will be done during 10
weeks, however a foundation for further thesis projects shall be achieved.” This
was, without any doubt achieved. Right now, there is a limited but working single
core version of OSE executing on the TILEpro64 processor.
The following sections cover subjects like early design decisions, the method
used during implementation, description of the implemented parts and finally a
verification of the results.
The work methodology used within each milestone is described. These sections
aim at being target generic and are meant to aid future thesis workers with similar
projects.
6.1
Milestones
The porting process is an incremental task. The problem complexity is quite
significant so it was necessary to divide the problem in to smaller tasks. The
project was divided into a couple of milestones to ease the work of extracting the
most important tasks.
39
40
6.1.1
Porting Enea OSE to TILEpro64
Milestone 1 - Build environment
Since TILEpro64 has a completely new architecture with a new ISA, a new
build environment has to be created. It is necessary to be able to continue with
the porting process. It shall be possible to edit OSE libraries and use Omnimake
to build them with Tilera’s GCC-port.
6.1.2
Milestone 2 - Launch OSE and write into a Ramlog
Link the included libraries and make a final build with a Coresys. There should
also be a configured simulator environment that can be used to test the final build.
Arrive at the point where OSE is able to write into the ramlog.
6.1.3
Milestone 3 - Get OSE in to a safe state
Get a basic single core version of OSE up and running. This version is able
to write into a log in ram memory that can be accessed from GDB. There is no
driver for serial communication or chip timer yet, which means that the only way
to see output from the operating system is by the ramlog and where it is not
possible to provide any input at run-time. No timer means that timer processes
are not supported and no system calls that rely on time are supported either.
The scheduler can still work in this configuration. As long as only event driven
processes are used, the scheduler will work.
6.1.4
Milestone 4 - Full featured single-core version of OSE
on TILEpro64
This milestone requires a timer and uart device driver. MMU support can be
implemented but this is optional.
6.1.5
Milestone 5 - Full featured multi-core version of OSE
on TILEpro64
Milestone 5 requires a multicore bootstrap and an IPI driver. Preferably, hardware MMU support should be available (this is needed if the features of the memory hierarchy are to be utilized). Milestone 5 could also be extended to include
utilization of the interconnection network for IPC by adding support in LINX.
6.2
MS1 - Build Environment
To be able to start working with the actual porting it was necessary to do some
preparatory work. A new target called tilepro was added to the internal build
environment. This section describes briefly what had to be done.
6.2 MS1 - Build Environment
6.2.1
41
Omnimake
Omnimake[38] is the make and build system for OSE source code. It is used
by Enea to build their product components. Creating a new configuration for the
Tilera architecture was the first step.
6.2.2
Requirements and Demonstration
The mile-stone specific requirement is listed in table 6.1.
Requirements
Nr.
1
Priority
1
MS1 - Requirements
Description
It shall be possible to build OSE core libraries for TILEpro64.
Table 6.1. MS1 - Requirements
Demonstration
The requirement is verified with use cases.
Use Case 1
1. Do a change into a source file in the CRT component.
2. Build the CRT component.
3. Confirm that the component library was built by looking in OSEROOT /system/lib
Expected Outcome
The library is compiled and can be found in OSEROOT /system/lib
Result of Use Case 1
Status: PASSED
Use Case 2
1. Do a change into a source file in the CORE component.
2. Build the CORE component.
3. Confirm that the component library was built by looking in OSEROOT /system/lib
42
Porting Enea OSE to TILEpro64
Expected Outcome
The library is compiled and can be found in OSEROOT /system/lib
Result of Use Case 2
Status: PASSED
6.2.3
Work Approach
Following are descriptions of how MS1 was achieved.
1. Create a configuration for the new compiler in Omnimake.
2. Choose a suitable reference architecture. Try to find a reference architecture
that has a similar ISA to Tilera’s one. By using a similar ISA, mapping the
reference architecture on the target architecture will be a much easier task.
3. Add library specific build configurations for the new target.
4. Remove all target specific source code until it is possible to build all desired
libraries.
6.3
MS2 - Coresys
The final system is called Coresys. The difference between a Coresys and a
Refsys is that a Coresys is a minimal build for testing and a Refsys is a full
featured OSE monolith. The Coresys only contains the OSE core functionality. It
also includes a few optional core extensions like the Run-Time Loader for ELF or
the Console library. Like Refsys, Coresys is responsible for linking all the desired
libraries, setting the OSE configuration parameters and creating a final executable.
6.3.1
Implemented Parts
Milestone 2 was more about configuring than coding. The OSE entry code was
the only produced source code artifact.
OSE Entry Code
The TILEpro64 start code has been developed. It is specified as the OSE ELF
entry point and handles initialization of the read-write data and BSS segments. It
also calls the main() function. The entry code can be considered as an extension
to the compiler.
6.3 MS2 - Coresys
6.3.2
43
Design Decisions
The initial intention was to do a para-virtualization on top of Tilera’s hypervisor.
The reason for this was that it is easier to write device drivers that interface
against the hypervisor than writing drivers that interface against the hardware.
This, however, turned out to be a bad decision. Studying the hypervisor was not
within the time scope of this thesis. Also in the long term, running on top of the
hypervisor was not desirable.
Instead of running on top of the hypervisor the decision was taken to run the
OS as a bare metal application. That means that the OS executes on the same
protection level as the hypervisor and has direct access to all hardware. The bare
metal environment offers a run time that can be accessed by the bare metal API.
However, I choose to not utilize this. It might have become handy when developing
a console driver but not so handy when developing the target interface. In this
thesis I only configure the OS as a BME to get it up and running. With only small
changes to the target interface, OSE is capable of installing interrupt vectors and
setting up a default MMU map by itself so there is no real need for an API.
6.3.3
Requirements and Demonstration
A couple of mile-stone specific requirements are listed in table 6.2.
Requirements
Nr.
2
3
4
Priority
1
1
1
MS2 - Requirements
Description
The Coresys shall be able to do the final linking and create an ELF.
The OSE init code shall start to execute in the simulator.
The init code shall be able to write into the ramlog.
Table 6.2. MS2 - Requirements
Demonstration
The requirements are verified by use cases.
Use Case 3
1. Start to build the Coresys.
2. If there are no error messages, verify that tilepro.elf exists in obj/
44
Porting Enea OSE to TILEpro64
Expected Outcome
OSE is linked and the tilepro.elf binary is generated in obj/.
Result of Use Case 3
Status: PASSED
Use Case 4
1. Modify the OSE init code: Add a ramlog print early in the init code, followed
by a breakpoint.
2. Build the libraries and the Coresys.
3. Start the simulator with a configuration to run OSE as a bare metal application by running the make script in simmake/.
4. When the breakpoint is reached and GDB has started, dump the ramlog to
a file and verify that your print was added.
Expected Outcome
The text was printed into the ramlog.
Result of Use Case 4
Status: PASSED
6.3.4
Work Approach
Following are descriptions of how MS2 achieved.
1. Choose a reference Coresys.
2. Investigate how the target platform bootstrap handles ELF.
3. If required by the target, do the necessary changes in the linker script.
4. Write the entry code.
6.4
MS3 - Get OSE into a safe state
Milestone 1 and 2 were more about understanding provided tools and the structure of OSE. Milestone 3 is more about the internal architecture of OSE and also
contains most of the produced source code artifacts.
6.4 MS3 - Get OSE into a safe state
6.4.1
45
Design Decisions
The new hardware abstraction layer strictly follows the internal architecture of
the reference architecture and overall internal architecture of OSE. There was not
much room for introducing new design; a lot of time had to be put into learning
the software and hardware. There were, however, two decisions to be taken about
possible constraints on the port.
One design decision that was made was to implement a native c runtime. The
first thought was to use the c run-time library that the soft kernel uses. A configuration that only executes in supervisor mode would allow this solution. However,
a target port was implemented instead. The reason behind this decision was that
this has to be done if the OS shall support both user and supervisor mode. Another reason was that a native approach worked better together with the target
interface of the reference architecture.
It was an early decision not to implement any MMU support. The reason for
this was that OSE can be configured to run without an MMU and there was no
specified requirement about protection. Implementing MMU support can also be
considered as a significant task. Especially in the TILEpro64 case.
6.4.2
Implemented Parts
The lowest layer in the OSE architecture is called the Hardware Abstraction
Layer. The hardware abstraction layer provides a target interface to the higher
layers. Most of the implemented parts reside in the hardware abstraction layer,
but there are also parts of the CRT that are hardware dependent.
Target Interface
The most important parts of the target interface are the functionality for creating, storing and restoring a process context. The interrupt vectors and the trap
code are implemented in the target interface. Some parts of the target interface
are implemented in C and some parts are written in pure assembler. Some functionality can be implemented in C or assembler, but the functionality that the
target interface implements is called a lot by the higher layers of the OS, so it
can be wise to implement this functionality in assembler for performance reasons.
Interrupt vectors also have high demands on performance and text size, which may
leave assembler as the only choice.
The OSE SPI specifies some architecture dependent functionality, like atomic
operations and CPU access functionality such as disable interrupts or register
manipulating functions. These are all implemented in the target interface.
46
Porting Enea OSE to TILEpro64
Figure 6.1. Location of the target interface in the OSE architecture
Figure 6.2. Location of the c run time in the OSE architecture
6.4 MS3 - Get OSE into a safe state
47
CRT - C Run Time
The OSE C run time library is implemented in the kernel component called CRT.
This library has architecture dependent functionality that had to be implemented.
This was all done in assembler. The code produced in this thesis implements c
run time initialization, system calls and memory manipulating functionality.
BSP - Board Support Packet
There was not enough time to implement any device drivers. The most important drivers that have to be implemented would be a console and a timer driver.
Also a driver for IPI would be necessary when implementing multi-core functionality. However, stubbed dummy drivers are provided in the BSP to ease further
development. The BSP also contains some target specific initialization, such as
setting up a static MMU map and enabling caches.
6.4.3
Requirements and Demonstration
Requirements
A couple of mile-stone specific requirements are listed in table 6.3.
Nr.
5
6
7
8
Priority
1
1
1
1
MS3 - Requirements
Description
It shall be possible to spawn processes.
It shall be possible to switch processes.
There shall be working IPC.
There shall be possible to do system calls.
Table 6.3. MS3 - Requirements
Demonstration
The requirements are verified with a use case.
Use Case 5
1. Add two processes to the BSP. They should communicate with each other
using send, blocking receives and write into Ramlog what they are doing.
Add a breakpoint at the end of one of the processes (make sure that the
breakpoint is actually reached).
2. Make sure your processes are created and started in bspStartOseHook2.
3. Start OSE in the simulator.
48
Porting Enea OSE to TILEpro64
4. When the breakpoint is reached and GDB is launched, dump the ramlog to
a file.
5. Read the ramlog to verify that your processes are working.
Expected Outcome
The desired text has printed in to the ramlog, showing that OSE is in a safe
state and that IPC and the dispatcher works.
Result of Use Case 5
Status: PASSED. See appendix A for the demonstration application source code
and application output that shows that requirement nr. 4 has been met.
6.4.4
Work Approach
During milestone 3 it was necessary to map the reference architecture to the
Tilera architecture and make the required changes. This meant comparing their
ISAs, ABIs and then implementing assembler and hardware dependent data structures.
1. Execute OSE in the simulator and implement functions when they are needed.
Chapter 7
Conclusions, Discussion and
Future Work
The initial intention with this thesis was to create a project foundation for the
MANY project. First, by creating a picture of current many-core research, derive
main requirements for operating systems on many-core and then, by porting OSE
to the TILEpro64 processor.
7.1
Conclusions from the Theoretical Study
The pre-study showed that the two research operating systems: Factored Operating System and Barrelfish. They both looked at distributed operating systems
and distributed web servers for inspiration. They were both aiming a design where
services provided by the operating system are distributed and avoid sharing cores
with user processes. Resources shared between cores shall be kept to a minimum
and communication, both on OS and user level is preferably done with explicitly
message passing.
The architecture of OSE was investigated. Because of the distributed design
and the message passing programming model, OSE turned out to fulfill the requirements stated by the research operating system papers. This meant that it
was a good idea to continue with the porting task instead of going deeper into the
theory behind software on many-core.
7.2
Results and Future Work
Together with this report a working copy of OSE has been delivered to Enea
AB. This thesis has also delivered a complete build environment for Tilera’s architectures. The OSE version that has been tested on the Tilera MDE functional
simulator is a single core system that is able to get into a safe state where processes
can be scheduled and executed.
49
50
7.2.1
Conclusions, Discussion and Future Work
Future Work - Theoretical
At the application side there are many subjects that can be investigated such as
tools for parallel computing and programming models. This thesis has focused on
how to make operating systems scalable and the suggestions about future research
areas will also be related to the operating system aspects.
Virtualization on Many-Core
The legacy software of today will, of course, also exist in the future. Running
legacy software on many-core may require virtualization. Wentzlaff et al. even
believe that it will be a requirement of future architectures that they shall be able
to execute the x86 architecture efficiently as an application[39]. Scalable dynamic
virtual machines that execute on many-core is a very interesting research area in
my opinion.
One other way to utilize the performance of many-core processors without using
a very scalable single image operating system is to use a hypervisor and provide an
AMP environment. Hypervisors on many-core and especially Enea’s hypervisor is
a very interesting subject.
7.2.2
Future Work - Implementation
Five milestones were stated for the porting process. Milestones 1 - 3 were
completed. Mile stone 2 took much longer time than what was estimated. The
reason behind this was the wrong decision to do a para-virtualization. Learning
how to launch OSE on top of Tilera’s hypervisor was very time consuming. When
the decision was made to run OSE as a bare metal application, already much time
had gone, thus, reaching and demonstrating milestone 3 became the final goal for
the implementation part of this thesis.
Mile Stone 4 - Full featured single-core version of OSE on TILEpro64
This milestone was described in the previous chapter. A working console and
timer device driver has to be implemented. Because of lack of time, this was not
completed.
Mile Stone 5 - Full featured multi-core version of OSE on TILEpro64
Milestone 5 is also described in the previous chapter. This is the long term goal
of the project, with a working full featured multicore SMP operating system.
Bibliography
[1] A. Agarwal and M. Levy, “The kill rule for multicore,” in Proceedings of
the 44th annual Design Automation Conference, DAC ’07, (New York, NY,
USA), pp. 750–753, ACM, 2007.
[2] J. E. P. S. Jonas Svennebring, John Logan, Embedded Multicore: An introduction.
[3] J. K. Christoph Kessler, “Models for parallel computing: Review and perspectives,” Dec 2007.
[4] “Itea2 - many.” http://www.itea2.org/project/index/view/?project=10090,
2011.
[5] G. Kurian, J. E. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L. C. Kimerling,
and A. Agarwal, “Atac: a 1000-core cache-coherent processor with on-chip optical network,” in Proceedings of the 19th international conference on Parallel
architectures and compilation techniques, PACT ’10, (New York, NY, USA),
pp. 477–488, ACM, 2010.
[6] “Openmp.org.” http://openmp.org/wp/, 2012.
[7] “Wool.” http://www.sics.se/projects/wool, 2012.
[8] “The cilk project.” http://supertech.csail.mit.edu/cilk/, 2010.
[9] “Intel cilk plus.” http://software.intel.com/en-us/articles/intel-cilk-plus/,
2012.
[10] Enea, Enea OSE Core User’s Guide. Rev. BL140702.
[11] Enea, Enea OSE Architecture User’s Guide. Rev. BL140702.
[12] Enea, OSE Application Programming Interface Reference Manual.
BL140702.
Rev.
[13] Enea, OSE System Programming Interface Reference Manual.
BL140702.
Rev.
[14] P. Strömblad, “Enea multicore:high performance packet processing enabled
with a hybrid smp/amp os technology.” Enea White Paper, 2010.
51
52
Bibliography
[15] Enea, OSE Device Drivers User’s Guide. Rev. BL140702.
[16] Tilera, Tile Processor Architecture Overview for the TILEpro Series. UG120Rel 1.7 (28 May 2011), http://www.tilera.com/scm/docs/index.html.
[17] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey,
M. Mattina, C.-C. Miao, J. F. Brown III, and A. Agarwal, “On-chip interconnection architecture of the tile processor,” IEEE Micro, vol. 27, pp. 15–31,
September 2007.
[18] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald,
H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski,
N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal,
“The raw microprocessor: A computational fabric for software circuits and
general-purpose programs,” IEEE Micro, vol. 22, pp. 25–35, March 2002.
[19] Tilera, Tile Processor User Architecture Manual. UG101-Rel. 2.4 (3 May
2011), http://www.tilera.com/scm/docs/index.html.
[20] Tilera, Tile Processor I/O Device Guide. UG104-Rel. 1.7 (29 Mar 2011),
http://www.tilera.com/scm/docs/index.html.
[21] “Message
passing
interface
at
the
open
directory
http://www.dmoz.org/Computers/
Parallel_Computing/Programming/Libraries/MPI/, 2012.
project.”
[22] D. Ungar and S. S. Adams, “Hosting an object heap on manycore hardware:
an exploration,” in Proceedings of the 5th symposium on Dynamic languages,
DLS ’09, (New York, NY, USA), pp. 99–110, ACM, 2009.
[23] I. C. . M. Z. . X. Y. . D. Yeung, “Experience with improving distributed shared
cache performance on tilera’s tile processor,” IEEE Computer Architecture
Letters, 2011.
[24] Tilera, Multicore Development Environment Optimization Guide. UG105Rel. 2.4 (6 Jun 2011), http://www.tilera.com/scm/docs/index.html.
[25] Tilera, Application Binary Interface. UG213-Rel. 3.0.1.125620 (9 Apr 2011),
http://www.tilera.com/scm/docs/index.html.
[26] D. Wentzlaff and A. Agarwal, “Factored operating systems (fos): the case
for a scalable operating system for multicores,” SIGOPS Oper. Syst. Rev.,
vol. 43, pp. 76–85, April 2009.
[27] D. Wentzlaff, C. Gruenwald, III, N. Beckmann, K. Modzelewski, A. Belay,
L. Youseff, J. Miller, and A. Agarwal, “An operating system for multicore
and clouds: mechanisms and implementation,” in Proceedings of the 1st ACM
symposium on Cloud computing, SoCC ’10, (New York, NY, USA), pp. 3–14,
ACM, 2010.
[28] “Carbon research group.” http://groups.csail.mit.edu/carbon/, 2012.
Bibliography
53
[29] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter,
T. Roscoe, A. Schüpbach, and A. Singhania, “The multikernel: a new os architecture for scalable multicore systems,” in Proceedings of the ACM SIGOPS
22nd symposium on Operating systems principles, SOSP ’09, (New York, NY,
USA), pp. 29–44, ACM, 2009.
[30] “The barrelfish operating system.” http://www.barrelfish.org, 2012.
[31] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek,
R. Morris, and N. Zeldovich, “An analysis of linux scalability to many cores,”
in Proceedings of the 9th USENIX conference on Operating systems design
and implementation, OSDI’10, (Berkeley, CA, USA), pp. 1–8, USENIX Association, 2010.
[32] F. W. Ghassan Almaless, “Almos: Advanced locality management operating
system for cc-numa many-cores,” 2011.
[33] A. B. T. R. P. B. T. H. R. I. Adrian Schüpach, Simon Peter, “Embracing
diversity in the barrelfish manycore operating system,” MMCS’08, Boston,
Massachusetts, USA., 2008.
[34] J. Holt, A. Agarwal, S. Brehmer, M. Domeika, P. Griffin, and F. Schirrmeister,
“Software standards for the multicore era,” IEEE Micro, vol. 29, pp. 40–51,
May 2009.
[35] J. Psota and A. Agarwal, “rmpi: message passing on multicore processors
with on-chip interconnect,” in Proceedings of the 3rd international conference
on High performance embedded architectures and compilers, HiPEAC’08,
(Berlin, Heidelberg), pp. 22–37, Springer-Verlag, 2008.
[36] “Linux scalability effort.” http://lse.sourceforge.net/.
[37] “Mark russinovich: Inside windows 7.” http://channel9.msdn.com/shows/
Going+Deep/Mark-Russinovich-Inside-Windows-7/.
[38] Enea, OSE 5 Source Getting Started. Rev. BL140702B L150602.
[39] D. Wentzlaff and A. Agarwal, “Constructing virtual architectures on a tiled processor,” in Proceedings of the International Symposium on Code Generation and
Optimization, CGO ’06, (Washington, DC, USA), pp. 173–184, IEEE Computer
Society, 2006.
Chapter 8
Demonstration Application
and Output
8.1
Demonstration Application
#define MY_SIGNAL ( 1 0 1 )
struct my_signal
{
SIGSELECT sig_no ;
char ∗ message ;
};
union SIGNAL
{
SIGSELECT sig_no ;
struct my_signal my_signal ;
};
s t a t i c SIGSELECT any_sig [ ] = { 0 } ;
s t a t i c char ∗ message1 = " Howdy␣ ho ! " ;
s t a t i c char ∗ message2 = " T a l l y ␣ ho ! " ;
PROCESS pid1 , p i d 2 ;
OS_PROCESS( demo1 )
{
union SIGNAL ∗ s i g ;
r a m l o g _ p r i n t f ( "Demo␣ s t a r t ! ␣ \n " ) ;
r a m l o g _ p r i n t f ( " Demo1 : ␣ T h e s i s ␣DEMO: ␣My␣ f i r s t ␣ p r o c e s s ! \ n " ) ;
while ( 1 )
{
r a m l o g _ p r i n t f ( " Demo1 : ␣ Waiting ␣ on ␣Demo2 . . . . ␣ \n " ) ;
s i g = r e c e i v e ( any_sig ) ;
55
56
Demonstration Application and Output
r a m l o g _ p r i n t f ( " Demo1 : ␣Demo2␣ s a y s : ␣%s ␣ \n " , s i g −>my_signal . message
);
r a m l o g _ p r i n t f ( " R e c e i v e d ␣ message ␣ from ␣Demo2␣ : −) ␣ I ’m␣ not ␣ a l o n e ␣
a f t e r ␣ a l l ! ␣ \n " ) ;
s i g −>my_signal . message = message1 ;
r a m l o g _ p r i n t f ( " Demo1 : ␣ S e n d i n g ␣%s ␣ t o ␣Demo2 . ␣ \n " , message1 ) ;
send (& s i g , p i d 2 ) ;
}
}
OS_PROCESS( demo2 )
{
union SIGNAL ∗ s i g ;
int i ;
r a m l o g _ p r i n t f ( " Demo2 : ␣ T h e s i s ␣DEMO: ␣ Almost ␣my␣ f i r s t ␣ p r o c e s s ! \ n " ) ;
s i g = a l l o c ( s i z e o f ( struct my_signal ) , MY_SIGNAL) ;
s i g −>my_signal . message = message2 ;
f o r ( i = 0 ; i < 2 ; i ++)
{
send (& s i g , p i d 1 ) ;
r a m l o g _ p r i n t f ( " Demo2 : ␣ S e n d i n g ␣%s ␣ t o ␣Demo1\n " , message2 ) ;
r a m l o g _ p r i n t f ( " Demo2 : ␣ Waiting ␣ on ␣Demo1 . . . . ␣ \n " ) ;
s i g = r e c e i v e ( any_sig ) ;
r a m l o g _ p r i n t f ( " Demo2 : ␣Demo1␣ s a y s : ␣%s ␣ \n " , s i g −>my_signal . message
);
s i g −>my_signal . message = message2 ;
}
/∗ w r i t e t h e v a l u e SIM_CONTROL_PANIC t o t h e SPR_SIM_CONTROL s p e c i a l −
p u r p o s e r e g i s t e r ∗/
r a m l o g _ p r i n t f ( "Demo␣ c o m p l e t e ! ␣ \n " ) ;
s e t _ s p r ( 0 x4e0c , 2 7 ) ;
}
void
bspStartOseHook2 ( void )
{
p i d 1 = c r e a t e _ p r o c e s s (OS_PRI_PROC,
" demo1 " ,
demo1 ,
100 ,
/∗ S t a c k s i z e ∗/
30 ,
/∗ P r i o r i t y ∗/
(OSTIME) 0 ,
/∗ T i m e s l i c e ∗/
(PROCESS) 0 ,
( struct OS_redir_entry ∗ ) NULL,
(OSVECTOR) 0 ,
(OSUSER) 0 ) ;
s t a r t ( pid1 ) ;
p i d 2 = c r e a t e _ p r o c e s s (OS_PRI_PROC,
" demo2 " ,
demo2 ,
100 ,
30 ,
(OSTIME) 0 ,
/∗ S t a c k s i z e ∗/
/∗ P r i o r i t y ∗/
/∗ T i m e s l i c e ∗/
8.2 Demonstration Application Output
57
(PROCESS) 0 ,
( struct OS_redir_entry ∗ ) NULL,
(OSVECTOR) 0 ,
(OSUSER) 0 ) ;
s t a r t ( pid2 ) ;
}
8.2
Demonstration Application Output
The demo application prints the following output into the Ramlog (startup
output included).
__RAMLOG_SESSION_START__
[ 0 ] 0 . 0 0 0 :ROFS : No embedded volume found . Use r o f s _ i n s e r t h o s t t o o l
t o i n s e r t one .
[ 0 ] 0 . 0 0 0 : D e t e c t e d TILEpro64 , PVR 0 x f f f f , D−c a c h e 8 KByte , I−c a c h e 16
KByte
[ 0 ] 0 . 0 0 0 :mm: mm_open_exception_area ( c p u _ d e s c r i p t o r =36c044 ,
v e c t o r _ b a s e =0 , v e c t o r _ s i z e =256)
[ 0 ] 0 . 0 0 0 :CPU_HAL_TILEPRO: i n i t _ c p u .
[ 0 ] 0 . 0 0 0 :mm: mm_open_exception_area MMU=0
[ 0 ] 0 . 0 0 0 :mm: m m _ i n s t a l l _ e x c e p t i o n _ h a n d l e r s : e n t r y
[ 0 ] 0 . 0 0 0 :mm: s t a r t p a r s i n g log_mem s t r i n g : krn /log_mem/RAM
[ 0 ] 0 . 0 0 0 :mm: max_domains : 2 5 5 @ 1206 a80
[ 0 ] 0 . 0 0 0 : Boot heap a u t o m a t i c a l l y c o n f i g u r e d . [ 0 x01410000 −0 x 0 8 1 f f f f f ]
[ 0 ] 0 . 0 0 0 : ini t_b oot_ heap ( 0 x01410000 , 0 x 0 8 1 f f f f f )
[ 0 ] 0 . 0 0 0 : I n i t i a l r a n g e : [ 0 x01410000 −0 x 0 8 1 f f f f f ]
[ 0 ] 0 . 0 0 0 : curr_base
: 0 x08200000
[ 0 ] 0 . 0 0 0 : phys_frag
:
[ 0 ] 0 . 0 0 0 :MM: add_bank : name RAM [ 0 x200000 −0x8000000 ]
b a n k _ s i z e 0 x060034 , f r a g _ c n t 0 x 0 0 8 0 0 0 , s i z e o f ∗bank , 0 x000040 ( s i z e o f
∗ f r a g ) 0 x00000c
[ 0 ] 0 . 0 0 0 :mm: phys_mem [ 0 x0000200000 −0 x 0 0 0 8 1 f f f f f ] SASE RAM
[ 0 ] 0 . 0 0 0 :mm: s t a r t p a r s i n g log_mem s t r i n g : krn /log_mem/RAM
[ 0 ] 0 . 0 0 0 :mm: log_mem [ 0 x00200000 −0 x 0 8 1 f f f f f ]
SASE RAM
[ 0 ] 0 . 0 0 0 :mm: r e g i o n " b s s " : [ 0 x01200000 −0 x 0 1 4 0 f f f f ] , ( 0 x00210000 )
su_rw_usr_ro copy_back s p e c u l a t i v e _ a c c e s s
[ 0 ] 0 . 0 0 0 : krn / r e g i o n / b s s : [ 0 x01200000 −0 x 0 1 4 0 f f f f ]
[ 0 ] 0 . 0 0 0 :mm: r e g i o n " data " : [ 0 x0036c000 −0 x 0 0 3 6 d f f f ] , ( 0 x00002000 )
su_rw_usr_rw copy_back s p e c u l a t i v e _ a c c e s s
[ 0 ] 0 . 0 0 0 : krn / r e g i o n / data : [ 0 x0036c000 −0 x 0 0 3 6 d f f f ]
[ 0 ] 0 . 0 0 0 :mm: r e g i o n " ramlog " : [ 0 x00100000 −0 x 0 0 1 0 7 f f f ] , ( 0 x00008000 )
su_rw_usr_na w r i t e _ t h r o u g h s p e c u l a t i v e _ a c c e s s
[ 0 ] 0 . 0 0 0 :mm: log_mem [ 0 x00100000 −0 x 0 0 1 0 7 f f f ]
SASE ramlog
[ 0 ] 0 . 0 0 0 : krn / r e g i o n / ramlog : [ 0 x00100000 −0 x 0 0 1 0 7 f f f ]
[ 0 ] 0 . 0 0 0 :mm: r e g i o n " t e x t " : [ 0 x00200000 −0 x 0 0 3 6 b f f f ] , ( 0 x0016c000 )
su_rwx_usr_rwx copy_back s p e c u l a t i v e _ a c c e s s
[ 0 ] 0 . 0 0 0 : krn / r e g i o n / t e x t : [ 0 x00200000 −0 x 0 0 3 6 b f f f ]
[ 0 ] 0 . 0 0 0 : Data c a c h e not e n a b l e d i n c o n f i g u r a t i o n . For NOMMU o n l y .
[ 0 ] 0 . 0 0 0 : I n s t r u c t i o n c a c h e not e n a b l e d i n c o n f i g u r a t i o n . For NOMMU
only .
[ 0 ] 0 . 0 0 0 :mm: map_regions ( )
[ 0 ] 0 . 0 0 0 :MM−meta−data : [ 0 x0819d000 −0 x 0 8 1 f f f f f ]
[ 0 ] 0 . 0 0 0 :MM i n i t c o m p l e t e d
[ 0 ] 0 . 0 0 0 : has_mmu= 0
[ 0 ] 0 . 0 0 0 :mm: i n i t i a l i z a t i o n c o m p l e t e d .
[ 0 ] 0 . 0 0 0 : Cache b i o s i n s t a l l e d .
58
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
Demonstration Application and Output
0.000: s y s c a l l ptr : 2 f 4 f e 0 bios ptr : 2 f50c0
0.000: Starting syspool extender .
0 . 0 0 0 : S t a r t i n g mainpool e x t e n d e r .
0 . 0 0 0 : k e r n e l i s up .
0 . 0 0 0 : S t a r t i n g RTC.
0 . 0 0 0 : S t a r t i n g system HEAP.
0 . 0 0 0 : S t a r t i n g FSS .
0 . 0 0 0 : S t a r t i n g PM
0 . 0 0 0 : S t a r t i n g SHELLD .
0 . 0 0 0 : OSE5 c o r e b a s i c s e r v i c e s s t a r t e d .
0 . 0 0 0 :ROFS : / r o m f s : No volume found . Has o s e _ r o f s _ s t a r t _ h a n d l e r 0
( ) been c a l l e d ?
[ 0 ] 0 . 0 0 0 : S t a r t i n g DDA d e v i c e manager
[ 0 ] 0.000: I n s t a l l i n g static device drivers .
[ 0 ] 0 . 0 0 0 : dda : ddamm_alloc_uncached from MM( 1 6 3 8 4 ) = 0 x4b9000
[ 0 ] 0 . 0 0 0 : devman : S t a r t e d ( log_mask=0x3 )
[ 0 ] 0.000: Register driver pic_tilepro
[ 0 ] 0 . 0 0 0 : R e g i s t e r d r i v e r ud16550
[ 0 ] 0.000: Register driver timer_tilepro
[ 0 ] 0.000: Activating devices .
[ 0 ] 0 . 0 0 0 : S t a r t i n g SERDD.
[ 0 ] 0 . 0 0 0 : S t a r t i n g CONFM.
[ 0 ] 0 . 0 0 0 : Adding program t y p e APP_RAM e x e c u t i o n mode u s e r
[ 0 ] 0 . 0 0 0 :APP_RAM/ t e x t=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x707
[ 0 ] 0 . 0 0 0 :APP_RAM/ p o o l=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x303
[ 0 ] 0 . 0 0 0 :APP_RAM/ data=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x303
[ 0 ] 0 . 0 0 0 :APP_RAM/ heap=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x303
[ 0 ] 0 . 0 0 0 : No c o n f f o r : pm/ p r o g t y p e /APP_RAM/ heap , u s i n g default .
[ 0 ] 0 . 0 0 0 : Adding program t y p e SYS_RAM e x e c u t i o n mode s u p e r v i s o r
[ 0 ] 0 . 0 0 0 :SYS_RAM/ t e x t=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x707
[ 0 ] 0 . 0 0 0 :SYS_RAM/ p o o l=phys_mem :RAM log_mem :RAM_SASE c a c h e : 0 xc perm : 0
x303
[ 0 ] 0 . 0 0 0 :SYS_RAM/ data=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x303
[ 0 ] 0 . 0 0 0 :SYS_RAM/ heap=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x303
[ 0 ] 0 . 0 0 0 : No c o n f f o r : pm/ p r o g t y p e /SYS_RAM/ heap , u s i n g default .
[ 0 ] 0 . 0 0 0 : S t a r t i n g RTL ELF .
[ 0 ] 0 . 0 0 0 : Demo s t a r t !
[ 0 ] 0 . 0 0 0 : Demo1 : T h e s i s DEMO: My f i r s t p r o c e s s !
[ 0 ] 0 . 0 0 0 : Demo1 : Waiting on Demo2 . . . .
[ 0 ] 0 . 0 0 0 : Demo2 : T h e s i s DEMO: Almost my f i r s t p r o c e s s !
[ 0 ] 0 . 0 0 0 : Demo2 : S e n d i n g T a l l y ho ! t o Demo1
[ 0 ] 0 . 0 0 0 : Demo2 : Waiting on Demo1 . . . .
[ 0 ] 0 . 0 0 0 : Demo1 : Demo2 s a y s : T a l l y ho !
[ 0 ] 0 . 0 0 0 : R e c e i v e d message from Demo2 : −) I ’m␣ not ␣ a l o n e ␣ a f t e r ␣ a l l !
[ 0 ] ␣ 0 . 0 0 0 : Demo1 : ␣ S e n d i n g ␣Howdy␣ ho ! ␣ t o ␣Demo2 .
[ 0 ] ␣ 0 . 0 0 0 : Demo1 : ␣ Waiting ␣ on ␣Demo2 . . . .
[ 0 ] ␣ 0 . 0 0 0 : Demo2 : ␣Demo1␣ s a y s : ␣Howdy␣ ho !
[ 0 ] ␣ 0 . 0 0 0 : Demo2 : ␣ S e n d i n g ␣ T a l l y ␣ ho ! ␣ t o ␣Demo1
[ 0 ] ␣ 0 . 0 0 0 : Demo2 : ␣ Waiting ␣ on ␣Demo1 . . . .
[ 0 ] ␣ 0 . 0 0 0 : Demo1 : ␣Demo2␣ s a y s : ␣ T a l l y ␣ ho !
[ 0 ] ␣ 0 . 0 0 0 : R e c e i v e d ␣ message ␣ from ␣Demo2␣ : −) ␣ I ’m not a l o n e a f t e r a l l !
[ 0 ] 0 . 0 0 0 : Demo1 : S e n d i n g Howdy ho ! t o Demo2 .
[ 0 ] 0 . 0 0 0 : Demo1 : Waiting on Demo2 . . . .
[ 0 ] 0 . 0 0 0 : Demo2 : Demo1 s a y s : Howdy ho !
[ 0 ] 0 . 0 0 0 : Demo c o m p l e t e !
På svenska
Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –
under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,
skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för
ickekommersiell forskning och för undervisning. Överföring av upphovsrätten
vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av
dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,
säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ
art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i
den omfattning som god sed kräver vid användning av dokumentet på ovan
beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan
form eller i sådant sammanhang som är kränkande för upphovsmannens litterära
eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se
förlagets hemsida http://www.ep.liu.se/
In English
The publishers will keep this document online on the Internet - or its possible
replacement - for a considerable time from the date of publication barring
exceptional circumstances.
The online availability of the document implies a permanent permission for
anyone to read, to download, to print out single copies for your own use and to
use it unchanged for any non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses
of the document are conditional on the consent of the copyright owner. The
publisher has taken technical and administrative measures to assure authenticity,
security and accessibility.
According to intellectual property law the author has the right to be
mentioned when his/her work is accessed as described above and to be protected
against infringement.
For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity,
please refer to its WWW home page: http://www.ep.liu.se/
© Sixten Sjöström Thames
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement