Institutionen för datavetenskap Department of Computer and Information Science Final thesis Porting a Real-Time Operating System to a Multicore Platform by Sixten Sjöström Thames LIU-IDA/LITH-EX-A--12/009--SE 2012-02-07 Linköpings universitet SE-581 83 Linköping, Sweden Linköpings universitet 581 83 Linköping Linköpings universitet Institutionen för datavetenskap Final thesis Porting a Real-Time Operating System to a Multicore Platform by Sixten Sjöström Thames LIU-IDA/LITH-EX-A--12/009--SE 2012-02-07 Supervisor: Sergiu Rafiliu Examiner: Petru Ion Eles Abstract This thesis is part of the European MANY project. The goal of MANY is to provide developers with tools to develop software for multi and many-core hardware platforms. This is the first thesis that is part of MANY at Enea. The thesis aims to provide a knowledge base about software on many-core at the Enea student research group. More than just providing a knowledge base, a part of the thesis is also to port Enea’s operating system OSE to Tilera’s many-core processor TILEpro64. The thesis shall also investigate the memory hierarchy and interconnection network of the Tilera processor. The knowledge base about software on many-core was constrained to investigating the shared memory model and operating systems for many-core. This was achieved by investigating prominent academic research about operating systems for many-core processors. The conclusion was that a shared memory model does not scale and for the operating system case, operating systems shall be designed with scalability as one of the most important requirements. This thesis has implemented the hardware abstraction layer required to execute a single-core version of OSE on the TILEpro architecture. This was done in three steps. The Tilera hardware and the OSE software platform were investigated. After that, an OSE target port was chosen as reference architecture. Finally, the hardware dependent parts of the reference software were modified. A foundation has been made for future development. v Acknowledgments My deepest gratitude goes to Patrik Strömblad for guiding me during the whole project. Patrik has enlightened me about multi-core and has provided valuable advice during the porting process. I thank my supervisors Barbro and Detlef for giving me the chance to work on this project. I really appreciate their moral support and guidance. I would like to thank the employees at Enea who gave me guidance and good company, especially Johan Wiezell who explained the details of porting OSE. Finally I thank my girlfriend Bibbi, who has supported and encouraged me during the thesis work. vii Contents 1 Introduction 1.1 Thesis Background . . . . . . . . . . . . . . . . . . . 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . 1.2.1 Target Interface and Board Support Package 1.2.2 Memory Hierarchy and Network-On-Chip . . 1.2.3 Shared Memory Multi Processing . . . . . . . 1.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 2 2 2 3 2 Background 2.1 ITEA2 - MANY . . . . . . . . . . . 2.2 Multicore Architecture . . . . . . . . 2.2.1 Heterogeneous Multi-Core . . 2.2.2 Homogeneous Multi-Core . . 2.2.3 Memory Architecture . . . . 2.3 Software Parallelism . . . . . . . . . 2.3.1 Bit-Level Parallelism . . . . . 2.3.2 Instruction Level Parallelism 2.3.3 Data parallelism . . . . . . . 2.3.4 Task Parallelism . . . . . . . 2.4 Software Models . . . . . . . . . . . 2.4.1 Symmetric Multiprocessing . 2.4.2 Asymmetric Multiprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 6 6 6 7 8 8 8 9 9 9 9 10 3 Enea’s OSE 3.1 Architecture Overview . . . . . . . . . 3.2 Load Modules, Domains and Processes 3.3 OSE for Multi-Core . . . . . . . . . . 3.3.1 Migration and Load Balancing 3.4 Hardware Abstraction Layer . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 11 13 14 14 14 ix x 4 Tilera’s TILEpro64 4.1 Architecture Overview . . . . . . . . . . . . . . . . . . 4.2 Interconnection Network - iMesh . . . . . . . . . . . . 4.2.1 Interconnection Hardware . . . . . . . . . . . . 4.2.2 The Networks . . . . . . . . . . . . . . . . . . . 4.2.3 Protecting the Network . . . . . . . . . . . . . 4.2.4 Deadlocks . . . . . . . . . . . . . . . . . . . . . 4.2.5 iLib . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Conclusions about the Interconnection Network 4.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . 4.3.1 Memory Homing . . . . . . . . . . . . . . . . . 4.3.2 Dynamic Distributed Cache . . . . . . . . . . . 4.3.3 Conclusions about the Memory Architecture . 4.4 Tools and Software Stack . . . . . . . . . . . . . . . . 4.5 Tilera Application Binary Interface . . . . . . . . . . . Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 19 19 19 21 21 22 22 22 23 24 24 25 25 5 Software on Many-Core 5.1 Scalability Issues with SMP Operating Systems . . . . . . . . 5.1.1 Locking the kernel . . . . . . . . . . . . . . . . . . . . 5.1.2 Sharing Cache and TLBs between Application and OS 5.1.3 Dependency on Effective Cache Coherency . . . . . . 5.1.4 Scalable SMP Systems . . . . . . . . . . . . . . . . . . 5.2 Operating Systems for Many-Core . . . . . . . . . . . . . . . 5.2.1 Design Principles: Factored Operating Systems . . . . 5.2.2 Design Principles: Barrelfish . . . . . . . . . . . . . . 5.2.3 Conclusions from Investigating fos . . . . . . . . . . . 5.2.4 Conclusions from Investigating Barrelfish . . . . . . . 5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Distributed Architectures are Scalable . . . . . . . . . 5.3.2 One Thread - One Core . . . . . . . . . . . . . . . . . 5.3.3 IPC with Explicit Message Passing . . . . . . . . . . . 5.3.4 Example of a Many-Core OS . . . . . . . . . . . . . . 5.3.5 Enea OSE and Many-Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 30 30 31 31 31 31 32 33 34 35 35 35 36 36 37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Porting Enea OSE to TILEpro64 6.1 Milestones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Milestone 1 - Build environment . . . . . . . . . . . . . 6.1.2 Milestone 2 - Launch OSE and write into a Ramlog . . 6.1.3 Milestone 3 - Get OSE in to a safe state . . . . . . . . . 6.1.4 Milestone 4 - Full featured single-core version of OSE TILEpro64 . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.5 Milestone 5 - Full featured multi-core version of OSE TILEpro64 . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 MS1 - Build Environment . . . . . . . . . . . . . . . . . . . . . 6.2.1 Omnimake . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Requirements and Demonstration . . . . . . . . . . . . . . . . . . . . . on . . on . . . . . . . . 39 39 40 40 40 40 40 40 41 41 Contents Work Approach . . . . . . . . . . . Coresys . . . . . . . . . . . . . . . Implemented Parts . . . . . . . . . Design Decisions . . . . . . . . . . Requirements and Demonstration . Work Approach . . . . . . . . . . . Get OSE into a safe state . . . . . Design Decisions . . . . . . . . . . Implemented Parts . . . . . . . . . Requirements and Demonstration . Work Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 42 43 43 44 44 45 45 47 48 7 Conclusions, Discussion and Future Work 7.1 Conclusions from the Theoretical Study . 7.2 Results and Future Work . . . . . . . . . 7.2.1 Future Work - Theoretical . . . . . 7.2.2 Future Work - Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 49 50 50 6.3 6.4 6.2.3 MS2 6.3.1 6.3.2 6.3.3 6.3.4 MS3 6.4.1 6.4.2 6.4.3 6.4.4 xi Bibliography 51 8 Demonstration Application and Output 8.1 Demonstration Application . . . . . . . . . . . . . . . . . . . . . . 8.2 Demonstration Application Output . . . . . . . . . . . . . . . . . . 55 55 57 Chapter 1 Introduction 1.1 Thesis Background This thesis work has been conducted at ENEA AB, a global software and services company with focus on solutions for communication-driven products. The thesis is part of the MANY1 project hosted by ITEA22 . Predicted by Moore’s law, the number of transistors per chip will double approximately every 18 months. Because of the inability of sequential single-core processors deliver grater performance proportional to the increased number of transistors, multi-core processors are now standard in basically all domains [1][2]. In the near future hundreds of cores are to be expected in various embedded devices[1]. As the number of cores per chip is growing, developing software is becoming a more complex task[3]. To scale well, when the number of cores increases, software has to be rewritten to execute in parallel. Soon many-core3 hardware systems are expected to be main stream in the embedded segment and software development has to adept to this[1]. Demands on shorter time-to-market and the complexity of parallel software make it necessary to provide developers with good tools. The MANY project addresses this issue and has an objective to provide developers with an efficient programming environment[4]. This master thesis focuses on porting OSE ME4 to a many-core platform. 1.2 Problem Statement This thesis required a pre study of software on many-core, an investigation of the chosen many-core hardware platform (TILEpro64) and the software platform (OSE) that was ported. To achieve this, a number of subjects had to be studied 1 Many-core programming and resource management for high-performance Embedded Systems Technology for European Advancement 3 Multi-core processor with at least dozens of cores 4 Enea OSE: Multicore Real-Time Operating System 2 Information 1 2 Introduction in detail. A good understanding of OSE ME architecture and the target platform architecture was necessary. 1.2.1 Target Interface and Board Support Package A new OSE target interface and a BSP5 had to be developed for the TILEpro64 architecture. This is a heavy task that requires a good understanding about the TILEpro64 architecture and the OSE architecture. A build environment for the new architecture had to be implemented as well. A question asked in the project specification was: To what extent can code be reused? 1.2.2 Memory Hierarchy and Network-On-Chip Each core on the TILEpro64 processor has an integrated L1 and L2 cache. The L3 cache is distributed among the tiles. On top of this there are 4 DRR2 controllers connected to the iMesh. This memory hierarchy had to be investigated. The following question was asked in the project specification.How does the memory hierarchy of TILE64 cope with the demands of a RTOS? The TILE64 processor has 64 cores (also referred to as tiles) connected in an iMesh6 on chip network. An important part of the pre-study was to investigate interfaces for communication between tiles. The following question was asked in the project specification. What implications does the iMesh network have on a Real-Time Operating System? 1.2.3 Shared Memory Multi Processing As stated above the developers need to be provided with an efficient programming environment. The OSE programming model uses an asynchronous message passing model for IPC7 . There is also a shared memory model available with POSIX8 threads. This thesis had to investigate what options that are available and most suitable among solutions based on a shared memory model. The following question was asked in the project specification. Is it desirable to develop a shared memory tool for parallel computing? 1.3 Method The work was organized in two phases, consisting of theoretical research and implementation. During phase 1 (covering the first 10 weeks) software for manycore was investigated. Tilera and Enea documentation was also studied in detail. The result of phase 1 was a half time presentation and a half time report. 5 Board Support Package mesh 7 Inter-Process Communication 8 Portable Operating System Interface 6 Intelligent 1.4 Limitations 3 In phase 2 (the final 10 weeks) the porting took place. This meant that the implementation of an OSE target interface and a BSP for the TILEpro64 processor had to be done. It was not expected that the complete operating system was to be ported, but a prototype and a foundation for future thesis projects had to be implemented. 1.4 Limitations The time limit for this study was 20 weeks in which, all the literature study, implementation, report and presentation had to be completed. Chapter 2 Background This chapter introduces some basic concepts that are necessary when describing operating systems for many-core. Multicore processors are common nowadays and have been shipped with desktop PCs for almost a decade. Processors such as Tilera’s TILEpro64 have dozens of cores and are referred to as many-core processors. It is a common assumption that a single chip will contain as many as 1000 cores within the next decade [1] [5]. The reason for this dramatically increased core count is that it has been necessary to meet the demand of higher performance and decreased power consumption. Before multicore processors appeared, performance was improved by increasing the frequency, utilizing instruction level parallelism and increasing cache sizes. This however came to an end [2]. The main reasons are listed below. • When increasing the frequency, the power consumption also increases. This is not acceptable when power consumption is a main requirement. The following equation shows the relation between power and frequency. power = capacitance ∗ voltage2 ∗ f requency • Superscalar techniques do not scale with frequency. The increased frequency may demand that a pipeline have to be stalled or that an additional stage needs to be added. This reduces the benefits of increased frequency. • The off-chip memory and IO subsystems work on a lower frequency and tend to stall the processor. This has been countered with increasing cache sizes. However, making the cache bigger requires more silicone which implies more power consumption. 5 6 Background 2.1 ITEA2 - MANY MANY [4] is the name of the European project hosted by ITEA2, which this thesis is part of. The objective of MANY is to provide embedded system developers with a programming environment for many-core. 2.2 Multicore Architecture This subsection contains a brief background on multi-core architecture. 2.2.1 Heterogeneous Multi-Core Heterogeneous multicore systems are SoCs1 containing processors with different instruction sets. This is common in embedded systems where, for example, a general purpose processor provides a user interface and controls special purpose hardware. The special purpose hardware can be a digital signal processor or an FPGA. This architecture provides both challenges and advantages. One challenge is that the same OS image cannot be executed on all cores. A big advantage however is, that performance can be improved by using special purpose hardware. Figure 2.1 shows a heterogeneous system. Figure 2.1. Heterogeneous Multicore 2.2.2 Homogeneous Multi-Core This is the most common architecture in desktop systems. In the homogeneous multicore processor all cores have the same architecture. A homogeneous system, containing only a few cores, together executing a SMP2 operating system, provides a pleasant environment for the developer. This is pleasant since the operating system is able to provide a lot of abstraction. Figure 2.2 shows a homogeneous system. 1 System on Chip Multiprocessing 2 Symmetric 2.2 Multicore Architecture 7 Figure 2.2. Homogeneous Multicore 2.2.3 Memory Architecture There are numerous memory architectures for multicore. This section describes the two basic concepts. Distributed Memory Distributed memory is when all cores have their own private memory. Communication is done by messages or streams over high-speed interconnection hardware. Shared Memory Shared memory is when the cores share the main memory. Communication between cores is done through the memory. The cores typically use a private cache which means that some kind of hardware has to make sure that the memory is consistent throughout the system. Cache Coherency In a shared memory system were cores uses a private cache, it is important to make sure that the shared resources are consistent throughout the caches. For CMPs3 this is guaranteed by a cache coherence protocol implemented in hardware. There are two main types of cache coherency protocols. Bus snooping can be used on a shared bus CMP. One solution is that the private caches are implemented as write through. When one core writes to the main memory, the cache coherency hardware at each core monitors the bus and invalidates its own copy of the data if it is located in the cache. The bus as an interconnection network does not scale to many-core which excludes snooping as a cache coherence protocol for many-core. 3 Chip Multiprocessor 8 Background Directory based is when a central directory keeps track of all data that is being shared between the cores. Communication with the main memory goes through the directory. This creates a bottleneck and bottlenecks do not scale. The chapter covering TILEpro64 describes the Distributed Shared Cache and Dynamic Distributed Cache. 2.3 Software Parallelism Unfortunately, an increased number of cores do not imply the same level of performance[3]. Sequential code typically performs worse on a multi-core system. To efficiently utilize multi-core systems the software has to be written with parallelism in mind. Amdahl’s law shows how sequential code for a fixed size problem affects the performance on parallel systems and that adding more cores does not imply the same amount of speedup[2]. Speedup = 1/(S + (1 − S)/N ) Amdahls’s law. S = sequential portion of code, N = Number of processors Amdahl’s law does only show speedup for algorithms with a fixed size problem[2]. Gustafson’s law shows the scaled speedup for a problem that is of variable size[2]. One example is packet processing. It is easy to understand that by adding more cores, the system will be able to process more packets. Scaledspeedup = N + (1 − N ) ∗ S Gustafson’s law. S = Serial portion of code, N = Number of processors Parallelism is being extracted and utilized at every level: in hardware, at compile time, by the system software and at application software level. These are the four basic types of parallelism. 2.3.1 Bit-Level Parallelism Bit-level parallelism is the width of the processor. Increasing bit-level parallelism is when a processor is redesigned to work with larger data sizes. One example is extending a parallel bus from 32-bit to 64-bit. 2.3.2 Instruction Level Parallelism Instruction level parallelism is the parallelism that can be achieved by executing instructions that are not dependent on each other in parallel. This is taken advantage of by VLIW4 and superscalar processors by using techniques such as out of order execution. Compilers can optimize code to achieve a higher degree of instruction level parallelism. 4 Very Long Instruction Word 2.4 Software Models 2.3.3 9 Data parallelism Data parallelism is when there is no dependency between data. One example is when working with arrays of data where the elements do not depend on each other. This has been utilized for a long time in CPU’s that supports SIMD5 instructions. This is also an area where multi-core systems make a big difference. A developer can, for example, use POSIX threads and divide pixels in an image between different threads that run on different cores (this is connected with task parallelism). 2.3.4 Task Parallelism Task parallelism is the ability to run different processes and threads in parallel. This is an area where multi-core system can provide big performance boosts. Systems consisting of loosely coupled units can really benefit from this. The application developer has to find parallel parts in his or her sequential program that can be divided into different threads that can be executed in parallel. 2.4 Software Models 2.4.1 Symmetric Multiprocessing The SMP model is when resources such as OS image are shared between the cores[2][3]. Communication and synchronization is done through a shared memory. This is a suitable model for systems with a few number of cores. It provides an environment similar to the multitasking single core system. The application developer can use many tools if he/she wants to use the shared memory model to utilize parallelism in the application. Examples of such tools are: OpenMP[6], Wool[7], Cilk[8] and Cilk+[9]. Affinity Scheduling To better utilize the performance of multi-core processors, SMP operating systems usually provide affinity scheduling[2]. That means that the scheduler takes the physical location of threads into account. Threads that shares memory are preferably located on the same core or at cores close to each other if it is a NUMA6 system. Scalability The fact that SMP systems share resources among cores also implies bottlenecks. This is described more in detail in chapter 5. 5 Single 6 Non Instruction Multiple Data Uniform Memory Access 10 2.4.2 Background Asymmetric Multiprocessing This is the classical model for distributed systems. In AMP7 the cores do not share resources[2]. This is common for heterogeneous distributed systems but can also be implemented on homogeneous systems. On homogeneous systems a hypervisor can be used to handle protection, provide communication channels and distribute resources among operating systems that execute on different cores. When using a hypervisor it is also possible to run multiple SMP systems in parallel on the same processor as a big AMP system. 7 (Asymmetric Multiprocessing Chapter 3 Enea’s OSE OSE (Operating System Embedded) is a distributed operating system that supports hard and soft real time applications. Being a distributed operating system, OSE can execute on heterogeneous distributed systems and clusters. Operating system core services and application provided services can be accessed location transparently, by applications through a message based programming model. Each node in the distributed system runs an OSE micro-kernel that can be extended with modules. OSE multi-core edition has extended the previous pure AMP model to be a hybrid AMP/SMP model. The content of this chapter is mainly derived from OSE documentation [10][11][12][13]. 3.1 Architecture Overview As mentioned above, OSE is based on a message based model. The micro-kernel provides real-time features such as a preemptive scheduler, prioritized interrupts and processes. The kernel provides memory management and memory pools are used to provide deterministic access times. Figure 3.1 shows the layers of OSE. Processes in OSE are the similar to POSIX threads. A location transparent message passing API for inter-process communication is provided by the kernel. All communication and synchronization in the OSE programming model is done with asynchronous message passing. This makes application code scalable since it can be moved from a single core system to a multi-processor cluster without modifying the code. Inter-process communication between cores is serviced by the medium transparent Enea LINX. 3.2 Load Modules, Domains and Processes A module is a collection of processes that make an application. A program is an instantiated load module. A module that is linked with the kernel during compile 11 12 Enea’s OSE Figure 3.1. OSE layers time is referred to as a core module. Modules that are separately linked can be loaded dynamically in run-time and are referred to as load modules. OSE processes do introduce some confusion. Processes in OSE are more similar to what is commonly known as threads. Processes may be grouped, share memory pools and heaps. A domain is a memory region shared by programs. If a MMU1 is used, OSE is able to provide full memory protection and virtual memory regions between domains. A domain usually contains code, data, heaps and pools. The pool is used for deterministic dynamic allocation of signal buffers and process stacks. The heap is also used for dynamic allocation but is not used for stacks and signal buffers. The heap may preferably be used by applications that do not completely use the OSE programming model. Software that depends on POSIX may use the heap. A program can be configured as private or shared. In a program configured as private, signals, buffers and files are privately owned by specific processes. These are reclaimed by the kernel when the process terminates. A process may only modify data that is owned by that process. When the program is configured as shared, a heap is shared among processes and the POSIX shared programming 1 Memory Management Unit 3.3 OSE for Multi-Core 13 models works. A program that uses POSIX compatible threads shall be executed in shared memory mode. A shared heap has to be used for multi-threaded parallel programming. As always with shared resources, when sharing a heap between processes and cores, locks, spinlocks and mutexes have to be used on critical sections. 3.3 OSE for Multi-Core The OSE programming model has been adaptable to multi-processor systems for a long time through the AMP model. This model can be used today on multi-core systems when there is a hypervisor that manages memory, peripherals and intercore communication. OSE for multi-core has extended this distributed model with certain properties of SMP operating systems, creating a hybrid. OSE MCE has startup code that loads the OS image onto several cores on a multi-core processor. Figure 3.2. OSE Multi-core Edition The multi-core edition is the same distributed operating system as before with additional SMP features [14][2]. OSE still uses the distributed system AMP model, having each core running its own scheduler with associated data structures. OS services can still be distributed and accessed via the message passing model and 14 Enea’s OSE distributed CRT. The OSE MCE architecture tries to keep shared system resources to a minimum (to maintain the AMP scalability). Figure 3.2 shows how OSE MCE is distributed among the cores. Synchronization between cores is done with something called kernel events. A global shared data structure called kernel event queue is used. When a core has stored its kernel event in the queue it generates an IPI2 to notify the receiving core(s). The IPI implementation is hardware specific. A high speed interconnection mechanism is preferably used. 3.3.1 Migration and Load Balancing OSE MCE provides functionality to migrate domains, programs, blocks and processes between cores. When a program or domain is moved between two cores, all the programs processes, blocks and the program heap daemon is also moved automatically. It is possible to lock programs to specific cores. Interrupt processes and timer interrupt processes cannot be moved and processes that use non-SMP system calls will be locked to one core. The OSE kernel does not provide any load balancing. This is expected to be implemented and controlled by the application designer. 3.4 Hardware Abstraction Layer The bottom layer of the OSE stack is called HAL3 . The HAL provides a target interface to the OSE kernel. This layer implements hardware specific functionality like MMU and cache support. This is the layer that has to be modified during the porting process. Part of the target layer is also the board support package. The BSP contains system initialization code and device driver source code. An OSE distribution does not include a BSP. The BSP is instead delivered as a separate component and the user can freely choose which BSP to use when compiling the OSE system. 3.5 Conclusion OSE multi-core edition has extended the AMP model with a shared memory environment. It provides a single chip AMP like environment that does not rely on a hypervisor to work. It has extended the OSE message based environment with an inter-core shared memory environment. Heaps that can be accessed from multiple cores implement the shared memory context. 2 Inter Process Interrupt Abstraction Layer 3 Hardware 3.5 Conclusion 15 If a shared memory implementation for parallel programming such as Wool, Cilk or OpenMP is to be implemented, it is important that individual processes can be easily created, killed and preferably able to migrate between cores. OSE multi-core edition support these features. When implementing a load balancer it is obviously important that the kernel provides core load monitoring functionality. In OSE the Program Manager or the Run Time Monitor provides this functionality. The following question is asked in the project specification: Is it desirable to implement a shared memory model for IPC? This question is faulty. OSE supports a shared memory model and implements a subset of POSIX. The question should instead be Is it desirable to implement a shared memory tool for parallel computing? This question can be answered. The answer: The optimal programming model when developing for OSE is the OSE message passing model; it enforces a scalable parallel design and has been used in clusters for a long time. A user may, however, have special reasons to use some other programming model based on shared memory. One example of such a situation would be when legacy code has to run on an OSE system. OSE multi-core edition provides the required functionality for implementing a shared memory tool for exploiting task parallelism in legacy applications. This is illustrated in figure 3.3. Figure 3.3. Processes sharing an address space between cores The question from the project specification can be answered: To what extent can code be reused? The answer: The fact is that porting means that the hardware specific parts, such as the hardware abstraction layer of the operating system, are the ones that have to be changed. This means that there might not be that much code that can be reused. Of course, it is possible to use code for another target as reference, but direct copying might be difficult. When making the device drivers, there are more possibilities for reuse and there are also device driver templates that can be used [15]. Chapter 4 Tilera’s TILEpro64 TILEpro64[16] is dedicated its own chapter in this thesis. The reason for this is that it has an interesting many-core architecture. Studying this may lead to a better general understanding of many-core, especially about interconnection networks and memory hierarchy. The TILEpro64 is also the target architecture when porting OSE in the implementation part of this thesis. This chapter can thus also be considered as a pre-study directly linked to the implementation. The content of this chapter is based on the documentation located at the Tilera Open Source Documentation page http://www.tilera.com/scm/docs/index.html and the iMesh article [17]. 4.1 Architecture Overview Tilera’s TILEpro64 architecture is a homogeneous tiled multi-core processor inspired by MIT’s RAW [18]. The cores (referred to as tiles) are organized in an 8x8 mesh on-chip interconnection network called iMesh. Each tile contains a VLIW core, cache and switch that connect the tile to the on-chip network[16]. The layout of the processor can be seen in figure 4.1. Each core is a general-purpose 32-bit VLIW. Each core contains an independent program counter, interrupt hardware, different protection levels and virtual memory and is capable of running an operating system. There are actually four protection levels, user, supervisor, hypervisor and hypervisor debug. This means that virtualization is supported in hardware. The processor uses a RISC ISA extended with instructions commonly used in DSP or packet processing applications. On each tile there is also a cache-engine containing a L1 data (8KB) and instruction (16KB) cache and a unified L2 (64KB) cache. There is a total of 5.5 MB of on-chip cache distributed among the cores. It is possible for any tile to access the L2 cache of any other tile on the interconnection network. This makes up the 17 18 Tilera’s TILEpro64 Figure 4.1. The TILEpro64 processor is organized in an 8x8 mesh virtual L3 cache called Distributed Shared Cache (DSC)[16]. The processor supports full cache-coherency. The memory hierarchy and interconnection network is described below. The tile can be seen in figure 4.2. Figure 4.2. The tile consisting of a VLIW core, cache engine and network switch The cores are connected to the iMesh with a switch-engine residing on each tile. The iMesh is actually six parallel mesh networks. Five dynamic (UDN, TDN, MDN, CDN and IDN) and one static (STN)[16]. 4.2 Interconnection Network - iMesh 4.2 19 Interconnection Network - iMesh iMesh [17] consists of six physical 2D-mesh networks. Each network has a dedicated purpose such as communication between tiles and I/O controllers or communication between caches and memory. The UDN, IDN and STN are all accessible from software. The other networks are controlled by hardware and are used by the memory system to provide inter-tile shared memory, cache coherence and tile to memory communication. The hardware controlled networks are guaranteed to be deadlock free, however care must be taken for the software accessible dynamic networks. The six networks are physically independent 32-bit full duplex. They are named: The Static Network (STN), The Tile Dynamic Network (TDN), The User Dynamic Network (UDN), The Memory dynamic Network MDN, The Coherence Dynamic Network (CDN) and The I/O Dynamic Network (IDN). The networks can be used simultaneously. Each tile has a switch. All switches together make the iMesh network and provide control and data-path for connections on the network. They also implement buffering and flow control. 4.2.1 Interconnection Hardware The Switch The switch is connected to all six networks and has five full duplex ports for each of them. One in each direction (North, East, South, West) and one connected to the local tile. The reason why the iMesh implements physical networks instead of logical is that logical networks would need the same amount of buffer and the extra wire connections are relatively cheap [17]. Receiving Messages It is possible to implement demultiplexing of received messages in software. This is implemented by triggering an interrupt when a message is received. The interrupt service routine then stores the message in a queue located in memory. On the IDN and UDN demultiplexing of incoming messages are supported in hardware. On UDN there are four queues that can be programmed to store different incoming messages depending on message tag. On IDN there are two such queues. They both also have a catch all queue that catches messages that do not match any of the other queues. 4.2.2 The Networks The dynamic networks use packet based communication[17]. The packet header contains information about the destination tile and the packet length. On the 20 Tilera’s TILEpro64 network the packet is wormhole routed which means that much smaller buffers are needed at each switch because the packet buffering is distributed all along the connection path. A dimension ordered routing policy is then used, which means that a packet first travels in the x-direction and then in the y-direction. This also means that it is possible to deadlock these networks. Figure 4.3. A wormhole routed packet travelling on the network. Figure 4.3 shows how a packet travels from the upper left tile to its destination, first in the x-direction, then in the y-direction. It can also be seen how the packet occupies the channel where it is travelling, thus preventing other packets from using that channel at the same time. A situation where a transmission blocks a channel on the interconnection network can introduce deadlocks[16][17]. It is thus important that the system developer makes sure that a packet can be received and buffered at its destination. UDN The user dynamic network is a software accessible packet-switched network. It can be used to implement a high-speed inter process communication between tiles. The developer has to be careful, as this network can be deadlocked[16][17]. IDN Just as the user dynamic network, the I/O dynamic network is packet switched and accessible from software. The I/O dynamic network can be used to communicate between tiles and between I/O devices. The developer has to be careful, as this network can be deadlocked[16][17]. 4.2 Interconnection Network - iMesh 21 MDN Only the cache engine has access to the memory dynamic network. It is used for cache-cache and cache-memory communication[16][17]. TDN The tile dynamic network complements the memory dynamic network and is used for cache-cache communication. If a tile wants to read from another tiles L2 cache, the request is sent over the TDN and the answer is sent over MDN. The reason that two networks are used is to prevent deadlocks[16][17]. CDN The coherence dynamic network is used by the cache coherence hardware to carry cache invalidate messages[16]. STN The static network is not routed dynamically as the other networks. Instead switches are configured to provide a static path for point-to-point communication[16][17]. As UDN and IDN, STN is also available to software. 4.2.3 Protecting the Network It might not be desirable that user processes can communicate directly with I/O devices or operating systems on other tiles. This is prevented by hardware with the so called multi-core hardwall[16][17][19]. The hardwall is controlled by a couple of special purpose registers[19]. This makes it possible to control what kind of traffic that passes through the switch. A hardwall protection violation may trigger an interrupt routine that does appropriate actions, like tunneling the traffic to another destination. 4.2.4 Deadlocks It is possible to trigger a deadlock when using the dynamic networks[17][19]. The developer must therefore be careful when designing the application. The wormhole routing protocol can overflow if the receiver does not take care of incoming packets. When designing a dynamic network protocol it is important to make sure that each tile always empties its own receive buffer and that there is no blocking send operation that stops the receive operation from executing[17][20]. This means that it is also important to not send more packet data than the receiving demux buffer can handle. The memory networks are controlled by hardware which guarantees that no deadlocks will occur[19]. 22 4.2.5 Tilera’s TILEpro64 iLib The iLib provides a user API for inter tile communication over UDN. It provides socket like streaming channels and a MPI1 [21] like message passing interface. Streaming iLib supports two types of streaming[17]. One is called Raw channels and the other one is called buffered channels. Raw channels have little overhead and are suitable for software that has high demands on latency. Buffered channels do have more overhead but instead support large buffers residing in memory. Messages The messaging API is similar to MPI. Everyone can send messages to everyone without the need of manually setting up a communication channel. The API makes sure that messages are received in order and that they are buffered depending on message tag. 4.2.6 Conclusions about the Interconnection Network The project specification asked the following question: What implications does the iMesh network have on a RTOS? The answer to this question is that there is a risk that the software managed networks becomes congested and deadlocked. To prevent this, communication protocols have to make sure that the receiver does not overflow[17][20]. Receive buffers must always be emptied and the sender must know that there are empty buffers before transmitting. The application designer has to take into consideration that the interconnection network can be congested. An application with high demands on determinism of network latencies can be protected with the network protection mechanism described above. This will prevent unwanted communication on parts of the network used by the critical application[17][16]. One other solution is to use the static network for critical applications. As long as the software developer follows these instructions, there should not be any problems using a RTOS on the iMesh. 4.3 Memory Hierarchy The TILEpro64 has a 36-bit shared physical address space which is visible as a 32-bit virtual address space. The memory can be visible and shared among all tiles or it can be grouped into protected domains. Each tile has a separate 8KB L1 cache for data, one 16KB L1 instruction cache and a unified 2-way 64KB L2 1 Message Passing Interface 4.3 Memory Hierarchy 23 cache. The on tile caches are complemented with the Distributed Shared Cache (DSC). That is a virtual nonuniform access L3 cache distributed between all L2caches. The cache coherence protocol called Dynamic Distributed Cache (DDC) implements system wide coherency and has a number of configurations. 4.3.1 Memory Homing All physical memory on the TILEpro64 can be associated with a home tile. The home tile is responsible for cache consistency its associated addresses. The memory homing system implements distributed directory based cache coherency. One use for home tiles is to dedicate the L2 cache to its associated physical memory and let all accesses from all tiles to those addresses go through the home tiles cache. This is how the L3 cache is implemented. The TLBs on each tile do not only map virtual addresses to physical ones but do also keep track of which home tile a cache line belongs to. There exist a couple of strategies on how to configure the home tiles. These strategies can be customized, and to achieve the best performance, software should be optimized with locality in mind [22][23]. Local Homing This strategy does not use the L3 cache. On a L2 cache miss the DDR memory is accessed directly and the complete page that the accessed data belongs to is cached local at the accessing tile. This strategy is good when different cores do not share data, because accessing the off-chip memory directly on a cache miss is faster than first trying to read the L2 cache of another tile[24]. Remote Homing This strategy implements the L3 cache. All physical pages get dedicated home tiles. When a L2 miss occurs, a request is sent to the home tile of the requested memory (which becomes the virtual L3 cache). If a second L2 miss occurs at the home tile the data has to be fetched from memory. This strategy is good for producer consumer applications where the producer can write directly into the consumers L2 cache[24]. Hashed Homing This strategy reminds of the Remote Homing Strategy, but the difference is that the pages are distributed among tiles at cache line level, using a hash function. This makes it suitable for applications where instructions and data are shared among several cores. The hashed distribution of memory provides better load balancing on the iMesh and avoids bottleneck situations where many tiles access the same page[24]. 24 Tilera’s TILEpro64 Figure 4.4. Example of cache configurations In figure 4.4 the different ways of configuring caching is explained. In the example, Tile 1 and Tile 2 are accessing the same three pages (A, B and C). Page A is configured as local, page B as remote and page C as hashed. Note that the memory sizes are not using a correct scale in the image (the L2 cache is not of the same size as the off-chip memory). Also note that the hashed page is not really hashed in the picture, the picture only tries to explain that the page is divided between caches. 4.3.2 Dynamic Distributed Cache Dynamic Distributed Cache (DDC) is the name of the cache coherency system on TILEpro64. It uses the homing concept to implement distributed directory based cache coherency[16]. Each home tile is responsible for keeping track of which tiles that has a copy of homed data. The home tile is also responsible for invalidating all copies if a cache line is updated. 4.3.3 Conclusions about the Memory Architecture The question from the problem statement can be answered. How does the memory hierarchy of TILE64 cope with the demands of a RTOS? It depends if there are many applications accessing the same memory controller. A bad configuration where many applications try to access a memory controller or a remote cache at the same time can congest the memory networks. This will lead to bad performance. The TILEpro64 processor can use a configuration called memory striping. This configuration splits pages between the four memory controllers which makes the traffic on the memory networks more evenly distributed. This, combined with a wisely chosen cache configuration, can increase performance. If there are hard requirements on deterministic access times, the application developer may consider 4.4 Tools and Software Stack 25 letting the critical application have a dedicated memory controller. The memory latencies are deterministic[16]. However a congested memory network can change this. The developer has to take the memory accesses from all applications running in parallel on different tiles into account[24]. 4.4 Tools and Software Stack Tilera has a number of host-side tools and a software stack that runs on the hardware. The host-side tools collection provide functionality like building, functional and cycle accurate simulator, debugging and profiling tools. The tile-side software stack is basically a complete software environment including hypervisor, libraries and custom made Linux version. Hypervisor The main functionality provided by the hypervisor is booting, loading guest operating systems, managing resources and memory, providing an interface for inter tile communication and I/O device drivers[16][24]. Bare Metal Environment For users with extra demands on performance there is support for a bare metal environment that can be used instead of running on top of the hypervisor[16]. The bare metal environment executes at the same protection level as the hypervisor and provides full access to the hardware. 4.5 Tilera Application Binary Interface It was necessary to study the Tile processor binary interface[25] to make the assembler functions in the target interface work together with the c code. The application binary interface specifies things such as data representation and function call convention. The tile processor uses byte aligned data and has little endian data representation. This means that the least significant byte in a data item is stored at the lowest address. The compiler supported data types are described in table 4.1. The register usage convention is described in table 4.2. Caller saved means that the caller has to save the register values if they are to survive a function call. The values in callee saved register has to be preserved by the called function and thus are required to contain the same value on function exit as on function entry. 26 Tilera’s TILEpro64 C Types char short int pointer double long long Data Types Size (Bytes) Byte Alignment 1 1 2 2 4 4 4 4 8 8 8 8 Machine Type byte halfword word word doubleword doubleword Table 4.1. Supported Data Types Register 0-9 10 - 29 30 - 51 52 53 54 55 56 57 58 59 60 61 62 63 Register Usage Convention Assembler Name Type Purpose r0 - r9 Caller-Saved Parameter Passing / Return Values r10 - r29 Caller-Saved r30 - r51 Callee-Saved r52 Calle-Saved Optional Frame Pointer tp Dedicated Thread-Local Data sp Dedicated Stack Pointer lr Caller-Saved Return Address sn Network Static Network idn0 Network IO Dynamic Network 0 idn1 Network IO Dynamic Network 1 sp Network User Dynamic Network 0 sp Network User Dynamic Network 1 sp Network User Dynamic Network 2 sp Network User Dynamic Network 3 sp Stack Pointer Table 4.2. Register Usage Convention 4.5 Tilera Application Binary Interface 27 The stack grows downward and is controlled completely by software. The stack pointer has to be 8-byte aligned. Table 4.3 shows the stack usage convention. If a function requires more arguments than there are dedicated registers, the arguments left without a register are stored on the stack, starting at address SP+8. Region Locals Dynamic Space Argument Space Frame Pointer Calle lr Stack Usage Convetion Purpose Local variables Dynamic allocated stack space If more than 10 arguments, then save them here Incoming sp Incoming lr Table 4.3. Stack Usage Convetion Size Variable Variable Variable One Word One word Chapter 5 Software on Many-Core Parallel software for multi-core has moved from being a restricted subject for scientific and high performance computing to become common in computer systems. Today, standard desktop PCs are shipped with 4 to 8 cores. It is believed that the number of cores per chip will continue to double every 18 months and that within ten years processors will contain as many as 1000 cores[1] [5]. Many-core processors have been available for some years. Extracting and managing parallelism in applications on multi-core is a hot subject. There are many tools available and research is active [3]. The system side is still in a somewhat novel state when it comes to many-core CMPs. There exist a couple of research operating systems that address the issues with many-core. The operating systems that have been investigated in this thesis in order to derive requirements for software on many-core are Wentzlaff et al.’s Factored Operating System (fos) [26][27][28] and the Barrelfish operating system [29][30]. These operating systems are designed with scalability as the main requirement, which means that common requirements in the embedded segment such as real-time capabilities are not of the highest priority. This chapter covers some operating system scalability problems and how to counter these problems. The first section considers about why the SMP model has problems scaling to many-core processors. The second section investigates the two many-core operating systems mentioned above. Finally, there is a discussion about what design principles and requirements need to be considered when developing operating systems for many-core. 5.1 Scalability Issues with SMP Operating Systems It is a fairly accepted belief that SMP operating systems do not scale as the number of cores increase. Some studies claim that Linux only scales to about 8 29 30 Software on Many-Core cores [2][26][29]. There are however those who disagree [31][32]. Boyd-Wickizer, Silas et al. have been able to make Linux scale to 48 cores by making modifications to the Linux kernel. 5.1.1 Locking the kernel A simple way to make a kernel SMP safe is to use a so called big kernel lock. This means that only one thread can enter the kernel at a time. Using a big kernel lock makes it possible for multiple cores to share the same kernel. Since only one thread can execute in the kernel, threads on other cores have to wait if they want kernel access. Operating system designers have countered this problem by replacing the big kernel locks with fine grained locks[2][31]. Fine grained locks make the kernel more suitable for multi-core as the probability that threads on different cores require access to the same resource decreases. Amdahl’s law shows us that serial segments of code have a big negative impact on scalability. This implies that even if the locks have very fine granularity, the shared resources will still become bottleneck when moving to many-core. Wentslaff et al. has with the use of a micro-benchmark discovered that the physical page allocator in Linux 2.6.24.7 [26], even though it uses fine grained locks, only scales up to 8 cores. The lock used when balancing the free pages lists between cores turned out to be the bottleneck. One other related problem is that the lock granularity is optimized for a specific number of cores, meaning that the optimal lock granularity for a many-core system may create too much lock overhead on a system with fewer cores[2][31]. It is also possible that the cache coherence protocol congests the core interconnection while invalidating shared cache lines. Making the kernel more scalable by making the locks more fine grained can be a difficult task and may introduce errors[2][31]. Forgetting to protect a shared resource or inserting deadlocks are risks when making locks finer grained. Patterns that trigger deadlocks can be difficult to detect. 5.1.2 Sharing Cache and TLBs between Application and OS The trend is that cache sizes decrease as the core count increases [1]. An operating system where application and operating system services share the same cache and TLB can suffer from bad cache performance due to anti-locality collisions. By measuring cache misses in user and supervisor mode when running an Apache2 server running on Linux 2.6.18.8 Wentslaff et al. show that application and operating system cache interference by anti-locality collisions is sizable and that the operating system is responsible for the majority of the cache-misses [26]. This supports a more distributed operating system design where operating system services and applications do not compete for cache. 5.2 Operating Systems for Many-Core 5.1.3 31 Dependency on Effective Cache Coherency SMP design relies on efficient cache coherency protocols[2][31]. It is doubtful that system wide cache coherency will scale to many-core. Wentslaff et al. claim that a difficulty with scaling appears with directory based cache coherency protocols when using more than 100 cores[26]. Baumann et al. believes that it is a distinct possibility that future operating systems have to handle non-coherent memory[29]. 5.1.4 Scalable SMP Systems Boyd-Wickizer, Silas et al. however claim that the Linux kernel only required relatively modest changes to become more scalable [31]. This was done by inserting locks with finer granularity and by introducing lock free protocols. Their conclusion is that it is not necessary to give up traditional operating system design, yet. They say that the bottlenecks are to be found at application level and among shared hardware resources, not in the Linux kernel. They also stated that making Linux scalable was dependent on effective cache coherency. 5.2 Operating Systems for Many-Core As mentioned in the beginning of this chapter, two novel operating systems have been investigated to derive requirements for a many-core operating system. The common features of these operating systems are that they propose a distributed operating system design where shared resources are kept to a minimum and communication is done with message passing. Barrelfish has also been putting extra emphasis on hardware diversity, using a system knowledge database (SKB) that is supposed to provide extra support for optimization on heterogeneous systems [33]. 5.2.1 Design Principles: Factored Operating Systems Factored Operating Systems[28] is designed by Wentzlaff et al. and has scalability as the main requirement. The main principle behind fos is that operating system services are dedicated and distributed among cores, inspired by distributed web servers. Factored Operating Systems distributes deep kernel services such as physical page allocation, scheduling, memory management and hardware multiplexing. The fos is spatially aware and takes locality into account when distributing the servers. Microkernel Providing Location Transparent Communication Fos consists of three main components. A micro-kernel, system service servers (referred to as OS layer) and applications. Operating system services and applications never execute on the same core. Both operating system services and applications are referred to as clients and the micro-kernel does not differ between 32 Software on Many-Core them. The micro-kernel takes care of resource management, implements a machine dependent communication infrastructure, name server cache and an API for spawning processes. The micro-kernel can allocate receive mailboxes which can be used by the clients to publish and access services. There is a local name server cache that contains mappings to services located on different cores. The distributed name server that resides in the OS layer provides load balancing between service fleet servers. Fleets of Distributed Servers Each function specific service belongs to a fleet of distributed cooperating servers. A server is locked to one core and communication between servers is done with message passing. A caller uses the micro-kernel name cache or the name server service to find the closest server providing a specific service. Applications can execute on one or several cores and may use shared memory for communication. It is not believed that cache coherency will scale system wide but may be effectively utilized using an application, implemented with a shared memory model, which executes on a couple of cores. All this makes fos resemble a distributed web server, looking at techniques such as replication with data consistency and spatial aware distribution. 5.2.2 Design Principles: Barrelfish Barrelfish[?] is designed on three main principles[29]. All inter-core communication shall be explicit and the operating system structure shall be hardware neutral. The final principle is that a state shall never be shared between cores instead it shall be replicated and kept consistent with agreement protocols. One difference from fos is that Barrelfish does not only have scalability as a main requirement but also portability[33]. Main Components Barrelfish consists of two main components, the CPU driver and the monitor. The CPU driver is private to each core and executes in kernel space. The monitor executes in user space and is responsible for coordination between cores. Together they provide the typical functionality of a micro-kernel: scheduling, communication and resource allocation. Other functionality provided by the operating system such as device drivers, network stacks and memory allocators execute in user space. The CPU Driver The CPU driver takes care of protection, time slicing of processes and provides the target interface. It is private to each core and thus does not share any state. Hardware interrupts are demultiplexed by the CPU driver and delivered to destination processes. It also provides a medium for asynchronous split-phase messages between local processes. 5.2 Operating Systems for Many-Core 33 The Monitor The monitor is a schedulable user space process responsible for coordinating the system-wide state. Resources are replicated and the monitor is responsible for keeping the replicated data structures consistent with an agreement protocol. Processes that want to access the global state need to go through the local monitor to access a remote copy of the state. The monitor is also responsible for interprocess communication between different cores. All virtual memory management is done in user space and the monitors are responsible for keeping global resources such as page tables consistent. IPC with Message Passing All inter-core communication is done with message passing. The message passing interface abstracts the communication medium and provides transparency to the communicators. The implementation is hardware specific and uses hardware interconnection when possible. Processes represented by Dispatcher Objects Processes are represented by special dispatcher objects that exist on each core that the process will execute. The dispatchers are scheduled by the CPU driver. The dispatcher contains a thread scheduler and a virtual address space. All operating system software uses replication instead of shared resources. However, user-space processes are free to use a shared memory model for parallel applications. Considering Hardware Diversity Barrelfish puts extra emphasis on running efficient on heterogeneous systems with little modifications[33]. It uses a system knowledge database that provides user applications with knowledge about the underlying hardware. This information is supposed to be used for run-time optimization. 5.2.3 Conclusions from Investigating fos Wentzlaff et al. says that one advantage when moving to a distributed operating system is that by making communication inside the operating system explicit with message passing, it is no longer necessary to search for shared memory bottlenecks and locks[26]. One other advantage is that system services implemented as distributed servers (where the number of servers increases with core count), scales in the same manner as distributed web servers. A third advantage with the distributed design is that system services do not have to share cores with applications. Applications can, instead, utilize the services with remote calls. The cost of dynamic messages is in the order of 15-45 cycles on the RAW and Tilera processors. Context switches on modern processors typically have a much higher latency in cycles [26]. As mentioned earlier, cores are expected to get smaller as 34 Software on Many-Core the core count grows[1]. This and the fact that embedded systems typically have very strict demands on software overhead has led to the development of small software overhead message passing APIs like the MCAPI[34] and rMPI [35]. The authors of fos state that it is their belief that the number of running threads in a system will be of the same order as the number of cores. This means that load balancing will occur less often. More cores also mean that the need for time multiplexing of resources will decrease. To achieve good performance on many-core systems placement of processes will instead be a bigger issue. Finally Wentzlaff et al. thinks that by using an explicit message passing model, the operating system designer will encourage application developers to think more carefully about what data is being shared[26]. 5.2.4 Conclusions from Investigating Barrelfish Baumann et al. argue that operating system design has to be rethought using ideas from distributed systems. They suggest minimal inter-core sharing and that OS functionality should be distributed among cores, communicating with message passing. The Barrelfish design says that shared resources should be controlled by servers and the clients then has to perform some kind of RPC to access the shared resource. With a micro benchmark they show that a client server message passing model scales better with the number of cores than a shared memory model when updating a data structure[29]. The reason for this is that when using the shared memory model, the cores get stalled because of invalidated cache lines. When using the message passing model the delay is proportional to the number of cores accessing the server. The Barrelfish design considers also consider that effective cache coherence will not scale with increased core count[29]. This favors the message passing model. There are existing point-sollutions to the problems that come with the shared memory model[31][36][37]. Scalable shared memory software in high performance computing has tackled this by fine-tuning lock granularity and the memory layout of shared data structures[3]. This means that the developer has to be careful about how the data is encoded on the particular platform and how the cachecoherence protocol is implemented.Example of such situations would be if a specific underlying implementation store arrays of data line wise or row wise, the size cache lines and what kind of cache coherency protocols that are used. This is an argument supporting the idea that the implicit communication with shared memory should be replaced with an explicit message passing model, encouraging the developer to create a parallel design which is less platform dependent. 5.3 Conclusions 5.3 35 Conclusions This section contains the overall conclusions about operating system design for many-core. These conclusions are for a scalable operating system that appears as a single image system to the executing processes. There are other ways to utilize the performance provided by many-core that are not covered in this thesis. (One example would be to run parallel AMP systems on top of a hypervisor[2]). Finally there is a subsection discussing Enea OSE on many-core. 5.3.1 Distributed Architectures are Scalable Distribute OS Services To be able to scale on many-core processors, operating systems should aim at a distributed design. This means that the services provided by a standard monolith operating system (such as a file system server, physical memory allocator, name server or an IP stack), should be distributed to dedicated cores. These services should also have an internal distributed design, where a spatially distributed fleet of servers provide the specific service to requesting processes. Server fleets should be able to grow or decrease depending on demand (maybe not dynamically, but at least during static configuration). These fleets should use a replicated state to be scalable. Consistency should be contained with a state of the art agreement protocol. A Microkernel Provides Inter-Core Communication The distributed OS services should run on top of a microkernel that provides location transparent communication. There should be a distributed name server that keeps track of OS services and application services. The distributed name server makes sure that a request is always serviced by the most appropriate server. There should be a name cache in the microkernel to increase performance. 5.3.2 One Thread - One Core It is to be expected that cores become weaker as the core count grows[1]. That means that caches will be smaller. This implies that executing operating system services and application processes on the same core will experience performance loss due to anti-locality cache collisions. The same could be said about multitasking. The cores will be weak but there will be lots of them. This means it will be possible to let each thread have a dedicated core. Fast interconnection, better cache performance and no overhead from context switching makes this a good choice. 36 Software on Many-Core Instead of time slicing threads on one core, the big task will instead be placement of threads. Finding the optimal placement will be necessary to get the desired performance. If threads have a good initial placement, there will be little need for migration and one thread per core means there will be no need for load balancing. 5.3.3 IPC with Explicit Message Passing Utilizing the performance of multi-core and many-core processors requires parallel software design[3][2]. Shared memory models for parallel programming does not scale well on many-core because of the lack of system wide cache coherence. The papers behind Barrelfish and fos state that explicit message passing for IPC should be used[26][29]. Enforcing a programming model based on explicit message passing means that threads will not be designed with shared resources. This makes the applications scale better. The message passing should preferably be implemented using the fast interconnection networks of many-core processors citeEmbeddedMulticore. The idea behind making the communication explicit is to make the software less platform dependent. Implicit communication is more platform dependent because the developer has to keep in mind how the underlying software and hardware handles data structures[26][29]. One example is when a developer is working with a shared matrix. If the communication mechanism is implemented with shared memory, the developer has to adapt the solution to how the matrix is stored in the underlying implementation and memory engine. 5.3.4 Example of a Many-Core OS This section aims to clarify the conclusions with an example. Figure 5.1 shows an example of how an operating system for many-core could look like. The cores are placed in an 8x8 mesh. The gray tiles are application threads, the green and the red tiles are operating system services (observe that the servers have no optimal placement, this is just an example). Both applications and OS services are running on top of a microkernel that provides a location transparent communication interface. If an application wants an OS service, it does a standard system call. The microkernel realizes that the requested service is located on another core. The microkernel forwards the remote call over the interconnection network to the most appropriate server. At the destination tile, the call is delivered by the microkernel to the OS process providing the service. The response is sent back to the requesting application process in the same way, over the interconnection network. This was a very simple example that explains the thoughts behind the design of a distributed operating system executing on a many-core processor. 5.3 Conclusions 37 Figure 5.1. Example of Many-Core OS 5.3.5 Enea OSE and Many-Core Enea OSE is described in chapter 3. OSE executes on top of a microkernel that can be extended with different core services. IPC is done with message passing and can be location transparent if used together with Enea LINX[11]. This copes well with the conclusions made about operating systems for many-core. The overall conclusion about OSE and many-core is that the OSE architecture copes well with the requirements stated in the fos and Barrelfish papers. Porting OSE to a many-core processor is, thus, very well motivated and will, with high probability, be an interesting research base for software on many-core at Enea. Chapter 6 Porting Enea OSE to TILEpro64 The subject covered by this chapter is the porting of OSE to TILEpro64. Porting OSE to the Tilera processor was stated already in the project specification. This decision was strengthened when the pre-study showed that OSE might be very well suited for many-core. When doing the specification, it was difficult to estimate how far the porting would get in the time scope of this thesis. It was known that a good understanding of the OSE architecture, the Tilera hardware and of the tools that had to be used was required. Therefore, the project specification stated ”It is not expected that the complete porting of OSE ME will be done during 10 weeks, however a foundation for further thesis projects shall be achieved.” This was, without any doubt achieved. Right now, there is a limited but working single core version of OSE executing on the TILEpro64 processor. The following sections cover subjects like early design decisions, the method used during implementation, description of the implemented parts and finally a verification of the results. The work methodology used within each milestone is described. These sections aim at being target generic and are meant to aid future thesis workers with similar projects. 6.1 Milestones The porting process is an incremental task. The problem complexity is quite significant so it was necessary to divide the problem in to smaller tasks. The project was divided into a couple of milestones to ease the work of extracting the most important tasks. 39 40 6.1.1 Porting Enea OSE to TILEpro64 Milestone 1 - Build environment Since TILEpro64 has a completely new architecture with a new ISA, a new build environment has to be created. It is necessary to be able to continue with the porting process. It shall be possible to edit OSE libraries and use Omnimake to build them with Tilera’s GCC-port. 6.1.2 Milestone 2 - Launch OSE and write into a Ramlog Link the included libraries and make a final build with a Coresys. There should also be a configured simulator environment that can be used to test the final build. Arrive at the point where OSE is able to write into the ramlog. 6.1.3 Milestone 3 - Get OSE in to a safe state Get a basic single core version of OSE up and running. This version is able to write into a log in ram memory that can be accessed from GDB. There is no driver for serial communication or chip timer yet, which means that the only way to see output from the operating system is by the ramlog and where it is not possible to provide any input at run-time. No timer means that timer processes are not supported and no system calls that rely on time are supported either. The scheduler can still work in this configuration. As long as only event driven processes are used, the scheduler will work. 6.1.4 Milestone 4 - Full featured single-core version of OSE on TILEpro64 This milestone requires a timer and uart device driver. MMU support can be implemented but this is optional. 6.1.5 Milestone 5 - Full featured multi-core version of OSE on TILEpro64 Milestone 5 requires a multicore bootstrap and an IPI driver. Preferably, hardware MMU support should be available (this is needed if the features of the memory hierarchy are to be utilized). Milestone 5 could also be extended to include utilization of the interconnection network for IPC by adding support in LINX. 6.2 MS1 - Build Environment To be able to start working with the actual porting it was necessary to do some preparatory work. A new target called tilepro was added to the internal build environment. This section describes briefly what had to be done. 6.2 MS1 - Build Environment 6.2.1 41 Omnimake Omnimake[38] is the make and build system for OSE source code. It is used by Enea to build their product components. Creating a new configuration for the Tilera architecture was the first step. 6.2.2 Requirements and Demonstration The mile-stone specific requirement is listed in table 6.1. Requirements Nr. 1 Priority 1 MS1 - Requirements Description It shall be possible to build OSE core libraries for TILEpro64. Table 6.1. MS1 - Requirements Demonstration The requirement is verified with use cases. Use Case 1 1. Do a change into a source file in the CRT component. 2. Build the CRT component. 3. Confirm that the component library was built by looking in OSEROOT /system/lib Expected Outcome The library is compiled and can be found in OSEROOT /system/lib Result of Use Case 1 Status: PASSED Use Case 2 1. Do a change into a source file in the CORE component. 2. Build the CORE component. 3. Confirm that the component library was built by looking in OSEROOT /system/lib 42 Porting Enea OSE to TILEpro64 Expected Outcome The library is compiled and can be found in OSEROOT /system/lib Result of Use Case 2 Status: PASSED 6.2.3 Work Approach Following are descriptions of how MS1 was achieved. 1. Create a configuration for the new compiler in Omnimake. 2. Choose a suitable reference architecture. Try to find a reference architecture that has a similar ISA to Tilera’s one. By using a similar ISA, mapping the reference architecture on the target architecture will be a much easier task. 3. Add library specific build configurations for the new target. 4. Remove all target specific source code until it is possible to build all desired libraries. 6.3 MS2 - Coresys The final system is called Coresys. The difference between a Coresys and a Refsys is that a Coresys is a minimal build for testing and a Refsys is a full featured OSE monolith. The Coresys only contains the OSE core functionality. It also includes a few optional core extensions like the Run-Time Loader for ELF or the Console library. Like Refsys, Coresys is responsible for linking all the desired libraries, setting the OSE configuration parameters and creating a final executable. 6.3.1 Implemented Parts Milestone 2 was more about configuring than coding. The OSE entry code was the only produced source code artifact. OSE Entry Code The TILEpro64 start code has been developed. It is specified as the OSE ELF entry point and handles initialization of the read-write data and BSS segments. It also calls the main() function. The entry code can be considered as an extension to the compiler. 6.3 MS2 - Coresys 6.3.2 43 Design Decisions The initial intention was to do a para-virtualization on top of Tilera’s hypervisor. The reason for this was that it is easier to write device drivers that interface against the hypervisor than writing drivers that interface against the hardware. This, however, turned out to be a bad decision. Studying the hypervisor was not within the time scope of this thesis. Also in the long term, running on top of the hypervisor was not desirable. Instead of running on top of the hypervisor the decision was taken to run the OS as a bare metal application. That means that the OS executes on the same protection level as the hypervisor and has direct access to all hardware. The bare metal environment offers a run time that can be accessed by the bare metal API. However, I choose to not utilize this. It might have become handy when developing a console driver but not so handy when developing the target interface. In this thesis I only configure the OS as a BME to get it up and running. With only small changes to the target interface, OSE is capable of installing interrupt vectors and setting up a default MMU map by itself so there is no real need for an API. 6.3.3 Requirements and Demonstration A couple of mile-stone specific requirements are listed in table 6.2. Requirements Nr. 2 3 4 Priority 1 1 1 MS2 - Requirements Description The Coresys shall be able to do the final linking and create an ELF. The OSE init code shall start to execute in the simulator. The init code shall be able to write into the ramlog. Table 6.2. MS2 - Requirements Demonstration The requirements are verified by use cases. Use Case 3 1. Start to build the Coresys. 2. If there are no error messages, verify that tilepro.elf exists in obj/ 44 Porting Enea OSE to TILEpro64 Expected Outcome OSE is linked and the tilepro.elf binary is generated in obj/. Result of Use Case 3 Status: PASSED Use Case 4 1. Modify the OSE init code: Add a ramlog print early in the init code, followed by a breakpoint. 2. Build the libraries and the Coresys. 3. Start the simulator with a configuration to run OSE as a bare metal application by running the make script in simmake/. 4. When the breakpoint is reached and GDB has started, dump the ramlog to a file and verify that your print was added. Expected Outcome The text was printed into the ramlog. Result of Use Case 4 Status: PASSED 6.3.4 Work Approach Following are descriptions of how MS2 achieved. 1. Choose a reference Coresys. 2. Investigate how the target platform bootstrap handles ELF. 3. If required by the target, do the necessary changes in the linker script. 4. Write the entry code. 6.4 MS3 - Get OSE into a safe state Milestone 1 and 2 were more about understanding provided tools and the structure of OSE. Milestone 3 is more about the internal architecture of OSE and also contains most of the produced source code artifacts. 6.4 MS3 - Get OSE into a safe state 6.4.1 45 Design Decisions The new hardware abstraction layer strictly follows the internal architecture of the reference architecture and overall internal architecture of OSE. There was not much room for introducing new design; a lot of time had to be put into learning the software and hardware. There were, however, two decisions to be taken about possible constraints on the port. One design decision that was made was to implement a native c runtime. The first thought was to use the c run-time library that the soft kernel uses. A configuration that only executes in supervisor mode would allow this solution. However, a target port was implemented instead. The reason behind this decision was that this has to be done if the OS shall support both user and supervisor mode. Another reason was that a native approach worked better together with the target interface of the reference architecture. It was an early decision not to implement any MMU support. The reason for this was that OSE can be configured to run without an MMU and there was no specified requirement about protection. Implementing MMU support can also be considered as a significant task. Especially in the TILEpro64 case. 6.4.2 Implemented Parts The lowest layer in the OSE architecture is called the Hardware Abstraction Layer. The hardware abstraction layer provides a target interface to the higher layers. Most of the implemented parts reside in the hardware abstraction layer, but there are also parts of the CRT that are hardware dependent. Target Interface The most important parts of the target interface are the functionality for creating, storing and restoring a process context. The interrupt vectors and the trap code are implemented in the target interface. Some parts of the target interface are implemented in C and some parts are written in pure assembler. Some functionality can be implemented in C or assembler, but the functionality that the target interface implements is called a lot by the higher layers of the OS, so it can be wise to implement this functionality in assembler for performance reasons. Interrupt vectors also have high demands on performance and text size, which may leave assembler as the only choice. The OSE SPI specifies some architecture dependent functionality, like atomic operations and CPU access functionality such as disable interrupts or register manipulating functions. These are all implemented in the target interface. 46 Porting Enea OSE to TILEpro64 Figure 6.1. Location of the target interface in the OSE architecture Figure 6.2. Location of the c run time in the OSE architecture 6.4 MS3 - Get OSE into a safe state 47 CRT - C Run Time The OSE C run time library is implemented in the kernel component called CRT. This library has architecture dependent functionality that had to be implemented. This was all done in assembler. The code produced in this thesis implements c run time initialization, system calls and memory manipulating functionality. BSP - Board Support Packet There was not enough time to implement any device drivers. The most important drivers that have to be implemented would be a console and a timer driver. Also a driver for IPI would be necessary when implementing multi-core functionality. However, stubbed dummy drivers are provided in the BSP to ease further development. The BSP also contains some target specific initialization, such as setting up a static MMU map and enabling caches. 6.4.3 Requirements and Demonstration Requirements A couple of mile-stone specific requirements are listed in table 6.3. Nr. 5 6 7 8 Priority 1 1 1 1 MS3 - Requirements Description It shall be possible to spawn processes. It shall be possible to switch processes. There shall be working IPC. There shall be possible to do system calls. Table 6.3. MS3 - Requirements Demonstration The requirements are verified with a use case. Use Case 5 1. Add two processes to the BSP. They should communicate with each other using send, blocking receives and write into Ramlog what they are doing. Add a breakpoint at the end of one of the processes (make sure that the breakpoint is actually reached). 2. Make sure your processes are created and started in bspStartOseHook2. 3. Start OSE in the simulator. 48 Porting Enea OSE to TILEpro64 4. When the breakpoint is reached and GDB is launched, dump the ramlog to a file. 5. Read the ramlog to verify that your processes are working. Expected Outcome The desired text has printed in to the ramlog, showing that OSE is in a safe state and that IPC and the dispatcher works. Result of Use Case 5 Status: PASSED. See appendix A for the demonstration application source code and application output that shows that requirement nr. 4 has been met. 6.4.4 Work Approach During milestone 3 it was necessary to map the reference architecture to the Tilera architecture and make the required changes. This meant comparing their ISAs, ABIs and then implementing assembler and hardware dependent data structures. 1. Execute OSE in the simulator and implement functions when they are needed. Chapter 7 Conclusions, Discussion and Future Work The initial intention with this thesis was to create a project foundation for the MANY project. First, by creating a picture of current many-core research, derive main requirements for operating systems on many-core and then, by porting OSE to the TILEpro64 processor. 7.1 Conclusions from the Theoretical Study The pre-study showed that the two research operating systems: Factored Operating System and Barrelfish. They both looked at distributed operating systems and distributed web servers for inspiration. They were both aiming a design where services provided by the operating system are distributed and avoid sharing cores with user processes. Resources shared between cores shall be kept to a minimum and communication, both on OS and user level is preferably done with explicitly message passing. The architecture of OSE was investigated. Because of the distributed design and the message passing programming model, OSE turned out to fulfill the requirements stated by the research operating system papers. This meant that it was a good idea to continue with the porting task instead of going deeper into the theory behind software on many-core. 7.2 Results and Future Work Together with this report a working copy of OSE has been delivered to Enea AB. This thesis has also delivered a complete build environment for Tilera’s architectures. The OSE version that has been tested on the Tilera MDE functional simulator is a single core system that is able to get into a safe state where processes can be scheduled and executed. 49 50 7.2.1 Conclusions, Discussion and Future Work Future Work - Theoretical At the application side there are many subjects that can be investigated such as tools for parallel computing and programming models. This thesis has focused on how to make operating systems scalable and the suggestions about future research areas will also be related to the operating system aspects. Virtualization on Many-Core The legacy software of today will, of course, also exist in the future. Running legacy software on many-core may require virtualization. Wentzlaff et al. even believe that it will be a requirement of future architectures that they shall be able to execute the x86 architecture efficiently as an application[39]. Scalable dynamic virtual machines that execute on many-core is a very interesting research area in my opinion. One other way to utilize the performance of many-core processors without using a very scalable single image operating system is to use a hypervisor and provide an AMP environment. Hypervisors on many-core and especially Enea’s hypervisor is a very interesting subject. 7.2.2 Future Work - Implementation Five milestones were stated for the porting process. Milestones 1 - 3 were completed. Mile stone 2 took much longer time than what was estimated. The reason behind this was the wrong decision to do a para-virtualization. Learning how to launch OSE on top of Tilera’s hypervisor was very time consuming. When the decision was made to run OSE as a bare metal application, already much time had gone, thus, reaching and demonstrating milestone 3 became the final goal for the implementation part of this thesis. Mile Stone 4 - Full featured single-core version of OSE on TILEpro64 This milestone was described in the previous chapter. A working console and timer device driver has to be implemented. Because of lack of time, this was not completed. Mile Stone 5 - Full featured multi-core version of OSE on TILEpro64 Milestone 5 is also described in the previous chapter. This is the long term goal of the project, with a working full featured multicore SMP operating system. Bibliography [1] A. Agarwal and M. Levy, “The kill rule for multicore,” in Proceedings of the 44th annual Design Automation Conference, DAC ’07, (New York, NY, USA), pp. 750–753, ACM, 2007. [2] J. E. P. S. Jonas Svennebring, John Logan, Embedded Multicore: An introduction. [3] J. K. Christoph Kessler, “Models for parallel computing: Review and perspectives,” Dec 2007. [4] “Itea2 - many.” http://www.itea2.org/project/index/view/?project=10090, 2011. [5] G. Kurian, J. E. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L. C. Kimerling, and A. Agarwal, “Atac: a 1000-core cache-coherent processor with on-chip optical network,” in Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT ’10, (New York, NY, USA), pp. 477–488, ACM, 2010. [6] “Openmp.org.” http://openmp.org/wp/, 2012. [7] “Wool.” http://www.sics.se/projects/wool, 2012. [8] “The cilk project.” http://supertech.csail.mit.edu/cilk/, 2010. [9] “Intel cilk plus.” http://software.intel.com/en-us/articles/intel-cilk-plus/, 2012. [10] Enea, Enea OSE Core User’s Guide. Rev. BL140702. [11] Enea, Enea OSE Architecture User’s Guide. Rev. BL140702. [12] Enea, OSE Application Programming Interface Reference Manual. BL140702. Rev. [13] Enea, OSE System Programming Interface Reference Manual. BL140702. Rev. [14] P. Strömblad, “Enea multicore:high performance packet processing enabled with a hybrid smp/amp os technology.” Enea White Paper, 2010. 51 52 Bibliography [15] Enea, OSE Device Drivers User’s Guide. Rev. BL140702. [16] Tilera, Tile Processor Architecture Overview for the TILEpro Series. UG120Rel 1.7 (28 May 2011), http://www.tilera.com/scm/docs/index.html. [17] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown III, and A. Agarwal, “On-chip interconnection architecture of the tile processor,” IEEE Micro, vol. 27, pp. 15–31, September 2007. [18] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, “The raw microprocessor: A computational fabric for software circuits and general-purpose programs,” IEEE Micro, vol. 22, pp. 25–35, March 2002. [19] Tilera, Tile Processor User Architecture Manual. UG101-Rel. 2.4 (3 May 2011), http://www.tilera.com/scm/docs/index.html. [20] Tilera, Tile Processor I/O Device Guide. UG104-Rel. 1.7 (29 Mar 2011), http://www.tilera.com/scm/docs/index.html. [21] “Message passing interface at the open directory http://www.dmoz.org/Computers/ Parallel_Computing/Programming/Libraries/MPI/, 2012. project.” [22] D. Ungar and S. S. Adams, “Hosting an object heap on manycore hardware: an exploration,” in Proceedings of the 5th symposium on Dynamic languages, DLS ’09, (New York, NY, USA), pp. 99–110, ACM, 2009. [23] I. C. . M. Z. . X. Y. . D. Yeung, “Experience with improving distributed shared cache performance on tilera’s tile processor,” IEEE Computer Architecture Letters, 2011. [24] Tilera, Multicore Development Environment Optimization Guide. UG105Rel. 2.4 (6 Jun 2011), http://www.tilera.com/scm/docs/index.html. [25] Tilera, Application Binary Interface. UG213-Rel. 3.0.1.125620 (9 Apr 2011), http://www.tilera.com/scm/docs/index.html. [26] D. Wentzlaff and A. Agarwal, “Factored operating systems (fos): the case for a scalable operating system for multicores,” SIGOPS Oper. Syst. Rev., vol. 43, pp. 76–85, April 2009. [27] D. Wentzlaff, C. Gruenwald, III, N. Beckmann, K. Modzelewski, A. Belay, L. Youseff, J. Miller, and A. Agarwal, “An operating system for multicore and clouds: mechanisms and implementation,” in Proceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, (New York, NY, USA), pp. 3–14, ACM, 2010. [28] “Carbon research group.” http://groups.csail.mit.edu/carbon/, 2012. Bibliography 53 [29] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania, “The multikernel: a new os architecture for scalable multicore systems,” in Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, SOSP ’09, (New York, NY, USA), pp. 29–44, ACM, 2009. [30] “The barrelfish operating system.” http://www.barrelfish.org, 2012. [31] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich, “An analysis of linux scalability to many cores,” in Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI’10, (Berkeley, CA, USA), pp. 1–8, USENIX Association, 2010. [32] F. W. Ghassan Almaless, “Almos: Advanced locality management operating system for cc-numa many-cores,” 2011. [33] A. B. T. R. P. B. T. H. R. I. Adrian Schüpach, Simon Peter, “Embracing diversity in the barrelfish manycore operating system,” MMCS’08, Boston, Massachusetts, USA., 2008. [34] J. Holt, A. Agarwal, S. Brehmer, M. Domeika, P. Griffin, and F. Schirrmeister, “Software standards for the multicore era,” IEEE Micro, vol. 29, pp. 40–51, May 2009. [35] J. Psota and A. Agarwal, “rmpi: message passing on multicore processors with on-chip interconnect,” in Proceedings of the 3rd international conference on High performance embedded architectures and compilers, HiPEAC’08, (Berlin, Heidelberg), pp. 22–37, Springer-Verlag, 2008. [36] “Linux scalability effort.” http://lse.sourceforge.net/. [37] “Mark russinovich: Inside windows 7.” http://channel9.msdn.com/shows/ Going+Deep/Mark-Russinovich-Inside-Windows-7/. [38] Enea, OSE 5 Source Getting Started. Rev. BL140702B L150602. [39] D. Wentzlaff and A. Agarwal, “Constructing virtual architectures on a tiled processor,” in Proceedings of the International Symposium on Code Generation and Optimization, CGO ’06, (Washington, DC, USA), pp. 173–184, IEEE Computer Society, 2006. Chapter 8 Demonstration Application and Output 8.1 Demonstration Application #define MY_SIGNAL ( 1 0 1 ) struct my_signal { SIGSELECT sig_no ; char ∗ message ; }; union SIGNAL { SIGSELECT sig_no ; struct my_signal my_signal ; }; s t a t i c SIGSELECT any_sig [ ] = { 0 } ; s t a t i c char ∗ message1 = " Howdy␣ ho ! " ; s t a t i c char ∗ message2 = " T a l l y ␣ ho ! " ; PROCESS pid1 , p i d 2 ; OS_PROCESS( demo1 ) { union SIGNAL ∗ s i g ; r a m l o g _ p r i n t f ( "Demo␣ s t a r t ! ␣ \n " ) ; r a m l o g _ p r i n t f ( " Demo1 : ␣ T h e s i s ␣DEMO: ␣My␣ f i r s t ␣ p r o c e s s ! \ n " ) ; while ( 1 ) { r a m l o g _ p r i n t f ( " Demo1 : ␣ Waiting ␣ on ␣Demo2 . . . . ␣ \n " ) ; s i g = r e c e i v e ( any_sig ) ; 55 56 Demonstration Application and Output r a m l o g _ p r i n t f ( " Demo1 : ␣Demo2␣ s a y s : ␣%s ␣ \n " , s i g −>my_signal . message ); r a m l o g _ p r i n t f ( " R e c e i v e d ␣ message ␣ from ␣Demo2␣ : −) ␣ I ’m␣ not ␣ a l o n e ␣ a f t e r ␣ a l l ! ␣ \n " ) ; s i g −>my_signal . message = message1 ; r a m l o g _ p r i n t f ( " Demo1 : ␣ S e n d i n g ␣%s ␣ t o ␣Demo2 . ␣ \n " , message1 ) ; send (& s i g , p i d 2 ) ; } } OS_PROCESS( demo2 ) { union SIGNAL ∗ s i g ; int i ; r a m l o g _ p r i n t f ( " Demo2 : ␣ T h e s i s ␣DEMO: ␣ Almost ␣my␣ f i r s t ␣ p r o c e s s ! \ n " ) ; s i g = a l l o c ( s i z e o f ( struct my_signal ) , MY_SIGNAL) ; s i g −>my_signal . message = message2 ; f o r ( i = 0 ; i < 2 ; i ++) { send (& s i g , p i d 1 ) ; r a m l o g _ p r i n t f ( " Demo2 : ␣ S e n d i n g ␣%s ␣ t o ␣Demo1\n " , message2 ) ; r a m l o g _ p r i n t f ( " Demo2 : ␣ Waiting ␣ on ␣Demo1 . . . . ␣ \n " ) ; s i g = r e c e i v e ( any_sig ) ; r a m l o g _ p r i n t f ( " Demo2 : ␣Demo1␣ s a y s : ␣%s ␣ \n " , s i g −>my_signal . message ); s i g −>my_signal . message = message2 ; } /∗ w r i t e t h e v a l u e SIM_CONTROL_PANIC t o t h e SPR_SIM_CONTROL s p e c i a l − p u r p o s e r e g i s t e r ∗/ r a m l o g _ p r i n t f ( "Demo␣ c o m p l e t e ! ␣ \n " ) ; s e t _ s p r ( 0 x4e0c , 2 7 ) ; } void bspStartOseHook2 ( void ) { p i d 1 = c r e a t e _ p r o c e s s (OS_PRI_PROC, " demo1 " , demo1 , 100 , /∗ S t a c k s i z e ∗/ 30 , /∗ P r i o r i t y ∗/ (OSTIME) 0 , /∗ T i m e s l i c e ∗/ (PROCESS) 0 , ( struct OS_redir_entry ∗ ) NULL, (OSVECTOR) 0 , (OSUSER) 0 ) ; s t a r t ( pid1 ) ; p i d 2 = c r e a t e _ p r o c e s s (OS_PRI_PROC, " demo2 " , demo2 , 100 , 30 , (OSTIME) 0 , /∗ S t a c k s i z e ∗/ /∗ P r i o r i t y ∗/ /∗ T i m e s l i c e ∗/ 8.2 Demonstration Application Output 57 (PROCESS) 0 , ( struct OS_redir_entry ∗ ) NULL, (OSVECTOR) 0 , (OSUSER) 0 ) ; s t a r t ( pid2 ) ; } 8.2 Demonstration Application Output The demo application prints the following output into the Ramlog (startup output included). __RAMLOG_SESSION_START__ [ 0 ] 0 . 0 0 0 :ROFS : No embedded volume found . Use r o f s _ i n s e r t h o s t t o o l t o i n s e r t one . [ 0 ] 0 . 0 0 0 : D e t e c t e d TILEpro64 , PVR 0 x f f f f , D−c a c h e 8 KByte , I−c a c h e 16 KByte [ 0 ] 0 . 0 0 0 :mm: mm_open_exception_area ( c p u _ d e s c r i p t o r =36c044 , v e c t o r _ b a s e =0 , v e c t o r _ s i z e =256) [ 0 ] 0 . 0 0 0 :CPU_HAL_TILEPRO: i n i t _ c p u . [ 0 ] 0 . 0 0 0 :mm: mm_open_exception_area MMU=0 [ 0 ] 0 . 0 0 0 :mm: m m _ i n s t a l l _ e x c e p t i o n _ h a n d l e r s : e n t r y [ 0 ] 0 . 0 0 0 :mm: s t a r t p a r s i n g log_mem s t r i n g : krn /log_mem/RAM [ 0 ] 0 . 0 0 0 :mm: max_domains : 2 5 5 @ 1206 a80 [ 0 ] 0 . 0 0 0 : Boot heap a u t o m a t i c a l l y c o n f i g u r e d . [ 0 x01410000 −0 x 0 8 1 f f f f f ] [ 0 ] 0 . 0 0 0 : ini t_b oot_ heap ( 0 x01410000 , 0 x 0 8 1 f f f f f ) [ 0 ] 0 . 0 0 0 : I n i t i a l r a n g e : [ 0 x01410000 −0 x 0 8 1 f f f f f ] [ 0 ] 0 . 0 0 0 : curr_base : 0 x08200000 [ 0 ] 0 . 0 0 0 : phys_frag : [ 0 ] 0 . 0 0 0 :MM: add_bank : name RAM [ 0 x200000 −0x8000000 ] b a n k _ s i z e 0 x060034 , f r a g _ c n t 0 x 0 0 8 0 0 0 , s i z e o f ∗bank , 0 x000040 ( s i z e o f ∗ f r a g ) 0 x00000c [ 0 ] 0 . 0 0 0 :mm: phys_mem [ 0 x0000200000 −0 x 0 0 0 8 1 f f f f f ] SASE RAM [ 0 ] 0 . 0 0 0 :mm: s t a r t p a r s i n g log_mem s t r i n g : krn /log_mem/RAM [ 0 ] 0 . 0 0 0 :mm: log_mem [ 0 x00200000 −0 x 0 8 1 f f f f f ] SASE RAM [ 0 ] 0 . 0 0 0 :mm: r e g i o n " b s s " : [ 0 x01200000 −0 x 0 1 4 0 f f f f ] , ( 0 x00210000 ) su_rw_usr_ro copy_back s p e c u l a t i v e _ a c c e s s [ 0 ] 0 . 0 0 0 : krn / r e g i o n / b s s : [ 0 x01200000 −0 x 0 1 4 0 f f f f ] [ 0 ] 0 . 0 0 0 :mm: r e g i o n " data " : [ 0 x0036c000 −0 x 0 0 3 6 d f f f ] , ( 0 x00002000 ) su_rw_usr_rw copy_back s p e c u l a t i v e _ a c c e s s [ 0 ] 0 . 0 0 0 : krn / r e g i o n / data : [ 0 x0036c000 −0 x 0 0 3 6 d f f f ] [ 0 ] 0 . 0 0 0 :mm: r e g i o n " ramlog " : [ 0 x00100000 −0 x 0 0 1 0 7 f f f ] , ( 0 x00008000 ) su_rw_usr_na w r i t e _ t h r o u g h s p e c u l a t i v e _ a c c e s s [ 0 ] 0 . 0 0 0 :mm: log_mem [ 0 x00100000 −0 x 0 0 1 0 7 f f f ] SASE ramlog [ 0 ] 0 . 0 0 0 : krn / r e g i o n / ramlog : [ 0 x00100000 −0 x 0 0 1 0 7 f f f ] [ 0 ] 0 . 0 0 0 :mm: r e g i o n " t e x t " : [ 0 x00200000 −0 x 0 0 3 6 b f f f ] , ( 0 x0016c000 ) su_rwx_usr_rwx copy_back s p e c u l a t i v e _ a c c e s s [ 0 ] 0 . 0 0 0 : krn / r e g i o n / t e x t : [ 0 x00200000 −0 x 0 0 3 6 b f f f ] [ 0 ] 0 . 0 0 0 : Data c a c h e not e n a b l e d i n c o n f i g u r a t i o n . For NOMMU o n l y . [ 0 ] 0 . 0 0 0 : I n s t r u c t i o n c a c h e not e n a b l e d i n c o n f i g u r a t i o n . For NOMMU only . [ 0 ] 0 . 0 0 0 :mm: map_regions ( ) [ 0 ] 0 . 0 0 0 :MM−meta−data : [ 0 x0819d000 −0 x 0 8 1 f f f f f ] [ 0 ] 0 . 0 0 0 :MM i n i t c o m p l e t e d [ 0 ] 0 . 0 0 0 : has_mmu= 0 [ 0 ] 0 . 0 0 0 :mm: i n i t i a l i z a t i o n c o m p l e t e d . [ 0 ] 0 . 0 0 0 : Cache b i o s i n s t a l l e d . 58 [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] Demonstration Application and Output 0.000: s y s c a l l ptr : 2 f 4 f e 0 bios ptr : 2 f50c0 0.000: Starting syspool extender . 0 . 0 0 0 : S t a r t i n g mainpool e x t e n d e r . 0 . 0 0 0 : k e r n e l i s up . 0 . 0 0 0 : S t a r t i n g RTC. 0 . 0 0 0 : S t a r t i n g system HEAP. 0 . 0 0 0 : S t a r t i n g FSS . 0 . 0 0 0 : S t a r t i n g PM 0 . 0 0 0 : S t a r t i n g SHELLD . 0 . 0 0 0 : OSE5 c o r e b a s i c s e r v i c e s s t a r t e d . 0 . 0 0 0 :ROFS : / r o m f s : No volume found . Has o s e _ r o f s _ s t a r t _ h a n d l e r 0 ( ) been c a l l e d ? [ 0 ] 0 . 0 0 0 : S t a r t i n g DDA d e v i c e manager [ 0 ] 0.000: I n s t a l l i n g static device drivers . [ 0 ] 0 . 0 0 0 : dda : ddamm_alloc_uncached from MM( 1 6 3 8 4 ) = 0 x4b9000 [ 0 ] 0 . 0 0 0 : devman : S t a r t e d ( log_mask=0x3 ) [ 0 ] 0.000: Register driver pic_tilepro [ 0 ] 0 . 0 0 0 : R e g i s t e r d r i v e r ud16550 [ 0 ] 0.000: Register driver timer_tilepro [ 0 ] 0.000: Activating devices . [ 0 ] 0 . 0 0 0 : S t a r t i n g SERDD. [ 0 ] 0 . 0 0 0 : S t a r t i n g CONFM. [ 0 ] 0 . 0 0 0 : Adding program t y p e APP_RAM e x e c u t i o n mode u s e r [ 0 ] 0 . 0 0 0 :APP_RAM/ t e x t=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x707 [ 0 ] 0 . 0 0 0 :APP_RAM/ p o o l=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x303 [ 0 ] 0 . 0 0 0 :APP_RAM/ data=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x303 [ 0 ] 0 . 0 0 0 :APP_RAM/ heap=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x303 [ 0 ] 0 . 0 0 0 : No c o n f f o r : pm/ p r o g t y p e /APP_RAM/ heap , u s i n g default . [ 0 ] 0 . 0 0 0 : Adding program t y p e SYS_RAM e x e c u t i o n mode s u p e r v i s o r [ 0 ] 0 . 0 0 0 :SYS_RAM/ t e x t=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x707 [ 0 ] 0 . 0 0 0 :SYS_RAM/ p o o l=phys_mem :RAM log_mem :RAM_SASE c a c h e : 0 xc perm : 0 x303 [ 0 ] 0 . 0 0 0 :SYS_RAM/ data=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x303 [ 0 ] 0 . 0 0 0 :SYS_RAM/ heap=phys_mem :RAM log_mem :RAM c a c h e : 0 xc perm : 0 x303 [ 0 ] 0 . 0 0 0 : No c o n f f o r : pm/ p r o g t y p e /SYS_RAM/ heap , u s i n g default . [ 0 ] 0 . 0 0 0 : S t a r t i n g RTL ELF . [ 0 ] 0 . 0 0 0 : Demo s t a r t ! [ 0 ] 0 . 0 0 0 : Demo1 : T h e s i s DEMO: My f i r s t p r o c e s s ! [ 0 ] 0 . 0 0 0 : Demo1 : Waiting on Demo2 . . . . [ 0 ] 0 . 0 0 0 : Demo2 : T h e s i s DEMO: Almost my f i r s t p r o c e s s ! [ 0 ] 0 . 0 0 0 : Demo2 : S e n d i n g T a l l y ho ! t o Demo1 [ 0 ] 0 . 0 0 0 : Demo2 : Waiting on Demo1 . . . . [ 0 ] 0 . 0 0 0 : Demo1 : Demo2 s a y s : T a l l y ho ! [ 0 ] 0 . 0 0 0 : R e c e i v e d message from Demo2 : −) I ’m␣ not ␣ a l o n e ␣ a f t e r ␣ a l l ! [ 0 ] ␣ 0 . 0 0 0 : Demo1 : ␣ S e n d i n g ␣Howdy␣ ho ! ␣ t o ␣Demo2 . [ 0 ] ␣ 0 . 0 0 0 : Demo1 : ␣ Waiting ␣ on ␣Demo2 . . . . [ 0 ] ␣ 0 . 0 0 0 : Demo2 : ␣Demo1␣ s a y s : ␣Howdy␣ ho ! [ 0 ] ␣ 0 . 0 0 0 : Demo2 : ␣ S e n d i n g ␣ T a l l y ␣ ho ! ␣ t o ␣Demo1 [ 0 ] ␣ 0 . 0 0 0 : Demo2 : ␣ Waiting ␣ on ␣Demo1 . . . . [ 0 ] ␣ 0 . 0 0 0 : Demo1 : ␣Demo2␣ s a y s : ␣ T a l l y ␣ ho ! [ 0 ] ␣ 0 . 0 0 0 : R e c e i v e d ␣ message ␣ from ␣Demo2␣ : −) ␣ I ’m not a l o n e a f t e r a l l ! [ 0 ] 0 . 0 0 0 : Demo1 : S e n d i n g Howdy ho ! t o Demo2 . [ 0 ] 0 . 0 0 0 : Demo1 : Waiting on Demo2 . . . . [ 0 ] 0 . 0 0 0 : Demo2 : Demo1 s a y s : Howdy ho ! [ 0 ] 0 . 0 0 0 : Demo c o m p l e t e ! På svenska Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ In English The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/ © Sixten Sjöström Thames
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement