Technische Universität Dresden Fakultät Informatik Institut für Systemarchitektur Professur Betriebssysteme Diplomarbeit Software Structure and Portability of the Fiasco Microkernel Alexander Warg 28. Juli 2003 Betreuender Hochschullehrer: Betreuender Mitarbeiter: Prof. Dr. Hermann Härtig Dr. Michael Hohmuth All trademarks are the property of their respective owners. Acknowledgements I like to thank everybody, who supported my work on the Fiasco microkernel. Especially, I like to thank: my supervisor Michael Hohmuth, for introducing me into the depths of Fiasco’s source code; Udo Steinberg and Frank Mehnert, who did numerous implementation details on IA-32 and accepted my design principles; Prof. Dr. Hermann Härtig, who always found time for a discussion; Adam Lackorzynski, who build the ARM cross compilers and never disabled my login; as well as all other members of the OS Groups at TU Dresden and University of Karlsruhe. Erklärung Hiermit erkläre ich, die vorliegende Arbeit selbständig verfaßt und keine anderen als die angegebenen Literaturhilfsmittel verwendet zu haben. Declaration I hereby declare that this thesis is a work of my own, and that only cited sources have been used. Dresden, den 28. Juli 2003 Alexander Warg Contents Contents I List of Figures IV 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 4 UML Class Diagrams . . . . . . . . . . . . . . . . . . . . 2 Fundamentals 2.1 2.2 2.3 5 The ARM Processor Architecture . . . . . . . . . . . . . . . . . . 5 2.1.1 Privileged Modes and Banked Registers . . . . . . . . . . 6 2.1.2 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Memory Management Unit . . . . . . . . . . . . . . . . . 7 2.1.4 StrongARM and XScale . . . . . . . . . . . . . . . . . . . 7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 How to Achieve Portability . . . . . . . . . . . . . . . . . 9 2.2.2 ARM Related Work . . . . . . . . . . . . . . . . . . . . . 10 State of Fiasco . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Design 14 3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Overall Design of Fiasco . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 L4-Independent Hardware Abstractions . . . . . . . . . . . . . . 16 3.3.1 16 Native Data Types . . . . . . . . . . . . . . . . . . . . . . I CONTENTS 3.3.2 Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.3 Generic Page-Table Interface . . . . . . . . . . . . . . . . 19 3.3.4 Standard C Library . . . . . . . . . . . . . . . . . . . . . 19 L4-Specific Components . . . . . . . . . . . . . . . . . . . . . . . 22 3.4.1 Basic L4 Abstractions . . . . . . . . . . . . . . . . . . . . 22 3.4.2 Hardware Layer of L4’s Basic Abstractions . . . . . . . . 24 3.4.3 Exception Handling . . . . . . . . . . . . . . . . . . . . . 26 3.4.4 L4-ABI Abstractions . . . . . . . . . . . . . . . . . . . . . 26 3.4.5 Kernel Memory Management . . . . . . . . . . . . . . . . 29 3.5 New JDB Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 StrongARM Specific Design . . . . . . . . . . . . . . . . . . . . . 33 3.6.1 Exception Handling . . . . . . . . . . . . . . . . . . . . . 33 3.6.2 ARM Kernel Address Space . . . . . . . . . . . . . . . . . 34 3.4 4 Implementation 35 4.1 Polymorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 Stage 1, Boot Subsystem . . . . . . . . . . . . . . . . . . 37 4.2.2 Stage 2, In-Kernel Bootstrap . . . . . . . . . . . . . . . . 39 4.3 The Build System . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 StrongARM Implementation Details . . . . . . . . . . . . . . . . 43 5 Future Work 45 5.1 General Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 StrongARM Topics . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6 Summary 48 A Architecture-Specific Hooks A-1 A.1 Kernel-External Hooks . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 Proc Class A-1 . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 A.1.2 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . A-2 A.2 Kernel-Internal Hooks . . . . . . . . . . . . . . . . . . . . . . . . A-3 A.2.1 Page table Class . . . . . . . . . . . . . . . . . . . . . . . A-3 A.2.2 Kmem Class . . . . . . . . . . . . . . . . . . . . . . . . . A-5 A.2.3 Context Class . . . . . . . . . . . . . . . . . . . . . . . . . A-5 II CONTENTS A.2.4 Thread Class . . . . . . . . . . . . . . . . . . . . . . . . . A-6 A.2.5 Kernel thread Class . . . . . . . . . . . . . . . . . . . . . A-6 A.2.6 In-Kernel System-Call Bindings . . . . . . . . . . . . . . . A-7 A.2.7 Cpu Class . . . . . . . . . . . . . . . . . . . . . . . . . . . A-12 A.2.8 Fpu Class . . . . . . . . . . . . . . . . . . . . . . . . . . . A-12 A.2.9 Timer Class . . . . . . . . . . . . . . . . . . . . . . . . . . A-12 A.2.10 Pic Class . . . . . . . . . . . . . . . . . . . . . . . . . . . A-13 A.2.11 Mapdb Class . . . . . . . . . . . . . . . . . . . . . . . . . A-13 A.2.12 Startup Constructor . . . . . . . . . . . . . . . . . . . . . A-14 A.2.13 Boot console Class . . . . . . . . . . . . . . . . . . . . . . A-14 A.2.14 kdb ke Module . . . . . . . . . . . . . . . . . . . . . . . . A-14 Acronyms B-1 Bibliography C-1 III List of Figures 2.1 Address flow on ARM with FCSE . . . . . . . . . . . . . . . . . 8 3.1 Layering of Subsystems . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Design of the Generic CPU Abstractions . . . . . . . . . . . . . . 18 3.3 Design of the Console-I/O Subsystem . . . . . . . . . . . . . . . 20 3.4 Generic Page-Table Interface . . . . . . . . . . . . . . . . . . . . 21 3.5 Hierarchy of the Thread Abstraction . . . . . . . . . . . . . . . . 23 3.6 Hierarchy of the Space Abstraction . . . . . . . . . . . . . . . . . 24 3.7 Platform Hooks of the Thread Abstraction . . . . . . . . . . . . 25 3.8 Design of the L4 ABI Types . . . . . . . . . . . . . . . . . . . . . 27 3.9 Encapsulated System-Call Parameters . . . . . . . . . . . . . . . 28 3.10 Kernel Binding for the Id-Nearest System Call . . . . . . . . . . 29 3.11 Model of the Memory-Management System . . . . . . . . . . . . 30 3.12 Class Structure of the New JDB . . . . . . . . . . . . . . . . . . 32 3.13 Exemplary Subclasses of Jdb module . . . . . . . . . . . . . . . . 32 3.14 StrongARM Address-Space Layout . . . . . . . . . . . . . . . . . 34 4.1 43 Memory Aliasing Example . . . . . . . . . . . . . . . . . . . . . . IV List of Programs 4.1 Explicit Compile-Time Polymorphism (Interface) . . . . . . . . . 37 4.2 Explicit Compile-Time Polymorphism (Implementation) . . . . . 38 4.3 Simple Static Constructors Example . . . . . . . . . . . . . . . . 40 4.4 Prioritized Static Constructors Example . . . . . . . . . . . . . . 41 4.5 Example Modules File . . . . . . . . . . . . . . . . . . . . . . . . 42 V LIST OF PROGRAMS VI Chapter 1 Introduction Portability and software engineering are two key terms that have become very significant for today’s software. For example, the complete philosophy of Java is based on the idea of write once and run anywhere, which is the most extreme form of portability. The software engineering community proposes the principle of object-oriented design (OOD), which manifests itself in programming languages that directly support object-oriented programming (OOP). The approach of OOD is claimed to be useful for distributed work on software projects and to result in maintainable code. The OOD approach helps to define autonomous components with well-defined interfaces, which are the base for good testability. Nowadays developers of application software commonly accept the principles of software design and the importance of portability. Many modern application programs run on various computer platforms and operating-systems; these are often implemented in object-oriented programming languages or with object-oriented techniques. The operating-systems world seems to be somewhat slower in accepting new principles; the importance of portability is known commonly, whereas a common basic approach for the design of operating systems is not established. The microkernel approach, which was devised in the early 1980’s, has the potential to become the base for operating-system design. Microkernels support modern component-based and modularized design of the operating-system personalities built atop them. However, the two key issues, portability and software engineering, can also be applied to the microkernel itself. The first microkernels suffered from their poor performance. For instance, Mach (see [Accetta et al., 1986]) was not accepted because of its bad performance. These first-generation microkernels where not really small and had a lot of functionality built into the kernel. The microkernels of the second generation, such as L4, aimed to be highly efficient. Jochen Liedtke claimed in 1995 that efficient microkernels are per se non-portable (see [Liedtke, 1995]) and must be implemented in assembly language to be able to use specific features of the underlying hardware. 1 CHAPTER 1. INTRODUCTION Today the number of architectures that are on the market is growing extremely fast and the field of embedded systems is becoming increasingly important. With the growing number of potential target platforms, the maintenance effort increases too fast if the implementation is done from scratch for every platform; each new implementation is very prone to errors and the various implementations do not benefit from bug fixes in one particular implementation. Thus, the claim that portability is not an issue for microkernels no longer holds true. Hazelnut from Liedtke’s operating-systems group at the University of Karlsruhe proved that it is possible to implement a microkernel in a high-level language (HLL). The runtime overhead of Hazelnut, which is implemented in C, was very low (see [Hazelnut, 2000]). Today’s picture of microkernel implementations has already changed. Current microkernels, such as Fiasco, Hazelnut, and Pistachio, are implemented in HLLs. This development made the microkernels far more maintainable and less prone to errors. Hazelnut and Pistachio, from University of Karlsruhe, have already focused on portability and use a common code base for different target architectures, whereas Fiasco, from TU Dresden, currently supports only one target architecture in the main branch. 1.1 Problem Statement The known portable L4 implementations do not use object-oriented principles to achieve portability, because polymorphism is thought to be a performance problem. The kernel implementations either suffer from an excess of conditionally compiled code (#ifdef constructs) or from inherent code duplication. Another deficiency of Pistachio and Hazelnut is the restricted real-time capability. These kernels use a global interrupt lock for synchronization, which results in long and partly unbounded interrupt latencies. In contrast, the Fiasco microkernel was designed to provide good real-time capabilities. The focus was on high preemptability, as well as short and bounded interrupt latencies, but not on portability. One way to overcome the deficits mentioned before is to improve the preemptability of Pistachio. The other solution is to make Fiasco more portable. The latter possibility has the advantage that the Fiasco microkernel was developed in Dresden, resulting in a high level of Fiasco knowledge in the Operating Systems Group in Dresden. Additionally, the ongoing VFiasco1 project could benefit from clear hardware abstractions. 1 VFiasco is the approach to formally proof the correctness of Fiasco. The formal proof is an important step to use Fiasco in systems with very high security demands. 2 CHAPTER 1. INTRODUCTION 1.2 Approach This paper presents a completely object-oriented design that is intended to improve the portability of the Fiasco source code. The design incorporates all features of object orientation, such as inheritance and polymorphism. I assumed that there is no a-priory impact on the performance of the compiled kernel due to the usage of OOD and OOP to achieve a better portability, which is not true offhand. In reality, one of the main challenges in applying modern software design principles, such as OOD, to microkernels is to keep the performance impact as low as possible (at best zero). However, the things that make OOP attractive, such as polymorphism, have a problematic influence on the efficiency of the resulting software. Most C++ compilers generate an extra level of indirection and impede sourcelevel inlining in the case of polymorphism. However, the extra level of indirection results in extra instructions and extra memory accesses on method invocation, and source-level inlining is the base for efficient realization of fine-grained interfaces. The additional overhead that is introduced through OOP is commonly accepted in application development, because the advantage of better maintainability pays more than the minor performance impact. However, in the microkernel community the most important design goal is to achieve maximum performance and to have as little influence on running applications as possible. Microkernels of the first generation, such as Mach, where not accepted because of their unsatisfactory performance. 1.3 Document Structure The rest of this paper is divided into four chapters. The following chapter provides a foundation that should help you better understand this thesis. You may just skip Chapter 2 completely or pick up some parts if you feel familiar with the mentioned topics. The Design chapter (Chapter 3) contains the object-oriented design of Fiasco and describes how portability is actually improved. Chapter 4 focuses on issues such as the efficient implementation of the polymorphic design, the build system, and the boot sequence. The Design and the Implementation chapter are split into two main parts, one that deals with portability in general and another one that reflects the issues dedicated to the ARM port, which was part of the task for my thesis. Finally, an outlook to open topics is given in Future Work (Chapter 5), and the Summary (Chapter 6) briefly recapitulates the important issues of this paper. 3 CHAPTER 1. INTRODUCTION 1.3.1 UML Class Diagrams All the class diagrams use a common style to communicate extra information. The pictured classes are colored white in the case of generic classes (i.e., not dedicated to a certain architecture). The gray shaded classes are hardware specific implementations or interfaces. Further information is encoded in the font that is used for the class title and for the operations. The class title is set italic in the case of abstract classes that cannot be instantiated. The operations of a class are set bold and italic if they are abstract or in C++ terms pure virtual. Final implementations of operations are set in normal font. Class-scope operations are underlined. 4 Chapter 2 Fundamentals 2.1 The ARM Processor Architecture A main point of my work was porting of the Fiasco microkernel to ARM and especially to StrongARM. The ARM architecture is nowadays one of the mostly used architectures in embedded systems. The big market share is due to the simple design that leads to low-cost and low-power implementations. This section introduces the ARM architecture from the operating systems point of view; the user-level view and the programming model are not covered in detail. The ARM is fundamentally a 32-Bit RISC architecture, augmented with CISC strengths. The condensed properties of ARM are as follows: • 32-Bit load/store architecture • High performance at low cost and power consumption • Conditional execution of all instructions • Parallel shift and ALU operation • Multiple register load/store instructions (slow on Intel XScale) • Hardware-loaded TLB • Multiple page size support • Domains, which can be taken as a variation of address-space identifiers (ASIDs) The following sections shall just give short overview of the ARM architecture. For complete documentation see [ARM Ltd., 2000, StrongARM, 2001, Intel XScale, 2002, Intel PXA, 2002a, Intel PXA, 2002b]. 5 CHAPTER 2. FUNDAMENTALS Mode User FIQ IRQ Supervisor Abort Undefined System Privileged – × × × × × × Registers R0–R15 R0–R7, R8 fiq–R14 fiq, R15 R0–R12, R13 irq–R14 irq, R15 R0–R12, R13 svc–R14 svc, R15 R0–R12, R13 abt–R14 abt, R15 R0–R12, R13 und–R14 und, R15 R0–R15 Table 2.1 ARM Operating Modes. 2.1.1 Privileged Modes and Banked Registers The ARM architecture supports seven operating modes, which are summarized in Table 2.1. One of these modes is the non-privileged user mode; the other six are privileged modes for operating-system execution. Modes can either be switched under the control flow of privileged software or via exceptions, which include the explicit software interrupt (SWI). The ARM uses banked registers to preserve the minimal machine state on mode switches. Banked registers are general-purpose registers that have an extra instance for a dedicated mode of operation. The privileged operating modes that can be entered via exceptions have at least a banked stack pointer (SP), link register (LR), and saved program status register (SPSR). The fast interrupt (FIQ) mode, has extra five banked registers, to facilitate extremely fast interrupt handling without the need to save registers to memory. 2.1.2 Exceptions Exceptions result in a defined change of the flow of control. It is irrelevant whether the exception is due to reasons internal or external to the processor; the CPU assigns a fixed address in the exception vector to the instruction pointer and switches to a privileged mode (see Section 2.1.1), corresponding to the exception that occurred. The ARM architecture supports the listed exceptions. Reset The reset exception is raised when the processor’s reset input is asserted, and the execution starts at address 0x00000000 in Supervisor mode. Undefined instruction This exception is raised if an attempt is made to execute an undefined instruction, or if a coprocessor instruction shall be executed and no coprocessor responds. The processor switches to the Undefined mode. Software Interrupt (SWI) An SWI enters the Supervisor mode. It is raised explicitly by executing the swi instruction. 6 CHAPTER 2. FUNDAMENTALS Prefetch Abort If the memory system signals an abort on an instruction prefetch, the prefetch abort exception is raised. The execution continues in Abort mode. Data Abort The memory system signals an abort on a data access. Again, the processor is put into Abort mode. Interrupt Request (IRQ) The IRQ exception is raised when the IRQ signal of the CPU is asserted. IRQs can be masked by setting the appropriate bit in the current program status register (CPSR). This exception ends up in the IRQ mode. Fast Interrupt Request (FIQ) The FIQ exception is raised when the FIQ signal of the CPU core is asserted. An FIQ exception results in a switch to the FIQ mode, which provides extra five banked registers. The FIQ signal should be used for a small number of interrupts that require very fast handling. 2.1.3 Memory Management Unit The ARM architecture defines virtual memory support with a hardware-walked page table. An implementation may provide a translation look-aside buffer (TLB), which can either be unified for code and data or split into an instruction-prefetch translation look-aside buffer (ITLB) and a data-access translation look-aside buffer (DTLB). In summary, the virtual memory system of ARM is based on hardware loaded TLBs, provides multiple page sizes and a variation of address-space identifiers (ASIDs), the domains. Page tables have two levels: the first level is referred to as page directory (PD), and the second level as leaf page table (LPT). The page directory contains 4096 entries and a leaf page table consists of 256 or 1024 entries. 2.1.4 StrongARM and XScale The StrongARM SA-1100 is a low-power, low-cost, and high-speed implementation of the ARM architecture. It specifies not only the CPU core, but also a set of peripheral controllers that are integrated within a single package. The StrongARM CPU core implements a modified Harvard architecture with separate caches and TLBs for instruction and data streams. The XScale application-processor family is the successor of StrongARM. The processors provide a set of extensions to the ARM architecture, as well as increased clock rates. Interesting enhancements, from the operating systems point of view, are better cache and TLB control (e.g., cache and TLB pinning). 7 CHAPTER 2. FUNDAMENTALS ARM Core Virtual address FCSE Modified virtual address MMU Physical address Main Memory Cache Figure 2.1: Address flow on ARM with FCSE. (According to [ARM Ltd., 2000], Figure 6–1 on page B6–3) Cache Architecture The modified Harvard architecture makes it possible to achieve clock speeds of 200 MHz and more. The processor provides, beside its standard instruction and data caches, a write buffer, a mini-data cache, and a data-prefetching read buffer. The caches are fully virtual, which results in problems that are described in later on, in Section 2.2.2. Fully virtual means indexed and tagged with virtual addresses. The XScale processors provide almost the same cache features like SA-1100; only the data-prefetching read buffer is missing. As compensation for the missing read buffer, XScale features larger caches and the ability to pin cache lines. Fast Context Switch Extension (FCSE) The fast-context-switch extension (FCSE) of the StrongARM provides a means to re-map the lowest 32 MByte of an address space. The FCSE is controlled with a PID register that contains the ID of the currently running process. The contents of the PID register are used to calculate the modified virtual address for accesses to the first 32 MByte of the address space; this modified virtual address is then fed to the fully virtual caches and the MMU for the usual translation process. Figure 2.1 shows the address flow in the ARM MMU system with FCSE. The calculation is as follows, where P is the current PID, Aaccess is the accessed address, and Avirtual is the modified virtual address fed to the memory system. n P · 32 MByte + A if Aaccess < 32 MByte access Avirtual = Aaccess if Aaccess ≥ 32 MByte As evident from the equation, only the lowest 32 MByte of the address space are affected by the FCSE; the rest of the address space is the normal virtual address space. The FCSE shall be very useful for small address spaces on StrongARM, however, there is a discussion about that later on. 8 CHAPTER 2. FUNDAMENTALS 2.2 Related Work There is a lot of work to do in the field of portable operating systems, especially in the open-source community. The most common open-source operating system is probably Linux, which runs in version 2.4 on at least 17 different platforms (see [Linux, 2003]). Other classical operating systems that have a focus on portability are FreeBSD (6 platforms, [FreeBSD, 2003, McKusick et al., 1996]) and NetBSD (53 platforms, [NetBSD, 2003, Kesteloot, 1995]). The aforementioned systems have the common feature that all are based on a monolithic kernel, where most of the drivers as well as file systems run in privileged mode. Nevertheless, in the microkernel community there is also upcoming interest in the direction of portability. A recent microkernel that has a clear focus on portability is Pistachio, the implementation of the L4 version X.2 interface from University of Karlsruhe (see [L4Ka, 2003]). Other projects in the field of portable microkernels are the Sartoris microkernel (see [Sartoris Developers Group, 2003]) and “The KeyKOS” microkernel (see [Bomberger et al., 1992]). 2.2.1 How to Achieve Portability There exist several different approaches to achieve the design goals of easy portability and maintainability. In the case that only two different platforms have to be supported, the obvious way is to split the source code into two parts, an architecture-specific and a generic part. This very simple approach proves not to be useful if more than two platforms should be supported, because there may be code that can be shared among some platforms but is useless for others; for example, the VGA driver, which fits for IA-32 and IA-64 but not for ARM. Such drivers must, with the two-part scheme, go into the architecture-specific part, and therefore must be duplicated. The monolithic kernels that I mentioned at the beginning of Section 2.2 are structured in a more sophisticated manner. Their source code consists of: a completely generic part; an architecture-dependent part, which is often subdivided into subarchitectures; and the device drivers. The architecture dependent part contains code that is specific to the target processor as well as device drivers for devices that are dedicated to machines based on the target processor. The device-driver part contains drivers for hardware devices that are available on at least two architectures. Drivers are mostly classified according to their purpose (e.g., Network, Video, and Sound). The Pistachio microkernel has a quite similar structure, it is subdivided into five parts: an architecture-specific part (mostly CPU specific), a platform-specific part (e.g., PC99 or EFI), an API-specific part, a generic part, and a glue part. There is no part for device drivers, which is seemingly unproblematic because a microkernel does not contain any device drivers. However, at least for debugging purposes even a microkernel needs drivers for I/O hardware. Moreover, there are further devices that need a driver in the micro kernel: first, the interrupt controller, for enabling 9 CHAPTER 2. FUNDAMENTALS and disabling interrupts; second, the system timer, which is needed to trigger events such as time-slice ends or IPC timeouts. Another problem of this hard partitioning is that two different API implementations, such as version 2 and version X.0, cannot share any code, in spite of the fact that they vary only at very few locations. The structure of Pistachio obviously goes into the direction of aspect-oriented design (ASOD) (see [ASOD, 2003]). The crosscutting concerns, target architecture, target platform, and system-call API, are separated cleanly. A design goal was a modular microkernel construction kit, which features easy construction of new APIs and easy porting to new architectures and platforms. However, Pistachio is implemented in plain C++ and without the use of inheritance and polymorphism. C-preprocessor macros and conditional compilation (#ifdefs) are used, to compose the different components and to enable specific features that are scattered throughout several functions. 2.2.2 ARM Related Work Address Space Switches The frequent address-space switches usually performed in microkernel operating systems are one of the trickiest challenges with ARM. The most problematic feature is the fully virtual caches (see Section 2.1.4) that do not support any form of ASID tagging, neither on StrongARM nor on XScale. Additionally, the StrongARM architecture suffers from its rudimentary cache-control mechanisms that allow only complete invalidation of the instruction cache. The fully virtual caches have the effect that cache coherency must be ensured by software. With a naive approach, the caches must be flushed1 on each addressspace switch. The direct and even more the indirect costs of this operation result in a major performance impact. The direct cost for flushing the cache is 1,000–18,000 cycles. The indirect costs, which result from the lost cache and TLB working set, are about 45 cycles per TLB miss and about 70 cycles per cache miss. In the worst case, the costs are up to 75,000 cycles (≈ 350µs on a 200-MHz processor). General Solution The ARM architecture provides means to mitigate the problem of missing ASIDs in the TLBs and even in the caches. The solution is to use the ARM domains. A sophisticated description of the ARM domains can be found in [ARM Ltd., 2000, Wiggins, 1999, Wiggins and Heiser, 2000]. In principle, the ARM domains provide efficient access-control changes for large and non-contiguous regions of virtual memory. This mechanism supports fast switches among different address spaces, with the restriction that concurrently active address spaces must have no overlap in their mapped memory regions. As long as the address spaces meet this restriction, switches among them can be 1 written back and invalidated 10 CHAPTER 2. FUNDAMENTALS made by reloading the domain access control register (DACR) with the appropriate access rights. The real profit results from the fact that the StrongARM architecture observes domain access rights even for cached data and not only for cache misses; thus the kernel can mostly avoid the extremely expensive cache flushes on address-space switches. The conclusion is that it is possible to circumvent the high expenses for addressspace switching, as long as the kernel ensures the condition of no overlap among the mappings of active address spaces. The general implementation idea is based on a caching page table, which contains non-overlapping regions of different address spaces at the same time. The page-table entries of the different address spaces are tagged with different ARM domains, and isolation of the address spaces is enforced with an appropriate mask in the DACR. More information about the principles can be found in Section 4.3 of [Wiggins, 1999]. The aforementioned concepts also hold true for the XScale architecture. Non-overlapping Address Spaces One reason for virtual address spaces is to support transparent multiprocessing with programs that make use of the same address ranges. Therefore, the nature of processes running in different address spaces is that they may have overlaps in their mappings. This condition stands in hard contradiction to the assumption of no overlap in mappings, made for fast address-space switches. One approach to reduce or even remove the overlap is the single-address-space operating system (SASOS)2 approach. SASOSs are a rather special class of operating systems, but the microkernel approach should not be restricted to such a niche. A more general way to reduce contention for address-space ranges is to use the FCSE of StrongARM, which allows a transparent re-mapping of the lowest 32 MBytes of the virtual address space (see Section 2.1.4). The drawback of the FCSE is the restriction to the lowest 32 MByte of the 4-GByte virtual address space. This means that only a variation of small address spaces benefits from the FCSE. However, with the target systems of ARM processors in mind, which are embedded systems, the limitation of processes to a 32 MByte address space seems to be not unrealistic. In addition, tasks with larger address spaces may benefit from the fast addressspace switch mechanism as long as the static part of the address space is limited to the lowest 32 MByte and the memory management component of the operating system avoids overlaps in dynamically allocated memory, such as stacks and heap. Limited Number of Domains A problematic issue is the limited number of available ARM domains. Only 16 domains are provided by the ARM architecture. 2 MUNGI from University of New South Wales (UNSW) is a SASOS based on L4 technology, see [Wilkinson et al., 1995] 11 CHAPTER 2. FUNDAMENTALS The fact that every running task needs to get a domain assigned makes it obvious that potentially more than 16 domains would be necessary to execute more than 16 tasks. The solution for this problem is to assign domains to running tasks dynamically. The dynamic allocation of a limited resource always yields the following key issues: Preemption to withdraw an assigned resource Scheduling for choosing a victim to be preempted and to select a candidate that gets a free resource Thrashing, which occurs if there are more candidates in the current working set than there are available resources In terms of ARM domains, the keywords are domain preemption, domain thrashing, and domain preemption strategy. [Wiggins et al., 2002, Wiggins and Heiser, 2000, Wiggins, 1999] discuss these terms in more detail. 2.3 State of Fiasco Michael Hohmuth, from the TU Dresden, initially wrote the Fiasco microkernel. It was the first implementation of the L4 microkernel interface in a HLL (C++). At the very beginning of my work, the Fiasco microkernel was already modularized and well-defined interfaces were used among the modules. The main reason for this structure was to achieve testability of separate modules. For such offkernel testing of modules, the dependencies among the modules must result in a directed acyclic graph. If this property is not satisfied, all modules participating in a circular dependency can only be tested as a whole. The same applies to portability too; more details on this shall follow in Section 3.1. The source code of the Fiasco microkernel was already partitioned into subsystems, modules, and submodules. The different subsystems are almost complete self-contained parts of the microkernel, such as a library for simple memory management, the reduced C library, or the main kernel image itself. Each subsystem consists of one or more modules that encapsulate logical units. For example, the thread module contains the data structures and methods necessary to handle L4 threads. The interfaces of these modules are well defined and hide the implementation details. Again, a module can aggregate one or more submodules, which are used to further subdivide a module into smaller logical blocks (e.g., thread-syscall and thread-ipc are two submodules of thread). When I initially started to work with Fiasco’s source code, the property of an acyclic directed graph was violated, the interfaces were not designed for portability among different architectures, and architecture and platform specific code was spread all over the kernel. 12 CHAPTER 2. FUNDAMENTALS Indeed the situation proved to be better than that. The circular dependencies were almost entirely introduced by logging features that use hooks into the Fiasco kernel debugger (JDB), and on the other hand JDB depends on nearly everything of the microkernel. Interfaces were well defined, even if mostly not machine independent and often coarse grained. The subsystem-module-submodule structure was not designed for portability; the initial partitioning was based only on logical blocks, and later refactoring aimed at the elimination of certain circular dependencies for meliorating testability. All in all, portability was a new goal that needed further partitioning, restructuring, and new probably more fine-grained interfaces. 13 Chapter 3 Design The main goal is to achieve portability among different computer architectures and platforms, while reusing most of Fiasco’s existing source code and keeping the performance impact as low as possible. The performance aspect is very important in a microkernel operating system. Older microkernel operating systems often suffered from the bad performance of the underlying kernel. As the past has shown, the Mach microkernel (see [Accetta et al., 1986]) was not accepted due to the performance issue. The L4 microkernels were actually designed to be highly efficient. Thus L4Linux, a Linux running in user space, suffers from a performance loss of only five percent (see [Härtig et al., 1997]). In condensed form, the design objectives are: • Portability among various hardware platforms • No performance impact at the L4 interface • Low maintenance effort for the different supported platforms • No #ifdef constructs to select specific implementations 3.1 Requirements This section is about portability in general, but in the terms of Fiasco’s existing structure. To achieve a certain degree of portability, all machine-specific code must be factored out into separate modules or submodules. Appropriate interfaces to the architecture-dependent parts have to be defined. These two things can be brought together under the term encapsulation of machine-specific code, which is clearly a necessary issue. The more tricky part is to determine the appropriate granularity 14 CHAPTER 3. DESIGN for the encapsulation, which must be somewhere between taking the whole kernel as machine specific and hiding single assembler statements behind a general interface. The resulting granularity is a tradeoff between the porting effort and runtime efficiency (performance). Not only the right granularity, but also the dependencies among the grains, in our terms modules and submodules, are of high importance. To get an easily portable system, there should be no circular dependencies (see Section 2.3). In the case where no circular dependencies exist, the porting work can be started at low-level modules, which do not depend on anything else, and go toward the top-level modules, which have no other modules that depend on them. The module structure and the avoidance of circular dependencies also gain the overall testability and the testability of a partially ported kernel. Altogether, to create portable software, circular dependencies among different modules should be removed entirely and a suitable granularity for the encapsulation must be found. 3.2 Overall Design of Fiasco When I started this work, I decided to use a two-part partitioning of every subsystem (generic & architecture-specific). This approach was determined to be infeasible; even for microkernel design, the weaknesses described in Section 2.2.1 are not acceptable. The resulting design is now similar to Linux. The source code is formally fragmented into a generic and an architecture-specific part. Nevertheless, the actual structure is far more fine-grained and based on the subsystem-module-submodule scheme of Fiasco, which is introduced in Chapter 2. Code that is not completely generic is fully factored out into separate modules or submodules and #ifdefs are almost completely banned. Modules and submodules that cannot be confined to one specific architecture, but to two or more architectures, are located in the generic parts of the source code. It turned out that there are several orthogonal concerns. For example, there are modules that are dedicated to: a certain hardware device, which is available on more than one architecture; a specific API version; or the word width of the target processor. These semi-specific submodules are labeled with special suffixes that point out their concern. Figure 3.1 on the following page shows the composition of the final boot image. The subsystems have the following tasks: Kernel + JDB is the implementation of the L4 interface and the Fiasco kernel debugger (JDB). Boot is the early bootstrap code, which loads the kernel into virtual memory and transfers control to it (see Section 4.2.1 for more information). 15 CHAPTER 3. DESIGN Kernel + JDB ABI JABI LibLMM Boot LibAMM MiniLibC Drivers Types Figure 3.1: Layering of Subsystems. This picture shows the composition of the Fiasco kernel image out off different subsystems. The shaded areas give the estimated amount of architecture and/or platform dependent code in the pictured subsystems. ABI contains L4-ABI specific type definitions (described in Section 3.4.4). JABI contains user-level interface specifications for the JDB. LibLMM is a generic implementation of a list-based memory manager. LibAMM provides functions to manage regions of virtual address spaces. MiniLibC contains the C-library used for the kernel and the Boot subsystem. Drivers is a library of drivers for console I/O and general processor features. For a better understanding of the rest of this chapter, there are three main terms that are used as follows: kernel image is used whenever the complete bootable image of the kernel is meant, it consists at least of the subsystems you can see in Figure 3.1, kernel always refers to the implementation of the L4 interface, which is composed of the Kernel, ABI, LibLMM, LibAMM, MiniLibC, and Drivers subsystems, JDB refers to Fiasco’s kernel debugger, which actually is an integrated part of the kernel, but virtually forms a stand-alone debugger and can be decoupled from Fiasco. 3.3 L4-Independent Hardware Abstractions This section deals with hardware abstractions that are not dedicated to porting of L4 or Fiasco. The subsequently described parts are generally useful for portable software and especially portable operating systems. 3.3.1 Native Data Types The most important prerequisite for the portability of machine-tight code is to have clearly defined native data types. These native data types belong to the 16 CHAPTER 3. DESIGN group of unstructured integral types and can be classified in fixed-width and fixed-meaning data types. The former have a fixed number of bits, independent from the target architecture. The latter have a special purpose and may vary in their width according to the target architecture. The fixed-width types must be defined for each processor architecture and for each supported compiler, because the C++ standard specifies no data types with a concrete width. The definition according to the used compiler can be dropped, because Fiasco currently only supports the GNU compilers, GCC and G++. On the other hand, the architecture specific definitions cannot be omitted. The second category, which may also differ from one architecture to another, is the fixed-meaning types. Members of this class are, for example, a type with the width of a general-purpose register or a virtual memory address. The C++ standard already defines some of these types, such as void* for addresses, but it lacks the definition of others. Because of the importance of these native types, not only for the kernel implementation but for nearly every subsystem, they are defined in a dedicated subsystem; this subsystem is called the Types subsystem and forms the lowest level of architecture dependent code. 3.3.2 Drivers The Drivers subsystem provides the next level of hardware abstraction. This subsystem contains several hardware-specific device drivers, which can be classified into different levels again. Processor Driver The lowest-level driver provides a very small interface to the central processing unit (CPU). Figure 3.2 on the following page shows the actual interface of the CPU, the Proc class, which is reduced to the common base of all CPUs Fiasco should run on. The Proc class contains the subsequently mentioned functionality: • IRQ control methods, which control and request the acceptance of hardware interrupts (IRQs); the strangest method in this group is probably irq chance, which gives a pending IRQ the chance to come through • A spin-loop support method (pause), which must be used in tight spin loops, and should protect the CPU from consuming too much energy and prevent blocking of hyper-threaded1 CPUs • Sleep-mode support (halt), which puts the processor into a sleep mode until the next IRQ 1 duplication of execution context but not the execution units on a single processor, to increase the utilization of the execution units (see [Intel, 2003]) 17 CHAPTER 3. DESIGN Figure 3.2: Design of the Generic CPU Abstractions. This figure shows the design of the very generic CPU interface, which is used not only by the microkernel but also by other subsystems. • Methods to access/manipulate the stack pointer of the local CPU For the complete interface definition see Section A.1.1 on page A-1. Atomic Operations The atomic operations are an integral part of Fiasco. The lock-free and wait-free (see [Hohmuth and Härtig, 2001]) implementation requires the use of operations that read and manipulate data in the main memory atomically, such as compare and swap or test and set. These atomic operations are defined in an extra module and come in two different flavors: multi-processor safe, and uniprocessor safe. The operations are hidden behind a type-safe interface, which is implemented with C++ function templates. The generic and type-safe wrapper functions make use of machine-specific low-level operations. Appendix A in Section A.1.2 contains the interface definition for these type-unsafe atomic operations that must be implemented for a specific target processor. Port-I/O Driver Another low-level driver is the port-I/O driver. This driver is only useful on architectures that use IA-32-compatible devices. Such devices use a special physical address space distinct from the normal memory address space, the I/O-ports. For instance, IA-64, as a supported architecture, may use an IA-32 compatible VGA card. The port-I/O abstractions are a means to write processor independent device drivers, no matter how the port address space is accessed on the underlying 18 CHAPTER 3. DESIGN platform. Console-I/O Driver Additionally to the aforementioned device drivers, a console-I/O abstraction exists. This abstraction provides a means for debugging input and output. Figure 3.3 on the following page illustrates the complete design of the console-I/O system. No I/O facilities in the kernel are necessary if the microkernel is used in a production system, because all the device interaction is implemented in user-level device drivers. Nevertheless, the development of the microkernel and the applications atop it often needs debugging. To support a well-featured kernel debugger, the console-I/O system forms the abstraction layer between the various I/O hardware and the kernel debugger. The basic console abstraction (class Console) is designed to cover any kind of character-based input and/or output device. Specialized abstractions for designated hardware classes are also defined. For instance, the serial UART device drivers for StrongARM and the 16550 UART, as used in most IA-32 computers, are bases on the Uart class definition. 3.3.3 Generic Page-Table Interface One of the fundamental principles of L4 microkernels is memory protection via address spaces. The underlying hardware platform must provide a mechanism for enforcing isolation of user-level programs. The processor must have the ability to allow or disallow access to a certain area of the main memory. However, the protection mechanism is often combined with an address-translation mechanism (virtual memory). Figure 3.4 on page 21 pictures a generic interface for the combination of addresstranslation and protection mechanisms. A complete documentation of the pagetable interface is given in Section A.2.1 on page A-3. The original IA-32 Fiasco used the Space class as interface for the hardware page-table structure. However, I designed the page-table interface completely independent from L4’s address-space abstraction. The main reason for the decoupling is the problematic address-space switches on ARM, which is described in Section 2.2.2. The solution outlined in this section is based on a caching page table (CPD), which caches mappings from multiple address spaces. 3.3.4 Standard C Library The standard C library is one of the most basic components that are necessary for creating complex software in C or C++. Although it is possible to implement software without using a standard C library, it is very convenient to use their well-known abstractions. 19 CHAPTER 3. DESIGN Figure 3.3: Design of the Console-I/O Subsystem. The pictured class hierarchy has the abstract interface for character-based I/O devices as root. This interface is on the one hand implemented directly, for instance by Mux console or Vga console. On the other hand, it is refined for specific classes of hardware, such as Uart for serial UART devices. 20 CHAPTER 3. DESIGN Figure 3.4: Generic Page-Table Interface. There are several different C libraries freely available, but they are mostly based on an underlying operating system (OS) and this is problematic if the OS kernel itself is the destination. Moreover, only a small subset of the features current C libraries support is really useful for kernel and in particular microkernel implementation. The following features are identified to be necessary for Fiasco’s implementation. • String handling functions, such as memcpy and its relatives • Character-type functions as usually provided by ctype.h (mostly used in JDB) • Assertions, which are widely used to catch error conditions that are caused by disregarded interface constraints • Static construction and destruction (important for static C++ objects) • setjmp and longjmp, which are used for in-kernel page-fault recovery • Basic input and output functionality, as for example printf and getchar Other features, such as I/O via file handles or file-system access, are useless for Fiasco, because L4 microkernels do not know anything about abstractions like files. At the beginning of my work, the main tree of Fiasco was built against the C library from the OSKit v0.6 (see [Group, 1999]), which is available only for IA-32. During the work for my term paper [Warg, 2002], I already implemented a minimal C library for the IA-64 port of Fiasco, which is based on the diet libc (see [von Leitner, 2003]). As the goal of this paper is generally improved portability, a single C library that shares as much code as possible among the supported architectures had to be developed. 21 CHAPTER 3. DESIGN The starting point for the new C library was the one used for IA-64. This library is extended to the IA-32 and the ARM architecture, which was a minor effort, because only type and limit definitions and the longjmp implementation are architecture-specific. The minimal C library is almost completely self-contained. The only exceptions are the console-I/O functions, which need a kind of back-end driver to pass the output to or read the input from. The remaining part of the library does not rely on any external functionality. The design target for the standard-I/O component was to have a simple and well-defined back-end interface, which makes it possible to reuse the library for almost any kind of low-level software. The idea is to provide a slim glue layer to bring the C library and an I/O driver together. C-Library Back End There are two possible flavors for an I/O back end: character oriented or string oriented. The OSKit v0.6 C library uses a character-oriented back end. In other words, an output or input-call-back function is invoked for every single character. I considered the overhead that is introduced by a character-oriented interface as too high. In particular, devices with hardware buffers that have to be flushed before the driver returns control to the C library suffer from this kind of interface. I finally preferred the string-oriented back end because the character-oriented interface is only a special case thereof and the calling and flushing overhead is reduced from once per character to once per string. 3.4 L4-Specific Components Section 3.3 focused on the general components that are very important for the design of maintainable software. This section is equally important, because it describes the design of the L4 specific components. 3.4.1 Basic L4 Abstractions The fundamental abstractions of the L4 interface, threads and address spaces (see [Liedtke, 1996]), form the base of Fiasco’s kernel design. Additional functionality to manage the hierarchical flex-page mapping relations is necessary to implement the unmap system call; this functionality is provided by the mapping database, which is already fully hidden behind an abstract interface. Moreover, the mapping database design and implementation itself is very complex (see [Grützmacher, 1998]), which caused me to leave it almost completely untouched. The only caveat is the restriction to 4-GByte virtual address space and the lack of support for multiple page sizes; but it is beyond the scope of this diploma thesis to redesign the mapping database. The overall design of the basic abstractions (threads and address spaces) is the one 22 CHAPTER 3. DESIGN Figure 3.5: Hierarchy of the Thread Abstraction. This UML class diagram shows the portability-relevant parts of Fiasco’s L4-thread design. The leftmost hierarchy was the initial design and is described in the [Hohmuth, 2003a]. The interfaces of these classes are extended with only a few abstract operations that form the hooks into the architecture specific parts. These architecture-specific parts (the shaded classes) implement the previously defined hooks in a machine-specific manner. The quite complex class structure results from the fully preemptable design of Fiasco. You should look into Section 3.4.1 for a more detailed explanation. described in [Hohmuth, 2003a]. To satisfy the portability requirements modules and submodules that contained architecture dependent code are subdivided into further fragments. Well-defined hooks, which are declared in the generic parts, provide the connectivity between these newly created fragments. The concrete definition of the hardware-abstraction hooks is given in Section 3.4.2. Figure 3.5 and Figure 3.6 on the following page illustrate the class hierarchies of the thread and the address-space abstractions. The completely preemptable design of Fiasco causes these complex class hierarchies for the basic L4 abstractions. In fully preemptable software, the access to shared resources has to be synchronized. The implementation of Fiasco is based on lock-free and wait-free synchronization primitives (see [Hohmuth and Härtig, 2001]). The wait-free locking scheme uses locking with helping, to avoid priority inversion, and thus depends on switching to the lockholders execution context. Some thread and address-space operations depend on the locking primitives, to protect shared data structures. The aforementioned dependencies would create a cyclic graph that is avoided by splitting the abstractions of threads and address spaces into two parts. As stated in Section 3.1, circular dependencies should be avoided completely, to maintain testability and portability. The one (low-level) part does not allow any manipulation of shared data structures, but provides the means to switch among different execution contexts (Context and Space context). The other (high-level) part encapsulates the 23 CHAPTER 3. DESIGN Figure 3.6: Hierarchy of the Space Abstraction. Analog to Figure 3.5 on the preceding page, this class diagram shows Fiasco’s design of the L4 address-space abstraction. Again, the two level concept (Space context and Space) results from the preemptability of Fiasco (see Section 3.4.1). The gray shaded classes are architecture-specific specializations or implementations of the generic interfaces. manipulation operations (Thread and Space). 3.4.2 Hardware Layer of L4’s Basic Abstractions The last section described on the general design issues with the basic L4 abstractions. On the other side, the two first-class abstraction of L4 are coupled tightly with a kind of hardware context. Threads The thread abstraction provides an execution context, which runs on a certain CPU and can be preempted transparently. From L4’s point of view, a thread is subject to a scheduling strategy and can be addressed for IPC. In relation to the underlying hardware, a thread is the execution context of the CPU. For switching among threads, the CPU state of the current thread must be saved and the state of the target thread must be restored. The state of a CPU is obviously completely dependent on the processor architecture and there are no proper means in today’s HLLs to express context switches, so the actual switching of the execution context has to be done in architecture-specific functions. The execution context of a thread is composed of the following parts: • CPU context, which is the content of the registers and the processor status • FPU context, which consists of the FPU registers and state (provided that the target architecture features an FPU) • The address space in which the thread is executing 24 CHAPTER 3. DESIGN Figure 3.7: Platform Hooks of the Thread Abstraction. The pictured class diagram shows the class hierarchy that represents the polymorphic hardware abstraction of an L4 thread. In comparison to Figure 3.5 on page 23, this figure shows the details about the hooks that must be implemented for the different target architectures. You can see, the class Context-ia32-ux has no implementation for switch fpu; this method is implemented in two further specializations Context-ia32 and Context-ux, which are omitted for the lack of space. The latter part is coupled directly with L4’s address-space abstraction. I will treat this later on. The former two parts are encapsulated in two hooks in the Context class (see Figure 3.7). The hook switch fpu must save the FPU state or prepare lazy saving mechanisms. After dealing with the FPU the actual CPU context must be switched, which is done in the hook switch cpu. The CPU switching hook must also call the function to switch the address space (call switchin context) immediately after switching to the target stack. This is necessary, because on thread creation a newly created thread does not leave the switch cpu function as usual, but drops directly into user invoke, which manages the transition to user mode. For a detailed description of the hooks into the architecture-specific part see Section A.2.3 on page A-5 and Section A.2.4 on page A-6. The FPU context is treated separately because not all platforms feature an FPU, whereas others have a very large FPU state and/or support lazy state handling. On all platforms that feature an FPU, the state is stored into a specifically allocated buffer, private to the thread. It is not necessary to save FPU state on the kernel stack; because the kernel never uses the FPU and thus no nested state saving is needed. The hooks that are declared in the class Thread (see Figure 3.7) are used for thread creation. Thread::init arch is called from the generic constructor of Thread, and has to initialize architecture specific thread state (e.g., specific bits in the thread’s processor state). The function Thread::initialize virtually performs the lthread ex regs system call. This hook has a generic implementation that fits for most architectures. Nevertheless, IA-32 is an exception, because it uses 25 CHAPTER 3. DESIGN varying principles for kernel entries (int/iret and sysenter/sysexit) that need special handling in the case of lthread ex regs. Address Spaces Address space switching, which is initiated by the call to call switchin context, follows a similar scheme like thread switching. A generic switching function (switchin context) uses a hook into the architecture specific part (make current). The function make current is responsible for switching the MMU to another address space and to do the proper operations to keep caches and TLBs consistent. The architecture-specific implementations of L4’s address-space abstraction become redundant, once the low-level page-table interface is implemented on all architectures, as proposed in Section 3.3.3. 3.4.3 Exception Handling In principle, exceptions must be handled according to their origin. Exceptions that originate from user applications, called user exceptions, are mostly passed back to user space. Version 2 and version X.0 of L4 use two different concepts for doing this. In version 2, page faults are delivered via IPC to a dedicated pager thread and all other exceptions are passed via native exception emulation, whereas version X.2 delivers all exceptions via IPC. The latter concept is much easier to implement in a generic fashion and therefore preferred for ARM. The other type of exception is that caused by the kernel itself. The code of the kernel is assumed correct and thus raises only a very limited number of exceptions at well-known locations. The kernel raises virtually only page faults. Page faults can be raised during the long-IPC transmission; such page faults are on behalf of the user application and can be accounted to the user exceptions. They are passed to the pager thread as usual. All current ports of Fiasco use lazy mechanisms for TCB allocation. These mechanisms are based on page faults in a special TCB area in the kernel address space. Such kernel page faults must be completely hidden from the user applications. Platforms that make use of kernel page faults must provide the corresponding handlers in the architecture dependent part of the thread abstractions. 3.4.4 L4-ABI Abstractions L4’s application binary interface (ABI) can be divided into two basic parts, the ABI types and the system-call conventions. 26 CHAPTER 3. DESIGN Figure 3.8: Design of the L4 ABI Types. This class diagram shows the L4 ABI types and their specific implementations, all interface details are hidden. The boxes with cursive titles are abstract interface definitions and the boxes with normal titles represent implementations. As you can see, some types have the interface definition separated from the implementation and others do not. The separation is always done, when more than one implementation exists on the currently supported architectures and ABI versions. ABI Types The L4 specification defines a number of data types as part of the microkernel ABI. You can find this specification in [Liedtke, 1996]. The ABI types transfer almost the same content on any architecture and on the supported ABI versions. The fact that makes them relevant for portability is the differing layout among different machines. Varying word widths and the possibility of highly efficient implementations, dedicated to the target architecture, constitute the variation in the binary layout of these data types. For example, the L4-UID type that is used on IA-32 was initially optimized to calculate the address of the TCB of the corresponding thread with only one 32-Bit-AND and one 32-Bit-OR operation. To be able to implement generic system-call logic, this diversity in the data layout must be hidden behind generally defined interfaces. Figure 3.8 gives an overview of these ABI types. As illustrated there, some of the data types do not have any specialized implementation yet and others do. For instance, L4 msgdope has just one generic implementation, whereas L4 fpage has specializations for 32-Bit architectures and the IA-32-specific I/O flex pages. Problem with C Bit fields In the first place, C bit fields seem to be a perfect means to represent L4’s ABI data types. However, there is a serious limitation regarding C bit fields: neither the C nor the C++ standard defines a concrete ordering of bit-field members. The result of this freedom is that the GNU compilers for big-endian and littleendian machines generate differing binary representations for the same bit-field declaration. The implementations of Hazelnut and Pistachio use subtle preprocessor macros 27 CHAPTER 3. DESIGN Figure 3.9: Encapsulated System-Call Parameters. The implementation of every system call uses its dedicated sys XXX frame interface to access the parameters. These interfaces are based on the ABI-type definitions presented in Figure 3.8 on the page before and kept completely generic with respect to the underlying architecture. The parameter encapsulations are even aware of the currently supported API versions (v2 and X.0). for defining bit fields in an endian-independent manner. I considered this solution infeasible, because the C standard also lacks a concrete definition for bit fields wider than an unsigned int, and a separate macro for each number of bit-field members is necessary. The implementation of Fiasco therefore does not use C bit fields for representing ABI data types. Fiasco is based on C++ classes with access functions that use mask and shift operations, which are independent of the endianness and compiler. System-Call Conventions Another completely architecture specific part of the L4 specification is the systemcall ABI, which is the mapping of the system-call parameters to the processor registers or to memory. Again, the obvious way to keep the system-call logic generic is to define an abstract interface for every L4 system call, which provides a means to access the parameters independent from the underlying architecture. I will refer to this layer of abstraction with the term: in-kernel system-call bindings. Figure 3.9 gives an overview of the complete set of interface definitions. Figure 3.10 on the facing page shows the interface and the existing implementations for the id nearest system call. The class hierarchy of the other system calls is analogous to that of id nearest. The complete description of the interfaces of all in-kernel system-call bindings is located in Section A.2.6 on page A-7 and the following. As a nice side effect, the encapsulation of the ABI types and of the system-call conventions made it possible to easily integrate support for the version-X.0 ABI into Fiasco. Only the L4 uid data type, the in-kernel system-call bindings, and the task new system call had to be re-implemented. 28 CHAPTER 3. DESIGN Figure 3.10: Kernel Binding for the Id-Nearest System Call. The pictured inheritance tree representatively stands for all in-kernel system-call bindings. As can be seen, each supported architecture and API version has to provide its specific implementation. 3.4.5 Kernel Memory Management To implement the basic abstractions of L4, a dynamic memory management is necessary. The kernel needs to allocate memory for the following objects: • Thread control blocks (TCBs) • Page tables and Page directories • Mapping nodes, which are an integral part of the mapping database • FPU context, if the target machine features an FPU The design of the memory management has to meet various demands: first, single memory pages at arbitrary virtual addresses must be allocated (page-level allocation). Secondly, the kernel needs to allocate pages at specific locations in virtual memory (e.g., for TCBs), this is called virtual-page allocation. Finally, the object allocation is responsible for allocating various memory objects of arbitrary size. Figure 3.11 on the following page illustrates the design of the memory-management subsystem. You can see the three aforementioned parts and their interaction. The overall design of the in-kernel memory management is based on the design of the IA-64 port of Fiasco, which is described in my term paper [Warg, 2002]. The trichotomy of the memory management is based on the logical separation of the different parts and the avoidance of circular dependencies among the addressspace and the allocator components. In a few words, the page-table implementation depends on page-level allocation, for allocating memory for page tables. Furthermore, the virtual-page allocation needs the page-table implementation to map the allocated pages to the requested virtual addresses. Thus, page-level and virtual-page allocation cannot be provided by one component. The third allocator is separated, because a completely different strategy is necessary to manage allocation of many small and highly dynamic objects; in Fiasco, such allocation is basically done with slab allocators. 29 CHAPTER 3. DESIGN Figure 3.11: Model of the Memory-Management System. This figure shows the class structure of Fiasco’s memory management. The three main components of the memory management are Mapped allocator for page-level allocation, Vmem alloc for virtual-page allocation, and Slab cache anon for object allocation. The depicted classes do not show the complete interface definitions. Page-Level Allocator The page-level allocator provides the lowest level of memory allocation. All other allocators make use of this allocator either directly or indirectly. The provided interface consists of methods for allocating and freeing blocks of physically adjacent memory, which is mapped at arbitrary locations in the kernel’s address space. Moreover, a method to request the physical address of a previously allocated block, and a method to request the virtual address of a physical frame exist. The latter two methods require that the physical memory lies within the responsibility of the allocator. In view of portability, this interface allows the implementation of an allocator for any model of physical memory. In particular, the model that is currently used on IA-322 is easily implementable. Nevertheless, also models where userlevel software has to supply the kernel with memory, as proposed by Andreas Haeberlen in [Haeberlen, 2003] can be supported behind this interface. Virtual-Page Allocator The virtual-page allocator is not really an allocator that manages memory resources. Vmem alloc uses the page-level allocator to actually allocate a block of memory at an arbitrary address and then uses the page-table interface (on IA-32 it is directly manipulated) to map the page to a requested address. 2 the physical memory is linearly mapped into the kernel’s address space as a whole 30 CHAPTER 3. DESIGN Fiasco uses this mechanism to allocate TCBs, and for allocating larger blocks of memory3 that do not have to be physically contiguous. The extra implementation for IA-32, which manipulates the page-table structures directly, should be removed in favor of a generic implementation, once the low-level page-table interface is implemented on IA-32. 3.5 New JDB Design This section shall not describe the design of a completely functional kernel debugger. The design is restricted to basic debugger core and has the following goals: • Portability (architecture independence) • Modularity • Easy extensibility • Look and feel of the original JDB • Coexistence with the original JDB The two last points are very important. The look and feel should be well known to the people that are accustomed to the kernel debugger. However, the more important goal is a smooth transition from the old kernel debugger to the new modularized and portable debugger. This can only be achieved by a temporary coexistence, because the old kernel debugger provides features for kernel and userlevel development that cannot be ported in one piece. The current IA-32 implementation accomplishes the coexistence through the integration of the Jdb core methods Jdb core::has cmd and Jdb core::exec cmd into the traditional event loop. The class diagram in Figure 3.12 on the next page pictures the design of the basic JDB abstractions. The small and architecture-independent core (Jdb core) provides the command execution and input-parsing environment. The debugger modules provide the actual debugger functionality. These modules must be derived from the class Jdb module, and are not required to deal directly with console input or input-parameter parsing. All debugger modules register themselves at the debugger core with use of static constructors (see Section 4.2.2). A debugger module must provide at least one command (Jdb module::Cmd) and is member of exactly one category (Jdb category). The category is used to classify the debugger help screen. The actual improvement of portability is achieved through the modular design. This design allows the programmer to port the debugger functionality that he cur3 i.e., larger than one page 31 CHAPTER 3. DESIGN Figure 3.12: Class Structure of the New JDB. Figure 3.13 shows exemplary subclasses of Jdb module, which are actually implemented in Fiasco. Figure 3.13: Exemplary Subclasses of Jdb module. The pictured class graph shows existing JDB modules. 32 CHAPTER 3. DESIGN rently needs, and eases the implementation of machine-specific extensions without interfering with other architectures. 3.6 StrongARM Specific Design This section contains very specific design issues that can already be taken as a kind of implementation of the generic design mentioned before. However, it is a part of the design that was made in order to implement the ARM port of Fiasco and is therefore located here. 3.6.1 Exception Handling This section builds upon the discussion of Section 3.1 from [Wiggins, 1999]. The discussion is concerned with the possibilities for handling exceptions and the usage of the privileged modes. In summary, the following two alternatives are evaluated: • Handling of all exceptions in a single mode of operation (see Section 2.1.1) • Handling of exceptions in the operating mode corresponding to the exception The first alternative comes with additional overhead for saving the banked registers (see Section 2.1.1) and switching to the final operating mode. The latter alternative implies that an un-banked register is used as stack pointer, to avoid problems with in-kernel exceptions: assuming, the processor is already in a privileged mode (e.g., the Abort mode) and an event causes a mode switch, for example an IRQ; the execution continues with a different stack pointer and not as supposed on the thread’s kernel stack. A seemingly simple solution is to load the mode-private stack pointer immediately after entering the different operating mode with the value of the stack pointer of the previous mode. The problem is that the banked registers of other modes are usually not accessible without an explicit switch to the corresponding mode. I decided to use a single operating mode for all exceptions, because the compilers used do not support the use of an un-banked register as stack pointer and a reimplementation in assembly language is out of the question. The decision seems to be contrary to that from Adam Wiggins in [Wiggins, 1999], but Gauntlet was implemented completely in assembly language and therefore could use an unbanked register as stack pointer and execute in the operating mode corresponding to the raised exception (see Section 2.1.1). Which operating mode is the best for kernel execution? System calls, which are triggered by well-defined prefetch aborts, as well as normal page faults switch automatically to the Abort mode. The Abort mode should be the choice, because no extra mode switches are necessary for these performance-critical kernel entries. The disadvantage of the Abort mode is that a bit of additional overhead for IRQs and FIQs is introduced, which could be also considered critical; an extra switch 33 CHAPTER 3. DESIGN 0xFFFFFFFF OS Linear−Map Area 0xF0000000 OS Virtual Memory Area 0xE0000000 OS TCB Area 0xD0000000 User−Mode Address Space 0x00000000 Figure 3.14: StrongARM Address-Space Layout. The shown diagram represents the address-space layout as it is currently used on the StrongARM processors. In comparison to the IA-32 address space, the user-kernel boundary is moved up to 0xD0000000, because StrongARM has its physical RAM at 0xC0000000. This restriction is not problematic, because the ARM implementation uses version X.0 UIDs, which allows only 2048 Tasks with 64 Threads each. With a TCB size of 2 KBytes, the TCBs fit into the 256 MByte OS TCB Area. from the IRQ respective the FIQ mode to the Abort mode must be executed and FIQ handlers cannot use the extra five registers (see Table 2.1). 3.6.2 ARM Kernel Address Space The layout of the kernel address space of StrongARM is slightly different from that of other 32-Bit architectures. Usually the uppermost Gigabyte of the address space is allocated for the kernel address space. The StrongARM architecture defines a physical memory map where DRAM is located in a partition above 0xC0000000 (3 GByte) in the physical address space. Hence, Sigma0, which is specified to run in one-to-one mapped memory, must execute above 3 GByte. To make this possible I moved the user-kernel boundary up to 0xD0000000. If the device features more than 256 MByte of memory or the memory is scattered over the uppermost gigabyte, only the RAM from 0xC0000000 up to 0xD0000000 can be mapped oneto-one, the applications must map the remaining RAM to lower addresses in order to access it. Figure 3.14 shows the complete address-space layout that I have implemented on StrongARM. 34 Chapter 4 Implementation In the design chapter, I largely focused on the software-technological point of view. Nevertheless, in the case of a microkernel, such as Fiasco, it is not adequate to have a nice object-oriented design (OOD). One of the perpetual goals is to achieve very high performance. Thus, this chapter does not contain only some war stories about the fiddly implementation process, but it describes the basic concepts of Fiasco’s fast polymorphic implementation. 4.1 Polymorphism Some members of the OS community toughly defend their opinion that C++ ends in too slow code and is not feasible for OS programming. The same people often use C to implement object-oriented code and use structures with function pointers to have a means for polymorphic classes. Examples are the fops structure in the Linux kernel, as well as the kdb console t structure in Pistachio’s kernel debugger. Fiasco, as the first HLL (C++) implementation of the L4 microkernel interface, has proven the contrary. Even the faster but not real-time capable L4 implementations from the University of Karlsruhe (Hazelnut and Pistachio) are implemented in C++. In spite of the fact that C++ is successfully used for the implementation of operating systems, polymorphism is a red rag for most microkernel people. This point of view originates from the additional level of indirection, which all known C++ compilers introduce to implement virtual functions. Indeed polymorphism has two occurrences, runtime polymorphism and compile-time polymorphism. Runtime polymorphism allows different implementations of a certain interface to be exchanged at runtime. This is useful when objects with specific implementations of the same interface coexist in a running system. The commonly used way to implement such runtime polymorphism is to 35 CHAPTER 4. IMPLEMENTATION use an extra level of indirection. This level can be created explicitly, by the use of function pointers, or implicitly using lazy binding by marking a C++ member function as virtual. If the runtime polymorphism is wanted, the extra level of indirection is accepted. On the other hand, the additional overhead is not and cannot be accepted if only one implementation of an interface is used at runtime. In this case, the polymorphism is only a software-technological means. In the design of Fiasco, the complete set of architecture-specific subclasses belongs to this category of polymorphism, which could be resolved at compile time. This compile-time polymorphism does not need an additional level of indirection for method invocation. The manifest possibilities to resolve the aforementioned problem are: • Definition of interfaces in C++ include files and multiple implementations in different C++ source files • Using a C++ compiler that recognizes compile-time polymorphism automatically1 • Using the C++ template mechanism • A language extension for explicitly specifying compile-time polymorphism The first solution works perfectly as long as no inline functions are used and the derived classes do not extend the interface. The latter means that specific classes that are derived from a certain super class are not able to define extra operations or attributes. The problem with inlining is that C++ inline functions must be implemented in the header file where they are declared, because the compiler needs the source code of an inline function at every invocation of it. Furthermore, it is not feasible to use non-inline functions, because the introduced overhead cannot be tolerated for the fine-grained encapsulation that is presently used in Fiasco. The second solution is not possible with the available C++ compilers; I am not aware of any compiler that performs such an optimization. Furthermore, it is not trivial to detect compile-time polymorphism automatically. The compiler has to do a complex data-flow analysis on the complete source code. In the general case, it is not sufficient to work on a per-file basis, as usually done. After all, developing such a compiler is clearly beyond the scope of this work. The template solution is a possible way to implement compile-time polymorphism. Andrei Alexandrescu describes the principles of this solution in [Alexandrescu, 2001] under the term Policy-Based Class Design. With this approach, generic code is located in class templates and the machine specific components are defined as template parameters. With respect to the existing source code of Fiasco, this alternative requires major restructuring. Due to the just mentioned difficulties, the last solution remains as a possible opportunity. Even this way requires an extra processing of the source code, but expensive 1 Uwe Dannowski from University of Karlsruhe is currently working that topic. 36 CHAPTER 4. IMPLEMENTATION Program 4.1 Explicit Compile-Time Polymorphism (Interface). This piece of code looks like a normal C++ class declaration. The peculiar piece is the preprocessor term ’INTERFACE:’ that instructs Preprocess to handle the subsequent code as interface definition. Program 4.2 on the next page shows the counterpart to the interface section, the implementation section. The two sections can also be placed in a single file, but for realization of compile-time polymorphism they have to be put into separate files. file: ’entry frame.cpp’ INTERFACE: ... class Sys_id_nearest_frame : public Syscall_frame { public: L4_uid dest() const; void type( Mword type ); void nearest( L4_uid id ); }; auto-recognition of compile-time polymorphism is not necessary. Another thing that led me to the explicit solution was Preprocess (see [Hohmuth, 2003b]), the special C++ preprocessor that is already used for the implementation of Fiasco. Only minor extensions to Preprocess were necessary to support explicit compiletime polymorphism. Program 4.1 and Program 4.2 on the following page show an example how compiletime polymorphism can be implemented with the help of Preprocess. 4.2 Bootstrap The boot procedure of Fiasco is subdivided into two major stages. The first stage runs in an environment that is provided by the platform’s boot loader. The second stage is the actual kernel bootstrap itself. 4.2.1 Stage 1, Boot Subsystem This boot stage is bound very tightly to the underlying platform and the used boot loader. Thus, it is not really possible to provide a platform and architectureindependent Boot subsystem. The purpose of the Boot subsystem is to set up the CPU and the MMU for the execution of the kernel. The main steps that are necessary for initializing the kernel’s execution environment are the following: 37 CHAPTER 4. IMPLEMENTATION Program 4.2 Explicit Compile-Time Polymorphism (Implementation). This code snippet presents one possible implementation of the abstract interface shown in Program 4.1 on the preceding page. The preprocessor combines the two files at compile-time, thus another implementation can be selected by choosing another implementation file at the invocation of Preprocess. The key term ’IMPLEMENTATION[ext]:’ initiates the implementation section of the submodule with the given extension. The implementations of the methods are marked with the keyword ’IMPLEMENT’, which implies that the respective interface is already declared elsewhere. file: ’entry frame-ia32-ux-x0.cpp’ IMPLEMENTATION[ia32-ux-x0]: ... IMPLEMENT inline L4_uid Sys_id_nearest_frame::dest() const { return L4_uid( esi ); } IMPLEMENT inline void Sys_id_nearest_frame::type( Mword type ) { eax = type; } IMPLEMENT inline void Sys_id_nearest_frame::nearest( L4_uid id ) { esi = id.raw(); } • Loading of the actual kernel • Enabling the MMU and populating the appropriate mappings • Transferring control to the kernel code Depending on the underlying platform, the loading of the kernel can also be done after enabling virtual memory. The only part that is shared among the supported architectures is an integrated ELF interpreter, which can handle statically linked ELF images. The current versions of Fiasco embed the kernel’s ELF image into the actual bootable image. The ARM boot image also contains additional images for user-level applications, such as Sigma0 and the root task. 38 CHAPTER 4. IMPLEMENTATION 4.2.2 Stage 2, In-Kernel Bootstrap The in-kernel bootstrap of Fiasco is based on two mechanisms: static constructors, and the actual main function. Static constructors are used, because C++ initializes global objects automatically through static constructors. With respect to portability, the mechanism of static constructors provides a perfect means for modularizing the kernel while the modules need to be initialized at boot up. Static Constructors A static constructor in my terminology is a trivial C function that initializes a statically defined data structure, such as a global object. The static constructors have to be executed before the actual program enters the main function, because statically defined objects must already be in an initialized state at this point of execution. The nature of statically defined objects is that their definitions may be distributed among different source-code files and therefore among the created object files (linker units), which implies that only the linker knows the complete list of all static constructors that have to be executed. If the statically defined object is a C++ object, the compiler puts a pointer to the object’s constructor into the appropriate list in the object file. The linker finally combines the constructor lists of the different object files. However, there are C data structures and hardware objects that are not directly supported by the C++ compiler. To have a means for initializing these objects, without the knowledge of the main program, static init.h defines four macros that allow the use of normal C functions as static constructors. These static-init macros put a pointer to the constructor function into the appropriate list in the object file. Program 4.3 on the next page and Program 4.4 on page 41 show examples for the definition of static constructors as they are used in Fiasco. There are actually two flavors of static constructors: constructors that have to be executed in a defined order due to dependencies among the initialized objects, and constructors of objects that have no dependencies among each other. The former kind will be called prioritized static constructors, and the latter group is the simple static constructors. The only difference in the declaration is the additional priority (Program 4.4). Startup Constructor One of the integral parts of the boot sequence is the static constructor startup system. This constructor is defined with the highest possible priority and therefore is the very first function that is executed during boot up. The job of startup system is to initialize all mandatory components of the microkernel, which cannot be configured out. For instance, the memory management, IRQs, and the interval timer are components that are initialized explicitly in startup system. These mandatory components often have dependencies among each other and therefore depend on their initialization order. 39 CHAPTER 4. IMPLEMENTATION Program 4.3 Simple Static Constructors Example. The three code snippets depict the principles that Fiasco uses to declare static constructors for certain objects. Snippet (a) is a statically defined instance of a normal C++ object. Snippet (b) shows the method of choice for initializing singleton objects. In Snippet (c) a normal C function is marked as static constructor. Each of these ways does also support an additional priority, like in Program 4.4 on the next page. (a) Global C++ Object (b) Singleton Global C++ Object class C { public: C(); }; C::C() { /* do some init stuff */ } ... C object_of_c; #include "static_init.h" class C { public: static void init(); }; C::init() { /* do some init stuff */ } ... /* mark C::init() as static */ /* constructor */ STATIC_INITIALIZE(C); (c) C Function #include "static_init.h" void constructor_func() { /* do the work */ } ... STATIC_INITIALIZER(constructor_func); Each of these components could also declare its own static constructor with the appropriate priority to satisfy the dependencies among them, but this is much more complicated to understand and not necessary because the components are mandatory parts of the microkernel and well known at programming time. 4.3 The Build System This section gives an overview of the build system. The build system is based on GNU Make. A big part of the Makefile logic is actually necessary for creating the source-file dependencies; Make uses these dependencies to build the minimal set of files. Concerning portability, the part that is responsible for the architecturedependent build process is of interest. There are actually two parts that control the architecture-specific build. The first part is located in files named Makeconf.<arch>, and defines tools, such as compil40 CHAPTER 4. IMPLEMENTATION Program 4.4 Prioritized Static Constructors Example. The mode of action of these examples is almost the same like in Program 4.3 on the facing page. The only difference is the priority X, which is additionally specified. (a) Global C++ Object (b) Singleton Global C++ Object #include "static_init.h" class C { #include "static_init.h" public: class C { C(); public: }; static void init(); }; C::C() { /* do some init stuff */ C::init() { } /* do some init stuff */ ... } C object_of_c INIT_PRIORITY(X); ... /* mark C::init() as */ /* static constructor */ STATIC_INITIALIZE_P(C,X); (c) C Function #include "static_init.h" void constructor_func() { /* do the work */ } ... STATIC_INITIALIZER_P(constructor_func,X); ers, assembler, and linker. The second part, the Modules File (Modules.<arch>, see src/README in the Fiasco source directory), defines the subsystems and their exact composition. The two aforementioned files exist for each supported target architecture. Program 4.5 on the next page shows an example excerpt from the IA-32 Modules File. The final linking of each subsystem is defined in files named Makerules.<subsystem>. If there are architecture specific rules necessary to do the final linking the Makerules.<subsystem> file must include a Makerules file from the directory of the appropriate architecture. The target architecture, and hence the used Modules File, is determined by the configuration variable CONFIG XARCH. All configuration variables are defined in the file globalconfig.out. This file can be modified either by hand or via the interactive configuration tool, which handles dependencies between the configuration options and keeps the file consistent. 41 CHAPTER 4. IMPLEMENTATION Program 4.5 Example Modules File. This excerpt from the IA-32 Modules file shows the declaration of the subsystems that are build for IA-32 and two examples for the subsystem composition. The variable SUBSYSTEMS contains the list of subsystems. The build target for every subsystem must be assigned to a variable named <subsystem>. The modules that constitute a subsystem have to be assigned to INTERFACES <subsystem>. If a specific module is composed of different submodules, the list of submodules must be assigned to <module> IMPL. SUBSYSTEMS = JABI ABI DRIVERS KERNEL CRT0 BOOT LIBK LIBAMM \ LIBLMM CHECKSUM CXXLIB MINILIBC LIBKERN TCBOFFSET ... # ABI Subsystem # ABI := libabi.a VPATH += abi/$(CONFIG_XARCH) abi INTERFACES_ABI := l4_types kip l4_types_IMPL := l4_types l4_types-$(CONFIG_ABI) \ l4_types-32bit l4_types-iofp kip_IMPL := kip kip-ia32 ... # DRIVERS subsystem # DRIVERS := libdrivers.a libgluedriverslibc.a VPATH += drivers/$(CONFIG_XARCH) drivers PRIVATE_INCDIR += drivers/$(CONFIG_XARCH) drivers INTERFACES_DRIVERS := mux_console console filter_console \ keyb io uart vga_console uart_IMPL := uart uart-16550 keyb_IMPL := keyb keyb-pc io_IMPL := io io-ia32 CXXSRC_DRIVERS := glue_libc.cc 42 CHAPTER 4. IMPLEMENTATION Physical Memory F Virtual Memory 2 1 OS TCBs + Virtual Memory OS Linear−Map Area Cache Cache Line (a) Cache Line (b) Figure 4.1: Memory Aliasing Example. This figure shows how contents of the physical frame F may reside in multiple cache lines at the same time, provided that the frame is aliased at different virtual addresses. 4.4 StrongARM Implementation Details This section covers the details about the StrongARM implementation. The focus is on the memory-system architecture, which brought up tricky issues. Page Tables and Fully Virtual Caches Hardware-walked page tables in conjunction with fully virtual caches yield a major problem. Manipulations on page tables must be enforced to the main memory before any access to the modified virtual memory regions. In particular, this means that manipulations on the currently active page table must be written back to memory instantly. Page tables of inactive tasks are written back implicitly; on StrongARM, caches must be written back and invalidated completely before switching over to another page table, because the caches do not support ASIDs and are tagged and indexed with virtual addresses. Fiasco uses an ARM cache-control instruction in its page-table implementation to write back the single cache line that contains the manipulated page-table entry. This instruction avoids unnecessary pollution of the data cache, which occurs if the instruction sequence for complete cache write back is used. Aliasing and Fully Virtual Caches The ARM reference manual says a physical page must not be mapped at more than one virtual address at the same time (aliased), except the caches and the write buffer are disabled for this page. Figure 4.1 depicts that aliasing in combination with fully virtual caches results in inconsistent memory contents. 43 CHAPTER 4. IMPLEMENTATION Fiasco and Aliasing On the one hand, Fiasco uses lazy mechanisms for TCB allocation that are based on aliasing of memory pages. On the other hand, programmers did not care about aliasing, because it was not an issue on supported architectures other than StrongARM. The lazy memory allocation uses a read-only zero page, which is mapped into the TCB area when a read instruction accesses a TCB for the first time; the zero page may be mapped multiple times into the TCB area, if multiple TCBs were accessed by read instructions. Nevertheless, the aliasing of the zero page is harmless, because the zero page is always mapped read only, and hence, cannot cause inconsistent views of the main memory. According to Figure 4.1 on the page before, cache line a and cache line b always contain the same value and never become dirty, because write access is denied. A major problem arose from aliases that exist because implementers did not care about. The virtual-page allocator uses Kmem alloc to allocate a page at an arbitrary virtual address and maps it to another virtual address. Thus, Vmem alloc creates an alias for the allocated page. The cache contents may become inconsistent if the first few bytes of the newly allocated page are modified. According to Figure 4.1 on the preceding page, cache line a contains stale allocator data, and cache line b contains up-to-date TCB data. If first cache line b and then cache line a is written back, the next read operation will see the stale data, whereas the actual TCB data is lost. The current solution is, flushing the cache after the re-mapping operation in Vmem alloc, which removes the stale cache contents before any access to the remapped page can occur. A better solution would be, using cache-control instructions to invalidate the cache location with the stale allocator data. However, the latter solution requires certain changes in the implementation of the page-level allocator. 44 Chapter 5 Future Work 5.1 General Issues There are still open topics that affect all target architectures. An important issue is the implementation of the generic page-table interface for all architectures, because this would remove further duplicated code: The L4 address-space abstractions could be implemented in a generic manner, and the virtual-page allocator, Vmem alloc (see Section 3.4.5 on page 29) could get rid of the special IA-32 implementation, which manipulates page-table structures directly. Currently no clean and flexible protocol exists to hand the machine’s physical memory layout into the kernel (at least in version 2 and X.0), which is the cause of extra code in Kmem alloc. The main difference among the implementations of the page-level allocator is the initialization routine that claims the memory for all kernel allocators. The page-level allocator needs to know where free memory is located and must mark the claimed memory as used, in order to advise Sigma0 to keep the memory reserved for the kernel. Another open issue is a mapping database that supports multiple page sizes (even more than two page sizes), and large and sparse address spaces, in particular larger than 4 GByte. The support of arbitrary page sizes could increase the TLB coverage on systems that feature multiple pages sizes in their TLB. In particular, IA-64 and ARM could gain application performance. An interesting topic is the system timer. One-shot timers, which are also known as aperiodic timers, can help to substantially decrease the frequency of timer interrupts while maintaining timer accuracy. On architectures that provide fast and efficient programming of timer events (e.g., StrongARM) this could gain overall performance. On architectures that suffer from their inefficient timer programming the abstraction should be easily implementable with periodic timers. In addition to the aforementioned topics, which are mostly about functionality, there is an open issue with the structure of Fiasco’s source code. The refactored 45 CHAPTER 5. FUTURE WORK source code is scattered over many small files, which often leads to confusion while navigating through the code. Many functions have different implementations corresponding to different concerns, and hence finding the relevant piece of code may be quite difficult. There are possible solutions for this problem that should be evaluated: • A directory structure, in which the architectures are the first level and the subsystems the second level • Farther refactoring of big classes, such as Thread into smaller subclasses • Symbolic links in the build directory that point to the source directories of the current architecture These solutions are not mutual exclusive, they should rather be combined to achieve better source-code properties. 5.2 StrongARM Topics There are several features of the StrongARM architecture that are currently unused. Fast address-space switches, as explained in Section 2.2.2 on page 10, should be implemented, which would improve the application performance substantially. The use of the minicache and the read buffer, which are available on StrongARM, should further reduce the influence of the microkernel on the cache-miss rate of the user-level applications. For instance, the TCBs and the page tables could be cached within the minicache. StrongARM also provides an efficiently programmable timer circuit. The timer events are determined by the contents of a few memory-mapped registers. The use of one-shot timers for scheduling and timeout events should be evaluated with respect to the interrupt frequencies and the user-level application performance. An open issue of the StrongARM port is the implementation of the exception delivery mechanism (see Section 3.4.3). I would prefer a method similar to that used in version X.2, where all exceptions are delivered via IPC to a dedicated exception handler thread. A very big topic is the complete user-level environment. Currently there is no appropriate C library for ARM user-level applications. Hence, at the moment there is only a Sigma0 and a simple test server running. Both are linked against the MiniLibC from the kernel and are build directly with the kernel. The lack of time and a malicious C++ compiler prevented me from measuring IPC performance. The user-level test program crashed with strange memory accesses, which turned out to come from bogus register contents. The compiler did not save allocated registers before system calls, even though all registers are in the clobber list of the assembler statement for the system call. 46 CHAPTER 5. FUTURE WORK Another field of open work is the unit and functional testing of the ARM components. The architecture-independent part of the kernel can be tested on any architecture; the unit tests that already exist are feasible for that. However, a set of low-level tests should be implemented to verify for example the implementation of the page-table interface or of atomic operations. 47 Chapter 6 Summary This chapter shall give a brief summary of what was achieved during my work. First, I will treat the software-technological changes. The second part deals with the ARM related topics. The most important step towards more portability was the refactoring of the source code in completely generic and machine or hardware-dependent units. The refactoring is based on class definitions with generic interfaces and implementations. The generic implementations make use of generic interfaces that are specialized by compile-time polymorphic inheritance. This kind of polymorphism can be completely resolved at compile time and introduces no runtime overhead. It turned out that Preprocess is a lot more versatile than thought. It helped for modularizing completely orthogonal aspects of Fiasco. The resulting design goes into the direction of ASOD, which may be worth it considering in the future (see [Coady et al., 2001]). An open issue with respect to ASOD is the tool support and the complexity of the description for intermixing the different aspects. The build system, as used now, supports the composition of modules from different submodules and uses Preprocess (already used before my work) for resolving the compile-time polymorphism. A file per target architecture, which can internally depend on further configuration options, determines the assembly of the final image. Another important property of the source code is the absence of circular dependencies among the different modules. This property is the base for unit testing, which is very helpful for maintaining functionality while changing the implementation behind the interfaces, and simplifies the porting work. In the absence of circular dependencies the modules and their dependencies result in an directed acyclic graph; porting can be started at the low-level modules, which do not depend on anything else and go towards the top-level modules, whereas each level can be tested independently of the higher levels. The independence of the system-call logic from the underlying processor archi- 48 CHAPTER 6. SUMMARY tecture and partially also from the provided ABI version is achieved through the complete encapsulation of L4’s ABI types and the system-call parameters into C++ classes. The ABI types are based on mask and shift operations and not on C bit fields, what makes them aware of little and big-endian architectures; the use of C++-source-level inlining should provide compiler results as efficient as the use of bit fields. The in-kernel system-call bindings provide the system-call logic with a completely generic mechanism to access the system-call parameters. I provided a generic page-table interface, which should ease the transfer to new architectures or new page-table layouts (e.g., on IA-64) even without a complete understanding of the class hierarchies of Fiasco’s address-space abstractions. A problem when porting Fiasco to a new architecture was the kernel debugger (JDB). The original JDB was dedicated to IA-32 and had a monolithic design that complicated, if not impeded, a port to a new architecture. The Fiasco kernel debugger is now easily removable from the kernel. I provided generic dummy modules for disabling the logging features, which depend on JDB and made the kernel implementation dependent on the kernel debugger, which is problematic anyway. Additionally, I designed and implemented a basic modularized kernel debugger, which eases the porting work and supports the implementation of easily pluggable modules. The ARM port, which I did during this work, is fully integrated in the main branch of Fiasco. It runs on the iPAQ H3800 with a StrongARM processor. It is not yet well tested because of missing user-level applications. 49 CHAPTER 6. SUMMARY 50 Appendix A Architecture-Specific Hooks This appendix contains the description for all interfaces that need an architecturespecific implementation. A.1 A.1.1 Kernel-External Hooks Proc Class Low-level (L4 independent) CPU abstraction. void Proc::cli ( ) This function must disable external interrupts on the local CPU. void Proc::sti ( ) This function must enable external interrupts on the local CPU. Proc::Status Proc::interrupts ( ) This function must return the state (enabled/disabled) interrupts on the local CPU. Proc::Status Proc::cli save ( ) This function must disable the interrupts on the local CPU and returns the previous state (enabled/disabled). void Proc::sti restore ( Proc::Status state ) This method must restore the interrupt state from state. A-1 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS void Proc::pause ( ) This method must be implemented to prevent the CPU from overheating and/or blocking concurrent hyper threads in the case of tight spin loops. void Proc::halt ( ) This method should put the CPU into sleep mode, which is left by a later IRQ. void Proc::stack pointer ( Mword sp ) This method has to set the CPU’s stack pointer to sp. A.1.2 Atomic Operations Operations for atomic read-write access to the main memory. bool up cas unsafe ( Mword *ptr, Mword oldval, Mword newval ) This method is must be implemented to do a uniprocessor compare and swap operation on the machine word that is addressed by ptr. The function must return true on successful operation, and false else. bool smp cas unsafe ( Mword *ptr, Mword oldval, Mword newval ) This method is must be implemented to do a multiprocessor-safe compare and swap operation on the machine word that is addressed by ptr. The function must return true on successful operation, and false else. bool up cas2 unsafe ( Mword *ptr, Mword *oldval, Mword *newval ) This method is must be implemented to do a uniprocessor compare and swap operation on the two adjacent machine words that are addressed by ptr. The function must return true on successful operation, and false else. bool smp cas2 unsafe ( Mword *ptr, Mword *oldval, Mword *newval ) This method is must be implemented to do a multiprocessor-safe compare and swap operation on the two adjacent machine words that are addressed by ptr. The function must return true on successful operation, and false else. A-2 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS bool up tas ( Mword *lock ) This function must be implemented to do a uniprocessor atomic test and set operation on the machine word addressed by lock. bool smp tas ( Mword *lock ) This function must be implemented to do a multiprocessor-safe atomic test and set operation on the machine word addressed by lock. A.2 A.2.1 Kernel-Internal Hooks Page table Class void* Page table::operator new ( size t ) Allocate memory for a new page table. void Page table::operator delete ( void * ) Free the memory of the given page table. void Page table::init ( ) Initialize the paging mechanisms for the kernel. Page table::Page table ( ) Create a new (empty) page table. Page table::Status Page table::insert ( P ptr<void> pa, void *va, size t s, Page::Attribs a ) Insert a mapping from virtual address va to physical address pa with the size s and attributes a. If there is already a mapping for the given virtual address, E EXISTS must be returned. Page table::Status Page table::replace ( P ptr<void> pa, void *va, size t s, Page::Attribs a ) Replace the mapping for virtual address va with the new values given in pa, s, and a. A-3 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS Page table::Status Page table::change ( void *va, Page::Attribs a ) Change the access rights of the given address va to a. If there is no mapping for va, E INVALID must be returned. Page table::Status Page table::remove ( void *va ) Remove the mapping for the virtual address va. If there is no mapping for va, E INVALID must be returned. P ptr<void> Page table::lookup ( void *va, size t *s, Page::Attribs *a ) const Returns the mapping for the virtual address va. If s is not null, the size of the found page is returned in s. If a is not null, the page attributes of the found page are returned. The returned pointer is null if there is no valid mapping. Page table::Status Page table::insert invalid ( void *va, size t s, Mword val ) Insert the given value (val ) as invalid mapping at address va and with size s into the page table. If there is already a valid mapping, E EXISTS must be returned. Mword Page table::lookup invalid ( void *va ) const Returns the invalid mapping at virtual address va. If there is a valid mapping, (Mword)-1 must be returned. void Page table::copy in ( void *my base, Page table *o, void *base, size t size ) Copy all mappings from o, starting with base and within the size size, to this page table, starting at my base. Page table* Page table::current ( ) Return a pointer to the currently active page table. size t const Page table::num page sizes ( ) Returns the number of supported page sizes. size t const*const Page table::page sizes ( ) Returns a constant array with num page sizes entries that contains the supported page sizes in bytes (starting with the smallest). A-4 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS size t const*const Page table::page shifts ( ) Returns a constant array with num page sizes entries. Each entry must contain the number of page-offset bits according to the page size returned by page sizes. Page table* Page table::activate ( P ptr<Page table> page table ) Activate the given page table and return a pointer its virtual address. This means the given page table must be activated in the MMU and all operations that are necessary to ensure memory consistency must be executed. A.2.2 Kmem Class The Kmem class represents the static (read only) abstraction of the kernel address space. The following constants must be defined: Kmem::mem tcbs Start address of the TCB area Kmem::mem tbcs end End address of the TCB area Kmem::mem user max End address of the user address space Kmem::mem kernel max End address of the kernel address space Kmem::ipc window start Start address of the IPC window Kmem::ipc window end End address of the IPC window Mword Kmem::is kmem page fault ( Mword pfa, Mword error ) Checks for kernel page fault. Mword Kmem::is tcb page fault ( Mword pfa, Mword error ) Checks for page fault in TCB area. Mword Kmem::is ipc page fault ( Mword pfa, Mword error ) Checks for user-mode page fault. Mword Kmem::is io bitmap page fault ( Mword pfa, Mword error ) Checks for page fault in the IA-32 I/O bitmap. A.2.3 Context Class Low-level part of Fiasco’s thread abstraction. A-5 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS void Context::switch fpu ( Context *target ) This hook must save the FPU context of the current thread and restore the FPU context of the target thread. On architectures with lazy FPU handling this method has to prepare the FPU according to their current owner. Mword Context::switch cpu ( Context *target ) switch cpu has to switch the execution context of the CPU and call the function call switchin context, which handles address-space switch over. void Context::switchin context ( ) On most architectures, this function calls Space context::switchin context. IA-32 requires additionally that the new stack pointer is written into the TSS. A.2.4 Thread Class High-level part of Fiasco’s thread abstraction. bool Thread::initialize ( Address ip, Address sp, Thread *pager, Thread *preempter, Address *o ip, Address *o sp, Thread **o pager, Thread **o preempter, Mword *o eflags ) This method has to prepare the user-level state of the thread, according to the given parameters. It is basically the implementation of the thread ex regs system call. void Thread::user invoke ( ) This method has to manage the transition to user-mode for newly created threads. bool Thread::handle sigma0 page fault ( Address pfa ) This method becomes redundant with the implementation of a generic page-table interface. At the moment the IA-32 version uses CPU feature flags to decide whether super pages shall be used. A.2.5 Kernel thread Class Special kernel thread, derived from the Thread class. This thread is responsible for the kernel startup and finally implements the idle loop. A-6 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS void Kernel thread::free init call section ( ) Should make the Initcall sections of the kernel unaccessible to the kernel. This means a complete un-mapping in Fiasco/UX, filling with undefined opcodes on IA-32, and do nothing on ARM. void Kernel thread::bootstrap arch ( ) Initializing some architecture specific kernel-thread stuff, may be empty. void Kernel thread::init workload ( ) Is only specialized on native IA-32, because of the old way sigma0 gets the address of the kernel info page (i.e., via an initial stack). A.2.6 In-Kernel System-Call Bindings The following sections describe the in-kernel system-call bindings for each system call. These bindings must be implemented to read the system-call parameters from the appropriate registers or memory locations. Sys ipc frame Binding for the version 2 and version X.0 IPC system call. void Sys ipc frame::rcv source ( L4 uid id ) Set the IPC source for the recipient. L4 uid Sys ipc frame::rcv source ( ) Get the IPC source for the recipient. L4 uid Sys ipc frame::snd dest ( ) const Get the destination for the IPC. Mword Sys ipc frame::has snd dest ( ) const Does the IPC have a destination. Mword Sys ipc frame::irq ( ) const Get the IRQ destination of the IPC. void Sys ipc frame::snd desc ( Mword w ) Set the send descriptor. A-7 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS L4 snd desc Sys ipc frame::snd desc ( ) const Get the send descriptor. void Sys ipc frame::rcv desc ( L4 rcv desc d ) Set the receive descriptor. L4 rcv desc Sys ipc frame::rcv desc ( ) const Get the receive descriptor. L4 timeout Sys ipc frame::timeout ( ) const Get the message timeout. Mword Sys ipc frame::msg word ( unsigned index ) const Get the given register message word. void Sys ipc frame::set msg word ( unsigned index, Mword value ) Set the given message word to the given value. void Sys ipc frame::copy msg ( Sys ipc frame *to ) const Copy this msg to the given IPC data. L4 msgdope Sys ipc frame::msg dope ( ) const Get the msg dope. void Sys ipc frame::msg dope set error ( Mword err ) Set the error code. void Sys ipc frame::msg dope ( L4 msgdope d ) Set the msg dope. void Sys ipc frame::msg dope combine ( L4 msgdope d ) OR some extra bits to the msg dope. unsigned const Sys ipc frame::num reg words ( ) Number of words transmitted in registers. Sys id nearest frame Binding for the version 2 and version X.0 ID nearest system call. A-8 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS L4 uid Sys id nearest frame::dest ( ) const Get the dest parameter of the syscall. void Sys id nearest frame::type ( Mword type ) Set the return type of the syscall. void Sys id nearest frame::nearest ( L4 uid id ) Set the result of the syscall. Sys ex regs frame Binding for the version 2 and version X.0 ex-regs system call. Mword Sys ex regs frame::lthread ( ) const Get the lthread parameter of the syscall. Mword Sys ex regs frame::sp ( ) const Get the stack pointer parameter. Mword Sys ex regs frame::ip ( ) const Get the instruction pointer parameter. L4 uid Sys ex regs frame::preempter ( ) const Get the preempter ID. L4 uid Sys ex regs frame::pager ( ) const Get the pager ID. void Sys ex regs frame::old eflags ( Mword oefl ) Set the old eflags (x86) or processor status word (other CPUs). void Sys ex regs frame::old sp ( Mword osp ) Set the old stack pointer. void Sys ex regs frame::old ip ( Mword oip ) Set the old instruction pointer. void Sys ex regs frame::old preempter ( L4 uid id ) Set the old preempter ID. A-9 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS void Sys ex regs frame::old pager ( L4 uid id ) Set the old pager ID. Sys thread switch frame Binding for the version 2 and version X.0 thread-switch system call. L4 uid Sys thread switch frame::dest ( ) const Get the dest id of the switch. Mword Sys thread switch frame::has dest ( ) const Returns true whether dest is valid. Sys thread schedule frame Binding for the version 2 and version X.0 thread-schedule system call. L4 sched param Sys thread schedule frame::param ( ) const Get the scheduling parameters. L4 uid Sys thread schedule frame::preempter ( ) const Get the preempter ID. L4 uid Sys thread schedule frame::dest ( ) const Get the destination ID. void Sys thread schedule frame::old param ( L4 sched param op ) Set the old scheduling params. void Sys thread schedule frame::time ( Unsigned64 t ) Set the consumed time. void Sys thread schedule frame::old preempter ( L4 uid id ) Set the old preempter. void Sys thread schedule frame::partner ( L4 uid id ) Set the partner of a pending IPC. A-10 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS Sys unmap frame Binding for the version 2 and version X.0 flexpage-unmap system call. L4 fpage Sys unmap frame::fpage ( ) const Get the fpage to unmap. Mword Sys unmap frame::map mask ( ) const Get the mask, say rights for the unmap. bool Sys unmap frame::downgrade ( ) const Returns true if the operation is a downgrade. bool Sys unmap frame::self unmap ( ) const Returns true if also the current space flushes the fpage. Sys task new frame Binding for the version 2 and version X.0 task-new system call. Mword Sys task new frame::mcp ( ) const Get the mcp of the new task (if created active). L4 uid Sys task new frame::new chief ( ) const Get the new chief of the task (if created inactive). Mword Sys task new frame::sp ( ) const Get the stack pointer of the thread 0 (active). Mword Sys task new frame::ip ( ) const Get the instruction pointer of thread 0 (active). Mword Sys task new frame::has pager ( ) const Is a pager specified (if not then create inactive task). L4 uid Sys task new frame::pager ( ) const Get the pager id (active). L4 uid Sys task new frame::dest ( ) const Get the task id of the new task. A-11 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS void Sys task new frame::new taskid ( L4 uid id ) Set the new tasks ID. A.2.7 Cpu Class Abstraction for kernel internal CPU initialization and features. void Cpu::init ( ) Initialize the CPU (e.g., Map ARM interrupt vector table) A.2.8 Fpu Class This class handles the state of the floating point unit (FPU). An implementation for machines without an FPU exists, in other cases the following methods need an implementation. void Fpu::init ( ) Initialize the FPU. void Fpu::save state ( Fpu state *s ) Save the current FPU state into s. void Fpu::restore state ( Fpu state *s ) Restore the FPU state from s. void Fpu::disable ( ) Disable the FPU, subsequent use may cause an exception, which can be used to implement lazy FPU handling. void Fpu::enable ( ) Enable the FPU. A.2.9 Timer Class This class encapsulates access to the system timer. The following hooks must be implemented. void Timer::init ( ) Initialize the system timer circuit. A-12 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS void Timer::acknowledge ( ) Acknowledge a timer IRQ. void Timer::enable ( ) Enable the system timer. Timer IRQs are subsequently generated. void Timer::disable ( ) Disable the system timer. (No more timer IRQs.) A.2.10 Pic Class This class is the abstraction for the platform’s interrupt controller. void Pic::disable locked ( unsigned irq ) Disable (mask) the specified IRQ. (This operation is only used with locally disabled IRQs, thus is not required to care about locking.) void Pic::enable locked ( unsigned irq ) Enable (unmask) the specified IRQ. (This operation is only used with locally disabled IRQs, thus is not required to care about locking.) void Pic::acknowledge locked ( unsigned irq ) Acknowledge the specified IRQ. (This operation is only used with locally disabled IRQs, thus is not required to care about locking.) Pic::Status Pic::disable all save ( ) Disable (mask) all specified IRQs and return the previous state. void Pic::restore all ( Pic::Status state ) Restore the state of all IRQs according to the given state. A.2.11 Mapdb Class The Mapdb class represents the mapping database. The entries of the mapping database may have differing layouts and alignment constraints on different architectures, and hence have to be defined per target architecture. The following structure represents an entry of the mapping database. The declaration of this structure can differ in the sizes and in the order of the members. struct Mapping_entry { A-13 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS unsigned unsigned unsigned unsigned }; A.2.12 space:11; size:1; address:20; depth:8; Startup Constructor A static constructor with the priority STARTUP INIT PRIO must initialize the mandatory parts of the kernel. These parts are, for example: • Boot info::init() • Config::init() • Kip::init() • Pic::init() • Boot console::init() • ... A.2.13 Boot console Class This class has to handle the in-kernel console drivers during the boot process. void Boot console::init ( ) This method must initialize the console for kernel messages. A.2.14 kdb ke Module Abstraction for explicit kernel-debugger invocation. bool kdb ke ( const char *msg ) This function must enter the kernel debugger and print out the given message. A-14 Acronyms ABI application binary interface API application programming interface ASID address-space identifier ASOD aspect-oriented design CISC complex-instruction-set computing CPD caching page table CPSR current program status register CPU central processing unit DACR domain access control register DTLB data-access translation look-aside buffer EFI extensible firmware interface ELF executable and linker format FCSE fast-context-switch extension FIQ fast-interrupt request FPU floating point unit HLL high-level language IPC inter-process communication IRQ hardware interrupt ITLB instruction-prefetch translation look-aside buffer I/O input/output JDB Fiasco kernel debugger B-1 APPENDIX A. ARCHITECTURE-SPECIFIC HOOKS LPT leaf page table MMU memory management unit OO object oriented OOD object-oriented design OOP object-oriented programming OS operating system PD page directory PID process identifier RAM random-access memory RISC reduced-instruction-set computing SASOS single-address-space operating system SPSR saved program status register SWI software interrupt TCB thread control block TLB translation look-aside buffer UART universal asynchronous receiver/transmitter UML unified modeling language UNSW University of New South Wales VGA video graphics adapter B-2 Bibliography [Accetta et al., 1986] Accetta, M. J., Baron, R. V., Bolosky, W., Golub, D. B., Rashid, R. F., Tevanian, A., and Young, M. W. (1986). Mach: A new kernel foundation for unix development. In USENIX Summer Conference, pages 93– 113, Atlanta, GA. [Alexandrescu, 2001] Alexandrescu, A. (2001). Modern C++ Design : Generic Programming and Design Patterns Applied. Addison-Wesley. [ARM Ltd., 2000] ARM Ltd. (2000). ARM Architecture Reference Manual. ARM Limited. [ASOD, 2003] ASOD (2003). Aspect-Oriented Development, Home Page. URL: http://www.asod.net. [Bomberger et al., 1992] Bomberger, A. C., Frantz, A. P., Frantz, W. S., Hardy, A. C., Hardy Norman, Landau, C. R., and Shapiro, J. S. (1992). The KeyKOS Nanokernel Architecture. In Proceedings of the USENIX Workshop on MicroKernels and Other Kernel Architectures, pages 95–112. USENIX Association. [Coady et al., 2001] Coady, Y., Kiczales, G., Feely, M., Hutchinson, N., Suan, J. O., and Gudmudson, S. (2001). Position Summary: Aspect-Oriented System Structure. In The 8th Workshop on Hot Topics in Operating Systems (HotOS). [FreeBSD, 2003] FreeBSD (2003). www.freebsd.org. FreeBSD Home Page. URL: http:// [Group, 1999] Group, T. F. R. (1999). The OSKit: The Flux Operating System Toolkit. University of Utah, Department of Computer Science. [Grützmacher, 1998] Grützmacher, L. (1998). Entwurf und Implementierung einer Mapping-Datenbank für L4. [Haeberlen, 2003] Haeberlen, A. (2003). Managing Kernel Memory Resources from User Level. Master’s thesis, System Architecture Group, University of Karlsruhe. [Härtig et al., 1997] Härtig, H., Hohmuth, M., Liedtke, J., Schönberg, S., and Wolter, J. (1997). The performance of µ-kernel-based systems. In 16th ACM C-1 BIBLIOGRAPHY Symposium on Operating System Principles (SOSP), pages 66–77, Saint-Malo, France. [Hazelnut, 2000] Hazelnut (2000). Hazelnut – performance evaluation. Available from URL: http://www.l4ka.org/projects/hazelnut/eval.asp. [Hohmuth, 2003a] Hohmuth, M. (2003a). The Fiasco kernel: System Architecture. Technical Report ISSN 1430-211X TUD-FI02-06, Dresden University of Technology. Unpublished. [Hohmuth, 2003b] Hohmuth, M. (2003b). Preprocess - A preprocessor for C and C++ modules. http://os.inf.tu-dresden.de/ hohmuth/prj/preprocess/. [Hohmuth and Härtig, 2001] Hohmuth, M. and Härtig, H. (2001). Pragmatic nonblocking synchronization for real-time systems. In USENIX Annual Technical Conference, Boston, MA. [Intel, 2003] Intel (2003). Prescott New Instructions Software Developer’s Guide. Intel Corporation. [Intel PXA, 2002a] Intel PXA (2002a). Intel PXA250 and PXA210 Application Processors, Developer’s Manual. [Intel PXA, 2002b] Intel PXA (2002b). Intel PXA250 and PXA210 Application Processors Operating System Developer’s Guide. [Intel XScale, 2002] Intel XScale (2002). Intel XScale Microarchitecture for the PXA250 and PXA210 Application Processors, User’s Manual. [Kesteloot, 1995] Kesteloot, L. (1995). Porting BSD UNIX to a New Platform. URL:http://www.teamten.com/lawrence/291.paper/291.paper.html. [L4Ka, 2003] L4Ka (2003). L4Ka Website of the University of Karlsruhe. URL: http://www.l4ka.org. [Liedtke, 1995] Liedtke, J. (1995). On µ-kernel construction. In 15th ACM Symposium on Operating System Principles (SOSP), pages 237–250, Copper Mountain Resort, CO. [Liedtke, 1996] Liedtke, J. (1996). L4 reference manual (486, Pentium, PPro). Arbeitspapiere der GMD No. 1021, GMD — German National Research Center for Information Technology, Sankt Augustin. Also Research Report RC 20549, IBM T. J. Watson Research Center, Yorktown Heights, NY, September 1996. [Linux, 2003] Linux (2003). www.kernel.org. Linux-Kernel Home Page. URL: http:// [McKusick et al., 1996] McKusick, M. K., Bostic, K., Karels, M. J., and Quarterman, J. S. (1996). The Design and Implementation of the 4.4BSD Operating System. Addison-Wesley Longman, Inc. C-2 BIBLIOGRAPHY [NetBSD, 2003] NetBSD (2003). www.netbsd.org. NetBSD Home Page. URL: http:// [Sartoris Developers Group, 2003] Sartoris Developers Group (2003). Project Description. URL: http://sartoris.sourceforge.net. Sartoris [StrongARM, 2001] StrongARM (2001). Intel StrongARM SA-1110 Microprocessor Developer’s Manual. [von Leitner, 2003] von Leitner, F. (2003). diet libc Web Page. URL: http: //www.fefe.de/dietlibc. [Warg, 2002] Warg, A. (2002). Porting of Fiasco to IA-64. Dresden University of Technology. [Wiggins, 1999] Wiggins, A. (1999). The Design and Implementation of the L4 Microkernel on the StrongARM SA-1100. [Wiggins et al., 2002] Wiggins, A., Chapman, M., Uhlig, V., Sayle, A., and Heiser, G. (2002). The Benefits of Sharing TLB Entries. Technical report, School of Computer Science & Engeneering, UNSW, Australia; University of Karlsruhe, Germany. [Wiggins and Heiser, 2000] Wiggins, A. and Heiser, G. (2000). Fast AddressSpace Switching on the StrongARM SA-1100 Processor. In Proceedings of the 5th Australian Computer Architecture Conference (ACAC), pages 97–104, Canberra, Australia. IEEE CS Press. [Wilkinson et al., 1995] Wilkinson, T., Murray, K., Russel, S., Heiser, G., and Liedtke, J. (1995). Single address space operating systems. UNSW-CSE-TR 9504, Univ. of New South Wales, School of Computer Science, Sydney, Australia. C-3
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement