MASTER’S THESIS: NOMADIC OPERATING SYSTEMS December 10, 2002 By Jacob Gorm Hansen & Asger Kahl Henriksen Department of Computer Science, University of Copenhagen Abstract This thesis attempts to solve the configuration and reconfiguration difficulties encountered in utility computing, by allowing problem instances to be submitted as running nomadic operating systems. An environment suitable for hosting multiple nomadic operating systems and an example nomadic operating system, based on Linux, are implemented. Performance of the nomadic operating system is shown not to suffer significantly, compared to native Linux, and the nomadic operating system is shown to be able to migrate between two hosts in less than one tenth of a second. Acknowledgements We would like to thank the following people for their time and effort in making this thesis possible: Niels Elgaard Larsen, our counsellor from DIKU. Peter Andreasen for proofreading and comments, and for supplying us with this nice TEX-style. Henrik Lund for helping create the illustrations. Enterprise Brandhouse for letting us use their offices for our work. The nice people on the L4-ka and L4-hackers mailing lists, for taking the time to explain in depth the inner workings of L4. i Contents 1 2 3 Introduction 1 Work description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Focus of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Motivation 5 Mobile computer use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Laptop and palmtop computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Remote display technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Trends in computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Utility computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Network file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Removable media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Faster, unstable networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 The need for operating system migration . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Present state 10 System partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 IBM VM/370 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Microkernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Disco and VMWare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Jails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Java Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 ii Chapter 0: 4 5 iii Choosing a technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Process migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Sprite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 MOSIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The residual dependency problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Why process migration has failed to take off . . . . . . . . . . . . . . . . . . . . . 16 Persistent systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Proposed solution 19 Interface design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Choosing a microkernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Mach 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 L4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 L4 abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 L4Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Different migration schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Guest interface considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Address space layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Page fault and page request handling . . . . . . . . . . . . . . . . . . . . . . . . . 30 Checkpointing behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Hardware abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Working with the L4 kernel 33 L4 basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Virtual memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Different L4 versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 0: 6 7 iv Implementation 37 Host environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 L4Linux as a host environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 The NomadBIOS host environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Choice of host environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Sharing host resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 L4 task numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Physical memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Timer interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Realtime clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Network packet filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Adapting L4Linux as a guest operating system . . . . . . . . . . . . . . . . . . . . . . . 44 Hardware abstraction layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Disabling interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Suspending L4Linux and its user processes . . . . . . . . . . . . . . . . . . . . . . 46 Resuming L4Linux and its user processes . . . . . . . . . . . . . . . . . . . . . . . 48 Pure demand paging model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Task identifier migration issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Remote control interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Checkpointing server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Block device access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Performance measurements 53 Latency benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Throughput benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Migration benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 0: 8 v Discussion 60 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Migration across network boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Other usage scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Workstation hotel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Laptop replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 The Fluke microkernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Denali . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 vMatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 A Guest interface specification 67 Address space layout and page fault handling . . . . . . . . . . . . . . . . . . . . . . . 67 Checkpointing behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Hardware abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 External thread identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Guest info page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 B Changes to the L4Linux source 69 C Availability 71 D Bibliography 72 C HAPTER 1 Introduction The first two sections below are a slightly altered version of the original work description submitted for this thesis before work began. Minor grammatical corrections, and a few clarifications, have been made. The current trend towards distributed computing, in the form of clusters, or grids, introduces a demand for dynamic allocation and re-allocation of computational resources across administrative realms. A number of languages and APIs have been proposed for creating a unified environment on which applications may rely. This may be impractical for a number of reasons. Existing applications need to be rewritten to utilise the new APIs or language features. This may be very costly or even impossible, due to lack of access to source code or qualified personnel, or lack of bindings between the new APIs and old programming languages like FORTRAN. Applications may require features not provided by the new APIs, such as shared memory, or access to various network file systems. In the case of new and interpreted languages, such as Java, the performance overhead often negates whatever speedup might be had from distributing the computation in the first place. To provide a common environment, the new APIs or languages will need to implement or abstract already existing technologies, resulting in duplication of effort and code bloat. It is reasonable to expect the new APIs to grow in complexity, until they match the features already provided by modern operating systems. From the above, the obvious yet naive way to achieve such a unified environment without rewriting everything, would be to define a standard operating system configuration, that would be adhered to by all participating organisations. Applications would then be able to rely on the existence of various operating system features, such as access to network file systems, a common set of installed shared libraries, and so on. However, the problems on agreeing on such a platform configuration, and keeping it synchronised across all nodes, would be exponential in the number of participating organisations, and number of features, so this approach is not realistic. One has to realize that the reliance of the application upon certain operating system features makes the operating system and its configuration a part of the application. To solve this configuration problem, we propose that instead of distributing applications in the form of plain user level programs, the entire operating system should be distributed along with the application, 1 Chapter 1: Introduction 2 allowing users to customise the operating system configuration according to the demands of the application. Most popular operating systems today serve two primary purposes. The first is to be a hardware abstraction layer, providing a unified interface to resources such as disk, network and memory. The second is to provide a range of services to client programs, such as protection, concurrency, persistence, naming etc. User level programs rely heavily on the particular semantics of the service layer, and changes to this layer may cause the programs to stop functioning. These two layers are often tightly coupled, mainly for performance reasons. This tight coupling has some drawbacks. For example, it is non-trivial to migrate a running operating system from one physical computer to another, without restarting it, because the operating system has committed itself to certain assumptions about the workings of the hardware, as determined at boot time. Since the hardware is already abstracted as seen from user level applications and kernel service layer, it becomes possible to cleanly separate the layers, leaving the operating system with only the service layer responsibilities. With this separation in place, migrating entire operating systems with all of their running applications becomes possible, as does running multiple (possibly different) operating systems concurrently on the same computer. Work description Building a foundation for migratable operating systems, and for running a number of these concurrently on the same processor, will be the scope of our work. Our goal will be to create a small program, much like the BIOS or firmware installed in computers today, that encapsulates not only the specifics of the installed hardware, but also the processing units and memory. This can be achieved by using a microkernel to host and service one or more guest operating systems. We will do the following: Investigate the various microkernel designs and rate them with regards to our goals. Analyse the needs and challenges of hardware abstractions. Analyse problems of migration across network boundaries. Implement a migratable operating system based on Linux. Benchmark our implementation against alternative schemes. Chapter 1: Introduction 3 Focus of this thesis Today, 30 years after the invention of UNIX, operating systems are still the centre of much research attention. Most of the researchers in the field can be divided into two groups; the group aiming to replace UNIX with something entirely different, and the group aiming to improve on what is already popular. An example of a replacement-project is Eros [Shapiro, 1999], a checkpointing and capabilitybased operating system, and an example of an improvement project is the MOSIX [Barak and La’adan, 1998] process migration system for Linux. There is no inherent conflict between these two schools of thought, and many researchers are working within both fields. Yet, when embarking on a research project on operating systems, one has to choose whether to try and topple the old with something radically new, or take the path of least resistance by attempting to improve upon the existing. As relative novices in the field, this project’s authors feel in no shape to stray too much from the beaten paths, so this project is one of evolution, not revolution. Therefore, the research and experiments described here focus more on finding new ways of exploiting classic operating systems technologies, than on developing concepts for next-generation operating systems. Due to the limited time frame of this project (the work of two persons full-time for six months), and the main goal of creating a real-life nomadic (migratable) operating system, it was decided to build upon existing and proven technologies where possible, instead of implementing base technologies from scratch. Notably, the project will build upon an existing microkernel foundation, rather than try and invent a new one. Some previous microkernel based approaches have gotten to the point where functionality is comparable to traditional monolithic systems, and stopped there. The real benefits of the microkernel approach are often not exploited fully, but left as an exercise for the reader. This project can be described as the undertaking of one such exercise; showing that the use of a microkernel allows multiple operating systems to run concurrently, and allows them to migrate quickly and easily between physical and somewhat heterogeneous machines. With regards to the implementation of a nomadic operating systems, the goals are restated in more technical detail as follows: Implement an environment suitable for hosting multiple concurrently running nomadic operating systems on commodity hardware, including shared and safe access to hardware devices such as network adapters. Implement a non-toy example nomadic operating system able to support standard applications with no recompilation or relinking necessary, and to transparently migrate between hosts. Implement a system of network protocols, clients and servers, for migrating running operating systems over a network, including measures to limit the downtime experienced by users. Chapter 1: Introduction 4 The purpose of the host environment is similar to the original purpose of the IBM PC basic input/output system (BIOS), in providing a standard foundation on which real operating systems can run. Therefore, we refer to it as the NomadBIOS. The example nomadic operating system will be a version of Linux, referred to as NomadLinux. Hypotheses We aim to confirm the following hypotheses about nomadic operating systems: Concurrent By adapting an existing microkernel-based operating system, the concurrent running of multiple fully protected operating systems on the same Intel IA32-based computer, without a fully virtualised environment 1 , will be possible. Migratable By being based on a hardware abstractions, operating systems will be able to migrate between heterogeneous physical computers, and do so without becoming unresponsive during migration, and without experiencing the residual dependency problem 2 . Efficient By sacrificing the ability to host unmodified operating systems, the implemented system will be able to outperform fully virtualised systems. Performance of the adapted operating system, running within the abstracted environment, will equal that of the original microkernel-based version of the operating system, which in turn will be almost on par with that of traditional monolithic operating systems. Scalable The performance of two program instances running under two concurrent operating system instances, will be identical to the performance of the two instances running under one instance of the operating system. 1 2 Such as the one present in VMWare, described on page 12. Residual dependencies are described in more detail on page 16. C HAPTER 2 Motivation This project deals with migration of running programs between physical hosts. This concept, while elegant and beneficial in theory, has found only limited use in practice. Our hypothesis is that this limited success is due to a wrong choice of migrational unit. Current schemes focus on the migration of processes rather than of complete systems. Processes in operating systems such as UNIX, have a large amount of dependencies on the environment in which they are running, for example open files, memory shared with other processes, reliance on access to various host-specific system features, and so on. Operating systems, on the other hand, only require access to a certain amount of memory, a certain type of processor, and perhaps a few standard peripheral devices such as harddrives or network cards. Furthermore, while the interfaces exposed by operating systems to application processes are often complex, vaguely defined and subject to change, the hardware interfaces exposed by computers to operating systems are simple, well-defined, and only slowly evolving. Thus, the operating system, instead of just its processes, should be the unit of migration. We make two initial assumptions about the migration of running operating systems. These assumptions will be backed by more thorough argumentation later, but for now let us assume that: 1. Open network connections to the outside world will always migrate seamlessly along with the operating system. 2. All file systems mounted by the migrating operating system will be network file systems, and thus remain accessible due to assumption 1. Naturally, by changing the unit of migration, the range of useful applications of migration changes as well. Process migration schemes such as MOSIX have shown useful mainly for balancing CPU and memory load across tightly connected clusters, running long non I/O intensive calculations. Operating system-level migration on the other hand, will probably be more useful when dealing with migration between clusters that are loosely connected via wide-area networks such as the Internet, as well as for administrative tasks, such as a server being taken down for hardware maintenance, where the operating system might be moved temporarily to another host without users noticing. Running operating systems may also be checkpointed to stable storage on another host for backup purposes. Operating system migration may also change the way individual users work with computers. Because the new unit of migration corresponds to the complete working environment of a workstation user, mobility problems may be attacked from a new angle. 5 Chapter 2: Motivation 6 Mobile computer use Computers aid us in our work, but also limit our work to specific locations such as the office or the home. Mobile computing is a broad term, embodying various attempts to regain the freedom to work where it is most practical (next to a knowledgeable coworker) or most pleasant (in the park). Today, mobile computing is generally solved in in one of three different ways: Laptop computers - compress all of the components needed for a real computer system to a size where it becomes carryable. Palmtop computers - build a minimal computer system, and tell the user to limit his work to the options provided. Remote displays - free the user from his desk (but not from the network) by allowing him to promote his display to any terminal at hand in the company or on the campus. We will look further into the advantages and disadvantages of the technologies below, and introduce nomadic operating systems as an alternative in some situations. Laptop and palmtop computers Laptop computers are generally just desktop workstations using smaller (and thus often either more expensive or less performant) components. Laptops cannot be made arbitrarily small, due to user demands of a large monitor and a usable keyboard, and tend to be rather heavy as well. Though they promise users the chance to work in the park or on the beach, it feels safe to assume that most laptops reside safely on desks for the most of their existences. Palmtops trade functionality for size, and are small enough that they can be taken anywhere just in case they might be needed. While good enough to handle a calendar and to do list, plus a few specialised applications, they are not yet able to replace a real computer for traditional computing uses. Remote display technologies Originally, remote displays were invented to allow multiple users to share one computer, but in recent years, as personal workstations have become cheap, their main force from the individual user’s perspective, lies in allowing the user to work from different locations, with access to the same set of files and applications. Remote display technologies offer a solution to mobility, by allowing the user to log in from terminals geographically removed from the server on which applications are running. Some technologies demand the user to perform a new login when changing location, while others Chapter 2: Motivation 7 allow a login session to be resumed on a new terminal. Some technologies, such as VT100, support only a text display, while others, such as the X Window System, provide display of bitmapped graphics as well. Common to all such technologies is the need for a reliable network connection. If the network is down, no work can be done. This dependence on network connectivity also means that while a remote display will be usable from everywhere inside a company, or on a university campus, it may become too slow for practical use from less connected sites, for example when working from home. For such uses, most people still prefer laptops, or just an extra workstation, with an extra set of applications, at home. Operating system migration offers a solution to the mobility problem. By letting the user migrate his entire operating system along with him when he moves, the need for laptops, palmtops, and remote displays is reduced. Because the operating system only needs to migrate once, network connectivity becomes less important. Trends in computing Below, we will examine three current trends in computing, trends that we believe underline the practicality of migrating entire operating system instead of just their processes. Utility computing The term utility computing, or computing grids, describes a trend towards sharing access to computation much as access to the power grid is shared today. If electricity is needed, it is not necessary to invest in neither a diesel generator, windmill, fuel cell, nor a nuclear reactor. Instead, it is possible to plug into the power grid via simple-to-use means, and retrieve electricity from there. Some people [Foster et al., 2002] believe that in the future, computing will become a utility just like electricity is today. Not only will consumers be able to purchase processing power or safe data storage from central entities, they will also be able to provide and sell these services themselves. Researchers in natural sciences, who occasionally have great needs of computational power but not always the budgets for the necessary equipment, have been pushing grids as a way of leveraging computer investments across multiple institutions, and a number of research projects dealing with the creation of software for service discovery, security, job scheduling, are ongoing. Chapter 2: Motivation 8 Network file systems Since the late 1980’s, the trend has been for storage to become detached from computation. Network filesystems, such as NFS, and distributed filesystems such as AFS [Satyanarayanan, 1990], allow data to be placed on dedicated servers. These servers may be tuned for better I/O performance, security is easier to enforce, and data redundancy can be avoided. The main drawback is that network partitions may render client computers useless, and that client-side storage space may end up unused. AFS utilises client harddrives as cache space, allowing near-local read performance when working on previously cached files, and even allows disconnected operation in certain cases where all needed files are in the cache. AFS also supports replication of read-only filesets across multiple fileservers, mitigating the effects of network partitions. Coda [Satyanarayanan et al., 1990] goes one step further, by allowing read-write replication of filesets across fileservers, and writes to client caches even when disconnected from fileservers, tracking modifications in a log which is replayed on the fileserver upon reconnection. Intermezzo [Braam et al., 1999] is a re-implementation of the central Coda ideas, attempting to reuse preexisting functionality where possible. For example, Intermezzo uses the journaling layer of the local filesystem for logging filesystems modifications, and the HTTP protocol for bulk network file transfers, instead of inventing its own protocol. The emergence of intelligent and efficient network file systems increase the feasibility of having tasks run at remote locations without having to transfer the total filesystem along with the application, especially if it is not known up front which parts of the total filesystem will be accessed. Removable media The ubiquitous but technically outdated 3.5” floppy disk has for many years hindered the advent of new forms of removable media. However, with the introduction of the Universal Serial Bus (USB) and IEEE-1394 (aka FireWire) interfaces as a standard feature in most computers, the removable media market has been revitalised. Solid-state devices such as M-Systems DiskOnKey, or the IBM USBKey, and portable harddrives such as the Apple iPod music player, allow vast quantities of data to be carried in a pocket, or even in a key-ring. The combination of removable media and migrating operating systems would allow a user to migrate a running desktop environment to passive removable media, and resume the session on a different machine. This would enable the user to bring with him his favourite working environment, without having to carry a heavy laptop around. Faster, unstable networks The main trends in networking over the last ten years have been a continuous growth in bandwidth, and an increasing amount of computer systems and devices connected to the Internet. Chapter 2: Motivation 9 But even though networks are becoming faster and larger, network instabilities and partitions have not yet ceased to occur. With interconnected networks, the points of failure causing partitions may lie outside the administrative realms of those affected, and so may be hard to rectify. However, problems are rare enough that few sites employ preventive measures such as redundant wide-area connections. For the Internet, it seems as if bandwidth and amount of connected hosts are rapidly increasing, but availability remains constant at well below 100%. When planning Internet-scale distributed systems, this should be kept in mind. Assuming almost infinite bandwidth for an application may well make sense, while relying on constant and 100% reliable connectivity between any two nodes in the network for the application to function, may be dangerous. Operating system typically run much longer than processes, and so will need to migrate less frequently. They will be able to cache large amounts of data, and thus become less reliant on constant network connectivity than process-migration systems. The need for operating system migration Storage is becoming easier detachable from computation. In a network file system, running software can be moved to a new location without losing contact to its file state. With fast removable media, migration may take place by simply checkpointing the entire running operating system, along with its file state to removable media, and carrying it to another machine where it may resume operations. Networking bandwidth and stability trends are relevant when suggesting remote display technologies as a solution to the mobility problem. As long as constant connectivity cannot be guaranteed, users will be wary of betting their ability to work on the stability of the network. If instead the operating system is migrated along with the user when he moves, connectivity is only necessary during the period of migration. Whether computing grids will ever appeal to individual users is yet too early to tell, but it appears evident that a system for migration of running calculations inside the grid will be a necessity for efficient scheduling of resources. With migratable operating systems (or nomadic operating systems as we like to call them), a grid customer is able to submit a job as a set of running operating systems, for which the grid may then allocate the necessary resources. Overall, the above mentioned trends point to the feasibility of operating system migration, aka nomadic operating systems. C HAPTER 3 Present state This project is about running multiple operating systems on one machine simultaneously, transparent migration of these operating systems to another machine, and, as a consequence of the latter, also about checkpointing running operating systems. A lot of research has been performed within all of these three fields, which are generally described by the names system partitioning, process migration, and persistent systems. In this chapter, we briefly survey the the most prominent system partitioning and process migration technologies, and provide a short introduction to persistent systems. System partitioning Various ways of running multiple operating systems on the same machine exist. In the 1970s, hardware-assisted virtual machines running on IBM mainframes were the centre of much attention. In the late 1980s, microkernels such as Mach seemed like the right answer due to their ability to run different operating system personalities concurrently, and in the mid 1990s the focus moved to safe programming languages such as Java. Lately, the hugely successful VMWare has spurred renewed interest in the virtual machine concept. IBM VM/370 The most well-known system for running multiple operating systems on the same machine, is the IBM VM/370 [Creasy, 1981] time-sharing system. It builds on the concept of pure virtual machines, emulating the complete instruction set of the physical machine for every operating system. Resource sharing is managed by the Control Program (CP), supporting various sharing policies, such as exclusive access to magnetic drives, or time-sharing of the CPU. In the original VM/370-system, the Conversational Monitor System (CMS) was the most frequently used guest operating system, but more recent version run both UNIX and Linux as well, and are able to host several thousand virtual machines on a single host. VM/370 runs on specialised hardware, built to aid virtualisation. While interesting for large organisations able to afford such equipment, its design prohibits its use on cheap commodity hardware, such as Intel systems, that lack proper virtualisation support [Robin and Irvine, 2001]. 10 Chapter 3: Present state 11 Microkernels The microkernel concept was proposed in 1970 by Per Brinch Hansen [Hansen, 1970], for the RC4000 system. The fundamental idea is to leave only the absolute essentials inside the operating system kernel (called a nucleus by Brinch Hansen), keeping all other functionality in normal user level programs. The focus of the RC4000 nucleus was extensibility, as the hardware platform on which it was running did not support memory protection. The nucleus supported buffered inter-process communication (IPC) as the primary means of communication between processes. Processes could be either internal, what we understand as a process today, or external, driving peripheral hardware. Since IPC could be performed towards both kinds of process, it became a unifying abstraction for both process-to-process communication, as well as for access to hardware. Later, in the mid-1980s, this approach was revisited in the Mach kernel [Rashid et al., 1989]. Mach supports both buffered IPC secured by a capability-like system of ports and port references, memory protection with user-implementable paging policies, and even virtual memory with disk-paging. Critics argued that Mach was too slow, as performance when running UNIX on top of Mach was lower than when running as one monolithic program, and today Mach is considered by critics as a failed experiment and proof that the microkernel concept is fundamentally flawed. Later again, in 1995, the idea was revisited by Jochen Liedtke [Liedtke, 1995], who argued that the deficiencies of Mach were due mainly to bad design and the inclusion of too many features, rather than a proof that the microkernel concept was bad. In contrast with Mach, Liedtke’s L4 kernel implemented only the basic abstractions which could not be implemented with equal functionality at the user level. Only unbuffered IPC was supported, based on the argument that buffered IPC might be implemented by user level threads. L4 had user level pagers but no disk-based virtual memory. The original L4 implementation was written in assembly code and tailored to the Intel x86 processor family. Liedtke showed that at the sacrifice of portability, microkernel based systems could be made to perform almost as well as monolithic systems. For example, an implementation of Linux on L4 ran only about 5-10% slower for practical scenarios than the traditional monolithic Linux [Härtig et al., 1997]. With microkernels, most of the features usually associated with an operating system are implemented as user level programs. Because of this, it becomes a lot simpler to run multiple operating systems concurrently and safely, though this is a feature which is not frequently exploited. Disco and VMWare The Stanford Disco [Bugnion et al., 1997] system, and its commercial successor VMWare, allow multiple commodity operating systems to run on standard hardware without special support for virtualisation. Disco focuses mainly on better utilisation of machines with many processors, by running many copies of the same operating system (SGI Irix) on the machine. Guest operating systems communicate via a machine-internal TCP/IP network, running atop a shared-memory layer for Chapter 3: Present state 12 maximum performance. Running multiple copies of Irix on a multiprocessor machine under Disco is shown to yield better overall performance than running a single, multi-processor capable Irix, on the same machine. VMWare is a derivate of Disco for the Intel Pentium family. Its main use is for letting different operating systems, usually Windows and Linux, run concurrently on the same machine. The basic idea of both systems is to let unprivileged code run directly on the CPU, at full speed, while interpreting privileged instructions, to trick the guest operating system into believing it is running directly on the real hardware. Guest operating systems experience a standard hardware configuration, accessible using normal means such as memory-mapped I/O and Direct Memory Access (DMA). The underlying host environment (known as the Virtual Machine Monitor (VMM)) simulates the semantics of real hardware devices. For instance, VMWare simulates the popular AMD Lance network chip, for which drivers are likely to exist for most operating systems. As described in Robin and Irvine , the Intel Pentium does does not meet the requirements for full virtualisation, because not all instructions reading or writing privileged processor state cause processor traps. Therefore, VMWare has to audit and potentially modify all code before allowing it to run directly on the CPU. The technique which is most likely used is described in [Lawton, 1999], and will be briefly outlined below: Initially, all memory pages corresponding to the pages containing guest code are filled with trap instructions. When the CPU attempts to execute one of these instructions, a trap occurs, and the trap handler fetches the original instruction, and checks whether it manipulates privileged state or not. If not, the instruction is written to the code memory page and executed by the CPU. If it does, the effects of the manipulation are simulated and the state of the virtual CPU modified accordingly, and execution commences at the subsequent instruction. Only when all contents of a memory page are verified in this manner, it can run without any interpretation taking place, and thus at full speed. Naturally, the process is more complicated than described, but one should notice the drawbacks of this technique, namely that: It requires additional memory for partially-checked pages. It has a performance overhead when running unchecked code the first time. The interpretation of all CPU instructions requires precise knowledge of the underlying CPU architecture, and so is hard to implement and not at all portable to other hardware architectures. The VMM does not know if the guest operating system is busy or not, and risks wasting CPU cycles executing the guest idle loop. Disco relies on the guest issuing a special power-save instruction present on the MIPS CPU, but this approach is not very generic or portable. The main advantage of the VMWare approach is that it allows unmodified guest operating system to run, and thus is a pure virtual machine in the sense of Goldberg . As described in Venkitachalam and Lim , this purity is sometimes sacrificed for performance by modifying popular guest operating systems such as Linux specially for running under VMWare. Chapter 3: Present state 13 Jails A simple way of partitioning a machine into several administrative realms, each with their own set of users, and own set of processes, is the creation of virtual realms by performing extra bookkeeping inside the same operating system. This is the approach taken in the FreeBSD “jail” mechanism [Kamp and Watson, 2000]. The advantages to this approach compared to running multiple operating systems on the same machine are, apart from its simplicity, that memory is saved by only running one operating system, and that there is no virtualisation overhead on hardware access. Migration of entire jails between hosts is still a rather complex issue though, because running processes are closely tied to the host operating system. It is also impossible for a virtual super user to tailor the operating system to his needs, for example by installing extra kernel modules, without affecting other users. Java Virtual Machine Interpreted languages running on a virtual machine, such as Java [Lindholm and Yellin, 1996], may be easily and safely run in parallel in multiple virtual machines (VMs) on any computer for which the VM is available. As long as interpretation takes place entirely in software, and no just-in-time compilation is applied, the process of suspending and moving virtual machines between physical hosts is simple. Once just-in-time compilation is introduced, the process becomes more difficult [von Laszewski et al., 2000]. Usually, Java virtual machines are able to access the local filesystem, making residual dependencies a problem as well. The biggest problem is the performance of interpreted or just-in-time compiled code, which is still at least twice as slow as that of native code [Ertl et al., 2002]. Why distribute a computation across multiple computers, if the speedup gain is largely negated by the overhead of using an interpreted language? Another problem of using interpreted languages is the need to rewrite legacy software, for which source code or skilled personnel may not be available. Choosing a technology For the purposes of implementing nomadic operating systems, it would appear that microkernels have the largest potential. Because operating systems run as normal user level programs, they may be subject to the microkernel’s scheduling policy, which solves the idle-loop problem described for Disco above, and may be protected from each other by normal memory protection methods. Unlike the VMWare and Disco approaches, the ability to run unmodified commercial operating systems (such as Windows or Irix), is not an issue, due to the availability of source code Chapter 3: Present state 14 for commercial-quality operating systems such as Linux, NetBSD and FreeBSD, which can be adapted with relative ease. Because the ability to run existing applications is a goal, language based systems will not be applicable. Neither will the pure virtual machine approach of VM/370, because it requires special processor features not present in commodity hardware. Process migration Migration of entire operating systems has much in common with migration of single processes, and lessons may be learned by looking at current process migration systems. Process migration is the mechanism of transferring a running process from one physical machine to another. This is beneficial in theory because it allows standard applications to transparently utilise idle computing resources present on most networks, and frees human operators from having to manually balance system loads. While not a new concept, process migration has failed to become mainstream, due to a number of inherent problems and limitations. To better understand the pros and cons of process migration, two of the most successful systems are described below. The systems described are Sprite and MOSIX, in both of which process migration is implemented as a part of the operating system, and is transparent to the end user and the processes involved. Finally, some of the problems hindering existing process migration system from gaining wide acceptance are discussed. Sprite Sprite [Ousterhout et al., 1988] was developed at the University of California in Berkeley. More than just an operating system, Sprite creates a network wide single system image, by providing specialised file services, network unique process identifiers, built in support for accessing remote files and services on the network, and transparent process migration [Douglis and Ousterhout, 1991]. A process can exist in one of two places: On the home node, or on a foreign node. Initially all processes start on the home node, but can then be migrated to another machine in the network for completion. Often this is done at the process initiation time, where the environment of the process is limited, and little information has to be transferred. Sprite attempts to utilise unused computing resources available on the network due to workstation idle time. A workstation becomes eligible for receiving foreign processes when it has been idle for a predetermined amount of time. When the owner of the workstation reactivates it, all foreign processes are evicted to their home nodes, from where they can be migrated again. Sprite uses a specialised strategy for migrating virtual memory to the new node, utilising the semantics of the Sprite network file system. When a process is migrated, it is frozen at the home Chapter 3: Present state 15 node, and all memory from its address space is flushed to a page file on disk. The process state is then transferred to the foreign node, which can then page memory from the page file as needed. The Sprite network file system ensures that in most cases the page file does not cause disk operations, because the fileservers use their memory as cache for the pagefile, which will be referenced again soon after it is written, when the process is resumed on the foreign node. In Sprite, the home node of a process always contains residuals dependencies of processes migrated onto other hosts in the network. Some kernel calls have to be forwarded to the home node for evaluation, while others, such as available physical memory, can be resolved directly on the foreign node. If a process has open files on the home node when migrating, any changed blocks are flushed to the fileserver, prior to migration. If a process shares a file with another process on the same node, i.e. a child process that has been forked, any caching of the file is disabled when one of the processes migrate away from the node. Any access to the file will bypass the local cache, and communicate directly with the fileserver, preserving UNIX file semantics. Processes using memory mapped I/O , i.e. for a frame-buffer, are not eligible for migration in the Sprite system. MOSIX MOSIX [Barak and La’adan, 1998] is a system for transparent process migration, developed at the Hebrew University in Jerusalem, Israel. The aim of MOSIX is to provide load balancing of processes across a cluster of computers, without the need for a central scheduling server, and without rewriting or recompiling applications. As in Sprite, the migrational unit in MOSIX is the UNIX process. Like Sprite, MOSIX has the concept of the home node of a process. For process migration to become transparent to the individual process, all communication with local hardware or file system is performed through a deputy process, which stays at the originating node after the process itself has been migrated elsewhere. This limits the types of processes that can be migrated without undue communications overhead, and MOSIX takes into account the amount of local I/O performed, before deciding whether or not to migrate a given process. MOSIX attempts to balance the processor and memory load of the nodes in the cluster, by probabilistically dispersing load information from each computer to a random subset of the cluster. When a node running MOSIX decides that it has become too loaded, it will use information previously received from other nodes to decide where to migrate one or more processes. When recipient nodes have been selected, deputy processes are created at the home node, and processes migrated via the network. This probabilistic method of load balancing is suboptimal, since information in the nodes does not contain the complete set, and may not be entirely up to date, but eliminates the need for a central scheduling server, and thus bottleneck and point of failure. The problem of traffic between home and foreign nodes has been addressed by the MOSIX team by the development of the MOSIX File System [Amar et al., 2000] (MFS) and of Direct File Chapter 3: Present state 16 System Access (DFSA). MFS is a prototype network file system, with files arranged so that part of the path name reveals on which node the file is physically located. MFS achieves UNIX file semantics by removing all caching, so that it becomes safe to access files locally rather than via the home node. Because file locations are easy to determine from file names, MOSIX is able to migrate the process to the file, rather than the other way round. MFS and DFSA are still at the research stage, and suffer from the no-caching requirement needed to guarantee UNIX semantics on the file system level. The use of deputy processes on home nodes means that MOSIX processes have a higher risk of failure than single-node processes, since either home or foreign node failing will take the process down. MOSIX does however provide commands for expelling all foreign processes on a node, which means nodes can be taken down without losing processes. The residual dependency problem Both Sprite and MOSIX suffer from the problem of residual dependencies. This problem occurs when a process leaves behind open files or other state in the operating system on the originating host. Both systems attempt to solve this problem by leaving a proxy processes on the originating host, handling access to local resources on behalf of the migrated process. This solution is problematic for two reasons; performance and stability. If all access to a resource has to go via the network, execution may be slowed, and by relying on resources on the originating host, the vulnerability of the migrated process to a machine crash is increased, because the process will fail if just one of the two involved hosts crashes. Note that under the assumptions made on page 5, operating system migration does not suffer from the residual dependency problem. Why process migration has failed to take off In spite of the promises held by the technology of process migration, it has failed to become an integrated part of any widely used operating system. A number of non-technical explanations for this situation are listed in Milojicic et al. , including: Lack of infrastructure Since no widely used operating system supports migration, a lot of work is needed to gain the benefits of such a system. Not a requirement Users have been content to work with remote invocation, and remote data access. Sociological factors In a Network of Workstations, it becomes a social issue of making the end user accept that other users have access to his resources. Until recently, the typical workstation has had little resources to spare, but recent advances in CPU-speed and memory prices have changed this. Chapter 3: Present state 17 Another reason may be that the aforementioned systems are not as transparent as they would like to be. In practice, the array of useful applications gaining from migration is very limited, because of the residual dependency problem. In MOSIX for example, certain types of application (for example those using shared memory) cannot be migrated at all, and short running processes such as compilers usually finish before the system has had time to decide whether they should migrate or stay. These limitations force the application developer into redesigning his application to better fit the needs of the system, and hence transparency is lost. Add to this the complexity of implementing both schemes. The tight coupling to internal kernel data structures makes the schemes very vulnerable to changes elsewhere in the kernel. For Sprite, the process migration code was initially easily and often broken due to changes elsewhere in the kernel, and often the person causing the breakage did not realize the implication of his or her changes to seemingly unrelated parts of the kernel. For MOSIX the problem is twofold, since MOSIX is not part of the official Linux source tree, requiring MOSIX to catch up on new features added, and changed kernel structures and semantics. Also, the sheer number of system calls (over 200 for the Linux kernel), kernel interfaces and semantics to be managed, makes it very difficult to validate that every single case is handled correctly. This indicates to us that the single process as a migrational unit is not the correct choice, and if migration as a concept is to succeed, it must be implemented against a simpler interface, and be oblivious to changes to the operating system in general. Persistent systems The phrase checkpointing refers to the act of freezing a running program, so that it may continue from the same stage, at a later time. Saving a document from a word processing program may be thought of as manually checkpointing the state of the program to disk. When talking about automatic checkpointing, or persistent, systems, the individual program takes no special action to have its state checkpointed, but relies on an underlying system doing so automatically. A checkpoint is a snapshot of all program state in certain instant, from which the program can continue. If the program runs independently of external input (i.e. all program variables are bound), the result of continuing from a checkpoint should be equal to the result of running the non-checkpointed program without interruption. Traditional operating systems, such as UNIX, support only explicit checkpointing, through filesystem operations. If a program wants its data to survive a system crash or restart, it has to manually serialise its internal data structures and copy the serialised version to stable storage. A number of experimental persistent systems with built in automatic and transparent checkpointing have been suggested. If user level programs were able to rely on the operating system automatically checkpointing their state to stable storage, they would be simpler to implement, and the whole system would become more resilient to failure. Some persistent systems rely on all applications being written in a special language, that inserts checkpointing code at relevant places automatically at compile or interpretation time, while Chapter 3: Present state 18 others perform checkpointing of the whole system at regular intervals. Napier [Morrison et al., 1990] is a an example of the first kind, while the Eros single-level store [Shapiro and Adams, 2002] is an example of the latter. Interval-based checkpointing has also been suggested for L4 in Skoglund et al. . Because nomadic operating systems need to freeze all of their state in order to copy it to a different machine, they will need a checkpointing mechanism, though not necessarily one that is as efficient as if it were to be running constantly at short intervals. Other than as an enabler of migration, the ability to checkpoint a running operating system is also practical for very long running calculations, where a checkpoint may be backed up to another host every few hours, so that months of work is not lost in case of a machine crash. C HAPTER 4 Proposed solution The configuration and reconfiguration problems of utility computing, and the problem of being mobile without having to carry a laptop, may be solved by making operating systems nomadic. By nomadic we mean able to migrate between physical hosts without loss of state and only neglible downtime, and able to share a host with other nomadic operating systems. Because fully virtualised systems are inefficient and complex on commodity hardware, we propose to implement the nomadic operating system on top of a modern microkernel, running a custom host environment, providing nomadic operating systems with shared abstracted access to hardware resources. Interface design considerations If different operating systems are to run in the nomadic host environment, a standard way of interfacing to the underlying environment must be defined, so that implementors will be able to refer to this standard when creating or adapting new guest operating systems. As whatever code lies outside this interface definition can be said to be “just implementation”, the specification of this interface will be the final result of this work on nomadic operating systems. Needless to say, its design must be well though out. This thesis is not about software engineering, but still, a few thoughts on when to design and implement what, will be in place. Defining the way different software systems interface to one another, is often a good way of clarifying difficult design decisions about the internal workings, and of eliminating superfluous features and hacks. As such, designing the interface before implementing the actual code behind it seems like a reasonable approach, since mistakes will then be eliminated by design. However, when researching new areas, all aspects of a given problem will not be known beforehand. In the book “Patterns of Software” [Gabriel, 1996], the author argues, basing his arguments of those of famed architect Christopher Alexander, that software should be grown much like classical farmhouses were grown for hundreds of years, by adding new features as they are needed, rather than the way grand office complexes today are designed, trying to anticipate any possible need up-front, which is often impossible in practice. So on one side there is the argument that a good interface design is a necessary precursor to a good implementation, and on the other the argument that things should only be designed as they are needed. 19 Chapter 4: Proposed solution 20 However appealing the first argument may be, and however many times the authors of this thesis have been told this by authorative figures, it was not possible to possess all knowledge required to design this interface (simple as it may be) before having implemented prototypes of the algorithms that were ultimately going to hide behind it. The resulting interface specification, detailed in appendix A is the result of both planning and experimentation, and has been revised all the way through the implementation process. Choosing a microkernel To achieve the stated goal of being able to run standard software on top of guest operating systems (and “standard” here is considered UNIX or UNIX-like) at least the following features need to be present in the chosen microkernel: User level paging Guest operating systems cannot be trusted with access to the Memory Management Unit (MMU1 ) registers, because they would then be able to allow themselves access to the address spaces of other guests. Instead, the microkernel must act as a middleman for manipulating MMU registers, and user tasks be able to handle page faults incurred by other user level tasks. This is necessary when running multi-address space operating systems such as UNIX. Interrupt multiplexing Interrupts are the basis of preemptive multitasking and of efficient event-driven hardware access. The microkernel either needs to be able to multiplex interrupts between multiple operating systems, or it needs to support the implementation of services for shared access to hardware by other means. Efficient means of inter-address space communication Because guest operating systems have to go through a hardware abstraction layer for any kind of I/O operation, a fast way of communicating between the guest and the address space housing the abstraction is necessary. Preexisting UNIX-personality Because of limited time, it will not be possible to port an existing UNIX-like operating systems to the chosen microkernel 2 , while also implementing features for concurrency and for migration. Therefore, a functioning port of a UNIX-like operating system for the candidate microkernel must exist. Because operating systems running on top of microkernels are not kernels, but rather normal user level programs with operating system-like behaviour, they are often referred to as personalities. Intel IA32 implementation While the Intel IA32 3 architecture may not be the most elegant or the most simple to program for, it is cheap, and plenty of machines are available, so the candidates need to have an up to date IA32 version available. 1 The purpose of the Memory Management Unit in a system is described in more detail on page 35. For example, the port of Linux to the L4 microkernel reportedly took 14 man-months to complete, while the entire time frame for this thesis project is 12 man-months. 3 Meaning Intel 80386 and up. In reality, we are only really interested in the Pentium Pro and higher. 2 Chapter 4: Proposed solution 21 Source availability While not strictly a requirement, access to source code for the microkernel will make it easier to understand and, if needed, debug. These demands leave us with two prime candidates: Mach 3.0 and L4. Mach 3.0 The Mach kernel, originally from Carnegie Mellon [Rashid et al., 1989], aimed at supporting a number of different operating system personalities, by providing a uniform interface on which operating systems could be run. The Mach kernel supported multiple threads within a shared address space, secure IPC, virtual memory management and copy-on-write mechanisms for lazy copying between address spaces. Prior to version 3.0, the Mach kernel contained a full 4.2BSD UNIX implementation, but this was moved outside the kernel in version 3.0, making Mach 3.0 a microkernel. IPC in Mach is done through ports, where a port designates a receiver of the IPC, which can grant other threads access to send messages to the object represented by the port by handing out port capabilities. IPC under Mach is asynchronous, allowing tasks to simply deliver messages and not have to wait for the recipient to accept the message before continuing. Under Mach it is possible for a parent task to insert shared libraries into the address space of any children it spawns. The parent task may then notify the Mach kernel that any system call traps in the child task is to be handled outside the kernel, by the shared libraries. These libraries can then execute the system call, or forward them to the Mach kernel if needed. In this way, it is possible for an operating system personality to handles syscalls from its own user processes. Mach 3.0 has a working Linux implementation called MkLinux, originally developed by the Open Software Foundation (OSF) in Grenoble, France in conjunction with Apple Computer, in an attempt to port Linux to the PowerPC platform. MkLinux is based on the OSF’s own implementation of the Mach 3.0 specification. Recently, Apple has launched the MacOS X operating system, a modified Mach 3.0 kernel running a BSD personality. L4 The purpose of the original L4 implementation was to prove that microkernels could be as efficient as monolithic kernels. At the time, Mach was regarded by many as a failed experiment, and Jochen Liedtke, the original creator of L4, set out to prove that the microkernel idea was not in itself wrong. L4 has the following relevant features: Chapter 4: Proposed solution 22 Fast unbuffered IPC Current versions are able to perform inter address space IPC in as little as 183 clock cycles (on a Pentium II, 400 MHz). Fast IPC is a requirement when building hardware abstractions. Unbuffered IPC makes migration simpler, since less state has to be extracted from the kernel. Evolved and mature The L4 specification has evolved since 1996, with multiple implementations currently in existence. Work on improving it is ongoing, which is an important issue when choosing a technology base for a new project. Linux 2.2 port As mentioned, an existing port of a UNIX-like system as a base for a nomadic operating system example implementation, is a necessity. L4 currently only has a port of Linux 2.2, but a port of version 2.4 is underway. Limited kernel state While L4 is not a stateless kernel, the amount of kernel state relevant to migration is very limited. Source availability Current versions of L4 are available with source code under the GNU Public License. Due to these advantages, L4 was chosen as the base for the nomadic operating system implementation. A more detailed overview of L4 is presented below: L4 abstractions The L4 kernel provides a small set of abstractions for controlling and multiplexing hardware resources, such as the central processing unit (CPU), the memory management unit (MMU), memory, and peripherals such as network interface cards and hard drive controllers. The CPU is abstracted via threads of control. A thread represents the execution of a program, and when the thread is not running on the CPU, the L4 kernel stores its CPU state, namely the instruction and stack pointers, as well as other general purpose registers. If a thread is waiting (blocked) trying to perform IPC, information about the pending IPC message is stored in L4 as well. Threads are able to communicate and synchronise with each other by means of IPC. When thread A wishes to communicate a message to thread B, it invokes the L4 IPC syscall. If B is already waiting to receive a message from A, control is transferred along with the message to B. If B is not already waiting to receive the message, A is blocked until B starts waiting for it. Other than for inter-thread communication, IPC is also used in L4 for controlling the MMU, and for delivering interrupts to programs that wish to handle them. Multiple threads share the same address space. The combination of an address space and a set of threads is known as a task in L4. A thread is able to manipulate the registers of other threads in its own address space/task. When a new task is created, the first thread gets created automatically by L4. This thread may then create a number of other threads if it desires. Chapter 4: Proposed solution 23 Figure 1: The page fault and page reply mechanisms. The bent arrow pointing from thread ’s pager to L4 indicates IPC interception of the page reply message from the pager to thread . Chapter 4: Proposed solution 24 When the MMU raises a page fault exception because a thread is trying to access memory for which no mapping exists, L4 looks up the pager of the running (and thus faulting) thread. The page fault is transformed into an IPC message describing the fault, and the message is forwarded to the pager. The pager is also a thread, which may then decide how to react on the page fault. It encodes its answer as a page mapping IPC message, with which it replies to the faultee. L4 sees that this is a mapping message, intercepts it, and programs the MMU to provide the corresponding mapping in the address space of the faultee. Figure 1 shows the two steps of the process. Interrupts are abstracted via IPC as well. A thread can ask L4 (again, via IPC) to become the handler of a given type of interrupt, handed out by L4 on a first come basis. When an interrupt of the given type occurs, it is transformed into an IPC message, which is then forwarded to the handling thread. IPC is the central concept in L4, and the L4 creators have put much effort into making IPC as fast as possible. By restricting who may send IPC to whom in the system, various security policies may be implemented. Clans and Chiefs [Liedtke, 1992] is the security mechanism offered by L4. A clan is a set of tasks that can communicate freely. These tasks are owned, or controlled, by a chief which is the only member of the clan that can make IPC to outside tasks. Whenever a task within the clan tries to send IPC to a task outside the clan, the chief task will automatically intercept it, and potentially modify it before either dropping the IPC or forwarding it to the intended recipient. A chief owns all tasks in its clan, meaning it is the only task able to create and delete them, though tasks may be donated to other chiefs. Initially, all tasks are unowned, but the first server to start will usually reserve them all for itself, to later hand them out when requested, according to some security policy. L4Linux L4Linux is a Linux personality running as a single user level task on top of L4. We use the word personality instead of kernel, to not confuse the different entities in the system. Because L4Linux is an adaption of a traditional monolithic operating system to a microkernel, the operating system itself runs as a single L4 task. If one were to implement an operating system on L4 from the beginning, it might be a good idea to split the operating system into multiple server tasks, but this was deemed too complex a job by the L4Linux implementors. The L4Linux server runs in multiple threads, but in one common address space. Please see part 2 of figure 6 on page 38 for an overview of the structure of L4Linux. After L4 and some L4-specific bookkeeping tasks, such as the backing pager for the entire system memory, have been started, L4Linux starts. L4Linux requests all interrupts and memory in the system. This allows it to access hardware as normal, including using its standard set of device drivers. Chapter 4: Proposed solution 25 Figure 2: L4Linux signal handling mechanism When the hardware resources have been initialised, L4Linux starts the first user level process 4 . The first process started in a UNIX-like system is traditionally called /sbin/init.This process may then start other processes, resulting in the end in a usable system. When a process is started, an L4 task is set up to contain it, so that it gets its own private address space. The new task contains two threads. One thread for the process’ own code, and an additional thread for handling signals. The signal thread is necessary because only threads in the same task are allowed to manipulate one another. The signal thread runs an infinite loop waiting for IPC from L4Linux, telling it to deliver a signal to the main thread. A signal is delivered by forcing (by manipulation of its instruction pointer) the main thread into the appropriate signal handling code. See figure 2. Linux syscall trap instructions are translated into IPCs by a user level trap handler 5 , and then forwarded to a thread in the L4Linux task. This thread handles all syscall IPCs, acts as a pager for user processes, and is multiplexed internally by Linux’ own scheduler. User processes are currently not scheduled by L4Linux, but by the L4 strict priority scheduler instead, though this is likely to change in the future. See figure 3. The original L4Linux implementation on the hand optimised L4 x86 kernel has been shown to run 5-10% slower than native Linux [Härtig et al., 1997]. 4 We distinguish here between a process, which is the Linux object representing a program and an address space, and a task, which is a set of L4 threads and an address space. 5 As an optimisation, it is also possible to perform the syscall directly via IPC, without the need for the mediating user level trap handler, but requiring modifications to the original program or shared libraries such as libc. Chapter 4: Proposed solution 26 Figure 3: L4Linux syscall handling mechanism. Different migration schemes If not considering the survival of long running computational tasks, a more simple solution to the problem of migrating a set of services from one machine to another, would be to just boot the target machine from the same filesystem as the source. The downside to this approach is mainly about downtime, which is the time taken to boot a new system. If nomadic operating systems are to compete in this scenario, downtime has to be smaller than what can be achieved simply by rebooting on a different machine. One problem with migration is the trail of residual dependencies often laid behind, and how the risk of failure of the migrated task becomes affected by it. While one of the points of migrating operating systems instead of just processes is to eliminate residual dependencies, some migration schemes may in fact reintroduce them. If residual dependencies exist, the risk of task failure as result of a host machine crash becomes exponential in the length of the dependency trail. A number of different approaches for migration, with various effects on downtime and risk of failure, as described in Milojicic et al. . Simple copy The task is stopped, and memory as well as task state information is transferred to the new host, where the state is restored and the task resumed. This scheme is the simplest scheme, and transfers all data only once. The task is unresponsive during the entire copy, but no residual dependencies exists. Lazy copy The task is stopped and kernel and CPU task state is transferred to the new host, where the task is resumed immediately. When the task pagefaults, the faulting page is Chapter 4: Proposed solution 27 Figure 4: The three schemes compared. The left side of each block designates initiation of migration procedure, and designates the time, at which control is transferred to the new host. The transfer block in “Lazy copy” immediately after signifies the resolving of the initial page fault for the running program. Chapter 4: Proposed solution 28 fetched from the old host before the task is allowed to continue. The task is only unresponsive for the short amount of time taken to transfer task state to the new host, but performance of the task running at the new host suffers initially, since every pagefault must be resolved across the network. Lazy copy leaves a residual dependency trail on all machines on which the task has ever run, but ensures that only memory actually accessed by a task is copied across the network. As noted in Zayas , typical behaviour for a task is to access only 25-50% of the total amount of memory it has allocated when running. Precopy The task is left running at the source machine and all memory owned by the task is mapped read-only and copied to the new host. When the still running task attempts to modify one of its pages, a page fault is raised, and the page marked as dirty. The page is then mapped read-write to the task, which continues running. After the initial transfer, a subset of the pages will be dirty, and these are again mapped read-only and copied to the new host. This goes on until the dirty subset is sufficiently small, after which the task is suspended at the originating host, the remaining dirty pages copied across, and the task is resumed on the new host. The downtime of the task is reduced to the time taken to copy the last set of dirty pages to the new host, but a number of pages will be copied more than once. The precopy approach leaves no residual dependencies on the originating host, and thus does not suffer from the increased risk of failure present in lazy copy. Figure 4 shows the comparable differences under the mentioned schemes. Because of the residual dependency problems of lazy copy, only full copy and precopy were implemented in this project for migration via the network. Lazy copy may be interesting if migrating via removable storage, as described at the end of this section. It is possible to improve upon the precopy scheme, ensuring that frequently used memory pages are sent as rarely as possible. This algorithm may be described as queued precopy, and can be implemented by using a queue and a set . is a queue of read-only marked pages ready for transfer, and is a set of writable pages which cannot be transferred. is the total set of pages to be migrated. . 2. Each page in is removed from 1. Initially, , marked read-only in the MMU page tables, and is queued on . This step is rerun at a certain interval, until migration completes, and does not need to be atomic. 3. If the task incurs a page fault by attempting to write to a page in , the page is removed from and is inserted into . The page is marked writable in the MMU. 4. A background thread monitors . It continually dequeues the first page on and transfers it across to the target system. If becomes empty (other than initially), the size of may be gauged, and the time needed for a transfer (and thus the downtime of the system) be estimated. If the downtime is acceptably small, the task is suspended, and the remaining pages in are marked read-only (thereby queing them on ). The pages now in may be transferred across, and the migration completed. Chapter 4: Proposed solution 29 The tendency should be for to shrink in size rather than grow. If not then the task is doing a lot of sparse writes and must be slowed down by delaying the mapping of writable pages in step 2, or by decreasing the update interval of step 1. The reason for being a queue rather than a set, is that pages which are infrequently written to will stay in longer and be transferred first, reducing the number of double page transfers. In addition to support for queueing operations, of arbitrary elements, needed by step 2. must also allow for fast lookup and removal The queued precopy algorithm has not yet been implemented. Currently a simple precopy, performing a fixed number of iterations, is used instead. When migrating via removable media, it will be desirable to be able to resume operation as soon after the media is inserted as possible. For this scenario a variant of lazy copy, or ondemand paging, with pages being loaded from the media as they are needed, becomes interesting. This is similar to how programs are loaded in UNIX-like systems. Though the system will run slowly at first, it should start responding almost immediately, provided the media is of decent speed. More challenging is the initial checkpointing of the operating system image onto the media. The user should not have to wait overly long for this to happen. Infrequently changing pages may be copied to the media ahead of time, hopefully leaving only a small percentage remaining when the user decides to detach the removable media from the host system. An algorithm similar to the one outlined for network migration above could be used, although it would be kept running at all times. Continuously copying pages to the removable media may seem a waste of CPU, but the utilisation of direct memory access (DMA) for copying pages should limit the overhead. Guest interface considerations Since the nomadic operating system is just an application of L4 microkernel architecture, most of the interface will have to adhere to its specifications. The choices left to us mainly have to do with address space layout, pagefault handling, checkpointing behaviour and hardware abstraction semantics. The different options available when designing the interface between host environment and guest operating system are outlined below: Address space layout Since L4 allows arbitrarily nested address spaces, it is possible to define whatever layout appears suitable. Most operating systems will expect a flat, contiguous address space starting spanning the total amount of physical memory. Often, hardware architectures attach special special meaning to Chapter 4: Proposed solution 30 Figure 5: The proposed address space layout of the NomadBIOS system. NomadBIOS controls all available memory and maps it to the L4Linux tasks, which in turn map it to their user level processes. various memory areas, using them for providing access to hardware control registers, or for information put there by the firmware at boot time, for example the total amount of RAM, harddrive geometries, and so forth. Since guest operating systems will not be able to directly access any hardware, but will have to go through various IPC based abstractions, theoretically no special memory areas will be needed for hardware access. In practice, a few special areas will have to exist. L4 provides a special kernel information page, from where values such as a global time tick counter may be read. A special page with guest-specific information will also be provided by the host environment. Shared pages for zero copy hardware access, for instance for incoming or outgoing network datagrams, may also be necessary for performance. The locations of such special pages in the guest address space, should be determined by the guest, so that different needs and traditions of guest operating systems may be catered to. A model of the NomadBIOS address space layout is seen in figure 5. Page fault and page request handling If a thread in L4 tries to access a memory page for which no mapping exists, the resulting page fault is translated into an IPC message, destined for the thread acting as pager for the faulting thread. It then becomes up to the pager thread to decide whether a page should be provided, and to return it. The faulting thread is blocked until a mapping has been provided for it. The pager may also choose to take other action, such as to kill the faulting thread on illegal memory access. Chapter 4: Proposed solution 31 In L4, memory page mappings are sent via IPC from the pager to the faulting or requesting thread. These special IPC messages are intercepted by the L4 kernel, which performs the actual maintenance of the MMU page tables. For L4 programs, an IPC protocol known as the protocol, named after the program which is often used as the root pager in an L4 system, exists. The protocol allows a caller to request mappings for special pages, such as the kernel info page, and for special 4MB “superpages”, available on the Intel Pentium family of CPUs. The use of superpages lessens the amount of entries in MMU page tables, so that address space switches become faster. When designing a pager interface for guests, the question becomes whether to rely entirely upon page faults, lazily providing pages as they are needed, or to just provide all mappings up front. In favour of the lazy approach speaks the fact that not all the memory available to a guest may actually be needed, and so can be saved for other guests in the system 6 . Also, the code for resuming a suspended guest becomes simpler, because the normal page faulting mechanism will make sure to request pages as they are needed. Against the lazy strategy speaks the impossibility of requesting special pages by other means than the attachment of various semantics to page locations, for example to agree that the kernel info page is always at address 1000 hexadecimal, and that all other page faults result in superpages. Imposing such special rules onto the address space layout may hurt future uses of the system, and should be limited as much as possible. The best solution is probably to allow the guest to provide hints as to its wishes or demands of special mappings, but not to provide any actual mappings before they are needed. In this manner, only the memory for recording the hinting information is wasted, as opposed to perhaps almost an entire address space. Checkpointing behaviour Currently no stable implementation of L4 provides all the features needed for wholly transparent checkpointing of a running program. Even though user level pagers provide easy access to all in-memory program state, some information about it is only available to L4 itself. If an L4 thread is in the kernel doing IPC, there is no way for another thread to determine this, and no way of suspending it in a manner that it may later be restarted safely. A thread that is manipulated from the outside, for example as part of a suspend or other signal operation, cancels any pending IPC, returning an error code. This means that only the thread itself (plus possibly the thread at the other end of the IPC) will know that IPC was ongoing, and should possibly be restarted for things to continue correctly. 6 Even when providing mappings lazily a for a guest, the guest must be guaranteed access to the entire address space if it needs it, or alternatively the guest will have to be suspended and relocated to a machine with more memory available. The amount of memory available to a traditional operating system does not change after the system has started Chapter 4: Proposed solution 32 Experimental versions of L4 have been implemented [Skoglund et al., 2001], wherein L4 itself uses an external pager thread to provide memory for kernel-internal thread control blocks (TCBs), allowing a trusted thread, whose identifier is specified at compile time, to access, and possibly checkpoint, L4 internals. This approach will only work if the kernel and the checkpointer have a common understanding of the internal structures, meaning that the kernel cannot be upgraded without also upgrading the external pager, and vice versa. An alternative method is to handle checkpointing non-transparently, by requiring all threads in the program to carefully check the error codes returned by the L4 IPC syscall, thereby becoming able to safely restart cancelled IPC operations, after the checkpointed program is revived. Even though this approach comes with the overhead of a few machine instructions after each relevant IPC in a guest operating system, it was deemed more elegant and better suited for this project, than simply wrenching open the L4 internals. Checkpointing will not be entirely transparent to the guest, which will have to respond to a special call to wrap up all its running processes, so that all their states are entirely in main memory, before it can be checkpointed safely. A positive side effect of this non-transparency is the ability for the guest to flush unimportant caches, for example those of network file systems, or take any other measures to compact the guest memory image, before it is migrated over the network. Hardware abstractions Although L4 itself does not impose any restrictions on accessing hardware devices available, it does not make any effort to aid this either. A task can request access to specific hardware interrupts, and to special memory mapped I/O ports. Other than this, the task has to communicate directly with the hardware itself. Access to hardware devices is handed out by L4 on a first come basis, so if multiple tasks are to share the same devices, one task will have to act as a proxy for the others. For simplicity, only a shared Ethernet device will be implemented, because network abstractions for all important I/O types are present in Linux. Some considerations about how to best share a network interface are presented on page 43. C HAPTER 5 Working with the L4 kernel This chapter introduces the central L4 concepts, and gives an overview of the most important programming primitives, in order to provide the reader with a basic understanding of how one might implement a classic full scale operating system on top of L4, and a feeling for the work involved in implementing features such as operating system suspension and resumption. L4 basics L4 provides a minimal set of primitives (syscalls) for manipulation of address spaces and threads, and for performing IPC between threads. A task is a set of threads sharing the same address space. When a new task is created, its first thread is started automatically. This thread may then start additional threads if necessary. A system call (syscall) in L4, as in most other kernels with memory protection, is invoked by means of a trap instruction. Traps are special instructions causing the CPU to enter supervisory mode and jump to a predetermined address containing handling code for the specific trap number, as specified in the trap instruction. The handler receives the general purpose registers of the program triggering the trap, and will usually treat these as parameters for the syscall. When the syscall is complete, any return values may be placed in the registers, and control returned to the trapping program by means of a special return instruction, which also changes the CPU state back to user mode. The original L4 kernel implemented seven 1 syscalls, each with a separate trap handler invoked (on Intel IA32) with the int instruction, with trap numbers 0x30 to 0x36 hexadecimal 2 . The L4 syscalls generally deal with task creation and deletion, thread register manipulation, pair, stored in and IPC. Thread- and task identifiers are specified by means of a one or two CPU registers, depending on implementation. Below are listed the syscalls most regularly used: task-new Creates or deletes a task with a specific number. Whether the call acts as creation or deletion depends on the parameters given. For a new task, initial instruction and stack pointers for its first thread are specified, as is an identifier for a thread acting as a pager for the first thread. If the thread incurs a page fault by addressing unpaged memory, this pager is contacted via IPC. When a task is initially created, it has an empty address space. 1 In later specifications the number of syscalls has grown to 12, with new calls for optimised intra-address space IPC and improved scheduling and thread management. 2 Note that this is a little different than for example Linux, which multiplexes all syscalls through a single trap vector. 33 Chapter 5: Working with the L4 kernel 34 ipc This syscall handles all communication and synchronisation between threads. Depending on its parameters it can act as both send, wait, as well as combined send-and-wait. IPC is unbuffered and thus always blocking. IPC messages may be entirely register-based, or perform memory copying and memory mapping, and so no special virtual memory manipulation syscalls are required. IPCs can be given a timeout value, and a thread can be put to sleep simply by performing IPC with the destination parameter set to nil. thread-ex-regs The tread-ex-regs syscall performs thread register manipulation. It allows modification of the instruction and stack pointer registers of a running thread, and creation of threads by initial specification of these registers. The general purpose CPU registers of a thread cannot be read directly, but instead the thread-ex-regs call may be used for redirecting a thread to a function which stores the registers into a memory location. Threadex-regs acts only on threads in the local task, so signal handlers and similar functionality have to be implemented as extra threads in this task. Checkpointing The L4 microkernel was not designed with checkpointing in mind, but provides a number of features which make implementing checkpointing possible. The main feature is the possibility of having user level threads handle page faults incurred by other user level threads. When a user level program is able to handle page faults, and to manipulate page mappings and page protection flags for other programs, it becomes possible for that program to copy (and thus checkpoint) the data contained in these pages, and to restart the program from the copied data. Other than the data which is kept in main memory, the CPU has a number of registers, all of which must also be saved a restored for a checkpoint to be complete. L4 has no direct support for transparently checkpointing the state of an entire set of running tasks from another user level task. User level programs do not have access to kernel internals such as thread control blocks (TCBs), so there is no way to determine if a given thread is engaged in an IPC operation at a given time, and thus no way to safely suspend it. However, if a thread-ex-regs operation is performed on a thread performing IPC, the IPC syscall will return a special error code, alerting the thread to the fact. The thread is then able to handle this case correctly, for example by retrying the syscall. If all threads cooperate in this manner, suspension still becomes possible, although not entirely transparent. One might argue that malicious threads will be a problem, but threads who refuse to cooperate are only able to hurt themselves, and may be killed by the system. In the case of Linux running atop of L4, every Linux process is implemented as an L4 task, consisting of a thread running the actual process code, as well as a signal handling thread; see figure 2 on page 25. The signal handling thread can be extended to handle suspension of the main process thread, without changing the Linux semantics of the system. A Linux process is still able to stop or otherwise damage its L4 signal thread, either by will or by accident, but this is equivalent to the program crashing, and may be handled similarly. By using the signal thread to implement suspension, transparent checkpointing of Linux processes becomes possible. Similar techniques may be used for other guest operating system personalities. Chapter 5: Working with the L4 kernel 35 Virtual memory All modern multi-user operating systems make use of the memory management unit (MMU), today an integrated part of most full scale CPUs. Originally, there was a one to one correspondence between memory addresses addressed by programs, and the location of the memory bits inside physical storage. Furthermore, any program would be able to access any location at will. This was deemed impractical for a number of reasons, mainly: Malicious programs were able to access arbitrary addresses, leaving innocent programs open to attacks or as victims of other people’s buggy programs. Swapping or paging unused program data to disk automatically was not possible, because there was no way to track what memory was being accessed. Programs too large to fit inside physical memory would have to implement their own disk-paging code. In systems using dynamic allocation, memory would become fragmented over time, leaving a reboot as the only means of defragmentation. The MMU introduces a layer of indirection between the CPU and memory, allowing encapsulation of untrusted programs inside closed address spaces, demand-paging of memory from disk, and easy defragmentation of memory. The MMU uses a number of memory tables to dynamically translate memory accesses into physical addresses, with the most frequently used combinations being cached for performance. MMUs are implemented differently in different architectures, but they commonly handle memory in rather large chunks (pages), so that the translation tables do not have to take up the entire physical memory. On the Intel 80386 and 80486 CPUs the page size is 4kB, whereas the Intel Pentium supports 4MB pages as well. Large page sizes result in smaller page tables, at the cost of coarser granularity. Programming the MMU registers is a privileged operation, and most traditional monolithic operating systems, such as Linux, only allow the user very limited control of the MMU. Userlevel tasks all reside in their own memory spaces with a fixed layout, and are unable to directly page other programs. Microkernels, such as Mach and L4, on the other hand, often allow user-level programs to manipulate the system-wide pagetables in some secure manner. This approach is more flexible in that it allows programs to implement their own paging policies, but has also been claimed by some to be less effective. For the purposes of this project, the need for user-level paging is clear. Guest operating systems are untrusted, and thus need to run at the user-level, but they also need to be able to implement arbitrary paging policies for their own client programs, so they need to be able to program the MMU themselves. In L4, virtual memory is abstracted into the IPC mechanism. When a thread accesses memory for which no mapping has been provided, the MMU triggers a page fault exception, which goes to L4. L4 determines which thread caused the page fault (which is easy, since it must have been the thread currently running) and looks up the corresponding pager thread. The page fault is Chapter 5: Working with the L4 kernel 36 translated into an IPC message describing the page fault (faulting instruction pointer, faulting address, read or write operation) which looks like it comes from the faultee, and is sent to the pager. Since IPC is blocking, the faultee blocks. The pager may now decide how to handle the page fault, for example by allocating some more memory for the faultee, and answer via a another IPC message. The answer is intercepted by L4, which performs the actual page table and MMU manipulations. Figure 1 on page 23 illustrates the two steps of the process. Since page faults look just like IPC, it is entirely possible for a program to contact a pager by means of IPC instead of actually pagefaulting. This is sometimes useful, for example when setting up a large initial address space. It is also possible to receive page mappings from multiple pager threads if desired. Different L4 versions The L4 specification has undergone a number of changes since its initial release. The main traits of each version are outlined below. The focus of the descriptions is the Intel IA32 platform, since this is the platform initially chosen for this project. An overview of the plans for future L4 development is given in Liedtke et al. . Version X.0 This is the original specification for the L4 x86 kernel, described in [Liedtke, 1999]. pair of 8 plus 6 bits for It specifies a 32 bit thread identifier, part of which is a the identification of the thread itself, resulting in a maximum of 256 different tasks with 64 threads each. Because the thread identifier parameter only takes up one 32 bit register, three registers are available for user data in IPC messages. Aside from the original L4, the Hazelnut C++ implementation also follows the X.0 specification. Version 2 This version is a derivate of X.0, and used by the Fiasco C++ kernel. Version 2 [Hohmuth, 1998] sacrifices one of the user data IPC registers in exchange for having 64 bit thread identifiers, leading to 2048 different tasks with 128 threads each. Due to its larger number of tasks, our work has been based mainly on Fiasco. Version X.2 While no public version is available, the Pistachio L4 kernel is based on the X.2 specification [Dannowski et al., 2002]. X.2 does not employ the fixed size pair from the earlier versions, but rather a single 18 bit thread identifier, resulting in room for up to threads. Because every thread can have its own address space in X.2, it becomes possible to have tasks as well. Version X.2 is also the first version of L4 to specify multiprocessor support. Another new feature is a special fast-path IPC call for use between two threads in the same address space, described in Liedtke and Wenske . C HAPTER 6 Implementation This chapter describes two approaches to implementing a workable host environment for nomadic operating systems, and the changes made to L4Linux to turn it into an example nomadic operating system. A number of things may go wrong when making fundamental changes to the workings of a complex and evolved operating system like Linux, and the reader shall be spared many of the details and tales of dead ends encountered on the way. Suffice it to say that not everything was as simple to implement as it may seem from the text. Host environment The host environment is a user level program running on top of L4, functioning as a service layer for guest operating systems, and as a control panel for controlling operations from the outside. The host environment implements both an interface with which guest operating systems may interact to access machine resources, and a control interface by which the guests may be managed by a human operator. Two different host environments were implemented, one based on the existing L4Linux, and one made from scratch using the Utah OSKit [Ford et al., 1997]. Using L4Linux as a host environment has the advantage that management tools may be easily implemented as user level processes, and that these tools can be remotely accessed by standard means such as Telnet or Secure Shell. L4Linux also includes the full Linux 2.2 set of device drivers. The downside is the overhead of running a full L4Linux kernel. The native host environment – called NomadBIOS – does away with this overhead, at the expense of having to implement network stacks and drivers for hardware, as well as a protocol for remote management. L4Linux as a host environment Initially this environment was used as a development platform, in order to get multiple L4Linux guests running at the same time. 37 Chapter 6: Implementation 38 Figure 6: 1) Native Linux interfaces hardware directly and runs in supervisor mode. 2) L4Linux running as user level process communicates with L4 which in turn handles IRQ forwarding via IPC. 3) NomadBIOS as middle tier, forwarding IPC messages to several guest Linuxes. Chapter 6: Implementation 39 When using L4Linux as the host environment, the guests may be run encapsulated in ordinary Linux processes, or alongside the host L4Linux, in a disjoint physical memory area. Running guests as an ordinary Linux process would allow the host environment to take advantage of the L4Linux built in to-disk paging algorithms, thereby increasing the perceived amount of memory accessible. However, the original L4Linux was hard to fit inside the restrictions normally faced by a user level L4Linux process, and instead it was decided to run the guests alongside the host. The current version of nomadic L4Linux would be better suited for running as a L4Linux processes though, and the avid reader should feel free to experiment. When running the guests alongside the host environment, resources are allocated from the global pool of resources, rather than from the host environment’s pool of resources. In this configuration, the host environment must be instructed not to grab all physical memory during boot, and not allocate every L4 task but rather just allocate enough to to get the host environment running. Because the host environment needs to be able to access unallocated resources directly, it was implemented as a loadable module which could be inserted into an unmodified L4Linux. The Ethernet abstraction is the only modification necessary to L4Linux for it to serve as a functioning host environment. Filter and bridging code is inserted into the Linux network packet handling loop, and packets destined for a guest are forwarded via L4 IPC. The NomadBIOS host environment The NomadBIOS native host environment is similar in nature to the L4Linux kernel module solution, with regards to memory allocation and task allocation. The main difference is that the network interface card, rather than being interfaced by the host L4Linux, must be accessed by a native L4 server. The native host environment comprises a number of separate L4 server threads, each responsible for one distinct feature of the host environment. The servers are highly independent, and are rarely interfaced directly by other servers than the control interface server, so that theoretically they could be made to run in separate address spaces, protected from each other. Each server has, in addition to an IPC guest interface, a number of back-end interfaces allowing, for instance, the control server to register a new client with the network server. These interfaces serve to tie the BIOS servers together, and cannot be accessed by guest operating systems. The network server needs direct access to the Ethernet network interface card (NIC), and should preferably support a wide range of hardware. To this end, the Utah OSKit set of operating system components was chosen. Apart from supplying low-level NIC drivers, OSKit also provides a TCP/IP stack, on top of which a remote management protocol has been implemented. This protocol allows NomadBIOS to be controlled from the outside by a small command line tool called runclient. Because the native host environment acts more like a firmware (or Basic Input Output System – BIOS) providing abstracted and uniform hardware access, than like an operating system, it is referred to it as the NomadBIOS. Chapter 6: Implementation Figure 7: An overview of the services running in NomadBIOS. 40 Chapter 6: Implementation 41 Choice of host environment While both environments were implemented during the course of this work, the L4Linux kernel module option was eventually abandoned, as it became clear that the benefits of running a full L4Linux as a host were limited, and maintaining two separate interfaces became a burden. No benchmarking was done using the L4Linux host environment. The L4Linux host environment primarily served as a convenient test-bench for experimentation, since the module could be unloaded and re-loaded quickly, rather than having to reboot the entire system. When the host environment interface was stable, focus shifted to the NomadBIOS native host environment solution, after which the L4Linux version became obsoleted. This is not to say that the L4Linux host environment would not serve a purpose. If revived, it would make it possible to add nomadic Linux servers to an already running L4Linux on demand, making it an optional service of the host L4Linux, rather than the dedicated purpose of the host. One would also gain free access to the complete set of Linux device drivers. Since both implementations provide the same interface, the choice of host environment may be ultimately left to the end user. Sharing host resources As a rule, The hardware resources in a computer system do not support shared access. If, for example, two programs write to a harddrive without some entity coordinating their access, the contents of the drive are likely to end up in a corrupted state. The operating system traditionally fills out this coordinating role, arbitrating access to resources on behalf of multiple users. In the nomadic operating system setup, operating systems become applications, and the role of the sole arbitrator is filled instead by the host environment. Below we list the resources that the host environment needs to arbitrate access to. Apart from the physical hardware, kernel resources such as task numbers must also be shared. L4 task numbers Current L4 kernels have a fixed number of tasks which must be shared between the host environment and the guest operating systems. For the Hazelnut kernel this number is 256, and for the Fiasco kernel 2048. Each task represents a separate address space and a set of up to either 64 or 128 threads. Of the tasks, four are used for special L4 servers, and one is used for the NomadBIOS. The remaining tasks must be shared by the guest operating systems, limiting the amount of guest kernels able to perform useful work. To make the best use of the scarce task resource, various allocation policies can be implemented. We considered two approaches to this issue: Chapter 6: Implementation 42 Splitting the available tasks into equally sized consecutive ranges. If each guest were to be given 255 tasks, eight guests would be able to run on a Fiasco-based system at once, which would be satisfactory at the moment. Alternatively, different guests could be allowed different amounts of tasks, similar to how they are allowed to use different amounts of memory. Tasks could be requested from the host environment on demand, so that guests needing few tasks would make more room for guests needing many. Because the former approach is simpler to implement than the latter, and since upcoming ver sions of L4 allow as many as tasks, which will probably solve the problem for all practical purposes, fixed size consecutive task ranges were chosen. Physical memory The Fiasco L4 kernel, in its current incarnation for the Intel IA32, supports no more than 256MB of physical memory, though this limitation is currently being addressed by the Fiasco team. As with the tasks, there is a choice of pre-allocating physical memory to a guest, or handing it out as required. If memory is only allocated when actually needed, more guests will be able to run in less memory, in a case when not all guests actually use their full allowance. However, most modern systems will quickly utilise all available memory for caching filesystem contents, so unless that behaviour is changed there is no point in allocating memory on demand. Memory for guests is always handled in 4MB superpage chunks which, on the Intel Pentium, is an efficient unit for the MMU, resulting in small page tables and few page faults. A simple bit vector is used for keeping track of reserved and free pages. The only exception is when a guest is migrating. In that case, 4kB pages are used because of their finer granularity. Timer interrupt Since guest access to the physical hardware on the host machine is not allowed, it is not possible for the guests to receive timer interrupts. The timer interrupt in Linux is used to allow preemptive scheduling to occur, to update process runtime information, and to update internal kernel timers. Since every L4Linux process is a native L4 task, the scheduling of processes is handled natively by L4’s own strict priority scheduler. It is possible for a task in L4 to handle the scheduling of other L4 tasks, by donating the remainder of its own timeslice to another task. The current L4Linux does not attempt to control scheduling of user level tasks though. Rather than having the host signal each guest every time a timer interrupt occurs, idle guests go to sleep for a short interval using the timeout feature of L4 IPC, polling for things to do Chapter 6: Implementation 43 (for example running the bottom half handler for incoming network packets) when they wake up, and then go back to sleep. Because preemption and scheduling is handled in L4, it should be possible for guest operating systems to become entirely event-driven, with no need for any external timer ticks. Unfortunately, L4Linux (which is derived from Linux 2.2) relies too much on the presence of a timer ticks to be easily turned into an entirely event-driven system. Realtime clock Linux implements the gettimeofday system call, which returns the current time in seconds since 1/1-1970 00:00:00, with millisecond accuracy. This is much higher accuracy than what can be obtained from the motherboard real time clock. The IA32 architecture exposes a high precision time stamp counter (TSC), a 64-bit value which is incremented at every clock tick. Linux uses this counter to calculate the current time, in order to get the high accuracy. Since the TSC counts clock cycles, the speed of the actual processor must be known in order to get a correct clock multiplier for converting between clock cycles and wall time. When an operating system migrates, a new value for the clock multiplier must be obtained in order to avoid clock skew due to migrating to a faster or slower CPU. NomadBIOS exposes the clock multiplier in the guest info page, and it is the guest’s responsibility to re-read this field after migration. Network packet filtering The physical network interface has to be multiplexed between the guest operating systems, as well as the host environment’s own network stack. If a network packet arrives at the host, its destination address must be examined, and a decision made of who the packet should be relayed to, or if the packet should be simply ignored. This filtering can be done in several ways, depending on the tradeoffs one is willing to make in the areas of flexibility, security, and performance. The flexibility choice is about whether guests should be able to decide the networking protocols supported dynamically, or if it is acceptable to hard code a fixed amount of protocols into the host environment. The security choice is if whether or not guests can be trusted to examine all incoming packets, even packets not destined for themselves, or if this should be considered a security breach? If guests can be trusted, all packets may just be forwarded to all guests, deferring the filtering decision to them. The performance choice is related to the security choice above. Consulting every guest for every packet will lead to poor performance if naively implemented, but on the other hand operations may be sped up by allowing all guests to read packets from a memory buffer shared Chapter 6: Implementation 44 between all of them, as opposed to having to copy packets from host to guest address spaces (or grant or map memory pages, an operation which takes some time as well), in order to enforce access restrictions. If flexibility is important, it is not possible to implement a static filter in the host. Instead, either all clients must be consulted for each packet, or clients must be able to specify filtering rules in a general way. The DPF [Engler and Kaashoek, 1996] system, part of the Exokernel project, allows filters to be specified in a domain-specific language, and downloaded into the host. DPF employs dynamic code generation, aided by runtime knowledge of installed filters, and claims performance on par with, or superior to, hand-crafted packet filters. For the purposes of this project, flexibility is less important than ease of implementation and performance. The only protocol currently supported is IP version 4, which means that a destination for an incoming packet can be determined solely be looking at the 32-bit destination IP address for the packet. Currently, the list of clients is searched linearly, as the number of clients is assumed to be small (less than ten), but this search would be easily implemented as complexity. If more flexibility were to be desired, a binary search instead, leading to an approach like the one taken by DPF would be necessary. Packets are copied from the host to the destination client, incurring some overhead. This is done for the security reasons described above, as well as to avoid garbage collection issues when sharing memory between host and clients. To increase performance, packets are queued up at both host and client, before being transferred via IPC in larger chunks, thereby amortising address space switching costs. Performance might be further improved by avoiding copying, but this is left for future work. Adapting L4Linux as a guest operating system In order to turn L4Linux into an example nomadic operating system, able to be seamlessly migrated between hosts on a network, a number of changes were necessary. These changes will be described in more detail shortly. The only hardware abstraction implemented by the current host environment is an Ethernet device. However, most Linux services like file systems (NFS), terminal access (SSH), graphical display (X), and so on, are already abstracted via the network in Linux, so it is possible to run a full blown operating system on this abstraction alone. While dramatically simplifying the abstractions needed to get several L4Linuxes running side by side, it also reduces the complexity of migration, because only data inside the guest address space has to be transferred. Network state in the abstracted layer does not necessarily have to be transferred, since applications designed for the Internet should tolerate a few lost packets. For some applications, this setup is fine, while for others it is inadequate. In the future, other types of abstractions (mainly access to hard drives) will be implemented as well. The nomadic L4Linux was configured to boot from an NFS-exported file system on another server, and was accessed from the outside via Secure Shell (SSH). Both the X Window System Chapter 6: Implementation 45 and Virtual Network Computing (VNC) were tested and found to work for access to applications with graphical user interfaces. Because the actual guest operating system configuration is entirely customisable, other file systems, for example AFS [Satyanarayanan, 1990] or Intermezzo [Braam and Nelson, 1999], might be used instead of or as supplement to NFS. Hardware abstraction layers As described, standard L4Linux requests all available interrupts from L4 at startup. From this point on, hardware is accessed as normal, using the standard Linux set of device drivers. If multiple Linuxes are to share the same physical hardware, they cannot be allowed to allocate any interrupt, so a hardware abstraction layer with multiplexing functionality must be created. The use of a uniform abstraction layer has the added advantage of allowing guest operating systems to migrate between systems with dissimilar peripheral hardware, without any trouble. Below the abstraction layer, code must exist to access the physical hardware, either by implementing its own drivers, or by being based on a host operating system which provides the drivers for it. The only abstraction actually implemented for this project is a shared Ethernet device. This device is implemented in NomadBIOS on top of the OSKit network drivers, and as a bridging plug-in in the network layer of the host version of L4Linux. The L4 IPC mechanism is used for passing network packets between guest and host address spaces. L4Linux already implemented a driver for an IPC-based Ethernet adapter, and this was used as a base for the client side of the implementation, though over time it was heavily modified. To lower the amount of context switches necessary when sending or receiving many packets in short succession, packets are not forwarded immediately, but gathered into a buffer of up to 16 packets, before they are sent as a single IPC message. The buffer is flushed either if full, or if a small timeout expires, to keep network latency low. This optimisation somewhat amortises the cost of added context switches incurred by using an abstraction, and is used both for incoming and outgoing traffic, and is inspired by the work done in VMWare [Venkitachalam and Lim, 2001]. The network multiplexing and demultiplexing (filtering) implementation is described in more detail in its own section below. Hardware abstraction layers for sharing other devices such as hard drives have not yet been implemented, but it is reasonable to assume this to be a rather trivial task, and this should be addressed in the future. Chapter 6: Implementation 46 repeat E RROR until E RROR SEND ; E RROR ; Figure 8: Handling IPC send cancellation. repeat E RROR until E RROR RECEIVE ; E RROR ; Figure 9: Handling IPC receive cancellation. Disabling interrupts For critical sections, Linux (on Intel hardware) uses the cli and sti instructions to turn off and on all interrupts. While the original L4Linux was allowed the use of these instructions, it is clearly not acceptable when running multiple systems at once, as it would allow a guest to monopolise the CPU. Fortunately, the same is true for realtime applications, and this has already been addressed in Härtig et al. . Recent versions of L4Linux can be configured to emulate the effects of cli and sti using a queued lock instead. Because some sections of the Linux timer calibration code need a real cli context to be precise, these had to be relocated to the host environment. Suspending L4Linux and its user processes Before a guest L4Linux instance can be migrated to another host, it has to be suspended to a safe state. This means that all threads, both in the L4Linux task and in all user processes, have to save CPU and kernel state to main memory, from where it may later be retrieved. The CPU state of a thread are its instruction pointer, stack pointer, general purpose registers, and also the state of the floating point unit if used. The kernel state of a thread describes its eventual involvement in IPC. A thread can be in one of the five following states: Ready: Not doing IPC, ready to run when scheduled. repeat E RROR until E RROR while E RROR E RROR CALL #"%$ '& )*+-, $./$ RECEIVE !0 ! E RROR $21 ; ; E RROR #"($ '& +3"%$ Figure 10: Handling two phase IPC cancellation. ; '& do Chapter 6: Implementation 47 Waiting to receive: The thread is waiting for some other party to start a send operation. Waiting to send: The thread has invoked a send operation, but the receiving party has not invoked a corresponding receive operation yet. Receiving: A receive operating is in progress. IPC is able to perform large copying operations, and may be subject to scheduling and to interruption by other threads. Sending: As above, in the opposite direction. It is possible to perform two-phase IPC in L4, allowing both a send operation followed by a receive operation, in a single syscall. There is no way to determine neither the CPU nor kernel state of a thread from outside the thread, so this will have to be done by the thread itself. The CPU state is obvious to the thread because it knows the values of its own registers, and the IPC syscall returns error values corresponding to each of the last four states above when interrupted. In the implementation, the thread-ex-regs syscall is used to force a given thread into special suspension code which pushes all CPU state to a location in memory. If the thread was interrupted doing IPC, the eax register will contain the appropriate error value corresponding to the the states above. After resumption, each thread is responsible for handling this situation correctly. This means that every IPC syscall invocation in L4Linux had to be transformed in the following way: In the case of cancellation or abortion of single-phase IPC, the IPC is restarted. In the case of cancellation or abortion of two-phase IPC, the error code describes how far the operation got before interruption. If the operation did not get past the send phase, it may be restarted completely. Otherwise, a single-phase receive should be started instead. Figures 8, 9 and 10 show the methods used to handle cancellation for send, receive and twophase IPC respectively. In case of user processes, the existing L4Linux signal handling thread was expanded to manipulate the main program thread into a suspended state. In the L4Linux task, one thread takes care of suspending all the other threads in a similar fashion. This thread’s only purpose is to wait for the host environment to ask it to suspend. Before suspending the various internal kernel threads, it IPCs the signal threads of all user processes to suspend. When the signal threads of all user processes have responded that suspension is complete, the corresponding tasks are deleted from L4 with the task-new syscall, and the kernel threads are suspended. The suspender thread then IPCs back to the host environment that suspension is complete. This last IPC message contains an instruction pointer for code which may be used to later revive the guest operating system from its suspension. The host environment may then delete the L4Linux task from L4 at will. At this point, all that is left of the once running guest is a contiguous chunk of memory, along with an instruction pointer, which is easily copied to a new host where it may be revived, as described below. Chapter 6: Implementation 48 Resuming L4Linux and its user processes From the viewpoint of the host environment, a suspended guest image is revived simply by creating a new L4 task, beginning execution at its specified resumption address. Inside the new task, the first thread that runs performs the following operations: First it recreates all of the internal kernel threads, starting with the internal kernel pager, which will handle page faults incurred by the rest of the kernel, usually by forwarding them to the host pager. The rest of the threads are restarted in code which makes them go to sleep until they receive a go signal. This is necessary because of the chicken-and-egg situation which occurs when both user processes and kernel threads expect to be able to communicate right away, and cannot handle the other party not existing yet. Each user process is then recreated as a new L4 task. The new tasks start at a special resumption address within the user signal handling code. This code starts a new signal handling thread for the process, restores the CPU state from the copy previously stored in memory, and jumps to the location on which the process was originally interrupted. At this point, the newly created user task will have incurred two page faults. The resumption code resides in a page shared by all user tasks, and the CPU state resides in a page specific to each task. If the process were involved in a syscall or page fault before suspension, special care must be taken, because the L4Linux syscall and page fault server thread cannot handle multiple syscalls or page faults at once from the same process (which makes sense because this is usually not possible). To solve this problem, an extra pager thread used only by recovering processes was added to L4Linux. Once a process is recovered, it switches back to the normal L4Linux pager. A specialised version of libc was made for L4Linux which performed syscalls via direct IPC to L4Linux, eliminating the need to go through a user level trap handler. This is not compatible with the above mentioned approach, since there is no simple way of knowing where the libc syscall IPC code is placed in the address space of the user level process prior to suspension. While this could be addressed by adding additional functionality to the extra recovery pager, but was considered to be beyond the scope of this project. Once all user processes have been recreated, all server threads are given the go signal, and the suspender thread goes back to sleep. The guest is now running as before. Pure demand paging model Because page faults in L4 are turned into simple IPC messages, it is possible for a task to explicitly request memory by making an IPC to it’s pager thread directly. The L4 documentation protocol, to which pagers may adhere. defines a page request protocol, known as the The protocol, named after the L4 backing pager, allows extended paging requests, for example requests for 4MB superpages, instead of the normal 4kB pages. The L4Linux startup code Chapter 6: Implementation 49 makes use of this protocol to explicitly request superpages for as much of its memory as possible, for performance reasons. After booting, standard L4Linux has requested all the memory available, and will make no further page faults to its backing pager. While this behaviour is beneficial in a traditional single-operating system setup, because the use of superpages is faster than normal pages, it is problematic in a migration scenario, because L4Linux assumes that page mappings do not disappear once requested. When resuming a guest in an empty address space, this assumption does not hold. Furthermore, the guest should not be allowed to decide or even know in what kind of pages it is running, as this may change during the lifetime of the guest. For example, the precopy algorithm, described on page 28, unmaps the entire guest address space and later temporarily maps back only the pages actually needed by the guest to continue running. While superpages are used for the entire address space for normal operation, the temporary precopy mappings are created as 4kB pages, because of their finer granularity. Because of this, the paging model in L4Linux was changed to be purely demand driven. No explicit requests for memory are made, but instead page faults incurred by user processes and kernel threads are implicitly forwarded to the host pager by the Linux pager touching the faulting addresses. The host pager may then choose whether to map back superpages or normal pages at its own discretion. This also has the advantage that disk paging and similar techniques may be applied at the host level, completely transparent to running guests. Task identifier migration issues Because of L4’s flat task identifier space, the task number for the Linux server, as well as for the user level tasks, cannot be assumed constant across migrations, because another guest may have already been allocated in the same task number range. Guest operating systems will need to deal with situations in which there is only partial or no overlap between task numbers before and after migration. Part of the solution to this problem is to introduce a level of indirection, by using some type of naming service to convert virtual task identifiers into actual L4 task identifiers. In the case of L4Linux, this indirection already exists for user level processes, as Linux Process IDs (PIDs) may be mapped via their kernel-internal task structs into L4 task identifiers. Upon resumption after a migration, the L4Linux server will have to change these identifiers into newly allocated ones. Conversely, the signal handling code of user space tasks store an identifier for the L4Linux server in a shared page, which will need to be updated as well. Unfortunately, task identifiers may also be cached in temporary variables or CPU registers, and used in IPC operations after resumption, and the only way of preventing use of such stale identifiers across migration is to treat every identifier use as a critical section and protect it with a lock or semaphore, inside which migration cannot occur. This solution does not come without implementation and runtime costs though. The Clans and Chiefs (C&C) (see page 24) security mechanism of L4 allows the L4Linux server, which is chief of its clan of Linux user process tasks, to intercept messages going to out-of-clan tasks, and redirect them to their new and correct destinations. Chapter 6: Implementation 50 A thread trying to perform IPC using a stale identifier will then either reach the correct destination right away (because the recipient it is still at its old location), or be intercepted by C&C (because the stale identifier points to a task which is not member of the new clan). Intercepted IPC will be forwarded to the L4Linux server, which may then look up the correct destination, and redirect the message. For this to work, new tasks need to be allocated in a manner so, that any newly obtained task number which overlaps the old set of task numbers is assigned to the same Linux PID as before. The same thing goes for the L4Linux server. IPC redirection has some overhead, but shortly after resumption all temporary identifiers should have gone out of scope, and execution will be back to normal speed. If the operating system is migrated again before all stale identifiers have gone out of scope, there is a risk of task identifiers rejoining the clan, but for different PIDs this time, resulting in invalid IPC going undetected by C&C. However, this may be solved for all practical purposes by being a little careful about the lifetime of cached identifiers. If this is not acceptable, a lifelong history of previous identifiers will need to be kept for each task, ensuring that if a PID once mapped to a certain task identifier, and that identifier ever becomes available, it will map to it again. Unfortunately, no existing L4 version for Intel currently implements C&C redirection, though forthcoming version promise a new and improved security model to replace it. As a temporary solution, the Fiasco kernel was modified to make out of clan IPCs fail with an error code, which L4Linux reacts to by recalculating the destination identifier and retrying the IPC. Another problem is that two guest Linux servers will normally be direct children of the host server task, and thus members of the same clan, allowing them to IPC each other directly. From a security viewpoint, this is not much of an issue, because L4Linux already takes care to discard incoming IPC from unknown sources, but it is a problem when migrating. Suppose a Linux server is running at task 256, and that its internal threads communicate by referring to task 256 and a specific local thread number. When the server is suspended and migrated to a host where task 256 is taken by another Linux server, it is assigned task 512 instead. Internal threads in task 512 still hold stale identifiers, pointing to threads in task 256, and because the two server tasks are in the same clan, they are allowed to communicate directly to threads in task 256, with undefined behaviour as a result. This can be solved by encapsulating each server in its own clan, but this solution is inefficient because it deepens the C&C hierarchy, adding another layer through which IPC will have to be manually forwarded. The root of the problem is in the redundant way IPC destinations are identified as pairs in L4 when communicating inside the same task. If it were possible in L4 to specify a tuple as the destination, to specify a recipient thread within the current task, the problem would be solved. The upcoming Pistachio L4 kernel is supposed to have a special LIPC primitive for performing very fast intra-task IPC, which could be used. In the meantime, the Fiasco L4 source was modified to allow IPC with task identifier 0 designating task-local IPC. Chapter 6: Implementation 51 Remote control interface The control interface defines a number of actions used for managing guest operating systems running on the machine. The interface server accepts TCP connections, making it possible to remotely start and stop guests, as well as to migrate them between machines. The following commands are defined with which guest operating systems can be managed: client-alloc Given information of required physical memory and IP address this call inserts the guest into the host environment. If the guest is transfered as an image for first time boot, the image is parsed and setup ready for execution. It is possible to specify a guest specific command line, made available to the guest via the guest info page. suspend Suspends a guest designated by its IP address. This function goes through the suspension process for the relevant guest but does not release resources allocated to the guest. resume Resumes a previously suspended guest with the environment available to the guest when it was suspended. migrate Migrates a guest designated by its IP address onto a new host. The precopy algorithm is used to minimise downtime during migration. The task is initially copied to the new server, after which it is suspended locally, and changed pages are copied to the new server. Finally the old server sends a resume signal to the new server to restart the process remotely. Should anything go wrong, the local task can be restarted with a resume command, making migration transactional by nature. When migration is complete, resources may be freed. The client program for the control protocol is available both as a stand-alone command line program, called runclient,and is also integrated into the host environment. Apart from the initial migration command, the protocol steps involved in performing migration between host environments are handled in a peer-to-peer fashion. Future work The current number of services provided by NomadBIOS suffice to demonstrate the feasibility of the system. A number of additional services have been identified, which, if implemented, would extend the usability of NomadBIOS in a cluster environment. Chapter 6: Implementation 52 Checkpointing server A checkpointing server can be implemented as a special version of the control interface running at a passive host, possibly a file server. A guest can be checkpointed by migrating to the checkpoint server, which stores the data in persistent storage, and then reports failure to start the process. This makes the original host environment restart the guest locally at the migration point. Should a host environment fail, the last known checkpoint can be restored by having the checkpoint server migrate the stored version of the guest back to an active host environment. Load balancing It would be possible to add MOSIX-like load balancing of guest operating systems to the host environment. The service would be responsible for communicating load averages and guest specifications between a number of hosts, making decisions about which hosts should offload which guests onto other hosts. The load balancing server would use the existing control interface to migrate guests to other hosts. This would allow for MOSIX-style load balancing of a cluster, or across a set of clusters, without residual dependency problems, though at a coarser granularity level. Block device access Similar to the network abstraction, it would be possible to add a block device abstraction to the host environment. This would give the guests access to a section of a local physical disk, usable as swap-space or for creating file systems for storing temporary files. The block device service would partition the local hard drive simply by recording a offset and length of each guests physical volume, and do block remapping and bounds checking to assure separation of the volumes. During migration all blocks owned by a guest would be transferred to the target machine, possibly using the precopy algorithm. C HAPTER 7 Performance measurements The primary objective of this project is the implementation of an environment for hosting nomadic operating systems, and a port of one non-toy operating system to the environment. The secondary objective, as described on page 4 is proving that the primary objective can be achieved without a significant loss of performance. To confirm this, a number of benchmarks were run to hopefully verify that this is the case. The strategy is twofold. One is to demonstrate that running nomadic L4Linux (NomadLinux) under NomadBIOS is not significantly slower than running standard L4Linux (which has been shown in Härtig et al.  to be 5-10% slower than native Linux for practical scenarios), while the other is to compare performance of NomadLinux under NomadBIOS against a similar setup using VMWare, the main alternative for hosting multiple operating systems on commodity hardware. Three types of benchmarks have been chosen: Latencies By measuring the latencies of a series of operating system related tasks, such as process creation and syscalls, it is possible to get indications of the performance overhead (if any) incurred by the overall design. Throughput Timing the task-completion time of a CPU intensive process gives a clearer picture of the impact felt by an end user in a system based on NomadBIOS. Migration While not a comparative benchmark, this benchmark measures the downtime incurred by migration as perceived by an external spectator, for example a remotely logged in user. All benchmarks were run on the same machine, a single CPU 750MHz Athlon AMD, with 380 MB of RAM and a 100Mbit Intel EtherExpress 100 Pro network interface. Latency benchmark A series of micro benchmarks are run to determine relative performance of various operating system elements. For this purpose the hbench OS [Brown and Seltzer, 1997] benchmark was chosen. hbench is a derivate of lmbench [McVoy and Staelin, 1996], used for the original L4Linux benchmarks, but modified to provide more accurate results in a number of cases. Hbench tests a number of operating system specific latencies and throughputs. Benchmarks regarding local file system performance have been disabled, since the guest is only able to 53 Chapter 7: Performance measurements 54 access network file systems. The results of this benchmark will be compared to the measuments of standard L4Linux, to give an indication of the overhead introduced in the nomadic version. A subset of the total benchmark results was picked, akin to the subset used in the original L4Linux benchmarks, but with parameters adjusted to run on faster hardware. The latency benchmarks are described below. Results are specificed in milliseconds (ms), and lower values yield better performance. Test getpid write to /dev/null null process simple process dynamic/static sh process dynamic/static pipe ctx 0k 16 ctx2 0k 16 Description Measures the time it takes for one getpid syscall. Measures the time it takes to write one byte to the null device. Measures the time it takes to fork the current process. Measures the time it takes to fork the current process and start an either dynamically or statically linked program using the execve system call. Measures the time it takes to fork the current process and start an either her dynamically or statically linked program using the shell. Measures the time it takes to send one byte through a pipe between two processes. Measures the context switch latency, by sending a token through a series of pipes, set up between 16 child processes. as ctx, but provokes cache misses in the first and second level caches. Usually, it would have been customary to benchmark memory map latency together with the other latency tests, but unfortunately the benchmark tended to crash NomadLinux, so no results are available. Investigations are proceeding as to the cause of the crash. The bandwidth benchmarks measure their results in MB/s. Higher results yield better performance. Test pipe-64k mem rd 2m mem write 2m mem zero 2m mem copy 2m unrolled aligned tcp-128k Description Measures the bandwidth of a pipe, by transferring 4MB through a pipe, in chunks of 64kB. Measures the read bandwidth of the system memory, by touching every byte in a 2MB data range. Measures the write bandwidth of the system memory, by writing to every byte in a 2MB data range. Measures the bandwidth of the system memory, when clearing memory with the bzero library call. Measures the copy bandwidth of the system memory, by copying 2MB data from one address to another. The copy code is loop unrolled for maximum performance, and the data is aligned on 4kB page boundaries. Measures the bandwidth of the network layer, by transferring at least 10MB to another host on the network, in chunks of 128kB. The benchmark is run only once per system, since hbench itself performs repetitions to average results. Chapter 7: Performance measurements 55 For each run, the system benchmarked was allowed to use 64MB RAM and a root file system mounted via NFS. The file server was a dual 350MHz Pentium II serving files from a Quantum Atlas 10k3 ultrawide SCSI2 disk, over a 100Mbit network, and was not doing other work. The results are compared against identical runs of the same benchmark suite under the following conditions: Setup VMWare on Windows XP VMWare on Linux 2.4.18 L4Linux Linux 2.2-20 Description VMWare running under Microsoft Windows XP the guest being a native Linux 2.2-20 kernel. As above, except with VMWare running under Linux2.4.18. The unmodified L4Linux, running on top of the same L4 kernel as the NomadLinux benchmark The native Linux kernel running directly on the hardware. The version of VMWare used was VMWare Workstation 3.2. The results of this benchmark are tabulated in table 1 and 2 and it is seen that while VMWare has better syscall latencies than NomadLinux, it suffers greatly in the process invocation tests. This is consistent with Lawton ’s description of how VMWare works. When compared to native Linux, both NomadLinux and L4Linux are outperformed by quite a margin, especially regarding syscalls. In the orginal L4Linux performance benchmarks, L4Linux was benchmarked with two different syscall mechanisms, one of which involved a specialised libc using direct L4 IPC instead of the trap mechanism described on page 25. Use of the specialised libc almost halved the syscall overhead when compared to native Linux. While this would have benefited NomadLinux as it did L4Linux, the current version of NomadLinux is unable to safely resume processes involved in syscalls by means of direct IPC, and so this optimisation has not been applied. The problems leading to this decision are further described on page 48. From the results, it is seen that NomadLinux equals L4Linux in most cases, while outperforming it in some. It is not evident what makes NomadLinux perform better than L4Linux in these cases, but for this benchmark it is satisfactory to see that no unnecessary overhead was introduced when adapting L4Linux to run under NomadBIOS. When examining the bandwidth results, it seems that NomadLinux and L4Linux have the same throughput in most cases, while still outperformed by native Linux. Linux under VMWare comes in at 5-20% performance penalty compared to native Linux, and suffers a massive degradation in the pipe bandwidth test, although reasons for this are not exactly clear. The degradation of the pipe bandwidth result, when seen in conjunction with the high pipe latency result of VMWare, helps to explain the relatively poor results of the context switch latency benchmarks, since these pass a token through a series of pipes set up between a number of processes. One area in which NomadLinux did not perform well was network bandwidth, at just 50% of native Linux performance. VMWare performed almost as well as native Linux, indicating that it should be possible to achieve similar performance for NomadLinux on NomadBIOS. Chapter 7: Performance measurements Test 56 NomadLinux getpid write to /dev/null null process Simple process dynamic Simple process static /bin/sh process dynamic /bin/sh process static ctx 0K 16 ctx2 0K 16 pipe 3.80 4.28 982.88 2644.74 1706.60 12907.32 11898.97 2.46 1.93 19.21 VMWare Windows 0.80 1.09 1485.70 4149.80 7242.33 30002.18 28316.46 27.93 26.11 40.67 VMWare Linux 0.83 1.03 1397.83 5197.21 2973.43 24193.52 22546.51 24.50 24.29 40.56 L4Linux Linux 3.94 4.62 1060.44 3094.99 1734.40 14501.13 12951.64 3.41 3.19 24.31 0.36 0.46 178.42 906.86 330.18 7707.26 7084.71 3.39 3.60 5.30 Table 1: Selected latency results running hbench Test NomadLinux pipe-64k mem rd 2m mem write 2m mem zero 2m mem copy 2m unrolled aligned tcp, 128k 330.73 393.18 290.94 294.79 149.76 5.443 VMWare Windows 88.23 332.00 278.61 282.35 168.76 8.843 VMWare Linux 90.17 371.16 281.06 280.48 156.78 9.379 Table 2: Selected bandwidth results running hbench L4Linux Linux 313.73 394.48 292.61 294.66 168.62 9.48 366.30 408.06 304.69 289.63 165.16 9.83 Chapter 7: Performance measurements 57 Throughput benchmark In order to measure system throughput a CPU intensive task was put together, in order to simulate a scientific workload. The task consists of encoding a single uncompressed wave audio file, into its ogg-encoded equivalent. To perform the encoding, the program oggenc was used. Each result measures one encoding from raw PCM wave file format to ogg-format, and the result is measured by the completion wall time of the task. For each run, the file cache is warmed, by running the benchmark once discarding the result. The benchmark is run in three scenarios: One encoder The encoder is run alone, and the time measured is the time it takes to complete the encoding. This mode measures throughput of a single task with no contending of resources. Two encoders The encoder is started twice. The time measured is the completion time of the last to finish. One encoder on two kernels For the setups that support running multiple guests, this benchmark runs one encoder in each of two guests. This benchmark is relevant only to the VMWare and NomadLinux setups, and the time measured is the last to finish. The results are measured in wall-time rather than UNIX system time + user time, due to the inability of L4Linux (and thus NomadLinux) to accurately report these numbers. The benchmarks were run under the same environments as the latency test, with the same guest configuration, and results are available in table 3. It is interesting to note that running just one guest under NomadBIOS performs equal to L4Linux itself, and loses just a few seconds running two guests each doing one encoding, compared to running two encoders at once. The difference between the two measurements signify the overhead of having a second operating system running, which, when compared to the VMWare figures, is marginal. VMWare on Windows suffers an 8% slowdown when running two virtual machines, whereas NomadBIOS suffers just 1.5%. Please note that the VMWare on Linux result for multiple guests was obtained with an independent timer, since VMWare was not able to share the real time clock correctly between multiple virtual environments, making the clock drift 50 seconds during this three minute test. Generally VMWare came in last, by a margin of 15-30% compared to NomadBIOS, except for the syscall latency benchmark. As is seen, L4Linux and NomadBIOS both come within 5% of native Linux performance. Migration benchmark This benchmark is relevant only for NomadLinux, since it is the only one of the test systems with built-in support for migration. Chapter 7: Performance measurements 58 This benchmark measures the time taken to migrate a guest environment from one host environment to another. This period of time can be split into two separate phases: the downtime experienced by the guest, and the downtime experienced by peers on the network. To get an indication of the first, the number of 4kB pages sent when transferring control of a guest to a new host, was measured. The guest was a full Linux system, running 14 user processes, including inetd, sshd, and apache, though all idle at the time. When using the simple copy algorithm, the downtime of a guest is the time taken to transfer its whole address space to a new host. The migrated guest test contained 64MB of memory, so the downtime of the guest should be 6-7 seconds on a 100Mbit network. Using the precopy algorithm, the downtime was reduced to the time taken to transfer 29 4kB pages (a total of 116kB), which is about 10 ms. Running a recursive find command from the root of the NFSmounted filesystem while being logged in via SSH, resulted in a final transfer of 244 4kB pages, or a downtime of about 100 ms. Since every user process in a NomadLinux system writes at least one page during suspension, pages, where is the number of user processes the theoretical best case final transfer is in the system. The extra page is for the state of the operating system threads. In the worst case, an extremely busy system, the time for the final transfer, and thus downtime, will be equal to that of the simple copy algorithm. The downtime experienced by other machines is not just the transfer time. Since the IP address of the guest is now associated with a different Ethernet address, machines connected to the guest needs to be made aware of this. Until this happens, they will continue sending packets to the old host, where the packets will be discarded. This can take a relatively long time; times of up to 20 seconds were seen. To counter this, the new host of the guest broadcasts a series of gratuitous ARP1 messages, telling every other machine on the network to update their ARP table entry for the guest IP address. This technique brings down the time for re-discovery of the guest to virtually nil. This was confirmed by letting a machine on the network send ICMP echo packets to the guest before, during and after migration, at the rate of ten packets per second. Without gratuitous ARP, 24.9 seconds elapsed before the guest started responding to echo requests, while with gratuitous ARP not a single packet was lost. The results of these three benchmarks support our hypotheses as described on page 4. Specifically, the throughput benchmark supports the Concurrent, Efficient and Scalable hypotheses, 1 See page 61 for a detailed description Mode NomadLinux L4Linux VMWare/Windows VMWare/Linux Native Linux One encoder 1m 01.1s 1m 01.5s 1m 15.1s 1m 15.2s 0m 58.2s Two encoders 2m 07.6s 2m 06.7s 2m 35.9s 2m 34.5s 1m 57.6s One encoder on two guests 2m 09.0s N/A 2m 48.7s 2m 40.0s N/A Table 3: Results for converting 44MB audio file into ogg-vorbis format Chapter 7: Performance measurements 59 showing that NomadLinux suffers just 1.5% overhead when running two simultanious guests and performs equally well with L4Linux, while the very low downtime shown in the migration benchmark support the Migratable hypothesis. C HAPTER 8 Discussion The original motivation for this project was to create a better solution to the configuration management and load balancing needs of computing grids. However, most of the actual work has been performed on the micro level, and tested in practice only on just a few machines. This chapter deals briefly with some of the challenges one would be met with if attempting to implement a grid-scale system based on nomadic operating systems. Security In a system involving untrusted parties, which a large-scale computing grid will necessarily be, there is a need for security measures to prevent abuse. The current NomadBIOS implementation contains no security features. Any client on the Internet is able to connect to a NomadBIOS machine and make use of its services. Augmenting NomadBIOS with support for Kerberos [Miller et al., 1987] or similar network authentication protocol, as well as implementing an authorisation system on top of a directory access protocol such as Lightweight Directory Address Protocol (LDAP) would be possible, and may be a topic for future work. When a host environment receives a new guest to execute, the host will need to validate the authenticity of the guest, in order to match it against local access control and quota rules, and to make sure it was not altered, either as an effect of network failures, or a malicious third party, while in transit. A guest transfer via the network happens in the form of a header structure describing the guest’s external attributes, and a set of memory pages. The external attributes describe the IP address, memory quota, and other restrictions that the guest must obey to, and is unable to manipulate. The external attributes of the guest are determined by its creator, and should be kept constant once the guest is running. The creator may digitally sign these attributes, and the host environment verify this signature before accepting any guest pages into memory. The number of distinctly addressed pages accepted into memory can then be bounded by the amount originally described by the creator. The rationale behind not letting a guest change its IP address, is that hosting sites may employ various address-based security measures, which the guest should not be allowed to circumvent. As described on page 61, Mobile IP could be used to aid in keeping IP address constant across migrations. 60 Chapter 8: Discussion 61 The originating host might calculate a secure checksum for each page before transfer, and conclude the transferral of pages with a signed list of all checksums, to make sure everything has come across correctly. By their nature, most memory pages will be mutable, and it will make no sense for their creator to sign them. By compromising a host system, an attacker will be able to alter the programs running on a guest, or to extract secrets such as keys or passwords from their memory. If the signature of the compromised host is trusted by other systems, they have no way of preventing the receipt of a maliciously altered guest. Even though the pages containing the static parts of the operating system code, or even of various user level programs, may be signed up front, there is still the risk of an attacker manipulating vital data structures, leaving them in an unpredictable state. It should be noted though, that even though a malicious host environment poses a threat to innocent guests, a malicious guest should not pose any threat to a host environment, as the host environment should ensure adequate protection. Migration across network boundaries The example nomadic operating system and host environment implemented for this project employ Internet Protocol version 4 (IPv4) on top of Ethernet as the only channel for communications with the outside world. IPv6 support may be added in the future. Guest operating systems may be transparently migrated within IP subnets, where their IP addresses stay valid. When a guest arrives at a new host, the new host broadcasts an Address Resolution Protocol (ARP) reply message to the local subnet 1 , so that everyone in the subnet becomes aware of the new Ethernet location of the guest’s IP address quickly. If this is not done, IP connections to the migrated guest usually hang for a few seconds, until its peers have resolved the new Ethernet address themselves. The problem of migrating a running operating system to a different IP subnet, is equivalent to the problem of migrating a portable computer by hand. If losing all open network connections is acceptable, the operating system may simply be reconfigured to use a new IP address, within the target subnet, which is what most laptop users do today. If connections are to survive migration between distinct subnets, a layer of indirection has to be introduced at the protocol level. Because of the problem of migrating portable computers by hand, this has already been addressed by the Mobile IP extension to IPv4 [Perkins and Myles, 1997]. In Mobile IP, a node on the home network (the home agent) forwards packets destined for the client (mobile node) to a care-of address on the foreign network. The care-of address may be handled by a foreign agent on the foreign network, or by the mobile node itself. The home and foreign agent functionality will typically be embedded in routers on the networks. 1 also known as a gratuitous ARP message Chapter 8: Discussion 62 Outgoing packets from the mobile node may generally be sent directly to their destination, though not all firewalls will forward packets with non-local source addresses. If outgoing packets cannot be sent directly, they also have to be tunnelled via the home agent. The problem with Mobile IP is the triangular routing occurring because incoming packets have to go through the home agent. Some Mobile IP implementations take the type of traffic into account, and allow stateless protocols such as HTTP to be routed directly. Because of the existence of Mobile IP, there is no need to implement anything special in order to support migration across network boundaries. The problem may be tackled simply by employing Mobile IP in guest operating systems and on host networks. The Internet Protocol version 6 specification contains some additional features which are beneficial to Mobile IP, and hopefully the popularity of laptops and wireless networks will result in most routing devices supporting mobility in the near future. By using the gratuitous ARP message to minimise the time taken for peers to discover the migration of a guest to a new physical interface, and by pointing to the option of using Mobile IP when migrating between different subnets, we can now conclude that the assumption made on page 5, about being able to migrate network connections seamlessly, has been satisfied. Other usage scenarios Apart from the grid scenarios envisaged elsewhere, nomadic operating systems may prove useful in some end user centric scenarios as well. Below a few imaginable “blue sky” examples are described. Common to both of them is that they are not very useful at the current state of the NomadBIOS, because abstracted access to human interface devices such as display and keyboard has not yet been implemented. Workstation hotel When leaving his place of work, a user may still wish to be able to access his workspace from home. Normally, this is solved by leaving the workstation on all night, in case it might be needed. Alternatively, the workstation’s operating system could migrate to a “hotel server” for the night. If the user needs to access it, it may be still be reached from the hotel server (though the actual location appears unchanged), while the workstation is powered down. Since only few users are expected to be logged in at night, one hotel server replaces a large number of workstations, saving huge amounts of energy. If the user prefers the speed and reliability of his workstation at home, instead of using a terminal connection to the hotel server, the operating system may instead be migrated to his home workstation when he leaves the office, racing the user home. Chapter 8: Discussion 63 Laptop replacement Spurred by the presence of Universal Serial Bus (USB) and IEEE-1394 (aka FireWire) ports in almost all modern personal computers, a new crop of portable storage devices, such as flash cards, the IBM USBKey, and the Apple iPod music player, have started to emerge. Both solid-state devices currently storing up to one gigabyte, and small harddrives storing 20 gigabytes or more, are available at consumer level prices. One use of the nomadic operating system would be to periodically checkpoint the entire operating system to for example a USBKey. When the user removes the USBKey from the computer, the operating system is paused, since there would no longer be anywhere to place dirty pages. When the USBKey were inserted anew, operation would resume. If the USBKey is inserted into another computer running the base operating system, the guest operating system would start operation from the checkpoint on the key. At first, no mappings would exist, leading to page faults. These would be served from the USBKey, and the system would soon be running at full speed, similar to the way normal diskbacked paging works. The benefit is that the user might carry his or her entire operating system environment around on a very small device, and be able to commence work at any location with a compatible computer present. Such a solution would need Mobile IP to allow migration of network connections to file servers. The root file system, as well as the various application binaries would be served from the network, perhaps with part of the USBKey acting as a disk cache. Intermezzo [Braam and Nelson, 1999] would be a good candidate for such a network file system, as it would deal gracefully with disconnected operation. When not in interactive use i.e.. when the lid is closed, most laptops today function only as clunky storage devices. Even though the convenience of being able to work on your thesis in the park during summer should not be underestimated, it feels safe to assume that many laptops are actually running on desks, as desktop computer replacements, most of the time. Instead of carrying a bulky and expensive laptop between desks, just to ensure access to personal data and applications, simple removable media technologies, which will fit in a pocket or in a key ring, could be used instead. If used in conjunction with trusted, public terminals running the host environment, users would be able to migrate their operating system, data, and applications between desks, eliminating the need for a laptop in many cases. Such terminals could exists both in the home and at the office, or even in airport and on planes. Related work Other recent research projects deal with the subjects of partitioning and operating system migration on commodity hardware. This can be viewed as a natural result of personal computers becoming fast enough that they can host several independent systems at once, much like mainframes have for several years. Chapter 8: Discussion 64 The Fluke microkernel The project perhaps most related to NomadBIOS is the Fluke [Ford et al., 1996] microkernel from the University of Utah. Fluke is an attempt at building a system supporting recursive virtual machines, of arbitrary nesting depth, with exportable kernel state allowing easy checkpointing of single programs, entire guest operating systems, or even of systems distributed across multiple physical machines. The main advantage of Fluke compared to L4, is its lack of global identifiers, simplifying the task of migrating complex guest operating systems across machines. Fluke is a synthesis of Mach and L4. IPC security is enforced via ports and port references like in Mach, but IPC is unbuffered like in L4. The memory management system supports recursive address spaces like L4, but with more detailed means of control. Where Fluke is a microkernel architecture, NomadBIOS is only a microkernel application. NomadBIOS gives up some transparency in exchange for less duplication of effort and better performance, due to its reliance on an established an quite mature kernel technology. Compared to Fluke, NomadBIOS on L4 aligns better with the microkernel ideal, in that only the bare-minimum features needed for protection reside within the kernel, whereas checkpointing support and indirection of global identifiers are kept in user-space. The most current version of Fluke was released in early 1999, and was described by its developers as unfinished at the time. However, the related OSKit driver framework, on which NomadBIOS relies for networking, is still maintained and used in many current operating system research projects. Denali The Denali Fault Isolation Kernel project [Whitaker et al., 2002] shares several goals with NomadBIOS. Denali aims to run multiple (hundreds or even thousands) of isolated guest operating systems on the same machine, and, like NomadBIOS, sees transparency as a non-goal. Scalability is more important to Denali than to NomadBIOS, but comes at the sacrifice of features such as virtual memory, resulting in the loss of ability to host traditional operating systems, such as Linux, within a single guest instance. Instead, Denali suggest the use of several, cooperating guests, to form a full operating system. This approach is similar to how L4Linux (and most other microkernel based operating system implementations) runs multiple user level tasks, served by one or more operating system tasks. While Denali currently lacks NomadBIOS’ ability to host real operating systems, it has the ability to swap guests to disk, which NomadBIOS does currently not. This currently makes Denali better suited for hosting large numbers of rarely active guests. Chapter 8: Discussion 65 vMatrix vMatrix [Awadallah and Rosenblum, 2002] uses the proprietary VMWare Virtual Machine Monitor (VMM) to implement a system much akin to NomadBIOS, though mainly focused on better HTTP serving by migrating content closer to consumers. As described on page 11, the VMWare approach to system partitioning is quite resource-wasteful, due to its goal of transparency. vMatrix does not attempt to limit migration downtimes, and cites figures of 10 seconds or more for resumption alone. Contrasted with NomadBIOS’ ability to perform complete migration in less than one tenth of a second of downtime, this is rather slow. Conclusion As has can be seen from the performance measurements, the hypotheses about nomadic operating systems have been verified. NomadBIOS generally outperforms VMWare, performs on level with standard L4Linux (except for network performance), and lags 5-10% behind monolithic Linux, similar to the results reported in Härtig et al. . The OSKit-based network abstraction needs improvement. Its lacking performance compared to standard L4Linux and VMWare may be due to the use of unoptimised OSKit drivers, the extra layers of copying introduced, or to the latency added by the queueing of packets to save on context switches. All of these issues are addressable without making fundamental changes to the overall design, and should be the focus of future work. The limitations of the current L4 Fiasco kernel, mainly the 256MB RAM limitation, also have to be addressed. It safe to assume that most nodes in clusters today contain much more memory than this, so this limitation needs to be addressed, which the Fiasco team is currently doing. The limited number of tasks available to guest operating systems, should not be a problem in most scientific scenarios, but anyway the upcoming L4 version 4 specification will solve that. How close is the implemented system to being usable, in a real cluster or grid setup? That depends of the type of application one wishes to run. Applications that iterate over very large datasets multiple times, will need access to disks for temporary storage to perform adequatly. As long as no block device abstraction has been implemented, these applications cannot be served efficiently. Applications that only iterate over their dataset once, will run fine, because the input and output data has to be transferred via the network anyway, and a disk will be no real advantage. Many current grid efforts attempt to use Java or other interpreted or just-in-time compiled languages to provide the user with an architecture-independent programming environment. Nomadic operating systems may be criticised for sacrificing architecture independence for performance and the ability to run legacy software. However, this choice is warranted by the very nature of the grid itself. If one assumes the set of nodes in the grid to be large enough, finding a suitable subset sporting one’s desired processor type, will always be possible. 66 Nomadic operating systems solve the problem of agreeing on a common configuration for operating systems on grid nodes, by deferring this choice to the end user. They also allow more efficient resource utilisation, by allowing load balancing on cluster or even grid level to occur, by means of operating system migration. Compared to traditional process migration schemes, they completely eliminate the problem of residual dependencies, though at the cost of coarser granularity of the migrational unit. The authors believe nomadic operating systems to be an enabling technology of grid computing. While the current implementation still needs a little work, it is stable enough for many uses already, and the reader is encouraged to try it out. A PPENDIX A Guest interface specification The interface specification has been derived from the NomadBIOS implementation, and represents the minimum functionality needed for creating a migratable guest operating system. The specification assumes that the guest is already an L4 program, so only the restrictions specific to guests running under the NomadBIOS host environment are described. The reader is referred to Hohmuth  and Liedtke for further details on programming for L4. Address space layout and page fault handling May: Access read-only the L4 kernel info page at address 0x1000. May: Access read-write a special guest info page, at address 0x2000 (see guest info page specification below). Cannot: Access any memory below 0x400000 (the first 4MB superpage), except for the shared kernel info page and the guest info page. May: Access distinct 4MB pages above 0x400000, with specified in the guest info page. signifying the amount of 4MB pages May: Allocate memory from the NomadBIOS pager via either IPC or by accessing it. Memory will be returned as either 4kB or 4MB pages, at the sole discretion of NomadBIOS. Must: Be able to handle repeated page faults to the same address idempotently, when acting as a pager for its own tasks and threads. Checkpointing behaviour Must: Listen for and react to the suspend IPC signal sent by the BIOS, by storing all task state in memory, and returning an instruction address on which resumption may later occur. Must: Be able to resume full operations from a previously suspended image. 67 Appendix A: Guest interface specification 68 Hardware abstractions Cannot: Allocate any hardware interrupts. Cannot: Access any hardware directly . May: Use the network IPC protocol for accessing access network resources. External thread identifiers Must: Obtain any identifiers needed for communication with external service threads via the guest info page at address 0x2000. Guest info page The guest info page is shared between the host environment and the guest, and is writable by both. Its main purpose is the provision of parameters, such as boot parameter string, IP and Ethernet addresses, maximum amount of memory available, accessible range of L4 tasks, thread identifiers for the network services, clock multiplier and so forth. It is writable by the guest so that in the future, information may be passed back to the host environment as well. After suspension, values obtained from the guest info page may no longer be valid, and should be reread. Information written to the guest info page should be considered lost and be rewritten. A PPENDIX B Changes to the L4Linux source For those knowledgeable about the L4Linux implementation, the changes necessary to turn L4Linux into an example nomadic operation system are described below. Some of the changes rely on two small modifications made to the Fiasco kernel: Out-of-clan IPC now fails with an error code, and intra-task IPC is possible by specifying task as the recipient. arch/l4/x86/emulib/int entry.S All IPCs are modified according to figure 8, 9 and 10. The outof-clan IPC error code is checked as well, to see if task numbers have changed due to migration. Added functions for storing the CPU state of the main process thread in the shared page, and for recovering the process state from this page. arch/l4/x86/emulib/user.c Special handling code for signal 100 added to the signal handling thread. Signal 100 is the signal chosen to signify a suspend request from the L4Linux server. When suspension is complete, the signal handler responds to L4Linux with eip and esp of the frozen process. arch/l4/x86/kernel/chead.c Thread 0, the first thread of the L4Linux task, normally creates a new thread – thread 3 – which does most of the work, and then goes to sleep forever. In Nomadic L4Linux, thread 0 listens for suspend requests from the host environment instead of going back to sleep. arch/l4/x86/kernel/irq.c Rather than trying to obtain the timer interrupt from L4, the function timer irq thread waits for an IPC timeout every 10 ms, and now has the responsibility of incrementing the jiffies counter. arch/l4/x86/kernel/l4 idle.S A special page fault handler thread is installed to allow page faults in the emulib pages while resuming a user process. All IPCs are modified according to figure 8, 9 and 10. Note that only the assembler version of the idle thread has been modified, so the C-version in dispatch.c is currently not supported. arch/l4/x86/kernel/pagefault.S All pages are touched before being mapped to the faultee, to provoke higher level page faults should the mapping have been removed by the host pager. Pages are touched according to the original type of access, so that read requests do not generate a writable mapping further up. arch/l4/x86/kernel/setup.c The command line for the kernel is read from the guest info page. All memory is now allocated on demand, so no explicit pager calls are performed. arch/l4/x86/kernel/time.c Gets the clock multiplier from the guest info page, instead of performing its own calibration. 69 Appendix B: Changes to the L4Linux source 70 arch/l4/x86/lib/l4 pager.c Modified to allow multiple page faults on the same address over time, by simply touching the memory page associated with the faulting address, instead of explicitly propagating the request to the host pager. The Ping-Pong task is not necessary when running under Fiasco, and has been disabled. arch/l4/x86/lib/task.c Modified to obtain tasks from the host environment instead of from RMGR. init/suspend.c New file implementing the suspension and recovery functions used prior to and after migration. First suspends all L4Linux server threads, then all processes, after which it signals the host environment which will handle the actual migration. Upon recovery, first all L4Linux server threads are restarted but sleeping, then all user processes are recreated according to the Linux process table, and finally the L4Linux server threads are given a “go” signal and start serving again. drivers/net/l4 ether.c Modified to use the host network interface. The receiver is running as a separate thread, listening for incoming packets from the host. Multiple packets are gathered in a queue, which is later flushed by an added bottom half handler, to amortise the cost of context switches. include/l4linux/net.h New file, defining constants for the NomadBIOS network interface. include/l4linux/x86/config.h Modified to run at lower priorities, to make room for the host environment. The kernel now runs at priority 10. include/l4linux/x86/sched.h IPCs modified according to figure 8, 9 and 10 include/l4linux/x86/shared data.h The shared data structure used for syscall handling is extended to make room for storing CPU state upon suspension. include/l4linux/guest info.h New file that defines the layout of the guest info page. A PPENDIX C Availability Source code and binaries of the NomadBIOS and nomadic L4Linux may be downloaded from the Internet at http://www.nomadbios.dk. More information about the Fiasco L4 kernel can be found at http://os.inf.tu-dresden.de/fiasco/, and more information about the Hazelnut and Pistachio L4 kernels, and about the upcoming version 4 L4 specification can be found at http://www.l4ka.org. 71 A PPENDIX D Bibliography Amar, L., A. Barak, A. Eizenberg and A. Shiloh. The MOSIX scalable cluster file systems for Linux, 2000. Awadallah, Amr and Medel Rosenblum. The vMatrix: A network of virtual machine monitors for dynamic content distribution. Technical report, Computer Systems Lab, Stanford University, 2002. Barak, A. and O. La’adan. The MOSIX multicomputer operating system for high performance cluster computing, 1998. Braam, P., M. Callahan and Phil Schwan. The InterMezzo filesystem, 1999. Braam, Peter J and Phillip A Nelson. Removing bottlenecks in distributed filesystems: Coda & Intermezzo as examples. Proceedings of the 5th Annual Linux Expo, pages 131–139, 1999. Brown, Aaron B and Margo I. Seltzer. Operating system benchmarking in the wake of lmbench: A case study of the performance of NetBSD on the Intel x86 architecture. Technical report, Harvard University, 1997. Bugnion, Edouard, Scott Devine, Kinshuk Govil and Mendel Rosenblum. Disco: Running commodity operating systems on scalable multiprocessors. ACM Transactions on Computer Systems, 15(4):412–447, 1997. Creasy, R. J. The origin of the VM/370 time-sharing system. IBM Journal of Research and Development, 25(5):483–490, 1981. Dannowski, Uwe, Espen Skoglund and Volkmar Uhlig. L4 eXperimental Kernel Reference Manual, Version X.2. Universität Karlsruhe, 2002. Douglis, Fred and John K. Ousterhout. Transparent process migration: Design alternatives and the Sprite implementation. Software - Practice and Experience, 21(8):757–785, 1991. Engler, Dawson R. and M. Frans Kaashoek. DPF: Fast, flexible message demultiplexing using dynamic code generation. In SIGCOMM, pages 53–59, 1996. Ertl, M. Anton, David Gregg, Andreas Krall and Bernd Paysan. Vmgen — a generator of efficient virtual machine interpreters. Software Practice and Experience, 32(3):265– 294, 2002. Ford, Bryan, Godmar Back, Greg Benson, Jay Lepreau, Albert Lin and Olin Shivers. The Flux OSKit: A substrate for kernel and language research. In Symposium on Operating Systems Principles, pages 38–51, 1997. 72 Appendix D: Bibliography 73 Ford, Bryan, Mike Hibler, Jay Lepreau, Patrick Tullmann, Godmar Back and Stephen Clawson. Microkernels meet recursive virtual machines. In Operating Systems Design and Implementation, pages 137–151, 1996. Foster, Ian, Carl Kesselman, Jeffrey M. Nick and Steven Tuecke. The Physiology of the Grid. An open Grid services architecture for distributed systems integration. draft., 2002. Gabriel, Richard P. Patterns of Software: Tales from the software community. OUP, 1996. Goldberg, Robert P. Survey of virtual machine research. IEEE Computer, 7(6):34–45, 1974. Hansen, Per Brinch. The nucleus of a multiprogramming system. Communications of the ACM, 13(4):238–250, 1970. Härtig, Hermann, Michael Hohmuth, Jochen Liedtke, Sebastian Schnberg and Jean Wolter. The performance of microkernel-based systems, 1997. Härtig, Hermann, Michael Hohmuth and Jean Wolter. Taming Linux, 1998. Hohmuth, Michael. The Fiasco Kernel: Requirements Definition. Technische Universität Dresden, 1998. Kamp, Poul-Henning and Robert N. M. Watson. Jails: Confining the omnipotent root. In Proceedings, SANE 2000 Conference, 2000. Lawton, Kevin. Running multiple operating systems concurrently on the IA32 pc using virtualization techniques, 1999. Liedtke, J. and H. Wenske. Lazy process switching, 2001. Liedtke, Jochen. Clans & Chiefs. Technical report, German National Research Center for Computer Science, 1992. Liedtke, Jochen. On micro-kernel construction. In Symposium on Operating Systems Principles, pages 237–250, 1995. Liedtke, Jochen. L4 Nucleus Version X Reference Manual. Universität Karlsruhe, 1999. Liedtke, Jochen, Uwe Dannowski, Kevin Elphinstone, Gerd Liefländer, Espen Skoglund, Volkmar Uhlig, Christian Ceelen, Marcus Haeberlen and Marcus Völp. The L4Ka vision. White paper, Universität of Karlsruhe, 2001. Lindholm, Tim and Frank Yellin. The Java Virtual Machine Specification. Addison-Wesley, 1996. McVoy, Larry and Carl Staelin. lmbench: Portable tools for performance analysis. Technical report, Sillicon Graphics Inc. and Hewlett-Packard Laboratories, 1996. Miller, S. P., B. C. Neuman, J. I. Schiller and J. H. Saltzer. Kerberos authentication and authorization system. Technical report, Massachusetts Institute of Technology, 1987. Milojicic, D., F. Douglis, Y. Paindaveine, R. Wheeler and S. Zhou. Process migration. ACM Computing Surveys, 32(3):241–299, 2000. Appendix D: Bibliography 74 Morrison, R., A. L. Brown, R. Carrick, R. C. H. Connor, A. Dearle and M. P. Atkinson. The Napier type system. In Rosenberg, J. and D M Koch, editors, Persistent Object Systems, pages 3–18. Springer-Verlag, 1990. Ousterhout, J. K., A. R. Cherenson, F. Douglis, M. N. Nelson and B. B. Welch. The Sprite network operating system. Computer Magazine of the Computer Group News of the IEEE Computer Group Society, ; ACM CR 8905-0314, 21(2), 1988. Perkins, C. E. and A. Myles. Mobile IP. Proceedings of International Telecommunications Symposium, pages 415–419, 1997. Rashid, Richard, Daniel Julin, Douglas Orr, Richard Sanzi, Robert Baron, Alesandro Forin, David Golub and Michael B. Jones. Mach: a system software kernel. In Proceedings of the 1989 IEEE International Conference, COMPCON, pages 176–178, San Francisco, CA, USA, 1989. IEEE Comput. Soc. Press. Robin, John Scott and Cynthia E. Irvine. Analysis of the intel pentium’s ability to support a secure virtual machine monitor. In Proceedings of the 2001 USENIX Security Symposium, 2001. Satyanarayanan, M. Scalable, secure, and highly available distributed file access. IEEE Computer, 23(5), 1990. Satyanarayanan, M., J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel and D. C. Steere. Coda: A highly available file system for a distributed workstation environment. IEEE Transactions on Computers, 39(4):447–459, 1990. Shapiro, Jonathan S. EROS: A Capability System. PhD thesis, University of Pennsylvania, 1999. Shapiro, Jonathan S. and Jonathan Adams. Design evolution of the EROS single-level store. In Proceedings of the 2002 USENIX Annual Technical Conference, 2002. Skoglund, Espen, Christian Ceelen and Jochen Liedtke. Transparent orthogonal checkpointing through user-level pagers. Lecture Notes in Computer Science, 2135:201–??, 2001. Venkitachalam, Ganesh and Beng-Hong Lim. Virtualizing I/O devices on VMWare Workstation’s hosted virtual machine monitor. In Proceedings of the 2001 USENIX Technical Conference, 2001. Laszewski, Gregor von, Kazuyuki Shudo and Yoichi Muraoka. Grid-based asynchronous migration of execution context in Java virtual machines. Lecture Notes in Computer Science, 1900:22–34, 2000. Whitaker, Andrew, Marianne Shaw and Steven D. Gibble. Denali: A scalable isolation kernel. Technical report, University of Washington, 2002. Zayas, Edward R. Attacking the process migration bottleneck, 1987.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project