Genode Operating System Framework

Genode Operating System Framework
Genode Operating System Framework
Norman Feske
January 23, 2015
Contents
1 Introduction
1.1 Universal truths . . . . . . . . . . .
1.2 Operating-system framework . . .
1.3 Licensing and commercial support
1.4 Document structure . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Getting started
5
6
7
9
10
11
3 Architecture
3.1 Capability-based security . . . . . . . . . . . . . . . . . . .
3.1.1 Capability spaces, object identities, and RPC objects
3.1.2 Delegation of authority and ownership . . . . . . .
3.1.3 Capability invocation . . . . . . . . . . . . . . . . .
3.1.4 Capability delegation through capability invocation
3.2 Recursive system structure . . . . . . . . . . . . . . . . . . .
3.2.1 Component ownership . . . . . . . . . . . . . . . . .
3.2.2 Tree of components . . . . . . . . . . . . . . . . . . .
3.2.3 Services and sessions . . . . . . . . . . . . . . . . . .
3.2.4 Client-server relationship . . . . . . . . . . . . . . .
3.3 Resource trading . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Resource assignment . . . . . . . . . . . . . . . . . .
3.3.2 Trading memory between clients and servers . . . .
3.3.3 Component-local heap partitioning . . . . . . . . .
3.3.4 Dynamic resource balancing . . . . . . . . . . . . .
3.4 Core - the root of the component tree . . . . . . . . . . . . .
3.4.1 Dataspaces . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Physical memory allocation (RAM) . . . . . . . . .
3.4.3 Access to boot modules (ROM) . . . . . . . . . . . .
3.4.4 Protection domains (PD) . . . . . . . . . . . . . . . .
3.4.5 Address-space management (RM) . . . . . . . . . .
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
14
14
15
16
18
21
21
22
22
26
28
28
32
34
36
38
38
39
39
40
40
Contents
3.5
3.6
3.4.6 Processing-time allocation (CPU) . . . . . . . . . . . . . . . .
3.4.7 Object-identity allocation (CAP) . . . . . . . . . . . . . . . .
3.4.8 Access to device resources (IO_MEM, IO_PORT, IRQ) . . . .
3.4.9 Logging (LOG) . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.10 Asynchonous notifications (SIGNAL) . . . . . . . . . . . . .
3.4.11 Event tracing (TRACE) . . . . . . . . . . . . . . . . . . . . . .
Component creation . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Obtaining the child’s ROM and RAM sessions . . . . . . . .
3.5.2 Constructing the child’s address space . . . . . . . . . . . . .
3.5.3 Creating the initial thread and the child’s protection domain
Inter-component communication . . . . . . . . . . . . . . . . . . . .
3.6.1 Synchronous remote procedure calls (RPC) . . . . . . . . . .
3.6.2 Asynchronous notifications . . . . . . . . . . . . . . . . . . .
3.6.3 Shared memory . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.4 Asynchronous state propagation . . . . . . . . . . . . . . . .
3.6.5 Synchronous bulk transfer . . . . . . . . . . . . . . . . . . . .
3.6.6 Asynchronous bulk transfer - packet streams . . . . . . . . .
4 Components
4.1 Device drivers . . . . . . . . . . . . . . . . . . . . .
4.1.1 Platform driver . . . . . . . . . . . . . . . .
4.1.2 Interrupt handling . . . . . . . . . . . . . .
4.1.3 Direct memory access (DMA) transactions
4.2 Protocol stacks . . . . . . . . . . . . . . . . . . . . .
4.3 Resource multiplexers . . . . . . . . . . . . . . . .
4.4 Runtime environments and applications . . . . . .
4.5 Common session interfaces . . . . . . . . . . . . .
4.5.1 Read-only memory (ROM) . . . . . . . . .
4.5.2 Report . . . . . . . . . . . . . . . . . . . . .
4.5.3 Terminal and UART . . . . . . . . . . . . .
4.5.4 Input . . . . . . . . . . . . . . . . . . . . . .
4.5.5 Framebuffer . . . . . . . . . . . . . . . . . .
4.5.6 Nitpicker GUI . . . . . . . . . . . . . . . . .
4.5.7 Platform . . . . . . . . . . . . . . . . . . . .
4.5.8 Block . . . . . . . . . . . . . . . . . . . . . .
4.5.9 Regulator . . . . . . . . . . . . . . . . . . .
4.5.10 Timer . . . . . . . . . . . . . . . . . . . . . .
4.5.11 NIC . . . . . . . . . . . . . . . . . . . . . . .
4.5.12 Audio output . . . . . . . . . . . . . . . . .
4.5.13 File system . . . . . . . . . . . . . . . . . . .
4.5.14 Loader . . . . . . . . . . . . . . . . . . . . .
4.6 Component configuration . . . . . . . . . . . . . .
4.6.1 Configuration format . . . . . . . . . . . . .
4.6.2 Server-side policy selection . . . . . . . . .
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
42
42
43
43
44
45
45
46
48
51
52
59
62
64
64
66
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
71
71
73
73
77
79
81
83
83
85
85
86
87
88
89
89
90
90
90
91
93
94
96
96
96
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
97
98
98
100
103
105
106
108
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
110
111
112
113
114
115
116
6 System configuration
6.1 Nested configuration concept . . . . . . . . . . . . . . . . . .
6.2 The init component . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Session routing . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Resource quota saturation . . . . . . . . . . . . . . . .
6.2.3 Handing out slack resources . . . . . . . . . . . . . .
6.2.4 Multiple instantiation of a single ELF binary . . . . .
6.2.5 Nested configuration . . . . . . . . . . . . . . . . . . .
6.2.6 Assigning subsystems to CPUs . . . . . . . . . . . . .
6.2.7 Priority support . . . . . . . . . . . . . . . . . . . . . .
6.2.8 Init verbosity . . . . . . . . . . . . . . . . . . . . . . .
6.2.9 Executing children in chroot environments on Linux
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
117
119
122
122
124
125
125
125
127
128
128
128
7 Functional specification
7.1 Parent-child interaction . . . . . . . . .
7.2 Fundamental data structures . . . . .
7.3 XML processing . . . . . . . . . . . . .
7.4 Process execution environment . . . .
7.5 Remote procedure calls . . . . . . . . .
7.6 Signals . . . . . . . . . . . . . . . . . .
7.7 Multi-threading and synchronization .
7.8 Process management . . . . . . . . . .
7.9 Common utilities . . . . . . . . . . . .
7.10 Server API . . . . . . . . . . . . . . . .
7.11 Support for user-level device drivers .
7.12 Tracing . . . . . . . . . . . . . . . . . .
7.13 C runtime . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
130
131
132
133
134
135
136
137
138
139
140
141
142
143
4.7
4.6.3 Dynamic component reconfiguration at runtime
Component compositions . . . . . . . . . . . . . . . . .
4.7.1 Sandboxing . . . . . . . . . . . . . . . . . . . . .
4.7.2 Component-level and OS-level virtualization . .
4.7.3 Interposing individual services . . . . . . . . . .
4.7.4 Ceding the parenthood . . . . . . . . . . . . . .
4.7.5 Publishing and subscribing . . . . . . . . . . . .
4.7.6 Enslaving services . . . . . . . . . . . . . . . . .
5 Development
5.1 Work flow . . . . . . . . . .
5.2 Tool chain . . . . . . . . . .
5.3 Build system . . . . . . . . .
5.4 Ports of 3rd-party software .
5.5 Run tool . . . . . . . . . . .
5.6 Automated tests . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
8 Under the hood
8.1 Component-local startup code and linker scripts . . . . . . . . . . .
8.1.1 Linker scripts . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.2 Startup code . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 C++ runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Rationale behind using exceptions . . . . . . . . . . . . . . .
8.2.2 Bare-metal C++ runtime . . . . . . . . . . . . . . . . . . . . .
8.3 Interaction of core with the underlying kernel . . . . . . . . . . . . .
8.3.1 Bootstrapping and allocator setup . . . . . . . . . . . . . . .
8.3.2 Kernel-object creation . . . . . . . . . . . . . . . . . . . . . .
8.3.3 Page-fault handling . . . . . . . . . . . . . . . . . . . . . . .
8.4 Dynamic linker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.1 Building dynamically-linked programs . . . . . . . . . . . .
8.4.2 Startup of dynamically-linked programs . . . . . . . . . . .
8.4.3 Address-space management . . . . . . . . . . . . . . . . . .
8.5 Execution on bare hardware (base-hw) . . . . . . . . . . . . . . . . .
8.5.1 Bootstrapping of base-hw . . . . . . . . . . . . . . . . . . . .
8.5.2 Kernel entry and exit . . . . . . . . . . . . . . . . . . . . . . .
8.5.3 Interrupt handling and preemptive multi-threading . . . . .
8.5.4 Split kernel interface . . . . . . . . . . . . . . . . . . . . . . .
8.5.5 Public part of the kernel interface . . . . . . . . . . . . . . . .
8.5.6 Core-private part of the kernel interface . . . . . . . . . . . .
8.5.7 Scheduler of the base-hw kernel . . . . . . . . . . . . . . . .
8.5.8 Sparsely populated core address space . . . . . . . . . . . .
8.5.9 Multi-processor support of base-hw . . . . . . . . . . . . . .
8.5.10 Limitations of the base-hw platform . . . . . . . . . . . . . .
8.6 Execution on the NOVA microhypervisor (base-nova) . . . . . . . .
8.6.1 Integration of NOVA with Genode . . . . . . . . . . . . . . .
8.6.2 Bootstrapping of a NOVA-based system . . . . . . . . . . . .
8.6.3 Log output on modern PC hardware . . . . . . . . . . . . . .
8.6.4 Relation of NOVA’s kernel objects to Genode’s core services
8.6.5 Page-fault handling on NOVA . . . . . . . . . . . . . . . . .
8.6.6 IOMMU support . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.7 Genode-specific modifications of the NOVA kernel . . . . .
8.6.8 Known limitations of NOVA . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
144
145
145
146
150
150
151
153
153
154
155
157
157
157
158
159
159
160
161
161
162
162
163
164
164
165
166
166
166
167
168
169
170
171
173
This work is licensed under the Creative Commons Attribution +
ShareAlike License (CC-BY-SA). To view a copy of the license, visit
http://creativecommons.org/licenses/by-sa/4.0/legalcode
4
1 Introduction
5
1.1
Universal truths
1.1 Universal truths
TODO
TODO define term “trusted computing base”
6
1.2
Operating-system framework
1.2 Operating-system framework
The Genode OS framework is a tool kit for building highly secure special-purpose operating systems. It scales from embedded systems with as little as 4 MB of memory to
highly dynamic general-purpose workloads.
Genode is based on a recursive system structure. Each program runs in a dedicated
sandbox and gets granted only those access rights and resources that are needed for
its specific purpose. Programs can create and manage sub-sandboxes out of their own
resources, thereby forming hierarchies where policies can be applied at each level. The
framework provides mechanisms to let programs communicate with each other and
trade their resources, but only in strictly-defined manners. Thanks to this rigid regime,
the attack surface of security-critical functions can be reduced by orders of magnitude
compared to contemporary operating systems.
The framework aligns the construction principles of L4 microkernels with Unix philosophy. In line with Unix philosophy, Genode is a collection of small building blocks,
out of which sophisticated systems can be composed. But unlike Unix, those building
blocks include not only applications but also all classical OS functionalities including
kernels, device drivers, file systems, and protocol stacks.
CPU architectures
Genode supports the x86 (32 and 64 bit) and ARM CPU architectures. On x86,
modern architectural features such as IOMMUs and hardware virtualization can
be utilized. On ARM, Genode is able to use TrustZone technology.
Kernels
Genode can be deployed on a variety of different kernels including most members of the L4 family (NOVA, Fiasco.OC, OKL4 v2.1, L4ka::Pistachio, Codezero,
L4/Fiasco). Furthermore, it can be used on top of the Linux kernel to attain rapid
development-test cycles during development. Additionally, the framework is accompanied with a custom kernel for ARM-based platforms that was specifically
developed for Genode and thereby further reduces the complexity of the trusted
computing base compared to other kernels.
Virtualization
Genode support virtualization at different levels:
• On NOVA, faithful virtualization via VirtualBox allows the execution of unmodified guest operating systems as Genode subsystems. Alternatively, the
Seoul virtual machine monitor can be used to run unmodified Linux-based
guest OSes..
• On the Fiasco.OC kernel, L4Linux represents a paravirtualized Linux kernel
that allows the use of regular Linux distributions as well as Android.
• With Noux, there exists a runtime environment for Unix software such as
GNU coreutils, bash, GCC, binutils, and findutils.
7
1.2
Operating-system framework
• On ARM, Genode can be used as TrustZone monitor.
Building blocks
There exist more than 100 ready-to-use components such as
• Device drivers for most common PC peripherals including networking, storage, display, USB, PS/2, Intel wireless, and audio output,
• Device drivers for a variety of ARM-based SoCs such as Texas Instruments
OMAP4, Samsung Exynos5, and FreeScale i.MX,
• A GUI stack including a low-complexity GUI server, window management,
and widget toolkits such as Qt5.
• Networking components such as TCP/IP stacks and packet-level network
services
8
1.3
Licensing and commercial support
1.3 Licensing and commercial support
Genode is commercially supported by the German company Genode Labs GmbH,
which offers trainings, development work under contract, developer support, and
commercial licensing:
Genode Labs website
http://www.genode-labs.com
The framework is available under two flavours of licences, an open-source license and
commercial licensing. The primary license used for the distribution of the Genode OS
framework is the GNU General Public License Version 2 (GNU GPL). In short, the GNU
GPL grants everybody the rights to
• Use the Genode OS framework without paying any license fee
• Freely distribute the software
• Modify the source code and distribute modified versions of the software
In return, the GNU GPL requires any modifications and derived work to be published
under the same or a GPL-compatible license. For the full license text, refer to
GNU General Public License Version 2
http://genode.org/about/LICENSE
For applications that require more permissive licensing conditions than granted by the
GNU GPL, Genode Labs offers the option to commercially license the technology upon
request. Please write to [email protected]
9
1.4
Document structure
1.4 Document structure
TODO
10
2 Getting started
TODO
• Obtaining the source code
• Source-tree structure
• Downloading 3-rd party source code
• Installing the tool chain
• Creating a build directory
• Configuring the build directory
• Building a target
• Executing a simple system scenario
– Linux
– Microkernel
• Tutorial for building a hello-world program
11
3 Architecture
Contemporary operating systems are immensely complex to accommodate a large variety of applications on an ever diversifying spectrum of hardware platforms. Among
the functionalities provided by a commodity operating system are device drivers, protocol stacks such as file systems and network protocols, the management of hardware
resources, as well as the provisioning of security functions. The latter category is meant
for protecting the confidentiality and integrity of information and the lifelines of critical
functionality. For assessing the effectiveness of such a security function, two questions
must be considered. First, what is the potential attack surface on the function? The answer to this question yields an assessment about the likelihood of a breach. Naturally,
if there is a large number of potential attack vectors, the security function is at high risk.
The second question is: What is the reach of a defect? If the compromised function has
unlimited access to all information processed on the system, the privacy of all users
may be affected. If the function is able to permanently install software, the system may
become prone to back doors.
Today’s widely deployed operating systems do not isolate security-critical functions
from the rest of the operating system. In contrary, they are co-located with most other
operating-system functionality in a single high-complexity kernel. Thereby, those functions are exposed to the other parts of the operating system. The likelihood for of a
security breach is as high as the likelihood for bugs in the overly complex kernel. In
other words: It is certain. Moreover, once an in-kernel function has been compromised,
the defect has an unlimited reach throughout the system.
The Genode architecture was designed to give more assuring answers to the two
stated questions. Each piece of functionality should be exposed to only those parts of
the system, on which it ultimately depends. But it remains hidden from all unrelated
parts. This minimizes the attack surface on individual security functions and thereby
reduces the likelihood for a security breach. In the event that one part of the system
gets compromised, the reach of the defect is limited to the particular part and its dependent parts. But unrelated functionalities remain unaffected. To realize this idea,
Genode composes the system out of many components that interact with each other.
Each component serves a specific role and uses well-defined interfaces to interact with
its peers. For example, a network driver accesses a physical network card and provides
a bidirectional stream of network packets to another component, which, in turn, may
process the packets using a TCP/IP stack and a network application. Even though the
network driver and the TCP/IP stack cooperate when processing network packets, they
are living in separate protection domains. So a bug in one component cannot observe
or corrupt the internal state of another.
Such a component-based architecture, however, raises a number of questions, which
are addressed throughout this chapter. Section 3.1 explains how components can cooperate without inherently trusting each other. Section 3.2 answers the questions of
who defines the relationship between components and how components become acquainted with each other. An operating system ultimately acts on physical hardware
resources such as memory, CPUs, and peripheral devices. Section 3.4 describes how
12
such resources are made available to components. Section 3.5 answers the question of
how a new component comes to life. The variety of relationships between components
and their respective interfaces call for different communication primitives. Section 3.6
introduces Genode’s inter-component communication mechanisms in detail.
13
3.1
Capability-based security
3.1 Capability-based security
This section introduces the nomenclature and the general model of Genode’s capabilitybased security concept. The Genode OS framework is not tied to one kernel but supports a variety of kernels as base platforms. On each of those base platforms, Genode
uses different kernel mechanisms to implement the general model as closely as possible. Note however that not all kernels satisfy the requirements that are needed to
implement the model securely. For assessing the security of a Genode-based system,
the respective platform-specific implementation must be considered. Sections 8.5 and
8.6 provide details for selected kernels.
3.1.1 Capability spaces, object identities, and RPC objects
Each component lives inside a protection domain that provides an isolated execution
environment.
Protection domain
Genode provides an object-oriented way of letting components interact with each
other. Analogously to object-oriented programming languages, which have the notion
of objects and pointers to objects, Genode introduces the notion of RPC objects and
capabilities to RPC objects.
An RPC object provides a remote-procedure call (RPC) interface. Similar to a regular
object, an RPC object can be constructed and accessed from within the same program.
But in contrast to a regular object, it can also be called from the outside of the component. What a pointer is to a regular object, a capability is to an RPC object. It is a token
that unambiguously refers to an RPC object. In the following, we represent an RPC
object as follows.
Protection domain
RPC object
The circle represents the capability associated with the RPC object. Like a pointer
to an object, that can be used to call a function of the pointed-to object, a capability
can be used to call functions of its corresponding RPC object. However, there are two
important differences between a capability and a pointer. First, in contrast to a pointer
that can be created out of thin air (e. g., by casting an arbitrary number to a pointer),
a capability cannot be created without an RPC object. At the creation time of an RPC
object, Genode creates a so-called object identity that represents the RPC object in the
kernel. Figure 1 illustrates the relationship of an RPC object and its object identity.
For each protection domain, the kernel maintains a so-called capability space, which
is a name space that is local to the protection domain. At the creation time of an RPC
14
3.1
Capability-based security
Protection domain
RPC object A
3
kernel
Cap space
0
1
2
3
4
5
...
ow
n
s
Object
identity
A
Figure 1: Relationship between an RPC object and its corresponding object identity.
object, the kernel creates a corresponding object identity and lets a slot in the protection
domain’s capability space refer to the RPC object’s identity. From the component’s
point of view, the RPC object A has the name 3. When interacting with the kernel, the
component can use this number to refer to the RPC object A.
3.1.2 Delegation of authority and ownership
The second difference between a pointer and a capability is that a capability can be
passed to different components without losing its meaning. The transfer of a capability from one protection domain to another delegates the authority to use the capability
to the receiving protection domain. This operation is called delegation and can be performed only by the kernel. Note that the originator of the delegation does not diminish
its authority by delegating a capability. It merely shares its authority with the receiving
protection domain. Figure 2 shows the delegation of the RPC object’s capability to a
second protection domain and a further delegation of the capability from the second
to a third protection domain. Whenever the kernel delegates a capability from one to
another protection domain, it inserts a reference to the RPC object’s identity into a free
slot of the target’s capability space. Within protection domain 2 shown in Figure 2,
the RPC object can be referred to by the number 5. Within protection domain 3, the
same RPC object is known as 2. Note that the capability delegation does not hand over
the ownership of the object identity to the target protection domain. The ownership is
always retained by the protection domain that created the RPC object.
Only the owner of an RPC object is able to destroy it along with the corresponding
object identity. Upon destruction of an object identity, the kernel removes all references
15
3.1
Capability-based security
Protection domain 1
RPC object A
Protection domain 2
Protection domain 3
5
3
2
delegate
delegate
kernel
Cap space 1
0
1
2
3
Cap space 2
4
5
...
0
1
2
3
Cap space 3
4
5
...
0
1
2
3
4
5
...
ow
n
s
Object
identity
A
Figure 2: The transitive delegation of a capability from one protection domain to others.
to the vanishing object identity from all capability spaces. This effectively renders the
RPC object inaccessible for all protection domains. Once the object identity for an RPC
object is gone, the owner can destruct the actual RPC object.
3.1.3 Capability invocation
Capabilities enable components to call methods of RPC objects provided by different
protection domains. A component that uses an RPC object plays the role of a client
whereas a component that owns the RPC object acts in the role of a server. The interplay
between client and server is very similar to a situation where a program calls a local
function. The caller puts the function arguments to a place where the callee will be
able to pick them up and then passes control to the callee. When the callee takes over
control, it obtains the function arguments, executes the function, puts the results to a
place where the caller can pick them up, and finally hands back the control to the caller.
In contrast to a program-local function call, however, client and server are different
threads in their respective protection domains. The thread at the server side is called
entrypoint denoting the fact that it becomes active only when a call from a client enters
the protection domain. In order to be able to act as a server, a component has to have at
least one entrypoint.
Protection domain
EP
16
3.1
Capability-based security
Client PD
5
Server PD
RPC object A
3
7
RPC object B
EP
Figure 3: The RPC object A and B are associated with the server’s entrypoint. A client has a
capability for A but not for B. For brevity, the kernel-protected object identities are not
depicted. Instead, the dashed line between the capabilities shows that both capabilities refer to the same object identity.
The wiggly arrow denotes that the entrypoint is a thread. Besides being a thread that
waits for incoming requests, the entrypoint is responsible for maintaining the association between RPC objects with their corresponding capabilities. The previous figures
illustrated this association with the link between the RPC object and its capability. In
order to become callable from the outside, an RPC object must be associated with a
concrete entrypoint. This operation results in the creation of the object’s identity and
the corresponding capability. During the lifetime of the object identity, the entrypoint
maintains the association between the RPC object and its capability in a data structure called object pool, which allows for looking up the matching RPC object for a given
capability. Figure 3 shows a scenario where two RPC objects are associated with one
entrypoint in the protection domain of a server. The capability for the RPC object A has
been delegated to a client.
If a protection domain is in possession of a capability, each thread executed within
this protection domain can issue a call to a member function of the RPC object that is
referred to by the capability. Because this is not a normal function call but the invocation
of an object located in a different protection domain, this operation has to be provided
by the kernel. Figure 4 illustrates the interaction of the client, the kernel, and the server.
The kernel operation takes the client-local name of the invoked capability, the opcode
of the called function, and the function arguments as parameters. Upon entering the
kernel, the client’s thread is blocked until it receives a response. The operation of the
kernel is represented by the dotted line. The kernel uses the supplied local name as
an index into the client’s capability space to look up the object identity, to which the
capability refers. Given the object identity, the kernel is able to determine the entrypoint
that is associated with the object identity and wakes up the entrypoint’s thread with
the information about the incoming request. Among those information is the serverlocal name of the capability that was invoked. Note that the kernel has translated the
client-local name to the corresponding server-local name. The capability name spaces
of client and server are entirely different. The entrypoint uses this number as a key
into its object pool to find the locally implemented RPC object A that belongs to the
17
3.1
Capability-based security
Client PD
Server PD
f n(args)
call(f n, args)
3
5
RPC object A
7
RPC object B
A → dispatch(f n, args)
call(5, f n, args)
EP
request(3, f n, args)
kernel
Figure 4: Control flow between client and server when the client calls a method of an RPC
object.
invoked capability. It then performs a method call of the so-called dispatch function
on the RPC object. The dispatch function maps the supplied function opcode to the
matching member function and calls this function with the request arguments.
The member function may produce function results. Once the RPC object’s member function returns, the entrypoint thread passes the function results to the kernel by
performing the kernel’s reply operation. At this point, the server’s entrypoint becomes
ready for the next request. The kernel, in turn, passes the function results as return
values of the original call operation to the client and wakes up the client thread.
3.1.4 Capability delegation through capability invocation
Section 3.1.2 explained that capabilities can be delegated from one protection domain to
another via a kernel operation. But it left open the question how this procedure works.
The answer is the use of capabilities as RPC message payload. Similar to how a caller
of a regular function can pass a pointer as an argument, a client can pass a capability as
an argument to an RPC call. In fact, passing capabilities as RPC arguments or results is
synonymous to delegating authority between components. If the kernel encounters a
capability as an argument of a call operation, it performs the steps illustrated in Figure
5. The local names are denoted as cap, e.g., caparg is the local name of the object identity
at the client side, and captranslated is the local name of the same object identity at the
server side.
1. The kernel looks up the object identity in the capability space of the client. This
lookup may fail if the client specified a number of an empty slot of its capability
18
3.1
Capability-based security
caparg
lookup object identity for caparg
Object
identity
Server capability space
0 1 2 3 4 5
...
create and insert capnew
captranslated = capnew
success?
Client capability space
0 1 2 3 4 5
...
no
call failed
yes
find cap for object identity
no
found?
yes
captranslated = capf ound
captranslated
Figure 5: Procedure of delegating a capability specified as RPC argument from a client to a
server.
19
3.1
Capability-based security
space. Only if the lookup succeeds, the kernel is able obtain the object identity
referred to by the argument. Note that under no circumstances, the client can refer
to object identities, for which it has no authority because it can merely specify the
object identities reachable through its capability space. For all non-empty slots of
its capability space, the protection domain was authorized to use their referenced
object identities by the means of prior delegations.
2. Given the object identity of the argument, the kernel searches the server’s capability space for a slot that refers to the object identity. Note that the term “search”
does not necessarily refer to an expensive linear search. The efficiency of the operation largely depends on the kernel implementation.
3. If the server already possesses a capability to the object identity, the kernel translates the argument to the corresponding local name when passing it as part of the
request to the server. If the server does not yet possess a capability to the argument, the kernel installs a new entry into the server’s capability space. The new
entry refers to the object identity of the argument. At this point, the authority
over the object identity has been delegated from the client to the server.
4. The kernel passes the translated or just-created local name of the argument as part
of the request to the server.
Even though the above description covered the delegation of a single capability specified as argument, it is possible to delegate more than one capability with a single RPC
call. Analogously to how capabilities can be delegated from a client to a server as arguments of an RPC call, capabilities can be delegated in the other direction as part of the
reply of an RPC call. The procedure in the kernel is the same in both cases.
20
3.2
Recursive system structure
inst
ate
cre
all
Child
Parent
Parent
Figure 6: Initial relationship between a parent and a new created child.
3.2 Recursive system structure
The previous section introduced capability delegation as the fundamental mechanism
to pass authority over RPC objects between protection domains. But in the given examples, the client was already in possession to a capability to the server’s RPC object.
This raises the question of how do clients get acquainted to servers?
3.2.1 Component ownership
In a Genode system, each component (except for the very first component called core)
has a parent, which owns the component. The ownership relation between a parent and
a child is two-fold.
owns
Child
Parent
On the one hand, ownership stands for responsibility. Each component requires physical resources such as the memory used by the component or in-kernel data structures
that represent the component in the kernel. The parent is responsible to provide a budget of those physical resources to the child at the child’s creation time but also during
the child’s entire lifetime. As the parent has to assign a fraction of its own physical
resources to its children, it is the parent’s natural interest to maintain the balance of
the physical resources split between itself and each of its children. Besides being the
provider of resources, the parent defines all aspects of the child’s execution and serves
as the child’s primary point of contact for seeking acquaintances with other components.
21
3.2
Recursive system structure
On the other hand, ownership stands for control. Because the parent has created its
children out of its own resources, it is in the position to exercise ultimate power over
its children. This includes the decision to destruct a child at anytime to regain the
resources that were assigned to the child. But it is also in control over the relationships
of the child with other components known to the parent.
Each new component is created as an empty protection domain. It is up to the parent to populate the protection domain with code and data, and to create a thread that
executes the code within the protection domain. At creation time, the parent installs
a single capability called parent capability into the new protection domain. The parent
capability enables the child to perform RPC calls to the parent. The child is unaware of
anything else that exists in the Genode system. It does not even know its own identity
nor the identity of its parent. All it can do is issuing calls to its parent using the parent
capability. Figure 6 depicts the situation right after the creation of a child component.
A thread in the parent component created a new protection domain and a thread residing in the protection domain. It also installed the parent capability referring to an RPC
object provided by the parent. To provide the RPC object, the parent has to maintain an
entrypoint. For brevity, entrypoints are not depicted in this and the following figures.
Section 3.5 covers the procedure of creating a component in detail.
The ownership relation between parent and child implies that each component has
to inherently trust its parent. From a child’s perspective, its parent is as powerful as the
kernel. Whereas the child has to trust its parent, a parent does not necessarily need to
trust its children.
3.2.2 Tree of components
The parent-child relationship is not limited to a single level. Child components are free
to use their resources to create further children, thereby forming a tree of components.
Figure 7 shows an example scenario. The init component creates sub systems according to its configuration. In the example, it created two children, namely a GUI and a
launcher. The latter allows the user to interactively create further subsystems. In the
example, launcher was used to start an application.
At each position in the tree, the parent-child interface is the same. The position of a
component within the tree is just a matter of composition. For example, by a mere configuration change of init, the application could be started directly by the init component
and would thereby not be subjected to the launcher.
3.2.3 Services and sessions
The primary purpose of the parent interface is the establishment of communication
channels between components. Any component can inform its parent about a service
that it provides. In order to provide a service, a component needs to create an RPC
object implementing the so-called root interface. The root interface offers functions for
creating and destroying sessions of the service. Figure 8 shows a scenario where the
GUI component announces its service to the init component. The announce function
22
3.2
Recursive system structure
Application
Launcher
GUI
Init
Figure 7: Example of a tree of components. The red arrow represents the ownership relation.
takes the service name and the capability for the service’s root interface as arguments.
Thereby, the root capability is delegated from the GUI to init.
It is up to the parent what to do with the announced information. The parent may
ignore the announcement or remember that the child “GUI” provides a service “GUI”.
A component can announce any number of services by subsequent announce calls.
The counterpart of the service announcement is the creation of a session by a client by
issuing a session request to its parent. Figure 9 shows the scenario where the application
requests a “GUI” session. Along with the session call, the client specifies the type of the
service and a number of session arguments. The session arguments enable the client to
inform the server about various properties of the desired session. In the example, the
client informs the server that it is interested in reading user input and that the client’s
window should be labeled with the name “browser”. As a result of the session request,
the client expects to obtain a capability to an RPC object that implements the session
interface of the requested service. Such a capability is called session capability.
When the parent receives a session request from a child, it is free to take a policy
decision of how to respond to the request. This decision is closely related to the management of resources described in Section 3.3.2. There are the following options.
Parent denies the service The parent may deny the request and thereby prevent the
child from using a particular service.
Parent provides the service The parent could decide to implement the requested service by itself by handing out a session capability for a locally implemented RPC
object to the child.
Server is another child If the parent has received an announcement of the service
from another child, it may decide to direct the session request to the other child.
Forward to grandparent The parent may decide to request a session in the name of
its child from its own parent.
Figure 9 illustrates the latter option where the launcher responds to the application’s
session request by issuing a session request to its parent, the init component. Note
23
3.2
Recursive system structure
Application
Launcher
GUI
GUI root
3
Parent
announce("GUI", 3)
Init
Parent
Parent
Figure 8: The GUI component announces its service to its parent using the parent interface.
Application
session("GUI", input:read label:browser)
Launcher
GUI
Parent
GUI root
session("GUI", input:read label:arora)
announce("GUI", 3)
Init
Parent
Parent
Figure 9: The application requests a GUI session using the parent interface.
24
3.2
Recursive system structure
Application
GUI
GUI session
session("GUI", input:read label:browser)
create
Launcher
Parent
GUI root
session(input:none label:arora)session("GUI", input:read label:arora)
announce("GUI", 3)
Init
Parent
Parent
Figure 10: Session creation at the server.
that by requesting a session in the name of its child, the launcher is able to modify the
session arguments according to its policy. In the example, the launcher imposes the
use of a different label to the session. When init receives the session request from the
launcher, it is up to init to take a policy decision with the same principle options. In
fact, each component that sits in between the client and the server along the branches
of the ownership tree can impose its policy onto sessions. The routing of the session
request and the final session arguments as received by the server are the result of the
successive application of all policies along the route.
In the example, init decides to override the “input” argument. Because the GUI announced its “GUI” service beforehand, init is in possession of the root capability, which
enables it to create and destroy GUI sessions. It decides to respond to the launcher’s
session request by triggering the GUI-session creation at the GUI component’s root
interface. The GUI component responds to this request with the creation of a new
GUI session and attaches the received session arguments to the new session. The accumulated session policy is thereby tied to the session’s RPC object. The RPC object
is accompanied with its corresponding session capability, which is delegated along the
entire call chain up to the originator of the session request (Section 3.1.2). Once the
application’s session request returns, the application can interact directly with the GUI
session using the session capability.
The differentiation between session creation and session use aligns two seemingly
conflicting goals with each other, namely efficiency and the application of the security
policies by potentially many components. All components on the route between client
25
3.2
Recursive system structure
and server are involved in the creation of the session and can thereby impose their
policies on the session. Once established, the direct communication channel between
client and server via the session capability allows for the efficient interaction between
the two components. For the actual use of the session, the intermediate components are
not on the performance-critical path.
3.2.4 Client-server relationship
Whereas the role of a component as a child is dictated by the strict ownership relation
that implies that the child has to ultimately trust its parent, the role of a component as
client or server is more diverse.
In its role of a client that obtained a session capability as result of a session request
from its parent, a component is unaware of the real identity of the server. It is unable
to judge the trustworthiness of the server. However, it obtained the session from its
parent, which the client ultimately trusts. Whichever session capability was handed
out by the parent, the client is not in the position to question the parent’s decision.
Even though the integrity of the session capability can be taken for granted, the client
does not need to trust the server in the same way as it trusts its parent. By invoking the
capability, the client is in full control over the information it reveals to the server in the
form of RPC arguments. The confidentiality of its internal state is protected. Furthermore, the invocation of a capability cannot have side effects on the client’s protection
domain other than the retrieval of RPC results. So the integrity of the client’s internal
state is protected. However, when invoking a capability, the clients hands over the flow
of execution to the server. The client is blocked until the server responds to the request. A misbehaving server may never respond and thereby block the client infinitely.
Therefore, with respect to the liveliness of the client, the client has to trust the server.
To empathize with the role of a component as a client, a capability invocation can be
compared to the call of a function of an opaque 3rd-party library. When calling such a
library function, the caller can never be certain to regain control. It just expects that a
function returns at some point. However, in contrast to a call of a library function, a capability invocation does not put the integrity and confidentiality of the client’s internal
state at risk.
If being in the role of a server, a component should generally not trust its clients. In
contrary, from the server’s perspective, clients should be expected to misbehave. This
has two practical implications. First, a server is responsible to validate the arguments
of incoming RPC requests. Second, a server should never make itself dependent on the
good will of its clients. For example, a server should generally not invoke a capability
obtained from one of its clients. A malicious client could have delegated a capability
to a non-responding RPC object, which may block the server forever when invoked
and thereby make the server unavailable for all other clients. As another example, the
server must always be in control over the physical memory resources used for a sharedmemory interface between itself and its clients. Otherwise, if a client was in control over
the used memory, it could revoke the memory from the server at any time, possibly
triggering a fault at the server. The establishment of shared memory is described in
26
3.2
Recursive system structure
detail in Section 3.6.3. Similarly to the role as client, the internal state of a server is
protected from its clients with respect to integrity and confidentiality. In contrast to a
client, however, the liveliness of a server is protected as well. A server does never need
to wait for any response from a client. By responding to an RPC request, the server does
immediately become ready to accept the next RPC request without any prior handshake
with the client of the first request.
Regarding the lifetime of a session, the client is not in the immediate position to dictate the server when to close a session because it has no power over the server. Instead,
the procedure of closing a session follows the same chain of commands as involved in
the session creation. The common parent of both client and server plays the role of a
broker, which is trusted by both parties. From the client’s perspective, closing a session
is a request to its parent. The client has to accept that the response to such a request is
up to the policy of the parent. From the perspective of a server that is implemented by a
child, the request to close a session originates from its parent, which, as the owner of the
server, represents an authority that must be ultimately obeyed. Otherwise, the parent
of a server might take steps to enforce its will by destructing the server altogether.
27
3.3
Resource trading
3.3 Resource trading
As introduced in Section 3.2.1, child components are created out of the resources of
their respective parent components. This section describes the underlying mechanism.
It first introduces the concept of RAM sessions in Section 3.3.1. Section 3.3.2 explains
how RAM sessions are used to trade resources between components. The resourcetrading mechanism ultimately allows servers to become resilient against client-driven
resource-exhaustion attacks. However, such servers need to take special precautions
that are explained in Section 3.3.3. Section 3.3.4 presents a mechanism for the dynamic
balancing of resources among cooperative components.
3.3.1 Resource assignment
In general, it is the operating system’s job to manage the physical resources of the machine in a way that enables multiple applications to utilize them in a safe and efficient
manner. The physical resources are foremost the physical memory, the processing time
of the CPUs, and devices.
The traditional approach to resource management Traditional operating systems
use to provide abstractions of those resources to applications running on top of the
operating system. For example, instead of exposing the real interface of a device to
an application, a Unix kernel provides a representation of the device as a pseudo file
in the virtual file system. An application interacts with the device indirectly by operating on the respective pseudo file via an device-class-specific API (ioctl operations).
As another example, a traditional OS kernel provides each application with an arbitrary amount of virtual memory, which may be much larger than the available physical
memory. The application’s virtual memory is backed with physical memory not before
the application actually uses the memory. The pretension of unlimited memory by the
kernel relieves application developers from considering memory as a limited resource.
On the other hand, this convenient abstraction creates problems that are extremely hard
or even impossible to solve by the OS kernel.
• The amount of physical memory that is at the disposal for backing virtual memory is limited. Traditional OS kernels use to employ strategies to uphold the illusion of unlimited memory by swapping memory pages to disk. However, the
swap space on disk is ultimately limited, too. At one point, when the physical
resources are exhausted, the pretension of unlimited memory becomes a leaky
abstraction and forces the kernel to take extreme decisions such as killing arbitrary processes to free up physical memory.
• Multiple applications including critical applications as well as potentially misbehaving applications share one pool of physical resources. In the presence of a
misbehaving application that exhausts the physical memory, all applications are
equally put at risk.
28
3.3
Resource trading
• Third, by granting each application the legitimate ability to consume as much
memory as the application desires, applications cannot be held accountable for
their consumption of physical memory. The kernel cannot distinguish a misbehaving from a well-behaving memory-demanding application.
There are several approaches to relieve those problems. For example, OS kernels that
are optimized for resource utilization may employ heuristics that take the application
behavior into account for parametrizing page-swapping strategies. Another example
is the provisioning of a facility for pinned memory to application. Such memory is
guaranteed to be backed by physical memory. But such a facility bears the risk of allowing any application to exhaust physical memory directly. Hence, further heuristics are needed to limit the amount of pinned memory an application may use. Those
counter measures and heuristics, while making the OS kernel more complex, are mere
attempts to fight symptoms but unable to solve the actual problems caused by the lack
of accounting. The behaviour of the such systems remains largely indeterministic.
As a further consequence of the abstraction from physical resources, the kernel has
to entail functionality to support the abstraction. For example, for swapping memory
pages to disk, the kernel has to depend on an in-kernel disk driver. For each application,
whether or not it ever touches the disk, the in-kernel disk driver is part of its trusted
computing base.
RAM sessions and balances Genode does hardly abstract from physical resources.
Instead, it solely arbitrates the access to such resources and provides means to delegate
the authority over resources between components. Each low-level physical resource
is represented as a dedicated service provided by the core component at the root of
the component tree. The core component is described in detail in Section 3.4. The
following description focuses on memory as the most prominent low-level resource
managed by the operating system. Conceptually, the approach could be applied to
other physical resources such as processing time, network bandwidth, disk space, or
even power. However, at the time of writing, Genode employs the mechanism for
memory only.
Physical memory is represented by the RAM service of core. The best way to describe the idea behind the RAM service is to draw the analogy to a bank. Each RAM
session corresponds to a bank account. Initially, when opening a new account, there
is no balance. However, by having the authority over an existing bank account with
a balance, one can transfer funds from the existing account to the new account. Naturally, such a transaction will decrease the balance of the originating account. Internally
at the bank, the transfer does not involve any physical bank notes. The transaction is
merely a change of balances of both involved bank accounts. A bank customer with the
authority over a given bank account can use the value stored on the bank account to
purchase physical goods while withdrawing the costs from the account. Such a withdrawal will naturally decrease the balance on the account. If the account is depleted,
the bank denies the purchase attempt. Analogously to purchasing physical goods by
withdrawing balances from a bank account, physical memory can be allocated from a
29
3.3
Resource trading
Child
Init
2
3
transf er(amount, 3)
Core
RAM session
RAM session
Figure 11: Init assigns a portion of its memory to a child. In addition to its own RAM session
(2), init has created a second RAM session (3) designated for its child.
RAM session. The balance of the RAM session is the RAM session’s quota. A piece of
allocated physical memory is represented by a so-called dataspace (see Section 3.4.1 for
more details). A RAM dataspace is a container of physical memory that can be used for
storing data.
Subdivision of budgets Similar to a person with a bank account, each component of
a Genode system has a session at core’s RAM service. At boot time, the core component
creates an initial RAM session with the balance set to the amount of available physical
memory. This RAM session is designated for the init component, which is the first and
only child of core. On request by init, core delegates the capability for this initial RAM
session to the init component.
For each child component spawned by the init component, init creates a new RAM
session at core. Figure 11 exemplifies this step for one child. As the result from the
session creation, it obtains the capability for the new RAM session. Because it has the
authority over both its own and the child’s designated RAM session, it can transfer a
certain amount of RAM quota from its own account to the child’s account by invoking
its own RAM-session capability and specifying the beneficiary’s RAM-session capability as argument. Core responds to the request by atomically adjusting the quotas of
both RAM sessions by the specified amount. In the case of init, the amount depends on
init’s configuration. Thereby, init explicitly splits its own RAM budget among its child
components. Each child created by init can obtain the capability for its own RAM session from init via the parent interface and thereby gains the authority over the memory
budget that was assigned to it. Note however, that no child has the authority over init’s
RAM session nor the RAM sessions of any siblings. The mechanism for distributing
a given budget among multiple children works recursively. The children of init can
30
3.3
Resource trading
Client
gate
dele
Server
Child
Init
Core
RAM session
RAM session
RAM session
Figure 12: Memory-stealing attempt
follow the same procedure to further subdivide their budgets for spawning grandchildren.
Protection against resource stealing A parent that created a child subsystem out
of its own memory resources, expects to regain the spent resources when destructing
the subsystem. For this reason, it must not be possible for a child to transfer funds
to another branch of the component tree without the consent of the parent. Figure 12
illustrates an example scenario that violates this expectation. The client and server components conspire to steal memory from the child. The client was created by the child
and received a portion of the child’s memory budget. The client requested a session
for a service that was eventually routed to the server. The client-server relationship
allows the client to delegate capabilities to the server. Therefore, it is able to delegate
its own RAM session capability to the server. The server, now in possession of the
client’s and its own RAM session capabilities, can transfer memory from the client’s to
its own RAM session. After this transaction, the child has no way to regain its memory
resources because it has no authority over the server’s RAM session.
To prevent such such resource-stealing scenarios, Genode restricts the transfer between arbitrary RAM sessions. Each RAM session must have a reference RAM session,
which can be defined only once. Transfers are permitted only between a RAM session
and its reference RAM session. When creating the RAM session of a child component,
the parent registers its own RAM session as the child’s reference RAM session. This
way, the parent becomes able to transfer budgets between its own and the child’s RAM
session.
31
3.3
Resource trading
RAM session destruction When a RAM session is closed, core destroys all dataspaces that were allocated from the RAM session and transfers the RAM session’s final
budget to the corresponding reference RAM session.
3.3.2 Trading memory between clients and servers
An initial assignment of memory to a child is not always practical because the memory
demand of a given component may be unknown at its construction time. For example,
the memory needed by a GUI server over its lifetime is not prior known but depends on
the number of its clients, the number of windows on screen, or the amount of pixels that
must be held at the server. In many cases, the memory usage of a server depends on
the behavior of its clients. In traditional operating systems, system services like a GUI
server would allocate memory on behalf of its clients. Even though the allocation was
induced by a client, the server performs the allocation. The OS kernel remains unaware
of the fact that the server solely needs the allocated memory for serving its client. In the
presence of a misbehaving client that issues an infinite amount of requests to the server
where each request triggers a server-side allocation (for example the creation of a new
window), the kernel will observe the server as a resource hog. Under resource pressure,
it will likely select the server to be punished. Each server that performs allocations on
behalf of its clients is prone to this kind of attack. Genode solves this problem by letting
clients pay for server-side allocations. Client and server may be arbitrary nodes in the
component tree.
Session quotas As described in the previous section, at the creation time of a child,
the parent assigns a part of its own memory quota to the new child. Since the parent retains the RAM-session capabilities of all its children, it can issue further quota transfers
back and forth between the children’s RAM sessions and its own RAM session, which
represents the reference account for all children. When a child requests a session at the
parent interface, it can attach a fraction of its quota to the new session by specifying an
amount of memory to be donated to the server as a session argument. This amount is
called session quota. The session quota can be used by the server during the lifetime of
the session. It is returned to the client when the session is closed.
When receiving a session request, the parent has to distinct three different cases depending on its session-routing decision as described in Section 3.2.3.
Parent provides the service If the parent provides the requested service by itself, it
first checks whether the session quota meets its need for providing the service. If
so, it transfers the session quota from the requesting child’s RAM session to its
own RAM session. This step may fail if the child offered a session quota larger
than the available quota in the child’s RAM session.
Server is another child If the parent decides to route the session request to another
child, it transfers the session quota from the client’s RAM session to the server’s
RAM session. Because each of RAM sessions are not related to each other as
32
3.3
Resource trading
both have the parent’s RAM session as reference account, this the transfer from
the client to the server consists of two steps. First, the parent transfers the session
quota to its own RAM session. If this step succeeded, it transfers the session quota
from its own RAM session to the server’s RAM session. The parent keeps track
of the session quota for each session so that the quota transfers can be reverted
later when closing the session. Not before the transfer of the session quota to
the server’s RAM session succeeded, the parent issues the actual session request
at the server’s root interface along with the information about the transferred
session quota.
Forward to grandparent The parent may decide to forward the session request to
its own parent. In this case, the parent will request a session on behalf of its
child. The grandparent neither knows nor cares about the actual origin of the
request and will simply decrease the memory quota of the parent. For this reason,
the parent transfers the session quota from the requesting child to its own RAM
session before issuing the session request at the grandparent.
Quota transfers may fail if there is not enough budget on the originating account. In
this case, the parent aborts the session creation and reflects the lack of resources as an
error to the originator of the session request.
This procedure works recursively. Once the server receives the session request along
with the information about the provided session quota, it can use this information to
decide whether or not to provide the session under these resource conditions. It can
also use the information to tailor the quality of the service according to the provided
session quota. For example, a larger session quota might enable the server to use larger
caches or communication buffers for the client’s session.
Session upgrades During the lifetime of a session, the initial session quota may turn
out to be too scarce. Usually, the server returns such a scarcity condition as an error of
operations that imply server-side allocations. The client may handle such a condition
by upgrading the session quota of an existing session by issuing an upgrade request to
is parent along with the targeted session capability and the additional session quota.
The upgrade works analogously to the session creation. The server will receive the
information about the upgrade via the root interface of the service.
Closing sessions If a child issues a session-close request to its parent, the parent
determines the corresponding server, which, depending on the route of the original
session request, may be locally implemented, provided by another child, or provided
by the grandparent. Once the server receives the session-close request, it is responsible to release all resources that were allocated from the session quota. The release
of resources should revert all allocations the server has performed on behalf its client.
Stressing the analogy with the bank account, the server has to sell the physical goods
(i. e., RAM dataspaces) it purchased from the client’s session quota to restore the balance on its RAM session. After the server has reverted all session-specific allocations,
33
3.3
Resource trading
the server’s RAM session is expected to have at least as much available budget as the
session quota of the to-be-closed session. So the session quota can be transferred back
to the client.
However, an misbehaving server may fail to release those resources by malice or
caused by a bug. For example, the server may be unable to free a dataspace because it
mistakenly used the dataspace for another client’s data. Another example would be a
memory leak in the server. Such misbehavior is detected on the attempt to withdraw
the session quota from the server’s RAM session. If the server’s available RAM quota
after closing a session remains lower than the session quota, the server apparently peculated memory. If the misbehaving server was locally provided by the parent, it has
the full authority to not hand back the session quota to its child. If the misbehaving
service was provided by the grandparent, the parent (and its whole subsystem) has to
subordinate. If, however, the server was provided by another child and the child refuses to release resources, the parent’s attempt to withdraw the session quota from the
server’s RAM session will fail. It is up to the policy of the parent to handle such a failure either by punishing the server (e. g., killing the component) or by granting more of
its own quota. Generally, misbehavior is against the server’s own interests. A server’s
best interest is to obey the parent’s close request to avoid intervention.
3.3.3 Component-local heap partitioning
Components that perform memory allocations on behalf of untrusted parties must
take special precautions for the component-local memory management. There are two
prominent examples for such components. As discussed in Section 3.3.2, a server may
be used by multiple clients that must not interfere with each other. Therefore, serverside memory allocations on behalf of a particular client must strictly be accounted to
the client’s session quota. Second, a parent with multiple children may need to allocate
memory to perform the book keeping for the individual children, for example, maintaining the information about their open sessions and their session quotas. The parent
should account those child-specific allocations to the respective children. In both cases,
it is not sufficient to merely keep track of the amount of memory consumed on behalf
of each untrusted party but the actual allocations must be performed on independent
backing stores.
Figure 13 shows a scenario where a server performs anonymous memory allocations
on behalf of two session. The memory is allocated from the server’s heap. Whereas
allocations from the heap are of byte granularity, the heap’s backing store consists of
several dataspaces. Those dataspaces are allocated from the server’s RAM session as
needed but at a much larger granularity. As depicted in the figure, allocations from both
sessions end up in the same dataspaces. This becomes a problem once one session is
closed. As described in the previous section, the server’s parent expects the server to release all resources that were allocated from the corresponding session quota. However,
even if the server reverts all heap allocations that belong to the to-be-closed session, the
server could still not release the underlying backing store because all dataspaces are
34
3.3
Resource trading
Server
Heap
Dataspace
Dataspace
Session
Dataspace
Session
Figure 13: A server allocates anonymous memory on behalf of multiple clients from a single
heap.
Server
Heap
Heap
Dataspace
Dataspace
Session
Dataspace
Session
Figure 14: A server performs memory allocations from session-specific heap partitions.
still occupied with memory objects of another session. Therefore, the server becomes
unable to comply with parent’s expectation.
The solution of this problem is illustrated in Figure 14. For each session, the server
maintains a separate heap partition. Each memory allocation on behalf of a client is
performed from the session-specific heap partition rather than from a global heap. This
way, memory objects of different sessions populate disjoint dataspaces. When closing a session, the server reverts all memory allocations from the session’s heap. After
freeing the session’s memory objects, the heap partition becomes empty. So it can be
destroyed. By destroying the heap partition, the underlying dataspaces that were used
as the backing store can be properly released.
35
3.3
Resource trading
3.3.4 Dynamic resource balancing
As described in Section 3.3.1, parent components explicitly assign physical resource
budgets to its children. Once assigned, the budget is at the disposal of the respective
child subsystem until the subsystem gets destroyed by the parent.
However, not all components have well-defined resource demands. For example, a
block cache should utilize as much memory as possible unless the memory is needed
by another component. The assignment of fixed amount of memory to such a block
cache cannot accommodate changes of workloads over the potentially long lifetime of
the component. If dimensioned too small, there may be a lot of slack memory remaining unutilized. If dimensioned too large, the block cache would prevent other and
possibly more important components to use the memory. A better alternative is to enable a component to adapt its resource use to the resource constraints of its parent. The
parent interface supports this alternative with a protocol for the dynamic balancing of
resources.
The resource-balancing protocol uses a combination of synchronous remote procedure calls and asynchronous notifications. Both mechanism are described in Section
3.6. The child uses remote procedure calls to talk to its parent whereas the parent uses
asynchronous notification to signal state changes to the child. The protocol consists of
two parts, which are complementary.
Resource requests By issuing a resource request to it parent, a child applies for an
upgrade of its resources. The request takes the amount of desired resources as argument. A child would issue such a request if it detects scarceness of resources. A
resource request returns immediately regardless of whether additional resources had
been granted or not. The child may proceed working under the low resource conditions
or it may block for a resource-available signal from its parent. The parent may respond
to this request in different ways. It may just ignore the request, possibly stalling the
child. Alternatively, it may immediately transfer additional quota to the child’s RAM
session. Or it may take further actions to free up resources to accommodate the child.
Those actions may involve long-taking operations such as the destruction of subsystems or the further propagation of resource request towards the root of the component
tree. Once the parent has freed up enough resources to accommodate the child’s request, it transfers the new resources to the child’s RAM session and notifies the child
by sending a resource-available signal.
Yield requests The second part of the protocol enables the parent to express its wish
for regaining resources. The parent notifies the child about this condition by sending
a yield signal to the child. On the reception of such a signal, the child picks up the
so-called yield request at the parent using a remote procedure call. The yield request
contains the amount of resources the parent wishes to regain. It is up to the child to
comply with a yield request or not. Some subsystems have meaningful ways to respond to yield requests. For example, an in-memory block cache could write back the
cached information and release the memory consumed by the cache. Once the child
36
3.3
Resource trading
has succeeded in freeing up resources, it reports to parent by issuing a so-called yield
response via a remote procedure call to the parent. The parent may respond to a yield
response by withdrawing resources from the child’s RAM session.
37
3.4
Core - the root of the component tree
3.4 Core - the root of the component tree
Core is the first user-level component, which is directly created by the kernel. It thereby
represents the root of the component tree. It has access to the raw physical resources
such as memory, CPUs, memory-mapped devices, interrupts, I/O ports, and boot modules. Core exposes those low-level resources as services so that they can be used by
other components. For each type of resource, there exists a service in core. For example, memory resources are represented by the RAM service, interrupts are represented
by the IRQ service, and CPUs are represented by the CPU service. In order to access a
resource, a component has to establish a session to the corresponding service. Thereby
the access to physical resources is subjected to the routing of session requests as explained in Section 3.2.3. Moreover, the resource-trading concept described in Section
3.3.2 applies to core services in the same way as for any other service.
In addition to making hardware resources available as services, core provides all
prerequisites to bootstrap the component tree. These prerequisites comprise services
for creating protection domains, for managing address-space layouts, and for creating
object identities.
Core is almost free from policy. There are no configuration options. The only policy of
core is the startup of the init process, to which core grants all available resources. Init, in
turn, uses those resources to spawn further components according to its configuration.
Section 3.4.1 introduces dataspaces as containers of memory or memory-like resources. Dataspaces form the foundation for most of the core services described in the
subsequent sections. The section is followed by the introduction of each individual
service provided by core. In the following, a component that has established a session
to such a service is called client. E.g., a component that obtained a session to core’s
RAM service is a RAM client.
3.4.1 Dataspaces
A dataspace is an RPC object that represents a contiguous physical address-space region with an arbitrary size. Its base address and size are subjected to the granularity of
physical pages as dictated by the memory-management unit (MMU) hardware. Typically the granularity is 4 KiB.
Dataspaces are created and managed via core’s services. Because each dataspace is
a distinct RPC object, the authority over the contained physical address range is represented by a capability and can thereby be delegated between components. Each component in possession of a dataspace capability can make the dataspace content visible
in its local address space (using core’s RM service described in Section 3.4.5). Hence,
by the means of delegating dataspace capabilities, components can establish shared
memory.
On Genode, only core deals with physical memory pages. All other components use
dataspaces as a uniform abstraction for memory, memory-mapped I/O regions, and
ROM modules.
38
3.4
Core - the root of the component tree
3.4.2 Physical memory allocation (RAM)
A RAM session is a quota-bounded allocator of physical memory. At session-creation
time, its quota is zero. To make the RAM session functional, it must first receive quota
from another already existing RAM session, which is called the reference account. Once
the reference account is defined, quota can be transferred back and forth between the
reference account and the new RAM session.
Provided that the RAM session is equipped with sufficient quota, the RAM client can
allocate RAM dataspaces from the RAM session. The size of each RAM dataspace is
defined by the client at the allocation time. The location of the dataspace in physical
memory is defined by core.
Each RAM dataspace is physically contiguous and can thereby be used as DMA
buffer by a user-level device driver. In order to set up DMA transactions, such a device
driver can request the physical address of a RAM dataspace by invoking the dataspace
capability.
Closing a RAM session destroys all dataspaces allocated from the RAM session and
restores the original quota. This implies that these dataspaces disappear in all components. The quota of a closed RAM session is transferred to the reference account.
The book keeping of quotas and the creation of dataspace RPC objects consumes
memory within core. Core’s RAM service allocates such memory from the session
quota supplied by its clients. Note that this session quota is unrelated to the quota
managed by the RAM session. For this reason, an allocation may fail for two different
reasons. The account represented by the RAM session may be depleted. So the allocation cannot be performed unless additional quota is transferred to the account. But
also, the RAM session’s session quota may be depleted so that core is not able to create a new dataspace RPC object for the allocated memory. The latter condition can be
resolved by upgrading the existing RAM session as detailed in Section 3.3.2.
3.4.3 Access to boot modules (ROM)
During the initial bootstrap phase of the machine, a boot loader loads the kernel’s binary and additional chunks of data called boot modules into the physical memory. After
those preparations, the boot loader passes control to the kernel. Examples for boot
modules are the ELF images of the core component, the init component, the components created by init, and the configuration of the init component. Core makes each
boot module available as a ROM session. Because boot modules are read-only memory,
they are generally called ROM modules. On session construction, the client specifies
the name of the ROM module as session argument. Once created, the ROM session
allows its client to obtain a ROM dataspace capability. Using this capability, the client
can make the ROM module visible within its local address space.
39
3.4
Core - the root of the component tree
3.4.4 Protection domains (PD)
A protection domain (PD) corresponds to a unit of protection within the Genode system. Typically, there is a one-to-one relationship between a component and a PD.
At the hardware-level, the CPU isolates different protection domains via a memorymanagement unit. Each domain is represented by a different page directory, or an
address-space ID (ASID). A PD session represents the used hardware-based protection
facility.
In addition to representing the unit of memory protection, a PD comprises a capability space as introduced in Section 3.1.1. Initially, the PD’s capability space is empty.
However, the PD client can populate the capability space with a single capability, which
is the parent capability of the component within the PD. The assignment of the parent
capability is done at the creation time of the component by its parent.
A PD on its own is not useful unless is becomes associated with an address-space
layout (RM session) and at least one thread of execution (CPU session). Section 3.5
explains how those sessions can be combined as basic building blocks for creating a
component.
3.4.5 Address-space management (RM)
A region-manager (RM) session represents the layout of a virtual address space. The
size of the virtual address space can be defined via session arguments at the sessioncreation time.
Populating an address space The concept behind RM sessions is a generalization of
the MMU’s page-table mechanism. Analogously to how a page table is populated with
physical page frames, an RM session is populated with dataspaces. Under the hood,
core uses the MMU’s page-table mechanism as a cache for RM sessions. An RM client
in possession of a dataspace capability is able to attach the dataspace to the RM session.
Thereby the content of the dataspace becomes visible within the RM session’s virtual
address space. When attaching a dataspace to an RM session, core selects an appropriate virtual address range that is not yet populated with dataspaces. Alternatively,
the client can specify a designated virtual address. It also has the option to attach a
mere window of the dataspace to the RM session. Furthermore, the client can specify
whether the content of the dataspace should be executable or not.
The counter part of the attach operation is the detach operation, which enables the RM
client to remove dataspaces from its RM session by specifying a virtual address. Under the hood, this operations flushes the MMU mappings of the corresponding virtual
address range so that the dataspace content becomes invisible.
Note that a single dataspace may be attached to any number of RM sessions. A
dataspace may also be attached multiple times to one RM session. In this case, each
attach operation populates a distinct region of the virtual address space.
40
3.4
Core - the root of the component tree
Assigning threads to an address space As for a PD session, an RM session is not
useful on its own. To enable the use of the RM-session’s address-space layout for a
component, it must first be associated with a thread of execution. An RM client can establish this association with the RM session’s add-client operation, which takes a thread
capability (obtained from a CPU session) as argument. Once associated, core uses the
address-space layout of the RM session to resolve page faults caused by the thread.
Realizing managed dataspaces The entirety of an RM session can be used as a
dataspace. Such a managed dataspace is not backed by a range of physical addresses
but by the range of virtual addresses of its underlying RM session. This makes RM
sessions a generalization of nested page tables. An RM client can obtain a dataspace
capability for a given RM session and use this dataspace capability in the same way as
any other dataspace capability, i. e., attaching it to its local address space, or delegating
it to other components.
Managed dataspaces are used in two ways. First, they allow for the manual management of portions of a component’s virtual address space. For example, the so-called
thread-context area is a dedicated virtual-address range preserved for stacks. Between
the stacks, the virtual address space must remain empty so that stack overflows won’t
silently corrupt data. This is achieved by creating an RM session that represents the
complete thread-context area. This RM session is attached as a dataspace to the component’s virtual address space. When creating a new thread with its corresponding
stack, the thread’s stack is not directly attached to the component’s RM session but to
the context area’s RM session. Another example is the virtual-address range managed
by a dynamic linker to load shared libraries into.
The second use of managed dataspaces is the provision of on-demand-paged dataspaces. A server may hand out dataspace capabilities that are backed by RM sessions to
its clients. Once the client has attached this dataspace to its address space and touches
the content, the client triggers a page fault. Core responds to this page fault by blocking
the client thread and delivering a notification to the RM client of the managed dataspace
(the server) along with the information about the fault address within the RM session.
The server can resolve this condition by attaching a dataspace with real backing store
at the fault address, which prompts core to resume the execution of the faulted thread.
3.4.6 Processing-time allocation (CPU)
A CPU session is an allocator for processing time that allows for the creation, the control, and the destruction of threads of execution. At session-construction time, the affinity of a CPU session with CPU cores can be defined via session arguments.
Once created, the session can be used to create, control, and kill threads. Each thread
created via a CPU session is represented by a thread capability. The thread capability is
used for associating the thread with its address space layout (RM session) as well as for
subsequent thread-control operations. The most prominent thread-control operation
is the start of the thread, which takes the thread’s initial stack pointer and instruction
pointer as arguments.
41
3.4
Core - the root of the component tree
During the lifetime of a thread, the CPU client can retrieve and manipulate the state
of the thread. This includes the register state as well as the execution state (whether the
thread is paused or running). Those operations are primarily designated for realizing
user-level debuggers.
To aid the graceful destruction of threads, the CPU client can issue a cancel-blocking
operation, which causes the specified thread to cancel a current blocking operation such
as waiting for an RPC response or the attempt to acquire a contended a lock.
3.4.7 Object-identity allocation (CAP)
Genode regards object identities as a physical resource because each object identity is
represented as a kernel object. Core’s CAP service allows for the creation and destructions of new object identities. For each RPC object associated to an RPC entrypoint,
the entrypoint requests the creation of a new object identity from a CAP session and
associates the RPC object with the capability allocated from the CAP service.
For more information about the relationship between RPC objects and object identities, refer to Section 3.1. The interplay between RPC objects, the RPC entrpoint, and
core’s CAP service is described in depth in Section 3.6.1.
3.4.8 Access to device resources (IO_MEM, IO_PORT, IRQ)
Core’s IO_MEM, IO_PORT, and IRQ services enable the realization of user-level device
drivers as Genode components.
Memory mapped I/O (IO_MEM) An IO_MEM session provides a dataspace representation for a non-memory part of the physical address space such as memory-mapped
I/O regions or BIOS areas. In contrast to a memory block that is used for storing information, of which the physical location in memory is of no matter, a non-memory object
has special semantics attached to its location within the physical address space. Its location is either fixed (by standard) or can be determined at runtime, for example by
scanning the PCI bus for PCI resources. If the physical location of such a non-memory
object is known, an IO_MEM session can be created by specifying the physical base
address, the size, and the write-combining policy of the memory-mapped resource as
session arguments. Once an IO_MEM session is created, the IO_MEM client can request
a dataspace containing the specified physical address range.
Core hands out each physical address range only once. Session requests for ranges
that intersect with physical memory are denied. Even though the granularity of memory protection is limited by the MMU page size, the IO_MEM service accepts the specification of the physical base address and size at the granularity of byes. The rationale
behind this contradiction is the unfortunate existence of platforms that host memorymapped resources of unrelated devices on the same physical page. When driving such
devices from different components, each of those components requires access to its
corresponding device. So the same physical page must be handed out to multiple com-
42
3.4
Core - the root of the component tree
ponents. Of course, those components must be trusted to not touch any portion of the
page that is unrelated to its own device.
Port I/O (IO_PORT) For platforms that rely on I/O ports for device access, core’s
IO_PORT service enables the fine-grained assignment of port ranges to individual components. Each IO_PORT session corresponds to the exclusive access right to a port
range specified as session arguments. Core creates the new IO_PORT session only if
the specified port range does not overlap with an already existing session. This ensures that each I/O port is driven by only one IO_PORT client at a time. The IO_PORT
session interface resembles the physical I/O port access instructions. Reading from an
I/O port can be performed via an 8-bit, 16-bit, or 32-bit access. Vice versa, there exist operations for writing to an I/O port via an 8-bit, 16-bit, or 32-bit access. The read
and write operations take absolute port addresses as arguments. Core performs the
I/O-port operation only if the specified port address lies within the port range of the
session.
Reception of device interrupts (IRQ) Core’s IRQ service enables device-driver components to respond to device interrupts. Each IRQ session corresponds to an interrupt.
The physical interrupt number is specified as session argument. Each physical interrupt number can be specified to only one session. The IRQ session interface provides
an operation to wait for the next interrupt. Only while the IRQ client is waiting for
an interrupt, core unmasks the interrupt at the interrupt controller. Once the interrupt
occurs, core wakes up the IRQ client and masks the interrupt at the interrupt controller
until the driver has completed the IRQ handing and waits for the next interrupt.
3.4.9 Logging (LOG)
The LOG service is used by the lowest-level system components such as the init component for printing diagnostic output. Each LOG session takes a label as session argument, which is used to prefix the output of this session. This enables developers to
distinguish the output of different components with each component having a unique
label. The LOG client transfers the to-be-printed characters as payload of plain RPC
messages, which represents the simplest possible communication mechanism between
the LOG client and core’s LOG service.
3.4.10 Asynchonous notifications (SIGNAL)
Core’s SIGNAL service plays the role of a broker of asynchronous notifications on kernels that lack the semantics of Genode’s signalling API. The service is not used directly
by components at the framework’s API level but is merely an implementation artifact.
43
3.4
Core - the root of the component tree
3.4.11 Event tracing (TRACE)
The TRACE service provides a light-weight event-tracing facility. It is not fundamental
to the architecture. However, as the service allows for the inspection and manipulation of arbitrary threads of a Genode system, TRACE sessions must not be granted to
untrusted components.
44
3.5
Component creation
Parent
virtual memory
environment
Core
RM session
RAM session
Figure 15: Starting point for creating a new component
3.5 Component creation
Each Genode component is made out of five basic ingredients:
RAM session for allocating the memory for the program’s BSS segment and heap
ROM session with the executable binary
CPU session for creating the initial thread of the component
RM session for managing the component’s address-space layout
PD session representing the component’s protection domain
It is the responsibility of the new component’s parent to obtain those sessions. The
initial situation of the parent is depicted in Figure 15. The parent’s memory budget is
represented by the parent’s RAM (Section 3.4.2) session. The parent’s virtual address
space is represented by the parent’s RM session (Section 3.4.5). Both sessions were originally created at the parent’s construction time. Along with the parent’s CPU session
and PD session, those sessions form the parent’s so-called environment. The parent’s RM
session is populated with the parent’s code (shown as red), the so-called thread-context
area that hosts the stacks (shown as blue), and presumably several RAM dataspaces for
the heap, the DATA segment, and the BSS segment. Those are shown as yellow.
3.5.1 Obtaining the child’s ROM and RAM sessions
The first step for creating a child component is obtaining the component’s executable
binary, e. g., by creating a session to a ROM service such as the one provided by core
(Section 3.4.3). With the ROM session created, the parent can make the dataspace with
the executable binary (i. e., an ELF binary) visible within its virtual address space by
attaching the dataspace to its RM session. After this step, the parent is able to inspect
45
3.5
Component creation
Parent
virtual memory
ref account, transfer quota
attach
Core
RM session
RAM session
ROM session
Dataspace
RAM session
Figure 16: The parent creates the RAM session of the new child and obtains the child’s executable
the ELF header to determine the memory required for the binary’s DATA and BSS segments.
The next step is the creation of the child’s designated RAM session, which represents
the memory budget the child will have at its disposal. The freshly created RAM session
has no budget though. In order to make the RAM session usable, the parent has to
transfer a portion of its own RAM quota to the child’s RAM session. As explained in
Section 3.3.1, the parent registers its own RAM session as the reference account for the
child’s RAM session in order to become able to transfer quota back and forth between
both RAM sessions. Figure 16 shows the situation.
3.5.2 Constructing the child’s address space
With the child’s RAM session equipped with memory, the parent can construct the address space for the new child and populate it with memory allocated from the child’s
budget. The address-space layout is represented as a session to core’s RM service (Section 3.4.5). Hence, as illustrated in Figure 17, the parent has to create an RM session
designated for the child. When creating the session, the parent is able to constrain the
bounds of the virtual address space. By default, the first page is excluded such that any
attempt by the child to de-reference a null pointer will cause a fault instead of silently
corrupting memory. After its creation time, the child’s RM session is empty. It is up
to the parent to populate the virtual address space with meaningful information by attaching dataspaces to the RM session. The parent performs this procedure based on the
information found in the ELF executable’s header:
Read-only segments For each read-only segment of the ELF binary, the parent attaches the corresponding portion of the ELF dataspace to the child’s address space
46
3.5
Component creation
Parent
virtual memory
attach
alloc
Core
ROM session
RAM session
Dataspace
Dataspace
RM session
Figure 17: The parent creates and populates the virtual address space of the child using a new
RM session (the parent’s RM and RAM sessions are not depicted for brevity)
by invoking the attach operation on the child’s RM-session capability. By attaching a portion of the existing ELF dataspace to the new child’s RM session, no
memory must be copied. If multiple instances of the same executable are created,
the read-only segments of all instances refer to the same physical memory pages.
If the segment contains the TEXT segment (the program code), the parent specifies a so-called executable flag to the attach operation. Core passes this flag to the
respective kernel such that the corresponding page-table entries for the new components will be configured accordingly (by setting or clearing the non-executable
bit in the page-table entries). Note that the propagation of this information (or
the lack thereof) depends on the used kernel. Also note that not all hardware
platforms distinguish executable from non-executable memory mappings.
Read-writable segments In contrast to read-only segments, read-writable segments
cannot be shared between components. Hence, each read-writable segment must
be backed with a distinct copy of the segment data. The parent allocates the backing store for the copy from the child’s RAM session and thereby accounts the
memory consumption on behalf of the child to the child’s budget. For each segment, the parent performs the following steps:
1. Allocation of a RAM dataspace from the child’s RAM session. The size of
the dataspace corresponds to the segment’s memory size. The memory size
may be higher than the size of the segment in the ELF binary (named file
size). In particular, if the segment contains a DATA section followed by a BSS
section, the file size corresponds to the size of the DATA section whereby the
47
3.5
Component creation
memory size corresponds to the sum of both sections. Core’s RAM service
ensures that each fresh allocated RAM dataspace is guaranteed to contain
zeros. Core’s RAM service returns a RAM dataspace capability as the result
of the allocation operation.
2. Attachment of the RAM dataspace to the parent’s virtual address space by
invoking the attach operation on the parent’s RM session with the RAM
dataspace capability as argument.
3. Copying of the segment content from the ELF binary’s dataspace to the fresh
allocated RAM dataspace. If the memory size of the segment is larger than
the file size, no special precautions are needed as the remainder of the RAM
dataspace is known to be initialized with zeros.
4. After filling the content of the segment dataspace, the parent no longer needs
to access it. It can remove it from its virtual address space by invoking the
detach operation on its own RM session.
5. Based on the virtual segment address as found in the ELF header, the parent
attaches the RAM dataspace to the child’s virtual address space by invoking
the attach operation on the child’s RM session with the RAM dataspace as
argument.
This procedure is repeated for each segment. Note that although the above description refers to ELF executables, the underlying mechanisms used to load the executable
binary are file-format agnostic.
3.5.3 Creating the initial thread and the child’s protection domain
With the virtual address space of the child configured, it is time to create the component’s initial thread. Analogously to the child’s RAM and RM sessions, the parent creates a CPU session (Section 3.4.6) for the child. The parent may use session arguments
to constrain the scheduling parameters (i. e., the priority) and the CPU affinity of the
new child. Whichever session arguments are specified, the child’s abilities will never
exceed the parent’s abilities. I.e., the child’s priority is subjected to the parent’s priority
constrains. Once constructed, the CPU session can be used to create new threads by
invoking the session’s create-thread operation. The invocation of this operation results
in a thread capability, which can be used to control the execution of the thread. Immediately after its creation, the thread remains inactive. In order to be executable, it first
needs to be configured. In particular, the thread needs to be associated with its address
space. Otherwise, the kernel respectively core would not know how to respond to page
faults triggered by the thread. To associate the thread with the virtual address space
of the new child, the parent invokes the add-client operation at the child’s RM session
with the thread capability as argument.
The final step is the creation of the child’s protection domain and the assignment of
the child’s initial thread to the new protection domain. A protection domain is represented by a session to core’s PD service. As described in Section 3.2.1, each protection
48
3.5
Component creation
Child
Parent
virtual memory
Parent
create thread
set parent
bind thread
Core
PD session
RAM session
CPU session
RM session
Figure 18: Creation of the child’s protection domain and initial thread
domain has initially a single capability installed, which allows the child to communicate to its parent. Right after the creation of the PD for a new child, the parent can
register a capability to a locally implemented RPC object as parent capability for the
PD session. Furthermore, the parent binds the initial thread to the child’s protection
domain using the PD-session interface. Now that the initial thread has a defined virtual address space and a protection domain, it is the right time to kick off the execution
of the thread using the start operation of the CPU session with the thread capability and
the initial program counter as arguments. The latter argument corresponds to the program’s entry-point address as found in the ELF header of the child’s executable binary.
Figure 18 illustrates the relationship between the PD session, the CPU session, and the
parent capability. Note that neither the ROM dataspace containing the ELF binary nor
the RAM dataspaces allocated during the ELF loading are visible in the parent’s virtual address space any longer. After initially loading the ELF binary, the parent has
detached those dataspaces from its own RM session.
The child starts its execution at the virtual address defined by the ELF entrypoint.
For regular Genode components, it points to a short assembly routine that sets up the
initial stack and calls the low-level C++ startup code. This code, in turn, initializes the
C++ runtime (such as the exception handling) along with the component’s local Genode
environment. The environment is constructed by successively requesting the component’s RM, RAM, CPU, and PD sessions from its parent. With the Genode environment
in place, the startup code initializes the thread-context area, sets up the real stack for the
main thread within the thread-context area, and returns to the assembly startup code.
49
3.5
Component creation
The assembly code, in turn, switches the stack from the initial stack to the real stack and
calls the program-specific C++ startup code. This code executes global constructors before calling the program’s main function. Section 8.1 describes the component-local
startup procedure in detail.
50
3.6
Inter-component communication
3.6 Inter-component communication
Genode provides three principle mechanisms for inter-component communication,
namely synchronous remote procedure calls (RPC), asynchronous notifications, and
shared memory. Section 3.6.1 describes synchronous RPC as the most prominent one.
In addition to transferring information across component boundaries, the RPC mechanism provides the means for delegating capabilities and thereby authority throughout
the system.
The RPC mechanism closely resembles the semantics of a function call where the control is transferred from the caller to the callee until the function returns. As discussed
in Section 3.2.4, there are situations where the provider of information does not wish to
depend on the recipient to return control. Such situations are addressed by the means
of an asynchronous notification mechanism explained in Section 3.6.2.
Neither synchronous RPC nor asynchronous notifications are suitable for transferring large bulks of information between components. RPC messages are strictly bound
to a small size and asynchronous notifications do not carry any payload at all. This
is where shared memory comes into play. By sharing memory between components,
large bulks of information can be propagated without the active participation of the
kernel. Section 3.6.3 explains the procedure of establishing shared memory between
components.
Each of the three basic mechanisms is rarely found in isolation. Most inter-component
interactions are a combination of the mechanisms. Section 3.6.4 introduces a pattern
for propagating state information by combining asynchronous notifications with RPC.
Section 3.6.5 shows how synchronous RPC can be combined with shared memory
to transfer large bulks of information in a synchronous way. Section 3.6.6 combines
asynchronous notifications with shared memory to largely decouple producers and
consumers of high-throughput data streams.
51
3.6
Inter-component communication
Connection
Session
Client object
RPC object
RPC stub code
RPC stub code
typed
IPC library
IPC library
untyped
Kernel IPC mechanism
Figure 19: Layered architecture of the RPC mechanism
3.6.1 Synchronous remote procedure calls (RPC)
Section 3.1.3 introduced remote procedure calls (RPC) as Genode’s fundamental mechanism to delegate authority between components. It introduced the terminology for
RPC objects, capabilities, object identities, and entrypoints. It also outlined the flow of
control between a client, the kernel, and a server during an RPC call. This section complements Section 3.1.3 with the information of how the mechanism presents itself at the
C++ language level. It first introduces the layered structure of the RPC mechanism and
the notion of typed capabilities. After presenting the class structure of an RPC server, it
shows how those classes interact when RPC objects are created and called.
Typed capabilities Figure 19 depicts the software layers of the RPC mechanism.
Kernel inter-process-communication (IPC) mechanism At the lowest level, the
kernel’s IPC mechanism is used to transfer messages back and forth between
client and server. The actual mechanism largely differs between the various kernels supported by Genode. Chapter 8 gives insights into the functioning of the
IPC mechanism as used on specific kernels. Genode’s capability-based security
model is based on the presumption that the kernel protects object identities as
kernel objects, allows user-level components to refer to kernel objects via capabilities, and supports the delegation of capabilities between components using the
kernel’s IPC mechanism. At the kernel-interface level, the kernel is not aware of
language semantics like the C++ type system. From the kernel’s point of view, an
object identity merely exists and can be referred to, but it has no type.
IPC library The IPC library introduces a kernel-independent programming interface
that is needed to implement the principle semantics of clients and servers. For
each kernel supported by Genode, there exists a distinct IPC library that uses the
respective kernel mechanism. The IPC library introduces the notions of untyped
capabilities, message buffers, IPC clients, and IPC servers.
An untyped capability is the representation of a Genode capability at the C++ language level. It consists of the local name of the referred-to object identity as well
52
3.6
Inter-component communication
as a means to manage the lifetime of the capability, i. e., a reference counter. The
exact representation of an untyped capability depends on the used kernel.
A message buffer is a statically sized buffer that carries the payload of an IPC message. It distinguishes two types of payload, namely raw data and capabilities.
Payloads of both kinds can be simultaneously present. A message buffer can
carry up to 1 KiB of raw data and up to four capabilities. Prior issuing the kernel
IPC operation, the IPC library translates the message-buffer content to the format
understood by the kernel’s IPC operation.
The IPC client represents the calling side of the communication channel to a given
destination capability. It uses two message buffers, a send buffer for the arguments to be sent to the server and a receive buffer for the results delivered as a
reply by the server. The user of an IPC client object can populate the send buffer
with raw data and capabilities using the C++ insertion operator, invoke the kernel’s call operation, and obtain the results by using the C++ extraction operator.
The kernel’s call operation blocks the execution of the IPC client until the server
replied to the call.
The IPC server represents the callee side of the communication with potentially
many IPC clients. Analogously to the IPC client, it uses two message buffers, a
receive buffer for incoming requests and a send buffer for delivering the reply of
the last request. Each IPC server has a corresponding untyped capability that can
be used to perform calls to the server using an IPC client object. An IPC server
object can be used to wait for incoming messages, read the arguments from the
receive buffer using the C++ extraction operator, populate the send buffer with
the reply message, and submit the reply to the client. The IPC server does not obtain any form of client identification along with an incoming message that could
be used to implement server-side access-control policies. Instead of performing
access control based on a client identification in the server, access control is solely
performed by the kernel on the invocation of capabilities. If a request was delivered to the server, the client has – by definition – a capability for communicating
with the server and thereby the authority to perform the request.
RPC stub code The RPC stub code complements the IPC library with the semantics
of RPC interfaces and RPC functions. An RPC interface is an abstract C++ class
with the declarations of the functions callable by RPC clients. Thereby each RPC
interface is represented as a C++ type. The declarations are accompanied with
annotations that allow the C++ compiler to generate the so-called RPC stub code
on both the client side and server side. Traditionally, the generation of communication stub codes uses to employ code-generating tools like IDL compilers. In
contrast, Genode uses C++ templates to generate the stub code, which avoids the
crossing of a language barrier when designing RPC interfaces and alleviates the
need for code-generating tools in addition to the compiler.
The client-side stub code translates C++ method calls to a sequence of operations
on an IPC client object. I.e., each RPC function of an RPC interface has an asso-
53
3.6
Inter-component communication
ciated opcode (according to the order of RPC functions). This opcode along with
the method arguments are inserted into the IPC client’s send buffer. Vice versa,
the stub code translates the content of the IPC client’s receive buffer to return
values of the method invocation.
The server-side stub code implements the so-called dispatch function, which
takes the IPC server’s receive buffer, translates the message into a proper C++
method call, calls the corresponding server-side function of the RPC interface,
and translates the function results into the IPC server’s send buffer.
RPC object and client object Thanks to the RPC stub code, the server-side implementation of an RPC object comes down to the implementation of the abstract
interface of the corresponding RPC interface. When an RPC object is associated
with an entrypoint, the entrypoint creates a unique capability for the given RPC
object. RPC objects are typed with their corresponding RPC interface. This C++
type information is propagated to the their corresponding capabilities. I.e., When
associating an RPC object that implements the LOG-session interface with an entrypoint, the resulting capability is a LOG-session capability.
This capability represents the authority to invoke the functions of the RPC object.
On the client side, the client object plays the role of a proxy of the RPC object
within the client’s component. Thereby, the client becomes able to interact with
the RPC object in a natural manner.
Sessions and connections Section 3.2.3 introduced sessions between client and
server components as the basic building blocks of system compositions. At the
server side each session is represented by an RPC object that implements the
session interface. At the client side, an open session is represented by a connection object. The connection object encapsulates the session arguments and also
represents a client object to interact with the session.
As depicted in Figure 19, capabilities are associated with types on all levels above the
IPC library. Because the IPC library is solely used by the RPC stub code but not at the
framework’s API level, capabilities appear as being C++ type safe, even across component boundaries. Each RPC interface implicitly defines a corresponding capability
type. Figure 20 shows the inheritance graph of Genode’s most fundamental capability
types.
Server-side class structure Figure 21 gives on overview of the C++ classes that are
involved at the server side of the RPC mechanism. As described in Section 3.1.3, each
entrypoint maintains a so-called object pool. The object pool contains references to RPC
objects associated with the entrypoint. When receiving an RPC request along with the
local name of the invoked object identity, the entrpoint uses the object pool to lookup
the corresponding RPC object. As seen in the figure, the RPC object is a class template parametrized with its RPC interface. When instantiated, the dispatch function is
generated by the C++ compiler according to the RPC interface.
54
3.6
Inter-component communication
Capability
Session
Dataspace
Thread
Parent
RAM
ROM
I/O
mem
CPU
IRQ
PD
CAP
I/O
port
I/O
mem
TRACE
Root
Signal
context
RAM
RM
LOG
SIGNAL
ROM
Figure 20: Fundamental capability types
Object-pool entry
*
1
capability
loo
RPC-object base
T
Object pool
kup
Entrypoint
call
IPC server
manage(RPC object base &) : Capability
dissolve(RPC object base &)
dispatch()
RPC interface
RPC object
receives
incoming RPC
requests
dispatch()
Figure 21: Server-side structure of the RPC mechanism
55
associates
capabilities with
local objects
3.6
Inter-component communication
Client
Core
Server
CAP
Object
pool
Entrypoint
create
manage a
RPC object
a
allocate capability
return capability capa
associate
a with capa
return
capa
delegate
capa
Figure 22: Creation of a new RPC object
RPC-object creation Figure 22 shows the procedure of creating a new RPC object.
The server component has already created an entrypoint, which, in turn, created its
corresponding object pool.
1. The server component creates an instance of an RPC object. “RPC object” denotes
an object that inherits the RPC object class template typed with the RPC interface
and that implements the virtual functions of this interface. By inheriting the RPC
object class template, it gets equipped with a dispatch function for the given RPC
interface.
Note that a single entrypoint can be used to manage any number of RPC objects
of arbitrary types.
2. The server component associates the RPC object with the entrypoint by calling the
entrypoint’s manage function with the RPC object as argument. The entrypoint
responds to this call by allocating a new object identity using a session to core’s
CAP service (Section 3.4.7). For allocating the new object identity, the entrypoint
specifies the untyped capability of its IPC server as argument. Core’s CAP service
returns the new object identity in the form of a new capability that is derived from
the specified capability. When invoked, the derived capability refers to the same
IPC server as the original capability. But it represents a distinct object identity.
The IPC server retrieves the local name of this object identity when called via the
56
3.6
Inter-component communication
Client
Server
RPC
Entrypoint
Object
pool
RPC object
a
invoke
capa
lookup by
capa
lookup
return a
RPC function
dispatch
reply
Figure 23: Invocation of an RPC object
derived capability. The entrypoint stores the association of the derived capability
with the RPC object in the object pool.
3. The entrypoint hands out the derived capability as return value of the manage
function. At this step, the derived capability is converted into a typed capability
with its type corresponding to the type of the RPC object that was specified as
argument. This way, the link between the types of the RPC object and the corresponding capability is preserved at the C++ language level.
4. The server delegates the capability to another component, e. g., as payload of a
remote procedure call. At this point, the client received the authority to call the
RPC object.
RPC-object invocation Figure 23 shows the flow of execution when a client calls an
RPC object by invoking a capability.
1. The client invokes the given capability using an instance of an RPC client object,
which uses the IPC library to invoke the kernel’s IPC mechanism. The kernel
delivers the request to the IPC server that that belongs to the invoked capability
and wakes up the corresponding entrypoint. On reception of the request, the
entrypoint obtains the local name of the invoked object identity.
57
3.6
Inter-component communication
2. The entrypoint uses the local name of the invoked object identity as a key into its
object pool to look up the matching RPC object. If the lookup fails, the entrypoint
replies with an error.
3. If the matching RPC object could be found, the entrypoint calls the RPC object’s
dispatch function. This function is implemented by the server-side stub code. It
converts the content of the receive buffer of the IPC server to a method call. I.e.,
it obtains the opcode of the RPC function from the receive buffer to decide which
method to call, and supplies the arguments according to the definition in the RPC
interface.
4. On the return of the RPC function, the RPC stub code populates the send buffer
of the IPC server with the function results and invokes the kernel’s reply operation via the IPC library. Thereby, the entrypoint becomes ready to serve the next
request.
5. When delivering the reply to the client, the kernel resumes the execution of the
client, which can pick up the results of the RPC call.
58
3.6
Inter-component communication
3.6.2 Asynchronous notifications
The synchronous RPC mechanism described in the previous section is not sufficient to
cover all forms of inter-component interactions. It shows its limitations in the following
situations.
Waiting for multiple conditions
In principle, the RPC mechanism can be used by an RPC client to block for a condition at a server. For example, a timer server could provide a blocking sleep
function that, when called by a client, blocks the client for a certain amount of
time. However, if the client wanted to respond to multiple conditions such as a
timeout, incoming user input, and network activity, it would need to spawn one
thread for each condition where each thread would block for a different condition. If one condition triggers, the respective thread would resume its execution
and respond to the condition. However, because all threads could potentially be
woken up independently from each other – as their execution depends only on
their respective condition – they need to synchronize access to shared state. Consequently, components that need to respond to multiple conditions would not
only waste threads but also suffer from synchronization overheads.
At the server side, the approach of blocking RPC calls is equally bad in the presence of multiple clients. For example, a timer service with the above outlined
blocking interface would need to spawn one thread per client.
Signaling events to untrusted parties
With merely synchronous RPC, a server cannot deliver sporadic events to its
clients. If the server wanted to inform one of its clients about such an event, it
would need to act as a client itself by performing an RPC call to its own client.
However, by performing an RPC call, the caller passes the control of execution
to the callee. In the case of a server that serves multiple clients, it would put the
availability of the server at the discretion of all its clients, which is unacceptable.
A similar situation is the interplay between a parent and a child where the parent
does not trust its child but still wishes to propagate sporadic events to the child.
The solution to those problems is the use of asynchronous notifications, also named
signals. Figure 24 shows the interplay between two components. The component labeled as signal handler responds to potentially many external conditions propagated
as signals. The component labeled as signal producer triggers a condition. Note that
both can be arbitrary components.
59
3.6
Inter-component communication
Signal producer
Signal handler
Core
SIGNAL
Signal
receiver
create
manage c
Signal
transmitter
for capc
create
delegate
capc
Signal
context c
allocate capability
signal-context
capability capc
submit
submit
wait for signal
context c count 2
wait for signal
submit
wake up
context c count 1
Figure 24: Interplay between signal producer and signal handler
Signal-context creation and delegation The upper part of the Figure 24 depicts the
steps needed by a signal handler to become able to receive asynchronous notifications.
1. Initially, the signal-handler component creates a so-called signal receiver. Similar
to an entrypoint that can respond to RPC requests for a number of RPC objects, a
signal receiver is able to receive signals for an arbitrary number of sources. Within
the signal-handler component, each source is represented as a so-called signal context. A component that needs to respond to multiple conditions creates one signal
context for each condition. In the figure, a signal context “c” is created.
2. The signal-handler component associates the signal context with the signal receiver via the manage function. Analogous to the way how RPC objects are associated with entrypoints, the manage function returns a capability for the signal
context. Under the hood, the signal receiver uses core’s SIGNAL service to create
this kind of capability.
60
3.6
Inter-component communication
3. As for regular capabilities, a signal-context capability can be delegated to other
components. Thereby, the authority to trigger signals for the associated context is
delegated.
Triggering signals The lower part of Figure 24 illustrates the use of a signal-context
capability by the signal producer.
1. Now in possession of the signal-context capability, the signal producer creates a
so-called signal transmitter for the capability. The signal transmitter can be used to
trigger a signal by calling the submit function. This function returns immediately.
In contrast to a remote procedure call, the submission of a signal is a fire-andforget operation.
2. At the time when the signal producer submitted the first signals, the signal handler is not yet ready to handle them. It is still busy with other things. However,
the number of submitted signals is recorded. Once the signal handler finally calls
the wait-for-signal function at the signal receiver, the call immediately returns
the information about the context, to which the signals refer, and the number of
signals that were submitted for this context.
3. After handling the first batch of signals, the signal handler component blocks its
execution by calling the signal receiver’s wait-for-signal function again. This time,
no signals are immediately pending. After a while, however, the signal producer
submits another signal, which eventually wakes up the signal handler with the
information about the associated context.
In contrast to remote procedure calls, signals carry no payload. If signals carried any
payload, this payload would need to be buffered somewhere. Regardless of where this
information is buffered, the buffer could overrun if signals are submitted at a higher
rate than handled. There might be two approaches to deal with this situation. The
first option would be to drop payload once the buffer overruns, which would make the
mechanism indeterministic, which is hardly desirable. The second option would be to
sacrifice the fire-and-forget semantics at the producer side, blocking the producer when
the buffer is full. However, this approach would put the liveliness of the producer at the
whim of the signal handler. Consequently, signals are void of payload. However, the
number of signals is recorded, which does not require a buffer but merely a counter.
Note that this counter could overflow in the presence of a producer that constantly
submits signals.
61
3.6
Inter-component communication
Core
Client
Server
RM
RAM
ate
cr e
Data
space
allocate dataspace
return
capability capds
attach
dataspace capds
Data
space
return server-local address
delegate capds
acce
ss
access
attach
dataspace capds
Data
space
return
client-local address
Figure 25: Establishing shared memory between client and server
3.6.3 Shared memory
By sharing memory between components, large amounts of information can be propagated across protection-domain boundaries without the active help of the kernel.
Sharing memory between components raises a number of questions. First, Section 3.3
explained that physical memory resources must be explicitly assigned to components
either by their respective parents or by the means of resource trading. This raises the
question of which component is bound to pay for the memory shared between multiple
components. Second, unlike traditional operating systems where different programs
can refer to globally visible files and thereby establish shared memory by mapping a
prior-agreed file into their respective virtual memory spaces, Genode does not have a
global name space. How do components refer to the to-be-shared piece of memory?
Figure 25 answers these questions with the sequence of establishing shared memory
between a server and its client. The diagram depicts a client, core, and a server. The notion of a client-server relationship is intrinsic for the shared-memory mechanism. When
establishing shared memory between components, the component’s roles as client and
server must be clearly defined.
1. The server interacts with core’s RAM service to allocate a new RAM dataspace.
Because the server uses its own RAM session for that allocation, the dataspace is
paid for by the server. At the first glance, this is seemingly a contradiction with the
62
3.6
Inter-component communication
principle that clients should have to pay for using services as discussed in Section
3.3.2. However, this is not the case. By establishing the client-server relationship,
the client has transferred a budget of RAM to the server via the session-quota
mechanism. So the client already paid for the memory. Still, it is the server’s
responsibility to limit the size of the allocation to the client’s session quota.
Because the server allocates the dataspace, it is the owner of the dataspace. Hence,
the lifetime of the dataspace is controlled by the server.
Core’s RAM service returns a dataspace capability as the result of the allocation.
2. The server makes the content of the dataspace visible in its virtual address space
by attaching the dataspace within its RM session. The server refers to the dataspace via the dataspace capability as returned from the prior allocation. When
attaching the dataspace to the server’s RM session, core’s RM service maps the
dataspace content at a suitable virtual-address range that is not occupied with existing mappings and returns the base address of the occupied range to the server.
Using this base address and the known dataspace size, the server can savely access the dataspace content by reading or writing its virtual memory.
3. The server delegates the authority to use the dataspace to the client. This delegation can happen in different ways, e. g., the client could request the dataspace
capability via an RPC function at the server. But the delegation could also involve
further components that transitively delegate the dataspace capability. Therefore,
the delegation operation is depicted as a dashed line.
4. Once the client has obtained the dataspace capability, it can use its own RM session to make the dataspace content visible in its address space. Note that even
though both client and server use core’s RM service, each component uses a different session. Analogous to the server, the client receives a client-local address
within its virtual address space as the result of the attach operation.
5. After the client has attached the dataspace within its RM session, both client and
server can access the shared memory using their respective virtual addresses.
In contrast to the server, the client is not in control over the lifetime of the dataspace.
In principle, the server as the owner of the dataspace could free the dataspace at its
RAM session at any time and thereby revoke the corresponding memory mappings
in all components that attached the dataspace. The client has to trust the server with
respect to its liveliness, which is consistent with the discussion in Section 3.2.4. A wellbehaving server should tie the lifetime of a shared-memory dataspace to the lifetime
of the client session. When the server frees the dataspace at its RAM session, core
implicitly detaches the dataspace from all RM sessions. Thereby the dataspace will
become inaccessible by the client.
63
3.6
Inter-component communication
Server
Client
h
tac
, at
ate
request dataspace
cre
return dataspace capability
attach
Data
space
RPC call
RPC reply
re
ad
,w
ri
te
dispatch
re
ad
te
wri
Figure 26: Transferring bulk data by combining synchronous RPC with shared memory
3.6.4 Asynchronous state propagation
In many cases, the mere information that a signal occurred is insufficient to handle
the signal in a meaningful manner. For example, a component that registers a timeout
handler at a timer server will eventually receive a timeout. But in order to handle the
timeout properly, it needs to know the actual time. The time could not be delivered
along with the timeout because signals cannot carry any payload. But the timeout
handler may issue a subsequent RPC call to the timer server for requesting the time.
Another example of this combination of asynchronous notifications and remote procedure calls is the resource-balancing protocol described in Section 3.3.4.
3.6.5 Synchronous bulk transfer
The synchronous RPC mechanism described in Section 3.6.1 enables components to exchange information via a kernel operation. In contrast to shared memory, the kernel
plays an active role by copying information (and delegating capabilities) between the
communication partners. Most kernels impose a restriction onto the maximum message size, i. e. the RPC message buffer limits messages to a maximum of 1 KiB. In
principle, larger payloads could be transferred as a sequence of RPCs. But since each
RPC implies the costs of two context switches, this approach is not suitable for transferring large bulks of data. But by combining synchronous RPC with shared memory,
these costs can be mitigated.
Figure 26 shows the procedure of transferring large bulk data using shared memory as a communication buffer while using synchronous RPCs for arbitrating the use
of the buffer. The upper half of the figure depicts the setup phase that needs to per-
64
3.6
Inter-component communication
formed only once. The lower half exemplifies an operation where the client transfers a
large amount of data to the server, which processes the data before transferring a large
amount of data back to the client.
1. At session-creation time, the server allocates the dataspace, which represents the
designated communication buffer. The steps resemble those described in Section
3.6.3. The server uses session quota provided by the client for the allocation. This
way, the client is able to aid the dimensioning of the dataspace by supplying an
appropriate amount of session quota to the server. Since the server performed the
allocation, the server is in control over the lifetime of the dataspace.
2. After the client established a session to the server, it initially queries the dataspace
capability from the server using a synchronous RPC and attaches the dataspace
to its own address space. After this step, both client and server can read and write
the shared communication buffer.
3. Initially the client plays the role as the user of the dataspace. The client writes the
bulk data into the dataspace. Naturally, the maximum amount of data is limited
by the dataspace size.
4. The client performs an RPC call to the server. Thereby, it hands over the role of the
dataspace user to the server. Note that this handover is not enforced. The client’s
PD retains the right to access the dataspace, i. e., by another thread running in the
same PD.
5. On reception of the RPC, the server becomes active. It reads and processes the
bulk data, and writes its results to the dataspace. The server must not assume
to be the exclusive user of the dataspace. A misbehaving client may change the
buffer content at any time. Therefore, the server must take appropriate precautions. In particular, if the data must be validated at the server side, the server must
copy the data from the shared dataspace to a private buffer before validating and
using it.
6. Once the server finished the processing of the data and wrote the results to the
dataspace, it replies to the RPC. Thereby, it hands over the role as the user of the
dataspace back to the client.
7. The client resumes its execution with the return of the RPC call, and can read the
result of the server-side operation from the dataspace.
The RPC call may be used for carrying control information. For example, the client may
provide the amount of data to process, or the server may provide the amount of data
produced.
65
3.6
Inter-component communication
shared between source and sink
ack
queue
get acked
5
4
acknowledge
release
Source
Sink
bulk buffer
alloc
1
get
submit
3
2
submit
queue
Figure 27: Life cycle of a data packet transmitted over the packet-stream interface
3.6.6 Asynchronous bulk transfer - packet streams
The packet-stream interface complements the facilities for the synchronous data transfer described in Sections 3.6.1 and 3.6.5 with a mechanism that carries payload over a
shared memory block and employs an asynchronous data-flow protocol. It is designed
for large bulk payloads such as network traffic, block-device data, video frames, and
USB URB payloads.
As illustrated in Figure 27, the communication buffer consists of three parts, a submit
queue, an acknowledgement queue, and a bulk buffer. The submit queue contains
packets generated by the source to be processed by the sink. The acknowledgement
queue contains packets that are processed and acknowledged by the sink. The bulk
buffer contains the actual payload. The assignment of packets to bulk-buffer regions is
performed by the source.
A packet is represented by a packet descriptor that refers to a portion of the bulk
buffer and contains additional control information. Such control information may include an opcode and further arguments interpreted at the sink to perform an operation
on the supplied packet data. Either the source or the sink is in charge of handling a
given packet at a given time. At the points 1, 2, and 5, the packet is owned by the source.
At the points 3 and 4, the packet is owned by the sink. Putting a packet descriptor in
the submit queue or acknowledgement queue represents a handover of responsibility.
The life cycle of a single packet looks as follows:
66
3.6
Inter-component communication
1. The source allocates a region of the bulk buffer for storing the packet payload
(packet alloc). It then requests the local pointer to the payload (packet content) and
fills the packet with data.
2. The source submits the packet to the submit queue (submit packet).
3. The sink requests a packet from the submit queue (get packet), determines the local
pointer to the payload (packet content), and processes the contained data.
4. After having finished the processing of the packet, the sink acknowledges the
packet (acknowledge packet), placing the packet into the acknowledgement queue.
5. The source reads the packet from the acknowledgement queue and releases the
packet (release packet). Thereby, the region of the bulk buffer that was used by the
packet becomes marked as free.
This protocol has four corner cases that are handled by signals:
Submit queue is full when the source is trying to submit a new packet. In this case,
the source blocks and waits for the sink to remove packets from the submit queue.
If the sink observes such a condition (when it attempts to get a packet from a full
submit queue), it delivers a ready-to-submit signal to wake up the source.
Submit queue is empty when the sink tries to obtain a packet from an empty submit
queue, it may block. If the source places a packet into an empty submit queue, it
delivers a packet-avail signal to wake up the sink.
Acknowledgement queue is full when the sink tries to acknowledge a packet at a saturated acknowledgement queue, the sink needs to wait until the source removes
an acknowledged packet from the acknowledgement queue. The source notifies
the sink about this condition by delivering an ready-to-ack signal. On reception
of the signal, the sink wakes up and can proceed to submit packets into the acknowledgement queue.
Acknowledgement queue is empty when the source tries to obtain an acknowledged
packet (get acked packet) from an empty acknowledgement queue. In this case,
the source may block until the sink places another acknowledged packet into the
empty acknowledgement queue and delivers an ack-avail signal.
If bidirectional data exchange between a client and a server is desired, there are two
approaches:
One stream of operations If data transfers in either direction are triggered by the
client only, a single packet stream where the client acts as the source and the server
represents the sink can accommodate transfers in both directions. For example,
the block session interface (Section 4.5.8) represents read and write requests as
packet descriptors. The allocation of the operation’s read or write buffer within
67
3.6
Inter-component communication
the bulk buffer is performed by the client, being the source of the stream of operations. For write operations, the client populates the write buffer with the to-bewritten information before submitting the packet. When the server processes the
incoming packets, it distinguishes the read and write operations using the control
information given in the packet descriptor. For a write operation, it processes the
information contained in the packet. For a read operation, it populates the packet
with new information before acknowledging the packet.
Two streams of data If data transfers in both directions can be triggered independently from client and server, two packet streams can be used. For example, the
NIC session interface (Section 4.5.11) uses one packet stream for ingoing and one
packet stream for outgoing network traffic. For outgoing traffic, the client plays
the role of the source. For incoming traffic, the server (such as a NIC driver) is the
source.
68
4 Components
The architecture introduced in Chapter 3 clears the way to compose sophisticated systems out of many building blocks. Each building block is represented by an individual
component that resides in a dedicated protection domain and interacts with other components in a well-defined manner. Those components do not merely represent applications but all typical operating-system functionalities.
Components can come in a large variety of shape and form. Compared to a monolithic operating-system kernel, a component-based operating system challenges the system designed by enlarging the design space with the decision of the functional scope
of each component and thereby the granularity of componentization. This decision
depends on several factors:
Security The smaller a component, the lower the risk for bugs and vulnerabilities.
The more rigid a component’s interfaces, the smaller its attack surface becomes.
Hence, the security of a complex system function can potentially be vastly improved by splitting it into a low-complexity component that encapsulates the
security-critical part and a high-complexity component that is uncritical for security.
Performance The split of functionality into multiple components introduces intercomponent communication and thereby context-switch overhead. If a functionality is known to be critical for performance, such a split should be clearly motivated by a benefit for security.
Reusability Componentization can be pursued for improved reusability while sometimes disregarding performance considerations. However, reusability can also be
achieved by moving functionality into libraries that can be easily reused by linking them directly against library-using components. By using a dynamic linker,
the linking can even happen at run time, which yields the same flexibility as the
use of multiple distinct components. Therefore, the split of functionality into multiple components for the sole sake of modularization is to be questioned.
The Sections 4.1, 4.2, 4.3, and 4.4 aid the navigation within the componentization design
space by discussing the different roles a component can play within a Genode system.
Those can be be the role of a device driver, protocol stack, resource multiplexer, runtime environment, and that of an application. By distinguishing those roles, it becomes
possible to assess the possible security implications of each individual component.
The versatility of a component-based system does not come from the existence of
many components alone. Even more important is the composability of components.
Components can be combined only if their interfaces match. To maximize composability, the number of interfaces throughout the system should be as low as possible, and
all interfaces should be largely orthogonal to each other. Section 4.5 reviews Genode’s
common session interfaces.
69
Components can be used in different ways depending on their configuration and
their position within the component tree. Section 4.6 explains how a component obtains
and processes its configuration. Section 4.7 discusses the most prominent options of
composing components.
70
4.1
Device drivers
Device driver
NIC client
NIC session
Core
IRQ session
IO-MEM session
Figure 28: A network device driver provides a NIC service to a single client and uses core’s
IO-MEM and IRQ services to interact with the physical network adaptor.
4.1 Device drivers
A device driver translates a device interface to a Genode session interface. Figure 28
illustrates the typical role of a device driver.
The device interface is defined by the device vendor and typically comprises the driving of state machines of the device, the notification of device-related events via interrupts, and a means to transfer data from and to the device. A device-driver component
accesses the device interface via sessions to the core services IO_MEM, IO_PORT, and
IRQ as described in Section 3.4.8.
In general, a physical device cannot safely be driven by multiple users at the same
time. If multiple users accessed one device concurrently, the device state would eventually become inconsistent. A device driver should not attempt to multiplex device.
Instead, to keep its complexity low, it should act as a server that serves only a single
client per physical device. Whereas a device driver for a simple device usually accepts
only one client, a device driver for a complex device with multiple sub devices (such as
a USB driver) may hand out each sub device to a different client.
A device driver should be largely void of built-in policy. If it merely translates the
interface of a single device to a session interface, there is not much room for policy
anyway. If, however, a device driver hands out multiple sub devices to different clients,
the assignment of sub devices to clients must be subjected to a policy. In this case, the
device driver should obtain this policy information from its configuration as provided
by the driver’s parent.
4.1.1 Platform driver
There are three problems that are fundamentally important for running an operating
system on modern hardware but that lie outside the scope of an ordinary device driver
71
4.1
Device drivers
because they are affect the platform as a whole rather than a single device. Those problems are the enumeration of devices, the discovery of interrupt routing, and the initial
setup of the platform.
Problem 1: Device enumeration Modern hardware platforms are rather complex
and vary a lot. For example, the devices attached to the PCI bus of a PC are usually
not known at the build time of the system but need to be discovered at run time. Technically, each individual device driver could probe its respective device at the PCI bus.
But in the presence of multiple drivers, this approach would hardly work. First, the
configuration interface of the PCI bus is a device itself. The concurrent access to the
PCI configuration interface by multiple drivers would ultimately yield undefined behaviour. Second, for being able to speak directly to the PCI configuration interface, each
driver would need to carry with it the functionality to interact with PCI.
Problem 2: of interrupt routing On PC platforms with multiple processors, the use
of legacy interrupts as provided by the Intel 8259 programmable interrupt controller
(PIC) is not suitable because there is no way to express the assignment of interrupts
to CPUs. To overcome the limitations of the PIC, Intel introduced the Advanced Programmable Interrupt Controller (APIC). The APIC, however, comes with a different
name space for interrupt numbers, which creates an inconsistency between the numbers provided by the PCI configuration (interrupt lines) and interrupt numbers as understood by the APIC. The assignment of legacy interrupts to APIC interrupts is provided by tables of the Advanced Configuration and Power Interface (ACPI). Consequently, in order to support multi-processor PC platforms, the operating system needs
to interpret those tables. Within a component-based system, we need to answer the
question of which component is responsible to interpret the ACPI tables and how this
information is applied to the individual device drivers.
Problem 3: Initial hardware setup In embedded systems, the interaction of the SoC
(system on chip) with its surrounding peripheral hardware is often not fixed in hardware but rather a configuration issue. For example, the power supply and clocks of
certain peripherals may be enabled by speaking an I2C protocol with a separate powermanagement chip. Also, the direction and polarity of the general-purpose I/O pins
depends largely on the way how the SoC is used. Naturally, such hardware setup steps
could be performed by the kernel. But this would require the kernel to become aware
of potentially complex platform intrinsics.
Central platform driver The natural solution to these problems is the introduction
of a so-called platform driver, which encapsulates the peculiarities outlined above. On
PC platforms, the role of the platform driver is played by the ACPI driver. The ACPI
driver provides an interface to the PCI bus in the form of a PCI service. Device drivers
obtain the information about PCI devices by creating a PCI session at the ACPI driver.
Furthermore, the ACPI driver provides an IRQ service that transparently applies the
72
4.1
Device drivers
interrupt routing based on the information provided by the ACPI tables. Furthermore,
the ACPI driver provides the means to allocate DMA buffers, which is further explained
in Section 4.1.3.
On ARM platforms, the corresponding component is named platform driver and
provides a so-called platform service. Because of the large variety of ARM-based SoCs,
the session interface for this service differs from platform to platform.
4.1.2 Interrupt handling
Most device drivers need to respond to sporadic events produced by the device and
propagated to the CPU as interrupts. In Genode, a device-driver component obtains
device interrupts via core’s IRQ service introduced in Section 3.4.8. On PC platforms,
device drivers usually do not use core’s IRQ service directly but rather use the IRQ
service provided by the platform driver (Section 4.1.1).
4.1.3 Direct memory access (DMA) transactions
Devices that need to transfer large amounts of data usually support a means to issue
data transfers from and to the system’s physical memory without the active participation of the CPU. Such transfers are called direct memory access (DMA) transactions. DMA
transactions relieve the CPU from actively copying data between device registers and
memory, optimize the throughput of the system bus by the effective use of burst transfers, and may even be used to establish direct data paths between devices. However,
the benefits of DMA come at the risk of corrupting the physical memory by misguided
DMA transactions. Because those DMA-capable devices can issue bus requests targeting the physical memory directly and not involving the CPU altogether, such requests
are naturally not subjected by the virtual-memory mechanism implemented in the CPU
in the form of a memory-management unit (MMU). Figure 29 illustrates the problem.
From the device’s point of view, there is just physical memory. Hence, if a driver sets
up a DMA transaction, e. g., if a disk driver reads a block from the disk, the driver
programs the memory-mapped registers of the device with the address and size of a
physical-memory buffer where it expects to receive the data. If the driver lives in a
user-level component, as is the case for a Genode-based system, it still needs to know
the physical address of the DMA buffer to program the device correctly. Unfortunately,
there is nothing to prevent the driver from specifying any physical address to the device. A malicious driver could misuse the device to read and manipulate all parts of
the physical memory, including the kernel. Consequently, device drivers and devices
should ideally be trustworthy. However, there are several scenarios where this is ultimately not the case.
Scenario 1: Direct device assignment to virtual machines When hosting virtual
machines as Genode components, the direct assignment of a physical device such as
a USB controller, a GPU, or a dedicated network card to the guest OS running in the
virtual machine can be useful in two ways. First, if the guest OS is the sole user of
73
4.1
Device drivers
Driver
Memory
controller
Disk
controller
DMA
Application
CPU MMU
system bus
Kernel
Figure 29: The MMU restricts the access of physical memory pages by different components
according to their virtual address spaces. However, direct memory accesses issued
by the disk controller are not subjected to the MMU. The disk controller can access
the entirety of the physical memory present in the system.
the device, the direct assignment of the device maximizes the I/O performance of the
guest OS using the device. Second, the guest OS may be equipped with a proprietary
device driver that is not present as a Genode component otherwise. In this case, the
guest OS may be used as a runtime executing the device driver and providing a driver
interface to the Genode world. In both cases the guest OS should not be considered as
trustworthy. In contrary, it bears the risk to subvert the isolation between components.
A misbehaving guest OS could issue DMA requests referring to the physical memory
used by other components and even the kernel and thereby break out of its virtual
machine.
Scenario 2: Firmware-driven attacks Modern peripherals such as wireless LAN
adaptors, network cards, or GPUs employ firmware executed on the peripheral device. This firmware is executed on a microcontroller on the device, and is thereby not
subjected to the policy of the normal operating system. Such firmware may either be
built-in by the device vendor, or is loaded by the device driver at initialization time of
the device. In both cases, the firmware tends to be a black box that remains obscure
except for the device vendor. Hidden functions or vulnerabilities might be present in it.
By the means of DMA transactions, such firmware has unlimited access on the system.
For example, a back door implemented in the firmware of a network adaptor could
look for special network packets to activate and control arbitrary spyware functions.
Because malware embedded in the firmware of the device can neither be detected nor
controlled by the operating system, both monolithic and microkernel-based operating
systems are powerless against such attacks.
Scenario 3: Bus-level attacks The previous examples misused a DMA-capable device as a proxy to drive an attack. However, the system bus can be attacked directly
with no hardware tinkering needed. There are ready-to-exploit interfaces that are fea-
74
4.1
Device drivers
DMA
Driver
Memory
controller
Disk
controller
Application
CPU MMU
IOMMU
system bus
Kernel
Figure 30: An IOMMU arbitrates and virtualizes DMA accesses issued by a device to the RAM.
Only if a valid IOMMU mapping exists for a given DMA access, the memory access
is performed.
tured on most PC systems. For example, most laptops come with PCMCIA / ExpressCard slots, which allow expansion cards to access the system bus. Furthermore, serial
bus interfaces, i. e., IEEE 1394 (Firewire) enable connected devices to indirectly access
the system bus via the peripheral bus controller. If the bus controller allows the device
to issue direct system bus requests by default, a connected device becomes able to gain
control over the whole system. This vulnerability has been present on a wide range of
commodity computers.
DMA transactions in component-based systems Direct memory access (DMA) of
devices looks like the Achilles heel of component-based operating systems. The most
compelling argument in favour of componentization is that by encapsulating each system component within a dedicated user-level address space, the system as a whole
becomes more robust and secure compared to a monolithic operating-system kernel.
In the event that one component fails due to a bug or an attack, other components remain unaffected. The prime example for such buggy components are, however, device
drivers. By empirical evidence, those remain the most prominent trouble makers in today’s operating systems, which suggests that the DMA loophole renders the approach
of component-based systems largely ineffective. However, there are three counter arguments to this observation.
First, by encapsulating each driver in a dedicated address space, classes of bugs
that are unrelated to DMA remain confined in the driver component. In practice
most driver-related problems stem from issues like memory leaks, synchronization
problems, deadlocks, flawed driver logic, wrong state machines, or incorrect deviceinitialization sequences. For those classes of problems, the benefits of isolating the
driver in a dedicated component still applies.
Second, executing a driver largely isolated from other operating-system code minimizes the attack surface onto the driver. If the driver interface is rigidly small and
well-defined, it is hard to compromise the driver by exploiting its interface.
75
4.1
Device drivers
Third, modern PC hardware has closed the DMA loophole by incorporating so-called
IOMMUs into the system. As depicted in Figure 30, the IOMMU sits between the physical memory and the system bus where the devices are attached to. So each DMA
request has to pass the IOMMU, which is not only able to arbitrate the access of DMA
requests to the RAM but is also able to virtualize the address space per device. Similar
to how an MMU confines each process running on the CPU within a distinct virtual
address space, the IOMMU is able to confine each device within a dedicated virtual
address space. To tell the different devices apart, the IOMMU uses the PCI device’s
bus-device-function triplet as unique identification.
With an IOMMU in place, the operating system can effectively limit the scope of
actions the given device can execute on the system. I.e., by restricting all accesses originating from a particular PCI device to the DMA buffers used for the communication,
the operating system becomes able to detect and prevent any unintended bus accesses
initiated by the device.
When executed on the NOVA kernel, Genode subjects all DMA transactions to the
IOMMU, if present. Section 8.6.6 discusses the use of IOMMUs in more depth.
76
4.2
Protocol stacks
Framebuffer driver
Framebuffer
session
Terminal
Terminal
session
Application
Input driver
Input
session
Figure 31: Example of a protocol stack. The terminal provides the translation between the
terminal-session interface (on the right) and the driver interfaces (on the left).
4.2 Protocol stacks
A protocol stack translates one session interface to another (or the same) session interface. For example, a terminal component may provide a command-line application
with a service for obtaining textual user input and for printing text. To implement this
service, the terminal uses an input session and a framebuffer session. Figure 31 depicts
the relationship between the terminal, its client application, and the used drivers. For
realizing the output of a stream of characters on screen, it implements a parser for escape sequences, maintains a state machine for the virtual terminal, and renders the pixel
representation of characters onto the framebuffer. For the provisioning of textual user
input, it responds to key presses reported by the input session, maintains the state of
modifier keys, and applies a keyboard layout to the stream of incoming events. When
viewed from the outside of the component, the terminal translates a terminal session to
a framebuffer session and an input session.
Similar to a device driver, a protocol stack typically serves a single client. In contrast
to device drivers, however, protocol stacks are not bound to physical devices. Therefore, a protocol stack can be instantiated any number of times. For example, if multiple
terminals are needed, one terminal component could be instantiated per terminal. Because each terminal has an independent instance of the protocol stack, a bug in the
protocol stack of one terminal does affect any other terminal. However complex the
implementation of the protocol stack may be, it is not prone to leaking information
to another terminal because it is connected to a single client only. The leakage of information is constrained to interfaces used by the individual instance. Hence, in cases
like this, the protocol-stack component is suitable for hosting highly complex untrusted
code if such code cannot be avoided.
Note that the example above cannot be generalized for all protocol stacks. There are
protocol stacks that are critical for the confidentiality of information. For example, an
in-band encryption component may translate plain-text network traffic to encrypted
77
4.2
Protocol stacks
network traffic designated to be transported over a public network. Even though the
component is a protocol stack, it may still be prone to leaking unencrypted information
to the public network.
Whereas protocol stacks are not necessarily critical for integrity and confidentiality,
they are almost universally critical for availability.
78
4.3
Resource multiplexers
Application
Framebuffer driver
Framebuffer
session
GUI server
GUI session
Application
GUI session
Input driver
GUI session
Input
session
Application
Figure 32: A GUI server multiplexes the physical framebuffer and input devices among multiple applications.
4.3 Resource multiplexers
A resource multiplexer transforms one resource into a number of virtual resources. A
resource is typically a session to a device driver. For example, a NIC-switch component may use one NIC session to a NIC driver as uplink and, in turn, provide a NIC
service where each session represents a virtual NIC. Another example is a GUI server
as depicted in Figure 32, which enables multiple applications to share the same physical framebuffer and input devices by presenting each client in a window or a virtual
console.
In contrast to a typical device driver or protocol stack that serves only a single client, a
resource multiplexer is shared by potentially many clients. In the presence of untrusted
clients besides security-critical clients, a resource multiplexer ultimately becomes a socalled multi-level component. This term denotes that the component is cross-cutting the
security levels of all its clients. This has the following ramifications.
Covert channels Because the component is a shared resource that is accessed by
clients of different security levels, it must maintain the strict isolation between
its clients unless explicitly configured otherwise. Hence, the component’s client
interface as well as the internal structure must be designed to prevent the leakage
of information across clients. I.e., two clients must never share the same namespace of server-side objects if such a namespace can be modified by the clients.
For example, a window server that hands out global window IDs to its clients
is prone to unintended information leakage because one client could observe the
allocation of window IDs by another client. The ID allocation could be misused
as a covert channel that circumvents security policies. In the same line, a resource
multiplexer is prone to timing channels if the operations provided via its client
interface depends on the behavior of other clients. For this reason, blocking RPC
calls should be avoided because the duration of a blocking operation may reveal
79
4.3
Resource multiplexers
information about the internal state such as the presence of other clients of the
resource multiplexer.
Complexity is dangerous As a resource multiplexer is shared by clients of different
security levels, the same considerations apply as for the OS kernel: High complexity poses a high risk for bugs. Such bugs may, in turn, result in the unintended
flow of information between clients or spoil the quality of service for all clients.
Hence, resource multiplexers must be as low complex as possible.
Denial of service The exposure of a resource multiplexer to untrusted and even malicious clients makes it a potential target for denial-of-service attacks. Some operations provided by the resource multiplexer may require the allocation of memory.
For example, a GUI server may need memory for the book keeping of each window created its clients. If the resource multiplexer performed such allocations
from its own memory budget, a malicious client could trigger the exhaustion of
server-side memory by creating new windows in an infinite loop. To mitigate this
category of problems, a resource multiplexer should perform memory allocations
exclusively from client-provided resources, i. e., using the session quota as provided by each client at the session-creation time. Section 3.3 describes Genode’s
resource-trading mechanism in detail. In particular, resource multiplexers should
employ heap partitioning as explained in Section 3.3.3.
Avoiding built-in policies A resource multiplexer can be understood as a microkernel for a higher-level resource. Whereas a microkernel multiplexes or arbitrates
the CPU and memory between multiple components, a resource multiplexer does
the same for sessions. Hence, the principles for constructing microkernels equally
apply for resource multiplexers. In the line of those principles, a resource multiplexer should ideally implement sole mechanisms but should be void of built-in
policy.
Enforcement of policy Instead of providing a built-in policy, a resource multiplexer
obtains policy information from its configuration as supplied by its parent. The
resource multiplexer must enforce the given policy. Otherwise, the security policy
expressed in the configuration remains ineffective.
80
4.4
Runtime environments and applications
Child
Child
Child
Runtime environment
Parent
Parent
Parent
Figure 33: A runtime environment manages multiple child components.
4.4 Runtime environments and applications
The component types discussed in the previous sections have in common that they deliberately lack built-in policy but act according to a policy supplied by their respective
parents by the means of configuration. This raises the question where those policies
should come from. The answer comes in the form of runtime environments and applications.
A runtime environment as depicted in Figure 33 is a component that hosts child components. As explained in the Sections 3.2 and 3.3, it is thereby able to exercise control
over its children but is also responsible to manage the children’s resources. A runtime
environment controls its children in three ways:
Session routing It is up to the runtime environment to decide how to route session
requests originating from a child. The routing of sessions is discussed in Section
3.2.3.
Configuration Each child obtains its configuration from its parent in the form of a
ROM session as described in Section 4.6. Using this mechanism, the runtime environment is able to feed policy information to its children. Of course, in order to
make the policy be effective, the respective child has to interpret and enforce the
configuration accordingly.
Lifetime The lifetime of a child ultimately depends on its parent. Hence, a runtime
environment can destroy and possibly restart child components at any time.
With regard to the management of child resources, a runtime environment can employ
a large variety of policies using two principal approaches:
Quota management Using the resource-trading mechanisms introduced in Section
3.3, the runtime environment can assign resources to each child individually.
Moreover, if a child supports the dynamic rebalancing protocol described in Section 3.3.4, the runtime environment may even change those assignments over the
lifetime of its children.
81
4.4
Runtime environments and applications
Interposing services Because the runtime environment controls the session routing
of each child, it is principally able to interpose the child’s use of any service including those normally provided by core such as RAM (Section 3.4.2), RM (Section 3.4.5), and CPU (Section 3.4.6). The runtime environment may provide a
locally implemented version of those session interfaces instead of routing session
requests directly towards the core component. Internally, each session of such a
local service may create a session to the real core service, thereby effectively wrapping core’s sessions. This way, the runtime environment can not only observe the
interaction of its child with core services but also implement custom resourcemanagement strategies, for example, sharing one single budget among multiple
children.
Canonical examples of runtime environments are the init component that applies a
policy according to its configuration, the noux runtime that presents itself as a Unix
kernel to its children, a debugger that interposes all core services for the debugging
target, or a virtual machine monitor.
A typical application is a leaf node in the component tree that merely uses services. In
practice, however, the boundary between applications and runtime environments can
be blurry. As illustrated in Section 4.7, Genode fosters the internal split of applications
into several components, thereby forming multi-component applications. From the outside, such a multi-component application appears as leaf node of the component tree
but internally, it employs a further level of componentization by executing portions of
its functionality in separate child components. The primary incentive behind this approach is the sandboxing of untrusted application functionality. For example, a video
player may execute the video codec within a separate child component so that a bug in
the complex video codec will not compromise the entire video-player application.
82
4.5
Common session interfaces
4.5 Common session interfaces
The core services described in Section 3.4 principally enable the creation of a recursively structured system. However, their scope is limited to the few low-level resources
provided by core, namely processing time, memory, and low-level device resources.
Device drivers (Section 4.1) and protocol stacks (Section 4.2) transform those low-level
resources into higher-level resources. Analogously how core’s low-level resources are
represented by the session interfaces of core’s services, higher-level resources are represented by the session interfaces provided by device drivers and protocol stacks. In
principle, each device driver could introduce a custom session interface representing
the particular device. But as discussed in the introduction of Chapter 4, a low number
of orthogonal session interfaces is desirable to maximize the composability of components. This section introduces the common session interfaces that are used throughout
Genode.
4.5.1 Read-only memory (ROM)
The ROM session interface makes a piece of data in the form of a dataspace available
to the client.
Session creation At session-creation time, the client specifies the name of a ROM
module as session argument. One server may hand out different ROM modules depending on the specified name. Once a ROM session has been created, the client can
request the capability of the dataspace that contains the ROM module. Using this capability and the client’s RM session, the client can attach the ROM module to its local
address space and thereby access the information. The client is expected to merely read
the data, hence the name of the interface.
ROM module updates In contrary to the intuitive assumption that read-only data
is constant, ROM modules may mutate during the lifetime of the session. The server
may update the content of the ROM module with new versions. However, the server
does not do so without the consent of the client. The protocol between client and server
consists of the following steps.
1. The client registers a signal handler at the server to indicate that it is interested in
receiving updates of the ROM module.
2. If the server has a new version of the ROM module, it does not immediately
change the dataspace shared with the client. It rather maintains the new version separately and informs the client by submitting a signal to the client’s signal
handler.
3. The client continues working with the original version of the dataspace. Once
it receives the signal from the server, it may decide to update the dataspace by
calling the update function at the server.
83
4.5
Common session interfaces
4. The server responds to the update request. If the new version fits into the existing dataspace, the server copies the content of the new version into the existing
dataspace and returns this condition in the reply of the update call. Thereby, the
ROM session interface employs synchronous bulk transfers as described in Section 3.6.5.
5. The client evaluates the result of the update call. If the new version has fitted in
the existing dataspace, the update is complete at this point. However, if the new
version is larger than the existing dataspace, the client requests a new dataspace
from the server.
6. On reception of the dataspace request, the server destroys the original dataspace
(thereby making it invisible at the client), and returns the new version of the ROM
module as a different dataspace.
7. The client attaches the new dataspace capability to its local address space to access
the new version.
The protocol is designed such that neither the client nor the server need to support
updates. A server with no support for updating ROM modules such as core’s ROM
service simply ignores the registration of a signal handler by a client. A client that is
not able to cope with ROM-module updates never requests the dataspace twice.
However, if both client and server support the update protocol, the ROM session
interface provides a means to propagate large state changes from the server to the client
in a transactional way. In the common case where the new version of a ROM module
fits into the same dataspace as the old version, the update does not require any memory
mappings to be changed.
Use cases The ROM session interface is used wherever data shall be accessed in a
memory mapped fashion.
• Boot time data comes in the form of the ROM sessions provided by core’s ROM
service. On some kernels, core exports kernel-specific information such as the
kernel version in the form of special ROM modules.
• If an executable binary is provided as a ROM module, the binary’s text segment
can be attached directly to the address space of a new process (Section 3.5). So
multiple instances of the same component effectively share the same text segment.
The same holds true for shared libraries. For this reason, executable binaries and
shared libraries are requested in the form of ROM sessions.
• Components obtain their configuration by requesting a ROM session for the ROM
module “config” at the parent (Section 4.6). This way, configuration information
can be propagated using a simple interface with no need for a file system. Furthermore, the update mechanism allows the parent to dynamically change the
configuration of a component during its lifetime.
84
4.5
Common session interfaces
• As described in Section 4.7.5, multi-component applications may obtain data
models in the form of ROM sessions. In such scenarios, the ROM session’s update mechanism is used to propagate model updates in a transactional way.
4.5.2 Report
The report session interface allows a client to report internal state to the outside using
synchronous bulk transfers (Section 3.6.5).
Session creation At session-creation time, the client specifies a label and a buffer
size. The label aids the routing of the session request but may also be used to select a
policy at the report server. The buffer size determines the size of the dataspace shared
between the report server and client.
Use cases
• Components may use report sessions to export their internal state for monitoring
purposes or for propagating exceptional events.
• Device drivers may report the information about detected devices or other resources. For example, a bus driver may report a list of devices attached on the
bus, or a wireless driver may report the list of available networks.
• In multi-component applications, components that provide data models to other
components may use the report-session interface to propagate model updates.
4.5.3 Terminal and UART
The terminal session interface provides a bi-directional communication channel between client and server using synchronous bulk transfers (Section 3.6.5). It is primarily
meant for textual user interfaces but may also be used to transfer other serial steams of
data.
The interface uses the two RPC functions read and write to arbitrate the access to a
shared-memory communication buffer between client and server as described in Section 3.6.5. The read function does never block. When called, it copies new input into
the communication buffer and returns the number of new characters. If there is no new
input, it returns 0. To avoid the need to poll for new input at the client side, the client
can register a signal handler that gets notified on the arrival of new input. The write
function takes a number of to-be-written characters as argument. The server responds
to this function by processing the specified amount on characters from the communication buffer.
Besides the actual read and write operations, the terminal supports the querying of
the number of new available input (without reading it) and the terminal size in rows
and columns.
85
4.5
Common session interfaces
Session creation At session-creation time, the terminal session may not be ready to
use. For example, a TCP terminal session needs an established TCP connection first.
In such a situation, the use of the terminal session by a particular client must be deferred until the session becomes ready. Delaying the session creation at the server side
is not an option because this this would render the server’s entry point unavailable
for all other clients until the TCP connection is ready. Instead, the server delivers a
connected signal to the client. This signal is emitted when the session becomes ready
to use. The client waits for this signal right after creating the session.
Use cases
• Device drivers that provide streams of characters in either direction
• Graphical terminal
• Transfer of steams of data over TCP (using TCP terminal)
• Writing streams of data to a file (using file terminal)
• User input and output of traditional command-line based software, i. e., programs
executed in the noux runtime environment
• Multiplexing of multiple textual user interfaces (using the terminal-mux component)
• Headless operation and management of subsystems (using CLI monitor)
UART The UART session interface complements the terminal session interface with
additional control functions, e. g., for setting the baud rate. Because UART sessions
are compatible to terminal sessions, a UART device driver can be used as both UART
server and terminal server.
4.5.4 Input
The input session interface is used to communicate low-level user-input events from
the server to the client using synchronous bulk transfers (Section 3.6.5). Such an event
can be of one of the following types:
press or release of a button or key. Each physical button (such as a mouse button) or
key (such as a key on a keyboard) is represented by a unique value. At the inputsession level, key events are reported as raw hardware events. They are reported
without a keyboard layout applied and without any interpretation of meta keys
(like shift, alt, and control). This gives the client the flexibility to handle arbitrary
combination of keys.
relative motion of pointer devices such as a mouse. Such events are generated by
device drivers.
86
4.5
Common session interfaces
absolute motion of pointer devices such as a touch screen or graphics tablet. Furthermore absolute motion events are generated by virtual input devices such as the
input session provided by a GUI server.
wheel motion of scroll wheels in vertical and horizontal directions.
focus of the session. Focus events are artificially generated by servers to indicate a
gained or lost keyboard focus of the client. The client may respond to such an
event by changing its graphical representation accordingly.
leave of the pointer position. Leave events are artificially generated by servers to indicate a lost pointer focus.
Use cases
• Drivers for user-input devices play the roles of input servers
• Providing user input from a GUI server to its clients, i. e., the interface of the
nitpicker GUI server provides an input session as part of the server’s interface.
• Merging multiple streams of user input into one stream (using an input merger)
• Virtual input devices can be realized as input servers that generate artificial input
events.
4.5.5 Framebuffer
The framebuffer session interface allows a client to supply pixel data to a framebuffer
server such as a framebuffer driver or a virtual framebuffer provided by a GUI server.
The client obtains access to the framebuffer as a dataspace, which is shared between
client and server. The client may update the pixels within the dataspace at any time.
Once, a part of the framebuffer has been updated, the client informs the server by calling a refresh RPC function. Thereby, the framebuffer session interface employs a synchronous bulk transfer mechanism (Section 3.6.5).
Session creation In general, the screen mode is defined by the framebuffer server,
not the client. The mode may be constrained by the physical capabilities of the hardware or depend on the driver configuration. Some framebuffer servers, however, may
take a suggestion by the client into account. At session-creation time, the client may
specify a preferred mode as session argument. Once the session is constructed, however, the client must request the actually used mode via the mode RPC function.
Screen-mode changes The session interface supports dynamic screen-mode changes
during the lifetime of the session using the following protocol:
87
4.5
Common session interfaces
1. The client may register a signal handler using the mode_sigh RPC function. This
handler gets notified in the event of server-side mode changes.
2. Similarly to the transactional protocol used for updating ROM modules (Section
4.5.1), the dataspace shared between client and server stays in tact until the client
acknowledges the mode change by calling the mode RPC function.
3. The server responds to the mode function by applying the new mode and returning the according mode information to the client. This step may destroy the old
framebuffer dataspace.
4. The client requests a new version of the frambuffer dataspace by calling the dataspace RPC function and attaches the dataspace to its local address space. Note that
each subsequent call of the dataspace RPC function may result in the replacement
of the existing dataspace by a new dataspace. Hence, calling dataspace twice may
invalidate the dataspace returned from the first call.
Frame-rate synchronization To enable framebuffer clients to synchronize their operations with the display frequency, a client can register a handler for receiving displaysynchronization events as asynchronous notifications (Section 3.6.2).
Use cases
• Framebuffer device drivers are represented as framebuffer servers.
• A virtual framebuffer may provide both the framebuffer and input session interfaces by presenting a window on screen. The resizing of the window may be
reflected to the client as screen-mode changes.
• A filter component requests a framebuffer session and, in turn, provides a framebuffer session to a client. This way, pixel transformations can be applied to pixels
produced by a client without extending the client.
4.5.6 Nitpicker GUI
The nitpicker session interface subsumes an input session and a framebuffer session as
a single session (Figure 34). Furthermore it supplements the framebuffer session with
the notion of views, which allows the creation of flexible multi-window user interfaces.
It is generally suited for resource multiplexers of the framebuffer and input sessions.
A view is a rectangular area on screen that displays a portion of the client’s virtual
framebuffer. The position, size, and viewport of each view is defined by the client.
Views can overlap, thereby creating a view stack. The stacking order of the views of
one client can be freely defined by the client.
The size of the virtual framebuffer can be freely defined by the client but the required
backing store must be provided in the from of session quota. Clients may request the
screen mode of the physical framebuffer and are able to register a signal handler for
88
4.5
Common session interfaces
Nitpicker session
Input session
Framebuffer session
Views
Figure 34: A nitpicker session aggregates a framebuffer session, an input session, and a sessionlocal view stack.
mode changes of the physical framebuffer. This way, nitpicker clients are able to adapt
themselves to changed screen resolutions.
Use cases
• The nitpicker GUI server allows multiple GUI applications to share a pair of a
physical framebuffer session and an input session in a secure way between multiple GUI applications.
• A window manager implementing the nitpicker session interface may represent
each view as a window with window decorations and a placement policy. The resizing of a windows by the user is reflected to the client as a screen-mode change.
• A loader (Section 4.5.14) virtualizes the nitpicker session interface for the loaded
subsystem.
4.5.7 Platform
The platform session interface (on ARM-based devices) and the PCI session interface
(on x86-based machines) provide the client with access to the devices present on the
hardware platform. See Section (Section 4.1.1) for more information on the role of platform drivers.
4.5.8 Block
The block session interface allows a client to access a storage server at the block level.
The interface is based on a packet stream (Section 3.6.6). Each packet represents a blockaccess command, which can be either read or write. Thanks to the use of the packetstream mechanism, the client can issue multiple commands at once and thereby hide
89
4.5
Common session interfaces
access latencies by submitting batches of block requests. The server acknowledges each
packet after completing the corresponding block-command operation.
The packet-stream interface for submitting commands is complemented by the info
RPC function for querying the properties of the block device, i. e., the supported operations, the block size, and the block count. Furthermore, a client can call the sync RPC
function to flush caches at the block server.
Session creation At session-creation time, the client can dimension the size of the
communication buffer as session argument. The server allocates the shared communication buffer from the session quota.
Use cases
• Block-device drivers implement the block-session interface.
• The part-block component requests a single block session, parses a partition table,
and hands out each partition as a separate block session to its clients. There can
be one client for each partition.
• File-system servers use block sessions as their back end.
4.5.9 Regulator
The regulator session represents an adjustable value in the hardware platform. Examples are runtime-configurable frequencies and voltages. The interface is a plain RPC
interface.
4.5.10 Timer
The timer session interface provides a client with a session-local time source. A client
can use it to schedule timeouts that are delivered as signals to a prior registered signal
handler. Furthermore, the client can request the elapsed number of milliseconds since
the creation of the timer session.
4.5.11 NIC
A NIC session represents a network interface that operates at network-packet level.
Each session employs two independent packet streams (Section 3.6.6), one for receiving
network packets and one for transmitting network packets. Furthermore, the client can
query the MAC address of the network interface.
Session creation At session-creation time, the communication buffers of both packet
streams are dimensioned via session arguments. The communication buffers are allocated by the server using the session quota provided by the client.
90
4.5
Common session interfaces
Use cases
• Network drivers are represented as NIC servers.
• A NIC switch uses one NIC session connected to a NIC driver, and provides multiple virtual NIC interfaces to its clients by managing a custom name space of
virtual MAC addresses.
• A TCP/IP stack uses a NIC session as back end.
4.5.12 Audio output
The audio output interface allows for the transfer of audio data from the client to the
server. One session corresponds to one channel. I.e., for stereo output, two audio-out
sessions are needed.
Session construction At session-construction time, the client specifies the type of
channel (e. g., front left) as session argument.
Interface design For the output of streamed audio data, a codec typically decodes
a relatively large portion of an audio stream and submits the sample data to a mixer.
The mixer, in turn, mixes the samples of multiple sources and forwards the result to
the audio driver. Each of those components the codec, the mixer, and the audio driver
are separate components. By using large buffer sizes between them, there is only very
little context-switching overhead. Also, the driver can submit large buffers of sample
data to the sound device without any further intervention needed. In contrast, sporadic
sounds are used to inform the user about an immediate event. It is ultimately expected
that such sounds are played back without much latency. Otherwise the interactive experience (e. g., of games) would suffer. Hence, using large buffers between the audio
source, the mixer, and the driver is not an option. The audio-out session interface was
specifically designed to accommodate both corner cases of audio output.
Similarly to the packet-stream mechanism described in Section 3.6.6, the audio-out
session interface depicted in Figure 35 employs a combination of shared memory and
asynchronous notifications. However, in contrast to the packet-stream mechanism, it
has no notion of ownership of packets. When using the normal packet-stream protocol,
either the source or the sink is in charge of handling a given packet at a given time, not
both. The audio-out session interface weakens this notion of ownership by letting the
source update once submitted audio frames even after submitting them. If there are
solely continuous streams of audio arriving at the mixer, the mixer can mix those large
batches of audio samples at once and pass the result to the driver.
Now, if a sporadic sound comes in, the mixer checks the current output position
reported by the audio driver, and re-mixes those portions that haven’t been played
back yet by incorporating the sporadic sound. So the buffer consumed by the driver
gets updated with new data.
91
4.5
Common session interfaces
shared between source and sink
Dataspace
current
position
Source
Sink
progress signal
Figure 35: The time-driven audio-out session interface uses shared memory to transfer audio
frames and propagate progress information.
Stream
Mixer
Driver
Stream
Figure 36: The mixer processes batches of incoming audio frames from multiple sources.
92
4.5
Common session interfaces
Stream
ix
rem
Stream
Mixer
Driver
Sporadic
Figure 37: A sporadic occurring sound prompts the mixer to remix packets that were already
submitted in the output queue.
Besides the way of how packets are populated with data, the second major difference
to the packet-stream mechanism is its time-triggered mode of operation. The driver
produces periodic signals that indicate the completeness of a played-back audio packet.
This signal triggers the mixer to become active, which in turn serves as a time base for
its clients. The current playback position is denoted alongside the sample data as a field
in the memory buffer shared between source and sink.
Use cases
• The audio-out session interface is provided by audio drivers.
• An audio mixer combines incoming audio streams of multiple clients into one
audio stream transferred to an audio driver.
4.5.13 File system
The file-system session interface provides the client with a storage facility at the file
and directory-level. Compared to the block session interface (Section 4.5.8), it operates
on a higher abstraction level that is suited for multiplexing the storage device among
multiple clients. Similar to the block session, the file-system session employs a single
packet stream interface (Section 3.6.6) for issuing read and write operations. This way,
read and write requests can be processed in batches and even out of order.
93
4.5
Common session interfaces
In contrast to read and write operations that carry potentially large amounts of payload, the directory functions provided by the file-system session interface are synchronous RPC functions. Those functions are used for opening, creating, renaming,
moving, deleting, and querying files, directories and symbolic links.
The directory functions are complemented with an interface for receiving notifications upon file or directory changes using asynchronous notifications.
Use cases
• A file-system operates on a block session to provide file-system sessions to its
clients.
• A RAM file system keeps a directory structure and files in memory and provides
file-system sessions to multiple clients. Each session may be restricted in different
ways (such as the root directory as visible by the respective client, or the permission to write). Thereby the clients can communicate using the RAM file system as
a shared storage facility but are subjected to an information-flow policy.
• A file-system component may play the role of a filter that transparently encrypts
the content of the files of its client and stores the encrypted files at another filesystem server.
• A pseudo file system may use the file-system interface as an hierarchic control
interface. For example, a trace file system provides a pseudo file system as a front
end to interact with core’s TRACE service.
4.5.14 Loader
The loader session interface allows clients to dynamically create Genode subsystems to
be hosted as children of a loader service. In contrast to a component that is spawning a
new subsystem as an immediate child, a loader client has very limited control over the
spawned subsystem. It can merely define the binaries and configuration to start, define
the position where the loaded subsystem will appear on screen, and kill the subsystem.
But it is not able to interfere with the operation of the subsystem during its lifetime.
Session creation At session-creation time, the client defines the amount of memory
to be used for the new subsystem as session quota. Once the session is established,
the client equips the loader session with ROM modules that will be presented to the
loaded subsystem. From the perspective of the subsystem, those ROM modules can be
requested in the form of ROM sessions from its parent.
Visual integration of the subsystem The loaded subsystem may implement a
graphical user interface by creating a nitpicker session (Section 4.5.6). The loader
responds to such a session request by providing a locally implemented session. The
loader subordinates the nitpicker session of the loaded subsystem to a nitpicker view
94
4.5
Common session interfaces
(called parent view) defined by the loader client. The loader client can use the loader
session interface to position the view relative to the parent-view position. Thereby, the
graphical user interface of the loaded subsystem can be seamlessly integrated with the
user interface of the loader client.
Use case The most illustrative use case is the execution of web-browser plugins
where neither the browser trusts the plugin nor the plugin trusts the browser (Section
4.7.4).
95
4.6
Component configuration
4.6 Component configuration
By convention, each component obtains its configuration in the form of a ROM module
named “config”. The ROM session for this ROM module is provided by the parent
of the component. For example, for the init component, which is the immediate child
of core, its “config” ROM module is provided by core’s ROM service. Init, in turn,
provides a different config ROM module to each of its children by providing a locally
implemented ROM service per child.
4.6.1 Configuration format
In principle, being a mere ROM module, a component configuration can come in an
arbitrary format. However, throughout Genode, there exists the convention to use XML
as syntax and wrap the configuration within a <config> node. The definition of sub
nodes of the configuration depends on the respective component.
4.6.2 Server-side policy selection
Servers that serve multiple clients may apply a different policy to each client. In general, the policy may be defined by the session arguments aggregated on the route of
the session request as explained in Section 3.2.3. However, in the usual case, the policy is dictated by the common parent of client and server. In this case, the parent may
propagate its policy as the server’s configuration and deliver a textual label as session
argument for each session requested at the server. The configuration contains a list of
policies whereas the session label is used as a key to select the policy from the list. For
example, the following snippet configures a RAM file system with different policies.
<config>
<!-- constrain sessions according to their labels -->
<policy label="noux -> root" root="/" />
<policy label="noux -> home" root="/home/user" />
<policy label="noux -> tmp" root="/tmp" writeable="yes" />
</config>
Each time a session is created, the server matches the supplied session label against
the configured policies. Only if a policy matches, the parameters of the matching policy
come into effect. The way how the session label is matched against the policies depends
on the implementation of the server. Usually, the server selects the policy where the
session label starts with the policy’s label. If multiple policies match, the one with the
longest (most specific) policy label is selected. If multiple policies have the same label,
the selection is undefined. This is a configuration error.
96
4.6
Component configuration
4.6.3 Dynamic component reconfiguration at runtime
As described in Section 4.5.1, a ROM module can be updated during the lifetime of
the ROM session. This principally enables a parent to dynamically reconfigure a child
component without the need to restart it. If a component supports its dynamic reconfiguration, it installs a signal handler at its “config” ROM session. Each time, the
configuration changes, the component will receive a signal. It responds to such a signal
by obtaining the new version of the ROM module using the steps described in Section
4.5.1 and applying the new configuration.
97
4.7
Component compositions
4.7 Component compositions
Genode provides a playground for combining components in many different ways. The
best composition of components often depends on the goal of the system integrator.
Among possible goals are the ease of use for the end user, the cost-efficient reuse of
existing software, and good application performance. However, the most prominent
goal is the mitigation of security risks. This section presents composition techniques
that leverage Genode’s architecture to dramatically reduce the trusted computing base
of applications and to solve rather complicated problems in surprisingly easy ways.
The figures presented throughout this section use a simpler nomenclature than the
previous sections. A component is depicted as box. Parent-child relationships are represented as light-gray arrows. A session between a client and a server is illustrated by
a dashed arrow pointing to the server.
Server
Client
Parent
4.7.1 Sandboxing
The functionality of existing applications and libraries is often worth reusing or economically downright infeasible to reimplement. Examples are PDF rendering engines,
libraries that support commonly used video and audio codecs, or libraries that decode
hundreds of image formats.
However code of such rich functionality is inherently complex and must be assumed
to contain security flaws. This is empirically evidenced by the never ending stream
of security exploits targeting the decoders of data formats. But even in the absence of
bugs, the processing of data by third-party libraries may have unintended side effects.
For example, a PDF file may contain code that accesses the file system, which the user of
a PDF reader may not expect. By linking such a third-party library to a security-critical
application, the application’s security is seemingly traded against the functional value
that the library provides.
Fortunately, Genode’s architecture principally allows every component to encapsulate untrusted functionality in child components. So instead of directly linking a thirdparty library to an application, the application executes the library code in a dedicated
sub component. By imposing a strict session-routing policy onto the component, the
untrusted code is restricted to its sandbox. Figure 38 shows a video player as a practical
example of this approach.
The video player uses the nitpicker GUI server to present a user interface with the
graphical controls of the player. Furthermore, it has access to a media file containing
video and audio data. Instead of linking the media-codec library (libav) directly to
98
4.7
Component compositions
Codec (avplay)
SDL audio
SDL video
ROM
"config"
ROM
"media"
<config>
<sdl_audio_volume
value="85"/>
</config>
GUI
Nitpicker GUI server
Frame
buffer
Virtual
Virtual
ROM
Framebuffer
Nitpicker
Nitpicker
ROM
service
Figure 38: A video player executes the video and audio codecs inside a dedicated sandbox.
the video-player application, it executes the codec as a child component. Thereby the
application effectively restricts the execution environment of the codec to only those
resources that are needed by the codec. Those resources are the media file that is handed
out to the codec as a ROM module, a facility to output video frames in the form of a
framebuffer session, and a facility to output an audio stream in the form of an audio-out
session.
In order to reuse as much code as possible, the video player executes an existing example application called avplay that comes with the codec library as child component.
The avplay example uses libSDL as backend for video and audio output and responds
to a few keyboard shortcuts for controlling the video playback such as pausing the
video. Because there exists a Genode version of libSDL, avplay can be executed as a
Genode component with no modifications. This version of libSDL requests a framebuffer session (Section 4.5.5) and an audio-out session (Section 4.5.12) to perform the
video and audio output. To handle user input, it opens an input session (Section 4.5.4).
Furthermore, it opens a ROM session for obtaining a configuration. This configuration
parametrizes the audio backend of libSDL. Because avplay is a child of the video-player
application, all those session requests are directed to the application. It is entirely up
to the application how to respond to those requests. For accommodating the request
for a frambuffer session, the application creates a second nitpicker session, configures
a virtual framebuffer, and embeds this virtual framebuffer into its GUI. It keeps the
nitpicker session capability for itself and merely hands out the virtual framebuffer’s
session capability to avplay. For accommodating the request for the input session, it
hands out a capability to a locally-implemented input session. Using this input session,
99
4.7
Component compositions
it becomes able to supply artificial input events to avplay. For example, when the user
clicks on the play button of the application’s GUI, the application would submit a sequence of press and release events to the input sessions, which appear to avplay as the
keyboard shortcut for starting the playback. To let the user adjust the audio parameters of libSDL during the replay, the video-player application dynamically changes the
avplay configuration using the mechanism described in Section 4.6.3. As a response
to a configuration update, libSDL’s audio backend picks up the changed configuration
parameters and ajusts the audio playback accordingly.
By sandboxing avplay as a child component of the video player, a bug in the video
or audio codecs can no longer compromise the application. The execution environment
of avplay is tailored to the needs of the codec. In particular, it does not allow the codec
to access any files or the network. In the worst case, if avplay becomes corrupted, the
possible damage is restricted to producing wrong video or audio frames but corrupted
codec can neither access any of the user’s data nor can it communicate to the outside
world.
4.7.2 Component-level and OS-level virtualization
The sandboxing technique presented in the previous section tailors the execution environment of untrusted third-party code by applying an application-specific policy to
all session requests originating from the untrusted code. However, the tailoring of the
execution environment by the parent can go even a step further by providing the allencompassing virtualization of all services used by the child, including core’s services
such as RAM, RM, and CPU. This way, the parent can not just tailor the execution environment of a child but completely define all aspects of the child’s execution. This clears
the way for introducing custom operating-system interfaces at any position within the
component tree, or for monitoring the behavior of subsystems.
Introducing a custom OS interface By implementing all session interfaces normally
provided by core, a runtime environment becomes able to handle all low-level interactions of the child with core. This includes the allocation of memory using the RAM
service, the spawning and controlling of threads using the CPU service, and the management of the child’s address space using the RM service.
The noux runtime illustrated in Figure 39 is the canonical example of this approach.
It appears as a Unix kernel to its children and thereby enables the use of Unix software
on top of Genode. Normally, several aspects of Unix would contradict with Genode’s
architecture:
• The Unix system-call interface supports files and sockets as first-level citizens.
• There is no global virtual file system in Genode.
• Any Unix process can allocate memory as needed. There is no explicit assignment
of memory resources to Unix processes needed.
100
4.7
Component compositions
Recompiled Unix program
FreeBSD libc
libc plugin
Noux session
open read write
select ioctrl
stat readdir
Noux
VFS
TarFS
I/O
channels
Terminal
Terminal
session
ROM
session
Init
Core
Figure 39: The Noux runtime provides a Unix-like interface to its children.
101
4.7
Component compositions
Component
RAM
RM
Service
CPU
Core
Figure 40: Each Genode component is created out of basic resources provided by core.
• Processes are created by forking existing processes. The new process inherits the
roles (in the form of open file descriptors) of the forking process.
Noux resolves those contradictions by providing the interfaces of core’s low-level services alongside a custom RPC interface. By providing a custom noux session interface
to its children, noux can accommodate all kinds of abstractions including the notion of
files and sockets. Noux maintains a virtual file system that appears to be global among
all the children of the noux instance. Since noux handles all the children’s interaction
with the RAM service, it can hand out memory allocations from a pool of memory
shared among all children. Finally, because noux observes all the interactions of each
child with the RM service, it is able to replay the address-space layout of an existing
process to a new process when the process forks.
Monitoring the behavior of subsystems Besides hosting arbitrary OS personalities
as a subsystem, the interception of core’s services allows for the all-encompassing monitoring of subsystems without the need for special support in the kernel. This is useful
for failsafe monitoring or for user-level debugging.
As described in Section 3.5, any Genode component is a created out of low-level
resources in the form of sessions provided by core. Those sessions include at least a
RAM sesion, a CPU session, and an RM session as depicted in Figure 40. In addition to
those low-level sessions, the component may interact with sessions provided by other
components.
For debugging the component, a debugger would need a way to inspect the internal state of the component. As the complete internal state is usually known by the
OS kernel only, the traditional approach to user-level debugging is the introduction of
a debugging interface into the kernel. For example, Linux has the ptrace mechanism
and several microkernels of the L4 family come with built-in kernel debuggers. Such
a debugging interface, however, introduces security risks. Besides increasing the complexity of the kernel, access to the kernel’s debugging mechanisms needs to be strictly
subjected to a security policy. Otherwise any program could use those mechanisms
to inspect or manipulate other programs. Most L4 kernels use to exclude debugging
features in production builds altogether.
102
4.7
Component compositions
Component
as Debugging Target
RAM
RM
CPU
GDB
Monitor
RAM
RM
CPU
Core
Service
Terminal
GNU
Debugger
Figure 41: By intercepting all sessions to core’s services, a debug monitor obtains insights into
the internal state of its child component. The debug monitor, in turn, is controlled
from a remote debugger.
In a Genode system, the component’s internal state is represented in the form of core
sessions. Hence, by intercepting those sessions for a child, a parent can monitor all interactions of the child with core and thereby record the child’s internal state. Figure 41
shows a scenario where a debug monitor executes a component (debugging target) as a
child while intercepting all sessions to core’s services. The interception is performed by
providing custom implementations of core’s session interfaces as locally implemented
services. Under the hood, the local services realize their functionality using actual core
sessions. But by sitting in the middle between the debugging target and core, the debug monitor can observe the target’s internal state including the memory content, the
virtual address-space layout, and the state of all threads running inside the component.
Furthermore, since the debug monitor is in possession of all the session capabilities of
the debugging target, it can manipulate it in arbitrary ways. For example, it can change
thread states (e. g., pausing the execution or enable single-stepping) and modify the
memory content (e. g., inserting breakpoint instructions). The figure shows that those
debugging features can be remotely controlled over a terminal connection.
Using this form of component-level virtualization, a problem that used to require
special kernel additions in traditional operating systems can be solved via Genode’s
regular interfaces. Furthermore, Figure 42 shows that by combining the solution with
OS-level virtualization, the connection to a remote debugger can actually be routed to
an on-target instance of the debugger, thereby enabling on-target debugging.
4.7.3 Interposing individual services
The design of Genode’s fundamental services, in particular resource multiplexers, is
guided by the principle of minimalism. Because such components are critical for security, complexity must be avoided. Functionality is added to such components only if it
cannot be provided outside the component.
103
4.7
Component compositions
Component
as Debugging Target
GDB
RAM
RM
CPU
GDB
Monitor
RAM
RM
CPU
Core
Service
Noux
Figure 42: The GNU debugger is executed within a dedicated noux instance, thereby providing
an on-target debugging facility.
However, components like the nitpicker GUI server are often confronted with feature requests. For example, users may want to move a window on screen by dragging
the window’s title bar. Because nitpicker has no notion of windows or title bars, such
a functionality is not supported. Instead, nitpicker moves the burden to implement
window decorations to its clients. However, this approach sacrifices functionality that
is taken for granted on modern graphical user interfaces. For example, the user may
want to switch the application focus using a keyboard shortcut or perform window
operations and the interactions with virtual desktops in a consistent way. If each application implemented the functionality of virtual desktops individually, the result would
hardly be usable. For this reason, it is tempting to move window-management functionality into the GUI server and to accept the violation of the minimalism principle.
The nitpicker GUI server is not the only service challenged by feature requests. The
problem is present even at the lowest-level services provided by core. Core’s RM service is used to manage the virtual address spaces of components. When a dataspace is
attached to an RM session, the RM service picks a suitable virtual address range where
the dataspace will be made visible in the virtual address space. The allocation strategy
depends on several factors such as alignment constraints and the address range that fits
best. But eventually, it is deterministic. This contradicts with the common wisdom that
address spaces shall be randomized. Hence core’s RM service is challenged with the
request for adding address-space randomization as a feature. Unfortunately, the addition of such a feature into core raises two issues. First, core would need to have a source
of good random numbers. But core does not contain any device drivers where to draw
entropy from. With weak entropy, the randomization might be not random enough.
In this case, the pretension of a security mechanism that is actually ineffective may be
worse than not having it in the first place. Second, the feature would certainly increase
the complexity of core. This is acceptable for components that potentially benefit from
the added feature, such as outward-facing network applications. But the complexity
104
4.7
Component compositions
Launchpad
Panel
Pointer
Window
Manager
Nitpicker
GUI
Browser
Backdrop
Init
Figure 43: The nitpicker GUI accompanied with a window manager that interposes the nitpicker session interface for the applications on the right. The applications on the
left are still able to use nitpicker directly and thereby avoid the complexity added by
the window manager.
eventually becomes part of the TCB of all components including those that do not benefit from the feature.
The solution to those kind of problems is the enrichment of existing servers by interposing their sessions. Figure 43 shows a window manager implemented as a separate
component outside of nitpicker. Both the nitpicker GUI server and the window manager provide the nitpicker session interface. But the window manager enriches the
semantics of the interface by adding window decorations and a window-layout policy.
Under the hood, the window manager uses the real nitpicker GUI server to implement
its service. From the application’s point of view, the use of either service is transparent.
Security-critical applications can still be routed directly to the nitpicker GUI server. So
the complexity of the window manager comes into effect only for those applications
that use it.
The same approach can be applied to the address-space randomization problem. A
component with access to good random numbers may provide a randomized version
of core’s RM service. Outward-facing components can benefit from the security feature
by having their RM session requests routed to this component instead of core.
4.7.4 Ceding the parenthood
When using a shell to manage subsystems, the complexity of the shell naturally becomes a security risk. A shell can be a text-command interpreter, a graphical desktop
shell, a web browser that launches subsystems as plugins, or a web server that provides a remote administration interface. All those kinds of shells have in common that
they contain an enormous amount of complexity that is attributed to convenience. For
example, a textual shell usually depends on libreadline, ncurses, or similar libraries to
provide a command history and to deal with the peculiarities of virtual text terminals.
A graphical desktop shell is even worse because it usually depends on a highly complex widget toolkit, not to speak about using a web browser as a shell. Unfortunately,
the functionality provided by these programs cannot be dismissed as it is expected by
105
4.7
Component compositions
App
Launcher
Nitpicker
Virtual
Framebuffer
Web
Browser
TCP/IP
Init
Nitpicker
GUI
Loader
Launcher
Init
Figure 44: A web browser spawns a plugin by ceding the parenthood of the plugin to the trusted
loader service.
the user. But the high complexity of the convenience functions fundamentally contradicts with the security-critical role of the shell as the common parent of all spawned
subsystems. If the shell gets compromised, all the spawned subsystems suffer.
The risk of such convoluted shells can be mitigated by moving the parent role for
the started subsystems to another component, namely a loader service. In contrast
to the shell, which should be regarded as untrusted due it its complexity, the loader
is a small component that is orders of magnitude less complex. Figure 44 shows a
scenario where a web browser is used as a shell to spawn a Genode subsystem. Instead
of spawning the subsystem as the child of the browser, the browser creates a loader
session. Using the loader-session interface described in Section 4.5.14, it can initially
import the to-be-executed subsystem into the loader session and kick off the execution
of the subsystem. However, once the subsystem is running, the browser can no longer
interfere with the subsystem’s operation. So security-sensitive information processed
within the loaded subsystem are no longer exposed to the browser. Still, the lifetime of
the loaded subsystem depends on the browser. If it decides to close the loader session,
the loader will destroy the corresponding subsystem.
By ceding the parenthood to a trusted component, the risks stemming from the complexity of various kinds of shells can be mitigated.
4.7.5 Publishing and subscribing
All the mechanisms for transferring data between components presented in Section
3.6 have in common that data is transferred in a peer-to-peer fashion. A client trans-
106
4.7
Component compositions
fers data to a server or vice versa. However, there are situations where such a close
coupling of both ends of communication is not desired. In multicast scenarios, the producer of information desires to propagate information without the need to interact (or
even depend on a handshake) with each individual recipient. Specifically, a component
might want to publish status information about itself that might by useful for other
components. For example, a wireless-networking driver may report the list of detected
wireless networks along with their respective SSIDs and reception qualities such that
a GUI component can pick up the information and present it to the user. Each time,
the driver detects a change in the ether, it likes to publish an updated version of the
list. Such a scenario could principally be addressed by introducing a use-case-specific
session interface, i. e., a “wlan-list” session. But this approach has two disadvantages.
1. It forces the wireless driver to play an additional server role. Instead of pushing
information anytime at the discretion of the driver, the driver has to actively support the pulling of information from the wlan-list client. This is arguably more
complex.
2. The wlan-list session interface ultimately depends on the capabilities of the driver
implementation. If an alternative wireless driver is able to supplement the list
with further details, the wlan-list session interface of the alternative driver might
look different. As a consequence, the approach is likely to introduce many specialpurpose session interfaces. This contradicts with the goal to promote the composability of components as stated at the beginning of Section 4.5.
As an alternative to introducing special-purpose session interfaces for addressing the
scenarios outlined above, two existing session interfaces can be combined, namely
ROM and report.
Report-ROM server The report-rom server is both a ROM service and a report service.
It acts as an information broker between information providers (clients of the report
service) and information consumers (clients of the ROM service).
To propagate its internal state to the outside, a component creates a report session.
From the client’s perspective, the posting of information via the report session’s submit
function is a fire-and-forget operation, similar to the submission of a signal. But in
contrast to a signal, which cannot carry any payload, a report is accompanied with
arbitrary data. For the example above, the wireless driver would create a report session.
Each time, the list of networks changes, it would submit an updated list as a report to
the report-ROM server.
The report-ROM server stores incoming reports in a database using the client’s session label as key. So the wireless driver’s report will end up in the database under the
name of the driver component. If one component wishes to post reports of different
kinds, it can do so by extending the session label by a component-provided label suffix
supplied as session-construction argument (Section 4.5.2). The memory needed as the
backing store for the report at the report-ROM server is accounted to the report client
via the session-quota mechanism described in Section 3.3.2.
107
4.7
Component compositions
In its role of a ROM service, the report-ROM server hands out the reports stored in its
database as ROM modules. The association of reports with ROM sessions is based on
the session label of the ROM client. The configuration of the report-ROM server contains a list of policies as introduced in Section 4.6.2. Each policy entry is accompanied
with a corresponding key into the report database.
When a new report comes in, all ROM clients that are associated with the report
are informed via an ROM-update signal (Section 4.5.1). Each client can individually
respond to the signal by following the ROM-module update procedure and thereby
obtain the new version of the report. From the client’s perspective, the origin of the
information is opaque. It cannot decide whether the ROM module is provided by the
report-ROM server or an arbitrary other ROM service.
Coming back to the wireless-driver example, the use of the report-ROM server effectively decouples the GUI application from the wireless driver. This has the following
benefits:
• The application can be developed and tested with an arbitrary ROM server supplying an artificially created list of networks.
• There is no need for the introduction of a special-purpose session interface between both components.
• The wireless driver can post state updates in an intuitive fire-and-forget way
without playing an additional server role.
• The wireless driver could be restarted without affecting the application.
Poly-instantiation of the report-ROM mechanism The report-ROM server is a
canonical example of a protocol stack (Section 4.2). It performs a translation between
the report-session interface and the ROM-session interface. Being a protocol stack, it
can be instantiated any number of times. It is up to the system integrator whether
to use one instance for gathering the reports of many report clients, or to instantiate
multiple report-ROM servers. Taken to the extreme, one report-ROM server could be
instantiated per report client. The routing of ROM-session requests restricts the access
of the ROM clients to the different instances. Even in the event that the report-ROM
server is compromised, the policy for the information flows between the producers and
consumers of information stays in effect.
4.7.6 Enslaving services
In the scenarios described in the previous sections, the relationships between clients
and servers had been one of the following:
• The client is a sibling of the server within the component tree, or
• The client is a child of a parent that provides a locally-implemented service to its
child.
108
4.7
Component compositions
However, the Genode architecture allows for a third option: The parent can be a client
of its own child. Given the discussion in Section 3.2.4, this arrangement looks counterintuitive at first because the discussion concluded that a client has to trust the server
with respect to the client’s liveliness. Here, a call to the server would be synonymous
to a call to the child. Even though the parent is the owner of the child, it would make
itself dependent on the child, which is generally against the interest of the parent.
That said, there is a plausible case where the parent’s trust in a child is justified: If
the parent uses an existing component like a 3rd-party library. When calling code of
a 3rd-party library, the caller implicitly agrees to yield control to the library and trusts
the called function to return at some point. The call of a service that is provided by a
child corresponds to such a library call.
By providing the option to host a server as a child component, Genode’s architecture
facilitates the use of arbitrary server components in a library-like fashion. Because the
server performs a useful function but is owned by its client, it is called slave. An application may aggregate existing protocol-stack components as slaves without the need
to incorporate the code of the protocol stacks into the application. For example, by
enslaving the report-ROM server introduced in Section 4.7.5, an application becomes
able to use it as a local publisher-subscriber mechanism. Another example would be an
application that aggregates an instance of the nitpicker GUI server for the sole purpose
of composing an image out of several source images. When started, the nitpicker slave
requests a framebuffer and an input session. The application responds to these requests
by handing out locally-implemented sessions so that the output of the nitpicker slave
becomes visible to the application. To perform the image composition, the application
creates a nitpicker session for each source image and supplies the image data to the virtual framebuffer of the respective session. After configuring nitpicker views according
to the desired layout of the final image, the application obtains the composed image
from nitpicker’s framebuffer.
Note that by calling the slave, the parent does not need to trust the slave with respect
to the integrity and confidentiality of its internal state (see the discussion in Section
3.2.4). By performing the call, only the liveliness of the parent is potentially affected. If
not trusting the slave to return control once called, the parent may take special precautions: A watchdog thread inside the parent could monitor the progress of the slave and
cancel the call after the expiration of a timeout.
109
5 Development
TODO
110
5.1
Work flow
5.1 Work flow
TODO
111
5.2
Tool chain
5.2 Tool chain
TODO
112
5.3
Build system
5.3 Build system
TODO
113
5.4
Ports of 3rd-party software
5.4 Ports of 3rd-party software
TODO
114
5.5
Run tool
5.5 Run tool
TODO
115
5.6
Automated tests
5.6 Automated tests
TODO
116
6 System configuration
There are manifold principal approaches to configure different aspects of an operating
system and the applications running on top. At the lowest level, there exists the opportunity to pass configuration information to the boot loader. This information may
be evaluated directly by the boot loader or passed to the booted system. As an example for the former, some boot loaders allow for setting up a graphics mode depending
on its configuration. Hence, the graphics mode to be used by the OS could be defined
right at this early stage of booting. More prominently, however, is the mere passing of
configuration information to the booted OS, i. e., in the form of a kernel command line
or as command-line arguments to boot modules. The OS would interpret boot-loaderprovided data structures (i. e., multiboot info structures) to obtain such information.
Most kernels interpret certain configuration arguments passed via this mechanism. At
the OS-initialization level, before any drivers are functioning, the OS behavior is typically steered by configuration information provided along with the kernel image, i. e.,
an initial file-system image (initrd). On Linux-based systems, this information comes
in the form of configuration files and init scripts located at well-known locations within
the initial file-system image. Higher up the software stack, configuration becomes
an even more diverse topic. I.e., the runtime behavior of a GNU/Linux-based system is defined by a conglomerate of configuration files, daemons and their respective
command-line arguments, environment variables, collections of symlinks, and plenty
of heuristics.
The diversity and complexity of configuration mechanisms, however, is problematic
for high-assurance computing. To attain a high level of assurance, Genode’s architecture must be complemented by a low-complexity yet scalable configuration concept.
The design of this concept takes the following considerations into account.
Uniformity across platforms To be applicable across a variety of kernels and hardware platforms, the configuration mechanism must not rely on a particular kernel
or boot loader. Even though boot loaders for x86-based machines usually support
the multiboot specification and thereby the ability to supplement boot modules
with additional command lines, boot loaders on ARM-based platforms generally
lack this ability. Furthermore, even if a multiboot compliant boot loader is used,
the kernel - once started - must provide a way to reflect the boot information to
the system on top, which is not the case for most microkernels.
Low complexity The configuration mechanism is an intrinsic part of each component.
Hence, it affects to the trusted computing base of every Genode-based system. For
this reason, the mechanism must be easy to understand and implementable without the need for complex underlying OS infrastructure. As a negative example,
the provision of configuration files via a file system would require each Genodebased system to support the notion of a file system and to define the naming of
configuration files.
117
Expressiveness Passing configuration information as command-line arguments to
components at their creation time seems like a natural way to avoid the complexity of a file-based configuration mechanism. However, whereas command-line
arguments are the tried and tested way for supplying program arguments in a
concise way, the expressiveness of the approach is limited. In particular, it is
ill-suited for expressing structured information as often found in configurations.
Being a component-based system, Genode requires a way to express relationships
between components, which lends itself to the use of a structural representation.
Common syntax The requirement of a low-complexity mechanism mandates a common syntax across components. Otherwise, each component would need to come
with a custom parser. Each of those parsers would eventually inflate the complexity of the trusted computing base. In contrast, a common syntax that is both
expressive and simple to parse helps to avoid such redundancies by using a single
parser implementation across all components.
Least privilege Being the guiding motive behind Genode’s architecture, the principle
of least privilege needs to be applied to the access of configuration information.
Each component needs to be able to access its own configuration but must not
observe configuration information concerning unrelated components. A systemglobal registry of configurations or even a global namespace of keys for such a
database would violate this principle.
Accommodation of dynamic workloads Supplying configuration information at the
construction time of a component is not sufficient for long-living components,
whose behavior might need to be adapted at runtime. For example, the assignment of resources to the clients of a resource multiplexer might change over the
lifetime of the resource multiplexer. Hence, the configuration concept should provide a means to update the configuration information of a component after its
construction time.
118
6.1
Nested configuration concept
<config>
<parent-provides> ... </parent-provides>
<default-route> ... </default-route>
...
<start name="nitpicker">
...
</start>
<start name="launchpad">
...
<config>
<launcher name="L4Linux">
<binary name="init"/>
<config>
<parent-provides> ... </parent-provides>
<default-route>
<any-service> <any-child/> <parent/> </any-service>
</default-route>
<start name="nit_fb">
<resource name="RAM" quantum="6M"/>
<config xpos="400" ypos="270" width="300" height="200" />
<provides>
<service name="Input"/>
<service name="Framebuffer"/>
</provides>
</start>
<start name="virtualbox">
<resource name="RAM" quantum="1G"/>
<config vbox_file="test.vbox" vm_name="TestVM">
...
</config>
</start>
</config>
</launcher>
</config>
</start>
</config>
Figure 45: Nested system configuration
6.1 Nested configuration concept
Genode’s configuration concept is based on the ROM session interface described in Section 4.5.1. In contrast to a file-system interface, the ROM session interface, is extremely
simple. The client of a ROM service specifies the requested ROM module by its name as
known by the client. There is neither a way to query a list of available ROM modules,
nor are ROM modules organized in a hierarchic name space.
The ROM session interface is implemented by core’s ROM service to make boot modules available to other components. Those boot modules comprise the executable binaries of the init component as well as those of the components created by init. Furthermore, a ROM module called “config” contains the configuration of the init process in
an XML format. To obtain its configuration, init requests a ROM session for the ROM
module “config” from its parent, which is core. Figure 45 shows an example of such a
config ROM module.
119
6.1
Nested configuration concept
Init
request
"config"
response
Launcher
request
"config"
response
Init
request
"config"
response
Core
Figure 46: Successive interception of “config” ROM requests
The config ROM module uses XML as syntax, which supports the expression of arbitrary structural data while being simple to parse. I.e., Genode’s XML parser comes
in the form of single header file with less than 400 lines of code. Init’s configuration is
contained within a single <config> node.
Each component started by init obtains its configuration by requesting a ROM module named “config” from its parent, which is init. Init responds to this request by handling out a locally-provided ROM session. Instead of handing out the “config” ROM
module as obtained from core, it creates a new dataspace that solely contains the portion of init’s config ROM module that refers to the respective child. Analogously to
init’s configuration, each child’s configuration has the form of a single <config> node.
This works recursively. From each component’s perspective, including the init component, the mechanism for obtaining its configuration is identical – it obtains a ROM
session for a ROM module named “config” from its parent. The parent interposes the
ROM session request as described in Section 4.7.3. Figure 46 shows the successive interposing of “config” ROM requests according to the example configuration given in
Figure 45. At each level, the information structure within the <config> node can be
different. Besides following the convention that a configuration has the form of a single
<config> node, each component can introduce arbitrary custom tags and attributes.
Besides being simple, the use of the ROM session interface for supplying configuration information has the benefit of supporting dynamic configuration updates over
the lifetime of the config ROM session. Section 4.5.1 describes the update protocol be-
120
6.1
Nested configuration concept
tween client and server of a ROM session. This way, the configuration of long-living
components can be dynamically changed.
121
6.2
The init component
6.2 The init component
The init component plays a special role within Genode’s component tree. It gets started
directly by core, gets assigned all physical resources, and controls the execution of all
further component nodes, which can be further instances of init. Init’s policy is driven
by an XML-based configuration, which declares a number of children, their relationships, and resource assignments.
6.2.1 Session routing
At the parent-child interface, there are two operations that are subject to policy decisions of the parent, the child announcing a service and the child requesting a service. If
a child announces a service, the parent is up to decide if and how to make this service
accessible to its other children. When a child requests a service, the parent may deny the
session request, delegate the request to its own parent, implement the requested service
locally, or open a session at one of its other children. This decision may depend on the
requested service or the session-construction arguments provided by the child. Apart
from assigning resources to children, the central element of the policy implemented in
the parent is a set of rules to route session requests. Therefore, init’s configuration concept is laid out around child components and the routing of session requests originating
from those components. The concept is best illustrated by an example:
<config>
<parent-provides>
<service name="CAP"/>
<service name="LOG"/>
<service name="SIGNAL"/>
</parent-provides>
<start name="timer">
<resource name="RAM" quantum="1M"/>
<provides> <service name="Timer"/> </provides>
<route>
<service name="CAP">
<parent/> </service>
<service name="SIGNAL"> <parent/> </service>
</route>
</start>
<start name="test-timer">
<resource name="RAM" quantum="1M"/>
<route>
<service name="Timer"> <child name="timer"/> </service>
<service name="LOG">
<parent/>
</service>
<service name="SIGNAL"> <parent/>
</service>
</route>
</start>
</config>
122
6.2
The init component
First, there is the declaration of services provided by the parent of the configured init
instance. In this case, we declare that the parent provides a CAP service, a LOG service,
and a SIGNAL service. For each child to start, there is a <start> node describing
resource assignments, declaring services provided by the child, and holding a routing
table for session requests originating from the child. The first child is called “timer” and
implements the “Timer” service. To implement this service, the timer requires a CAP
session. The routing table defines that CAP session requests are delegated to init’s parent. The second process called “test-timer” is a client of the timer service. In its routing
table, we see that requests for “Timer” sessions are routed to the “timer” child whereas
requests for “LOG” sessions are routed to init’s parent. Per-child service routing rules
provide a flexible way to express arbitrary client-server relationships. For example, service requests may be transparently mediated through special policy components acting
upon session-construction arguments. There might be multiple children implementing
the same service, each targeted by different routing tables. If there exists no valid route
to a requested service, the service is denied. In the example above, the routing tables
act effectively as a white list of services the child is allowed to use.
In practice, usage scenarios become more complex than the basic example, increasing
the size of routing tables. Furthermore, in many practical cases, multiple children may
use the same set of services and require duplicated routing tables within the configuration. In particular during development, the elaborative specification of routing tables
tend to become an inconvenience. To alleviate this problem, there are two mechanisms,
namely wildcards and a default route. Instead of specifying a list of individual service
routes targeting the same destination, the wildcard <any-service> becomes handy.
For example, instead of specifying
<route>
<service
<service
<service
<service
<service
<service
</route>
name="ROM">
name="RAM">
name="RM">
name="PD">
name="CPU">
name="SIGNAL">
<parent/>
<parent/>
<parent/>
<parent/>
<parent/>
<parent/>
</service>
</service>
</service>
</service>
</service>
</service>
the following shortcut can be used:
<route>
<any-service> <parent/> </any-service>
</route>
The latter version is not as strict as the first one because it permits the child to create
sessions at the parent, which were not white listed in the elaborative version. Therefore,
the use of wildcards is discouraged for configuring untrusted components. Wildcards
and explicit routes may be combined as illustrated by the following example:
123
6.2
The init component
<route>
<service name="LOG"> <child name="nitlog"/> </service>
<any-service>
<parent/>
</any-service>
</route>
The routing table is processed starting with the first entry. If the route matches the
service request, it is taken, otherwise the remaining routing-table entries are visited.
This way, the explicit service route of “LOG” sessions to the “nitlog” child shadows the
LOG service provided by the parent.
To allow a child to use services provided by arbitrary other children, there is a further
wildcard called <any-child>. Using this wildcard, such a policy can be expressed as
follows:
<route>
<any-service> <parent/>
</any-service>
<any-service> <any-child/> </any-service>
</route>
This rule would delegate all session requests referring to one of the parent’s services to
the parent. If no parent service matches the session request, the request is routed to any
child providing the service. The rule can be further abbreviated to:
<route>
<any-service> <parent/> <any-child/> </any-service>
</route>
Init detects potential ambiguities caused by multiple children providing the same service. In this case, the ambiguity must be resolved using an explicit route preceding the
wildcards.
To reduce the need to specify the same routing table for many children in one configuration, there is a <default-route> mechanism. The default route is declared
within the <config> node and used for each <start> entry with no <route> node.
In particular during development, the default route becomes handy to keep the configuration tidy and neat.
The combination of explicit routes and wildcards is designed to scale well from being convenient to use during development towards being highly secure at deployment
time. If only explicit rules are present in the configuration, the permitted relationships
between all processes are explicitly defined and can be easily verified.
6.2.2 Resource quota saturation
If a specified resource (i. e., RAM quota) exceeds the available resources. The available
resources are assigned completely to the child. This makes it possible to assign all
remaining resources to the last child by simply specifying an overly large quantum.
124
6.2
The init component
6.2.3 Handing out slack resources
Resources may remain unused after the creation of all children if the quantum of available resources is higher than sum of the quotas assigned to the children. Init makes
such slack memory available to its children via the resource-request protocol described
in Section 3.3.4. Slack memory is handed out on a first-come first-served basis.
6.2.4 Multiple instantiation of a single ELF binary
Each <start> node requires a unique name attribute. By default, the value of this
attribute is used as ROM module name for obtaining the ELF binary from the parent.
If multiple instances of a component with the same ELF binary are needed, the binary
name can be explicitly specified using a <binary> sub node of the <start> node:
<binary name="filename"/>
This way, a unique child name can be defined independently from the binary name.
6.2.5 Nested configuration
Each <start> node can host a <config> sub node. As described in Section 6.1, the
content of this sub node is provided to the child when a ROM session for the module
name “config” is requested. Thereby, arbitrary configuration parameters can be passed
to the child. For example, the following configuration starts timer-test within an
init instance within another init instance. To show the flexibility of init’s service routing
facility, the “Timer” session of the second-level timer-test child is routed to the timer
service started at the first-level init instance.
125
6.2
The init component
<config>
<parent-provides>
<service name="CAP"/>
<service name="LOG"/>
<service name="ROM"/>
<service name="RAM"/>
<service name="CPU"/>
<service name="RM"/>
<service name="PD"/>
<service name="SIGNAL"/>
</parent-provides>
<start name="timer">
<resource name="RAM" quantum="1M"/>
<provides><service name="Timer"/></provides>
<route>
<any-service"> <parent/> </any-service>
</route>
</start>
<start name="init">
<resource name="RAM" quantum="1M"/>
<config>
<parent-provides>
<service name="Timer"/>
<service name="LOG"/>
<service name="SIGNAL"/>
</parent-provides>
<start name="test-timer">
<resource name="RAM" quantum="1M"/>
<route>
<any-service"> <parent/> </any-service>
</route>
</start>
</config>
<route>
<service name="Timer"> <child name="timer"/> </service>
<any-service">
<parent/>
</any-service>
</route>
</start>
</config>
The services ROM, RAM, CPU, RM, and PD are required by the second-level init instance to create the timer-test component. As illustrated by this example, the use of
nested configurations enables the construction of arbitrarily complex component trees
via a single configuration.
Alternatively to specifying all nested configurations in a single configuration, sub
configurations can be placed in a separate ROM modules specified via the configfile
node. For example:
126
6.2
The init component
Init
Init
Init
Init
Figure 47: Successive virtualization of CPU affinity spaces by nested instances of init
<start name="nitpicker">
<resource name="RAM" quantum="1M"/>
<configfile name="nitpicker.config"/>
</start>
6.2.6 Assigning subsystems to CPUs
Most multi-processor (MP) systems have topologies that can be represented on a twodimensional coordinate system. CPU nodes close to each other are expected to have
closer relationship than distant nodes. In a large MP system, it is natural to assign clusters of closely related nodes to a given workload. As described in Section 3.2, Genode’s
architecture is based on a strictly hierarchic organizational structure. Thereby, it lends
itself to the idea to apply this successive virtualization of resources to the problem of
clustering CPU nodes.
Each component within the component tree has a component-local view on a socalled affinity space, which is a two-dimensional coordinate space. If the component
creates a new subsystem, it can assign a portion of its own affinity space to the new
subsystem by imposing a rectangular affinity location to the subsystem’s CPU session.
Figure 47 illustrates the idea.
Following from the expression of affinities as a rectangular location within a componentlocal affinity space, the assignment of subsystems to CPU nodes consists of two parts,
the definition of the affinity space dimensions as used for the init instance, and the
association of sub systems with affinity locations relative to the affinity space. The
affinity space is configured as a sub node of the <config> node. For example, the
following declaration describes an affinity space of 4x2:
<config>
...
<affinity-space width="4" height="2" />
...
</config>
127
6.2
The init component
Subsystems can be constrained to parts of the affinity space using the <affinity>
sub node of a <start> entry:
<config>
...
<start name="loader">
<affinity xpos="0" ypos="1" width="2" height="1" />
...
</start>
...
</config>
As illustrated by this example, the numbers used in the declarations for this instance
of init are not directly related to physical CPUs. If the machine has merely two cores,
init’s affinity space would be mapped to the range 0,1 of physical CPUs. However, in a
machine with 16x16 CPUs, the loader would obtain 8x8 CPUs with the upper-left CPU
at position (4,0).
6.2.7 Priority support
The number of CPU priorities to be distinguished by init can be specified with the
prio_levels attribute of the <config> node. The value must be a power of two. By
default, no priorities are used. To assign a priority to a child process, a priority value
can be specified as priority attribute of the corresponding <start> node. Valid
priority values lie in the range of -prio_levels + 1 (maximum priority degradation) to 0
(no priority degradation).
6.2.8 Init verbosity
To ease debugging, init can be instructed to print various status information as LOG
output. To enable the verbose mode, assign the value “yes” to the verbose attribute
of the <config> node.
6.2.9 Executing children in chroot environments on Linux
On the Linux base platform, each process started by init can be assigned to a chroot
environment by specifying the new root location as root attribute to the corresponding <start> node. Root environments can be nested. The root path of a nested init
instance will be appended to the root path of the outer instance.
When using the chroot mechanism, core will mirror the current working directory
within the chroot environment via a bind-mount operation. This step is needed to
enable the execve system call to obtain the ELF binary of the new process.
In order to use the chroot mechanism when starting Genode’s core as a non-root
user process, the core executable must be equipped with the CAP_SYS_ADMIN and
128
6.2
The init component
CAP_SYS_CHROOT capabilities. CAP_SYS_ADMIN is needed for bind mounting.
CAP_SYS_CHROOT is needed to perform the chroot syscall:
sudo setcap cap_sys_admin,cap_sys_chroot=ep core
For an example of using chroot, please refer to the os/run/chroot.run script.
129
7 Functional specification
TODO
130
7.1
Parent-child interaction
7.1 Parent-child interaction
• Parent interface
• Root interface
131
7.2
Fundamental data structures
7.2 Fundamental data structures
• Basic types, structured types, allocators
• Rationale behind using intrusive data structures
132
7.3
XML processing
7.3 XML processing
• Parsing
• Generation
133
7.4
Process execution environment
7.4 Process execution environment
134
7.5
Remote procedure calls
7.5 Remote procedure calls
135
7.6
Signals
7.6 Signals
136
7.7
Multi-threading and synchronization
7.7 Multi-threading and synchronization
137
7.8
Process management
7.8 Process management
138
7.9
Common utilities
7.9 Common utilities
139
7.10
Server API
7.10 Server API
140
7.11
Support for user-level device drivers
7.11 Support for user-level device drivers
141
7.12
Tracing
7.12 Tracing
142
7.13
C runtime
7.13 C runtime
143
8 Under the hood
This chapter gives insights into the inner functioning of the Genode OS framework. In
particular, it explains how the concepts explained in Chapter 3 are realized on different
kernels and hardware platforms.
144
8.1
Component-local startup code and linker scripts
8.1 Component-local startup code and linker scripts
All Genode components including core rely on the same startup code, which is roughly
outlined at the end of Section 3.5. This section revisits the steps in more detail and refers
to the corresponding points in the source code. Furthermore, it provides background
information about the linkage of components, which is closely related to the startup
code.
8.1.1 Linker scripts
Under the hood, the Genode build system uses three different linker scripts located at
repos/base/src/platform/:
genode.ld is used for statically linked components, including the core component,
genode_dyn.ld is used for dynamically linked components, i. e., components that are
linked against at least one shared library,
genode_rel.ld is used for shared libraries.
Additionally, there exists a special linker script for the dynamic linker (Section 8.4).
Each program image generated by the linker generally consists of three parts, which
appear consecutively in the component’s virtual memory.
1. A read-only “text” part contains sections for code, read-only data, and the list of
global constructors and destructors.
The startup code is placed in a dedicated section .text.crt0, which appears
right at the start of the segment. Thereby the link address of the component is
known to correspond to the ELF entrypoint (the first instruction of the assembly
startup code). This is useful when converting the ELF image of the base-hw version of core into a raw binary. Such a raw binary can be loaded directly into the
memory of the target platform without the need for an ELF loader.
The mechanisms for constructing the list of constructors and destructors differ
between the CPU architecture and are defined by the architecture’s ABI. On x86,
the lists are represented by .ctors.* and .dtors.*. On ARM, the information
about global constructors is represented by .init_array and there is no visible
information about global destructors.
2. A read-writable “data” part that is pre-populated with data.
3. A read-writable “bss” part that is not physically present in the binary but known
to be zero-initialized when the ELF image is loaded.
The link address is not defined in the linker script but specified as linker argument.
The default link address is specified in a platform-specific spec file, e. g., repos/basenova/mk/spec-nova.mk for the NOVA platform. Components that need to organize their
145
8.1
Component-local startup code and linker scripts
virtual address space in a special way (e. g., a virtual machine monitor that co-locates
the guest-physical address space with its virtual address space) may specify link addresses that differ from the default by overriding the LD_TEXT_ADDR value.
ELF entry point As defined at the start of the linker script via the ENTRY directive,
the ELF entrypoint is the function _start. This function is located at the very beginning of the .text.crt0 section. See the Section 8.1.2 for more details.
Symbols defined by the linker script The following symbols are defined by the
linker script and used by the base framework.
_prog_img_beg, _prog_img_data, _prog_img_end Those symbols mark the
start of the “text” part, the start of the “data” part (the end of the “text” part), and
the end of the “bss” part. They are used by core to exclude those virtual memory
ranges from the core’s virtual-memory allocator (core-region allocator).
_parent_cap, _parent_cap_thread_id, _parent_cap_local_name Those
symbols are located at the beginning of the “data” part. During the ELF loading
of a new component, the parent writes information about the parent capability
to this location (the start of the first read-writable ELF segment). See the corresponding code in the _setup_elf function in base/src/base/process/process.cc. The
use of the information depends on the base platform. E.g., on a platform where a
capability is represented by a tuple of a global thread ID and an object ID such as
OKL4 and L4ka::Pistachio, the information is taken as verbatim values. On platforms that fully support capability-based security without the use of any form of a
global name to represent a capability, the information remains unused. Here, the
parent capability is represented by the same known local name in all components.
Even though the linker scripts are used across all base platforms, they contain a few
platform-specific supplements that are needed to support the respective kernel ABIs.
For example, the definition of the symbol __l4sys_invoke_indirect is needed
only on the Fiasco.OC platform and is unused on the other base platforms. Please
refer to the comments in the linker script for further explanations.
8.1.2 Startup code
The execution of the initial thread of a new component starts at the ELF entry point,
which corresponds to the _start function. This is an assembly function defined in
repos/base/platform/<arch>/crt0.s where <arch> is the CPU architecture (x86_32, x86_64,
or ARM).
Assembly startup code The assembly startup code is position-independent code
(PIC). Because the Genode base libraries are linked against both statically-linked and
146
8.1
Component-local startup code and linker scripts
dynamically linked executables, they have to be compiled as PIC code. To be consistent
with the base libraries, the startup code needs to be position-independent, too.
The code performs the following steps:
1. Saving the initial state of certain CPU registers. Depending on the used kernel,
these registers carry information from the kernel to the core component. More
details about this information are provided by Section 8.3.1. The initial register
values are saved in global variables named _initial_<register>. The global
variables are located in the BSS segment. Note that those variables are used solely
by core.
2. Setting up the initial stack. Before the assembly code can call any higher-level C
function, the stack pointer must be initialized to point the top of a valid stack.
The initial stack is located in the BSS section and referred to by the symbol
_stack_high. However, having a stack located within the BSS section is dangerous. If it overflows (e. g., by declaring large local variables, or by recursive
function calls), the stack would silently overwrite parts of the BSS and DATA
sections located below the lower stack boundary. For prior known code, the stack
can be dimensioned to a reasonable size. But for arbitrary application code, no
assumption about the stack usage can be made. For this reason, the initial stack
cannot be used for the entire lifetime of the component. Before any componentspecific code is called, the stack needs to be relocated to another area of the virtual
address space where the lower bound of the stack is guarded by empty pages.
When using such a “real” stack, a stack overflow will produce a page fault, which
can be handled or at least immediately detected. The initial stack is solely used
to perform the steps needed to set up the real stack. Because those steps are the
same for all components, the usage of the initial stack is bounded.
3. Because the startup code is used by statically linked components as well as the
dynamic linker, the startup immediately calls the init_rtld hook function. For
regular components, the function does not do anything. The default implementation in repos/base/src/platform/init_main_thread.cc is a weak function. The dynamic
linker provides a non-weak implementation, which allows the linker to perform
initial relocations of itself very early at the dynamic linker’s startup.
4. By calling the init_main_thread function defined in repos/base/src/platform/init_main_thread.cc, the assembly code triggers the execution of all the steps
needed for the creation of the real stack. The function is implemented in C++,
uses the initial stack, and returns the address of the real stack.
5. With the new stack pointer returned by init_main_thread, the assembly
startup code is able to switch the stack pointer from the initial stack to the real
stack. From this point on, stack overflows cannot easily corrupt any data.
6. With the real stack in place, the assembly code finally passes the control over to
the C++ startup code provided by the _main function.
147
8.1
Component-local startup code and linker scripts
Initialization of the real stack along with the Genode environment As mentioned above, the assembly code calls the init_main_thread function (located in
repos/base/src/platform/init_main_thread.cc) for setting up the real stack for the program.
For placing a stack in dedicated portion of the component’s virtual address space, the
function needs to overcome two principle problems:
• It needs to obtain the backing store used for the stack, i. e., by allocating a dataspace from the component’s RAM session as initialized by the parent.
• It needs to preserve a portion of its virtual address space for placing the stack and
make the allocated memory visible within this portion.
In order to solve both problems, the function needs to obtain capabilities for its RAM
session and RM session from its parent. This comes down to the need for performing
RPC calls. First, for requesting the RM and RAM session capabilities from the parent,
and second, for invoking the session capabilities to perform the RAM allocation and
RM attach operations.
The RPC mechanism is based on C++. In particular, the mechanism supports the
propagation of C++ exceptions across RPC interfaces. Hence, before being able to perform RPC calls, the program must initialize the C++ runtime including the exception
support. The initialization of the C++ runtime, in turn, requires support for dynamically allocating memory. Hence, a heap must be available. This chain of dependencies
ultimately results in the need to construct the entire Genode environment as a side effect of initializing the real stack of the program.
During the construction of the Genode environment (by calling Genode::env()),
the program requests its own RM, RAM, CPU, and PD sessions from its parent, and
initializes its heap (env()→heap()).
With the environment constructed, the program is able to interact with its own RM
and RAM sessions and can principally realize the initialization of the real stack. However, instead of merely allocating a new RAM dataspace and attaching the dataspace to
the RM session, a so-called thread-context area is constructed. The thread-context area
is a secondary RM session that is attached as a dataspace to the component’s actual RM
session (See the description of managed dataspaces in Section 3.4.5). This way, virtualmemory allocations within the thread-context area can be managed manually. I.e., the
spaces between the stacks of different threads are guaranteed to remain free from any
attached dataspaces. For constructing the thread-context area, a new RM session is
created (repos/base/src/base/context_area.cc).
Component-dependent startup code With the Genode environment constructed
and the initial stack switched to a proper stack located in the thread-context area, the
component-dependent startup code of the _main in repos/base/src/platform/_main.cc can
be executed. This code is responsible for calling the global constructors of the program
before calling program’s main function.
In accordance to the established signature of the main function, taking an argument
list and an environment as arguments, the startup code supplies these arguments but
148
8.1
Component-local startup code and linker scripts
uses dummy default values. However, since the values are taken from the global variables genode_argv, genode_argc, and genode_envp, a global constructor is able
to override the default values.
The startup code in _main.cc is accompanied with support for atexit handling. The
atexit mechanism allows for the registration of handlers to be called at the exit of the
program. It is provided in the form of a POSIX API by the C runtime. But it is also
used by the compiler to schedule the execution of the destructors of function-local static
objects. For the latter reason, the atexit mechanism cannot be merely provided by the
(optional) C runtime but must be supported by the base library.
149
8.2
C++ runtime
8.2 C++ runtime
Genode is implemented in C++ and relies on all C++ features required to use the language in its idiomatic way. This includes the use of exceptions and runtime-type information.
8.2.1 Rationale behind using exceptions
Compared to return-based error handling as prominently used in C programs, the C++
exception mechanism is much more complex. In particular, it requires the use of a C++
runtime library that is called as a back-end by the exception handling code generated
by the compiler. This library contains the functionality needed to unwind the stack
and a mechanism for obtaining runtime type information (RTTI). The C++ runtime
libraries that come with common tool chains, in turn, rely on a C library for performing
dynamic memory allocations, string operations, and I/O operations. Consequently,
C++ programs that rely on exceptions and RTTI use to depend on a C library. For this
reason, the use of those C++ features is universally disregarded for low-level operatingsystems code that usually does not run in an environment where a complete C library
is available.
In principle, C++ can be used without exceptions and RTTI (by passing the arguments -fno-exceptions and -fno-rtti to GCC). However, without those features,
it is hardly possible to use the language as designed.
For example, when the operator new is used, it performs two steps: Allocating the
memory needed to hold the to-be-created object and calling the constructor of the object
with the return value of the allocation as this pointer. In the event that the memory
allocation fails, the only way for the allocator to propagate the out-of-memory condition
is throwing an exception. If such an exception is not thrown, the constructor would be
called with a null as this pointer.
Another example is the handling of errors during the construction of an object. The
object construction may consist of several consecutive steps such as the construction of
base classes and aggregated objects. If one of those steps fails, the construction of the
overall object remains incomplete. This condition must be propagated to the code that
issued the object construction. There are two principle approaches:
1. The error condition can be kept as an attribute in the object. After constructing
the object, the user of the object may detect the error condition by requesting the
attribute value. However, this approach is plagued by the following problems.
First, the failure of one step may cause subsequent steps to fail as well. In the
worst case, if the failed step initializes a pointer that is passed to subsequent steps,
the subsequent steps may use an uninitialized pointer. Consequently, the error
condition must eventually be propagated to subsequent steps, which, in turn,
need to be implemented in a defensive way.
Second, if the construction failed, the object exists but it is inconsistent. In the
worst case, if the user of the object misses to check for the successful construc-
150
8.2
C++ runtime
tion, it will perform operations on an inconsistent object. But even in the good
case, where the user detects the incomplete construction and decides to immediately destruct the object, the destruction is error prone. The already performed
steps may have had side effects such as resource allocations. So it is important
to revert all the successful steps by invoking their respective destructors. However, when destructing the object, the destructors of the incomplete steps are also
called. Consequently, such destructors need to be implemented in a defensive
manner to accommodate this situation.
Third, objects cannot have references that depend on potentially failing construction steps. In contrast to a pointer that may be marked as uninitialized by being a
null pointer, a reference is, by definition, initialized once it exists. Consequently,
the result of such a step can never be passed as reference to subsequent steps.
Pointers must be used.
Fourth, the mere existence of incompletely constructed objects introduces many
variants of possible failures that need to be considered in the code. There may
be many different stages of incompleteness. Because of the third problem, every
time a construction step takes the result of previous step as argument, it explicitly
has to consider the error case. This, in turn, tremendously inflates the test space
of the code.
Furthermore, there needs to be a convention of how the completion of an object
is indicated. All programmers have to learn and follow the convention.
2. The error condition triggers an exception. Thereby, the object construction immediately stops at the erroneous step. Subsequent steps are not executed at all.
Furthermore, while unwinding the stack, the exception mechanism reverts all already completed steps by calling their respective destructors. Consequently, the
construction of an object can be considered as a transaction. If it succeeds, the
object is known to be completely constructed. If if fails, the object immediately
ceases to exist.
Thanks to the transactional semantics of the second variant, the state space for potential error conditions (and thereby the test space) remains small. Also, the second variant
facilitates the use of references as class members, which can be safely passed as arguments to subsequent constructors. When receiving such a reference as argument (as
opposed to a pointer), no validity checks are needed. Consequently, by using exceptions, the robustness of object-oriented code (i. e., code that relies on C++ constructors)
can be greatly improved over code that avoids exceptions.
8.2.2 Bare-metal C++ runtime
Acknowledging the rationale given in the previous section, there is still the problem
of the complexity added by the exception mechanism. For Genode, the complexity of
the trusted computing base is a fundamental metric. The C++ exception mechanism
with its dependency to the C library arguably adds significant complexity. The code
151
8.2
C++ runtime
complexity of a C library exceeds the complexity of the fundamental components (such
as the kernel, core, and init) by an order of magnitude. Making the fundamental components depend on such a C library would jeopardize one of Genode’s most valuable
assets, which is its low complexity.
To enable the use of C++ exceptions and runtime type information but avoid the
incorporation of an entire C library into the trusted computing base, Genode comes
with a customized C++ runtime that does not depend on a C library. The C++ runtime
libraries are provided by the tool chain. To build those libraries without a C library, a
libc emulation header (tool/libgcc_libc_stub.h) is used instead of the interface of a real C
library. The emulation header contains only those definitions and declarations needed
by the C++ runtime. The resulting libraries contain references to (some of) the symbols
present in the emulation header. Those symbols are provided by Genode’s C++ support
code (repos/base/src/base/cxx). The implementation of those functions is specifically tied
to the usage patterns of the C++ runtime. Several of the functions are mere dummies.
Unfortunately, the interface used by the C++ runtime does not reside in a specific
namespace but it is rather a subset of the POSIX API. When linking a real C library
to a Genode component, the symbols present in the C library would collide with the
symbols present in Genode’s C++ support code. For this reason, the C++ runtime
(of the compiler) and Genode’s C++ support code are wrapped in a single library (repos/base/lib/mk/cxx.mk) in a way that all POSIX functions remain hidden. All the references of the C++ runtime are resolved by the C++ support code, both wrapped in the
cxx library. To the outside, the cxx library solely exports the CXA ABI as required by
the compiler.
152
8.3
Interaction of core with the underlying kernel
8.3 Interaction of core with the underlying kernel
Core is the root of the process tree. It is initialized and started directly by the underlying kernel and has two purposes. First, it makes the low-level physical resources of the
machine available to other components in the form of services. Those resources are the
physical memory, processing time, device resources, initial boot modules, and protection mechanisms (such as the MMU, IOMMU, and virtualization extensions). It thereby
hides the peculiarities of the used kernel behind an API that is uniform across all kernels supported by Genode. Core’s second purpose is the creation of the init component
by using its own services and following the steps described in Section 3.5.
Even though core is executed in usermode, its role as the root of the component tree
makes it as critical as the kernel. It just happens to be executed in a different processor
mode. Whereas regular components solely interact with the kernel when performing
inter-component communication, core interplays with the kernel more intensely. The
following subsections go into detail about this interplay.
The description tries to be general across the various kernels supported by Genode.
Note, however, that a particular kernel may deviate from the general description.
8.3.1 Bootstrapping and allocator setup
At boot time, the kernel passes information about the physical resources and the initial
system state to core. Even though the mechanism and format of this information varies
from kernel to kernel, it generally covers the following aspects:
• A list of free physical memory ranges
• A list of the physical memory locations of the boot modules along with their respective names
• The number of available CPUs
• All information needed to enable the initial thread to perform kernel operations
Core’s allocators Core’s kernel-specific platform initialization code (core/platform.cc)
uses this information to initialize the allocators used for keeping track of physical resources. Those allocators are:
RAM allocator contains the ranges of the available physical memory
I/O memory allocator contains the physical address ranges of unused memory-mapped
I/O resources. In general, all ranges not initially present in the RAM allocator are
considered as I/O memory.
I/O port allocator contains the I/O ports on x86-based platforms that are currently not
in use. This allocator is initialized with the entire I/O port range of 0 to 0xffff.
153
8.3
Interaction of core with the underlying kernel
IRQ allocator contains the IRQs that are associated with IRQ sessions. This allocator
is initialized with the entirety of the available IRQ numbers.
Core-region allocator contains the virtual memory regions of core that are not in use.
The RAM allocator and core-region allocator are subsumed in the so-called corememory allocator. In addition to aggregating both allocators, the core-memory allocator allows for the allocation of core-local virtual-memory regions that can be used
for holding core-local objects. Each region allocated from the core-memory allocator
has to satisfy three conditions:
1. It must be backed by a physical memory range (as allocated from the RAM allocator)
2. It must have assigned a core-local virtual memory range (as allocated from the
core-region allocator)
3. The physical-memory range has the same size as the virtual-memory range
4. The virtual memory range is mapped to the physical memory range using the
MMU
Internally, the core-memory allocator maintains a so-called mapped-memory allocator
that contains ranges of ready-to-use core-local memory. If a new allocation exceeds the
available capacity, the core-memory allocator expands its capacity by allocating a new
physical memory region from the RAM allocator, allocating a new core-virtual memory
region from the core-region allocator, and installing a mapping from the virtual region
to the physical region.
All memory allocators mentioned above are performed at the granularity of physical
pages, i. e., 4 KiB.
The core-memory allocator is expanded on demand but never shrunk. This makes it
unsuitable for allocating objects on the behalf of core clients because allocations could
not be reverted when closing the session. It is solely used for dynamic memory allocations at the startup (e. g., the memory needed for keeping the information about the
boot modules), and for keeping meta data for the allocators themselves.
8.3.2 Kernel-object creation
Kernel objects are objects maintained within the kernel and used by the kernel. The
exact notion of what a kernel object represents depends on the actual kernel as the
various kernels differ with respect to the abstractions they provide. Typical kernel objects are threads and protection domains. Some kernels have kernel objects for memory
mappings whereas others provide page tables as kernel objects. Whereas some kernels
represent scheduling parameters as distinct kernel objects, others subsume scheduling
parameters to threads. What all kernel objects have in common, though, is that they
154
8.3
Interaction of core with the underlying kernel
consume kernel memory. Most kernels of the L4 family preserve a fixed pool of memory for the allocation of kernel objects.
If an arbitrary component was able to perform a kernel operation that triggers the
creation of a kernel object, the memory consumption of the kernel would depend on the
good behavior of all components. A misbehaving component may exhaust the kernel
memory.
To counter this problem, on Genode, only core triggers the creation of kernel objects
and thereby guards the consumption of kernel memory. Note, however, that not all
kernels are able to prevent the creation of kernel objects outside of core.
8.3.3 Page-fault handling
Each time a thread within the Genode system triggers a page fault, the kernel reflects
the page fault along with the fault information as a message to the user-level page-fault
handler residing in core. The fault information comprises the identity and instruction
pointer of the faulted thread, the page-fault address, and the fault type (read, write,
execute). The page-fault handler represents each thread as a so-called pager object. For
handling the page fault, it first looks up the pager object that belongs to the faulting
thread’s identity, analogously to how an RPC entrypoint looks up the RPC object for an
incoming RPC request. Given the pager object, the fault is handled by calling the pager
function with the fault information as argument. This function is implemented by the
so-called Rm_client (repos/base/src/core/rm_session_component.cc), which represents the
association of the pager object with its virtual address space (RM session). Given the
context information about the RM session of the thread, the pager function looks up
the region within the RM session, on which the page fault occurred. The lookup results
in one of the following three cases:
Region is populated with a dataspace If a dataspace is attached at the fault address,
the backing store of the dataspace is determined. Depending on the kernel, the
backing store may be a physical page, a core-local page, or another reference to a
physical memory page. The pager function then installs a memory mapping from
the virtual page where the fault occurred to the corresponding part of the backing
store.
Region is populated with a managed dataspace If the fault occurred within a region where a managed dataspace is attached, the fault handling is forwarded to
the RM session that represents the managed dataspace.
Region is empty If no dataspace could be found at the fault address, the fault cannot
be resolved. In this case, core submits an RM-fault signal to the RM session where
the fault occurred. This way, the RM-session client has the chance to detect and
possibly respond to the fault. Once the signal handler receives a fault signal,
it is able to query the fault address from the RM session. As a response to the
fault, the RM-session client may attach a dataspace at this address. This attach
operation, in turn, will prompt core to wake up the thread (or multiple threads)
155
8.3
Interaction of core with the underlying kernel
that faulted within the attached region. Unless a dataspace is attached at the
page-fault address, the faulting thread remains blocked. If no signal handler for
RM faults is registered for the RM session, core prints a diagnostic message and
blocks the faulting thread forever.
To optimize the TLB footprint and the use of kernel memory, the RM service does not
merely operate at the granularity of memory pages but on address ranges whose size
and alignment are arbitrary power-of-two values (at least as large as the size of the
smallest physical page). The source and destinations of memory mappings may span
many pages. This way, depending on the kernel and the architecture, multiple pages
may be mapped at once, or large page-table mappings can be used.
156
8.4
Dynamic linker
8.4 Dynamic linker
The dynamic linker is a mechanism for loading ELF binaries that are dynamicallylinked against shared libraries.
8.4.1 Building dynamically-linked programs
The build system automatically decides whether a program is linked statically or dynamically depending on the use of shared libraries. If the target is linked against at least
one shared library, the resulting ELF image is a dynamically-linked program. Each
dynamically-linked program is implicitly linked against the dynamic linker. Because
the dynamic linker contains all the base libraries and the C++ runtime, the program
itself must be void of those symbols. Otherwise two ambiguous versions of the base
libraries would be present when the program is started. This invariant is enforced by
the build system by stripping all base libraries from the linking step of a dynamicallylinked programs. The base libraries are defined in repos/base/mk/base-libs.mk.
The entrypoint of a dynamically-linked program is the main function.
8.4.2 Startup of dynamically-linked programs
When creating a new component, the parent first detects whether the to-be-loaded ELF
binary represents a statically-linked program or a dynamically-linked program by inspecting the ELF binary’s program-header information (see repos/base/src/base/elf/). If
the program is statically linked, the parent follows the procedure as described in Section 3.5. If the program is dynamically linked, the parent remembers the dataspace of
program’s ELF image but starts the ELF image of the dynamic linker instead.
The dynamic linker is a regular Genode component that follows the startup procedure described in Section 8.1.2. However, because of its hybrid nature, it needs to take
special precautions before using any data that contains relocations. Because the dynamic linker is a shared library, it contains data relocations. Even though the linker’s
code is position independent and can principally be loaded to an arbitrary address,
global data objects may contain pointers to other global data objects or code. For example, vtable entries contain pointers to code. Those pointers must be relocated depending on the load address of the binary. This step is performed by the init_rtld hook
function, which was already mentioned in Section 8.1.2. Global data objects must not
be used before calling this function. For this reason, init_rtld is called at the earliest
possible time directly from the assembly startup code. Apart from the call of this hook
function, the startup of the dynamic linker is the same as for statically-linked programs.
The main function of the dynamic linker obtains the binary of the actual dynamicallylinked program by requesting a ROM session for the module “binary”. The parent
responds to this request by handing out a locally-provided ROM session that contains
the dataspace with the actual program. Once the linker has obtained the dataspace containing the dynamically-linked program, it loads the program and all required shared
157
8.4
Dynamic linker
libraries. The dynamic linker requests each shared library as a ROM session from its
parent.
After completing the loading of all ELF objects, the dynamic linker calls the entrypoint of the loaded binary, which is the program’s main function. Note that the symbol
main remains ambiguous as both the dynamic linker and the loaded program have a
main function.
8.4.3 Address-space management
To load the binary and the associated shared libraries, the linker does not directly attach
dataspaces to its RM session. Instead, it manages a dedicated part of the component’s
virtual address space manually by attaching a large managed dataspace to its RM session. This way, the linker can precisely control the layout within the virtual-address
range covered by the managed dataspace. This control is needed because the loading
of an ELF object does not correspond to an atomic attachment of a single dataspace
but it involves consecutive attach operations for multiple dataspaces, one for each ELF
segment. When attaching one segment, the linker must make sure that there is enough
space beyond the segment to host the next segment. The use of a managed dataspace
allows the linker to manually allocate big-enough portions of virtual memory and populate it in multiple steps.
158
8.5
Execution on bare hardware (base-hw)
8.5 Execution on bare hardware (base-hw)
The code specific to the base-hw platform is located within the repos/base-hw/ directory.
In the following description, unless explicitly stated otherwise, all paths are relative to
this directory.
In contrast to classical L4 microkernels where Genode’s core process runs as userlevel roottask on top of the kernel, base-hw executes Genode’s core directly on the
hardware with no distinct kernel underneath. Core and kernel are melted into one
hybrid kernel/userland program. Only a few code paths are executed in privileged
mode but most code runs in user mode. This design has several benefits. First, the
kernel part becomes much simpler. For example, there are no allocators needed in the
kernel part because allocators are managed by the user-level part of core. Second, basehw side-steps long-standing hard kernel-level problems, in particular the management
of kernel resources. For the allocation of kernel objects, the hybrid core/kernel can employ Genode’s user-level resource trading concepts as described in Section 3.3. Finally
and most importantly, merging the kernel with roottask removes a lot of redundancies between both programs. Traditionally, both kernel and roottask use to perform the
book keeping of physical-resource allocations and the existence of kernel objects such
as address spaces and threads. In base-hw, those data structures exist only once. The
complexity of the combined kernel/core is significantly lower than the sum of the complexities of a traditional self-sufficient kernel and a distinct roottask on top. This way,
base-hw helps to make Genode’s TCB less complex.
The following subsections detail the problems that base-hw had to address to become
a self-sufficient base platform for Genode.
8.5.1 Bootstrapping of base-hw
A Genode-based system consists of potentially many boot modules. But boot loaders
on ARM platforms usually merely support the loading of a single system image. Hence,
base-hw requires a concept for merging boot modules together with the core/kernel
into a single image.
System-image assembly In contrast to other base platforms where core is a selfsufficient program, on the base-hw platform, core is actually built as library. The library
description file is specific for each platform and located at lib/mk/platform_<pf>/core.mk
where <pf> corresponds to the used hardware platform. It includes the platformagnostic repos/base-hw/lib/mk/core.inc file. The library contains everything core needs
(including the C++ runtime, the kernel code, and the user-level core code) except for
the following symbols:
_boot_modules_headers_begin and _boot_modules_headers_end Between
those symbols, core expects an array of boot-module header structures. A bootmodule header contains the name, core-local address, and size of a boot module.
This meta data is used by core’s initialization code in platform.cc to populate the
ROM service with modules.
159
8.5
Execution on bare hardware (base-hw)
_boot_modules_binaries_begin and _boot_modules_binaries_end
Between those symbols, core expects the actual module data. This range is outside the core image (beyond _prog_img_end). In contrast to the boot-module
headers, the modules reside in a separate section that remains unmapped within
core’s virtual address space. Only while access to a boot module is needed by
core (i. e., the ELF binary of init during the creation of the init component), core
makes the module visible within its virtual address space.
Making the boot modules invisible to core has two benefits. The integrity of the
boot modules does not depend on core. Even in the presence of a bug in core,
the boot modules cannot be accidentally overwritten. Second, there are no pagetable entries needed to map the modules into the virtual address space of core.
This is particularly beneficial when using large boot modules such as a complete
disk image. If base-hw incorporated such a large module into the core image,
page-table entries for the entire disk image would need to be allocated at the
initialization time of core.
Those symbols are defined in an assembly file called boot_modules.s. If building
core stand-alone, the final linking stage combines the core library with the dummy
boot_modules.s file located at src/core/boot_modules.s. But when using the run tool (Section 5.5) to integrate a bootable system image, the run tool dynamically generates a
version of boot_modules.s depending on the boot modules listed in the run script and
repeats the final linking stage of core by combining the core library with the generated
boot_modules.s file. The generated file is placed at <build-dir>/var/run/<scenario>/ and
incorporates the boot modules using the assembler’s .incbin directive. The result of
the final linking stage is an executable ELF binary that contains both core and the boot
modules.
Startup of the base-hw kernel Core on base-hw uses Genode’s regular linker script.
Like any regular Genode component, its execution starts at the _start symbol. But
unlike a regular component, core is started by the boot loader as a kernel in privileged mode. Instead of directly following the startup procedure described in Section
8.1.2, base-hw uses custom startup code that initializes the kernel part of core first.
The startup code is located at src/core/spec/arm/crt0.s. It eventually calls the kernel initialization code in src/core/kernel/kernel.cc. Core’s regular C++ startup code (the _main
function) is executed by the first user-level thread created by the kernel (see the thread
setup in the init_kernel_mp_primary function).
8.5.2 Kernel entry and exit
The execution model of the kernel can be roughly characterized as a single-stack kernel. In contrast to traditional L4 kernels that maintain one kernel thread per user thread,
the base-hw kernel is a mere state machine that never blocks in the kernel. State transitions are triggered by user-level threads that enter the kernel via a system call, by
device interrupts, or by a CPU exception. Once entered, the kernel applies the state
160
8.5
Execution on bare hardware (base-hw)
change depending on the event that caused the kernel entry, and leaves the kernel
to the user land. The transition between user and kernel mode depends on the revision of the ARM architecture. For ARMv7, the corresponding code is located at src/core/spec/arm_v7/mode_transition.s.
8.5.3 Interrupt handling and preemptive multi-threading
In order to respond to interrupts, base-hw has to contain a driver for the interrupt controller. ARM-based SoCs greatly differ with respect to the used interrupt
controllers. The interrupt-controller driver for a particular SoC can be found at
src/core/include/spec/<spec>/pic.h and the corresponding src/core/spec/<spec>/pic.cc where
<spec> refers to a particular platform (e. g., imx53) or an IP block that is is used across
different platforms (e. g., arm_gic for ARM’s generic interrupt controller). Each of the
drivers implement the same interface. When building core, the build system uses the
build-spec mechanism explained in Section 5.3 to incorporate the single driver needed
for the targeted SoC.
To support preemptive multi-threading, base-hw requires a hardware timer. The
timer is programmed with the timeslice length of the currently executed thread. Once
the programed timeout elapses, the timer device generates an interrupt that is handled
by the kernel. Similarly to interrupt controllers, there exist a variety of different timer
devices on ARM-based SoCs. Therefore, base-hw contains different timer drivers. The
timer drivers are located at src/core/include/spec/<spec>/timer.h where <spec> refers to the
timer variant.
The in-kernel handler of the timer interrupt invokes the thread scheduler (src/core/include/kernel/cpu_scheduler.h). The scheduler maintains a list of so-called scheduling contexts where each context refers to a thread. Each time the kernel is entered, the scheduler is updated with the passed duration. When updated, it takes a scheduling decision
by making the next to-be-executed thread the head of the list. At the kernel exit, the
control is passed to the user-level thread that corresponds to the head of the scheduler
list.
8.5.4 Split kernel interface
The system-call interface of the base-hw kernel is split in two parts. One part is usable
by all components and solely contains system calls for inter-component communication and thread synchronization. The definition of this interface is located at include/kernel/interface.h. The second part is exposed only to core. It supplements the public
interface with operations for the creation, the management, and the destruction of kernel objects.
The distinction between both parts of the kernel interface is enforced by the function
Thread::_call in src/core/kernel/thread.cc.
161
8.5
Execution on bare hardware (base-hw)
8.5.5 Public part of the kernel interface
Threads do not run independently but interact with each other via synchronous intercomponent communication as detailed in Section 3.6. Within base-hw, this mechanism
is referred to as IPC (for inter-process communication). To allow threads to perform
calls to other threads or to receive RPC requests, the kernel interface is equipped with
system calls for performing IPC (send_request_msg, await_request_msg, send_reply_msg).
To keep the kernel as simple as possible, IPC is performed using so-called user-level
thread-control blocks (UTCB). Each thread has a corresponding memory page that is
always mapped in the kernel. This UTCB page is used to carry IPC payload. The
largely simplified procedure of transferring a message is as follows. (In reality, the state
space is more complex because the receiver may not be in a blocking state when the
sender issues the message)
1. The user-level sender marshals its payload into its UTCB and invokes the kernel,
2. The kernel transfers the payload from the sender’s UTCB to the receiver’s UTCB
and schedules the receiver,
3. The receiver retrieves the incoming message from its UTCB.
Because all UTCBs are always mapped in the kernel, no page faults can occur during
the second step. This way, the flow of execution within the kernel becomes predictable
and always returns to the user land.
In addition to IPC, threads interact via the synchronization primitives provided
by the Genode API. To implement these portions of the API, the kernel provides
system calls for managing the execution control of threads (pause_current_thread, resume_local_thread, yield_thread).
To support asynchronous notifications as described in Section 3.6.2, the kernel
provides system calls for the submission and reception of signals (await_signal, signal_pending, submit_signal, and ack_signal) as well as the life-time management of signal
contexts (kill_signal_context). In contrast to other base platforms, Genode’s signal API is
directly supported by the kernel so that the propagation of signals does not require any
interaction with core’s SIGNAL service (Section 3.4.10). However, the creation of signal
contexts is arbitrated by the SIGNAL service. This way, the kernel objects needed for
the signalling mechanisms are accounted to the corresponding clients of the SIGNAL
service.
8.5.6 Core-private part of the kernel interface
The core-private part of the kernel interface allows the user-level part of core to perform
privileged operations. Note that even though the kernel and core are executed in different CPU modes (privileged mode and user mode), both parts share a single address
space and ultimately trust each other. The kernel is regarded a mere support library
of core that executes those functions that can only be executed in the privileged CPU
162
8.5
Execution on bare hardware (base-hw)
mode. In particular, the kernel does not perform any allocation. Instead, the allocation
of a kernel objects is performed as an interplay of core and the kernel.
1. Core allocates physical memory from its physical-memory allocator. Most kernelobject allocations are performed in the context of one of core’s services. Hence,
those allocations can be properly accounted to a session quota (Section 3.3). This
way, kernel objects allocated on behalf of core’s clients are “paid for” by those
clients.
2. Core allocates virtual memory to make the allocated physical memory visible
within core and the kernel.
3. Core invokes the kernel to construct the kernel object at the location specified by
core. This kernel invocation is actually a system call that enters the kernel via the
kernel-entry path.
4. The kernel initializes the kernel object at the virtual address specified by core and
returns to core via the kernel-exit path.
The core-private kernel interface consists of the following operations:
• The creation and destruction of protection domains (new_pd and bin_pd), invoked
by the PD service
• The creation, manipulation, and destruction of threads (new_thread, bin_thread,
start_thread, resume_thread, access_thread_regs, and route_thread_event), used by the
CPU service and the core-specific back end of the Genode::Thread API
• The creation and destruction of signal receivers and signal contexts (new_signal_receiver,
bin_signal_receiver, new_signal_context, and bin_signal_context), invoked by the
SIGNAL service
8.5.7 Scheduler of the base-hw kernel
CPU scheduling in traditional L4 microkernels is based on static priorities. The scheduler always picks the runnable thread with highest priority for execution. If multiple
threads share one priority, the kernel schedules those threads in a round-robin fashion.
Whereas being pretty fast and easy to implement, this scheme has disadvantages: First,
there is no way to prevent high-prioritized threads from starving lower-prioritized
ones. Second, CPU time cannot be granted to threads and passed between them by
the means of quota. To cope with these problems without much loss of performance,
base-hw employs a custom scheduler that deviates from the traditional approach.
The base-hw scheduler introduces the distinction between high-throughput-oriented
scheduling contexts - called fills - and low-latency-oriented scheduling contexts - called
claims. Examples for typical fills would be the processing of a compiler job or the rendering computations of a sophisticated graphics program. They shall obtain as much
163
8.5
Execution on bare hardware (base-hw)
CPU time as the system can spare but there is no demand for a high responsiveness.
In contrast, an example for the claim category would be a typical GUI-software stack
covering the control flow from user-input drivers through a chain of GUI components
to the drivers of the graphical output. Another example is a user-level device driver
that must quickly respond to sporadic interrupts but is otherwise untrusted. The low
latency of such components is a key factor for usability and quality of service. Besides
introducing the distinction between claim and fill scheduling contexts, base-hw introduces the notion of a so-called super period, which is a multiple of typical scheduling
time slices, e. g., one second. The entire super period corresponds to 100% of the CPU
time of one CPU. Portions of it can be assigned to scheduling contexts. A CPU quota
thereby corresponds to a percentage of the super period.
At the beginning of a super period, each claim has its full amount of assigned CPU
quota. The priority defines the absolute scheduling order within the super period
among those claims that are active and have quota left. As long as there exist such
claims, the scheduler stays in the claim mode and the quota of the scheduled claims
decreases. At the end of a super period, the quota of all claims is replenished to the
initial value. Every time the scheduler can’t find an active claim with CPU-quota left, it
switches to the fill mode. Fills are scheduled in a simple round-robin fashion with identical time slices. The proceeding of the super period doesn’t affect the scheduling order
and time-slices of this mode. The concept of quota and priority that is implemented
through the claim mode aligns nicely with Genode’s way of hierarchical resource management: Through CPU sessions, each process becomes able to assign portions of its
CPU time and subranges of its priority band to its children without knowing the global
means of CPU time or priority.
8.5.8 Sparsely populated core address space
Even though core has the authority over all physical memory, it has no immediate access to the physical pages. Whenever core requires access to a physical memory page,
it first has to explicitly map the physical page into its own virtual memory space. This
way, the virtual address space of core stays clean from any data of other components.
Even in the presence of a bug in core (e. g., a dangling pointer), information cannot
accidentally leak between different protection domains because the virtual memory of
other components is not visible to core.
8.5.9 Multi-processor support of base-hw
On uniprocessor systems, the base-hw kernel is single-threaded. Its execution model
corresponds to a mere state machine. On SMP systems, it maintains one kernel thread
and one scheduler per CPU core. Access to kernel objects gets fully serialized by one
global spin lock that is acquired when entering the kernel and released when leaving
the kernel. This keeps the use of multiple cores transparent to the kernel model, which
greatly simplifies the code compared to traditional L4 microkernels. Given that the kernel is a simple state machine providing lightweight non-blocking operations, there is
164
8.5
Execution on bare hardware (base-hw)
little contention for the global kernel lock. Even though this claim may not hold up
when scaling to a large number of cores, current ARM-based platforms can be accommodated well.
Cross-CPU inter-component communication Regarding synchronous and asynchronous inter-processor communication - thanks to the global kernel lock - the is no
semantic difference to the uniprocessor case. The only difference is that on a multiprocessor system, one processor may change the schedule of another processor by unblocking one of its threads (e. g., when an RPC call is received by a server that resides
on a different CPU as the client). This condition may rescind the current scheduling
choice of the other processor. To avoid lags in this case, the kernel lets the unaware
target processor trap into an inter-processor interrupt (IPI). The targeted processor can
respond to the IPI by taking the decision to schedule the receiving thread. As the IPI
sender doesn’t have to wait for an answer, the sending and receiving CPUs remain
largely decoupled. There is no need for a complex IPI protocol between both.
TLB shootdown With respect to the synchronization of core-local hardware, there
are two different situations to deal with. Some hardware components like most ARM
caches and branch predictors implement their own coherence protocol and thus need
adaption in terms of configuration only. Others, like the TLBs lack this feature. When
for instance a page table entry gets invalid, the TLB invalidation of the affected entries
must be performed locally by each core. To signal the necessity of TLB maintenance
work, an IPI is sent to all other cores. If all cores completed the cleaning, the thread that
invoked the TLB invalidation resumes its execution.
8.5.10 Limitations of the base-hw platform
The base-hw kernel does not (yet) support the model of kernel-protected capabilities as
described in Section 3.1. All kernel objects are referred to via global IDs. On this platform, a capability is represented as a tuple of a thread ID and a global object ID. Capability delegation is realized as a plain copy of those values. Since any thread can specify
arbitrary global thread IDs and object IDs when performing RPC calls, capability-based
security access control remains ineffective.
Note that the lack of kernel-protected capabilities is not an inherent limitation
of the design of base-hw but a temporary limitation due to the kernel’s stage of
its ongoing development.
165
8.6
Execution on the NOVA microhypervisor (base-nova)
8.6 Execution on the NOVA microhypervisor (base-nova)
NOVA is a so-called microhypervisor, denoting the combination of microkernel and a
virtualization platform (hypervisor). It is a high-performance microkernel for the x86
architecture. In contrast to other microkernels, it had been designed for hardwarebased virtualization via user-level virtual-machine monitors. In line with Genode’s
architecture, NOVA’s kernel interface is based on capability-based security. Hence, the
kernel fully supports the model of a Genode kernel as described in Section 3.1.
NOVA website
http://hypervisor.org
NOVA kernel-interface specification
https://github.com/udosteinberg/NOVA/raw/master/doc/specification.pdf
8.6.1 Integration of NOVA with Genode
The NOVA kernel is available via Genode’s ports mechanism described in Section 5.4.
The port description is located at repos/base-nova/ports/nova.port.
Building the NOVA kernel Even though NOVA is a third-party kernel with a custom
build system, the kernel is built directly from the Genode build system. NOVA’s build
system remains unused.
From within a Genode build directory configured for one of the nova_x86_32 or
nova_x86_64 platforms, the kernel can be built via
make kernel
The build description for the kernel is located at repos/base-nova/src/kernel/target.mk.
System-call bindings NOVA is not accompanied with bindings to its kernel interface. There is only a description of the kernel interface in the form of the kernel specification available. For this reason, Genode maintains the kernel bindings for NOVA
within the Genode source tree. The bindings are located at repos/base-nova/include/ in
the subdirectories nova/, 32bit/nova/, and 64bit/nova/.
8.6.2 Bootstrapping of a NOVA-based system
After finishing its initialization, the kernel starts the first boot module (after the kernel)
as root task. The root task is Genode’s core. The virtual address space of core contains
the text and data segments of core, the UTCB of the initial EC, and the hypervisor info
page (HIP). Details about the HIP are provided in Section 6 of the NOVA specification.
166
8.6
Execution on the NOVA microhypervisor (base-nova)
BSS section of core The kernel’s ELF loader does not support the concept of a BSS
segment. It simply maps the physical pages of core’s text and data segments into the
virtual memory of core but does not allocate any additional physical pages for backing the BSS. For this reason, the NOVA version of core does not use the genode.ld
linker script as described in Section 8.1.1 but the linker script located at repos/basenova/src/platform/roottask.ld. This version hosts the BSS section within the data segment.
Thereby, the BSS is physically present in the core binary in the form of zero-initialized
data.
Initial information provided by NOVA to core The kernel passes a pointer to the HIP
to core as the initial value of the ESP register. Genode’s startup code saves this value in
the global variable _initial_sp (Section 8.1.2).
8.6.3 Log output on modern PC hardware
Because transmitting information over the legacy comports does not require complex
device drivers, serial output over comports is still the predominant way to output lowlevel system logs like kernel messages or the output of core’s LOG service.
Unfortunately, most modern PCs lack dedicated comports. This leaves two options
to obtain low-level system logs.
1. The use of vendor-specific platform-management features such as Intel VPro /
Intel Advanced Management Technology (AMT) or Intel Platform Management
Interface (IPMI). These platform features are able to emulate a legacy comport and
provide the serial output over the network. Unfortunately, those solutions are not
uniform across different vendors, difficult to use, and tend to be unreliable.
2. The use of a PCI card or an Express Card that provides a physical comport. When
using such a device, the added comport appears as PCI I/O resource. Because
the device interface is compatible to the legacy comports, no special drivers are
needed.
The latter option allows the retrieval of low-level system logs on hardware that lacks
special management features. In contrast to the legacy comports, however, it has the
minor disadvantage that the location of the device’s I/O resources is not prior known.
The I/O port range of the comport depends on the device-enumeration procedure of
the BIOS. To enable the kernel to output information over this comport, the kernel must
be configured with the I/O port range as assigned by the BIOS on the specific machine.
One kernel binary cannot simply be used across different machines.
The Bender chain boot loader The alleviate the need to adapt the kernel configuration to the used comport hardware, the bender chain boot loader can be used.
Bender is part of the MORBO tools
https://github.com/TUD-OS/morbo
167
8.6
Execution on the NOVA microhypervisor (base-nova)
Instead of starting the NOVA hypervisor directly, the multi-boot-compliant boot loader
(such as GRUB) starts bender as the kernel. All remaining boot modules including the
real kernel have been already loaded into memory by the original boot loader. Bender
scans the PCI bus for a comport device. If such a device is found (e. g., an Express Card),
it writes the information about the device’s I/O port range to a known offset within the
BIOS data area (BDA).
After the comport-device probing is finished, bender passes control to the next boot
module, which is the real kernel. The comport device driver of the kernel does not use
a hard-coded I/O port range for the comport but looks up the comport location from
the BDA. The use of bender is optional. When not used, the BDA always contains the
I/O port range of the legacy comport 1.
The Genode source tree contains a pre-compiled binary of bender at tool/boot/bender.
This binary is automatically incorporated into boot images for the NOVA base platform
when the run tool (Section 5.5) is used.
8.6.4 Relation of NOVA’s kernel objects to Genode’s core services
For the terminology of NOVA’s kernel objects, refer to the NOVA specification mentioned in the introduction of Section 8.6. A brief glossary for the terminology used in
the remainder of this section is given in table 1.
NOVA term
PD
EC
SC
HIP
IDC
portal
Protection domain
Execution context (thread)
Scheduling context
Hypervisor information page
Inter-domain call (RPC call)
communication endpoint
Table 1: Glossary of NOVA’s terminology
NOVA capabilities are not Genode capabilities Both NOVA and Genode use the
term “capability”. However, the term has not the same meaning in both contexts. A
Genode capability refers to an RPC object or a signal context. In the context of NOVA,
a capability refers to a NOVA kernel object. To avoid confusing both meanings of the
term, Genode refers to NOVA’s term as “capability selector”, or simply “selector”.
PD service A PD session corresponds to a NOVA PD.
CPU service NOVA distinguishes so-called global ECs from local ECs. A global EC
can be equipped with CPU time by associating it with an SC. It can perform IDC calls
but it cannot receive IDC calls. In contrast to a global EC, a local EC is able to receive
168
8.6
Execution on the NOVA microhypervisor (base-nova)
IDC calls but it has no CPU time. A local EC is executed not before it is called by another
EC.
A regular Genode thread is a global EC. A Genode entrypoint is a local EC. Core
distinguishes both cases based on the instruction-pointer (IP) argument of the CPU
session’s start function. For a local EC, the IP is set to zero.
RAM and IO_MEM services Core’s RAM and IO_MEM allocators are initialized
based on the information found in NOVA’s HIP.
ROM service Core’s ROM service provides all boot modules as ROM modules. Additionally, NOVA’s HIP is provided as a ROM module named “hypervisor_info_page”.
CAP service A Genode capability corresponds to a NOVA portal. Each NOVA portal
has a defined IP and an associated local EC (the Genode entrypoint). The invocation
of a Genode capability is an IDC call to a portal. A Genode capability is delegated by
passing its corresponding portal selector as IDC argument.
IRQ service NOVA represents each interrupt as a semaphore. Within core, there is
one entrypoint per IRQ session. When wait_for_irq is called, the called IRQ entrypoint blocks on its corresponding IRQ semaphore. In the kernel, this semaphore-down
operation implicitly unmasks the interrupt at the CPU.
When the interrupt occurs, the kernel masks the interrupt at the CPU and performs
the semaphore-up operation on the IRQ’s semaphore. Thereby, it wakes up the IRQ
entrypoint, which replies to the wait_for_irq RPC call.
RM service The RM service is used for the page-fault handling as explained in Section 8.6.5. Each memory mapping installed in a component implicitly triggers the allocation of a node in the kernel’s mapping database.
8.6.5 Page-fault handling on NOVA
On NOVA, each EC has a defined range of portal selectors. For each type of exception,
the range has a dedicated portal that is entered in the event of an exception. The pagefault portal of a Genode thread is defined at the creation time of the thread and points
to a dedicated pager EC within core. Hence, for each Genode thread, there exist two
ECs. One in the PD where the thread executes and the pager EC in core.
The operation of pager ECs When an EC triggers a page fault, the faulting EC implicitly performs an IDC call to its pager. The IDC message contains the fault information. On NOVA, there is a one-to-one relationship between a pager EC and Genode’s
pager object. For resolving the page fault, core follows the procedure described in 8.3.3.
If the lookup for a dataspace within the faulter’s RM session succeeds, core establishes a
169
8.6
Execution on the NOVA microhypervisor (base-nova)
memory mapping into the EC’s PD by sending a so-called map item as reply to the page
fault message. In the case where the region lookup within the thread’s corresponding
RM session fails, the pager EC blocks on a semaphore. Because the page-fault message
remains unanswered, the faulting thread is effectively put on halt. In the event that the
RM fault is resolved by an RM client as described in the paragraph “Region is empty”
of Section 8.3.3, the blocking on the semaphore gets released and the pager EC is able to
reply to the original page-fault message. However, the reply does not immediately establish a memory mapping. Instead, the faulter will immediately trigger another fault
at the same address. This time, however, the region lookup succeeds.
Mapping database NOVA tracks memory mappings in a data structure called mapping database and has the notion of the delegation of memory mappings (rather than the
delegation of memory access). Memory access can be delegated only if the originator of
the delegation has a mapping. Core is the only exception because it can establish mappings originating from the physical memory space. Because mappings can be delegated
transitively between PDs, the mapping database is a tree where each node denotes the
delegation of a mapping. The tree is maintained in order to enable the kernel to revoke
the authority. When a mapping is revoked, the kernel implicitly revokes all transitive
mappings that originated from the revoked node.
Because of this design, core needs to maintain a core-local memory mapping for each
memory mapping established outside of core. This mapping is solely needed to revoke
the memory mapping later on, for example, when a dataspace is detached from an RM
session. The kernel’s revoke operation takes the core-local address as argument and
revokes all mappings originating from this mapping node.
8.6.6 IOMMU support
As discussed in Section 4.1.3, misbehaving device drivers may exploit DMA transactions to circumvent their component boundaries. When executing Genode on the
NOVA microhypervisor, however, bus-master DMA is subjected to the IOMMU.
The NOVA kernel applies a subset of the (MMU) address space of a protection domain to the (IOMMU) address space of a device. So the device’s address space can be
managed in the same way as one normally manages the address space of a PD. The
only missing link is the assignment of device address spaces to PDs. This link is provided by the dedicated system call assign_pci that takes a PD capability selector and a
device identifier as arguments. The PD capability selector represents the authorization
over the protection domain, which is going to be targeted by DMA transactions. The
device identifier is a virtual address where the extended PCI configuration space of the
device is mapped in the specified PD. Only if a user-level device driver got access to
the extended PCI configuration space of the device, it is able to get the assignment in
place.
To make NOVA’s IOMMU support available to Genode components, the ACPI driver
has the ability to hand out the extended PCI configuration space of a device, and a
170
8.6
Execution on the NOVA microhypervisor (base-nova)
NOVA-specific extension (assign_pci) to the PD session interface can be used to associate
a PCI device with a protection domain.
Even though these mechanisms combined principally suffice to let drivers operate
with the IOMMU enabled, in practice, the situation is a bit more complicated. Because NOVA uses the same virtual-to-physical mappings for the device as it uses for
the process, the DMA addresses the driver needs to supply to the device must be virtual addresses rather than physical addresses. Consequently, to be able to make a device driver usable on systems without IOMMU as well as on systems with IOMMU, the
driver needs to become IOMMU-aware and distinguish both cases. This is an unfortunate consequence of the otherwise elegant mechanism provided by NOVA. To relieve
the device drivers from caring about both cases, Genode decouples the virtual address
space of the device from the virtual address space of the driver. The former address
space is represented by a Genode component called device PD. Its sole purpose is to
hold mappings of DMA buffers that are accessible by the associated device. By using
one-to-one physical-to-virtual mappings for those buffers within the device PD, each
device PD contains a subset of the physical address space. The ACPI driver performs
the assignment of device PDs to PCI devices. If a device driver intends to use DMA,
it allocates a new DMA buffer for a specific PCI device at the ACPI driver. The ACPI
driver responds to such a request by allocating a RAM dataspace at core, attaching it to
the device PD using the dataspace’s physical address as virtual address, and handing
out the dataspace capability to the client. If the driver requests the physical address of
the dataspace, the returned address will be a valid virtual address in the associated device PD. From this design follows that a device driver must allocate DMA buffers at the
ACPI server (specifying the PCI device the buffer is intended for) instead of using core’s
RAM service to allocate buffers anonymously. Note that the current implementation of
the ACPI server assigns all PCI devices to only one device PD.
8.6.7 Genode-specific modifications of the NOVA kernel
NOVA is not fit to be used as Genode base platform as is. This section compiles the
modifications that were needed to meet the functional requirements of the framework.
All modifications are maintained at the following repository:
Genode’s version of NOVA
https://github.com/alex-ab/NOVA.git
The repository contains a separate branch for each version of NOVA that had been used
for Genode. When preparing the NOVA port using the port description at repos/basenova/ports/nova.port, the NOVA branch that matches the used Genode version is checked
out automatically. The port description refers to a specific commit ID. The commit history of each branch within the NOVA repository corresponds to the history of the original NOVA kernel followed by a series of Genode-specific commits. Each time NOVA
is updated, a new branch is created and all Genode-specific commits are rebased on the
history of the new NOVA version. This way, the differences between the original NOVA
171
8.6
Execution on the NOVA microhypervisor (base-nova)
kernel and the Genode version remain clearly documented. The Genode-specific modifications solve the following problems:
Destruction of kernel objects
NOVA does not support the destruction of kernel objects. I.e., PDs and ECs can
be created but not destroyed. With Genode being a dynamic system, kernel-object
destruction is a mandatory feature.
Inter-processor IDC
On NOVA, only local ECs can receive IDC calls. Furthermore each local EC
is bound to a particular CPU (hence the name “local EC”). Consequently, synchronous inter-component communication via IDC calls is possible only between
ECs that both reside on the same CPU but can never cross CPU boundaries. Unfortunately, IDC is the only mechanism for the delegation of capabilities. Consequently, authority cannot be delegated between subsystems that reside on different CPUs. For Genode, this scheme is too rigid.
Therefore, the Genode version of NOVA introduces inter-CPU IDC calls. When
calling an EC on another CPU, the kernel creates a temporary EC and SC on the
targeted CPU as a representative of the caller. The calling EC is blocked. The
temporary EC uses the same UTCB as the calling EC. Thereby, the original IDC
message is effectively transferred from one CPU to the other. The temporary EC
then performs a local IDC to the destination EC using NOVA’s existing IDC mechanism. Once the temporary EC receives the reply (with the reply message contained in the caller’s UTCB), the kernel destroys the temporary EC and SC and
unblocks the caller EC.
Support for priority-inheriting spinlocks
Genode’s lock mechanism relies on a yielding spinlock for protecting the lock
meta data. On most base platform, there exists the invariant that all threads of
one component share the same CPU priority. So priority inversion within a component cannot occur. NOVA breaks this invariant because the scheduling parameters (SC) are passed along IDC call chains. Consequently, when a client calls a
server, the SCs of both client and server reside within the server. These SCs may
have different priorities. The use of a naive spinlock for synchronization will produce priority inversion problems. The kernel has been extended with the mechanisms needed to support the implementation of priority-inheriting spinlocks in
the userland.
Combination of capability delegation and translation
As described in Section 3.1.4, there are two cases when a capability is specified as
an RPC argument. The callee may already have a capability referring to the specified object identity. In this case, the callee expects to receive the corresponding
local name of the object identity. In the other case, when the callee does not yet
172
8.6
Execution on the NOVA microhypervisor (base-nova)
have a capability for the object identity, it obtains a new local name that refers to
the delegated capability.
NOVA does not support this mechanism per se. When specifying a capability
selector as map item for an IDC call, the caller has to specify whether a new mapping should be created or the translation of the local names should be performed
by the kernel. However, in the general case, this question is not decidable by the
caller. Hence, NOVA had to be changed to take the decision depending on the
existence of a valid translation for the specified capability selector.
8.6.8 Known limitations of NOVA
This section summarizes the known limitations of NOVA and the NOVA version of
core.
Fixed amount of kernel memory NOVA allocates kernel objects out of a memory
pool of a fixed size. The pool is dimensioned in the kernel’s linker script nova/src/hypervisor.ld (at the symbol _mempool_f). The existence of a fixed pool implies
that any component that is able to trigger allocations in the kernel is able to indirectly consume kernel resources. A misbehaving component in possession of
its own PD capability selector may even forcefully trigger the exhaustion of the
entire pool and thereby make the kernel unavailable. I.e., the kernel panics when
running out of memory. The kernel provides no mechanism to mitigate such a
resource-exhaustion-based denial-of-service attack.
On Genode, only core explicitly allocates kernel objects, which relieves the problem but does not solve it. In order to create a kernel object, a PD capability selector
must be specified to the respective system call. Since PD capability selectors are
never delegated to the outside of core, kernel objects cannot be directly created by
arbitrary components. The creation of kernel objects is rather a side effect of the
use of core’s services. Thereby, core is principally in the position to restrict the use
of kernel memory per client. However, such an accounting for kernel memory is
not performed by the NOVA version of core.
In addition to the explicit creation of kernel objects, kernel memory is implicitly
allocated when mapping nodes are inserted into the kernel’s mapping database.
Thereby, kernel memory is consumed as a side effect of IDC calls that carry map
items. Since ECs of the same PD can perform IDC calls between one another,
the allocation of mapping nodes can be artificially stressed by delegating a large
number of mappings within the same PD via successive IDC calls.
Therefore, components are principally able to launch denial-of-service attacks on
the kernel. In the event of an exhaustion of kernel memory, the kernel stops the
system. Hence, even though the lack of proper management of kernel memory is
an availability risk, it cannot be exploited as a covert storage channel.
The maximum number of threads is limited by core’s thread-context area NOVA’s
page-fault handling protocol works completely synchronously. When a page fault
173
8.6
Execution on the NOVA microhypervisor (base-nova)
occurs, the faulting EC enters its page-fault portal and thereby activates the corresponding pager EC in core. If the pager’s lookup for a matching dataspace
within the faulter’s RM session succeeds, the page fault is resolved by delegating
a memory mapping as the reply to the page-fault IDC call. However, if a page
fault occurs on a managed dataspace, the pager cannot resolve it immediately.
The resolution must be delayed until the RM fault handler (outside of core) responds to the RM fault signal. In order to enable core to serve page faults of other
threads in the meantime, each thread has its dedicated pager EC in core.
Each pager EC consumes a thread context within core. Since core’s thread-context
area is limited, the maximum number of ECs within core is limited too. Because
one core EC is needed as pager for each thread outside of core, the available thread
contexts within core become a limited resource shared by all CPU-session clients.
Because each Genode component is a client of core’s CPU service, this bounded
resource is effectively shared among all components. Consequently, the allocation
of threads on NOVA’s version of core represents a possible covert storage channel.
Bounded number of object capabilities within core For each capability created via
core’s CAP service, core allocates the corresponding NOVA portal and maintains
the portal’s capability selector during the lifetime of the associated object identity.
Each allocation of a capability via core’s CAP service consumes one entry in core’s
capability space. Because the space is bounded, clients of the CAP service could
misuse core’s capability space as covert storage channel.
Core must retain mappings to all memory used throughout the system As mentioned in Section 8.6.5, core needs to own a mapping node before delegating the
mapping to another PD as a response to a page fault. Otherwise, core could not
revoke the mapping later on because the kernel expects core’s mapping node as a
proof for the authorization for the revocation of the mapping.
Consequently, even though core never touches memory handed out to other
components, it needs to have memory mappings with full access rights installed
within its virtual address space. Therefore, core on NOVA cannot benefit from a
sparsely populated address space as described in Section 8.5.8 for base-hw.
Non-executable (NX) bit on 32-bit machines
NOVA supports the NX bit on 64-bit machines. In the x86 32-bit architecture,
however, there exists no NX bit unless the x86 Physical Address Extension (PAE)
is enabled. Because NOVA does not support PAE, the NX bit remains unsupported when using NOVA on 32-bit machines.
174
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement