Grant Agreement No.: 317814 IRATI Investigating RINA as an Alternative to TCP/IP Instrument: Collaborative Project Thematic Priority: FP7-ICT-2011-8 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Due date of the report: Month 11 Actual date: 18th December, 2013 Start date of project: January 1st, 2013 - Duration: 24 months version: v.1.0 Project co-funded by the European Commission in the 7th Framework Programme (2007-2013) Dissemination Level PU Public PP Restricted to other programme participants (including the Commission Services) RE Restricted to a group specified by the consortium (including the Commission Services) CO Confidential, only for members of the consortium (including the Commission Services) D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS FP7 Grant Agreement No. Project Name Document Name Document Title Workpackage Authors Editor Reviewers Delivery Date Version Doc IRATI D3.1 Date December 2013 317814 Investigating RINA as an Alternative to TCP/IP IRATI D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS WP3 Francesco Salvestrini (Nextworks) Nicola Ciulli (Nextworks) Eduard Grasa (i2CAT) Miquel Tarzan (i2CAT) Leonardo Bergesio (i2CAT) Sander Vrijders (iMinds) Dimitri Staessens (iMinds) Francesco Salvestrini (Nextworks) Eduard Grasa (i2CAT), Dimitri Staessens (iMinds) 18th December 2013 V1.0 2 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 3 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Abstract This deliverable presents the first phase integrated prototype of the IRATI project. The prototype implements core parts of a RINA stack over Ethernet for a Linux-base OS, spans both kernel and user spaces and provides different software packages implementing the various functionalities. The kernel space components mainly lay on the fast-path, implementing the forwarding functionalities through the use of EFCP, RMT, PDU Forwarding table, normal and shim IPC Processes (i.e. the shim IPC over Ethernet) components. These components are bound together by the KIPCM (the kernel IPC manager), the KFA (the kernel flow allocation manager) and the RNL (the Netlink manager) layers that also implement the kernel/user interface. The kernel part of the IRATI stack must jointly work with its counterpart in user space which provides the remaining functionalities through a well defined set of user-space libraries and OS processes. The libraries wrap the kernel-space APIs (syscalls and Netlink messages) and provide additional functionalities such as: Allow applications to use RINA natively, enabling them to allocate and deallocate flows, read and write SDUs to these flows, and register/unregister to one or more DIFs. Facilitate the IPC Manager to perform the tasks related to IPC Process creation, deletion and configuration. Allow the IPC Process to configure the PDU forwarding table, to create and delete EFCP instances, to request the allocation of kernel resources to support a flow etc. These C/C++ based libraries allow IRATI adopters to develop native RINA applications. Language bindings to interpreted languages (i.e. Java) are also made available by wrapping the exported symbols with the target language native interface (i.e. JNI). Upon these bindings OS daemons such has the IPC Process and the IPC Manager have been developed and are ready to be used for testing and experimentation purposes. The prototype provides frameworks for configuration, building and development tasks. These frameworks follow common and well established practices with particular emphasis on usability features. The prototype configuration framework automatically adapts to different OS/Linux based systems, e.g. Debian/Linux, Fedora, Ubuntu. The building framework allows to build the whole stack unattended. The software development framework provides a RAD environment. However, the phase 1 prototype has known limitations which are currently being addressed and will be resolved in the next prototypes. They can be summarised as follows: 4 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Only flows between adjacent IPC Processes are supported, since the implementation of the link-state routing specification provided in D2.1 is planned for phase 2. Only unreliable flows are supported, since the Data Transfer Control Protocol (DTCP) is planned for phase 2. The prototype uses simple policies, i.e. only basic functional configuration of most components. This document presents the relevant choices, protocols and methodologies on software design, development, integration and testing agreed within the partners for the scopes of the first phase prototype implementation. The software meets both the features and the stability requirements for experimentation and has been released to WP4 (MS7). The core functionalities available in the presented prototype will be further enhanced in the upcoming project prototypes which will be made available in the next periods as part of deliverables D3.2 and D3.3. 5 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 TABLE OF CONTENTS Abstract ......................................................................................................................................... 4 Acronyms..................................................................................................................................... 11 1 Introduction ........................................................................................................................ 13 2 Software design ................................................................................................................... 14 2.1 The kernel space.......................................................................................................... 15 2.1.1 The object model................................................................................................. 15 2.1.1.1 The Linux kobjects ........................................................................................... 17 2.1.1.2 The IRATI objects ............................................................................................. 18 2.1.1.2.1 Spinlock, non-interruptible contexts and objects creation....................... 19 2.1.2 The framework .................................................................................................... 20 2.1.3 The personalities and the kernel/user interface ................................................. 21 2.1.4 The stack core...................................................................................................... 21 2.1.4.1 KFA, the Kernel Flow Allocation Manger......................................................... 22 2.1.4.2 KIPCM, the Kernel IPC Manager ...................................................................... 23 2.1.4.3 RNL, the RINA Netlink Layer ............................................................................ 25 2.1.4.4 The IPC Process factories ................................................................................ 27 2.1.4.4.1 The IPC Process factories interface ........................................................... 28 2.1.5 2.1.5.1 The IPC Processes ................................................................................................ 28 The IPC Processes interfaces ........................................................................... 29 2.1.5.1.1 The normal IPC Process interface ............................................................. 29 2.1.5.1.2 The shim IPC Process interfaces ................................................................ 30 2.1.5.2 EFCP, RMT and the PDU Forwarding Table (in the normal IPC Process)......... 31 2.1.5.2.1 The egress DUs workflow .......................................................................... 32 2.1.5.2.2 The ingress DUs workflow ......................................................................... 35 2.1.6 2.1.6.1 ARP826 ................................................................................................................ 37 ARP826 software design.................................................................................. 37 2.1.6.1.1 The Generic Protocol Addresses and Generic Hardware Addresses ........ 38 2.1.6.1.2 The ARP826 Core ....................................................................................... 39 2.1.6.1.3 The ARP826 Maps and Tables ................................................................... 40 6 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 2.1.6.1.4 The ARP826 Address Resolution Module.................................................. 40 2.1.6.1.5 The ARP826 RX/TX .................................................................................... 41 2.1.7 2.1.7.1 The Shim IPC Process over Ethernet ................................................................... 43 2.1.9 The Shim Dummy IPC Process ............................................................................. 44 The user space............................................................................................................. 45 2.2.1 2.3 Bindings to other programming languages ..................................................... 48 2.2.1.2 Detailed Librina software design..................................................................... 50 rinad .................................................................................................................... 53 2.2.2.1 Package components and build framework .................................................... 55 2.2.2.2 RINA Daemons software design ...................................................................... 55 Package inter-dependencies ....................................................................................... 61 The development and release model ................................................................................. 63 3.1 The git workflow.......................................................................................................... 63 3.1.1 Repository branches............................................................................................ 63 3.1.2 Internal software releases................................................................................... 64 3.1.3 Issues management............................................................................................. 65 The environments ............................................................................................................... 66 4.1 The build environments .............................................................................................. 66 4.1.1 The kernel-space configuration and building ...................................................... 66 4.1.2 The user-space configuration and building ......................................................... 68 4.2 5 librina................................................................................................................... 46 2.2.1.1 2.2.2 4 The RINARP API ............................................................................................... 41 2.1.8 2.2 3 RINARP................................................................................................................. 41 The development and testing environments .............................................................. 68 Installing the software......................................................................................................... 70 5.1 Installing the kernel ..................................................................................................... 70 5.2 Installing the librina package ...................................................................................... 70 5.2.1 Bootstrapping the package ................................................................................. 70 5.2.2 Prerequisites for the configuration ..................................................................... 70 5.2.3 Installation procedure ......................................................................................... 71 5.3 Installing the rinad package ........................................................................................ 71 7 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 5.3.1 Bootstrapping the package ................................................................................. 71 5.3.2 Prerequisites for the configuration ..................................................................... 71 5.3.3 Installation procedure ......................................................................................... 72 5.4 Helper for librina and rinad installation ...................................................................... 72 5.5 Layout of folders and files in the installation directory .............................................. 72 6 Loading and running the stack ............................................................................................ 74 7 Testing the stack.................................................................................................................. 75 7.1 VirtualBox configuration ............................................................................................. 75 7.2 VMs configuration ....................................................................................................... 76 7.3 IPC Manager configuration file.................................................................................... 76 7.4 Running the test .......................................................................................................... 77 8 Conclusions and future work .............................................................................................. 78 9 References ........................................................................................................................... 80 8 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 LIST OF FIGURES Figure 1 IRATI high-level software architecture .......................................................................... 14 Figure 2 The IRATI kernel-space stack software architecture ..................................................... 15 Figure 3 KFA Interactions (details) .............................................................................................. 23 Figure 4 KIPCM interactions (details) .......................................................................................... 24 Figure 5 RNL and Netlink layers in the IRATI prototype .............................................................. 25 Figure 6 RNL interaction (detailed) ............................................................................................. 27 Figure 7 IPCP Factory, IPCP instance and IPCP interactions........................................................ 28 Figure 8 The components interactions during read/write operations ....................................... 32 Figure 9 The egress EFCP / RMT workflow .................................................................................. 33 Figure 10 The egress DUs workflow ............................................................................................ 34 Figure 11 The ingress EFCP / RMT workflow............................................................................... 35 Figure 12 The ingress DUs workflow ........................................................................................... 36 Figure 13 Interactions of the ARP826 parts ................................................................................ 38 Figure 14 High-level architecture of the user-space parts .......................................................... 46 Figure 15 High-level librina design .............................................................................................. 47 Figure 16 SWIG wrapping example ............................................................................................. 49 Figure 17 The IRATI user-space stack software architecture ...................................................... 50 Figure 18 librina detailed software design .................................................................................. 52 Figure 19 Alba RINA stack high level software design ................................................................ 54 Figure 20 IPC Manager Daemon detailed software design ......................................................... 56 Figure 21 IPC Process Daemon detailed software design ........................................................... 58 Figure 22 Packages runtime inter-dependencies........................................................................ 61 Figure 23 Packages build-time inter-dependencies .................................................................... 62 Figure 24 The module loading/unloading steps.......................................................................... 67 Figure 25 Testing environment ................................................................................................... 75 9 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 10 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Acronyms ARP CDAP CEP CLI COW DIF DTCP DTP DVCS EFCP FIDM FIFO FS GHA GPA GPB GUI HW IDD IPC IPCP JAR JNI JVM KFA KIPCM LAN LOC MS MTU NAT NI NIC NL OO OOD OOP OS PDU Address Resolution Protocol Common Distributed Application Protocol Connection End Point Command Line Interface Copy On Write Distributed IPC Facility Data Transfer Control Protocol Data Transfer Protocol Distributed Version Control System Error and Flow Correction Protocol Flows-IDs Manager First In First Out File System Generic Hardware Address Generic Protocol Address Google Protocol Buffers Graphical User Interface Hardware Inter-DIF Directory Inter-Process Communication IPC Process Java Archive Java Native Interface Java Virtual Machine Kernel Flow Allocation (Manager) Kernel IPC Manager Local Area Network Lines Of Code Milestone Maximum Transmission Unit Network Address Translation Native Interface Network Interface Card Netlink Object Oriented Object Oriented Design Object Oriented Programming Operative System Protocol Data Unit 11 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS PIM RFC RIB RINARP RMT RNL SCM SDU SPR SW SWIG UI VA VB VLAN VM VMI Doc IRATI D3.1 Date December 2013 Port / IPC Process instance Mapping Request For Comments Resource Information Base RINA ARP (adaptation layer) Relay and Multiplexing Task RINA Netlink (abstraction) Layer Software Configuration Management Service Data Unit Software Problem Report Software Software Wrapper and Interface Generator User Interface Virtual Appliance Virtual Box Virtual LAN Virtual Machine Virtual Machine Image 12 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 1 Introduction The software components required for the first phase IRATI project prototype, as described in the high level software architecture deliverable (ref. D2.1), have been designed, developed, integrated and functionally tested in a virtualised environment. The resulting prototype implements the core components of a RINA stack prototype over Ethernet for a Linux-based OS. Although the design and development activities progressed without major problems and deviations, the release of the first phase software prototype was slightly delayed, mainly due to the unsuitability of the Linux ARP implementation for the scope of the Shim Ethernet IPC Process. That problem caused the introduction of unplanned developments, delaying the prototype release date. The software prototype meets both the feature and stability requirements for experimentation. Its core functionalities will be further enhanced during the next project phases and will be made available in the next periods as part of deliverables D3.2 and D3.3. This document presents the relevant choices, protocols and methodologies on software design, development, integration and testing agreed among the partners. It also includes installation and operation instructions of the prototype. The deliverable is structured as follows. Section 2 presents design and development details as well as updates to the high level software architecture introduced to overcome the problems encountered during development. Section 3 presents both the development model agreed among the partners and the release model agreed between WP3 and WP4. Section 4 presents the environments for development, building and testing of both kernel and user space components. Section 5 and 6 provide instructions on installing and loading the IRATI stack respectively. Section 7 provides instructions on performing tests using the whole stack, including the Shim Ethernet IPC Process, in a dual-machine virtualized environment. Finally, section 8 concludes the document, presenting the work planned for the next prototypes. 13 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 2 Software design This section presents the major updates to the high level software design of the IRATI stack components with respect to the design presented in deliverable D2.1, [3], which further details the high level software architecture. Figure 1 presents the high-level layout of the IRATI stack components which are detailed in the following sections. The Kernel Flow Allocation Manager (KFA), the RINARP and the ARP826 components depicted in the figure were introduced into the IRATI kernel-space software architecture in order to overcome problems discovered during the software integration phase. The KFA takes over part of the KIPCM functionalities described in D2.1 by managing the kernelspace flow-allocation mechanisms and the interactions with the IPC Process (IPCP) instances, mainly solving concurrency related problems. The RINARP and ARP826 components instead, have been introduced to solve limitations of the current Linux ARP implementation. Figure 1 IRATI high-level software architecture The IRATI stack will continue to evolve during the whole project lifetime and the software design presented in this document will be updated accordingly. The next prototypes may 14 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 change the design presented in this document significantly. In that case, planned WP3 deliverables D3.2 and D3.3 will report such updates. 2.1 The kernel space The following subsections cover the design decisions which led to the present kernel-space software architecture as well as the kernel components internals. Figure 2 provides a detailed picture of the kernel parts referenced in Figure 1. Figure 2 The IRATI kernel-space stack software architecture 2.1.1 The object model The kernel space software must comply with more restrictions and different requirements compared to the well-known user-space software. As an example, some of the requirements to take care of when designing kernel-space software can be summarised as follows: Only a constrained set of programming languages can be used: the kernel-level software must be written either in assembly language or in C. This constraint implies that only a reduced set of Object Oriented (OO) techniques can be applied: there is no way to bind actions to language constructs for implementing ad-hoc object constructors, operators cannot be overloaded etc. Kernel contexts have timing constraints: Interrupt handling procedures introduce strict timings that imply programming constraints on the handlers as well as to all the 15 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 derived code executing on the same context. Code in such contexts has to react in a timely fashion and cannot call functions that might sleep (such as deferred memory allocation routines). Different synchronization and concurrency techniques: The Linux kernel is a cooperative environment where “live” entities can be implemented as tasklets, kthreads, workqueues, etc. The synchronization semantics among these entities differ from user-space implementations through a higher degree of details (e.g. spinlocks, futexes, mutexes, semaphores) and constraints (e.g. spinlocks cannot be recursively locked). In order to overcome the possible limitations of the environment, reduce the problems that may be caused by incrementally introducing features during the entire project lifetime and keep code refactoring at the minimum, an ad-hoc Object Oriented Design (OOD) approach throughout all the IRATI stack kernel-space components was adopted. Object Oriented Programming (OOP) approaches applied to low-level languages such as C are not new and lot of literature describing different techniques and methodologies is easily available in the public domain [28]. These approaches however usually deal with OOP in userspace and have to be opportunely re-factored in order to be applied to kernel space. Part of these techniques are already embraced in the Linux kernel (and were applied to the IRATI implementation accordingly) while the remaining ones were introduced during the development phase where needed. The adopted OOD techniques can be summarised as follows: Information hiding: Implemented making use of forward definitions of the object’s type in its header file while the complete definition is embedded into the corresponding compilation module (i.e. using opaque pointers, [29]). Constructors and destructors: Implemented making use of ad-hoc and per-object functions mimicking (C++) constructors and destructors. Statically allocated objects are initialised and finalised through the use of _init() and _fini() functions while the dynamically allocated ones are created and destroyed through the use of _create() and _destroy() functions. Classes, methods, polymorphism and inheritance: Classes are implemented through the use of opaque data pointers and function pointers stored into the corresponding object, coupled with an ad-hoc API, provide the object’s (public) methods. Polymorphism and inheritance can be easily obtained by opportunely mangling these pointers and enforcing common naming constraints. By applying such techniques, the internal implementation of a stack component can be hidden and its interface “exported” as a set of function pointers and an opaque data pointer (pointing to the component’s internal state). That interface can be dynamically bound to other 16 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 components by storing these function and data pointers into the targets, received through the use of an ancillary function. With the general adoption of the aforementioned approaches throughout the stack and by using the runtime dynamic linking features of the kernel (i.e. module loading and unloading), the implementation of dynamic embedding and removal of features into the stack at runtime is possible. This avoids the need to recompile/change the core of the stack each time the internal implementation of a component changes, which is especially important for the shim IPC processes since they strongly depend on the underlying technology (e.g. Ethernet, WiFi) and thus their implementation varies widely. This way, new technologies can be easily integrated into the IRATI stack. The following section introduces the current Linux kernel object model and briefly discusses the downsides for its applicability within the IRATI prototype environment, justifying the introduction of a custom object model. The section after that highlights the details of the IRATI object model implementation. 2.1.1.1 The Linux kobjects The Linux kobjects are an OO abstraction used in the Linux kernel. They initially represented the glue holding the device model and its sysfs interface [32] together. Nowadays their use is widespread over the entire kernel and constitutes the major OOD technique utilised. The kobject can be summarised as a structure (i.e. struct kobject). Kobjects have a name, a reference count, a parent pointer (allowing kobjects to be arranged into hierarchies), a specific type and (optionally) a representation in the sysfs virtual filesystem. Their representation is the following (as in kernel version v3.10.0): struct kobject { const char * name; struct list_head entry; struct kobject * parent; struct kset * kset; struct kobj_type * ktype; struct sysfs_dirent * sd; struct kref kref; unsigned int state_initialized unsigned int state_in_sysfs unsigned int state_add_uevent_sent unsigned int state_remove_uevent_sent unsigned int uevent_suppress }; : : : : : 1; 1; 1; 1; 1; Where, most noticeably: 17 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 name: is the symbolic object name, used for both naming (i.e. lookups) and sysfs presentation. entry and parent: are used for objects (re-)parenting. sd: is used for sysfs presentation (i.e. sysfs directory). kref: is used for reference counting. ktype: Represents the type associated with a kobject. It controls what happens when a kobject is no longer referenced and it also drives the kobject's default representation in sysfs. kset: The kset is the basic container type for collections of kobjects and allows for their homogeneous handling and eases their integration into sysfs (a kset can be represented almost automatically as a sysfs directory, generally each of those entries corresponds to a kobject in the same kset). Kobjects are generally not interesting on their own; They are usually embedded within some other structure which contains the information the code is really interested in. They can be seen as a top-level abstract class from which the rest of classes are derived in an OO approach. Refer to [27] for further details. Despite its wide use within the Linux kernel, the kobject abstraction has noticeable downsides when applied to the IRATI implementation: the kobject abstraction makes implicit use of concurrency semantics (i.e. spinlocks), embeds reference counting (i.e. kref), implies objects naming (i.e. the “name” field), forces loose typing etc. The resulting model, initially tailored to hold the bindings of the kernel devices tree, becomes heavy-weighted and cannot be efficiently adapted to the IRATI stack since Its components have different bindings schemas, locking semantics and memory models. The object model presented in the next section removes the unnecessary details from kobjects while keeping the interesting ones in order to obtain a lightweight and high-performance implementation tailored to the needs of the IRATI prototype. 2.1.1.2 The IRATI objects As introduced in the high-level software architecture [3], all the kernel-space parts of the IRATI stack have been developed assuming the following conventions in order to keep the code clean and manageable: Use of opaque pointers (i.e. struct obj * in the functions arguments) to hide the object’ internal representation from its user. The object interface exposes functions mimicking constructors (obj_init() and obj_create()) and destructors (obj_fini() and obj_destroy()), depending if the objects are statically or dynamically allocated. 18 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 The object interface exposes methods that can be applied over objects of type ‘struct obj *’. When polymorphism is needed, these methods are implemented as function pointers. The final representation, expressed for a generic object ‘obj’, can be summarised with the following interface: struct obj; int int obj_init(struct obj * o, <parms>); obj_fini(struct obj * o); struct obj * obj_create(<parms>); int obj_destroy(struct obj * o); int ... int 2.1.1.2.1 obj_method_1(struct obj * o, <parms>); obj_method_n(struct obj * o, <parms>); Spinlock, non-interruptible contexts and objects creation In order to fulfil the kernel constraints, the object-oriented approach summarised in the previous section was further enhanced to be suitable even to non-interruptible contexts. In general, functions that may sleep cannot be used in non-interruptible contexts, in order to avoid soft-lockups. The problem is mostly noticeable with interruptible memory allocations executed while holding spinlocks. The following code snippet reproduces a typical soft-lockup problem: a_type_t * t; spinlock_t s; spin_lock(&s); ... t = kmalloc(sizeof(*t), GFP_KERNEL); ... spin_unlock(&s); ... To resolve the aforementioned problem, memory allocation has to be forced as noninterruptible (i.e. the GFP_KERNEL flag must be replaced with the GFP_ATOMIC one). 19 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Therefore, the previously introduced object creation “method” (i.e. _create) has been differentiated at its very root and sliced as follows: struct obj * obj_create_gfp(gfp_t flags, <parms>); struct obj * obj_create(<parms>); struct obj * obj_create_ni(<parms>); Where: obj_create_gfp(): provides the actual implementation of the object creation, taking as input parameters the flags that control the priority of the memory allocation operations – that is, whether the memory allocation functions can sleep. This function is not (usually) exported in the object’ API but declared as “static” and hidden into the software module (i.e. the “.c” file). obj_create(): has to be used in interruptible contexts. It calls obj_create_gfp passing the GFP_KERNEL flag (allowing memory allocation operations to sleep). obj_create_ni(): must be used in non-interruptible contexts. It calls obj_create_gfp passing the GFP_ATOMIC flag (preventing memory allocation operations from sleeping). 2.1.2 The framework All the components of the kernel-space IRATI stack rely on a base framework that is composed by the following parts: rmem: This part implements a common memory management layer that provides additional features over the basic primitives available for dynamic memory allocation and de-allocation (i.e. kmalloc, kzalloc and kfree). These features provide additional debugging functionalities such as memory tampering (adding pre-defined shims on-top/at-bottom of the memory area to detect out-of-rage writes) and poisoning (initializing the object contents with a known value to detect uninitialized usage) specifically tailored to RINA objects; allowing developers to easily spot memory leaks as well as memory corruption problems. rref: The rref component provides reference-counting functionalities to RINA objects and allows implementing lightweight garbage-collection semantics in a per-object way. Developers can opt-in for reference counting in their objects only when needed. rwq, rbmp and rmap: These parts implement façades for the Linux work-queues, bitmaps and hashmaps respectively. These components provide additional functionalities such as easier dynamic allocation, simplified interaction and ad-hoc methods for non-interruptible contexts. 20 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 rqueue: The rqueue component mainly provides dynamically resizable queues, which are unavailable as part of the default kernel libraries. Part of the aforementioned features, such as memory tampering and poisoning, can be optedout at compilation time - they are selectable items in the Kconfig configuration framework and therefore their performance degradation can be reduced to nothing. The final software framework, used by all the components of the IRATI stack, can be imagined as built on the just presented functionalities and the approach presented in 2.1.1.2. As an example SDU, PDU, KFA and KIPCM internal data structures have been modelled upon this base framework by: Including an IRATI object in their representative structures. Using the former presented utilities for memory allocation (i.e. rmem) or creation of elements in their structures (rbmp, rmap, rqueue, rwq etc.). 2.1.3 The personalities and the kernel/user interface The “personality layer” aims to support different implementations of the RINA kernel stack coexisting in the same system. To achieve this, a unique kernel/user interface must be agreed and shared among the different implementations. The kernel/user interface is currently shaped by the set of system calls and the RINA Netlink layer (RNL). The personality layer acts as a kernel/user interface mux/demux that allows hot-plugging different implementations into the stack core. Each implementation is associated with a personality instance, which is identified by a system-wide unique identifier. In the user-space to kernel direction, the personality layer demuxes the interface to the specific personality instance (e.g. providing the service). In the opposite direction, the personality layer muxes the personality instance to the related user-space entity (e.g. requesting the service). This is possible because the components of the stack core hold bindings to their parent/siblings in a hierarchical way (where the personalities reside at the root of the hierarchy). 2.1.4 The stack core The components of the IRATI prototype located in kernel space are divided into two main categories: the core and the IPC processes. The core is the set of components that provide the fundamental functionalities to the system, as well as the ‘glue’ that holds together the different instances of IPC Processes. The components in the core orchestrate the interactions with user-space components, the life-cycle of the IPC processes and the data transfer operations (in which the IPC processes are involved). The core components have been updated since D2.1 [3] resulting in a final composition for the first prototype currently shaped by the KIPCM, the KFA, the RNL and the IPC process factories 21 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 as depicted in Figure 2. With respect to [3], two main problems were addressed and solved, in addition to minor improvements: Concurrency and memory allocation problems. IPC process abstraction independent from its type (i.e. normal or shim). During the testing and integration activities, a few concurrency (i.e. bad locking semantics) and memory (leaks and corruptions) related problems have been detected. Although most of the problems were due to bugs, the coupling between the KIPCM and the IPC Processes has been found too tight. This coupling and the concurrency constraints imposed by the kernel environment - such as the absence of recursive locking semantics - caused dead-locks under particular read/write conditions. In order to properly solve them, some of the KIPCM functionalities have been moved into a new software component: the Kernel Flow Allocation Manager (KFA). 2.1.4.1 KFA, the Kernel Flow Allocation Manger The KFA is in charge of flow management related functionalities such as creation, binding to IPC processes, de-allocation, etc. These functionalities are offered to both the KIPCM via its northbound interface, and to the IPC Processes via its southbound interface. The management of port identifiers and its association to flow instances is the most important functionality provided by the KFA through its Port ID Manager (PIDM). This solves synchronization issues during the flow creation process due to race conditions. The destination applications started to invoke system calls on port-ids before the kernel completed the flow allocation process. Thus, there was a short period where the read and write operations on this port id failed since this situation was not properly handled by the KIPCM. With the introduction of the KFA, the internal kernel flow structures that support the port-ids are created before notifying the application of an incoming flow allocation request, with the status of ‘PENDING’. When the application confirms that the flow is accepted, the status of the flow structure is updated to ‘ALLOCATED’. If the application invokes a read/write system call on the port-id before the flow status is ‘ALLOCATED’, the KFA simply deschedule the calling process until the flow allocation is complete. Complementary to the port id management, the KFA binds the KIPCM and IPC Processes by means of the flow it manages. For this reason, the KFA creates a flow structure for each flow and binds together its port identifier and the identifier of the IPC Process supporting the flow. The flow structure also contains information such as the flow instance allocation state – as described in the previous paragraph - and the queue of SDUs ready to be consumed by user space applications. By means of the allocation state, the KFA controls the life cycle of a flow, since the possible states (pending, allocated, de-allocated) are set as result of the management actions and impose constraints about the possible actions that can be applied over the flow. 22 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 The following figure shows the main interactions of the KFA. Figure 3 KFA Interactions (details) 2.1.4.2 KIPCM, the Kernel IPC Manager As described in [3, section 7.5.1], the KIPCM is the counterpart in kernel-space of the IPC Manager in user-space. Its main responsibilities are the lifecycle management of several components in the kernel, such as IPC processes and the KFA, and providing the main interface towards user-space. The following figure shows the main interactions between the KIPCM and the other components in the system. 23 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Figure 4 KIPCM interactions (details) The OO approach introduced in section 2.1.1 shows its potential here: the KIPCM provides the same binding and usage APIs for both the normal and the shim IPC Processes, regardless of their “exact” type. This way, all the functionalities they provide are transparent (and homogeneous) to the upper layers, even though their inner workings may vary greatly. For instance, the “regular” IPC Processes have EFCP and RMT instances while the shims are completely missing them, but the user-space components are unaware of such differences. To provide such high-level abstraction, the KIPCM makes use of the IPC Processes factories. These factories are registered to the KIPCM after the loading of the corresponding IPC Process kernel module, adding to the system the capability to create the specific type of IPC Process with an independent interface. Nevertheless, the specific functionalities offered by the KIPCM have been updated since [3], as already introduced in section Errore. L'origine riferimento non è stata trovata., mostly moving flow management tasks to the KFA. They can be summarized as follows: It is in charge of the creation of IPC Processes in the kernel. It is in charge of retrieving the corresponding IPC Process instance when it has to perform any task triggered by a call to the to-user-space interface (i.e. flow allocation, register application to IPC process, etc). It abstracts the nature of the IPC Process to the applications by providing a unique API for reading and writing SDUs. It is the main hub to the RNL Layer presented in section 2.1.4.3, correlating incoming/outgoing (replies/requests) messages by means of their sequence numbers. 24 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS 2.1.4.3 Doc IRATI D3.1 Date December 2013 RNL, the RINA Netlink Layer In conjunction with a set of system calls and procfs/sysfs, Netlink sockets are one of the three technologies that shape the communication framework between user space processes and the kernel in the IRATI prototype. Netlink is a flexible, socket-based, communications channel typically used for kernel / userspace dialogues. It defines a limited set of families, each one dealing with a specific service. The family selects the kernel module - or Netlink group - to communicate with, e.g. NETLINK_ROUTE (receives routing and link updates and may be used to modify the routing tables), NETLINK_FIREWALL (transports IPv4 packets from the netfilter subsystem to user space) etc. In order to counteract this limited amount of available families, the Generic Netlink extension was introduced as another Netlink family that works as a protocol type multiplexer, increasing the number of families supported by the original Netlink specification. Please refer [3, section 7.3.1.3] and [4] for further details. The RINA Netlink Layer is the solution implemented in the IRATI stack to integrate Netlink in the RINA software framework. It presents an abstraction layer that deals with all the tasks related to the configuration, generation and destruction of Netlink sockets and messages, hiding their complexity to the RNL users. Figure 5 RNL and Netlink layers in the IRATI prototype On the one hand, RNL defines a Generic Netlink family called NETLINK_RINA for the exclusive use of the RINA software framework. This family describes the set of messages that can be sent or received by the components in the kernel. On the other hand, the RNL acts as a multidirectional communication hub that multiplexes the messages upon reception from user space and provides message generation facilities to the different components in the kernel. Upon initialization, the KIPCM registers a set of handlers towards the RNL, one for each message defined in the stack’s RNL API. The RNL handlers are objects that contain a call-back 25 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 function and a parameter that is opaquely transferred by RNL to the receptor of the call-back once it gets invoked. This approach allows each component of the kernel to register a specific handler with a call-back function that is in charge of performing the task associated to the type of the message received. The RNL behaviours on ingress and egress directions can be summarised as follows: Upon reception of a new message: the RNL looks in the registered set for the proper handler and forwards the message by invoking the registered call-back function, passing the responsibility of reacting to the message to the component in charge. This component relies on the second set of utilities provided by the RNL: parsing and formatting of the NETLINK_RINA family messages. Upon reception, the component will call the RNL in order to parse the received message. RNL will validate the message correctness (i.e. the message is well formed and its type and parameters are correct). It will return a well-known internal structure with the translation of the parameters contained in the original message. In the output direction, when a component in the kernel sends a Netlink message to user-space: the component will call RNL passing the message type and the value of the parameters to be sent. The RNL will format the corresponding Netlink message attaching a sequence number to it. Finally, via a second call to RNL, it will send the resulting message to its destination (identified by the Netlink socket number). Figure 6 depicts the most important interactions between the RNL and the KIPCM. Each number represents an ordered step in the process: 1) At bootsrap, the KIPCM registers a RNL set; 2) The KIPCM registers the handlers required for this registered RNL set; 3) when a message of certain type is received from user-space, the corresponding handler is retrieved and the callback to the module in charge is called, triggering the task to be performed ; 4) If a response message is needed it is sent using RNL again. 26 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Figure 6 RNL interaction (detailed) 2.1.4.4 The IPC Process factories The IPC Process factory concept has been introduced into the IRATI kernel-space stack in order to abstract the real nature of the IPC processes – normal or shim – to their users (e.g. KIPCM, the user-space). It is an abstract constructor of IPC processes, as its OOD factory concept counterpart. Each IPC Process (e.g. the Shim IPC Process over Ethernet, the Normal IPC Process) has its adhoc factory that registers to the KIPCM during its initialization (e.g. when the IPC Process module is loaded). The KIPCM, in turn, uses the provided factory whenever an IPC Process instance of the corresponding type has to be created or destroyed. All the factories in the system provide instances of IPC processes with the same common interface (i.e. with the same “methods” signatures). Nevertheless, each IPC Process instance created is underneath bound to its specific type. This is possible because, although the factory provides a common API for IPC Process creation/destruction to the core components, it knows the particularities of the specific type of IPC process it constructs. 27 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Figure 7 IPCP Factory, IPCP instance and IPCP interactions 2.1.4.4.1 The IPC Process factories interface The IPC Process factories interface follows the rules for any object interface in the IRATI framework (ref. section 2.1.1.2). It is made up of four methods, two addressed to the initialization/finalization of the factory and two for the creation/destruction of IPC processes: init: Initializes the factory’s internal structures (e.g. at module loading time). fini: Destroys any factory’s internal structures and frees allocated memory (e.g. at module unloading time). create: This call is invoked by the KIPCM so the factories create an abstract IPC Process instance whose API is bound to the particular API of an IPC process of the factory specific type (ref. section 2.1.5.1). destroy: This call is invoked by the KIPCM to destroy an IPC Process instance. 2.1.5 The IPC Processes In deliverable D2.1 [3], the necessity to distribute the different components of the normal IPC Process between user and kernel spaces was identified. This distribution minimizes the number of context-switches during the fast-path related operations. The partitioning brought the EFCP, RMT, PDU Forwarding Table and SDU protection components to kernel space as part of a normal IPC Process instance. However, for the shim IPC Processes none of their 28 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 components can lie outside the kernel, since the internal components are different and specific for the underlying technology (i.e. Ethernet). Nevertheless, the interface and the internal structures of normal IPC Processes have been slightly updated from what is stated in [3]. The following sections briefly describe such updates and introduce the inner workings of the IPC Processes in the prototype. 2.1.5.1 The IPC Processes interfaces Shim and normal IPC Processes offer both a common and a type-dependent interface. The common interface allows the seamless binding of instances to other kernel components, regardless of its type. It consists of a truly common part, used to assign IPC Processes to DIFs, register applications, and read and write from and to flows. The specific part of the interface addresses the type-related functionalities of the IPC process. These particularities make reference to the way flows are created and managed in normal and shim IPC Processes and the differences between the internal structures in each one and their configuration requirements. The aggregation of both common and specific set of calls forms the unique API that is provided by the IPC Process instance object. 2.1.5.1.1 The normal IPC Process interface As already explained in the previous section, the normal IPC Process API comprises a subset of specific calls depending on the type of the IPC Process (normal in this case); and a subset of calls that are common to any IPC Process. The common set of calls is: assign_to_dif: This operation is triggered by the IPC Manager in user-space. The affected IPC Process receives all the necessary information on the DIF in order to be able to start operating as part of it. update_dif_config: This call passes a new configuration (after change) of an associated DIF to the IPC Process. sdu_write: When an application has been granted a flow towards another application and wants to send a SDU, it makes use of the available system call present in the userspace/kernel interface. This call is processed by the KIPCM which in turn calls the sdu_write function in the IPC Process that is supporting the flow. flow_binding_ipcp: This call binds the RMT structure of the (N) IPC Process to the (N1) flow structure from/to which the (N) IPC Process will receive/send SDUs. The specific set of calls is: 29 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 connection_create: As part of the flow allocation process for a normal IPC Process, at least a connection must be created to support the requested flow. This call creates the EFCP instances required in the kernel part of the normal IPC Process. connection_destroy: For the opposite situation, this call destroys an EFCP instance. connection_update: During the connection set up process, this call is used to update an EFCP instance with the connection identifier of the peering EFCP instance (the EFCP instance at the opposite end of the connection). connection_create_arrived: During the connection set up process, this call is used by the KIPCM to notify a normal IPC process that another normal IPC Process at the opposite end is requesting a connection to support a flow. management_sdu_write: Invoked by the IPC Process Daemon at user-space, when it wants to send a layer management SDU to a peer IPC Process through a given port-id. The kernel parts of the IPC Process will add the required DTP header to the SDU, and schedule the resulting PDU for transmission through the Relaying and Multiplexing Task (RMT). management_sdu_read: Invoked by the IPC Process Daemon to retrieve the next layer management SDU directed to it. When the kernel components of the IPC Process detect that an incoming PDU contains a layer management SDU, they store the payload of the PDU – as well as the port-id it arrived from – in an internal queue. When the IPC Process Daemon invokes the management_sdu_read system call, the IPC Process kernel components return the first element of the queue or make the calling process sleep until there is an element available. pdu_forwarding_table_modify: Invoked by the PDU Forwarding Table computation entity at the IPC Process Daemon, in order to add or remove entries of the PDU Forwarding Table. 2.1.5.1.2 The shim IPC Process interfaces As already stated, the common part of the shim IPC Process interface is the same as explained in previous section. The specific set of calls offered by this type of IPC Process is: application_register: As result of this operation triggered by an application via the IPC Manager in user-space, that application is registered as reachable from the IPC process. The IPC Process may spread this information to the rest of the DIF members. application_unregister: Complementary to application_register. flow_allocate_request: This call is invoked by the KIPCM when an application process on top requests a new flow to a destination application via the IPC Manager in userspace. flow_allocate_response: Informs the IPC Process about the application’s decision with regards to accepting or denying the flow allocation request. The IPC Process will react committing the flow in the former case, or deleting it in the latter. 30 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 flow_deallocate: Deallocates an existing flow, freeing the corresponding resources. 2.1.5.2 EFCP, RMT and the PDU Forwarding Table (in the normal IPC Process) The EFCP instance implements the Error and Flow Control Protocol state machine. It holds the DTP and (optionally) DTCP instances associated to the flow. For the 1st prototype, only the DTP state machine has been implemented in order to support the flows among different IPC Processes in the same system while for the DTCP only basic placeholders have been allocated. The basic functionalities of EFCP are supported and it provides its services through the APIs summarized as follows: efcp_write: Injects a SDU into the instance, to be processed and delivered to the RMT (outgoing direction). efcp_receive: Takes a PDU as an input, processes the DTP PCI and delivers any resulting complete SDUs to the N+1 IPC Process or to the queues of SDUs ready to be consumed by user-space (incoming direction). Although a flow may be supported by several connections, in the current prototype there is one active EFCP instance per flow. Nevertheless, the EFCP Container concept has been introduced to abstract the notion of multiple EFCP instances in the same IPC Process. The EFCP Container provides the efcp_container_write and efcp_container_receive APIs. These APIs mux/demuxes incoming requests onto the particular EFCP instance that will finally perform its task. The EFCP Container holds ingress and egress workqueues that are used to sustain the work of each EFCP Protocol Machine into the container. These workqueues bind data (e.g. the SDU) to code (e.g. the function processing the SDU) deferring execution from API calls. Therefore, they are the means to decouple the ingress and egress workflows within the EFCP component. Finally, the EFCP Container manages the CEP identifiers (the Connection End Point identifiers of flows, unique within the IPC Process instance scope). Conversely, there is only one RMT instance per IPC Process that, similarly to the EFCP Container counterpart, holds ingress and egress work queues for the purposes formerly presented (i.e. decouple and sustain the DU workflows into RMT). The EFCP Container and the RMT instance coupled together sustain the ingress and egress DU workflows within a normal IPC Process. Figure 8 summarizes the interactions of these components with the rest of the stack during the “write” (green) and “read” (red) operations. 31 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Figure 8 The components interactions during read/write operations The following sections delve in the DUs workflows by exposing them at two different levels: at the intra IPC Process level, i.e. the part that takes place between the EFCP Container and the RMT, and at the inter IPC Process level, then the different IPC Processes in the IRATI stack communicate each other to serve a flow on the fast-path. 2.1.5.2.1 The egress DUs workflow The intra IPC Process egress DU workflow can be summarized as follows: 1. A SDU arrives at the EFCP Container, via the efcp_container_write call. 32 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 2. The EFCP Container retrieves the associated EFCP instance and invokes the efcp_write call. 3. The EFCP instance posts the SDU into the DTP. This call creates a new workitem in the egress workqueue of the EFCP Container. 4. When the EFCP Container workitem is executed, the DTP is invoked: a. The DTP protocol machine creates a PCI and, with the SDU, builds a PDU. b. The PDU is sent to RMT by invoking rmt_send which, in turn, creates a workitem in its egress workqueue. 5. When the RMT workitem is executed, the RMT core: a. Retrieves the port-id that must be used to send the PDU given the destination address and the QoS id of this flow. b. Finally sends out the SDU crafted from the original PDU (this step is required because the N-1 DIF supporting the flow will treat the PDU as just a SDU). Figure 9 The egress EFCP / RMT workflow Figure 10 shows the inter IPC Process workflow for an outgoing SDU through the DIF stack, i.e. the path a DU would follow through a stack of DIFs to exit the system towards the destination application. In the example, 3 DIFs are stacked in a system, each one with its IPC Process representative. IPC Process 0 is a shim IPC Process, the representative of a shim DIF in the system. IPC Processes 1 and 2 are normal IPC Processes. The port identifiers in Figure 10 follow the convention of using the names or numbers of the underlying IPC Processes that are the origin and destination of this port id, e.g. “port id app2” means that it binds the application to the IPC Process 2. The workflow begins when an application, which has already been granted a flow to another application, invokes the syscall sys_sdu_write. 33 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 1. This syscall takes as input parameters the SDU to be sent and the port id of the flow. In this example, app2. 2. The KIPCM, via its kipcm_sdu_write API, receives the SDU and relays it to the KFA by invoking kfa_flow_sdu_write. 3. The KFA retrieves the IPC Process instance from the flow structure associated to this port id and invokes the sdu_write API in the IPC Process interface. In this case, IPC Process 2 provides normal_write function, which calls the efcp_container_write API. 4. The intra IPC Process path described before is followed from the call to efcp_container_write until the crafted SDU (sdu*) is sent to the KFA by means of kfa_flow_sdu_write. The port id retrieved by RMT 2 will be 21 (NOTE: the star in “sdu*” means that this SDU is different from the one originally received by the KIPCM). 5. The intra IPC Process path is followed again for IPC Process 1, which is normal as well. A new crafted SDU (NOTE: the double stars mean that it is different from the previous “sdu*”) is sent to the KFA by RMT 1 with port id 10. 6. The KFA retrieves the flow structure and thus the IPC Process instance, invoking the sdu_write operation in the IPC process interface. However, this time the IPC Process is the shim IPC Process and therefore the sdu_write operation results in a call to shim_write. Finally, the shim follows its own scheme to send sdu** (ref. section 2.1.8). Figure 10 The egress DUs workflow 34 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS 2.1.5.2.2 Doc IRATI D3.1 Date December 2013 The ingress DUs workflow The intra IPC Process ingress DU workflow can be summarized as follows: 1. A SDU arrives in the RMT by means of a call to rmt_receive and is transformed into an ingress workqueue (work)item. 2. When the workitem is executed, the RMT task will craft a PDU from the SDU and will check the destination address: a. If the destination address is not the address of the IPC Process it belongs to, it will look for a N-1 flow (using the PDU Forwarding Table) to send the SDU. In case there is not an entry, the SDU will be discarded. b. If the destination address is the address of the IPC Process, it will call efcp_container_receive to pass the PDU to the DTP. The task of processing the PDU will be inserted into the ingress workqueue. i. When the task is run: 1. The DTP retrieves the connection information, i.e. the port-id, by means of the PCI (which will be discarded) 2. Finally sends out the SDU by invoking the call to kfa_sdu_post. Figure 11 The ingress EFCP / RMT workflow Figure 12 shows the inter IPC Process ingress DU workflow, i.e. an example of the path a DU would follow through a stack of DIFs to arrive at the destination application after entering the system. The setup is identical to the one described in section 2.1.5.2.1. 35 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 1. A SDU arrives in the system and after being processed by the device layer is received by the shim IPC Process (IPCP 0 in the figure). IPCP 0 performs the necessary steps (ref. section 2.1.8) to identify the port to which the SDU has to be forwarded - pid 10 in the figure - and it calls the kfa_sdu_post operation to send the SDU (sdu**). 2. The KFA retrieves its flow structure and looks for a RMT instance associated to this flow (ref. section 2.1.5.1.1). It uses the call rmt_receive to post the SDU (sdu**) on the RMT instance belonging to IPC Process 1. 3. The intra IPC Process path described before is followed from the call to rmt_receive until the call to kfa_sdu_post to finally send sdu* to the flow identified with port id 21. 4. Steps 2 and 3 are executed until the KFA does not find a corresponding RMT where to post the SDU to, considering the flow to be directed to an application in user space 5. The KFA pushes the SDU into the SDUs ready queue of the flow structure. 6. The application invokes the sdu_read syscall to retrieve the SDU (from the flow’ readyqueue). Figure 12 The ingress DUs workflow 36 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 2.1.6 ARP826 The ARP826 software component is a lightweight RFC-826 compliant ARP protocol implementation [26]. The main motivation for the development of this additional software component is the unfeasibility of the reuse of the current Linux ARP implementation (as in Linux kernel v3.10.0), mainly for the following reasons: It is too interwoven with the TCP/IP layer: o While the original RFC was designed to translate any network protocol address to an Ethernet address, the Linux ARP implementation only allows translating an IPv4 address to any hardware address. In that case, the Shim Ethernet IPC Process would have to enforce naming restrictions on the IPC processes using the shim. For instance the IPC Process names would have to be constrained to IP addresses. o When a new mapping entry becomes available in the ARP cache, only the TCP/IP stack is notified. Re-use of the Linux ARP implementation would either mean heavily patching the code to overcome this problem or actively polling the cache to see if a mapping becomes available. It only allows having one protocol network address per interface: Linux ARP gets the IP address from the device. This would require workarounds to assign the IPC process name to an interface in the form of an IP address. The TCP/IP stack would also have to use this IP address from then on. ARP requests can only be sent to IP addresses in the same subnet: This would make the naming restrictions even stricter, the shim IPC processes would be trickier to configure and the workarounds to circumvent this problem would render the solution hard to maintain and to propose mainstream. With the IRATI ARP826 implementation, none of aforementioned restrictions hold. This implementation complies with RFC826, which assumes the generality the ARP protocol resolving any Layer 3 network address and therefore suits the Shim Ethernet IPC Processes requirements and, at the same time, avoids patching areas of the Linux kernel (i.e. ARP, IPv4, NetFilter) that would probably harm the adoption of the IRATI stack in the mainstream sources (i.e. the Linux kernel). 2.1.6.1 ARP826 software design The ARP826 implementation is made up of different parts interacting with each other. These parts, depicted in Figure 13, can be summarized as: the core, maps and tables (also referred to as the “cache”), the Address Resolution Module (ARM) and the RX/TX part. All the parts together implement the ARP826 component as a dynamically loadable module (i.e. arp826.ko). The following sections briefly describe the ARP826 addresses abstraction, its parts and their interactions. 37 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Figure 13 Interactions of the ARP826 parts 2.1.6.1.1 The Generic Protocol Addresses and Generic Hardware Addresses The ARP826 component can be summarised as a directory that maps a protocol address to a hardware address. The data structures that symbolize these addresses are called Generic Protocol Addresses (GPAs) and Generic Hardware Addresses (GHAs). Their representation in the stack can be summarized as follows: 38 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 struct gpa { uint8_t * address; size_t length; }; typedef enum { MAC_ADDR_802_3, ... } gha_type_t; struct gha { gha_type_t type; union { uint8_t mac_802_3[6]; ... } data; }; These representations are hidden for the user, following the kernel-space object-oriented approach described in section 2.1.1, while functions to handle such objects are available as part of their API. The GHA/GPA abstraction provides many services to its users, most importantly: Addresses growing/shrinking: In an ARP packet there is a field that specifies the length of the network protocol addresses. In some network protocols the length of the network address can vary. Therefore, there are also methods to grow and shrink GPAs, by adding and removing filler data. This way the length is set on a per packet basis. Per-technology GHAs creation: The GHA component allows the creation of special GHAs, such as the “broadcast” and “unknown” GHAs (i.e. the address that must be inserted into ARP requests) for specific technology 2.1.6.1.2 The ARP826 Core The core is in charge of the ARP826 module initialization and the registration of new network protocols in the ARP826 component. Upon the component initialization, the arp_receive function implemented in the ARP826 RX/TX part, is registered to the devices layer frame-reception event (system-wide). This function is invoked to handle incoming Ethernet frames with ARP ethertype. It handles ARP packets for registered network protocols. 39 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS 2.1.6.1.3 Doc IRATI D3.1 Date December 2013 The ARP826 Maps and Tables ARP826 Maps and ARP826 Tables are the data structures supporting the ARP826 implementation: The ARP826 Tables are the main abstraction supporting addresses mapping, commonly referred to as the ARP cache in other implementations. ARP826 Tables are composed by Tableentries, maintaining mapping between GPAs and GHAs. Table-entries can be freely added, removed, updated by network address, retrieved by hardware (GHA) / network address (GPA) / both. Tables lookup and management operations (e.g. insertion and removal of table entries) are mainly performed by the RX/TX and ARM components. The ARP826 Maps are the internal data structure used by the ARP826, effectively storing the mappings as hash-maps. 2.1.6.1.4 The ARP826 Address Resolution Module The ARP826 Address Resolution Module (ARM) is in charge of handling the addresses resolution requests. An application that is using the ARP826 component calls the ARP request API call when it wants to obtain a mapping of a network address (GPA) to a hardware address (GHA). The API call contains the source and destination network and hardware addresses, as well as a call-back that will be invoked when the mapping becomes available. The API can be summarized as follows typedef void (* arp826_notify_t)(void * opaque, const struct gpa * tpa, const struct gha * tha); int arp826_resolve_gpa(struct net_device * dev, uint16_t ptype, const struct gpa * spa, const struct gha * sha, const struct gpa * tpa, arp826_notify_t notify, void * opaque); The combination of all the arp826_resolve_gpa parameters (i.e. dev, ptype, spa, sha, tpa, notify and opaque) is added to a list of on-going resolutions when the function is called. The list allow the decoupling of the applications request (an ongoing resolution) from the ARP Protocol Machine. Upon receipt of an ARP reply, a new workitem is created and added to the workqueue of the ARM component. A workqueue is used here to decouple the ARM frame handler from the non-interruptible context the reception of new frame poses. 40 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 When the workitem is finally executed, the list of on-going resolutions is browsed through to look for an entry matching the request (dev, ptype, spa, sha, tpa are used for the match). In the case of a match, the notify handler function is called and the entry gets removed from the ongoing resolution list. The request gets discarded otherwise. 2.1.6.1.5 The ARP826 RX/TX The ARP826 RX/TX component is the very bottom part of the IRATI Stack ARP implementation. It can be viewed as a passive component in the sense that it only responds to external events: The receiving (RX) part responds to an incoming new frame. It checks if it can process it (i.e. it performs preliminary checks) and admits only ARP request or reply frames. If the frame is an ARP request, the ARP cache is checked to verify that the requested network address was registered on this system and, if this is the case, an ARP reply is sent back. If the frame is an ARP reply, it is handed to the ARP826 Address Resolution Module The transmitting (TX) part is executed when the API is called, and simply creates a new ARP request and transmits it on the specified device 2.1.7 RINARP The RINARP component is an abstraction layer interposed between the Shim Ethernet IPC Process and the ARP826 components. RINARP decouples the Shim Ethernet IPC implementation from ARP826, thus allowing parallel developments of the two components. The RINARP component is expected to be reused by other Ethernet based shims (e.g. a ShimWIFI IPC Process). With the RINARP/ARP826 approach, the current ARP826 can evolve, even with major API changes, without involving significant updates to the shims already available (i.e. the Shim Ethernet IPC Process). 2.1.7.1 The RINARP API The RINARP API is defined as follows: 41 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 struct rinarp_handle; typedef void (* rinarp_notification_t)(void * opaque, const struct gpa * tpa, const struct gha * tha); struct rinarp_handle * rinarp_add(struct net_device * dev, const struct gpa * pa, const struct gha * ha); int rinarp_remove(struct rinarp_handle * handle); int rinarp_resolve_gpa(struct rinarp_handle * const struct gpa * rinarp_notification_t void * handle, tpa, notify, opaque); const struct gpa *rinarp_find_gpa(struct rinarp_handle * handle, const struct gha * tha); The RINARP user must first get a handle through a call to rinarp_add(), specifying its (PA, HA) mapping. This handle will be used for any later call to the RINARP API. When the user finishes using RINARP, the handle must be disposed calling the rinarp_remove()function. The API allows for asynchronous resolutions and synchronous lookups: To resolve a GPA asynchronously, rinarp_resolve_gpa has to be used. The function initiates the task of issuing the ARP request (using the target network protocol address, tpa). Once the resolution becomes available, the user-provided callback function (i.e. notify) is invoked to notify about the resolution results. Since the Ethernet II standard has no flow allocation mechanism, the shim IPC process over Ethernet creates a new flow upon the reception of a new Ethernet frame [3]. However, the Ethernet frame only contains the hardware address, which means the sending application might be unknown. To solve this problem, a explicit ARP826 table lookup functionality has been added and the rinarp_find_gpa introduced. rinarp_find_gpa allows the reverse lookup operation (i.e. gets the GPA from the ARP cache, based on the corresponding GHA), supplying to the IRATI stack the mean to get the information on the flow initiator it requires. 42 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS 2.1.8 Doc IRATI D3.1 Date December 2013 The Shim IPC Process over Ethernet The Shim IPC process over Ethernet, as specified in [3], has been implemented in the stack. It can be used by normal IPC processes or for testing purposes by applications, such as the echo test application. Since it must implement the IPC process factories interface as described in section 2.1.4.4, it provides the following operations: eth_vlan_create: Creates a new instance of the Shim IPC process over Ethernet and allocates the necessary memory for its data structures. eth_vlan_destroy: Cleans up all memory and destroys the instance. eth_vlan_init: Initializes all data structures in the instance. eth_vlan_fini: Cleans up all the memory and destroys the factory data. The IPC Process instance this shim returns upon eth_vlan_create strictly follows the rules described in section 2.1.5.1. However, a freshly created and initialized instance is not yet functional since It first has to be assigned to a DIF. eth_vlan_assign_to_dif supplies the Shim IPC process over Ethernet with the information it needs to be able to start operating properly (i.e. the interface name and the VLAN id, which is the DIF name). Finally, a packet handler is added to the device layer (eth_vlan_rcv). An application or IPC process that uses this Shim IPC process can consequently call the following operations, to start the inter processes communication: eth_vlan_application_register: Registers an application in the shim DIF, and thus in the ARP cache. Only one application can use the shim IPC process at a time. Because ARP does not differentiate between a client and a server application, every application has to call this operation before calling any other flow-related operation. eth_vlan_application_unregister: Unregisters the application in the shim DIF and removes its traces from the ARP cache. A different application can now use the shim IPC process. eth_vlan_flow_allocate_request: When this operation is called, a new flow is created and an ARP request is sent out, as specified in [3]. The rinarp_resolve_handler function is handed to the ARP826 component as the function that should be called upon receipt of the corresponding ARP reply. eth_vlan_flow_allocate_response: When the destination application answers about a flow allocation request, the shim is notified. This function is in charge of handling the application response as well as to update the flow status accordingly. eth_vlan_flow_deallocate: Deallocates the flow; cleans up all data structures related to the flow. 43 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 eth_vlan_sdu_write: When a flow is allocated, SDUs can be written. This function encapsulates the SDU into an Ethernet frame, and sends it out of the device. The Shim IPC process over Ethernet also has internal operations, which are needed for a working Shim IPC process over Ethernet implementation: rinarp_resolve_handler: Is called upon receipt of an ARP reply to an ARP request that was sent out when eth_vlan_flow_allocate_request was called. The flow is then allocated, and ”write” operations are finally allowed. eth_vlan_rcv: Receives new Ethernet frames, adds them to a list and queues a work item in the shim IPC process over Ethernet’s workqueue (in order to decouple from the non-interruptible context). eth_vlan_rcv_worker: Is called by the workqueue and processes at least one frame. For every frame in the list of frames eth_vlan_recv_process_packet is called. eth_vlan_recv_process_packet: Processes the packet with the following logic: o If the flow that corresponds with the sender’ MAC address in this frame is already allocated, it is delivered to the corresponding flow. o If the flow has not been allocated yet, but exists already, the frame is queued. o If the flow has been denied, the packet is dropped. o If the flow does not exist, a new flow is created and the KIPCM notified. 2.1.9 The Shim Dummy IPC Process The Shim Dummy IPC Process can be imagined as a sort of loopback IPC Process. It is an IPC Process that does not traverse host boundaries and therefore cannot communicate with IPC Processes in other systems. It can be used by normal IPC processes or by applications, such as the echo test application, for testing purposes. As any other IPC Process, the Shim Dummy registers its IPC Process factory into the system (KIPCM) upon initialization either during module loading (Shim Dummy built as a kernel module) or during kernel initialization (Shim Dummy embedded into the kernel). Since it must implement the IPC process factories interface as described in section 2.1.4.4, it provides the following operations: dummy_create: Creates a new instance of the Shim Dummy IPC process and allocates the necessary memory for its data structures. dummy_destroy: Cleans up all memory and destroys the instance. dummy_init: Initializes all data structures in the factory. dummy_fini: Cleans up all the memory and destroys the factory data. 44 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 The IPC Process instance the factory returns upon dummy_create strictly follows the rules described in section 2.1.5.1. However, a freshly created and initialized instance is not yet functional since it first has to be assigned to a DIF. The dummy_assign_to_dif “method” supplies the Shim Dummy IPC Process with the information it needs to start operating properly (the DIF name in this case). The Shim Dummy IPC Process API that can be used by the applications or other IPC Processes, via the IPC Process instance abstraction, is the following: dummy_application_register: Registers an application in the Shim Dummy. dummy_application_unregister: Unregisters the application in the Shim Dummy. dummy_flow_allocate_request: When this operation is called, a new flow connecting two applications in the same system is created. The Shim Dummy notifies the destination applications and two flow structures are created in the KFA, one for the source application when the request is received by the KIPCM and the other for the destination application at this point. dummy_flow_allocate_response: When the destination application answers about a flow allocation request, the shim is notified. This function is in charge of handling the application response as well as to update the flow status accordingly. dummy_flow_deallocate: Deallocates the flow; cleans up all data structures related to the flow. dummy_sdu_write: Hook to the syscall sdu_write. dummy_sdu_read: Hook to the syscall sdu_read. A SDU from the sdu_ready queue is returned. 2.2 The user space The user-space components are implemented into two different SW packages: librina and rinad. The librina package contains all the IRATI stack libraries that have been introduced to abstract from the user all the kernel interactions (such as syscalls and Netlink details). Librina provides its functionalities to user-space RINA programs via scripting language extensions or statically/dynamically linkable libraries (i.e. for C/C++ programs). Librina is more a framework/middleware than a library: it has its own memory model (explicit, no garbage collection), its execution model is event-driven and it uses concurrency mechanics (its own threads) to do part of its work. 45 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Figure 14 High-level architecture of the user-space parts Rinad instead, contains the IPC Manager and IPC Process daemons as well as a testing application (RINABand). The IPC Manager is the core of IPC Management in the system, acting both as the manager of IPC Processes and a broker between applications and IPC Processes (enforcing access rights, mapping flow allocation or application registration requests to the right IPC Processes, etc.). IPC Process Daemons implement the layer management components of an IPC Process (enrollment, flow allocation, PDU Forwarding table generation or distributed resource allocation functions). For more details on the rationale behind this high-level architecture, interested readers might refer to the relevant sections in D2.1 [3]. Rinad also provides a couple of example/utility applications that serve two purposes: i) provide an example of how an application uses librina and ii) allow testing/experimentation with the IRATI stack by measuring some properties of the IPC service as perceived by the application (flow allocation time, goodput in terms of bytes read/write per second or mean delay). In the following sections, the two software packages are described. 2.2.1 librina The IRATI implementation provides the following libraries to support the operation of userspace daemons and applications. librina-application: Provides the APIs that allow an application to use RINA natively, enabling it to allocate and deallocate flows, read and write SDUs to that flows, and register/unregister to one or more DIFs. librina-ipc-manager: Provides the APIs that facilitate the IPC Manager to perform the tasks related to IPC Process creation, deletion and configuration. 46 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 librina-ipc-process: APIs exposed by this library allow an IPC Process to configure the PDU forwarding table (through Netlink sockets), to create and delete EFCP instances (through Netlink sockets also), to request the allocation of kernel resources to support a flow (through system calls) and so on. librina-faux-sockets: Allow adapting a non-native RINA application (a traditional UNIX socket based application) to lay over the RINA stack. librina-cdap: Implementation of the CDAP protocol. librina-sdu-protection: APIs and implementation to use the SDU-protection module in user space to protect and unprotect SDUs (add CRCs, encryption, etc). librina-common: Common interfaces and data structures. Librina enables applications and the different user-space components of the IRATI stack to communicate between them or with the kernel via Netlink sockets, system calls, or sysfs, hiding the complexity of dealing with these mechanisms. Librina also provides an objectoriented wrapper of the underlying threading facilities (e.g. libpthread), allowing librina users to take advantage of concurrency mechanisms without further external dependencies. Figure 15 High-level librina design Figure 15 shows the librina high-level design. Librina allows its users to invoke remote operations (handled by other OS processes) or system functions as if they were local calls to 47 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 normal C++ objects. These proxy objects use librina internals to either send a Netlink message or to invoke a system call: Operations that result in system call invocations are synchronous. When the librina user invokes an operation of this type, the proxy class calls the librina internal handlers, which invoke the relevant system call and return the result to the proxy class. The proxy class, if required, formats the result and presents it to the caller. Operations that result in the generation of a Netlink message are asynchronous; the caller will get the result of the operation as an event. When the librina user invokes an operation of this type, the proxy class invokes the NetlinkManager, which provides an object oriented-wrapper to the libnl/libgnl libraries. The NetlinkManager generates a Netlink message, and uses libnl to send the message to the destination Netlink port-id. At this point the proxy class operation returns a handle to the caller so that he can identify the event reporting about the result of the operation (the handle is an integer). Librina has an internal thread that continuously waits for incoming Netlink messages. When a message arrives the thread is woken up, requests the NetlinkManager to parse the message and adds the resulting event to an internal events queue. The librina user can get the available events by invoking the ‘eventPoll’ (non-blocking) or ‘evenWait’ operations in the API, which retrieve the element at the head of the events queue. 2.2.1.1 Bindings to other programming languages The librina build framework has been enhanced in order to automatically generate bindings for interpreted languages through the use of SWIG [18]. SWIG is a software development tool that simplifies the task of interfacing different interpreted languages such as Perl, Python, Java and Ruby to C/C++ libraries. In simpler terms, SWIG is a compiler that takes C/C++ declarations and creates the wrappers required to access those declarations. The wrappers SWIG generates are layered: the C/C++ declarations are bound to a C/C++ Low Level Wrapper (LLW) which in turn is connected, using the Native Interface (NI) semantics of the target language, to a High Level Wrapper (HLW). Both the LLW and HLW depend on the target language since they have to interact using different NIs (such as the JNI [23], the Python API [22] etc.). Depending on the complexity of the library interface, SWIG has to be opportunely driven in order to produce good HL wrappers (and therefore target language modules or libraries suitable for the end-user). These corrections usually apply over an additional file (the SWIG interface file, also called the “.i” file), which can easily become an additional management burden. 48 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Figure 16 shows a simple example of a C software module wrapping. In the example, the input software module (i.e. example.c and example.h) is exporting a function (i.e. the fact() function) and the relative SWIG driving directions for the binding productions are described in the interface file (i.e. the example.i file). Once the SWIG executable is executed with the interface file as input, it produces the low-level wrappers (i.e. example_wrap.c) and highlevel wrappers (i.e example.py). Finally, The low-level wrappers are compiled as a dynamic library (i.e. libexample.so). This dynamic library will be loaded, on-demand, by the highlevel wrappers once they are imported in the Python interpreter (i.e. Python). Figure 16 SWIG wrapping example The wrapping of librina focused on minimising the efforts while reducing the maintenance costs of the bindings support to the bare minimum. To obtain the two goals a) the minimum corrections to the SWIG directions have been applied and b) the librina API stabilization has been prioritized over its internal mechanisms. As a matter of fact, the current Java SWIG interface files are ~500 Lines Of Code (LOCs) long which correspond ~13500 LOCs of automatically generated code (with a wrapper/code produced lines ratio of 1/27). The approach has been embraced for a two-fold scope. It allows developing new applications in different programming languages, better suiting the needs the IRATI adopters may have, and it allows the reuse of existent software written in different languages, within the stack framework. The availability of the IPC Manager and IPC Process daemons – as well as the RINABand testing application – in the Alba Stack [20] were the primary drivers of such choice. 49 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Bindings for other high-level interpreted languages such as Python are expected to be introduced in the future. Due to the reduced amount of wrapping directions for the Java parts and the better support SWIG has for other languages, their cost is expected to be far less than the cost of the Java bindings that have been introduced. The following figure depicts the updated librina software architecture with respect to the architecture described in deliverable D2.1. Figure 17 The IRATI user-space stack software architecture 2.2.1.2 Detailed Librina software design Figure 18 provides a detailed overview of the internal components of librina, grouped in different categories. At the API level, user applications can find five types of classes: Model classes: These classes model objects that abstract different concepts related to the services provided by librina, such as: application names, flow specifications, RIB objects, neighbours and connections. Model classes contain information on the modelled objects, but do not provide operations to perform actions other than updating or reading the object’s state. Proxy classes: These classes model ‘active entities’ within librina, meaning that they provide operations to perform actions on these entities. These actions result in the invocation of librina internals either to send a Netlink message to another user space 50 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 process or the kernel; or to invoke a system call. For instance, librina-application provides an ‘IPCManager’ proxy class that allows an application process to request the allocation or deallocation of flows to the IPC Manager Daemon. Another example can be found in the ‘IPC Process’ class available at librina-ipcmanager: this proxy class allows the IPC Manager daemon to invoke operations on the user-space or kernel components of an IPC Process. Event classes: As briefly introduced in section 2.2.1, librina is event-based. Invocation of proxy classes operations that cause the emission of a Netlink message return right away, without waiting for the Netlink message response. The response will be later obtained as one of the events received through the EventConsumer class. Event classes are the ones that encapsulate the information of the different events, discriminated by event type. Examples of events include results of flow allocation/deallocation operations or results of application registration/unregistration operations, just to name a few. EventProducer: This class allows librina users to access the events originated from the responses to the operations requested through the Proxy classes. The event producer provides blocking, non-blocking and time-bounded blocking operations to retrieve pending events. Concurrency classes: Concurrency classes provide an object-oriented wrapper to the OS threading functionalities. It is internally used by librina, but also exposed to librina users in case they want to use it as a way of avoiding external dependencies or intermixing different threading libraries (as it is the case of the IPC Manager and IPC Process daemons). 51 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Figure 18 librina detailed software design The librina core components process two types of inputs: operations invoked via Proxy classes at the API level or Netlink messages received via the Netlink socket bounded to librina created at initialization time. Operations invoked via proxy classes can follow two processing paths that either result in the invocation of a system call or on the generation of a Netlink message. In the former case processing is very simple: invocations of proxy operations are mapped to system call wrappers that make the required system call to the kernel (such as readsdu, writesdu, createipcprocess or allocateportid). The latter case involves more processing, as explained in the following: Message classes: These classes provide an object-oriented model of the different Netlink messages that can be sent or received by librina. The basic message class ‘BaseNetlinkMessage’ models all the information required to generate/parse the header of a Netlink message, including the Netlink header (source port-id, destination port-id and sequence number), the Generic Netlink family header (family and operation-code) and the RINA family header (source and destination IPC Process ids). 52 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 The different message classes extend the base class by modelling the information that is sent/received as Netlink message attributes in the different messages. NetlinkManager: This class provides an object-oriented wrapper of the functions available at the libnl/libgnl libraries (these libraries provide functions to generate, parse, send and receive Netlink messages). The wrapping is partial since only the functionality required by librina has been wrapped. In the ‘output path’ the NetlinkManager takes a message class, generates a buffer, adds the NL message header to the buffer, passes the message class and the buffer to the NL formatter classes (which will add NL attributes to the buffer) and finally passes the buffer to libnl to send the message. In the ‘input path’ – upon calling the blocking ‘getMessage’ operation – the IPC Manager blocks until libnl returns a buffer containing a NL message, then it parses the header, requests the NL parser classes to parse the NL attributes and return the appropriate message class, and returns. NetlinkMessage Parsers/Formatters: The goal of these classes is either to generate the attributes of a NL message based on the contents of a message class (formatting role) or to create and initialize a message class based on the attributes of a NL message (parsing role). In order to ensure that all the NL messages are received in a timely fashion, librina-core has an internal thread that is continuously calling the blocking NetlinkManager ‘getMessage’ operation. When the operation returns the thread converts the resulting Message class to an Event class, and puts the Event class to an internal events queue. When a librina user calls the EventConsumer to retrieve an event, the EventConsumer tries to retrieve an element from the events queue by invoking the eventPoll (non-blocking), eventWait (blocking) or eventTimedWait (blocking but time-bounded) operation. All librina components use an internal lightweight logging framework instead of an external one in order to minimize librina dependencies, since the goal is to facilitate deploying it within several OS/Linux systems. 2.2.2 rinad Rinad contains the user-space daemons of the IRATI implementation, as well as two example applications that are also useful for testing the stack. Two types of daemons in user-space are responsible for implementing the IPC Manager and layer management IPC Process functionality. IPC Process Daemon: Implements the layer management parts of an IPC Process, which comprise the functions that are more complex but execute less often. These functions include: the RIB and the RIB Daemon, Enrolment, Flow Allocation, Resource Allocation and the PDU Forwarding Table Computation. There is one instance of the IPC Process Daemon per IPC Process in the system. 53 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 IPC Manager Daemon: The central point of management in the system. It is responsible for the management and configuration of IPC Process daemons and kernel components; hosts the local management agent and IDD. There is one instance of the IPC Manager Daemon per system, but it can also be replicated in the future to increase availability. The final goal of the IRATI prototype is to produce a C++ implementation of these daemons (for the 3rd prototype, end of 2014). This will allow maximizing the portability of the IRATI stack to different POSIX-based systems, as well as minimizing the code footprint and external dependencies. However, the aggressive timing of the 1st prototype, as well as the requirement of the first experimentation round to happen between months 7 and 11 of the project made the consortium take a different approach for D3.1: reusing part of the code of the Alba RINA stack [20] in order to implement the IRATI user-space daemons. Figure 19 Alba RINA stack high level software design Alba is a work-in-progress pure Java RINA prototype developed by i2CAT and the TSSG. All the RINA functionalities are implemented as a single OS processes running the Java Virtual Machine (JVM), as depicted in Figure 19. Java Applications can use an Alba-provided Java library to invoke the RINA functionality. This library communicates with the RINA OS process by means of local TCP connections. The Alba RINA prototype can only operate over TCP/UDP, since it is user-space only and therefore limited to what the OS sockets API can provide. Section 2.2.2.2 provides further detail on how the Alba code has been integrated into IRATI’s phase 1 IPC Process and IPC Manager Daemons. 54 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS 2.2.2.1 Doc IRATI D3.1 Date December 2013 Package components and build framework The librina features enhancement introduced in section 2.2.1.1, allowed the reuse of the Java code present in the Alba RINA stack. Such code, rearranged in order to use the librina bindings, has been introduced into the IRATI stack as the additional rinad package. This package provides: The IPC Process and the IPC Manager daemons, as described in D2.1. The RINABand testing application (for complex bandwidth-related tests). Echo client and server applications (for simple connectivity and configuration related tests). The configuration and building framework templates, previously used in librina, have been updated to cope with the different rinad requirements. The configuration framework, initially planned only for C/C++ based code in librina, has been enhanced to look for JARs building related tools (such as the Java compiler). Maven (http://maven.apache.org) has been selected as the building framework for the Java parts. Therefore, the rinad package relies on the one hand on autotools for configuration and on the other hand on Maven for building. 2.2.2.2 RINA Daemons software design The IPC Manager Daemon is the main responsible for managing the RINA stack in the system. It manages the IPC Process lifecycle, acts as the local management agent for the system and is the broker between applications and IPC Processes (filtering the IPC resources available to the different applications in the system). As introduced in section 2.2.2 the first phase prototype of the IPC Manager has been developed in Java, leveraging part of the Alba prototype codebases. Moreover, the current IPC Manager Daemon is not a complete implementation, since it does not implement the local management agent yet (therefore the RINA stack cannot be managed through a centralized DIF Management System). Figure 20 shows a schema of the detailed IPC Manager Daemon software design. It is a Java OS process that leverages the operations provided by the librina API through the wrappers generated by SWIG and the Java Native Interface (JNI). In concrete, librina-ipcmanager provides the following proxy classes to the IPC Manager Daemon: IPC Process Factory. Enables the creation, destruction and enumeration of the different types of IPC Processes supported by the system. IPC Process. Allows the IPC Manager to request operations to IPC Processes such as assignment to DIFs, configuration updates, enrolment, registrations of applications or allocations/deallocations of flows. Application Manager. Provides operations to inform applications about the results of pending requests such as allocation of flows or registrations of applications. 55 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Figure 20 IPC Manager Daemon detailed software design When the IPC Manager Daemon initializes it reads a configuration file from a well-known location. This configuration file provides default values for system parameters, describes configurations of well-known DIFs and controls the behaviour of the IPC Manager bootstrap process. The latter is achieved by specifying: The IPC Processes that have to be created at system start-up, including their name and type. For each IPC Process to be created, the names of the N-1 DIFs where the IPC Process has to be registered (if any). For each IPC Process to be created, the name of the DIF that the IPC Process is a member of (if any). If the IPC Process is assigned to a DIF it will be initialized with an address and all the other information required to start operating as a member of that DIF (DIF-wide constants, policies, credentials, etc.) When the bootstrapping phase is over the IPC Manager main thread starts executing the event loop forever. The event loop continuously polls librina’s EventProducer (in blocking mode) to get the events resulting from Netlink request messages sent by applications or IPC Processes. 56 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 When and event happens, the event loop checks its type and delegates the processing of the event to one of the specialized core classes: Flow Manager (flow related events), Application Registration Manager (application-registration related events) or IPC Process Manager (IPC Process lifecycle management related events). The processing performed by these core classes will typically result in the invocation of one of the operations provided by the librinaipc-process Proxy classes previously described in this section. Local system administrators can interact with the IPC Manager through a Command Line Interface (CLI), accessible via telnet. This console provides a number of commands that allow system administrators to query the status of the RINA stack in the system, as well as performing actions that modify its configuration (such as creating/destroying IPC Processes, assigning them to DIFs, etc.). The IPC Manager supports the CLI console through a dedicated thread that listens at the console port; only one console session at a time is supported at the moment. The current IPC Manager has leveraged the following Alba components, adapting them to the environment of the IRATI stack: Configuration file format, parsing libraries and model classes (the configuration file uses JSON – the JavaScript Object Notation). Command Line Interface Server Thread and related parsing classes. Bootstrapping process. 57 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Figure 21 IPC Process Daemon detailed software design The IPC Process Daemon performs the layer management functions of a single IPC Process. It is therefore “half” of the IPC Process application, while the other half – dealing with datatransfer and data-transfer control related tasks - is located at the kernel. Layer management operations are more complex and do not have such stringent performance requirements as data transfer operations, therefore locating them at user-space is a logical choice, as introduced in D2.1. Figure 21 depicts the detailed software design of the IPC Process Daemon. The first phase prototype follows the same approach taken with the IPC Manager Daemon design and implementation: leveraging the Alba stack as much as possible in order to provide a simple but complete enough implementation of the IPC Process Daemon. Therefore the IPC Process Daemon is also a Java OS process that builds on the APIs exposed by librina through SWIG and JNI. The librina proxy classes described below are the more relevant to the IPC Process Daemon operation: 58 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 IPC Manager. Allows the IPC Process Daemon to communicate with the IPC Manager Daemon, mainly to inform the latter about the results of requested operations; but also to notify about incoming flow requests or flows that have been deallocated. Kernel IPC Process. Provides operations to enable the IPC Process Daemon to communicate with the data-transfer/data-transfer-control related functions of the IPC Process in the kernel. The APIs allow the IPC Process Daemon to modify the kernel IPC Process configuration, to manage the setup and teardown EFCP connections or to modify the PDU forwarding table. IPC Process Daemons are instantiated and destroyed by the IPC Manager Daemon. When the IPC Process Daemon has completed is initialization, the main thread starts executing the event loop. Such a loop is implemented by continuously polling the EventProducer for new events (in blocking mode) and processing them when they arrive. The event processing is delegated to the classes implementing the different layer management functions: Enrollment Task, Resource Allocator, Registration Manager, Flow Allocator and PDU Forwarding Table Generator. Processing performed by these classes typically involves two types of actions: Local actions resulting in communications with the Kernel IPC Process or the IPC Manager, achieved via the librina proxy classes. Remote actions resulting in communications with peer IPC Process Daemons, achieved via the RIB Daemon. The RIB Daemon is an internal component of the IPC Process that provides an abstract, objectoriented schema of all the IPC Process state information. This schema, known as the Resource Information Base or RIB, allows IPC Processes to modify the state of their peers by performing operations on one or more of the RIB objects. The Common Distributed Application Protocol (CDAP) is the application protocol used to exchange the remote RIB operation requests and responses between peer IPC Processes. This protocol allows six remote operations to be performed over RIB objects: create, delete, read, write, start and stop. The objects that are the target of the operation are identified by the following attributes: Object class. Uniquely identifies a certain type of objects. Object name. Uniquely identifies the instance of an object of a certain class. The object class + object name tuple uniquely identify an object within the RIB. Object instance. A shorthand for object class + object name, to uniquely identify an object within the RIB. Scope. Indicates the number of ‘levels’ of the RIB affected by the operation, starting at the specified object (object class + name or instance). This allows a single operation to target multiple objects at once. 59 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Filter. Provides a predicate that evaluates to ‘true’ or ‘false’ based on the value of the object attributes. This allows further discriminating to what objects the operation has to be applied. More information about the RIB, RIB Daemon and CDAP can be found at D2.1 [3]. CDAP is implemented as a library that provides a CDAPSessionManager class that manages one or more CDAP sessions. The CDAPSession class implements the logics of the CDAP Protocol state machine as defined in the CDAP specification [33]. CDAP can be encoded in multiple ways, but the IRATI stack follows the approach adopted by the other current RINA protocols to use Google Protocol Buffers (GPB) [31]. This decision will make interoperability possible, and will also provide the benefits of GPB: efficient encoding; proven, mature and scalable technology with good quality open source parsers/generators available. In addition to the information of the operation as well as the identity of the targeted objects, CDAP messages can also transport the actual values of such objects. Therefore the object values also need to be encoded in binary format. Again, GPB is the initial encoding format chosen, although others are also possible (ASN.1, XML, JSON, etc). Object encoding functionalities are implemented by the Encoding support library, which provides an encoding-format-neutral interface. Thus it allows for several encoding implementations to be plugged in/out, specifying which one to use at configuration time. The RIB is implemented as a map of object managers, indexed by object names – current RINA implementations have adopted the convention of making object names unique within the RIB as a simplifying assumption. Each object manager wraps a piece of state information (for example Flows, Application Registrations, QoS Cubes, the PDU Forwarding Table, etc) with the RIBObject interface. This interface abstracts the six operations provided by CDAP: create, delete, read, write, start and stop. When a remote CDAP message reaches the IPC Process Daemon, the message is handled to the RIB Daemon component. The RIB Daemon retrieves the object manager associated to the targeted object name from the RIB map, and invokes the requested CDAP operation. The goal of the object manager is to translate each CDAP operation to the appropriate actions on the layer management function classes. The layer management function classes use the RIB Daemon when they have to invoke a remote operation to a peer IPC Process. The RIB Daemon provides operations to send CDAP messages to neighbour IPC Processes based on its application process name. When such operations are called, the RIB Daemon internally fetches the port-id of the underlying N-1 flow that allows the IPC Process to communicate with the given neighbour, encodes the CDAP message and requests the kernel to write the encoded CDAP message as an SDU to that N-1 flow. 60 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 The current IPC Process Daemon has leveraged the following Alba components, adapting them to the environment of the IRATI stack: Supporting classes: CDAP library, Encoder library. RIB Daemon and RIB implementation classes. Layer management function classes: Enrolment Task, Resource Allocator, Flow Allocator and Registration Manager. The PDU Forwarding Table Generator implementing the specification of D2.1 is not part of Alba. Currently the implementation of this component is in progress and will be part of prototype 2 (D3.2). 2.3 Package inter-dependencies The IRATI stack is now split into three different software packages: the kernel, librina and rinad. This partitioning allows IRATI adopters to pick up and import only the parts they require into their solutions with less effort compared to monolithic package solutions. The following figure shows the inter-packages run-time dependencies. Figure 22 Packages runtime inter-dependencies The building dependencies among the packages (e.g. rinad depends on the availability of librina already installed into the system) are maintained and checked by each package configuration framework (i.e. autotools). Some of the features the packages provide are 61 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 optional and can be explicitly disabled at configuration time. Moreover, the configuration system automatically adapts to systems with reduced (optional) requirements. The following figure shows the packages build-time first-level inter-dependencies. Refer to the documentation accompanying the sources (e.g. README files) for further details. Figure 23 Packages build-time inter-dependencies The auto-configuration features described in section 5.2 and 5.3 have been introduced into the user-space packages (i.e. librina and rinad) to allow compiling them into different OS/Linux systems. The packages can be built into reasonably updated systems different than the one selected by IRATI Project partners for the development and integration activities (i.e. Debian/Linux). 62 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 3 The development and release model The whole development of the IRATI stack is managed with the git SCM [5]. Git is a free and open source distributed version control system (DVCS) designed to handle everything from small to very large projects with speed and efficiency. Please refer to [6] for further information. The IRATI project git repository is currently hosted on a private repository at github [7]. The repository will not be opened up until prototype 2 (D3.2) is delivered, since the IRATI consortium wants to guarantee minimal standards of quality and a certain stability of the APIs in order to make the prototype usable by the interested stakeholders. In order to get a copy of the repository, the following command must be issued: git clone [email protected]:dana-i2cat/irati.git foo Upon successful completion, the foo directory will be holding a local copy of the project source base as well as the complete commits history since the developments very beginning. The ‘git log’ command or Graphical User Interfaces (GUI) programs, such as qgit *10] or gitk [9], can be used to navigate the commits history, retrieve logs, tags information etc. 3.1 The git workflow The git workflow agreed among the partners varies widely depending on different conditions such as the deadlines prioritization, the concomitant introduction of new features and stabilization fixes, temporary workarounds etc. In the following sections the usual git workflow is described. 3.1.1 Repository branches The development, integration and testing procedure shared among the partners mostly pivot around the following repository branches: wip (work in progress): This branch holds the in-progress work. Such code can have temporary workarounds, fixes and partial implementations etc. The wip branch is used to integrate all the partner contributions and, once declared sufficiently stable, gets merged into the irati branch. irati: This branch holds the “stable” IRATI software. The master branch sources are merged into irati as well as all the partner contributions. After passing the integration tests, the irati branch software is finally released to WP4. master: this branch is used to hold the third-parties (mainstream) sources as they are, e.g. the vanilla Linux sources. An example of the usual approach can be summarized with the following events: 63 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 The partners agree on importing a new mainstream release (e.g. for bug-fixes or stabilization): the mainstream source code is integrated into the master branch which successively gets propagated (merged) into the irati branch and then into wip respectively. For example, a notable mainstream update can be represented by the introduction of a new Linux kernel version. A bunch of contributions in wip is declared stable: the wip branch is merged into irati Important fixes have been indicated by WP4 testing activities: the fixes are applied, as point-solutions, into the irati branch which successively gets merged into wip 3.1.2 Internal software releases Due to the cooperative nature of the stack’ components and the deployment constraints in the project’ testbed, work-packages WP3 and WP4 agreed on exchanging software releases directly through tagged repository versions. Such tags have the following format: v<MAJOR>.<MINOR>.<MICRO> where: MAJOR: is incremented when the software base reaches a mature level of stability MINOR: is incremented when all the planned functionalities introduced have been integrated, tested and the resulting prototype is sufficiently stable to be used without major faults or crashes. MICRO: is incremented when a set of contributions (representing a new functionality introduced) gets merged into the wip branch Each tag also reports a message briefly describing the changes involved into the release (which can be analyzed in details through the use of ‘git log’ command). e.g.: git tag –l –n1 | ... v0.1.9 Version v0.2.0 Version ... v0.2.5 Version fixes) ... v0.4.11 Version v0.4.12 Version sort –V bump (fullchain) bump (demo ready) bump (arp826, rinarp and shim-eth-vlan bump (user-space fixes) bump (IPCP-loop ready) The whole source code base, referring to a particular tag, can be accessed with the following command: 64 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 git checkout <tag> The tag-based release approach allows releasing software snapshots to WP4 often at the minimum expense in terms of time and efforts. 3.1.3 Issues management The github issue and repository tracking services [19] allow to bind Software Problem Reports (SPR), such as feature requests or bug reports, to milestones as well as to automatically refer to SPR from commits. The features and the automatisms available brought the partners to embrace the github services for managing all the IRATI software. 65 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 4 The environments In the following sections the kernel and user spaces building, development and testing environments are described. 4.1 The build environments 4.1.1 The kernel-space configuration and building The Linux kernel has two ad-hoc systems for configuration and building purposes: the Kconfig and Kbuild frameworks respectively. The Kconfig framework [1] allows selecting kernel compilation-time features with a User Interface (UI) that drives the selections and automatically enables/disables dependant configuration entries (e.g the IPv4 support can be selected only if the network support has been previously enabled, the FS support is automatically enabled by default etc.). The Kbuild framework [2] is a make [17] wrapper specifically tailored for building the kernel image file and dynamically loadable modules, handling all the particularities of their details (e.g. embedded linking scripts, modules symbols mangling, firmware embedding, dynamic modules generation, syscalls table generation, source files auto-generation, static stack checks). The kernel-space parts of the IRATI stack are contained into the linux/net/rina repository path; which holds the source code, the Kconfig file(s) and the Kbuild file(s). The current build setup provides to the final user the following dynamically loadable kernel modules: Module name rina-personality-default normal-ipcp shim-dummy shim-eth-vlan Description The IRATI personality. It holds all the base components of the stack such as the framework libraries, RNL, IPCP Factories, EFCP, RMT PDU-FWD-T, KIPCM, KFA etc. The Normal IPC Process The Shim Dummy IPC Process The Shim Ethernet IPC Process shim-tcp-udp rinarp arp826 The Shim TCP/UDP IPC Process The RINARP component The ARP826 component Pre-requisite - rina-personality-default rina-personality-default rina-personality-default, rinarp rina-personality-default arp826 - The modular approach can be summarized as in the following steps: 66 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Upon start-up/loading: 1. The kernel image is loaded: The IRATI stack core, which is built-in into the kernel image, initializes the framework, the KIPCM, the KFA and the RNL layer. Please note that the stack core cannot be a dynamically loadable module since it has to publish the RINA system calls into the global syscalls table. The syscalls table is static, produced at compilation time. 2. Upon rina-personality-default module loading: the personality hooks get registered into the core. 3. Upon IPC Process module loading (i.e. normal-ipcp, shim-dummy or shim-eth-vlan): the module registers its hooks to the IPC Process Factory. Upon shutdown/unloading 4. Whenever an IPC Process module is explicitly unloaded, the module deregisters as an IPC Process Factory from the system 5. Once all the IPC Process Factories are removed, the rina-personalitydefault module can be unloaded from the system. The following figure depicts the aforementioned steps Figure 24 The module loading/unloading steps 67 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 4.1.2 The user-space configuration and building All the IRATI stack packages in user-space share the same configuration and building frameworks. These frameworks are built with the GNU build system, also known as the “autotools” suite. The autotools suite is a set of development tools designed to assist developers and package maintainers in making source-code packages portable to many Unix-like systems. The suite is designed to make it possible to build and install a software package by running the following shell commands: ./configure && make && make install The autotools suite can be summarized as composed by the following tools: Autoconf: Autoconf [13] is an extensible package that produces POSIX shell scripts to automatically configure software source code packages. These scripts can adapt the packages to many kinds of UNIX-like systems without manual user intervention. Autoconf creates a configuration script (i.e. configure) from a template file that lists the operating system features that the package can use. Automake: Automake [14] is a tool for automatically generating Makefiles. Each automake input file is basically a series of make [17] variable definitions, with rules being thrown in occasionally. The goal of Automake is to remove the burden of Makefiles maintenance. Libtool: Libtool [15] simplifies the shared libraries generation procedure by encapsulating both the platform-specific dependencies, and the library user interface, in a single point. Libtool is designed so that the complete functionality of each host type is available via a generic interface while the various systems quirks are hidden to the developer. pkg-config: Pkg-config [16] is a helper tool that provides a unified interface for querying installed libraries, mainly for the purpose of compiling software from its source code. It simplifies the development of the autoconf checks, which are performed at configuration-time. The tool is language-agnostic, therefore it can be used for defining the location of documentation, interpreted language modules, configuration files etc. Please refer to each package’s documentation [13-16] for further details. 4.2 The development and testing environments The WP3 development and testing environment constitutes of a Debian-testing (currently codenamed “Jessie”) based VM image (VMI). 68 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 The VMI is updated upon the introduction of new development or testing requirements (i.e. external tools) and shared among the partners through the project FTP Server [12]. All the VMIs contain the pre-requisite tools required to build and test the stack. The partners agreed on using VirtualBox [21] as the virtualization solution for running the IRATI Virtual Appliances (VA) since it allows to easily setup environments that suit the WP3 integration testing scopes. The IRATI Stack requirements do not imply ad-hoc virtualization architectures and therefore other virtualization facilities can be applied. 69 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 5 Installing the software As previously described, the IRATI stack is composed by the Linux kernel, the librina package and the rinad package. In the following sections, specific per-package configuration and building instructions are described. 5.1 Installing the kernel The configuration and building system of the IRATI stack kernel-space parts strictly follows the Linux system. Therefore, the following steps – as well as almost all the documentation freely available on the World Wide Web - can be followed. From the linux root folder: 1. Either a. Copy the config-IRATI template file onto the default configuration file (i.e. ‘cp config-IRATI .config’). b. Configure the kernel with the UI based tool (i.e. ‘make menuconfig’). 2. Type ‘make bzImage modules’ in order to build the kernel and all the modules. 3. Install the kernel and all the modules in the target system by typing ‘make modules_install install’. Please note that OS/Linux distributions do not usually require additional bootloader setup and the ‘make module_install install’ command is enough to a) install the new kernel and b) setup the bootloader in order to allow booting the kernel. However, the bootloader configuration procedure may change and different procedures may have to be followed. Please refer to the distribution documentation for specific instructions and eventual troubleshooting. 5.2 Installing the librina package The following sections summarize the steps required for installing librina from a repository checkout. 5.2.1 Bootstrapping the package The sources, as cloned from the repository, do not contain the necessary setup that would allow a successful package configuration. In order to setup the sources environment, run the ‘bootstrap’ script available in the package root as follows: ./bootstrap 5.2.2 Prerequisites for the configuration The librina package requires the following external packages (already installed into the development VMI) in order to complete the configuration process successfully: 70 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS 1. 2. 3. 4. Doc IRATI D3.1 Date December 2013 A C++ compiler libnl-3 libnl-3-genl swig >= 2.0 (excluding versions in the range [2.0.4, 2.0.8]) 5.2.3 Installation procedure Once the package has been “bootstrapped”, as described in section 5.2.1, the usual “autotools” procedure must be followed to configure, build and install the package: 1. ./configure 2. make 3. make install Please refer to the configure script help for further details on the command line options available. The configure help can be obtained with the following command: ./configure –help The configure script assumes default parameters if no options are given. The package installation paths are set to the system defaults. If the paths have to be changed, in order to suit a different installation, the –-prefix configure option can be used as follows: ./configure –-prefix=<PREFIX> 5.3 Installing the rinad package The following sections summarize the steps required for installing rinad from a repository checkout. 5.3.1 Bootstrapping the package The sources, as cloned from the repository, do not contain the necessary setup that would allow a successful package configuration. In order to setup the sources environment, run the ‘bootstrap’ script available in the package root as follows: ./bootstrap 5.3.2 Prerequisites for the configuration The rinad package requires the following external packages (already installed into the development VMI) in order to complete the configuration process successfully: 1. 2. 3. 4. librina The Java virtual machine (JRE >= 6) The Java compiler (javac) Maven 71 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 5.3.3 Installation procedure Since rinad relies on the librina wrappers availability (i.e. the shared librina low-level wrappers libraries as well as the Java high-level JAR wrappers), the configure script has to be instructed on the paths where to look for the wrappers. With reference to the PREFIX variable value used during the librina configuration process (as described in section 5.2.3), the following commands allows configuring the rinad package: 1. PKG_CONFIG_PATH=$PREFIX/lib/pkgconfig LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PREFIX/lib ./configure -prefix=$PREFIX 2. make 3. make install Please note that the rinad package relies on Maven to compile all the Java parts and, as a side effect, Maven will be downloading all its prerequisites during the execution of the ‘make’ command (ref. section 5.2.3, step #2). 5.4 Helper for librina and rinad installation The repository source directory contains a couple of helper scripts summarizing the installation and removal procedures of the user space IRATI stack packages: install-from-scratch: this script installs both librina and rinad packages. The script takes, as an optional parameter, the absolute path of the installation (i.e. the package PREFIX). uninstall-and-clean: this script is the opposite of install-fromscratch, it uninstalls all the software previously installed and clean ups the source directories. 5.5 Layout of folders and files in the installation directory With reference to the PREFIX variable used during the installation, the following table describes the paths utilized during the user-space packages installation procedure. 72 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 Path Description bin Holds the different binaries produced by the packages, e.g. echoclient, echo-server, ipcmanager, ipcprocess, rinaband-client and rinaband-server etc Holds the configuration files used by the different programs (e.g. the IPC Manager configuration file ipcmanager.conf) include/librina Holds all the header files of librina lib Contains the librina static/shared libraries lib/pkg-config Holds the pkg-config data files share/librina Contains the JARs of the librina Java wrappers share/rinad Contains the JARs required to run the various daemons (i.e. IPC Manager and IPC Process) and applications (i.e. RINABand and Echo applications) 73 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 6 Loading and running the stack At the end of the bootstrap procedure, the following modules have to be loaded, in order to get a fully functional IRATI stack: 1. modprobe rina-personality-default 2. modprobe shim-eth-vlan 3. modprobe normal-ipcp The aforementioned steps automatically load all the necessary prerequisite modules, as described in section 4.1.1 (e.g. the loading of the shim-eth-vlan module triggers the loading of the rinarp modules which in turn will cause the loading of the arp826 module). Please note that if the kernel is a non-modular one, the aforementioned commands are unnecessary. All the IRATI kernel components will be built-in into the kernel image and initialized during the system bootstrap phase. In order to have a functional stack, the remaining IPC Manager and IPC Process daemons have to be executed. Since the IPC Manager automatically starts a new IPC Process when needed, the only remaining command that has to be executed is the following: 1. $PREFIX/bin/ipcmanager Please note that the IPC Manager daemon relies on a configuration script, located into $PREFIX/etc path, which could have to be changed in order to suit specific needs. 74 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 7 Testing the stack With reference to a VirtualBox [11] based environment, the following instructions can be used to setup an environment suitable for testing the prototype between two Virtual Machines (VMs) intercommunicating through a VLAN, as depicted in the following figure: Figure 25 Testing environment Please note that the instructions are general and can be used in a real HW deployment, with just the minimum amount of changes required to obtain the correct network setup. 7.1 VirtualBox configuration The two VMs must have two NICs each. The first adapter will be used for accessing the outside world (e.g. for the Maven prerequisites retrieval during the rinad package installation) while the second will be used for testing the stack, using the Shim Ethernet IPC Process. Therefore, the VirtualBox configuration for each VM will be assumed as the following: eth0: A NAT-ed interface, with access to the Internet through the physical host. eth1: An interface directly connected to the Virtual Box internal network. The VLAN will be configured on this interface. 75 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 The VM Images (VMIs) available for download from [12] and the instructions presented in sections 5 and 6 can be used to setup the testing environment. Please refer to [21] for instructions on the VMs NICs setup. 7.2 VMs configuration In order to properly configure the guest OS with the setup required for the test, the following steps can be followed in one VM: 1. Enable the 8021q module (i.e. modprobe 8021q). 2. Add VLAN 98 to interface eth1 (i.e. vconfig add eth1 98). 3. Enable and configure eth1.98: 3.1 ifconfig eth1.98 up 3.2 ifconfig eth1.98 mtu 1496 (optional if the network card does not support 1504B MTU) The other VM configuration will be the same. 7.3 IPC Manager configuration file In order to facilitate the testing, the IPC Manager can be configured to automatically create a shim-Ethernet over VLAN IPC Process and assign it to a DIF. This way, applications can start using it as soon as the IPC Manager completes its initialization. The aforementioned feature can be enabled by editing the IPC Manager configuration file (ref. section 5.5) as follows: { "localConfiguration" : { "installationPath" : "/usr/local/irati/share/rinad", "libraryPath" : "/usr/local/irati/lib", "consolePort" : 32766 }, "ipcProcessesToCreate" : [ { "type" : "shim-eth-vlan", "applicationProcessName" : "test", "applicationProcessInstance" : "1", "difName" : "98" } ], "difConfigurations" : [ { "difName" : "98", "configParameters" : [ { "name" : "interface-name", "value" : "eth1" } ] } ] } 76 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 This configuration file will instruct the IPC Manager to create a Shim Ethernet IPC Process with application process name “test”, application process instance “1” and assign it to the DIF “98” (ref. ipcProcessToCreate). The configuration of the DIF “98” is described below (ref. difConfigurations), with a single configuration parameter to bind the shim IPC Process to the “eth1” network interface (ref. configParameters). 7.4 Running the test Depending on the PREFIX path used during installation (ref. section 5.2 and 5.3), the IRATI stack binaries will be located in a certain directory. In the following, the IRATI_BIN shell variable value is assumed to be set to $PREFIX/bin. 1. Start the IPC Manager: $IRATI_BIN/ipcmanager 2. On VM1: a. Start the “echo application” in server mode: $IRATI_BIN/echo-server 3. On VM2: a. Start the “echo application” in client mode: $IRATI_BIN/echo-client Please note that the IPC Manager daemon automatically starts instances of the IPC Process daemon thus, there is no need to start the IPC Process daemon manually. The echo-client and echo-server will start exchanging SDUs over eth1.98. Use a traffic sniffer such as tcpdump [24] or WireShark [25] to analyze the RINA related traffic flowing through eth1.98. 77 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 8 Conclusions and future work The software components required for the first phase IRATI project prototype have been designed, developed, integrated and functionally tested. Although the design and development activities progressed without major problems and deviations, the unsuitability of the Linux ARP implementation - for the scope of the Shim Ethernet IPC Process - introduced unplanned developments, slightly delaying the prototype release date. These delays coupled with the project’ tight timings induced prioritization among the features to be delivered in the first phase. The prototype now implements core parts of a RINA stack over Ethernet for a Linux-based OS, provides a solid backbone for the upcoming developments and supports the creation of DIFs over Ethernet with levels of service similar to UDP. Its DTCP implementation is under active development while placeholders within CDAP have been defined and authentication mechanisms are going to be added in short. Routing and multi-homing support has been post-poned for inclusion in the phase 2 prototype, in order to stabilise the groundworks first. The software prototype meets both the features and the stability requirements for experimentation and has been delivered to WP4 for experimentations. Results of the experimentation phase are described in deliverable D4.1 [30]. The core functionalities described in this document will be further enhanced and the current prototype will be incrementally updated with additional features towards fulfilling the project’ goals. Apart from these planned features, the partners agreed on a set of improvements to be applied to the source code base. These enhancements can be summarized as follows: Incrementally redesign the ingress/egress DU loop, in order to obtain a distributed architecture that does not incur in the bottle-necks caused by the current centralized DU-loop approach (i.e. the KIPCM and KFA sustain the DUs ingress and egress loop, as described in section 2.1.5). Reduce the kernel-space memory footprint, by embracing techniques such as flightweight objects, Copy-On-Write (COW) memory optimization strategies, objects caching etc. Provide an improved sysfs integration, in order to properly expose the stack’ objects internals to the user. This would enable for greater interaction allowing either to read the objects status or to change their exported parameters while the stack is running (i.e. dynamically profiling the stack while it is running). Improve the user-space performance by translating the IPC Process and IPC Manager daemons from Java to C++. Improve the stack’ POSIX compliance and decouple even more the kernel-space components from the underlying kernel functionalities/libraries, in order to ease 78 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 porting the stack to other POSIX compliant systems (Debian GNU/kFreeBSD, FreeBSD etc.). The updated prototypes will be made available in the next periods as part of the upcoming project deliverables D3.2 and D3.3. 79 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 9 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. The Linux Kbuild building framework – https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt The Linux Kconfig configuration framework https://www.kernel.org/doc/Documentation/kbuild/kconfig.txt IRATI, Deliverable D2.1 – http://irati.eu/wp-content/uploads/2012/07/IRATI-D2.1.pdf RFC 3549 Linux Netlink as an IP Services Protocol - http://www.ietf.org/rfc/rfc3549.txt git – http://git-scm.com/ git documentation – http://git-scm.com/documentation github – https://github.com The IRATI software repository (access available on demand) – https://github.com/danai2cat/irati gitk: The git repository browser – https://www.kernel.org/pub/software/scm/git/docs/gitk.html qgit: A graphical interface to git repositories – http://sourceforge.net/projects/qgit Virtual Box – https://www.virtualbox.org IRATI FTP Server (access available on demand) – ftp.i2cat.org GNU autoconf – http://www.gnu.org/software/autoconf GNU automake – http://www.gnu.org/software/automake GNU libtool – http://www.gnu.org/software/libtool pkg-config – http://www.freedesktop.org/wiki/Software/pkg-config GNU Make – http://www.gnu.org/software/make SWIG: The Software Wrapper and Interface Generator – http://swig.org The IRATI issues tracking system (access available on demand) – https://github.com/danai2cat/irati/issues The Alba RINA stack – https://github.com/dana-i2cat/alba The VirtualBox user manual – https://www.virtualbox.org/manual/UserManual.html The Python API – http://docs.python.org/2/extending/extending.html The Java Native Interface – http://docs.oracle.com/javase/7/docs/technotes/guides/jni tcpdup – http://www.tcpdump.org WireShark – http://www.wireshark.org RFC-826, An Ethernet Address Resolution Protocol – http://tools.ietf.org/search/rfc826 The zen of kobjects – http://lwn.net/Articles/51437 A. T. Schreiner, “Object oriented programming with ANSI-C,” 1993 – https://ritdml.rit.edu/handle/1850/8544 Opaque pointers – http://en.wikipedia.org/wiki/Opaque_pointer IRATI, Deliverable D4.1 – http://irati.eu/wp-content/uploads/2012/07/IRATI-D4.1final.pdf 80 D3.1 First phase integrated RINA prototype over Ethernet for a UNIX-like OS Doc IRATI D3.1 Date December 2013 31. Google Protocol Buffers Developers Guide - https://developers.google.com/protocolbuffers/ 32. Linux SysFS - https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt 33. CDAP - Common Distributed Application Protocol Reference (available on demand) - PNA Technical Draft D-Base-2010-xxy, draft 0.7.2, December 2010 END OF DOCUMENT 81
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement