Computer Engineering 2009 Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ MSc THESIS Code Integrity Check targetting RISC Processors Andre Michel Abikhaled Abstract CE-MS-2009-13 Security is emerging as an important concern in embedded system design. The security of embedded systems is often compromised due to vulnerabilities in the software that they execute. Security attacks exploit such vulnerabilities to trigger unintended program behavior, e.g. the leakage of sensitive data or the execution of malicious code. Many computer security threats involve execution of unauthorized foreign code on the victim system. Viruses, network and email worms, Trojan horses, backdoor programs used in Denial of Service attacks are just a few examples. Program code in a computer system can be changed either by malicious security attacks or by various failures in microprocessors. In this work, we present an enhanced methodology for code integrity at run-time in Reduced Instruction Set Computer (RISC) processors. This is achieved by pre-computing checksums over parts of the binary code before program execution. These pre-computed checksums are usually generated by a Certified Trusted Authority. These values embedded on the binary code are verified at run-time during program execution. For this purpose the targeted processor is augmented with a cryptographic unit and dedicated control unit. The added cryptography unit increases the overhead delay about 30 percent and the area doubles in size. Faculty of Electrical Engineering, Mathematics and Computer Science Code Integrity Check targetting RISC Processors THESIS submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in COMPUTER ENGINEERING by Andre Michel Abikhaled born in Beirut, Lebanon Computer Engineering Department of Electrical Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology Code Integrity Check targetting RISC Processors by Andre Michel Abikhaled Abstract Laboratory Codenumber : : Committee Members : Computer Engineering CE-MS-2009-13 Advisor: dr.ir. Georgi Gaydadjiev, CE, TU Delft Chairperson: dr. Koen Bertels, CE, TU Delft Member: dr.ir. Stephan Wong, CE, TU Delft Member: dr. Marjan Popov, EE, TU Delft i ii Special Thanks to my advisor for his support in this Thesis work. I would like also to thank my family and all the friends for their support. iii iv Contents List of Figures viii List of Tables ix Acknowledgements xi 1 Overview 1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 2 Background 2.1 Security Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Secure Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Secure Hash Algorithm(SHA) . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 SHA-1 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . 2.6 Proposed SHA-1 round with operations rescheduling and data expansion unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 12 13 15 3 Selection of the processor 3.1 Choices while selecting the processor . . . . . . . 3.2 Introduction to Leon Processor . . . . . . . . . . 3.2.1 Leon Integer Unit . . . . . . . . . . . . . 3.2.2 Instruction and Data Cache System . . . 3.2.3 Memory Access and AMBA on-chip buses 3.2.4 Bare-C Compiler . . . . . . . . . . . . . . 3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . 16 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 23 24 25 26 28 29 4 Implementation and Validation 4.1 Certified Authority Emulation . . . . . . . . . . . . . . . . . . . . . . 4.2 Implementation of the Code Integrity Check Unit . . . . . . . . . . . 4.2.1 Architecture and Organization of the non extended processor 4.2.2 Extension of The Leon3 processor . . . . . . . . . . . . . . . 4.2.3 Detection of valid Instructions from the Instruction Cache . . 4.2.4 Cryptographic Control Unit . . . . . . . . . . . . . . . . . . . 4.2.5 Timing Simulation Results . . . . . . . . . . . . . . . . . . . 4.2.6 Halting the Pipeline Unit . . . . . . . . . . . . . . . . . . . . 4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 33 33 34 36 36 39 40 43 v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Evaluation 45 5.1 Design Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6 Conclusions 49 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Bibliography 53 7 Appendix A: LEON3 VHDL simulation steps with Modelsim 57 8 Appendix B: Software Test program 59 vi List of Figures 1.1 Stack Based Buffer Overflow attack overwriting return address[26] . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 REM datapath[34] . . . . . . . . . . . . . . . . . . . . . . . Hardware Assisted Monitoring Architecture[24] . . . . . . . XOM overview and memory organization[34] . . . . . . . . Simple XOM Architecture[41] . . . . . . . . . . . . . . . . . Hardware Architecture of High-end Secure Coprocessor[42] TPM Architecture[44] . . . . . . . . . . . . . . . . . . . . . Secure System Overview[1] . . . . . . . . . . . . . . . . . . Code Integrity Check[1] . . . . . . . . . . . . . . . . . . . . Pseudo Code for SHA-1 function[4] . . . . . . . . . . . . . . Operation in a single step of SHA-1[17] . . . . . . . . . . . General SHA-1 Implementation[39] . . . . . . . . . . . . . . Top level design of SHA-1 core . . . . . . . . . . . . . . . . Structural Design of the SHA-1 core . . . . . . . . . . . . . SHA-1 round unit Implementation[4] . . . . . . . . . . . . . SHA-1 data expansion Implementation[4] . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 LEON block diagram[8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . LEON integer unit block diagram . . . . . . . . . . . . . . . . . . . . . . . Instruction cache tag layout[8] . . . . . . . . . . . . . . . . . . . . . . . . data cache tag layout[8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory controller connected to AMBA bus and external memory devices[28] AMBA AHB/ASB and APB Bus [38] . . . . . . . . . . . . . . . . . . . . 24 25 26 26 27 28 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 Certified Authority . . . . . . . . . . . . . . . . . . . . . . Application Software . . . . . . . . . . . . . . . . . . . . . Organization of the Leon-3 template . . . . . . . . . . . . Organization of the LEON3S Unit . . . . . . . . . . . . . Architecture of Leon3 Instruction Cache[40] . . . . . . . . Extended Datapath . . . . . . . . . . . . . . . . . . . . . . Code Integrity Check Unit . . . . . . . . . . . . . . . . . . wave diagram of Code Integrity Check . . . . . . . . . . . generate valid unit . . . . . . . . . . . . . . . . . . . . . . wave diagram . . . . . . . . . . . . . . . . . . . . . . . . . Cryptographic Unit Interface . . . . . . . . . . . . . . . . Cache to Hash Unit . . . . . . . . . . . . . . . . . . . . . wave diagram of Cache To Hash(no jump instructions) . . wave diagram of Cache To Hash (with jump instructions) Normal Execution of the pipleine unit[8] . . . . . . . . . . Execution of the pipleine unit when cryptography occurs . 31 32 33 33 34 34 35 36 37 37 38 40 41 42 43 44 5.1 Overhead in Delay vs size of the cache . . . . . . . . . . . . . . . . . . . . 47 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 7 8 9 10 10 11 12 13 14 15 16 17 18 18 19 7.1 7.2 Leon processor Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 57 Configuration Inside processor . . . . . . . . . . . . . . . . . . . . . . . . . 57 8.1 8.2 8.3 8.4 Simple Test program Test program 2 . . . Test program 3 . . . Test program 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 60 61 List of Tables 2.1 SHA-1 Functions and Constants[4] . . . . . . . . . . . . . . . . . . . . . . 15 3.1 Address Space map [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1 5.2 5.3 5.4 5.5 Design Comparisons . . . . . . . . . . . . . Architectural Parameters . . . . . . . . . . Delay measurements without modifications Delay measurements with modifications . . Delay overhead . . . . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 46 46 47 47 x Acknowledgements Andre Michel Abikhaled Delft, The Netherlands September 11, 2009 xi xii 1 Overview 1.1 Problem statement Let us assume a system of central authority and users. The authority wants to share secrets and data with users having portable computing devices. Our target is protecting the executed code from malicious modifications. The possible attacks can be classified as software and physical attacks on code and data. The authority can be a military organization with remote devices used by soldiers in the field. It could be a bank with devices used by customers as personal ATMs for dispensing electronic cash and accepting deposits into bank. It could even be a cell phone network provider where users are allowed to download trusted software[1]. The main problem of all the above systems is that most of the new security attacks result in violating the integrity of the software code of an application program. Such security attacks or threats try to change the instructions so that adversary try to gain control over the program execution flow[26]. Figure 1.1 illustrates an example of an attack using malicious code. In Figure 1.1(a), function g() is vulnerable because it consists of an operation - strcpy which copies the string str1 into buffer without considering the size of the string, str1. Since buffer is a local array, it will usually be stored in part of a stack frame belonging to local variables. A program that copies passing the end of this array, will overwrite anything stored after the array. Figure 1.1(b) depicts the program stack when function g() is executed. Function g() is called by function f() which has placed the arguments (arg0..argN) of function g() in the stack just after function f()s local variables before running a call instruction. Figure 1.1(c) illustrates how the return address is overwritten using buffer overflow, by making use of a vulnerable function (strcpy in this case). Since strcpy does not check for the length of the string copied across, the hacker is able to make the program copy data beyond the end of an array (buffer in this case). Apart from the content of the buffer, the attacker has also overwritten the return address of function g(). In most of the attacks of this type, the main concern of the attacker is to overwrite the return address and this is the easiest way to gain control of an application programs execution flow. Usually, when a function returns, it will return to the address pointed by the return address in the stack. As described here, if the adversary has changed the return address to point to the code segment injected probably by the same copy operation (as in Figure 1.1(c)), when function g() returns, it will execute the injected malicious code and therefore giving the control to the attackers code[26]. 1.2 Thesis Motivation In this master thesis we are aiming to improve the security of an existing soft- or hard-core processor by executing it with hardware cryptographic unit performing the Secure Hash 1 2 CHAPTER 1. OVERVIEW Figure 1.1: Stack Based Buffer Overflow attack overwriting return address[26] Algorithm (SHA-1). Whenever the instructions are loaded into the on-chip instruction cache from external off-chip caches or main memory, the secure processor performs the code integrity check for the loaded instructions[3]. The secure processor verifies the integrity of the instructions by comparing the hash embedded in the cache line which is calculated statically before starting the execution of the code with the one that is computed at run-time over the instructions in the cache line[2]. The code integrity check guarantees that the binary code loaded in the memory was not altered before execution. For this reason we compute for every cache line the checksum signatures before the program starts. A certified authority generates the checksums loaded into the binary file. At run-time when the program is executed, the real hash is calculated by the cryptographic hash unit in hardware. Different hashes imply that the binary code has been changed and an interrupt is raised[21]. If the hash comparison indicates that both checksums are the same, then the embedded hash in the cache line will be replaced by no-op instruction so as not to affect the execution of the instructions in the pipeline [1]. 1.3 Outline of the thesis The remainder of this thesis is organized as follows. Chapter 2 discusses the different classes of attacks into computer systems like the Hardware-Based Attacks , the architecture for a secure execution of programs on embedded processors is discussed. Moreover, the hash function which is a computationally efficient function converting binary strings of arbitrary length to binary strings of some fixed length called the checksums is going to be discussed. Chapter 3 discusses the 32-bit LEON processor that complies to the SPARC V8 architecture. Chapter 4 discusses the implementation of the SHA-1 unit, the code integrity unit. Chapter 5 describes the comparison of this design with other two designs e.g.The Trusted platform and IBM Secure Coprocessor. Some measurements of 1.3. OUTLINE OF THE THESIS 3 the delay are included in this chapter. Finally this thesis is concluded in chapter 6, where recommendations for future work can also be found. 4 CHAPTER 1. OVERVIEW 2 Background This chapter discusses the background information. Section 2.1 discusses the security issues and the different classes of attacks into computer systems like the Hardware-Based and the Software-based Attacks. The architecture of the design of the Secure Processor is discussed in section 2.2. Section 2.3 illustrates the related work. A hash function which is a function converting binary strings of arbitrary length to binary strings of some fixed length called the checksums is illustrated in section 2.4. Section 2.5 discusses the proposed SHA-1 architecture. Section 2.6 illustrates the implementation of the SHA-1 core in hardware. 2.1 Security Issues The definition of dependability in a computing system is the trustworthiness of a system which delivers a reliable and trusted system[26]. Dependability includes the following attributes of a computing system[26]: 1. Availability: The correct service of the system is ready or available. 2. Reliability: The correct service is always or continuously available. 3. Safety: The users and environment must have no damage. 4. Security: The security services is defined as the availability for authorized users only, the confidentiality protection of messages from unintended eavesdroppers, the authentication of data, and the integrity protection that shows that the message was not modified in transit. In this section, the different classes of attacks into computer systems are presented . 1. Hardware-Based Attacks: direct access of the adversary on the hard disk by using direct hardware block access. Special Cryptography is implemented in hardware that will protect the hardware from such attacks. Cryptography like encrypting all sensitive data is used, where the encrypted data can only be accessed after decrypting the data with the corresponding key where the key can be private or public. Another kind of attacks in hardware is the attacks over busses in hardware[27]. 2. Software-based Attacks: Special attacks where the hacker tries to execute his code to access protected data or to insert errors in system software to get administrator access. The adversary tries to insert software implementation errors[27]. Some of the common software based attacks were explained in details in [26] [27], these attacks will be again briefly discussed. 5 6 CHAPTER 2. BACKGROUND • Buffer Overflow: For instance, if the programmer forgets to check if the length of the input data is not more than the size of the buffer located on the stack where data has to be hold, then the adversary inserts errors. The attacker replaces the return address which is stored on the stack with an address pointing to the overflowed buffer containing its own malicious code, where this last can be used to copy passwords or other valuable data to an attacker. • Heap Overflow: The heap does not contain the return address, but it contains the actual buffer, the size of the buffer and pointers to next and previous buffers are stored in the same memory chunk. The adversary overwrites one of the pointers to point to the return address on the stack such that the return address is overwritten and the adversary inserts malicious software. • Double free vulnerability: The adversary overwrites the administrative data structure. • Format string attack: The attacker modifies the format string, to overwrite the contents of the stack to execute a malicious code. • Temporary file vulnerability: The attacker tries places a link of symbols to force the program to write to a file where he has access to. 2.2 Related Work Extensive previous work has been done in the area of code integrity check like the Runtime Execution Monitoring (REM) [34]. The main concern in REM is to verify the code integrity at the basic block(a sequence of instructions with a single entry point, and no internal flow control instructions, such as branch, call, return instructions [34]) level. This is done by pre-calculating the hashes for every basic bock before execution of program. During the execution of the program, these precomputed values are verified again by calculating the hash values with the aid of a hash unit implemented in hardware. The datapath of a pipeline processor has been modified to implement the code integrity check. The Secure Hash Unit (implemented in hardware to compute the hashes of the instructions in the basic block) reads instructions in 128-bit blocks and computes the corresponding hash values to the current basic block. This unit is connected to the pipeline control and processes a basic block only after all the instructions in the basic block are executed. At the same time with the hash computation, the stored hash of the current basic block is read from the memory and stored in the first-in-first-out (FIFO) hash read buffer. This buffer is trivial because the hash computation delay in hardware is longer than the latency of looking up the hash value in the buffer. Whenever the hash value computation is finished, it is compared to the stored value, and an interrupt is raised if the hash values mismatch [34]. The block diagram in Figure 2.1 depicts the REM datapath. Similar work to the REM is done in [22][24]. The embedded processor is modified by adding extra hardware unit that observes the dynamic execution of the processor, checks whether the execution of the allowed program behavior is not tampered [22]. The embedded processor is a five stage pipeline RISC processor. The inputs to the new hardware unit or monitor contains the program counter (PC) and instruction register 2.2. RELATED WORK 7 To I-Fetch Unit Load/Store Pipe L1-I Cache HMAC Compute Logic L1-D Cache Hash Read Buffer L2 Cache Exception = Chip Boundary Main Memory Figure 2.1: REM datapath[34] (IR) of the completing instruction, and the pipeline control signal from the pipeline control unit. The output of the monitor includes a stall signal and an invalid signal. When the monitor detects a deviation in the program behavior, it raises the invalid signal, which results in an interrupt to the processor [22]. The monitor is splitted into three sub blocks which check program properties at different level of abstractions. The top-most level is the application level, the second level describes the intra-procedural control flow by validating each branch/jump instruction within a function and finally, the lowermost level verifies the integrity of the instruction stream [24]. Some security attacks may not result in a control flow violation, for instance the alteration of a basic block in the program code segment during execution. Hash values of each basic block in the program are pre-computed before the execution of the program, loaded into the hardware monitor when the application is loaded for execution, and subsequently checked during program execution. To be able to check the instruction stream integrity, each row in the basic block lookup table, is augmented to contain another field that stores the statically-computed cryptographic hash of the instruction sequence in the basic block. During program execution, the monitor buffers the instruction stream corresponding to a basic block until a branch/jump instruction is encountered. At this point, the monitor switches to an empty buffer and enables the hardware hash unit to compute the hash of the buffered basic block. The computed hash value is then compared against the value stored in basic block lookup table. When the buffers are full, the processor is stalled in order to allow the instruction integrity checker to catch up[22][24]. The design in Figure 2.2 depicts the Hardware Assisted Monitoring Architecture. Similar work to the Hardware Assisted Monitoring Architecture is done in [20][21]. 8 CHAPTER 2. BACKGROUND Figure 2.2: Hardware Assisted Monitoring Architecture[24] The same idea of having a hardware monitoring unit, but the only difference with the work mentioned above is that the code is checked only at lower level of instructions. A generalized idea for monitoring code integrity at run-time in application-specific instruction set processors (ASIPs) is presented, where both the instruction set architecture (ISA) and the microarchitecture can be customized for a particular application domain. The monitoring microoperations are embedded in machine instructions, and the processor is modified with a hardware monitor. The monitor analyzes the processors execution trace of basic blocks at run-time, checks whether the execution trace aligns with the expected program behavior, and signals any mismatches [21]. Moreover, soft errors in microprocessors can also change program code and result in system malfunction. In [20], an approach for monitoring code integrity at run-time in ASIPs is presented similar to the one in [21] is presented where a compiler-assisted and application-controlled management design for the monitoring architecture is done. The XOM allows the execution of instructions that are stored in memory but do not allow the execution of malicious instructions. The idea behind the XOM is that the Execute-only code which is stored in memory as encrypted form has to be decrypted when a load instruction occurs[41]. This means that the data are loaded to the processor chip in decrypted form. On the other hand data written to the memory on a store instruction has to be encrypted before it is send to the memory. According to [34], the software is given by the vendor in encrypted form. During the execution of the software on the target processor, it is decrypted using a secret key. Therefore, there exists an encryption/decryption unit between the cache and the main memory. On a read miss from the cache, the data from the memory are first decrypted and loaded into the cache. On a write back to the memory the data are encrypted because the memory is not trusted, and the data verified for integrity. This is done by embedding in every data block a checksum [34]. On the other hand, the memory can be tampered from an adversary. To overcome this problem the memory hashing concept is used. The memory is considered as a structured tree where the program data are placed at the leaves of the tree. The nodes in the tree or the parent nodes contain the hash values of every data block at the leaves. When a cache miss occurs, the incoming data block is checked for integrity by checking its hash and all the parent hashes up to the root nodes. In case both values are 2.2. RELATED WORK 9 not the same, an exception is raised. When writing back the data into the memory, the hashes in the tree are updated including the root hash. Figure 2.3 explains the XOM overview and how the hashes are organized in the memory. Figure 2.3: XOM overview and memory organization[34] The limitations of XOM are the following [34]: • Nowadays the software is unencrypted and some of the software is open source where XOM only protects encrypted code. • XOM does not protect shared library code and does not fully protect the untrusted I/O channels. • XOM does not detect the vulnerability or the alteration that happen during program execution. The implementation of the XOM machine is described in details in [41]. We will discuss briefly the implementation of the XOM machine. The simple XOM machine is implemented by slightly modifying a CPU to add a special unit XOM Virtual Machine Monitor (XVMM), an on-chip private memory, microcode specially designed to store the private key[41]. The XVMM is implemented in software or in microcode. The XVMM code is a trusted, authorized, and privileged program. For this reason, the XVMM has access to the private key and the on-chip private memory [41]. The block diagram in Figure 2.4 depicts the architecture of the XOM. Another similar work to the XOM architecture is the IBM4758 Secure Coprocessor [42]. The secure coprocessor provides a secure platform for the secure distributed applications where an application program can be executed without being tampered or 10 CHAPTER 2. BACKGROUND Figure 2.4: Simple XOM Architecture[41] violated from an adversary with direct physical access to the device [42]. The secure coprocessor must offer high computational and cryptographic resources and must be easily programmable. These requirements are met by using good cryptographic accelerators, adding Dynamic RAM (DRAM), and using smaller amount of battery-blacked RAM (BBRAM) as the non-volatile, secure memory [42]. The block diagram in Figure 2.5 shows the Hardware architecture of the high-end secure coprocessor. Figure 2.5: Hardware Architecture of High-end Secure Coprocessor[42] The hardware architecture shown in Figure 2.5 must guarantee secure loading and execution of code. Physical attacks tamper the system by actually destroying the secret keys that are stored in the memory. This secure architecture detects the physical attacks that actually destroys the keys in the secure memory, it ensures that the secrets, when they are first loaded in the embedded system, are not known outside this 2.2. RELATED WORK 11 system, and despite the fact that many software programs are caused to attack, the secrets stored in the DRAM and BBRAM are not modified[42]. Another important example of Secure processors is the Trusted Computing Module architecture designed by the Trusted Computing Group (TCG)[44]. The Trusted Computing Group (TCG) is a non profit organization with the concern of improving the security of the computing environments in computer platforms by protecting the identity of the platform from being observed by an adversary or unauthorized entities [44]. The Trusted Platform provides at least three basic features like protected capabilities, integrity measurement and integrity reporting[44]. Shielded locations are locations where data can be accessed safely in memory and registers. Set of commands with exclusive permission that access the Shielded locations is called protected permissions. Shielded locations are used to protect and report integrity measurements [44]. According to [44], the TCG systems contain set of roots which have the concern of describing the embedded system characteristics that affect the trustworthiness of the system. Men distinguish between three types of roots of trust: a root of trust for measurement(RTM), root of trust for storage(RTS) and root of trust for reporting(RTR). The RTM is a computing engine which provides the reliable integrity measurements while the RTS is a computing engine which gives an accurate summary of values of integrity digests and the sequence of digests, on the other hand the RTR is a computing engine capable of reliably reporting information held by the RTS [44]. The block diagram in Figure 2.6 shows the architecture of the Trusted Platform Module(TPM). Figure 2.6: TPM Architecture[44] Let us explain briefly the functionality of the components inside the TPM [44]. • Input/Output(I/O): The I/O component controls the flow of information over the communication bus by routing messages to appropriate components. • Non-Volatile Storage: This component is used to store keys like the root key and the owner authorization data. • Program Code: is a firmware that controls the devices in the TPM platform. • Random Number Generator(RNG): The RNG generates the key. 12 CHAPTER 2. BACKGROUND • SHA-1 Engine: is a message digest engine, generates checksums. • RSA Engine: is used to generate the encryption/decryption keys. • Execution Engine: is used to execute the program code. 2.3 Secure Processor Architecture This section describes the different steps of how a secure system works. The authority in a secure processor skips normal execution, changes into the secure mode execution, and checks for the integrity of the binary code where the binary code is loaded in the ROM. The authority software determines the checksums over each cache line prior to the execution of the code and the hashes are stored in every cache line. The block diagram in Figure 2.7 gives an example of a Secure System. Sensitive Application Secure Binary Code User Application 1 User Application 2 Operating System Disk Processor Chip Code Integrity Check Main Memory User I/O Figure 2.7: Secure System Overview[1] According to the related work in section 2.2, we selected to implement the idead of Code Integrity Check in hardware like it is explained in the REM(Runtime Execution Monitoring). The reason is that the Code Integrity Check guarantees that the secure instructions remain unmodified throughout its execution. The secure instruction code is signed by computing a checksum over pieces of code, and embedding the hash into the binary code[1]. Figure 2.8 depicts the calculation of checksums statically and embedding them in the cache line. During execution, when the instructions are fetched from offchip memory into the instruction cache. The processor will verify the integrity of the instruction code by computing the checksum at run-time and comparing it with the stored hash already pre-computed [2]. If both hashes are the same, this implies that the instruction code is secure otherwise the system has been tampered. If the hash check passes, the embedded hashes in the cache line are discarded and replaced with no-op instruction so as not to affect the execution of the processor[3]. 2.4. SECURE HASH ALGORITHM(SHA) 13 Binary Code Code address Secure Hash Algorithm Checksum 4 bytes 1 byte Instructions Checksum Checksum 4 bytes 1 byte Instructions Checksum 5 bytes Cache Line Figure 2.8: Code Integrity Check[1] 2.4 Secure Hash Algorithm(SHA) A hash function is a function converting binary strings of arbitrary length to binary strings of some fixed length, called checksums[6]. The hash functions are a one-way functions that compute a small fixed length output value, the digest message. One of the trivial characteristics of the hash functions is that no information of the data input can be obtained. It is seldom to have two different data messages generating the same hash value[4]. The uses of hash functions are for digital signatures. The entity receiving the message then hashes the received message, and verifies that the received signature is correct for this hash value [6]. Hash functions can also be used for code integrity check. The hash value of a particular input is computed. To check that the input data has not been altered, the hash value is recomputed using the input message, and compared for equality with the original pre-calculated hash value [6]. According to [4], the Secure Hash Algorithm (SHA-1) computes a 160 bit message digest or output hash value from the input message. The input data stream is separated into multiple input bocks of 512 bits each. The input block is split into 80*32 bits words, one 32-bit word for each computational round of the SHA-1 algorithm. Each round comprises additions and logical operations, such as bitwise logical operations and bitwise rotations to left. Figure 2.9 illustrates the pseudo code of the SHA-1 algorithm. The Secure Hash Algorithm (SHA-1) contains two stages of calculations: preprocessing and hash computation[13]. The preprocessing stage converts a message M which is a sequence of bits of arbitrary length L into m-bit blocks[13][18]. At this stage, the message M will be padded and parsed into m-bit blocks. The purpose behind padding is to guarantee that the padded message is a multiple of 512 bits. This is done by concatenating the message M by one bit with value of 1, followed by k bits 0, (k being the least non-negative solution to the expression (L + 1 + k) mod 512 = 448). The length of the padded message will become a multiple of 512 bits. After the message has been padded it will be parsed into several m-bit block before starting with the calculation of 14 CHAPTER 2. BACKGROUND For each data_block do Wt = expand(data_block); A = DM0 ; B = DM1; C = DM2; D = DM3; E = DM4; For t = 0 ; t <= 79; t = t + 1 do Temp = Rotl(A) + f_t(B,C,D) + E + K_t + W_t ; E = D; D = C; C = Rotl(B); B = A; A = Temp; End for DM0 = A + DM0; DM1 = B + DM1; DM2 = C + DM2; DM3 = D + DM3; DM4 = E + DM4; End for; Figure 2.9: Pseudo Code for SHA-1 function[4] the checksum value. The last step in the preprocessing stage is to set the initial hash values. The second stage is the hash computation. The SHA-1 algorithm that is already described in Figure 2.9 performs 80 rounds. Every round uses a 32-bit word obtained from the current input data block. Since, each data block only has 16*32 bit words (512 bits), the rest of the bit words which are 64*32 bit words are obtained from data expansion. The following formula deduced from [4] shows the remaining 64*32 bits words how they are obtained. Wt = Rotli (Wt−3 xorWt−8 xorWt−14 xorWt−16 ) (2.1) The computation of the final checksum is an eighty iteration algorithm over each message block, where every message block is a sequence of 32 bit words. At the beginning of every computation, the five internal registers are initialized with the initial hash values and at the end of every round computation each of the five registers contain the current hash value [18]. Figure 2.10 illustrates the operation in a single step of the Secure Hash Algorithm: Let us explain what every rotational and logical operation does. The logical operations ft and bitwise rotations to the left RotL are involved in every round computation. The calculation of ft depends on the round t being executed as well as the value of the constant Kt [4]. Table 2.1 shows the logical functions and the values of the constants for every group of 20 rounds. 2.5. SHA-1 HARDWARE IMPLEMENTATION A B C 15 D E ft <<<5 Wt <<<30 Kt A B C D E Figure 2.10: Operation in a single step of SHA-1[17] Rounds 0 to 19 20 to 39 40 to 59 60 to 79 2.5 Table 2.1: SHA-1 Functions and Constants[4] Function Constants (B and C) xor ((/B) and D) 0X5A827999 B xor C xor D 0X6ED9EBA1 (B and C) xor (B and D) xor (C and D) 0X8F1BBCDC B xor C xor D 0XCA62C1D6 SHA-1 Hardware Implementation This section describes the efficient implementation of Secure Hash Algorithm (SHA-1) in hardware. First the design of the SHA-1 unit is described in general. The pre-processing unit performs the appending of the input message, padding it with zeros and forms message blocks of the fixed length 512 bits for an input message of variable length. The pre-processing unit forwards message blocks to the message scheduler unit. The Message scheduler unit computes message dependent words, W(t), each of the message word is 32 bits. The message digest unit calculates the checksums. At every step, the message digest unit computes the hash value of a new word generated by the message scheduler unit. The message digest unit is the most crucial part of the implementation, as it determines both the performance and area of the whole design[39]. The block diagram 16 CHAPTER 2. BACKGROUND in Figure 2.11 describes the general SHA-1 architecture. Message Pre-Processor Message Scheduler Hash Value Message Digest Control Logic Figure 2.11: General SHA-1 Implementation[39] On the other hand, the data dependencies of the algorithms SHA-1 do not allow for efficient pipelining. The computation time of the message digest unit is very high[4]. Some work has been done in [4] to improve the computational throughput by unrolling the calculation structure, but at the expense of more hardware resources [4]. Some methods are proposed in [4] to improve the hardware implementation of the SHA-1 algorithm. The most important ones are: 1. Parallel counters and balanced Carry Save Adders(CSA): are used to improve the partial additions . 2. Unrolling techniques: are applied to optimize the data dependency . 3. Memory based block expansion structures: the embedded memories are used to store the constant values described in section 2.4. 4. Operation rescheduling: is used to improve the speed of the circuit. The block diagram in Figure 2.12 shows the top level design of the SHA-1 core and the signals that interface this unit. The block diagram in Figure 2.13 shows the structural design of the SHA-1 core with the interface signals. The SHA-1 core consists of two units which are the SHA-1 round unit and the Data Expansion unit. 2.6 Proposed SHA-1 round with operations rescheduling and data expansion unit From Figure 2.9 in section 2.4, we can conclude that the SHA-1 round computation is based on the calculation of the A value. The remaining values do not require any computation, except the value of the rotation of B. The value of A depends on its previous value, no parallelism can be directly deduced, as depicted in the following formula[4]: At+1 = RotL5 (At ) + [f (Bt , Ct , Dt ) + Et + Kt + Wt ] (2.2) 2.6. PROPOSED SHA-1 ROUND WITH OPERATIONS RESCHEDULING AND DATA EXPANSION UNIT 17 IV Reset Start clk WriteIV LoadWi Wi SHA-1 Core finish OutHash Figure 2.12: Top level design of SHA-1 core The operation RotL(At) depends on the variable A(t), the remaining terms depend on variables that require no calculation and do not depend on the value of A(t), therefore some rescheduling in operations can be done. From equation (4.1), we can rearrange the terms that are not dependent of the variable A by producing the carry Carry(t) and save S(t). Thus, the equation becomes[4]: St + Carry t = [f (Bt , Ct , Dt ) + Et + Kt + Wt ] (2.3) If we combine equation (4.1) with equation (4.2), we get: At+1 = RotL5 (At )+(St−1 +Carry t−1 ), St +Carry t = [f (Bt , Ct , Dt )+Et +Kt +Wt ] (2.4) According to [4], the critical path of the SHA-1 round unit is optimized by splitting the computation of the value of A and by rescheduling as it is shown in formula (4.3). This shows that the function f (B, C, D) and the partial addition operations are no longer in the critical path. Therefore, the critical path of the SHA-1 round unit is reduced to a three input full adder. The final values of the internal variables (A to E) are added to the current values after completing 80 rounds, where the current values remains unmodified until the end of each data block computation. The final summation is computed by one adder for each 32 bits portion of the 160-bit hash value. Figure 2.14 depicts the implementation of the SHA-1 round unit in hardware. According to equation (2.1) in section 2.4, the 512 bits generated from the padded unit is expanded to obtain the 80 32-bit words as input to the SHA-1 round unit. This unit is implemented with registers and XOR operations. A select logic is also needed to select between the first 16 rounds and the remaining rounds. Figure 2.15 shows the design of the SHA-1 data expansion unit. 18 CHAPTER 2. BACKGROUND Figure 2.13: Structural Design of the SHA-1 core Figure 2.14: SHA-1 round unit Implementation[4] 2.7. CONCLUSION 19 Wi A/D Converter Vin A/D Converter B1 Vin GND A/D Converter B1 GND Vref B8 Sign Vin ENB A/D Converter B1 Vin GND Vref Vin B1 GND Vref B8 Sign Vref B8 Sign ENB A/D Converter B1 GND Vref B8 Sign ENB A/D Converter B1 GND Vref B8 Vin B8 Sign ENB Sign ENB ENB Wi-1 LoadWi Wi-8 Mux Wi-12 + Rotl_1 Wi-14 Mi A/D Converter Vin A/D Converter B1 GND Vin A/D Converter B1 GND B8 Vref A/D Converter B1 GND B8 Vref Sign ENB Vin A/D Converter B1 GND B8 Vref Sign ENB Vin B1 GND B8 Vref Sign ENB Vin B8 Vref Sign ENB Sign ENB Figure 2.15: SHA-1 data expansion Implementation[4] 2.7 Conclusion Early in this chapter, some security issues were discussed. Different classes of attacks into computer systems were described like the Hardware-Based. Attacks where an adversary has direct physical hardware access, the Software-based Attacks that are specially used remotely to break into computer and embedded systems. Some of the common software based attacks that were explained are Buffer Overflow, Heap Overflow, Double free vulnerability, format string attack and Temporary file vulnerability. The architecture of the design of the Secure Processor was discussed. A hash function which is a function converting binary strings of arbitrary length to binary strings of some fixed length called the checksums was discussed. The hash functions are one-way functions that computes a small fixed length output value, the digest message, that is highly correlated with the input data. One of the trivial characteristics of the hash functions is that no information of the data input can be obtained. The efficient implementation of Secure Hash Algorithm (SHA-1) in hardware was described. 20 CHAPTER 2. BACKGROUND Selection of the processor 3 This chapter discusses the 32-bit LEON processor that complies to the SPARC V8 architecture. Section 3.1 describes how the choice of the Leon processor was made. Section 3.2 gives an introduction for the Leon processor. Subsection 3.2.1 describes the Leon Integer Unit that consists of five pipeline stages. The on-chip memory in LEON3 that is implemented with separate instruction and data busses, and how to configure the cache direct map or set associative and the size of the cache are discussed in subsection 3.2.2. The memory controller of the Leon-3 processor that controls a memory bus holding external memory devices, asynchronous static ram (SRAM) and synchronous dynamic ram (SDRAM). The AMBA (Advanced Microcontroller Bus Architecture) is an on-chip bus specification for interconnection and organization of various functional modules that are a part of System-on-Chip are explained in subsection 3.2.3. Subsection 3.2.4 explains about the compiler of the Leon3 processor. 3.1 Choices while selecting the processor A choice has to be made between using a hard-core or a soft-core processor to implement the envisioned extensions. First of all, we have tried to implement the code integrity check targeting the Xilinx Virtex-II Pro that uses the PowerPC 405D5 processor core, which is a 32-bit high performance, low power, scalar RISC(Reduced Instruction Set Computer) architecture, using separate data and instruction caches. Since the PowerPC is embedded into the virtex-II Pro device this means that the hard IP cores are diffused at any place within the FPGA platform, connected with the different neighboring Configurable Logic Block(CLB)[35]. The choice of implementing cryptography in hardware where the Xilinx Virtex II Pro is used as the FPGA platform is going to be rejected. The reason is that none of the IP cores can be modified, all what can be done is to add an extra IP core on the OPB bus. Another reason might be the fact that the hard-core PowerPC has limited design flexibility. On the other hand, the soft-core platforms has better design flexibility than the hard-core processor. According to [32], a soft-core can be eliminated from the design when it is not needed. Therefore, a reduction in area is going to be less. On the other hand, a soft-core uses a lot of resources from the FPGA, while the hard-core has its own hard-wired hardware. Another positive point of soft-core is that it can be customized to the requirements of the designer and it can be reconfigured at run-time to meet the constraints. On the other hand, the hard-core is a lot more optimized for the silicon-technology than the soft-core. The next choice is to study three soft-core processors: the Leon-3, the Xilinx Microblaze and the OpenRISC 1200 RISC Core. The Leon-3 processor is a VHDL model that is fully synthesizable 32-bit processor implemented with SPARC V8 standard. 21 22 CHAPTER 3. SELECTION OF THE PROCESSOR 1. Integer Unit: The IU (Integer Unit) contains 32-bit RISC architecture five stage instruction pipeline, 8 global registers, 2-32 register windows of 16 registers each and 32-bit instructions [35]. 2. Cache: The cache is implemented with a harvard model (separate instruction/data cache). Instruction and data cache size are modified from 1 KB to 64 KB. The cache contains direct mapped or multi-set cache with set associativity of 2-4. The cache lines can be modified between 8 and 32 bytes of data [35]. 3. Memory: The memory controller has an interface with the PROM, SRAM, SDRAM and memory mapped I/O devices. The memory contains 2 Gbyte address space. the memory areas can be programmed to 8-16-32 bit data width[35]. 4. Design Tools: Simulation is done via a generic testbench and test program is available, including support files for Modelsim. The second platform that is going to be discussed is the Xilinx Microblaze Softcore. The Microblaze is a 32-bit soft processor designed by Xilinx. It contains a RISC architecture with Harvard Model separate data and instruction busses. The main concern of developing the Microblaze platform is to build complex systems for networking, telecommunication, data communication and embedded systems[35]. 1. Processor Unit: It contains 3-stage pipeline with 32-bit RISC architecture. Every instruction has 32 bits. The Instruction Set Architecture (ISA) has two types. The first type which is Type A contains two source and one destination operand while the second type contains one source and one immediate operand. It contains a RISC architecture with Harvard Model separate data and instruction busses but no cache[35]. The memory access can be done via the Local Memory Bus (LMB) and On-chip Memory Bus (OPB). 2. Cache: Separate instruction and data cache. The placement scheme is only directmapping. 3. Memory: The memory controller supports up to 8 memory (flash/SRAM) banks. It contains a separate control register for each bank. Moreover, it supports 8,16,32 and 64-bits bus interface [35]. 4. Design tools: The Xilinx Software integrated development environment, which creates software like Standard C [35]. It contains the GNU C compiler tools including compiler, assembler. 5. Performance: System frequency is about 150 MHz. The third platform is the OpenRISC 1000. This architecture is 32- and 64-bit RISC processors. It targets medium and high performance networking, portable, embedded, and automotive applications because it is designed to get a better performance, low power consumption[35]. 3.2. INTRODUCTION TO LEON PROCESSOR 23 1. Processor Unit: It is a scalar, single-issue 5 stage pipeline. It consists of a singlecycle instruction execution for most of the instructions. 2. Cache: It follows the Harvard model with split instruction and data cache. The instruction /data cache size can be configured from 1KB to 64 KB. 3. Design tools: It contains the GNU ANSI C, C++, Java and Fortran compilers[35]. 4. Performance: System frequency is about 250 MHz. The question remains which of the three soft-core is going to be used in this project. To answer this question, let us revise the differences between the three soft-cores. All three processors are 32-bit RISC processor big endian synthesizable pipelined processors. LEON-3 and OpenRISC 1200 contain 5-stage pipelines while the Microblaze contains 3 stages of pipeline. LEON and OpenRISC processors are available for free under LGPL license, on the other hand Microblaze is available by the company XILINX. According to the analysis done in [36], we have chosen the Leon-3 processor as soft-core to implement cryptography for the following reasons: 1. Cache: All three processors have Harvard caches with almost equal values for the cache size. The Microblaze and the OpenRISC 1200 implement in their placement scheme only the Direct mapping scheme while the Leon-3 provides support for a 2-4 way set associative cache configuration, in which three replacements strategies can be choosed. The choice of the cache is very trivial in this thesis because we need to have more flexibility in configuring the cache. Leon soft-core offers better flexibility than the remaining two. On the other hand the architecture of the cache of Leon-3 is more advanced and complicated than the cache of the remaining two[36]. 2. Documentation: The documentation for LEON-3 is acceptable while the documentation of the Microblaze is a bit more extensive. On the other hand, the documentation of the OpenRISC 1200 is not so extensive and well documented like the other two cores[36]. 3. Tools: The configuration tools of the Leon-3 and the Microblaze are very easy to change even the configuration tools of the Leon-3 is easier to use and to access than the Microblaze. On the other hand, the OpenRISC 1200 configuration is done manaully. HDL simulation and debugging is less complex for Leon-3 compared to the other two processors[36]. 3.2 Introduction to Leon Processor Leon-3 is a 32-bit processor that complies to the SPARC V8 architecture. The Leon-3 processor is a soft-core processor that can be configured and made suitable for embedded applications and System-on-Chip designs with the following features on chip: separate instruction and data caches, hardware multiplier and divider, interrupt controller, two 24 bit timers, two UARTs, power-down function, watchdog, 16 bit I/O port and flexible memory controller. Additional modules can easily be added using the on-chip AMBA 24 CHAPTER 3. SELECTION OF THE PROCESSOR Figure 3.1: LEON block diagram[8] AHB/APB buses[7]. All the features of Leon processor are illustrated in the block diagram of Figure 3.1. 3.2.1 Leon Integer Unit The LEON integer unit (IU) is implemented using the SPARC Version 8 integer instructions. The Leon integer unit has the following features: 1. 5-stage instruction pipeline. 2. Separate instruction and data caches (Harvard Architecture). 3. Support for 2 - 32 register windows. A block diagram of the LEON Integer Unit can be seen in figure 3.2 [8]. The Leon Integer Unit as defined in [8] consists of five pipeline stages and they are the following: 1. FE (Instruction Fetch): In this stage the next instruction to be executed is fetched from the memory. In case the instruction cache is enabled, then the instruction fetch is done directly from the instruction cache. 2. DE (Decode): In this stage the instruction is decoded and the operands are read. The operands may come from the register file or from internal data bypasses. The CALL and Branch target addresses are also determined. 3.2. INTRODUCTION TO LEON PROCESSOR FE DE EXE 25 ME WR Figure 3.2: LEON integer unit block diagram 3. EXE (Execute): In this stage, the arithmetic operations are performed in the ALU (Arithmetic Logical Unit). Logical and shift operations are also computed in the ALU. Moreover, the address of memory operations like load , store, jump and return will be determined. 4. ME (Memory) : In this stage, the data is valid at the end of the stage. For data write, the data will be written to the data cache during the execute stage. 5. WR (Write): In this stage, the outcome of any arithmetic, logical, shift, and cache read operations are written back to the register file. 3.2.2 Instruction and Data Cache System The on-chip memory in LEON3 is implemented with separate instruction and data buses. The LEON-3 processor uses a Harvard Architecture. Both instruction and data cache are connected to two independent cache controllers where these last can be configured to implement either a direct mapped or a multi-set cache with set associativity of 2-4. The set size is configured from 1-256 KBytes. The simplest one is where for instance the Leon instruction cache is configured from 1 - 64 Kbytes. It is a direct-mapped cache[28]. Instruction and data cache operations are controlled via a Cache Control Register(CCR). Every cache can be operating in one of the three modes: enable, disable and frozen. In case where the cache is in disable mode, no memory operations like load and store are performed. The frozen mode operation keeps the cache synchronized with the main memory as if it was in the enable mode, so the cache can be accessed but whenever there are misses the cache line will not be updated from the memory. In the enable mode, the instruction cache is divided into cache lines with 8 - 32 bytes of data assuming direct mapped cache. In this mode, the cache line is filled from main memory. The instructions are forwarded at the same time to the Integer Unit(IU) or processor pipeline. Sometimes, due to internal dependencies or multi-cycle instruction, then the processor pipeline is halted until the line fill is completed. In case where the processor pipeline executes a control transfer instruction like branch or call instruction during the line fill, the termination of the line fill will be on the next fetch. Every cache line has a cache tag associated for it. The cache tag consists of an address tag (ATAG) and valid (V) bits[8][28]. 26 CHAPTER 3. SELECTION OF THE PROCESSOR On the other hand, the Leon data cache is configured also from 1-64 Kbyte. It is a direct-mapped cache. The data cache is divided into cache lines with 8 - 32 bytes of data. Every cache line has a cache tag associated for it. The cache tag consists of an address tag (ATAG) and valid (V) bits. The following two figures show the instruction and data cache tag respectively: Figure 3.3: Instruction cache tag layout[8] 1. Address Tag (ATAG)[31:10]: tag address of the cache line. 2. Valid (V) [7:0]: When a sub-block of cache line is filled with data, then the valid bits are set. A cache fill which results in a memory error will leave the valid bit unset. A flush instruction will clear all valid bits. Figure 3.4: data cache tag layout[8] 1. Address Tag (ATAG) [31:10]: tag address of the cache line. 2. Valid (V) [7:0]: When a sub-block of cache line is filled with data, then the valid bits are set. A cache fill which results in a memory error will leave the valid bit unset. 3.2.3 Memory Access and AMBA on-chip buses The memory controller of the Leon-3 processor controls a memory bus holding external memory devices, asynchronous static ram (SRAM) and synchronous dynamic ram (SDRAM). The memory controller acts as a slave on the AHB (Advanced High Speed) bus. The memory bus supports four types of devices: prom,sram,sdram and local I/O. The memory bus can also be configured in 8-bit or 16-bit for applications with low memory and performance demands[28]. Figure 3.5 shows the interface between the AHB bus and the external memory. The controller decodes three address spaces (PROM, I/O and RAM) whose mapping is determined through VHDL generics. The controller decodes in total a 2 Gbyte address space. The following table shows that: AMBA (Advanced Microcontroller Bus Architecture) is an on-chip bus specification for interconnection and organization of various functional modules that are a part of System-on-Chip. The AMBA specification improves the reusable system on-chip platform by including a common standard for data communication in a System-on-Chip module[37]. Men distinguish three distinct AMBA buses and they are: 3.2. INTRODUCTION TO LEON PROCESSOR APB 27 AHB A D PROM I/O Memory Controller SRAM SDRAM Figure 3.5: Memory controller connected to AMBA bus and external memory devices[28] Table 3.1: Address Space map [8] Address range Size Mapping 0X00000000-0X1FFFFFFF 512 M Prom 0X20000000-0X3FFFFFFF 512 M I/O 0X40000000-0X7FFFFFFF 1G RAM 1. Advanced High-performance Bus (AHB): The AMBA AHB is used for system modules with high clock frequency. AHB supports the efficient connection of processors, on-chip memories and off-chip external memory interfaces. 2. Advanced System Bus (ASB): The AMBA ASB is for systems with high performance. AMBA ASB is an alternative system bus suitable for use where the high-performance features of AHB are not required. 3. Advanced Peripheral Bus (APB): The AMBA APB is used in systems where lowpower peripherals exists. AMBA APB is optimized for minimal power consumption. The AMBA AHB system consists of the following components[38]: 28 CHAPTER 3. SELECTION OF THE PROCESSOR 1. AHB master: The bus master initiates read and write operations by providing informations about the addresses of the data. One master is only allowed to actively use the bus. 2. AHB slave: A bus slave responds to a read or write operation within a given address-space range. The bus slave replies back to the active master the success, failure or waiting of the data transfer. 3. AHB arbiter: The bus arbiter guarantees that only one bus master at a time is allowed for data transfers. 4. AHB decoder: The AHB decoder is used to decode the address of each transfer of data and provides a select signal for the slave that is involved in the transfer. The AMBA ASB has the same components as the AMBA AHB. The following figure shows all three bus systems and the cores connected to the three busses. Figure 3.6: AMBA AHB/ASB and APB Bus [38] 3.2.4 Bare-C Compiler BCC(Bare C compiler) is a cross compiler for LEON-3 processors. It is based on the GNU compiler tools and the Newlib standalone C-library [9]. The cross compiler supports floating point operations, as well as SPARC V8 multiply and divide instructions. For further information check [9]. The next thing was to compile a very simple program like hello world and this is done with the following command : sparc-elf-gcc -msoft-float -g -O2 hello.c -o hello.exe. The compiler has some extra options like -msoft-float, it emulates floating point and it is used if no Floating Point Unit exists in the system. The other option is -O2 where it optimizes code maximum performance and minimal code size, for more informations about the compiler options check [9]. After the simple code of hello world has been compiled, the next step was to learn how to make Leon boot the PROM and this is done with the command sparc-elf-mkprom. Note that sparc-elf-mkprom creates ELF files. To 3.3. CONCLUSION 29 create an SRECORD file for a prom program, the command sparc-elf-obj is used. Let us summarized the steps from compilation, linking and copying from ELF to SRECORD: 1. sparc-elf-gcc -g -O2 hello.c -o hello -msoft-float 2. sparc-elf-mkprom hello -o hello.exe -msoft-float MKPROM boot-prom builder v1.0 section: .text at 0*4000000, size 31040 bytes section: .data at 0*4007940, size 1904 bytes 3. sparc-elf-objcopy -O srec hello.exe hello.srec 3.3 Conclusion In this chapter, three soft-cores was studied, the Leon-3 processor, the Xilinx Microblaze and the OpenRISC 1200. The 32-bit LEON processor that complies to the SPARC V8 architecture was discussed. The Leon Integer Unit as defined that consists of five pipeline stages, the on-chip memory in LEON3 is implemented with separate instruction and data buses, and how to configure the cache direct map or set associative and the size of the cache were illustrated. The memory controller of the Leon-3 processor controls a memory bus holding external memory devices, asynchronous static ram (SRAM) and synchronous dynamic ram (SDRAM) were also explained. AMBA (Advanced Microcontroller Bus Architecture) is an on-chip bus specification for interconnection and organization of various functional modules that are a part of System-on-Chip. The AMBA specification that improves the reusable system on-chip platform by including a common standard for data communication in a System-on-Chip module and the three distinct AMBA buses which are Advanced High-performance Bus (AHB), Advanced System Bus (ASB) and Advanced Peripheral Bus (APB). The compiler of the Leon3 processor is also explained. 30 CHAPTER 3. SELECTION OF THE PROCESSOR Implementation and Validation 4 This chapter describes the implementation of the Code Integrity Check (CIC) unit in hardware using the Leon3 Template. Section 4.1 discusses how the emulated certified authority reads the binary file of the compiled program, assigns the instructions into a certain number of bytes, calculates the checksums with the SHA-1 algorithm, embeds the checksums to the cache line that contains the instructions and loads them in the binary file. Section 4.2 describes the design of the hardware CIC (Code Integrity Check) unit that consists out of some key components called the controller of the cryptographic unit, the cryptographic unit (SHA-1 core), additional control units to detect whenever the data read from the cache is valid, selection logic components and a compare unit. 4.1 Certified Authority Emulation The first step in implementation was to add to the binary file generated from the sparcelf-gcc compiler the checksum calculated for every cache line in the binary file. A software program is implemented in C language. The binary file is generated after compiling with the sparc-elf-gcc compiler of the Leon-3 processor and loaded in the PROM of the LEON3 soft-core. The certified authority reads the binary file of the compiled program, assigns the instructions into a certain number of bytes, calculates the checksums with the SHA-1 algorithm, embeds the checksums to the cache line that contains the instructions and loads them in the binary file. Figure 4.1 illustrates the flow chart diagram that shows these steps. C Program Code Sparcelf-gcc Compiler Read Binary code Binary Code Segment Code in x bytes Authority Software Compute checksums and write checksums Modified Binary Code with embedded Checksums Figure 4.1: Certified Authority A small example illustrates the overal functionality. The software program which is running on the LEON-3 processor, is shown in Figure 4.2. After compiling the software program, the binary file is obtained. 000081D8200003000004821060E081884000 0010819000008198000081800000A1800000 002001000000030020408210600FC2A00040 31 32 CHAPTER 4. IMPLEMENTATION AND VALIDATION 003084100000010000000100000001000000 int sum (int a , int b) { int result; result = a + b; return result; } int mult (int a , int b) { int result; if (a > b) result = b * a; else result = a * b; return result; } main() { report_start(); base_test(); int result1,result2; int x = 2; int y = 3; result1 = sum(x,y); result2 = mult(x,y); printf("the sum is %d \n",result1); printf("the product is %d \n",result2); report_end(); } Figure 4.2: Application Software The first four characters correspond to 16-bit address. Every eight characters represent four bytes of instructions loaded in the memory. The authority software is going to read the binary file and computes the checksum for every line. The following checksums are generated. ee750e0f90421b951f0761e37f9e2889a861e3b1 28552f99df09a02a2f6fbe219251aeb0c494a6b3 9496dc19a05cf4c0aad3d6ce3a4064165cce252a 536538f090b1ce3fbb3d35aeb028afb836f19ab5 The idea originally was to embed the checksums in the existed binary file. So we need to expand the cache to embed all these checksums in it. The checksums are instead stored in an on chip memory (look up table) where all the checksums are read from memory. This look up table is necessary because the SHA-1 computation latency is longer than the latency of looking up the checksums in the Block RAM. This is why we thought this is a representative implementation because also the costs of the SHA-1 are much more than the costs of additional logic we did not consider. 4.2. IMPLEMENTATION OF THE CODE INTEGRITY CHECK UNIT 4.2 4.2.1 33 Implementation of the Code Integrity Check Unit Architecture and Organization of the non extended processor This subsection describes the standard implementation of the Leon3 processor. The general organization of the Leon-3 template consists of the memory controller which is the interface between the PROM of the Leon3 processor and the AMBA-AHB(Advanced High Speed) bus controller. On the other hand the Leon3 top level entity consists of the cache control unit which is the interface between the memory controller, the cache itself (Instruction and data cache) and the Integer Unit pipeline. The block diagram in Figure 4.2 shows the organization of the LEON-3 template. From Figure 4.2, the AHB controller is a combination of an arbiter, bus multiplexer and slave decoder. The LEON3S entity is a 32-bit processor core conforming to the SPARC architecture. The memory controller hosts a memory bus PROM. It acts as a slave on the AHB (Advanced High Speed) bus[28]. clk rstn ahbsi memi LEON3S top ahbmi Memory level entity ahbmo AHB Controller ahbso ahbso Controller memo leon ahbsi processor PROM Figure 4.3: Organization of the Leon-3 template The signals that interface the AMBA-AHB bus with the Leon3S top level entity and memory controller are ahbmi (AHB master input), ahbmo (AHB master output), ahbsi (AHB slave input), ahbso (AHB slave output). The interface signals between the memory controller and the PROM are memi (memory input) and memo (memory output). The block diagram in Figure 4.3 illustrates the architecture of the LEON3S unit. ahbmi ici mcii Integer Unit Pipeline Instruction Cache ahbmo MMU_Acache ahbso mcio ico Figure 4.4: Organization of the LEON3S Unit When a miss occurs in the instruction cache, it changes into the streaming mode, fetching one entire cache line, where the buffer waddr holds the next address to ask from memory [40]. Figure 4.4 illustrates the organization and behavior of the LEON Instruction cache. The next subsection explains the modification done to the datapath of the Leon3 processor. 34 CHAPTER 4. IMPLEMENTATION AND VALIDATION Figure 4.5: Architecture of Leon3 Instruction Cache[40] 4.2.2 Extension of The Leon3 processor This subsection describes the implementation of the Code Integrity Check (CIC) unit in hardware using the Leon3 processor. The top level entity of the leon3 processor is modified to implement the cryptography in hardware. The datapath is extended by adding key components like the controller of the cryptographic unit, the cryptographic unit (SHA-1 core), unit to detect if the instructions read from the cache are valid, a block RAM memory to dump the pre-computed cheksums and a compare unit. The block diagram in Figure 4.5 shows the extension of the datapath with the Code Integrity Check. ici Integer Unit pipeline (iu.vhd) Instruction cache (icache.vhd) ico halt interrupt mcii mcio ahbmi Cache ahbmo Controller (MMU_Acache. ahbso vhd) Code Integrity check (CIC.vhd) Figure 4.6: Extended Datapath At run time the checksum is computed by the cryptographic unit and compared to the pre-computed checksum computed statically prior to the execution of the program code. The pre-computed checksums are stored in an on-chip Block RAM module. The pre-computed hashes of the cache lines are stored in BRAM at the same address of the cache line. The two checksums are compared when the signal start compare from the control unit is enabled. In case the two checksums values are not the same, this means that the binary code has been tampered and an interrupt is sent to the pipeline unit so that the processor stops the execution of the code. The block diagram in Figure 4.6 shows the structural design of the Code Integrity Check. The Code Integrity Check unit is tested by a generic testbench. The generic test- 4.2. IMPLEMENTATION OF THE CODE INTEGRITY CHECK UNIT finish valid start LoadWi Cryptographic control unit and interface between the cryptographic unit and the Leon cache (cach_to_hash.vh d) 35 Cryptogra phic unit (SHA1_co re.vhd) WriteIVHash WriteIVValue Data_to_SHA1 Start_compare Read_mem halt Precomputed_hash ico outhash Register ici Instruction cache of leon3 (icache.vhd) Initializing hash in memory (dp_memory.v hd) A H Register Q1 A Q8 H ENB Q1 Q8 ENB Generate_valid. vhd Comparator.vhd Pipeline unit of Leon3 (iu.vhd) interrupt Figure 4.7: Code Integrity Check Unit bench includes external Flash PROM which is pre-loaded with a test program. The test program will execute on the LEON3 processor and test various functionality in the design. The generic testbench has been created to read from file. It reads from the file of the PROM where the binary code is loaded. The binary code has been created after compiling a software test program like the one shown in Figure 4.2. Other test programs have been generated with nested for loops, while loops. For further details about the test program refer to Appendix B. All the interface signals between the processor chip and the external memory components are defined as internal signals. The generic testbench emulates the LEON processor where all the components that are inside this processor chip can be tested. The generic testbench emulates the top level entity of the Leon3 processor and all the components are mapped to this entity. The wave diagram in Figure 4.7 shows the evaluation of the design (the relevant signals are circled). According to the wave diagram in Figure 4.7, we can observe the following: • Signal mem-re is enabled after one clock cycle, the pre-calculated mem-hash value is read from the offset address of the cache line ici.dpc. The offset of the address 36 CHAPTER 4. IMPLEMENTATION AND VALIDATION 0000000C is for instance 000 and this is the address of the pre-computed value. • After 80 clock cycles, the run time hash value is computed and signal start-compare is enabled. Figure 4.8: wave diagram of Code Integrity Check 4.2.3 Detection of valid Instructions from the Instruction Cache The main concern is in detecting whether the instructions in the cache are valid. The instruction address is valid at the beginning of the Fetch stage of the pipeline and is generated in the execution stage from incrementing the program counter(pc) or from a previous branch. In case a miss is detected in the fetch stage, the instruction cache is switched into the streaming state, sending a memory request and at the same time the pipeline is in halt mode. When the memory request is ready, the data is latched in the pipeline by using the signal ico.mds which is an output from the instruction cache going as an input to pipeline unit[40]. Figure 4.8 illustrates the design of the generate valid unit. Whenever the signal ico.mds is equal to zero, the instructions in ico.data(0) are ready in the instruction cache. The transition from state s0 to state s1 in the state machine in Figure 4.14 detects the valid signal, this means that in state s1, the instructions are valid in the cache. The simulation of the generate valid unit with all the interface signals are shown in wave diagram in Figure 4.9 where the relevant signals are circled in the waveform. 4.2.4 Cryptographic Control Unit The main concern is controlling the cryptographic unit (SHA-1 core) by reading the instructions from the instruction cache if a valid has been generated from the generate valid unit. The SHA-1 core can accept any message of any size and digest the message into an output of 160 bits. The pipeline unit handles four instructions from the instruction cache. In our case the input to the cryptographic unit is 4 instructions each 4.2. IMPLEMENTATION OF THE CODE INTEGRITY CHECK UNIT 37 Ico.mds Instruction cache Pipeline of the Of the leon processor leon processor (icache.vhd) (iu.vhd) Ico.data Ico.mds Generate_valid.vhd valid Ico.mds = 0 State0 Valid =0 Ico.mds Ico.mds =0 =1 State1 Valid = 1 Ico.mds = 1 Figure 4.9: generate valid unit Figure 4.10: wave diagram of 32 bits. On the other hand, the SHA-1 core performs pre-processing operations by appending the message by one and padding the message with zeros as it was described in section 2.4. The control unit makes sure that the cryptographic unit is reading the valid instructions from the cache, loads each of the four instructions in registers, enable the start signal for the SHA-1 core to begin computing the checksums, halts the pipeline of the processor during the computation of checksums, and makes sure that all the inputs to the cryptographic unit are appended with one, padded with zero and adding the size of the message at the end of the input. The block diagram in Figure 4.16 illustrates the design of the unit in hardware. The Finite State Machine in Figure 4.12 illustrates the design of the SHA-1 core controller the cache to hash unit. • The cache to hash unit describes the interface between the instruction cache of the Leon-3 processor and the cryptographic unit(SHA-1 core). The output of the 38 CHAPTER 4. IMPLEMENTATION AND VALIDATION clk Ico.data(0) clk rst rst finish valid start Cryptograp hic unit (Sha1_core .vhd) LoadWi WriteIVHash WriteIVvalue Register load1 A Q1 H Q8 data1 ENB State Machine halt Register load2 A Q1 H Q8 data2 ENB Data_to_SHA1 Register load3 A Q1 H Q8 data3 5 to1 Mux ENB Register load4 A Q1 H Q8 data4 ENB additional Select Pipeline of the leon3processor (iu.vhd) Figure 4.11: Cryptographic Unit Interface instruction cache ico.data(0) is assigned as an input to the control unit. Whenever the valid signal is detected from the generate valid unit, then the cryptographic unit is reading the valid instructions. • The transition from state S1, S2, S3 to state S4 in Figure 4.17 describes that the registers are enabled and the instructions data are loaded in the registers. The first four states take into consideration the branch instructions. In case of a branch instruction the special state jump handles those branch instructions, the remaining instructions and the embedded checksum in the cache line are neglected since there is a jump to another address in the code. This state jump checks whether there branch is still equal to 1 or the offset of the address of the cache line not equal to zero. In this case we stay hanging in this state until there is no more branch or the offset of the address is equal to zero. 4.2. IMPLEMENTATION OF THE CODE INTEGRITY CHECK UNIT 39 • In state S4 the cryptographic unit is enabled by setting the start signal to one and the pipeline of the processor is halted. From state 5 to state 20 the pipeline is halted and the cryptographic unit is doing computation work. • The signal LoadWi is enabled once the valid instructions are loaded in the registers. The select signal of the multiplexers select the relevant data as an input to the cryptographic unit. The select signals select also the additional data, this means that the value for the appending, padding with zeros and the length of the input message is also generated from the cache to hash unit. • At state 21 the cryptographic core is done with the calculations of checksums and the state transition goes back to state 0, to start the next computation round. 4.2.5 Timing Simulation Results This subsection shows the simulation results of the cryptographic control unit in case a binary code is running from the main function and a binary code where the code contains jumps instructions. The wave diagram in Figure 4.12 illustrates the time simulation of the cache to hash unit or the cryptographic control unit, the interface signals and the state transitions of a binary code running from main. The expected behavior is explained in subsection 4.2.4. (The relevant signals are circled). According to the wave diagram in Figure 4.12, we can observe the following: • The four instructions which are ico.data(0) in the simulation wave diagram are loaded in registers reg1, reg2, reg3 and reg4 when the signals load1, load2, load3 and load4 signals are enabled. • The Loadwi signal of the cryptographic unit is enabled, immediately after the start signal is enabled, the instructions are loaded to the cryptographic unit and the LoadWi signal is enabled for a certain period of time. • During the computation of the checksums, the halt signal is enabled until the computation is finished, so that the processor stops the execution of the instructions. • This simulation is indicated only for the case of basic blocks so it does not include any jumps instructions. The binary code runs only from the main function. The wave diagram in Figure 4.13 shows the time simulation of the cache to hash unit taking into account the jump instructions. The binary code includes jumps instructions. The expected behavior is explained in subsection 4.2.4. In case of jump instruction the transition from the current state(S0... S3) to state jump takes place. The state jump handles the branch instructions, the remaining instructions and the embedded checksum in the cache line are neglected since there is a jump to another address in the code. This state jump checks whether there branch signal is still equal to 1. According to the wave diagram in Figure 4.13, we can observe the following: • In the case of jump the signal branch is asserted whenever a jump instruction occurs. The branch is detected whenever the rbranch signal or the fbranch signal is asserted. 40 CHAPTER 4. IMPLEMENTATION AND VALIDATION Branch = 0 or ici.dpc = 0 S0 Reset =1 WIVH ash = 0 S7 Halt = 1 laodWi = 1 Mux_ctrl = “010” S14 Halt = 1 Loadwi =1 Mux_ctrl = “101” Branch =1 valid = 1 S8 Halt = 1 loadWi = 1 Mux_ctrl = “011” S1 Laod1 = 1 an Br S15 Halt = 1 Loadwi =1 Mux_ctrl = “101” ch = Valid = 1 1 State jump S2 Load2 = 1 WIVHash = 1 S9 Halt = 1 Loadwi = 1 Mux_ctrl = 100 Branch = 1 Valid = 1 S16 Halt = 1 Loadwi =1 Mux_ctrl = “101” Branch =1 S10 Halt = 1 Loadwi =1 Mux_ctrl = “101” S3 Laod3 = 1 S17 Halt = 1 Loadwi =1 Mux_ctrl = “101” Valid = 1 S4 Load4 =1 Read_mem =1 Start = 1 Halt = 1 Branch= 1 or Ici.dpc(3d own to 0) /= 0 S11 Halt = 1 Loadwi =1 Mux_ctrl = “101” S5 Start = 0 Halt = 1 loadWi = 1 Mux_ctrl = “000” S6 Halt = 1 loadWi = 1 Mux_ctrl =”001” S18 Halt = 1 Loadwi =1 Mux_ctrl = “101” S12 Halt = 1 Loadwi =1 Mux_ctrl = “101” S19 Halt = 1 Loadwi =1 Mux_ctrl = “101” S13 Halt = 1 Loadwi =1 Mux_ctrl = “101” S20 Halt = 1 Loadwi =1 Mux_ctrl = “110” Finish = 1 S21 Halt = 1 Figure 4.12: Cache to Hash Unit • The current state is hanging in the state jump and the next address is stored in ici.fpc. 4.2.6 Halting the Pipeline Unit Let us show how the pipeline unit looks like when the standard execution of the Leon3 processor is considered and the execution when the cryptography extension is added. The pipeline unit has been modified where the halt signal is assigned to the hold pc signal of the Leon processor. The hold pc signal controls the fpc (fetch program counter) register 4.2. IMPLEMENTATION OF THE CODE INTEGRITY CHECK UNIT 41 Figure 4.13: wave diagram of Cache To Hash(no jump instructions) in the pipleine unit of the Leon processor, if hold pc is enabled then the current program counter is held in this register and the pc is not increased. The data are not latched in the pipeline stages but nops are included. The halt signal controls a multiplexor to choose between the normal instruction data or the nops. When hold pc is disabled or when halt signal is disabled, this means that the cryptographic unit is done with the computation of the checksum, the program counter is increased and the new instruction is fetched. Figure 4.14 shows the normal execution of the pipeline unit and Figure 4.15 illustrates the pipeline unit when cryptography occurs. Figure 4.14 illustrates the standard 5 stages pipeline execution from Fetch, Decode, Execute, Memory and Write Back. In the Fetch stage the next instruction to be executed is fetched from memory. In the decode stage, the instruction is decoded and the operands 42 CHAPTER 4. IMPLEMENTATION AND VALIDATION Figure 4.14: wave diagram of Cache To Hash (with jump instructions) are read. All arithmetic operations are performed in the Arithmetic Logical Unit (ALU). The data are valid in the memory stage. In the write back stage all results from the ALU are written back to the register file. Figure 4.15 describes the modification of the pipeline unit whenever the halt signal is enabled, the fetch program counter register will hold the current value of pc and nops are included to the pipeline unit so that the unit does not execute anything. 4.3. CONCLUSION 43 Figure 4.15: Normal Execution of the pipleine unit[8] 4.3 Conclusion In chapter 4, The certified authority which is a software program reads the binary file of a compiled program, assigns the instructions into a certain number of bytes, calculate the checksums with the SHA-1 algorithm and embed the checksums in the binary file and the binary file is loaded in the the PROM. The implementation of the Code Integrity Check (CIC) unit in hardware using the Leon3 Template is described. The hardware CIC (Code Integrity Check)unit that was build out of the controller of the cryptographic unit, the cryptographic unit (SHA-1 core), the generate valid units to detect whenever the data read from the cache are valid, Block RAM to store the pre-computed checksums and a compare unit were illustrated. The simulation results of the behavior of the Code Integrity Check unit, the controller of the cryptography unit were discussed. Finally, the modification of the pipeline unit was discussed. 44 CHAPTER 4. IMPLEMENTATION AND VALIDATION nop halt Multiplexer halt S1 D S2 C ENB Figure 4.16: Execution of the pipleine unit when cryptography occurs 5 Evaluation Section 5.1 discusses the comparison of the three different secure processors that were discussed in Section 2.3 Related work. Section 5.2 shows some experimental results and takes into consideration the area and delay overhead when the design is scaled. 5.1 Design Comparisons The code integrity check controls whether the binary code has been modified by precomputing checksums prior the execution of the program. During execution of the program, the checksum is computed and compared to the pre-computed one (assuming the operating system is trusted). The Code Integrity Check Unit provides only the integrity of the binary code but does not provide memory integrity verification. If memory integrity verification is required, then a combination of our design and eXecute Only Memory (XOM)[41] can be used. Some of the available secure processors are already discussed in Section 2.2. We have discussed three different designs the XOM architecture, the Secure Coprocessor of IBM4758 and the Trusted Computing Group. As we mentioned in the beginning of this section that the Code Integrity Check Unit provides only the integrity of the binary code but does not provide memory integrity verification. If memory integrity verification is required, then a combination of our design and eXecute Only Memory (XOM)[41] can be used. The common characteristics of the discussed architecutres is that the hardware is assumed to be correct. Architectures like XOM provide a software tamper-resistant execution environment. The check is done inside the processor so no separate hardware is needed while the Trusted Computing Module of TCG group provides the same thing as XOM except that the check is done outside the processor, so a separate hardware is required. On the other hand, the IBM4758 secure Co-processor protects secret keys for security applications where the secret keys that are stored are never leaked. Table 5.1 explains the comparisons between our design and the XOM processor, the Secure Coprocessor of IBM4758 and the Trusted Computing Group(e.g Intel Lagrande). In table 5.1 the comparisons between different designs is made in terms of whether the design requires separate hardware to implement the cryptography, the measure of the trust of the user, this means that why a user should trust the system, security perimeter is taken into account, this means that where the security has been concentrated, is it inside the processor and not outside or is it in both, e.g in our design/XOM the security has been concentrated only inside the processor chip, where the memory is untrusted. 45 46 CHAPTER 5. EVALUATION Table 5.1: Design Comparisons Our Design /XOM Goal Copy and tamperresistant software distribution and execution (inside processor) No Separate Hardware Measure of User’s trust Integrity of Software and its computation result is checked Processor Chip Boundary Security perimeter Common Characteristics 5.2 Hardware is correct , Permanent device secret(Public-private key) Secure Coprocessors, e.g. IBM4758 Secret protection for Security applications Yes The Secrets that are stored are never leaked The private keys are securely stored inside the casing during manufacture Hardware is correct , Permanent device secret(Public-private key) Trusted Computing Group , e.g. Intel LaGrande Copy and tamperresistant software distribution and execution (outside processor) Yes Integrity of Software is checked at load time only Processor, DRAM, TPM chip and busses Hardware is correct, Permanent device secret(Public-private key) Experimental Results This section describes some of the test results that we have obtained from synthesizing our design. The Xilinx ISE tool is used to synthesize the design in hardware. The default values of the LEON-3 architectural parameters that we used and the latency of the SHA-1 core are depicted in table 5.2. Table 5.2: Architectural Parameters Architectural Parameter L1- Instruction Cache L1-latency Load cycles Store Cycles SHA-1 Latency Value 64 KB/set, 1-set, 32 Bytes/line 1 cycle 2 cycles 2 cycles 80 cycles First, we synthesize the original Leon3MP design without modifications and then we synthesize the Leon3MP after modification with adding the cryptographic unit. We check the overhead in delay by taking into account the associativity (number of sets) and the set size (Kbytes/set) of the Leon3 processor cache. We measure the delay by increasing the set size for different number of sets between one to four. We do that to check what the delay is when the design is scaled. The delay is considered to be inverse proportional to the frequency of the design. We start first with the measurements of the delay in ns without any modification. From the synthesis results, we discovered that the delay stays constant when the set size of the cache increases. The delay varies only when the number of sets varies between one to four. The results are illustrated in table 5.3. Table 5.3: Delay measurements without modifications number of sets Delay[ns] 1 8.127 ns 2 8.303 ns 3 8.305 ns 4 8.309 ns 5.2. EXPERIMENTAL RESULTS 47 The area usage is about 24 percent usage of the LUT(Look up table on the FPGA). On the other hand we measure the delay after modifying the Leon3 processor by adding the cryptographic unit. From the synthesis results, we noticed that the delay does not vary if the size of the cache increases. The delay increases with the variation of the number of sets. A possible explanation why the delay stays constants is that we are loading the binary code in the memory where the size of the code is staying constant. The results are illustrated in table 5.4. The area usage is 40 percent. The area overhead almost doubles. Table 5.4: Delay measurements with modifications number of sets Delay[ns] 1 8.393 ns 2 8.392 ns 3 8.392 ns 4 8.392 ns If we compare these results with the results of table 5.3, we conclude that the delay increases. This is due to the fact that the computation time of the cryptographic unit is high. The overhead in delay is presented in table 5.5 for different number of sets and set size. The graph in Figure 5.1 illustrates the delay overhead where the x axis represents the size of the cache in Kbytes/sets and the y axis the delay overhead in percentage for set associativity 1, 2, 3, 4. Table 5.5: Delay overhead number of sets Delay[ns] 1 26.6 % 2 28.9 % 3 38.7 % 4 48.3 % Delay Overhead percentage of delay overhead 60 % delay overhead, 48.3 50 % delay overhead, 38.7 40 30 % delay overhead, % delay overhead, 28.9 26.6 % delay overhead 20 10 0 1 2 3 4 number of sets Figure 5.1: Overhead in Delay vs size of the cache Let us discuss the reason behind the increase in delay overhead if the set associativity of the cache varies between 1 to 4. Generally speaking, in a direct mapped cache a memory block maps to exactly one cache block. At the other extreme, in a fully associative 48 CHAPTER 5. EVALUATION cache a memory block can be mapped to any cache block. On the other hand to reduce cache miss rate a compromise is to divide the cache into sets each of which consists of n ”ways” (n-way set associative). A memory block maps to a unique set specified by the index field and can be placed in any way of that set. The disadvantage of the n-way set associative cache is that it requires n comparators while 1 comparator in the direct map is needed. This will increase the delay overhead and area. The n-way set associative requires additional multiplexor delay for the data than the direct mapping scheme. In n-way set associative the data is available after set selection and Hit/Miss decision, while in a direct mapped cache, the cache block is available before the Hit/Miss decision. Other reasons that lead to performance degradation is the delay caused by the Code Integrity Check unit implemented in hardware. The delay is caused at the cache-memory boundary upon the L1 cache miss, where the miss penalty is already several hundred cycles for typical microprocessors. Hence, some tens of extra cycles for hardware hash computation is going to cause much of performance degradation. The code integrity checking inserts additional no-op instructions into the instruction stream for every cache line. These will cause degradation in the efficiency of the instruction fetch since a fraction of instructions fetched are useless. On the other hand, to reduce the performance overhead, the cryptographic unit has to be pipelined. Instead of waiting 80 clock cycles to compute the hash value for an input message, we can have 80 pipelined stages so that the throughput will be 160 bits per clock cycle. Another method to reduce performance overhead, is to modify the compiler, such that prior to the execution of the program, the hash values are loaded in a bursting fashion in the on-chip memory, this means that they are not loaded one hash value at a time. 5.3 Conclusion This chapter discusses the comparison between the available secure processors which are XOM architecture, the Secure Coprocessor of IBM4758 and the Trusted Computing Module of TCG group. The experiment results show that when the design is scaled the area will double. On the other hand the overhead in delay increases if the set associative increases, this is because more logic is needed to calculate where the block of data should be in the set. In case one set Direct Mapping there is only one set. In case of 2 ways you need to check whether the data is left or right. In case of 4 ways you have to check where the data will be between one of the four. 6 Conclusions 6.1 Summary In this thesis, a design of the code integrity check unit is implemented in hardware. The optimized cryptographic core SHA-1 unit is used to calculate the hash values at run time. The pre-computed values are computed statically before the program runs on the Leon processor. These values are usually generated by an authority software that reads the binary code, calculates the checksums for every cache line and embed the pre-calculated checksums in the original binary code. When the program starts executing the checksum values are generated in hardware and checked with the pre-calculated one. If they are not similar an interrupt is raised and the pipeline of the processor is stalled. In chapter 1, the problem statement has been discussed is that most of the new security attacks result in violating the integrity of the software code of an application program. Such security attacks or threats try to change the instructions so that adversary try to gain control over the program execution flow. An example of a malicious code has been given. In chapter 2, some security issues were discussed. Different classes of attacks into computer systems were described like the Hardware-Based Attacks where an adversary has direct physical hardware access, the Software-based Attacks that are specially used remotely to break into computer and embedded systems. Some of the common software based attacks that were explained are Buffer Overflow, Heap Overflow, Double free vulnerability, Format string attack and Temporary file vulnerability. Some related work has been discussed like a design of an architecture for a secure execution of programs on embedded processors. A hash function which is a computationally efficient function converting binary strings of arbitrary length to binary strings of fixed length called the checksums was discussed. The hash functions are one-way functions that computes a small fixed length output value, the digest message, that is highly correlated with the input data. One of the trivial characteristics of the hash functions is that no information of the data input can be obtained. It is very rarely to have two different data streams generating the same hash value. The Secure Hash Algorithm (SHA-1) computes a 160 bit message digest or output hash value from the input message. The input data stream is separated into multiple input bocks of 512 bits each. The input block is split into 80*32 bits words, one 32-bit word for each computational round of the SHA-1 algorithm. Each round comprises additions and logical operations, such as bitwise logical operations and bitwise rotations to left. The computation of the final checksum is an eighty iteration algorithm over each message block, where every message block is a sequence of 32 bit words. In chapter 3, the choices for the designs were described. A choice has to be made which of the three soft-core is going to be used in this thesis. The three processors 49 50 CHAPTER 6. CONCLUSIONS LEON-3 and OpenRISC 1200 and Microblaze are 32-bit RISC processor big endian synthesizable pipelined processors. LEON-3 and OpenRISC 1200 contains 5-stage pipelines while the Microblaze contains 3 stages of pipeline. Leon3 soft-core processor is the targeted platform chosen to modify because Leon-3 is the most configurable processor, while MicroBlaze and OpenRISC have less configuration options. Leon soft-core offers better design flexibility than the MicroBlaze and OpenRISC. The 32-bit LEON processor that complies to the SPARC V8 architecture was discussed. The Leon Integer Unit as defined that consists of five pipeline stages, the on-chip memory in LEON3 is implemented with separate instruction and data buses, and how to configure the cache direct map or set associative and the size of the cache were also illustrated. The memory controller of the Leon-3 processor controls a memory bus holding external memory devices, asynchronous static ram (SRAM) and synchronous dynamic ram (SDRAM) were also explained. AMBA (Advanced Microcontroller Bus Architecture) is an on-chip bus specification for interconnection and organization of various functional modules that are a part of System-on-Chip. The AMBA specification that improves the reusable system on-chip platform by including a common standard for data communication in a System-on-Chip module and the three distinct AMBA buses which are Advanced High-performance Bus (AHB), Advanced System Bus (ASB) and Advanced Peripheral Bus (APB). In chapter 4, The certified authority which is a software program reads the binary file of a compiled program, assigns the instructions into a certain number of bytes, calculate the checksums with the SHA-1 algorithm and embed the checksums to the cache line that contains the instructions and load them in the PROM. The efficient implementation of Secure Hash Algorithm (SHA-1) in hardware was described. This section describes the implementation of the Code Integrity Check (CIC) unit in hardware using the Leon3 ML403 Template. The hardware CIC (Code Integrity Check)unit that was build out of some key components called the controller of the cryptographic unit, the cryptographic unit (SHA-1 core), separate control units to detect whenever the data read from the cache are valid, some selection logic components and a compare unit were illustrated. In chapter 5, the comparison between the available secure processors which are XOM architecture, the Secure Coprocessor of IBM4758 and the Trusted Computing Module of TCG group has been discussed. The delay is measured by configuring the size of the cache for different number of sets. The experiment results show that when the design is scaled the area will double. The cryptography unit when it is added to the Leon3 template the delay in overhead increases about 30 percent and the area doubles in size. The reason why the overhead in delay increases if the set associative increases, because more logic is needed to calculate where the block of data should be in the set. In case one set Direct Mapping there is only one set. In case of 2 ways you need to check whether the data is left or right. In case of 4 ways you have to check where the data will be between one of the four. 6.2 Future Work The following points proposes the future research direction work: • Implementing the Concealed Execution Mode unit that checks for integrity of data 6.2. FUTURE WORK 51 during runtime. When the data are written to the memory. Before sending the data to the off-chip, the data has to be encrypted and hashed. When the data are read on chip, the pre-calculated checksums are checked with the one calculated during run-time(same as instruction hash). • Modifying the compiler of the LEON processor to add instructions like hash pointer to the Instruction Set Architecture (ISA). Every basic block begins with the hash pointer where it points to the pre-computed checksum value of the basic block. During execution of the code, the checksum is calculated on run-time and compared to the pre-computed value which is found easily by the hash pointer. 52 CHAPTER 6. CONCLUSIONS Bibliography [1] J.S. Dwoskin, R.B. Lee. ”Hardware-rooted trust for secure key management and transient trust,” In Proceedings of the 14th ACM conference on Computer and communications security, pp. 389 - 400, IEEE November. 2007. [2] R.B. Lee, P.C. Kwan, J.P. McGregor, J. Dwoskin, Z. Wang. ”Architecture for protecting critical secrets in microprocessors,” In Proceedings. 32nd International Symposiumon, pp. 2- 13, IEEE June. 2005. [3] J.P. McGregor, R.B. Lee. ”Protecting cryptographic keys and computations via virtual secure coprocessing”, pp. 16 - 26, IEEE March. 2005. [4] R. Chaves, ”Secure Computing on Reconfigurable Systems”, pp. 207, December 2007, PhD Thesis. [5] R.D. Stinson, ”Cryptography-Theory and practice”, CRC Press, 1995. [6] A.J. Menezes, P.C. van Oorschot, S.A. Vanstone, ”Handbook of Applied Cryptography”, CRC Press, 2001. [7] www.gaisler.com [8] J. Gaisler, ”The Leon Processor User’s Manual”, Version 2.3.7, August 2001 [9] J. Gaisler, ”BCC-Bare-C Cross-Compiler User’s Manual”, Version 1.0.29, February 2007 [10] www.exforsys.com/tutorials/c-language [11] S.G.Kochan, ”Programming in ANSI C” [12] B.W. Kernighan, D.M. Ritchie, ”The C Programming Language” [13] Federal Information Processing Standards Publication, ”Specifications for the Secure Hash Standard”,August 2002. [14] Federal Information Processing Standards Publication, ”Specifications for the Keyed-Hash Message Authentication Code”,March 2002. [15] S. Pongyupinpanich, S. Choomchuay. ”An Architecture for a SHA-1 Applied for DSA”, 3rd Asian International Mobile Computing Conference May. 2004. [16] M. Kim, Y. Kim, J. Ryou, S. Jun. ”Efficient Implemntation of the keyed-Hash Message Authentication Code Based on SHA-1 Algorithm for Mobile Trusted Computing”, pp. 410-419, IEEE August. 2007. [17] D. Zibin, Z. Ning. ”FPGA Implementation of SHA-1 Algorithm”, IEEE August. 2003. 53 54 BIBLIOGRAPHY [18] D. Toma, A. Perez, D. Borrione, E. Bergeret. ”Design Of A Proven Correct SHA Circuit”, IEEE March. 2003. [19] J. Dwoskin, D. Xu, J. Huang, M. Chiang , R. Lee. ”Secure Key Managment Architecture Against Sensor-node Fabrication attacks”, IEEE Dec. 2007 [20] H. Lin, X. Guan, Y .Fei, Z.J. Shi. ”Compiler-assisted Architectural Support for Program Code Integrity Monitoring in Application-specific Instruction Set Processors”, pp 815 - 820,IEEE 2007. [21] Y .Fei, Z.J. Shi. ”Microarchitectural Support for Program Code Integrity Monitoring in Application-specific Instruction Set Processors”,IEEE 2007. [22] D. Arora, S. Ravi, A. Raghunathan, N. K. Jha. ”Secure Embedded Processing through Hardware-assisted Run-time Monitoring”, pp 178 - 183, IEEE 2005. [23] D. Arora, S. Ravi, A. Raghunathan, N. K. Jha. ”Enhancing security through hardware-assisted run-time validation of program data properties”,pp 190 - 195, IEEE 2005. [24] D. Arora, S. Ravi, A. Raghunathan, N. K. Jha. ”Architectural Enhacements for Secure Embedded Processing”,IEEE 2005. [25] K.D.Wilken, T. Kong. ”Concurrent Detection of Software and Hardware DataAccess Faults”, IEEE 1997. [26] R.G.Gael. ”Architectural Support for Security and Reliability in Embedded Processors”, August 2006, PhD Thesis. [27] J.Platte. ”A Security Architecture for Microprocessors”, Nov. 2006, PHD Thesis. [28] J.Gaisler. ”GRLIB IP Core User’s Manual”, Version 1.0.19, September 2008. [29] J. Gasiler. ”Leon 3 ML401 Template Design”, November 2007 [30] L. Buttelmann. ”Leon 3 VHDL simulation guide”, Version 0.1.18.10.2007 [31] ”The SPARC Architectural Manual”, Version 8. [32] P. Anemaet, T. Van As. ”Microprocessors Soft-Cores: An Evaluation of Design Methods and Concepts on FPGAs”, part of the Computer Architecture(Special Topics) course ET4078, Department of Computer Engineering. [33] F. Duarte, ”A Cache-Based Hardware Accelrator for Memory Data Movements”, Nov.2008, PhD Thesis. [34] A. Murat Fiskiran, R. B. Lee, ”Runtime Execution Monitoring (REM) to Detect and Prevent Malicious Code Execution”, pp 452 - 457 , IEEE 2004. [35] D.Driessens, T. Tierens, ”Embedded Systeemontwerp op basis van soft end Hardcore FPGA’s”. BIBLIOGRAPHY 55 [36] D. Mattsson, M.Christensson, ”Evaluation of Synthesizable CPU Cores ”. [37] R.R. Srivastava, ”System on-chip platform”, Master Thesis, August 2004. [38] ”AMBA Specification”, ARM, May 1999 [39] R. Lien, T. Grembowski, and K. Gaj, A 1 Gbit/s partially unrolled architecture of hash functions SHA-1 and SHA-512, in CT-RSA, pp. 324338, 2004. [40] K.Eisele, ”Design of a Memory Management Unit for System-on-a-chip Platform LEON”, Master thesis, University of Stuttgart. [41] D.L.C. Thekkath, M.M.P. Lincoln, D.B.J. Mitchell, M.Horowitz, ”Architectural Support for Copy and Tamper Resistant Software”. pp 168 - 177 , IEEE 2000. [42] S.W. Smith, S. Weingart, ”Building a High-Performance, Programmable Secure CoProcessor”. IEEE 1998. [43] Trusted Computed Group.LaGrande Technology Architectural Overview. [44] TCG Specification Architecture Overview, August 2007. [45] S. Bajikar, ”Trusted Platform Module (TPM) based security on Notebook PCs. June 2002. 56 BIBLIOGRAPHY Appendix A: LEON3 VHDL simulation steps with Modelsim 7 This appendix explains the steps of how to run an application on LEON3 processor. Most of the tutorials do not explain clearly the methods and steps that need to be taken in order to compile, simulate the design. We start first explaining how to configure the Leon3 processor with make xconfig GUI. • make xconfig : changes the leon3 configuration by creating a config.vhd Figure 7.1: Leon processor Configuration • Processor : by clicking on processor , a GUI filled with several options is launched. You have a flexibility of modifying the cores inside the Processor. Figure 7.2: Configuration Inside processor Next we describe the steps to compile an application. We give a very simple example of how to do it. 57 58 CHAPTER 7. APPENDIX A: LEON3 VHDL SIMULATION STEPS WITH MODELSIM • gcc hello.c :This step compiles the C code and creates and executable file. The command used is : sparc-elf-gcc -msoft-float-O2 hello.c -o hello.exe • mkprom : This step loads the binary file into the prom of the Leon processor. The command used is : sparc-elf-mkprom.exe -rmw -msoft-float -v -ramsize 1024 hello.exe • objcopy : this step creates a copy of the object file. The command used is : sparc-elf-objcopy -O srec prom.out prom.srec After the steps of compilation of the applications are fulfilled, we describe the steps used for simulation and synthesize with XILINX ISE. • make vsim : This step compiles all the IP cores of the LEON3 template. • vsim testbench : This step loads the testbench in modelsim and runs the simulation. • make scripts : This step creates a compile.xst file which contains commands for analyzing all GRLIB files and creates also .npl files for the ISE project. • ise leon3mp.ise : This step creates the .ise project file. • make ise : This step generates the netlist with XST. The final programming file is the ”LEON3mp.bit” is the file that can be used to be run on the FPGA board. Appendix B: Software Test program 8 This appendix explains the software programs that run on LEON3 processor. • Simple test program, testing on for basic block. main() { int result1,result2; int x = 2; int y = 3; result1 = x + y; } Figure 8.1: Simple Test program • Simple Test program but including some standard function like printf and basic test function for the LEON processor but the program is still running from main. main() { report_start(); base_test(); int result1,result2; int x = 2; int y = 3; result1 = sum(x,y); result2 = mult(x,y); printf("the sum is %d \n",result1); printf("the product is %d \n",result2); report_end(); } Figure 8.2: Test program 2 • Test program 3 including jump instructions and many if statements • Test progam 4 including while loop and for loops 59 60 CHAPTER 8. APPENDIX B: SOFTWARE TEST PROGRAM int sum (int a , int b) { int result; result = a + b; return result; } int mult (int a , int b) { int result; if (a > b) result = b * a; else result = a * b; return result; } main() { report_start(); base_test(); int result1,result2; int x = 2; int y = 3; result1 = sum(x,y); result2 = mult(x,y); printf("the sum is %d \n",result1); printf("the product is %d \n",result2); report_end(); } Figure 8.3: Test program 3 61 Void sum (int a , int b) { int *result, result = malloc(sizeof(int)*1000); int i; for(i= 0; i <1000;i++) result[i] = a + b; } int mult (int a , int b) { int result; while (a > b) result = b * a; else result = a * b; return result; } main() { report_start(); base_test(); int result1,result2; int x = 2; int y = 3; result1 = sum(x,y); result2 = mult(x,y); printf("the sum is %d \n",result1); printf("the product is %d \n",result2); report_end(); } Figure 8.4: Test program 4 62 CHAPTER 8. APPENDIX B: SOFTWARE TEST PROGRAM
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement