DEGREE PROJECT, IN MEDIA TECHNOLOGY , SECOND LEVEL STOCKHOLM, SWEDEN 2015 Ahead of Time Compilation of EcmaScript Code Using Type Inference JONAS LUND KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC) Ahead of Time Compilation of EcmaScript Code Using Type Inference Förkompilering av EcmaScript programkod baserad på typhärledning Jonas Lund [email protected] DM228X Degree Project in Media Technology 30 credits Interactive Media Technology Degree Progr. in Media Technology 270 credits Royal Institute of Technology year 2015 Supervisor at CSC was Vasiliki Tsaknaki Examiner was Haibo Li Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc Ahead of Time Compilation of EcmaScript Code Using Type Inference Abstract To investigate the feasibility of improving performance for EcmaScript code in environments that restricts the usage of dynamic just in time compilers, an ahead of time EcmaScript to C compiler capable of compiling a substantial subset of the EcmaScript language has been constructed. The compiler recovers type information without customized type information by using the Cartesian Product Algorithm. While the compiler is not complete enough to be used in production it has shown to be capable of producing code that matches contemporary optimizing just in time compilers in terms of performance and substantially outperforms the interpreters currently used in restricted environments. In addition to constructing and benchmarking the compiler a survey was conducted to gauge if the selected subset of the language was acceptable for use by developers. Förkompilering av EcmaScript programkod baserad på typhärledning Sammafattning För att undersöka möjligheterna till att förbättra prestandan vid användning av programkod skriven i EcmaScript i begränsade miljöer som förhindrar användningen av dynamiska så kallade just in time kompilatorer, har en statisk EcmaScript till C kompilator utvecklats. Denna kompilator är kapabel att kompilera program som använder en större delmängd av språket såsom beskrivet i standarden. Kompilatorn härleder typinformation ur programkoden utan specifikt inlagd typinformation med hjälp av den Kartesiska Produkt Algoritmen. Emedan kompilatorn inte är utvecklad till den grad att den är användbar i produktionsmiljö, så visar testresultat att den kompilerade koden som kompilatorn producerar matchar samtida just in time kompilatorer och har stora prestandafördelar gentemot de programtolkar som används i begränsade miljöer. Utöver konstruktion och utvärdering av kompilatorn så gjordes en undersökning bland utvecklare för att utvärdera huruvida den utvalda delmängden av programspråket var acceptabel för utvecklare. Contents 1 Introduction................................................................................................................ 1 1.1 Defining the Problem Space................................................................................1 1.2 Purpose...............................................................................................................2 1.3 Delimitations.......................................................................................................2 2 Related Research........................................................................................................4 2.1 Basic Type Inference Systems.............................................................................4 2.2 CPS, Expansion Theory and Soft Typing............................................................4 2.3 Modern Precision Improving Methods................................................................5 2.4 Improvements in Interpretation and JIT Techniques...........................................6 2.5 Other Static Compiler Implementations..............................................................6 3 Method....................................................................................................................... 8 3.1 Compiler Construction........................................................................................8 3.2 Compiler Performance Evaluation......................................................................8 3.3 Language Subset Evaluation...............................................................................8 4 Compiler Design........................................................................................................9 4.1 System Design Considerations............................................................................9 4.2 Compiler System Design Overview..................................................................11 4.3 Parsing, Syntactic Analysis and Closure Conversion........................................13 4.4 Advanced Flow and Type Analysis in the Compiler..........................................16 4.4.1 Limits of an Abstract Syntax Tree..............................................................16 4.4.2 Conversion to Continuation Passing Style (CPS)......................................16 4.4.2 Abstract Interpretation with the Cartesian Product Algorithm...................18 4.5 Nominalization of Abstract Structural Types....................................................25 4.5.1 Path Compression......................................................................................26 4.5.2 Object Reification......................................................................................27 4.6 Code Generation...............................................................................................29 5 Performance Evaluation...........................................................................................30 5.1 Benchmarking Setup.........................................................................................30 5.1.1 Benchmarked Factors................................................................................30 5.1.2 Used Benchmarks......................................................................................31 5.1.2 Compared Systems....................................................................................32 5.1.3 Compiler and Machine Variations..............................................................34 5.1.4 Benchmarking Variability..........................................................................34 5.2 Benchmarking Results......................................................................................35 6 Language Subset Evaluation....................................................................................42 6.1 Feature Questionnaire.......................................................................................42 6.2 Questionnaire Results.......................................................................................42 7 Discussion................................................................................................................44 8 Conclusion...............................................................................................................48 References................................................................................................................... 49 Appendix A: Glossary.................................................................................................52 Appendix B: Benchmarking Samples..........................................................................54 001_fib.js................................................................................................................54 002_fibco.js............................................................................................................54 003_cplx.js..............................................................................................................55 004_cplxo.js............................................................................................................55 005_cplxpo.js..........................................................................................................55 006_array.js............................................................................................................. 56 007_arrayco.js.........................................................................................................57 Appendix C: Benchmarking results.............................................................................58 Win32.....................................................................................................................58 Raspberry PI B1......................................................................................................59 Raspberry PI B2......................................................................................................60 C compilers.............................................................................................................62 Appendix D: Questionnaire.........................................................................................62 1 Introduction 1.1 Defining the Problem Space JavaScript is probably the most spread computer language on the planet due to it’s inclusion in web browsers and this is thus the commonly used name for it even if the core parts of it was later standardized as EcmaScript (Ecma 2011). The language started out it’s life as an extension language to provide simple interactivity to web pages in Netscape browsers in the mid 90s but has kept proving itself as an increasingly capable language as developers has started using internet browsers for more and more computationally expensive tasks due to significant amount of engineering effort being put into improving performance of implementations of the language. In the last few years there has been a push in the form of HTML5 API’s to improve graphical(Canvas, WebGL, FullScreen), audio (WebAudio) and network(WebSockets,WebRTC) capabilities in browsers to enable games and other interactive applications to be distributed portably to a variety of platforms via the browser. For game companies in-browser games for the desktop is an enormously attractive proposition as the barrier of entry for a player to start playing is very small. With the performance of JavaScript implementations always increasing and as the availability of these new API’s are becoming more commonplace, we are now at a breaking point where a large amount of advanced games that were previously desktop based will be developed for the web first. For example the popular Unity game development environment is in the process of abandoning the usage of a custom plug in in favor of using HTML5 as the preferred target for web deployments. On mobiles however this proposition is slightly different as some of these API’s or the user interface is crippled due to security concerns that override the needs of game developers. In addition to this the mobile ecosystem also has a better functioning marketplace for developers wishing to monetize on games with native applications. While there exists toolkits to port these web based games to mobiles as native applications they suffer in terms of performance as they rely on interpreters while the modern JavaScript environments uniformly rely on Just in time compilation(Hereafter JIT) processes to speed up execution. A requirement for a JIT to function is that the engine is allowed to create and change the native code during execution, however this conflicts with the premises of code signing and is thus disallowed for distributing applications for various platforms that puts user security in front of developer convenience. 1 Under the auspices of the authors company a lexer, parser and embryo of the used type inference system had previously been produced that showed enough promise to instigate further development of a prototype compiler capable of producing actual running code. The lexer and parser are designed after the EcmaScript standard specification. The embryonic type inference system has during this thesis work undergone major changes to make it useful in an actual compiler and much of the code has changed while the core concepts has been retained. 1.2 Purpose This work investigate the possibilities of attaining higher performance on game code written in EcmaScript language with static ahead of time (AOT) compilation instead of interpretation in places where just in time(JIT) compilers are unavailable. Due to the way dynamic typing in EcmaScript works compared to explicitly typed languages that are usually used with static ahead of time compilation, a major focus of the work on getting the static compiler to generate high performance code lies in doing type inference to make implicit type information explicit for the backend code generator target. 1.3 Delimitations The focus is on trying to answer questions about applicability of statically compiling game code in an EcmaScript language subset. Many constructs and behaviors of EcmaScript and JavaScript are very hard to analyze and optimize, especially in a static context and this is why the studied dialect is a limited subset of the full standardized language. In fact, some of those constructs (the with statement) and semantics have been or are in the process of being deprecated in current and future EcmaScript and JavaScript revisions for both performance and safety reasons so their omission should not form a major impediment for developers (Ecma-262 5th ed). The runtime system while mentioned in places is not detailed in this text. For the tested benchmarks in this report only a very small skeleton runtime library was made that mostly consists of functions to provide the operating systems memory and timing functionality to the compiled EcmaScript code. A full runtime system would require functionality to emulate some dynamic behaviors mention but in this work the dynamic behavior was tested in other ways as detailed in the benchmarking section. Another big factor when evaluating high level languages for game development is garbage collection. In particular garbage collection pauses can hurt immersion due unpredictable delays in execution when they are triggered. While one would need to take garbage collection into account when evaluating interactive performance this evaluation should be able to ignore the effects of garbage collection pauses when doing evaluation of raw computation speed in relation to type inference. 2 While there is a multitude of algorithms for type inference those will only be briefly mentioned and not exhaustively investigated in comparison to the selected algorithm as this work focuses on trying to evaluate how well code behaves when compiled with a compiler based on type inference. Without doubt a more powerful algorithm could potentially improve performance but the Cartesian Product Algorithm has already been proven to be useful and powerful in previous works and part of the hypothesis this work is based on focuses on the algorithm being suitable for soft analysis. This evaluation also does not touch upon investigating very low level program optimization techniques such as register allocation, instruction scheduling and pipelining. As the compiler targets a C/C++ compiler or a code generation system such as LLVM to handle low level code generation, these options will be able to handle low level code generation optimizations very well while giving the compiler an appropriate abstraction focusing on high level optimizations with the usage of type inference and other program analysis. 3 2 Related Research 2.1 Basic Type Inference Systems Due to performance concerns practical usage in the early ages of computing the only feasible solution was to use relatively low level languages with none or explicit typing and with the limited hardware of those times the problem of doing simple optimizations on even those languages was hard enough that most optimization efforts was focused on solving these low level problems. While languages without explicit typing existed they were usually academic interpreted experiments and thus suffered from performance problems compared to established systems of the day. (Van Emden 2014) Around 1970 Hindley-Milner type inference was defined and later implemented in ML and provided the first steps in letting programmers use implicitly typed programs with high performance. While this approach works great for statically typed programs with declarative definitions designed for this, like with the ML language, the basic premise of constraining the types in a graph has drawbacks where dynamic types with imperative definitions that can be present for Lisp like languages as it can limit the dynamism of the language. In addition to this, analyzing types in a Lisp like language must on top of the special cases needed to handle normal call/return pairs also handle continuations, a construct powerful enough to simulate both return statements and exceptions common in imperative languages. 2.2 CPS, Expansion Theory and Soft Typing A major enabler for further research in this field for such dynamic languages was the formulation of the Continuation Passing Style(Shivers 1988) for programs with accompanying transforms from regular syntax that is hereafter known as CPS. What this new form gave was the ability to model all control flow of a program as invocations of continuations that never need to return for all control flow rather than relying on implicit return semantics (Appel 1992). For the statement “x=y+z” a regular parsed tree would be in the form of “call set_x with the result from the computation of y+z”, a CPS transformed program on the other hand will say something roughly translated to “call compute y+z, then pass on the result of this computation as an argument to call set_x”, while this seems to imply roughly the same thing but in a longer and more convoluted way it gives compiler developers a coherent way of analyzing all program flow since they can focus on a simple chain of operations instead of a tree with contextual meaning. Naive type analysis with only unexpanded basic types and singular code paths over nodes however proved to provide limited accuracy and thus also limited performance for fully 4 inferred code in Lisp systems since the functional style with very primitive data leads to user composition of data containers. As a remedy we saw the realization and formalization of N-expansion for data and code definitions when trying to achieve exact inference of polymorphic code. This provided the definition of the k-CFA family, with non-expanded object definitions as 0-CFA (no expansion), callsite expansion as 1-CFA (identify objects by calling site), 2-CFA (identify object by callsite with the allocation environment) and higher levels defined as allowing expansion to be defined in terms of identifying and expanding something with environment of an environment formalizing k-CFA. (Shivers 1991) (Might 2014). The problem with expansion however is that for a compiler to exactly decide types for any program is impossible without actually running the program itself, thus it would needs to handle the halting problem that is proven to be undecidable and thus a compiler will therefore require various heuristics to finish analyzing in bounded time (Shivers 1991). Even without taking the halting problem into account, full expansion of the program at every expansion level is highly inefficient in terms of memory and processing power for most programs as it is likely that it’s isolated code paths that expand while most code benefits very little from it. Notable was also that soft typing was defined by Cartwright (Cartwright 1991) as utilizing type inference improvements to speed up specific case code in dynamic systems without trying to achieve perfect inference. 2.3 Modern Precision Improving Methods Plevyak(1988) published a work proposing an iterative method that started with 0-CFA and then expanded only problematic cases before rebuilding the flow information to keep down expansion to only the bad cases, this approach showed some improvements but keeping track of the problematic cases can present some problems for a compiler writer. (Agesen 1995) The CPA algorithm (Agesen 1995) was introduced in conjunction to development of the seminal Self system, this algorithm sidestepped the expansion problem for code by using a slightly different approach that mostly solved N-expansion problems for function invocations by using a growing argument set on invocation nodes that matches argument sets to previous occurrences if possible instead of expanding. Since this method doesn’t blindly trace in data but instead relies on what actually flows into the node for deciding expansion it is easier to implement in a stable fashion, however as it only works for managing code expansions and not data it doesn’t solve all expansion problems. More recently the CFA2 algorithm was proposed (Vardoulakis 2011), it partly tries to unify the work of Agesen and Plevyak but also relies on novel pushdown heuristics to codify and terminate expansions (in a sense Agesens CPA can be seen as a heuristic to terminate 5 expansion). The CFA2 work has in turn spurred new research into using pushdown automata but also other approaches such as in-analysis garbage collection to limit expansion while retaining correct analysis semantics (Might 2006). Object sensitivity while used practically before that point in time was recently codified as an excellent indicator of object identity selection to guide type inference systems as a dual to closure environments (Smaragdakis 2011) 2.4 Improvements in Interpretation and JIT Techniques The group working on the Self system had worked on several tracks to improve performance of their system and they did a comparison of type feedback (as through a JIT) versus type inference (with static compilation) that showed a slight benefit in terms of performance for JIT compilation but also many practical advantages for development (Agesen&Hölzle 1995). One could argue that it’s partly due to this paper (But also very much due to the very dynamic nature of JavaScript on the web) that we have seen a great deal of work on improving JIT compilation compared to the work done on static compilation for dynamically typed systems ever since. Apart from type dispatch these JIT compilers has bought in methods from the old Self system (Google 2008) and added other interesting data representation tricks (Wingolog 2011) to improve performance. But work has also started going towards more explicit type verification systems to cram out even more performance as exemplified by Asm.JS (Herman 2013) or even more expensive type inference to do this implicitly (Egorov 2013), and with the requirements due to the ever higher levels of analysis in JIT systems we are again starting to see a convergence between the JIT and static analysis fields as more and more expensive analysis are being permitted to gain performance in JIT systems. 2.5 Other Static Compiler Implementations While almost all work on JavaScript performance has focused on JIT compilation we have seen some work on static compilation of other dynamic languages used in more conventional settings. Cython(Cython, 2014) and Nuitka(Hayen 2014) for the Python language and similar compilers are fairly plain compilers that differ relatively little from normal compilers for explicitly typed languages by mostly transforming syntax trees into corresponding abstract operations as defined by the language. However as python is a dynamic language with much expense being spent on doing the right thing with the right type the main advantage in terms of performance is more or less just in getting rid of the dispatch loop of an interpreter and some Python specific optimizations but retains much of the performance penalty of dynamic type dispatch. Cython tries to augment the performance with specific manual type annotations while Nuitka is designed to evolve outwards by optimizing increasingly bigger 6 snippets of code. While the improvements in performance for generic code thanks to these optimizations are small they provide very high compatibility and an opportunity to fine tune specific code paths where needed, the downside of this approach is that it will increase the amount of code duplication and maintenance if the code is to target other targets apart from the specific compiler used. Implementations with higher performance targets as practical applications of type inference theory was done with Starkiller (Salib 2004) and Shedskin(dufour 2006) for Python and Ecstatic(Madsen 2007) for Ruby, these all utilized Agesens CPA for type inference showing the strength of this algorithm and reporting promising performance results. Shedskin also used the work of Plevyak for data expansion as suggested in Agesens original paper. Of these systems only Shedskin has been made publicly available. 7 3 Method 3.1 Compiler Construction Without any actualization of the algorithms and ideas under review, any and all research into the area would just be based on speculation. Thus an optimizing prototype compiler has been constructed based on some hypothesis about game code behavior to test the possibility of generating fast code within these assumptions. First a set of hypotheses about the code to be compiled was defined, based on that a design for the compiler was ironed out and finally implemented. 3.2 Compiler Performance Evaluation The main goal of this evaluation was to answer questions in regards to how well a well defined subset of the EcmaScript language aimed at game development can be optimized with an ahead of time compiler compared to contemporary interpreters and JIT compilers. To test this a set of increasingly more complex pieces of benchmarks are implemented and tested within a controlled test harness to give answers about performance characteristics of the code generated by the compiler compared to established solutions using other techniques. There is also a brief review and comparison on how different C compilers handles the output of the constructed compiler since all low level optimizations are deferred to the C compilers and thus a brief review will need to be done to see how this impacts the results compared to the established solutions. 3.3 Language Subset Evaluation Since the compiler has a restricted set of capabilities amounting to a subset of the language to attain performance, an important question to answer was how well such a subset conforms to the set assumptions about real world usage of the language. Also the defined subset might present developers with benefits in terms of better error checking than the standardized language. 8 4 Compiler Design This chapter describes the design of the implemented compiler, first by presenting the main motivations and direction in chapter 4.1, then by a brief conceptual overview of the internal compiler workings in chapter 4.2 and finally by presenting details of the main components in the rest of chapter 4. 4.1 System Design Considerations The main goal in the design of the system is to provide a practical compiler for porting games and similar applications written for the web platform to specifically be compiled to native applications. As such the priorities are firstly general compatibility and secondly performance. Even if compatibility is the primary goal it is impossible to attain perfect compatibility with an ahead of time compiler due to constructs such as eval, thus such border cases that are impossible to analyze and optimize are left out. The focus for most works listed in related research chapter is to attain perfect type information to generate optimal code execution speed for all possible cases by resolving optimal types in every possible case. However for this compiler it isn't necessary to attain perfect information due to three reasons presented below. The first reason is because the primary priority of aiming this compiler as a tool for developers creating regular application code rather than those developers that are writing performance oriented libraries and container constructs exclusively for high performance computing. So while games often require high performance, it is acceptable to fall slightly short in terms of performance for complicated cases as long as the performance is close enough in those parts of the code that takes up a majority of the running time. The second reason is that while most game developers require their compiler to produce well performing code they are usually not averse to doing manual performance tuning when needed to achieve performance goals. Thanks to this, most of the problems of slow and expensive border cases isn’t a critical problem as long as the tooling and documentation enables developers to avoid pitfalls or track down problems as they arise. The third reason is that the target language is EcmaScript. This language, just like the implementations of Python and Ruby that used CPA before relies on the fact that objects are part of the language itself, thus creating a suitable level of abstraction for the compiler analysis. Since objects in these languages usually contain more information in terms of various properties, it can be easier to analyze their contents compared to Lisp and similar languages. In Lisp and similar languages, objects and containers are usually implemented using lower level constructs of the language itself such as list cells, making it necessary to 9 analyze these lower level constructs and providing an abstract view of objects. Languages with first class objects on the other hand gives us the opportunity to treat objects as whole entities in terms of analysis and thus reducing the need for higher data accuracy since the programmer has already provided an implicit link between points that needs analysis in the forms of the objects themselves. In addition to first class objects the fact that EcmaScript Arrays functions as variable size vectors in other languages and that regular Object literals are commonly used as hash tables gives the analysis system further hints about object relationships To expand on this third reason we will look at a couple of hypotheses about how game code in EcmaScript usually behaves. These hypotheses are based on the authors experiences in writing EcmaScript code together with results reported in the research literature mentioned earlier. The hypotheses are: • • • • • • • Runtime code loading is limited or can be contained with well defined functionality that the compiler can analyze ahead of time. Most performance sensitive code paths in games are of numeric nature and largely monomorphic. Dynamic code paths are usually explicitly placed and the impact of them in a soft typed system is relatively small with modern interpretation techniques. That callsite of data allocations (formally 1-expansion as in 1-CFA) by default is sufficient for successfully inferring most performance sensitive code in practice. In cases where1-expansion is insufficient optional programmer guided expansions should only be needed for corner cases like container libraries like binary trees. Basic container libraries are rarely used in practice as EcmaScript developers mostly use object literals as hash tables and the regular Array type supports vector functionality. Undefined semantics can be modified as most computations on the Undefined value are accidental and result in unwanted behaviors anyhow. 10 Based on the all the reasons and code usage hypotheses the designs of the compiler and runtime system was set. • • • • • • The compiler has to be fully ahead of time, this means that optimizations has to be possible to calculate ahead of execution time. The compiler will use type inference based on the Cartesian Product Algorithm to infer as much data as possible about the final executable will behave. As the full language is impossible to optimize well the compiler operates on is a subset of the EcmaScript standard that relies on some of the hypotheses about code usage specified above. The type inference is not required to produce a perfect inference result as actual dynamism is expected in the code. A developer should be able to guide the inference system in specific cases where performance really matters. As dynamism is expected the compiler and runtime system should be designed so that the penalty of dynamic code is minimal where it does occur. 4.2 Compiler System Design Overview As shown in Figure 1 the compilation process is divided into a number of major stages. The first stage is parsing and lexing, that takes the source files of a program and turns them into an abstract syntax trees (AST). The second stage is rewriting of the AST in order to simplify the code. The third stage is a transformation of the AST into an internal form with continuation passing style(CPS) characteristics. The fourth stage is an abstract interpretation of this internal form to produce a control flow graph and abstract type information. The fifth stage rewrites the abstract code paths into code paths suitable for execution by a regular cpu. The sixth stage transforms object definitions based on information in the abstract data to real object definitions that can be used during execution. The seventh and final stage produces C code that can then be compiled by a regular C compiler. 11 Figure 1: Compilation process overview 12 4.3 Parsing, Syntactic Analysis and Closure Conversion The first stage of the compiler is relatively straightforward and doesn't deviate from how most other compilers operate. First a lexer and parser combination designed to handle the standard EcmaScript syntax is used to parse the programs sources files to produce an abstract syntax tree (AST) in memory that is used throughout the rest of the compiler. A simple example of this is shown in Figure 2. Figure 2: Parsing of source code to abstract syntax tree A lexer is based on applying lexicographical rules detailing facts such that identifiers always starts with alphabetical characters or _ and can then after the first character be followed by alphanumeric characters. These rules makes it possible to produce state machines or other functions that combine single source code characters into source code tokens. A token could be an identifier (such as player, name, name30), a number (0, 1, 30, 10e5, -0.2), a string (“joe”) or some operator such as dot (.) or equals (=). After the source characters are converted into tokens by the lexer, a parser interprets these tokens in relation to each other to interpret things as facts in an Abstract Syntax Tree. In the figure for example the root tree nodes specifies the assignment operation that was parsed out due to the parser associating the equals (=) token with an assignment operation that requires a target (the name property of the player object on left side of the tree) and a value (“joe” on the right side of the tree). A quirk of parsing EcmaScript due to regular expressions is that lexers are required to be pull based as the lexicographical rules depends on the current parsing context that makes implementation of the lexer and parser combination slightly more complex than for many other languages where the lexer and parser can be decoupled. The compiler implements the lexer as a finite state machine while the parser is based on the common recursive descent method. 13 Once parsing is completed a couple of relatively simple syntax analysis and rewrites operations are executed on the AST. This means that these analysis and rewrite operations works on a semantic level very close to the source program as written by the developer. Figure 3: Variable environment example Most importantly a simple environment analysis is done that resolves variable accesses between different functions and scopes. This connects the variables, which the programmer uses to the function environments they belong to as exemplified in Figure 3, in the figure the usage of variable “x” in the inner function is called a non-local variable or more formally a free variable. With environments created and variables attached to them a transform known as closure conversion can be done (Appel 1992, p. 103). This is done by the compiler in 2 stages. The first stage is done by utilizing the variable locations to identify variables accesses that are fully local to a function like “y” in the figure(i.e. both the definition and all usages are within the same function) from the free variable accesses that occurs when there are references to variables defined in a parent function like as exemplified in the previously mentioned figure with the variable “x”. 14 fun: F2 Y id: x fun: F2 Y return: return: add: add: id: y getprop: x getlocal: y getscope: F Figure 4: Closure conversion of function F2 With the identification done the compiler executes a rewrite on the AST to remove from the code the free variable accesses, replacing them in the syntax tree with explicit object property accesses that works in the same way as regular object accesses would (free variable accesses would need to implicitly load an environment object before operating upon a variable). This is exemplified in Figure 4 that shows the AST layout of the function body of the F2 function in Figure 3 before and after the rewrite stage, the implicit variable accesses in the function are transformed into unambiguous and explicit get operations. With the closure conversion done on the AST form that yields internally functionally equivalent code, later stages of the compiler are simplified since all complexities of free variables in lexical scoping are removed and replaced with regular object accesses that the compiler must handle regardless. 15 4.4 Advanced Flow and Type Analysis in the Compiler 4.4.1 Limits of an Abstract Syntax Tree Syntactical analysis stages can answer most questions related to type information in explicitly typed languages such as Java, C# and C++ with minor function local analysis needed in the worst cases. Such syntactical analysis works well with walking the hierarchical structure of an abstract syntax tree with a simple recursive function, for a language with dynamic types and first class functions on the other hand there is a need to do extensive global flow analysis to even be able to determine what functions are invoked with what kind of arguments as the functions are passed along as values in the target language rather than as syntactically bound symbols (Shivers 1988). While it’s not immediately obvious what flow analysis techniques are referred to in Shivers paper it does not matter as the main point he made in this paper was that control flow in these kinds of languages are dependent on values that are computed during execution. To recover the control flow and type information from the source program, the compiler will execute an abstract interpretation of the program. First the compiler converts the program from the AST to an equivalent of the continuation passing style and secondly the compiler applies an abstract interpretation of program to be compiled using the Cartesian Product Algorithm with abstract values as described in the coming passages. 4.4.2 Conversion to Continuation Passing Style (CPS) A continuation is a function defined as the rest of the program to be executed (Reynolds 1993). A tail call is such a call within a function that nothing else will happen in the function after call is made that can safely replace the callers stack with the callees. With these two definitions continuation passing style(CPS) is defined as programs, where functions are constructed so that control flow is done by tail-calling simple functions and passing along what to do next as yet another function(the continuation). While this definition might seem convoluted the requirements for mechanically handling such code is in reality very simple and this format is often actually used in compilers for many optimizations and transformations up until the point that actual primitive machine specific instructions need to be generated (Steele, 1978) (Appel 1992). Figure 5 shows a very simple example, in it add and mul is used as a function instead of the + and * operators to emphasize the point that all execution is modeled as function calls. The original function in the example is a function that receives 2 arguments, multiplies x by 3, then add y to the return value from the multiplication and finally return the result to an unknown function. In contrast the closure converted function is composed of 3 functions that takes input argument(s), sends them off to an arithmetic function that in turn calls the next 16 function with the result. While longer overall each separate function in itself is simpler since the only thing it does is call another function with variables, constants and other functions as arguments. Figure 5: Regular continuation passing style conversion illustrated Just as with the closure conversion described previously the main purpose of transforming the code into CPS is to remove complex implicit semantics from the compiler internals and replace them with explicit operations. In the example above the main benefit is that there is no implicit returns, rather it is explicit that a function closure is explicitly invoked with the result of each computation. Apart from knowing the return function explicitly continuations can also be used to model the non-local throw/catch semantics of exception handling, and in many ways these can be even harder to analyze than function calls since a throw doesn’t just return to the immediately calling function but can be caught by an exception handler in a function that indirectly called the function that throws the exception. The example of CPS conversion in the figure above introduces the new free variables of y and cont as accessed in the inner functions, and while this is useful to the expository purposes of explaining CPS it contrasts slightly with how the compiler works in practice as it never creates any free variable during the conversion. The compiler instead directly creates a form that reminds more of what Appel(1992, p. 119) describes as callee-save continuation closures but with an infinite register set, the internal structures within the compiler also departs enough from the source notation that examples will be shown in an abstract notation to better reflect implementation realities. Figure 6: Output of conversion to abstract machine 17 Figure 6 illustrates a form very close to what the compiler actually produces from the tripleAndAdd function that was shown earlier in Figure 5. In the figure the register set is denoted as state, this register set is used to contain all local variables as well as temporaries of an evaluation stack similar to a stack machine. Within the state RC and TC represents the virtual return and throw continuations (the virtual part being due to them only existing during abstract interpretation and being ignored in the final generated code), FR is the function closure reference that might be needed to access environment variables that was extracted during the earlier closure conversion, TR is the JavaScript “this” reference and finally x and y represent the regular function arguments. In the notation each call and assignment corresponds to an independently executable instruction. The last execution in this example executes RC to emulate a return statement, when executed the result of this function is appended onto the local state at the call instance that invoked the tripleAndAdd function similar to how append, multiply and add worked within the function. Had TC been invoked instead the control would have transferred to the matching try/catch statement(s). The representation used in the compiler deviates in slight ways from what could be considered pure CPS conversion. In a sense the function local parts could be viewed as a regular control flow graph, yet those flow constructs are not used after abstract interpretation and the compiler also has a relatively coherent handling of “regular” local operation nodes and continuations that follows the blueprints of CPS based compilers. However the main benefit of the local and non-local distinction is that the compiler only needs to handle free variables during abstract interpretation for the virtual return and throw continuations. Thus the compiler has no need of more transforms on the code between abstract interpretation stage and the code generation leaving most of results of the interpretation available to the code generator to do optimizations such as generating fast monomorphic type accesses instead of generating slower generic code. 4.4.2 Abstract Interpretation with the Cartesian Product Algorithm Abstract interpretation is described as the compiler correctly running all possibly occurring functions and branches of a program with abstract values instead of real values. The differences are due to the fact that the final compiled program will run with data and inputs that are unknown at compile time. So while the exact control flow cannot be known ahead of time for all programs, a compiler can extract a conservative approximation of the control flow that can be used for various purposes by doing abstract interpretation. Exactly how abstract interpretation is implemented always depends on the requirements of the source language and the design of the system and on how much information is needed for the compiler to accomplish the required goals. 18 Compared to normal execution as done directly by a CPU that deals with basic types such as booleans, numbers, objects, functions and so on, abstract interpretation deals with descriptions and abstractions of those. For example during normal execution exact numbers such as 1, 2, 3.... might be passed to a function as the parameter X, the compiler doing abstract interpretation will use representations such as “anything”, “a number”, “an integer number”, “a number greater or equal to 1”, “a finite set of the numbers 1,2,3,4” or combinations of those. As a corollary to this follows that while normal execution of a statement such as “if (X<3) A else B” will run either A or B since X is known exactly, an abstract interpreter having the abstract values as described will not be able to decide exactly if A or B needs to be run and thus needs to run both. Similar to this, instead of executing on actual objects and functions(functions being objects themselves in EcmaScript) as during normal execution, objects during abstract interpretation are represented as sets of abstract objects that themselves contains possible sets of properties rather than exact properties. The Figure 7 exemplifies abstract interpretation of a call to the CPS converted tripleAndAdd function presented earlier, the contents of state between each basic computation unit is shown as [Value1,Value2, …, ValueN]. It shows an equivalent of a call from CallSite1 to tripleAndAdd with postCont1 as a continuation representing the return position, a constant Number (2) sent as X and finally an unknown Number sent as Y. One can see how with each instruction the abstract interpretation propagates most values and changes some. By following the execution states it is observable how the constant is propagated and compiler recognizes that the multiply function operated on constant values and produces the new constant 6, the following add instruction however also operates on an unknown quantity and thus produces a final unknown number. Figure 7: Basic abstract interpretation 19 Even with a naive and basic abstract interpretation system as in the example above the compiler would be generating optimal code, however EcmaScript has a number of semantics that are troublesome to analyze and one particular example is the semantics of the + operator where the result type can be either a Number or String depending on the input arguments. While it could be possible to represent this as (Number or String) we will use the notation Anything for sets containing multiple basic types since a runtime type test for 2 basic types is usually as expensive as testing for all basic types. Figure 8 shows how this affect basic implementations by adding another callsite to our tripleAndAdd example, this second callsite is equivalent to the first apart from that the return continuation is postCont2, X is an arbitrary number and Y is a String. Since the two call sites execution merges at the entry of tripleAndAdd the execution state from there on has to be merged. First the continuations are kept as sets of continuations (so that returns and throws can be executed on all relevant targets), secondly in the X argument slot the constant Number{2} is merged with an unknown number to produce another unknown number and finally in the slot of the Y argument the passed Number and String are merged into Anything. Looking at the execution of the function it can be seen that since the X argument is no longer constant the multiplication will produce an unknown number and in the same fashion as Y was Anything and the addition operator operating on Anything will produce a Number or String (equivalent of Anything) then the result of this operation and the entire function will be Anything. 20 Figure 8: Problematic example with basic abstract interpretation During normal execution the cost of executing simple arithmetic operations such as additions with runtime type tests due to unknown exact types usually more than halves performance compared to directly executing explicit cpu instructions when exact types are known, even if the penalty exact factor depends highly on actual implementation details and how the target CPU there will invariably be a cost. So the just presented example where Numbers and Strings merged to force Anything operations would have made the tripleAndAdd function significantly slower. Even worse since the output of the function became Anything, these values requiring type tests would propagate beyond the tripleAndAdd function itself from the PostCall1 and PostCall2 positions to potentially unrelated locations in the program slowing down not just this single function but possibly large swaths of the program. While finding an exact solution to type inference problems is subject to the halting problem a variety of algorithms use different heuristics to find better solutions. All of them will in various fashions duplicate code paths so that the individual generated code paths can run as quickly as possible with a reduced number of type tests required compared to the base scenario. Of these algorithms this compiler is designed around the Cartesian Product Algorithm (CPA). The Cartesian Product Algorithm originally defined by Agesen is defined simply put as a Cartesian product between all possible types for every argument slot for every function, while 21 that simple definition would use a very large amount of memory the algorithm also specifies that building the product can be done lazily in a monotonically increasing fashion since the algorithm only needs to consider those new combinations that are actually possible. In contrast with many other algorithms based expansion like n-CFA and IFA, the requirement that the set building is done in a monotonically increasing order is a key detail of the algorithm since it makes it very stable in the sense that no decisions are reversed and once the algorithm has filled in all reachable argument combinations it will terminate. The compilers implementation of the CPA algorithm works on a per-operation basis (equivalent to functions in CPS terms) since in EcmaScript function calls, node fetches and even operators can produce multiple types the analysis needs to work with a finer granularity than entire EcmaScript level functions. Also objects and functions are given the same precedence as numbers in the compiler during abstract interpretation, so objects from different allocation sites cause disjoint analysis paths since their contents can vary and the functions they target differ. In addition to this the compiler modifies the monotonicity definition slightly to account for details in the EcmaScript language without breaking the stability, the modifications all involve the notion of merging specializations into supersets. This is possible because merged values can never devolve back into a subset and a superset will always accept everything defined in the subset it evolved from during the merge. There are three primary examples of how merging values is done in the compiler. The first example is continuation values. Since once 2 paths have merged the same path will be followed the compiler works with continuation sets rather than actual positions like with objects and functions. The result of 2 continuation sets meeting is always the union of those sets and by that definition that union will always accept both of the subsets that was used to produce it. The second example is numeric promotion. Most optimizing EcmaScript runtimes tries to use 32-bit integers if possible since they are usually faster than IEEE 754 double precision floating point numbers that the EcmaScript specification specifies as the number. The compiler also does this where possible but if a Number and Integer value crosses paths the compiler must always merge the result to be a Number since the possible values of double precision floats are a superset of 32-bit integers. The third example is if the compiler has during abstract interpretation first detected a constant such as Number{0} for an operation (often found as the value for the first iteration of a loop) and then subsequently encounter that same operation with another number the compiler then merges the individual constant numbers into a generic number (the compiler could potentially calculate bounded numbers but that has not been implemented). From that point every time the compiler runs that operation again, any new number, constant or not, will match the existing generic Number definition. In this sense the merging could be seen as a 22 generalization of infinite sets. This kind of infinite set generalization merge also applies for booleans and strings. In Figure 9 we again look at the abstract interpretation of the tripleAndAdd function, this time using the Cartesian Product Algorithm adding a third callsite that that sends a constant Integer{2} as the X argument and another constant Number as the Y argument. During abstract interpretation tripleAndAdd is first called from CallSite1 and this produces an initial path since there existed no paths before this. After this CallSite2 is encountered and since Number and String are considered as fully disjoint types during abstract interpretation the compiler produces a new path through the function (the compiler will try to merge the paths if possible but the paths at all positions are disjoint in this example). Lastly the newly added CallSite3 is encountered, this time the compiler detects that while not fully identical the arguments passed here are compatible with the path produced by the earlier call from CallSite1. The compiler now begins to merge the new path with the old path to produce a new path that is a superset of the 2 previous paths. First as per the first merge example a union set is created of the 2 continuations since they will both follow this codepath. Secondly as with the second merge example the Integer constant from CallSite3 to arguments X is promoted to a Number(double precision float), since both constants are now equal in type and value it will remain a constant. Thirdly the Y argument has 2 Numbers but with different constant values and as with the third merge example the superset is all Numbers. After the merge operation at the function entry point the subsequent states produced by CallSite1 will match apart from the Continuation set that needs to be merged for all subsequent operations. 23 Figure 9: Abstract interpretation with the Cartesian Product Algorithm Compared to problematic behavior shown in Figure 8 where the naive merging of Numbers and Strings would have produced slow code paths due to runtime tests associated with the catch all Anything type, the abstract interpreter example using the CPA algorithm as shown in Figure 9 analyzed CallSite1 in an equivalent way to the simple case first presented in Figure 7 giving exact type information that could be used to produce optimal code. Also like the simple case no slow generic type would escape the tripleAndAdd function to slow down other areas of the program as in the problematic example with naive interpretation. As a side note, while it might be possible to get even faster code by not merging the Integer and Number paths it would increase analysis time and would require a backend that can detect 24 overflows of integer additions, something that a pure C backend is incapable of doing without performance penalties. 4.5 Nominalization of Abstract Structural Types The output produced by the flow analysis is in a very abstract and verbose form. This means that while all data is necessary to correctly analyze the program, the data about the program is magnitudes larger than the data actually needed to produce executable code. So to make the data from the analysis useful for actual code generation two different processes are applied to the resulting data to make code generation feasible and efficient. The abstract interpretation handles objects in what is called a structural form that only contains information about what properties exist in objects but has no data about how the objects are actually laid out in memory after compilation. While the structural form greatly simplifies abstract interpretation, the information about how all objects are laid out in memory is something that is required by the compiler to produce optimized code Nominal types are named types as used in languages like C++, Java, C# and so on, usually these types has specific layouts that makes it easy for the compiler to make fast property accesses since the layout information is available to the compiler and ends up as an offset for pointer accesses. Nominal types are also used in hierarchies where the types have a specific layout and all child types have identical layout in memory to their ancestor types for all shared properties. So to generate as efficient code as the above mentioned languages, compilers of dynamic languages has to find ways of mapping the property accesses to accesses on Nominal types with explicit memory layouts. Other compilers using CFA for analysis build nominal types during the abstract interpretation stage. The Starkiller compiler uses something that could be called instance re-flowing where allocation sites contain the object shape and each time a new property is introduced or generalized the object at the allocation site is changed and reanalyzed. The Shedskin compiler on the other hand uses the Iterative Flow Analysis method on objects. The IFA method works by first running a simple flow analysis with one live empty object type initially, after this the algorithm checks for incompatible object usage by finding unknown properties or incompatible types of properties as the Number and String usage described in the previous chapter, if any incompatible object usage was detected the live objects in the analysis are split to remedy the incompatible uses and the analysis is run again until no more incompatible uses are found. While the Shedskin approach can give the highest potential performance, it will fail analysis for a number of programs due to the sacrifices made in terms of compatibility to achieve optimal performance. Starkiller is similar to the compiler described here since both use 25 allocation sites as the primary object identifier during analysis, however in cases where polymorphic accesses can occur Starkiller seems to fall back to expensive runtime property searches. 4.5.1 Path Compression The first stage of the nominalization process is to compress the control flow description into simpler more generic terms better suitable for actual code generation. To accurately analyze code the abstract interpretation needs to keep separate paths for each function and object site that passes through a certain state variable to accurately judge their usage or apply all possible destination functions as illustrated on the left side in Figure 10. However, for real code all these cases would generate identical code of passing the value in a reference at the same location as illustrated by the right side in the figure, thus already all such simple cases would end up as identical code in the compiler binary taking unnecessary space if the compiler did not compact the abstract paths. Figure 10: Compression of redundant paths for code generation While compressing only this trivial case can reduce code size by orders of magnitude, the code size would still correspond more to the abstract analysis size than input code size and produce output binaries with fairly unpredictable sizes to a developer due to the primitive types such as numbers, booleans and references being disjoint in the analysis stage. As a hypothesis set up during the construction of the compiler is that performance critical paths 26 will end up with very little polymorphism if the developer needs performance by hand tuning code, the compiler can make some assumptions to better match the number of paths to the input code size. The algorithm implemented to compress the paths iteratively collects all possible execution paths for each built output function and then produces a fixed number of optimistic optimal paths for each operation. The optimal paths would be selected as those that will yield the highest performance by favoring cases where operations using numbers and booleans are executed compared to cases where operations use objects and generic types. After selecting the optimal paths the algorithm then defers all other abstract code paths to a final generic code path that must be capable of handling all possible cases found by the compiler. If the code always follows the optimal paths selected by the compiler the performance should be close to a theoretical optimum while the worst case scenario of this approach is that the generic path would be active for all code execution but even for the generic path the compiler will generate fairly fast code for many cases due to how programs are usually written. One could theoretically engineer specific cases to slow down the compiled code down to a level similar to interpreters but such cases should in most cases be synthetic in nature. 4.5.2 Object Reification The second stage of the nominalization process builds real object descriptions in nominal form to be able to capture multiple abstract objects that in reality are equal. Figure 11 first illustrates 3 objects in an array, the first 2 objects have only the property x written to them with an integer value and while they would be separate objects during abstract analysis it is trivial for the compiler to identify them as identical objects and compact the first 2 objects to the same real object type during this reification phase. Figure 11: Polymorphism with differing object shapes The last object in this example however is not identical to the 2 first objects yet all of them are allocated into the same array and will all thus flow into the same code paths. The situation where similar but not identical objects flow into the same code paths is not entirely uncommon and the objects would then with a very high likelihood be used at the same locations in many other places also. 27 The algorithm used by the compiler to create nominal types analyzes the code and computes confluence scores for property groups based on how many times particular properties are used by different objects at the same locations, the algorithm then picks the property groups with the highest confluence scores to become the basis for creating new nominal types to represent the structural types when generating the actual nominal objects types. The 3 objects in Figure 11 would produce a property group for their x properties with a high confluence score due to their common increment inside the for loop and the algorithm would due to the high confluence score then place them in the same nominal type. The last object with the additional y property would then introduce another property group for that y property and the algorithm would subsequently create a subtype to hold this y property and move the mapping of the third abstract type to point to this new subtype. In practice this would mean that the compiler could then generate a direct memory access to the field x in all 3 objects in the same location in memory since they all belong to the same hierarchy and hereby speed up execution by not having to query the object about the position of x since all 3 objects would have identical object layouts as far as the property x is concerned. A Java equivalent of how the generated code would look is listed below in Figure 12. Figure 12: Java analogue of the reified object layout 28 4.6 Code Generation Writing efficient low level code generators takes a nontrivial amount of time and the main goal of the project was to investigate the feasibility producing efficient code by removing the most expensive high level runtime tests with the help of type inference. As such C code was selected as an intermediary target since there exist very high quality C compilers that has had many man years of work put into producing efficient code with low level optimizations. Selecting C as the target also gives certain platform independence practically for free without having to invest time into separate code generators for the different target architectures. Another big benefit of selecting C as the intermediary language is that writing operating system bindings for various functionality is simplified since C code snippets that is used to do the interfacing can be embedded into the JavaScript code. While C is the current back end code generator most of the heavy lifting to enable optimizations are done in the previously described target independent parts and as such making other backends will be a much smaller job than writing all from scratch again. The code generator in the compiler is a relatively straightforward system that generates C representations of the Nominal types and builds functions based on the compressed control flow. The generator is also tuned to produce fast code over compact code, so for example all objects have the same relatively big virtual dispatch table(vtable) generated with all unknown properties and invalid function shapes stubbed out with some fault checking code to catch compiler bugs. The big vtable makes generic cases requiring polymorphic property accesses and function calls faster since the compiler can do dispatch on all types without any other runtime type detection than using the vtable supplied by the object itself. However for all cases where the compiler has earlier inferred that for example property accesses are monomorphic (i.e. only one target object type) then it will skip vtable calls and do a direct C property access. While targeting C as an intermediate language gives good results quickly with relatively little complexity it also provides a couple of drawbacks. The first drawback is that integer overflows checks cannot be done quickly in a portable way. A portable implementation need to produce int64 results from int32 values and this can slow down 32 bit code in addition to the actual tests. Non portable C can rely on compiler intrinsics but would need to retain the portable variants as a fallback. The second drawback is that C provides for no standardized way to walk data in the stack frames of a call chain, this means that to implement exact garbage collection the compiler must keep track of all object references with runtime code instead of being able to rely on a stack walking function to find object references. While a conservative garbage collector could be used there can often be benefits of having exact collectors in terms of latency. The third drawback is that a C based implementation cannot return to different positions in the calling function, such a feature could potentially be used to target different code return positions depending on the computed type instead of passing type information in runtime values. 29 5 Performance Evaluation The performance evaluation done is purely focused on benchmarking computational performance. While other metrics could be used to measure how efficiently the compiler and/or runtime is able to analyze and create runnable code, such as the number of successfully removed type tests, garbage collection latency and the like. Those factors are either not relevant (the benchmarks are small enough that the compiler removed 100% of the type tests) or outside the defined scope (garbage collection is a field on it’s own and all benchmarks are designed to avoid having memory pressure as an active factor). 5.1 Benchmarking Setup Correctly benchmarking code performance is a large subject on it’s own and can sometimes focus on the wrong things(Ricards 2010) (Ratanaworabhan 2010). However since the produced compiler is not a finished compiler by a far stretch, the benchmarking focuses on a couple of very small micro benchmarks designed to measure general execution cost of particular features and optimizations present in the compiler. These benchmarks are executed in a variety of configurations and variations to try to give a measurement of the relative performance of the compiler to systems with similar characteristics. Where possible there has been an effort to minimize overhead in timing to give as accurate results as possible but for differences on the order of a magnitude this might have been ignored since the timing overhead would become minor factor in comparison to the system at large. A mechanical script (using Node.JS) was used to run the benchmarks and collect the data without user intervention during the benchmarking phase. The benchmarks were run 5 times for most tested runtimes and compilers on each platform. An exception to this was made for the very slow interpreters on the ARM platforms that only had 2 runs each, however this does not significantly affect the outcome as the interpreters with only 2 runs were more than a magnitude slower and thus the conclusions drawn due to slightly lower accuracy for the slow tests are unaffected. After running the benchmarks an average of the running times was taken and used for the presented results. 5.1.1 Benchmarked Factors There is three major factors that play a part in determining the performance of an optimized EcmaScript environment compared to a basic interpreter. They are presented in increased order of compiler sophistication. 30 The first is interpretation overhead itself of having code to determine the actual operation to be executed. An interpreter needs to find the next operation code, go to the code, find the operation parameters and finally execute the actual needed code. Compiled code on the other hand only needs to do the last part of actually executing the actual operation code, in addition the code can also be tailored to the parameters instead of having to be generic in nature. A subset of this problem is the overhead or limitations of the selected target language (such as C or JVM bytecode). The second is dynamic typing in terms of varying basic types requiring operators to be executed in different ways. As an example if 2 values are known to be numbers the compiler can generate a native machine add instruction. If the type is not known then a short additional instruction sequence is needed to determine the type, branch and then execute the actual operation for the specific type. The third and final factor is dynamic property accesses due to hard to predict object shapes, if object shapes are known then property accesses are just indexed loads and stores. On the other hand unknown property accesses can be quite complicated, all depending on the memory model used. The compiler with ahead of time knowledge optimizes object shapes, JIT compilers on the other hand without ahead of time knowledge usually rely on various inline caching schemes. While not implemented the compiler could also be adapted to utilize inline caching for particular cases if the ahead of time analysis is unable to predict the object shape. Giving an exact overhead of all these factors is impossible as they always depends on how these features are implemented but in general since multiple instructions are often required to do the work of individual instructions in optimal cases the penalty is to be expected to be at least 2-3x for each factor but often worse. 5.1.2 Used Benchmarks There is a total of 7 benchmarks divided into three sets of functionally similar examples differing mostly in details used to identify the impact of individual performance factors. The first set of benchmarks (001_fib and 002_fibco) are variations of a regular recursive fibonacci calculation function. The fibonacci micro benchmarks stresses function call overhead and performance of arithmetic operations without involving objects or the garbage collector and can be used to get a baseline of how fast code a compiler can produce in ideal cases. The fibonacci function is timed with a parameter of 35 leading to roughly 30 million function calls, 30 million comparisons, 30 million subtractions and 15 million additions. 001_fib is a straight application of the mathematical formulation while 002_fibco is written with integer coercions, the straight application in 001_fib gives a EcmaScript environment no hints about how to handle numbers while the coercions in 002_fibco forces all numbers in the 31 computation to be integers helping compilers and runtimes to produce better code if tuned for this characteristic (the Asm.JS subset of EcmaScript relies on these coercions for the validity of the subset model). The second set of benchmarks (003_cplx,004_cplxo and 005_cplxpo) are variations of computations simulating complex unit number multiplications, these benchmarks has no function call overhead and all operates on floating point numbers. All of them simulate rotating a complex unit vector by multiplying it with another complex unit vector a set number of times (100 million times for 003_cplx, 50 million times for 004_cplxo and 10 million times for 005_cplxpo). The main difference between these benchmarks however is what they operate on. The first in the set, 003_cplx, does all arithmetic on local variables and has no overhead apart from the arithmetic operations themselves and is an even better indicator than the fibonacci sample on how much an impact dynamic type tests has on arithmetic operation performance since it lacks any function call overhead. The second in the set, 004_cplxo, is almost identical to the first one but stores the complex numbers in similar objects rather than in local variables and thus indicates how much predictable objects accesses impacts performance in addition to the dynamic typing (or removal thereof). The last one in the set, 005_cplxpo, is almost identical to 004_cplxo but puts more strain on the runtimes by sending in differently shaped accumulator objects to test the performance impact of polymorphic objects. The last set of benchmarks (006_array, 007_arrayco) contains a numeric summation function run as a loop, this benchmark set like the complex set has no function call overhead and instead tests the efficiency of array storage and accesses to them. Arithmetic operations also play a large role in this benchmark but array accesses takes the majority of the performance impact if the array operations themselves are slow. Like in the fibonacci benchmarks the difference between 006_array and 007_arrayco is the presence of integer coercions, that in this case can plays a more significant role since the number being coerced is also used as the array index All the benchmarks were specifically designed to minimize the impact of JIT compilation pauses and garbage collection collection pauses by doing any needed setup and then running five warmup iterations of the benchmark with identical parameters before starting the clock when timing the functions. 5.1.2 Compared Systems The compared systems are listed here and the identifying abbreviations used in the benchmark charts are introduced here as a reference. The compiler described in this thesis is compared twice for each benchmark, once with property accesses inlined and once without inlining enabled. This is to gauge the general 32 performance impact of fast property accesses compared to dynamic property accesses. In the charts these are noted as (cm_in) for the inlined variants and (cm_no) for the variants with no inlining. The three major open source JavaScript runtimes with built in JIT compilers, V8, Spidermonkey and JavaScriptCore, were all benchmarked to give both information on theoretical high end performance and the relative performance of interpreters. The V8 runtime that drives Chrome and Node.JS has been benchmarked with node 0.12 running with baseline compiler only (v8no_b) and also with the optimizing crankshaft compiler enabled (v8no_c). On one platform 2 different versions of V8 was compared instead due to some performance regression when testing with a newer Node.JS environment, the tested versions were those V8 versions bundled with Node.JS v0.6 (v8n6) and Node.JS v0.12 (v8n12). The Spidermonkey runtime used in Firefox is also included in the tests, the Spidermonkey runtime is divided into 3 tiers with increasing performance. First comes the interpreter, normally used to run code that is seldomly executed like initialization scripts (sp_in), for code that is executed a few times more the second tier is a baseline compiler(sp_ba) similar to the baseline compiler found in V8 and finally the 3rd tier is called Ionmonkey and is the compiler level with most optimizations enabled(sp_fa). The third major open source JavaScript runtime JavaScriptCore not tested on all platforms due to difficulties of building it. Like Spidermonkey it has 3 performance tiers, an interpreter(jsc_i), a baseline compiler(jsc_j) and an optimizing DFG compiler (jsc_d). A fourth tier called FTL is under development but was not tested due to the building problems mentioned. Three other less known open source JavaScript runtimes was also tested. The old Java based Rhino(rhino) runtime and the new Java8 based Nashorn (nasho) runtime were tested. The Java based runtimes are included in the testing since they generate JVM bytecode that has similar restrictions on the generated code as a C backend has and provides some hints on how well dynamic code can perform with those kinds of restrictions. And finally the Duktape(dukt) interpreter was tested since it has recently become a popular runtime for embedding. In addition to the public runtimes there was a couple of special tests added. First a C variation of some of the tests were done (ccref). The separate C benchmarking is done to give a bound of how roughly how fast it is theoretically possible for the benchmark to run without any JavaScript overheads. Additionally a secondary simple EcmaScript to C++ translator(cmref) was developed in addition to the compiler that has been described. While the public JavaScript runtimes might have extra overhead due to interpretation in addition to the 33 dynamic type tests this secondary EcmaScript to C++ translator gives a good isolated view of the overhead produced by dynamic runtime type tests on the two benchmarks it was able to compile. 5.1.3 Compiler and Machine Variations The choice of C as an intermediate language means that the runtime performance of the generated code is dependent on the efficiency of C compiler used to compile the code to the native machine, beside C compiler differences the characteristics of the target architecture and even individual cpu can also impact the performance of the produced code. The compiler selection has varied a bit depending the available platforms but in general the relative performances has not varied too much and as such and the trends has been consistent enough that shallow conclusions can be drawn from the results. The main part of testing has been done on an Intel laptop (Core i3-2330M CPU at 2.2 ghz) running Windows with the programs compiled by Visual C++ 2013 running in 32bit mode. While GCC can in some cases generate faster code, Visual C++ was used for the default comparison numbers due to more consistent numbers and the fact that Visual C++ is the default compiler for most windows projects due to better availability of API headers. Secondary testing machines are the Raspberry PI Model B (first model) running Raspbian and a Raspberry PI Model B2 also running Raspbian. The older Raspberry PI B with a 700 mhz ARM11 cpu is comparable to older and cheaper low end mobile phones in 2015 while the newer Raspberry PI B2 with a 900 mhz Cortex cpu should be more comparable to mobile phones popular with consumers in 2015. The code tested on the Raspberry machines was compiled with the Raspbian bundled GCC compiler (version 4.6.3-14). 5.1.4 Benchmarking Variability To get stable results the Raspberry PI B2 model had all but one core disabled and the Linux kernel “performance governor” was enabled to avoid cpu throttling variations. The older Raspberry PI and the Intel laptop provided relatively stable results when most concurrent processes were closed down. While effort was made to get stable data, anomalies cannot be ruled out as all benchmarks was run on multitasking operating systems. For example, in one instance 2 of the mechanically produced numbers was removed for one compiler since they were unreasonably high indicating a concurrent process being active (clang). The cmref numbers for the 003_cplx benchmark on the win32 platform were also added manually later from a secondary run since the initial run was made without them, the 34 numbers should be correct in terms of relative performance as the same preparations were made for running both tests. Another example of variability is how the C reference code ended up slower than the result of the compiler output on 003_cplx on the win32 test, closer inspection puts the most probable cause of this as the printf statements being inside the timing block rather than outside in the C code since more or less consistent 10ms differences in the timing was detected in all the C samples on multiple platforms indicating that context switching latencies could be the culprit. The EcmaScript samples do not share this timing flaw and the differentials does not significantly affect the relative performance measurements apart from in the 003_cplx sample. On inspection of the win32 results an error of 1.6% was calculated. This was done by first dividing the mean difference of each benchmark target value set by the mean value of the same set to get a benchmark target specific error, then a mean of those errors was taken. In general the faster runtimes had far smaller errors but the mean is larger due to the slower runtimes having very variable running times. 5.2 Benchmarking Results The full results of the benchmarking runs are available in the benchmark results appendix (Appendix C), what follows are a number of charts to summarize the results of the various benchmarks run. All chart numbers in this chapter refers to running time in milliseconds and lower results are better due to higher performance leading to reduced running time. 6171 7000 5000 0 002_fibco 001_fib 003_cplx 2074 284 711 1104 218 461 458 458 1000 84 258 333 2000 1575 3000 349 2502 4000 102 265 318 517 Execution time (MS) 6000 004_cplxo 006_array 005_cplxpo 007_arrayco Figure 13: Win32: compiled code execution times (lower is better) 35 ccref cm_in cm_no cmref 0 002_fibco 001_fib 003_cplx 18950 4803 7883 1229 2058 490 1733 2393 10000 4610 3929 3928 15000 5000 24018 11993 20000 760 2090 2777 4250 Execution time (MS) 25000 19822 24350 30000 ccref cm_in cm_no cmref 004_cplxo 006_array 005_cplxpo 007_arrayco Figure 14: Raspberry Pi B2: compiled code execution times (lower is better) 002_fibco 001_fib 003_cplx 6154 16686 12372 963 0 4070 2913 2911 10000 554 2006 3078 20000 1599 20490 30000 30160 40000 1518 2881 3792 6488 Execution time (MS) 50000 42083 50498 60000 ccref cm_in cm_no cmref 004_cplxo 006_array 005_cplxpo 007_arrayco Figure 15: Raspberry Pi B1: compiled code execution times (lower is better) The figures 13, 15 and 14 shows the performance of the code generated by the compiler with(cm_in) and without(cm_no) inlining enabled on the different benchmarks. In addition some reference numbers for clean C samples (ccref) and the secondary simpler compiler (cmref) are included to discern overheads leading to lower performance. 003_cplx is designed to only measure numeric computation performance and has no function call or property access overhead. On this benchmark the main compiler produces code has a running time that is more or less identical to the C reference sample as the compiler is able to infer numeric operations. The simple reference compiler (cmref) however only does simple local expression type deductions and is thus only able to remove a few type tests where the output is guaranteed to be numeric, otherwise resorting to expensive runtime dynamic typing 36 operations. The penalty for the dynamic type test differ significantly between platforms, from being on the magnitude of roughly 5 times longer execution time on the modern Intel laptop, slightly higher on the Raspberry Pi B2 to a staggering 17 times longer execution time on the older Raspberry Pi B1. The exact cause of the differences isn't clear but in general it seems that more advanced cpu's designs somehow handles dynamic type tests better. The benchmarks 004_cplxo and 005_cplxpo are almost identical to 003_cplx but operates on values stored in objects instead of local variables. While 003_cplx correctly showed identical performance for the compiler regardless of the property inlining flag, the 004_cplxo and 005_cplxpo benchmarks shows a dramatically longer running time with inlining disabled. While the 006_array and 007_arrayco benchmarks do different computations by summing array values instead of complex rotations, the running time difference due to inlining can also be clearly seen in these benchmarks. While the 001_fib and 002_fibco benchmarks do show some performance differences due to inlining and dynamic typing those differences are much smaller than in the other benchmarks, instead these benchmarks shows a not insignificant gap in the running time to the pure C runtime samples. On inspecting the code generated by the compiler compared to the clean C samples two things stood out. First is the shadow stack that exists to allow exact garbage collection, while this could be removed by opting for a conservative collector it could also be viewed as cheating as a full blow runtime will benefit greatly in terms of reduced runtime latencies by having accurate garbage collection semantics. The second thing is that method dispatching is always done dynamically through object references in the syntactic closures. This dispatching is in turn visible in three ways in the code, first by the management of the closure objects, secondly by the property accessors and thirdly by the method dispatch itself. The property accessors is what makes the inlined benchmarks faster than those without inlining. The dispatching and closure management could potentially be optimized away but those optimizations are not implemented in the compiler. 003_cplx 284 106905 711 62206 52456 105767 87716 118859 164259 002_fibco 001_fib 218 458 0 258 16658 28158 100000 349 83720 53349 150000 265 14587 23169 Execution time (MS) 200000 50000 192885 250000 004_cplxo 006_array 005_cplxpo 007_arrayco Figure 16: Win32: execution time in comparison to interpreting runtimes (lower is better) 37 cm_in sp_in dukt 1000000 0 002_fibco 001_fib 003_cplx 7706983 cm_in sp_in dukt 6154 2000000 16686 1686565 3000000 2913 375785 1466292 4000000 2006 207908 1400924 5000000 963 829049 1985036 6000000 1599 1287669 3299508 7000000 2881 195315 1097573 Execution tim e (MS) 8000000 1888850 6271054 9000000 004_cplxo 006_array 005_cplxpo 007_arrayco 1000000 500000 0 001_fib 002_fibco 003_cplx 2058 81600 559938 1500000 1733 30870 85028 635214 2000000 3929 82926 267945 953678 2500000 1229 49028 338925 1094541 1824967 3000000 2090 26955 78997 478556 Execution tim e (MS) 3500000 3819854 4000000 4803 162139 541746 4500000 7883 140606 529563 3575500 Figure 17: Raspberry Pi B1: execution time in comparison to interpreting runtimes (lower is better) cm_in jsc_i sp_in dukt 004_cplxo 006_array 005_cplxpo 007_arrayco Figure 18: Raspberry Pi B2: execution time in comparison to interpreting runtimes (lower is better) The compiler is targeted at compiling code for environments that for various reasons cannot use JIT compilers and are thus restricted to precompiled or interpreted code. Figures 16, 17 and 18 shows the results of the code generated by the compiler compared to interpreters. The performance for the code generated by the compiler is roughly a factor from 10 times to 500 times faster compared to the interpreters. If the call heavy fibonacci benchmarks are omitted the compiler comes out even more ahead at roughly 20x compared to best interpreter that was the assembly optimized JSC interpreter. 38 0 001_fib 5891 5720 3168 v8no_b sp_fa 372 278 284 458 1937 2343 v8no_c 711 458 489 2998 cm_in 218 374 377 349 368 1000 480 2000 458 455 3000 258 218 574 118 1395 4000 4682 4073 4845 5000 265 218 461 110 1235 Execution time (MS) 6000 4702 7000 sp_ba 002_fibco 003_cplx 004_cplxo 005_cplxpo 006_array 007_arrayco 173898 Figure 19: Win32: execution time compared to JIT compilers (lower is better) 200000 0 001_fib 125554 cm_in v8n6 v8n12 6154 32755 33682 6182 sp_fa 16686 30406 31446 5582 60430 5275 86609 76392 963 20000 5679 40000 2913 60000 2006 7706 13027 1887 26103 80000 85467 66764 100000 6130 120000 1599 109496 103333 140000 2881 6329 11192 1926 23351 Execution time (MS) 160000 96560 122904 180000 sp_ba 002_fibco 003_cplx 004_cplxo 005_cplxpo 006_array007_arrayco 2000 8247 4803 4469 5491 6219 4726 1229 4000 1733 2455 1536 2041 6000 2058 4271 4447 5726 9057 8000 3929 4613 5638 10000 2090 2456 1628 1996 Execution time (MS) 12000 8248 14000 4468 4811 16000 7883 13453 Figure 20: Raspberry Pi B1: execution time in comparison to JIT compilers (lower is better) cm_in v8no_c sp_fa jsc_d 0 002_fibco 001_fib 003_cplx 004_cplxo 006_array 005_cplxpo 007_arrayco Figure 21: Raspberry Pi B2: execution time in comparison to JIT compilers (lower is better) 39 11353 Figures 19, 20 and 21 shows the performance of the code generated by the compiler in comparison to open source runtimes with baseline JIT compilers and optimized JIT compilers enabled. In this high performance context it can be seen that the code produced by the compiler is performing roughly on par with the results produced by the optimized open source runtimes, worse in some cases but better in other cases. While the compiler usually fell behind in the benchmarks without coercions, the optimizing runtimes on the other hand had roughly identical results despite the presence of type coercions since the JIT runtimes handle numbers as integers with promotion to double precision floats only after overflow checking. 002_fibco 003_cplx 6324 5129 cm_in nasho rhino 284 711 218 1397 0 001_fib 5594 7850 349 2000 458 1503 4000 258 836 2202 6000 1978 5113 8000 265 496 1519 Execution time (MS) 10000 8501 12000 004_cplxo 006_array 005_cplxpo 007_arrayco 298851 Figure 22: Win32: execution time compared to JVM runtimes (lower is better) 002_fibco 001_fib 003_cplx cm_in nasho rhino 4803 144004 7883 114742 1229 41817 2058 0 3929 50898 50000 1733 15719 35625 150000 61860 136036 200000 100000 169927 250000 2090 11494 26261 Execution tim e (MS) 300000 171012 237065 350000 004_cplxo 006_array 005_cplxpo 007_arrayco Figure 23: Raspberry Pi B2: execution time compared to JVM runtimes (lower is better) Figures 22 and 23 shows the performance of the compiler compared to JVM based runtimes, while the JVM runtimes are at a disadvantage since they are built for full compatibility they share some of the limitations with the compiler when targeting C code. The shared limitations 40 003_cplx 610 289 370 279 230 220 220 139 929 720 881 1764 002_fibco 001_fib 383 357 256 232 462 464 514 510 246 264 254 202 2000 1800 1600 1400 1200 1000 800 600 400 200 0 240 258 343 190 Execution time (MS) of C and JVM backends shows of how much performance can be increased by a compiler inferring types ahead of time. msvc_05 msvc_13 clang gcc 004_cplxo 006_array 005_cplxpo 007_arrayco Figure 24: Win32: execution time differences with different C compiler backends (lower is better) Figure 24 shows the relative performance of using different C compiler backends to compile the C code generated by the EcmaScript compiler into machine code. On win32 the Microsoft Visual C++ compiler from Visual Studio 2005 and 2013 (compiler versions 14 and 18), Clang 3.6.0 and GCC 4.9.2 were tested. The tests showed some mixed results with GCC in general seeming to be edging out Clang and the 2013 Microsoft compiler apart from on one benchmark where some problem made GCC perform much worse than the other compilers. That particular benchmark also showed problematic performance with the Microsoft compilers if one compiler switch was omitted (/QIfist). This compiler switch forces the rounding mode of double to integer conversions to a particular value and exists due to the windows C ABI not guaranteeing that the cpu’s rounding flags are properly set at all times, it is not far fetched to believe that this anomaly in the GCC numbers are due to the same problem. 41 6 Language Subset Evaluation 6.1 Feature Questionnaire Since the compiler makes certain deviations from the EcmaScript standard a questionnaire was created and posted on various public forums to gather information on if the assumptions made during compiler design would hold up compared to how developers write EcmaScript in reality. The questions in the questionnaire itself are listed in Appendix D but a brief summary is provided here. The questionnaire starts with a few basic questions about what environments the respondents has experience with. The second set of questions goes into details about the respondents usage of dynamic code with respect to eval and modules to gauge if the compromises and workarounds possible with the compiler are enough to satisfy the developers real world usage of JavaScript. The third set of questions concerns the details of how the compiler deviates from the standard when it comes to handling undefined variables. The fourth set is a single question regarding the optimizations of the compiler in terms of object shaping and their effect on property lifetimes. The fifth set is a question regarding handling of numerical optimizations and how the convenience and compatibility concerns impacts programmers. The sixth and final set is yet another singular question this detailing how developers use regular EcmaScript objects as hashmaps would be used in languages like C++ and Java, this question is highly relevant since the usage is believed to be common while complicating optimizations. 6.2 Questionnaire Results 20 responses were gathered by posting the questionnaire on various forums and social media. Some respondents also reported having some problems understanding the questions due to their technical nature. While the response set is relatively small and shouldn’t be used to draw definite conclusions, especially as some of the questions might not have been correctly understood, the responses still give a preliminary indicator on the usefulness of the subset defined for the compiler. 17 of the respondents mainly used normal JavaScript, 2 of the respondents used TypeScript and 1 response was given by a developer mainly using Haxe but reporting on his JavaScript experiences. On top of using the various browsers 8 out of the 20 respondents had used some form of mobile porting toolkit in the past. A clear majority, 16 out of 20 respondents did not use eval in their code. Half of the respondents used other dynamic code loading mechanisms. An anomaly occurred in the 42 question about developers being able to replace eval with other mechanisms as more respondents answered the question than those who had indicated that they did use eval, possibly due to the order of the questions or poor phrasing. Never the less of those that HAD answered the eval question positively half of those developers indicated that they could replace eval with something else, by counting that would mean that 18 out of 20 respondents indicating that they are able to easily work in the absence of eval. A little over half of the developers, 11 out of 20 answered that their code should be unaffected to changes to the behavior of undefined. When questioned on the details of the changed behavior that answer did not correlate fully to them expecting their code to behave identically, the number of developers who did not seem to depend on undefined did however remain the same. However most of the developers, 14 out of 20, seemed to be in favor of making the compiler fault on undefined usages due to the error prone semantics introduced by the behavior of undefined that can create hard to find bugs at times Half of the developers (10 out of 20) thought their code would behave differently if fields existed ahead of time. Most (13 of 20) of the developers claimed that their code would not be affected by integer overflows, ie letting integers overflow instead of automatically be promoted to double precision floating point values. However when asked about if they would want the compiler to default to a similar mode of operation only 1 out of 20 developers answered positively. 7 of the 20 respondents would have preferred to have a manually enabled mode that disabled integer overflows while 10 of the 20 respondents preferred to use Asm.JS style type coercions as tested in the 002_fibco and 007_arrayco benchmarks. 2 of the respondents did not know or seemed to misunderstand the question. A majority with 16 out of 20 respondents used object literals exclusively for dynamic mappings, 1 used containers and 2 used both object literals and containers. 1 of the responses was “random” that probably meant both but as no clarification was given it. 43 7 Discussion The main motivation for constructing the compiler was to build a compiler that would provide a usable high performance environment for game developers building games in JavaScript to enable production of more advanced without being limited by the performance of interpreters. While the compiler is incomplete the runtime performance for the initial well typed benchmarks shows great promise with performance exceeding what interpreters are capable of by more than an order of magnitude even in the worst cases. In fact the performance of the compiler is roughly on par with contemporary leading open source optimizing JIT compilers in most cases, being better or worse depending on the particular benchmark. Compared to JIT compilers the full program analysis employed in the compiler has both benefits and drawbacks. The main benefit as was shown by the performance wins in the 005_cplxpo benchmark comes from the fact that the compiler with ahead of time knowledge of properties in an object is able to produce a fixed object layout, JIT compilers on the other hand have to rely on polymorphic inline caches that perform increasingly bad as polymorphism increases. Now the compiler cannot handle an arbitrary level of polymorphism without also suffering but the heuristics does seem to provide some mitigation unavailable to a JIT compiler as shown in the 005_cplxpo benchmark. There are however no denying that there are drawbacks compared to a JIT compiler. Full program analysis is not free, the Shedskin python compiler author reports a 2 minute compilation time for a 3000 line program(Dufour 2011). The Shedskin compiler does use a slightly more expensive algorithm than the compiler described in this text but with ever increasing project sizes analysis time and complexity has to be kept in check. Furthermore certain concessions exist in the performance of generated code in the compiler to keep down analysis complexity, the main one being that the compiler treats most numeric operations with automatic promotions to double precision numbers. This is due to the fact that for the compiler to handle integer overflows as promotions as a JIT compiler would, every numeric operator instance would double the complexity of the analysis for the current method and spill over some complexity increases into other methods as well. To put this into context, 5 independent variables being operated on at the start of a method would create a 32 times increase in analysis complexity for the rest of the method. In the benchmarks the performance penalty of using doubles in most places instead of integers can be viewed by comparing the runtime difference of the 006_array and 007_arrayco benchmarks, where the index being converted from a double to integer makes the compiler handle the first benchmark much worse while being on par with the optimizing compilers on the second one. To some degree this effect can also be seen in the 001_fib and 002_fibco 44 benchmarks even if the performance penalties are masked to some degree by the function call overhead. These performance penalties reflects the results that Agesen and Hölze(1995) had in their comparison of static and JIT compilation of the Self system. Future development of the compiler should first and foremost be directed at analyzing bigger programs to give a better understanding of how expensive the type inference algorithm is. Despite the compiler being designed to avoid pathological cases during analysis a better investigation should be done once bigger programs fed to the system to investigate if there are common cases and patterns that would make the code impossible to analyze for the compiler. Once better analysis of the actual analysis complexity has been done, revisiting integer number optimizations without the presence of type coercions should be investigated more carefully in real world settings. The usage of C as an intermediate target has worked very well when developing the compiler by providing stable low level code generation facilities and thus enabling the compiler focus to be on high level optimizations. Additionally the C code generated by the compiler has been compiled to native programs and executed on Arm, X86-64 and X86-32 with hardly any extra work outside of a few macros and #ifdef’s to avoid deprecated header files (The C alloca function) and operating system timekeeping functionality. Some of the potential drawbacks of C as an intermediate target was discussed in the code generation chapter. While the integer overflow promotion was partially avoided to keep down analysis complexity in the compiler, not having a standardized support in C compilers for overflow detection during integer operations was another factor in the decision to not implement this. The lack of stack frame walking for C compilers manifested itself in a significant function call overhead, had this been available the shadow stack maintenance in the generated code could have been omitted. Manual editing on the code generated by fibonacci function was performed and showed that roughly 20% of the execution time could have been removed by not maintaining the shadow stack, while 20% is a significant number one has to remember that fibonacci is a very call heavy benchmark that doesn’t necessarily reflect real world code. As a potential replacement for the C backend the LLVM documentation was investigated. The LLVM IR has extensions for handling the above mentioned integer overflow code and LLVM also contains some standardized garbage collection(GC) support routines (LLVM 2015). With these points resolved LLVM seems like a worthwhile target, however in practice these points might be moot. Firstly the integer overflow extensions available in LLVM are also available to C programs with the Clang compiler (newer GCC versions also seems to support the same functionality in intrinsics). Secondly earlier versions of the LLVM GC support also maintains a shadow stack just like the compiler does manually now, thus the performance benefits might not be as big as hoped. Upcoming versions of LLVM however seems to have a new stackmap system in development but this is not yet finalized. 45 With the compiler hitting the performance targets by nearing the performance of the optimizing JIT compilers and thereby improving upon interpreters, the most important question left is if the subset used by the compiler is useful to programmers. While the compiler was not completed to such a degree to answer all the hypotheses about actual code usage the questionnaire did fill in most of the blanks that were not directly performance related but rather dependent on actual language usage. Comparing the hypothesis about eval usage and dynamic code loading with the questionnaire results proved mostly correct, while some developers did use eval half of them could move to other methods to load dynamic code leaving a smaller minority that felt that eval support was required. Richards (2010) did find that eval was used on real world websites but how much this applies to games outside of setting up a web specific environment is unclear. The worst case scenario would be that the compiler runtime would include a separate runtime that could be used to interpret code produced by eval at reduced speed, while requiring extra work and possibly negate numeric optimizations in many places it would remove most compatibility problems. Developers were split on the issues surrounding undefined but seemed to welcome the idea of the compiler warning about potential undefined usages. Undefined behaviors creates problems for the type inference system yet the split opinions of developers warrant some extra investigation into deciding how to proceed on the issue. As undefined can cause propagated errors in JavaScript systems developers might have overestimated the impact of undefined but this cannot be verified until the compiler is more mature and used for real life projects. Compiler reporting on undefined usage should be investigated to see if the functionality would be more of an annoyance or aid to developers. The final options would be to either have full emulation of undefined behavior (with the performance penalty) or a limited emulation system based on magic values (numerics with a special NaN pattern and a special object for other cases). The developers were also split on the issue of ahead of time fields, this issue is partially tied to the undefined issue so semi active emulation might be done either in the same way with the undefined magic numbers or by adding bitfields for the “existence” of particular fields. The bitfield updating method should provide most of the needed compatibility but put a small extra performance strain on field updates. This performance penalty would not be too bad since Richards(2010) showed a 6:1 ratio of field reads to writes and the cost of the bitfield updates would be comparable to that of garbage collection write barriers that are widely used in modern JIT runtimes. The developers were not so worried about integer overflows but almost unanimously preferred compatibility in terms of generic execution and to either have the behavior enabled specifically or just do it manually by hand optimized Asm.JS style type coercions where needed. Implementing this according to developer wishes is what is done today and the 46 results are not mutually exclusive so the behavior in the current compiler is already acceptable. Object literals was used by the majority with some small uptake of container types. With the widespread usage of this idiom it cannot be overlooked by a successful JavaScript compiler even if it complicates behavior, this was suspected beforehand and the compiler is already able to analyze the usage of object literals as a functionality to map keys to values but the exact runtime penalty of this has not yet been fully investigated. The amount of developers who seemed to favor hand optimizations when it came to integers numbers was a bit unexpected but could be leveraged to attain even more performance or find alternative solutions to the trickier problems faced in increasing the compatibility of the code produced by the compiler. This could also be looked at in connection with the recent usage of Asm.JS to bring over useful C and C++ libraries for handling physics and other CPU heavy calculations in games. While the compiler is already capable of matching the optimizing JIT compilers in performance, it could actually gain a speed advantage since C and C++ libraries could be linked in natively using the same abstractions already used with Asm.JS without going through the Asm.JS translation steps that has a bit of a performance penalty at runtime. 47 8 Conclusion While not complete the compiler subset so far implemented has been shown the potential to improve the performance situation compared to interpreters by even matching the performance of the contemporary optimizing JIT compilers. Fine tuning the system, unexplored optimizations and native code bindings directly linked into the binary shows that there is room to improve the performance figures further in real world scenarios, however the other hand questionnaire showed a need to revisit some compatibility assumptions that could potentially decrease performance. Overall the large performance improvement compared to interpreters speaks for themselves and implementing some of the compatibility fixes should not change the picture significantly compared to interpreter performance. The biggest unresolved technical issue remaining to be investigated is to find out if there are common occurrences of pathological cases for the type inference system, this is since this hasn’t been properly investigated while working on making the system capable of compiling the relatively small benchmarking cases. With the performance goals met combined with the questionnaire responses, it seems that roughly half the JavaScript developers should find the compiler useful as designed today with roughly another quarter of the developers finding it suitable if the compatibility issues raised were sorted out. In general these results combined shows enough promise to warrant future development of the system. 48 References Agesen, O. (1995), The Cartesian Product Algorithm: Simple and Precise Type Inference of Parametric Polymorphism. In ECOOP ‘95, Ninth European Conference on Object-Oriented Programming, Århus, Denmark Agesen, O. (1996). Concrete type inference: delivering object-oriented applications, Doctoral dissertation, Stanford University Agesen, O., Hölzle, U. (1995), Type feedback vs. concrete type inference: a comparison of optimization techniques for object-oriented languages OOPSLA '95 Proceedings of the tenth annual conference on Object-oriented programming systems, languages, and applications Pages 91-107 Appel, A. (1992), Compiling with continuations, Cambridge University Press, ISBN 978-0521-03311-4 (2006 paperback reprint) Cartwright, R., Fagen, M. (1991), Soft typing, PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation Pages 278 - 292 Cython (2014) Cython C-extensions for python http://cython.org/ (Retrived 2014-12-11) Dufour, M. , (2006), Shed Skin, An Optimizing Python-to-C++ Compiler , Master thesis http://mark.dufour.googlepages.com/shedskin.pdf Dufour, M , (2011), Shed Skin 0.9, blogpost. http://shed-skin.blogspot.se/2011/09/shed-skin-09.html Ecma International (2011) Standard ECMA-262, ECMAScript Language Specification, Edition 5.1 (compatible with ISO/IEC 16262:2011 ) http://www.ecma-international.org/publications/standards/Ecma-262.htm (Retrived 2014-1211) Egorov, V (2013), Why asm.js bothers me http://mrale.ph/blog/2013/03/28/why-asmjs-bothers-me.html (Retrived 2014-12-11) Google, (2008) Chrome V8, Design Elements https://developers.google.com/v8/design (Retrived 2014-12-11) 49 Hayen, K. (2014) Nuitka python compiler http://nuitka.net/ (Retrived 2014-12-11) Herman, D., Wagner, L. Zakai, A. (2013) Asm.JS specification working draft http://asmjs.org/spec/latest/ (Retrived 2014-12-11) LLVM Project (2014) Accurate Garbage Collection with LLVM http://llvm.org/releases/3.6.0/docs/GarbageCollection.html (Retrived 2015-05-20) Madsen, M., Sørensen, P., Kristensen, K., (2007), Ecstatic – Type Inference for Ruby Using the Cartesian Product Algorithm, Master thesis http://projekter.aau.dk/projekter/en/studentthesis/ecstatic--type-inference-for-ruby-usingthecartesian-product-algorithm%28e78517a5-e4cf-42d0-9caa-a80749e84c00%29.html (Retrived 2014-12-11) Might, M. (2014) k-CFA: Determining types and/or control-flow in languages like Python, Java and Scheme http://matt.might.net/articles/implementation-of-kcfa-and-0cfa/ (Retrived 2014-12-11) Might, M. Shivers, O. (2006) Improving flow analyses via ΓCFA: Abstract garbage collection and counting. 11th ACM International Conference on Functional Programming (ICFP 2006). Portland, Oregon. September, 2006. Pages 13--25. Plevyak, J. (1988) OPTIMIZATION OF OBJECT-ORIENTED AND CONCURRENT PROGRAMS , PHD dissertation, University of Illinois at Urbana-Champaign, 1996 Reynolds, J. (1993), The discoveries of continuations, LISP and Symbolic Computation, 1993, Vol.6(3), pp.233-247 Salib, M. , (2004), Starkiller: A Static Type Inferencer and Compiler for Python , Master thesis http://dspace.mit.edu/handle/1721.1/16688 (Retrived 2014-12-11) Shivers, O., (1988), Control flow analysis in scheme , PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, Pages 164-174 Shivers, O. (1991) Control-Flow Analysis of Higher-Order Languages or Taming Lambda , PHD dissertation CMU-CS-91-145 50 Smaragdakis, Y., Bravenboer M., and Lhotak, O. (2011) Pick Your Contexts Well: Understanding Object-Sensitivity ACM SIGPLAN Notices, 2011, Vol.46(1), pp.17 Steele, G. (1978), RABBIT: A compiler for SCHEME, Master Thesis, Massachusetts Institute of Technology Ratanaworabhan, P., Livshits, B., & Zorn, B. G. (2010). JSMeter: Comparing the behavior of JavaScript benchmarks with real web applications. Proceedings of the 2010 USENIX conference on Web application development (pp. 3-3). Richards, G., Lebresne, S., Burg, B., & Vitek, J. (2010). An analysis of the dynamic behavior of JavaScript programs. ACM Sigplan Notices (Vol. 45, No. 6, pp. 1-12). Van Emden, M. (2014) How recursion got into programming: a comedy of errors http://vanemden.wordpress.com/2014/06/18/how-recursion-got-into-programming-a-comedyof-errors-3/ (Retrived 2014-12-11) Vardoulakis, D., Shivers, O., (2011) CFA2: a Context-Free Approach to Control-Flow Analysis http://arxiv.org/abs/1102.3676?frbrVersion=2 (Retrived 2014-12-11) Wingolog, A., (2011) Value representation in JavaScript implementations http://wingolog.org/archives/2011/05/18/value-representation-in-javascript-implementations (Retrived 2014-12-11) 51 Appendix A: Glossary HTML5 The latest HTML document standard for the web that also codifies approaches to incrementally add new functionality to browsers beside the core document standard. API Application Programming Interface, a standardized way of interacting with a resource. Canvas A simple 2D drawing API WebGL An adaptation of the OpenGL ES 2 standard for JavaScript used to render 2d and 3d graphics with the help of graphics chipset FullScreen API An API used to provide a way for elements in a web page to be expanded to the entire screen, suitable for movie viewing and games. Native code Computer code that is executed directly by the machine silicon without any other software in between to provide a translation. Compilation A description of the processes of turning the developers sources into code runnable by a computer. Most often done by the developer prior to distribution and execution and generally produces native code. Interpretation A usually slower method that runs the code in a form not native to the current machine by the native machine interpreting the non-native instruction stream. Just in time compilation (JIT) A generic description of modern and faster interpretation systems that speeds up execution by translating the code to be run into native code. In contrast with a regular compiler these kinds of systems defers the translation until the code needs to run. Benefits include faster testing and the ability of compilers to make decisions about how to generate code based on the actual data that needs to be processed. Ahead of time compilation (AOT) An explicit moniker of the classic compilation model of developers compiling code before distribution to end users. 52 Garbage Collection Automatic reclaiming of memory and other resources when all references to a particular resource has disappeared. Code signing Applying a verifiable cryptographic signature to a piece of code to be run. By signing code the signer indicates to man and/or machine that this code has been vetted by the signing party and is assumed to be free of problems. Typing The difference for a computer system between a boolean (true or false), an integer(1, 0, 1 ,2…), a floating point number ( 0.5, 3.141592), a string (“hello world”) and other kinds of object types. Dynamic typing Describing that code can have different kinds of data in the same position at different points in time. Static typing Contrary to the above this describes that there can only have one type in each position. Explicit typing Almost exclusively used with static typing, this is used to signify that the developer writes out the type to be used. f.ex: int A = 1+2; String B=”Hello”; Implicit typing The opposite of explicit typing but used with both dynamic and statically typed languages: var A=1+2; var B=”Hello”; Soft typing A kind of runtime system that outwardly to the developer works as a dynamically typed system but internally tries to constrain as much of the functionality as possible with static types. Structural typing A system where descriptions of what fields an object has rather than an identity is the important part, compare to Nominal typing Nominal typing A way of treating types where the identity of the type is the way to test equivalence, Type inference A process where a compiler infers static type information if possible from the source code. Continuation A continuation is a function defined as the rest of the program to be executed (Reynolds 1993) 53 Monomorphism Cases where only single types are used in the same position Polymorphism Cases where multiple types are used in the same position Appendix B: Benchmarking Samples The code samples in this chapter should be directly runnable with node.js. To run the code samples with default shells for Spidermonkey(JS), JavaScriptCore(jsc), Nashorn, Rhino and Duktape a small console shim is needed, the simplest one is the following: console={log:print} 001_fib.js var fib=function(x) { if (x<2) return x; else return fib(x-2)+fib(x-1); }; console.log("warmup"); var i=0.0; while(i<5) { fib(33); i=i+1; } start=new Date().valueOf(); var res=fib(35); stop=new Date().valueOf(); console.log("fib result ",0|res); console.log("Time:",stop-start); 002_fibco.js var fib=function(x) { if (x<2) return x; else return (fib((x-2)|0)+fib((x-1)|0))|0; }; console.log("warmup"); var i=0.0; while(i<5) { fib(33); i=i+1; } start=new Date().valueOf(); var res=fib(35); stop=new Date().valueOf(); console.log("fib result ",0|res); console.log("Time:",stop-start); 54 003_cplx.js var iter=100000000; var cplx=function(i,mx,my) { var t=0; var ax=1.0; var ay=0.0; while(0<i) { t=ax*mx-ay*my; ay=ax*my+ay*mx; ax=t; i=(i-1)|0; } return {x:ax,y:ay}; }; console.log("warmup"); var i=0.0; while(i<5) { cplx(iter,0.9,0.4358898943540674); i=i+1; } start=new Date().valueOf(); var res=cplx(iter,0.9,0.4358898943540674); stop=new Date().valueOf(); console.log("Time:",stop-start); console.log("cplx result ",res.x,res.y); 004_cplxo.js var iter=50000000; var cplx=function(i,m) { var t=0; var a={x:1,y:0}; while(0<i) { t=a.x*m.x-a.y*m.y; a.y=a.x*m.y+a.y*m.x; a.x=t; i=(i-1)|0; } return a; }; console.log("warmup"); var i=0.0; while(i<5) { cplx(iter,{x:0.9,y:0.4358898943540674}); i=i+1; } start=new Date().valueOf(); var res=cplx(iter,{x:0.9,y:0.4358898943540674}); stop=new Date().valueOf(); console.log("Time:",stop-start); console.log("cplx result ",res.x,res.y); 005_cplxpo.js var iter=10000000; var cplx=function(i,a,m) { var t=0; while(0<i) { t=a.x*m.x-a.y*m.y; 55 a.y=a.x*m.y+a.y*m.x; a.x=t; i=(i-1)|0; } return a; }; var pcplx=function(iter,m) { var obj=[]; obj.push({x:1,y:0}); obj.push({y:0,x:1}); obj.push({z:3,y:0,x:1}); var i=0; var out=obj[0]; var cf=cplx; while (i<obj.length) { out=cf(iter,obj[i],m); i++; } return out; }; console.log("warmup"); var i=0.0; while(i<5) { pcplx(iter,{x:0.9,y:0.4358898943540674}); i=i+1; } start=new Date().valueOf(); var res=pcplx(iter,{x:0.9,y:0.4358898943540674}); stop=new Date().valueOf(); console.log("Time:",stop-start); console.log("cplx result ",res.x,res.y); 006_array.js var i=0.0; var j=0; var a=[]; //var max=10000000; var max=1000; var iter=200000.0; //var max=2; while(i<max) { a.push(j); j=j^(j*374); j=(j+14513)&0xffff; i=i+1; } var test=function() { var i=0; var sum=0; while(i<max) { sum=sum+a[i]; i=i+1; } return sum; }; console.log("warmup"); i=0.0; while(i<15) { 56 test(); i=i+1; } start=new Date().valueOf(); var res; while(0<iter) { res=test(); iter=iter-1; } stop=new Date().valueOf(); console.log("test result ",0|res); console.log("Time:",stop-start); 007_arrayco.js var i=0.0; var j=0; var a=[]; //var max=10000000; var max=1000; var iter=200000.0; //var max=2; while(i<max) { a.push(j); j=j^(j*374); j=(j+14513)&0xffff; i=i+1; } var test=function() { var i=0; var sum=0; while(i<max) { sum=sum+a[i]; i=(i+1)|0; } return sum; }; console.log("warmup"); i=0.0; while(i<15) { test(); i=i+1; } start=new Date().valueOf(); var res; while(0<iter) { res=test(); iter=iter-1; } stop=new Date().valueOf(); console.log("test result ",0|res); console.log("Time:",stop-start); 57 Appendix C: Benchmarking results The numbers here is the raw CSV data used to create the graphs. Each number corresponds to one sample of running time for calculating each benchmark in milliseconds Win32 ,,Test0,Test1,Test2,Test3,Test4 001_fib.js,ccref,93,109,109,109,93 001_fib.js,cm_in,266,265,265,265,265 001_fib.js,cm_no,312,312,327,327,312 001_fib.js,cmref,514,514,530,514,514 001_fib.js,v8no_c,218,218,218,218,219 001_fib.js,v8no_b,452,452,468,468,468 001_fib.js,sp_fa,110,109,111,110,110 001_fib.js,sp_ba,1233,1233,1240,1238,1231 001_fib.js,sp_in,14376,14400,14997,14270,14894 001_fib.js,nasho,578,436,578,421,468 001_fib.js,rhino,1482,1513,1560,1498,1544 001_fib.js,dukt,22308,21825,21886,23634,26193 002_fibco.js,ccref,78,93,78,78,93 002_fibco.js,cm_in,265,249,250,265,265 002_fibco.js,cm_no,328,328,327,343,343 002_fibco.js,v8no_c,218,218,218,218,218 002_fibco.js,v8no_b,562,578,578,578,578 002_fibco.js,sp_fa,118,118,119,118,118 002_fibco.js,sp_ba,1396,1395,1397,1393,1398 002_fibco.js,sp_in,16404,16409,16688,16702,17088 002_fibco.js,nasho,827,842,826,858,827 002_fibco.js,rhino,2199,2169,2184,2200,2262 002_fibco.js,dukt,28735,27721,28798,28283,27253 003_cplx.js,ccref,468,468,452,452,468 003_cplx.js,cm_in,468,452,468,453,452 003_cplx.js,cm_no,452,453,468,468,452 003_cplx.js,cmref,2502,2496,2500,2501,2511 003_cplx.js,v8no_c,468,452,452,452,452 003_cplx.js,v8no_b,4851,4852,4836,4852,4836 003_cplx.js,sp_fa,481,481,480,479,480 003_cplx.js,sp_ba,4071,4071,4066,4073,4084 003_cplx.js,sp_in,84315,83296,83218,83614,84160 003_cplx.js,nasho,1498,1607,1513,1466,1435 003_cplx.js,rhino,5148,5085,5132,5054,5148 003_cplx.js,dukt,51277,55021,53321,52853,54273 004_cplxo.js,cm_in,343,343,343,359,358 004_cplxo.js,cm_no,1576,1560,1591,1592,1560 004_cplxo.js,v8no_c,374,359,374,359,374 004_cplxo.js,v8no_b,4695,4712,4712,4696,4695 004_cplxo.js,sp_fa,378,377,377,377,378 004_cplxo.js,sp_ba,4681,4673,4681,4681,4698 004_cplxo.js,sp_in,105721,106027,105820,105670,105600 004_cplxo.js,nasho,1981,1997,2013,1950,1950 004_cplxo.js,rhino,7847,7987,7426,7753,8237 004_cplxo.js,dukt,87828,88280,87485,87626,87363 005_cplxpo.js,cm_in,219,218,218,219,218 005_cplxpo.js,cm_no,1030,1186,1092,1186,1030 005_cplxpo.js,v8no_c,374,374,374,374,374 005_cplxpo.js,v8no_b,2995,2995,2995,3011,2995 005_cplxpo.js,sp_fa,489,489,489,490,489 005_cplxpo.js,sp_ba,3170,3163,3170,3170,3168 005_cplxpo.js,sp_in,62709,62112,62092,62032,62089 005_cplxpo.js,nasho,1404,1404,1326,1420,1435 005_cplxpo.js,rhino,5133,5117,5132,5133,5132 005_cplxpo.js,dukt,52478,52432,52354,52369,52651 58 006_array.js,cm_in,702,718,718,702,718 006_array.js,cm_no,5741,7317,5772,5772,6256 006_array.js,v8no_c,452,468,452,468,452 006_array.js,v8no_b,1919,1935,1950,1950,1935 006_array.js,sp_fa,277,278,280,278,277 006_array.js,sp_ba,5719,5715,5711,5715,5740 006_array.js,sp_in,105456,105702,108196,105453,109719 006_array.js,nasho,5506,5757,5569,5584,5555 006_array.js,rhino,8346,8564,8424,8502,8673 006_array.js,dukt,179729,155969,174938,158808,151851 007_arrayco.js,cm_in,296,281,281,281,281 007_arrayco.js,cm_no,2075,2074,2074,2075,2075 007_arrayco.js,v8no_c,452,468,453,468,453 007_arrayco.js,v8no_b,2356,2324,2371,2325,2340 007_arrayco.js,sp_fa,371,372,372,372,373 007_arrayco.js,sp_ba,5886,5900,5884,5884,5901 007_arrayco.js,sp_in,122590,120525,116760,117357,117066 007_arrayco.js,nasho,6302,6209,6489,6349,6272 007_arrayco.js,rhino,11170,11419,11404,11435,11341 007_arrayco.js,dukt,197496,187154,202520,186826,190429 Raspberry PI B1 ,,Test0,Test1,Test2,Test3,Test4 001_fib.js,ccref,1510,1520,1520,1520,1520 001_fib.js,cm_in,2876,2899,2885,2871,2875 001_fib.js,cm_no,3784,3795,3786,3786,3813 001_fib.js,cmref,6540,6550,6410,6420,6520 001_fib.js,v8n6,6328,6334,6326,6332,6328 001_fib.js,v8n12,11174,11061,11035,11170,11522 001_fib.js,sp_fa,1919,1940,1938,1918,1918 002_fibco.js,ccref,550,560,550,550,560 002_fibco.js,cm_in,2001,1997,1998,1997,2037 002_fibco.js,cm_no,3079,3076,3079,3079,3079 002_fibco.js,v8n6,7694,7701,7695,7722,7720 002_fibco.js,v8n12,12920,13023,13238,13132,12822 002_fibco.js,sp_fa,1890,1886,1885,1888,1886 003_cplx.js,ccref,4070,4070,4070,4070,4070 003_cplx.js,cm_in,2907,2910,2935,2907,2907 003_cplx.js,cm_no,2914,2914,2907,2911,2910 003_cplx.js,cmref,51920,48770,50450,50430,50920 003_cplx.js,v8n6,109432,109403,109460,109497,109689 003_cplx.js,v8n12,103421,103506,103334,103142,103266 003_cplx.js,sp_fa,5676,5672,5676,5700,5672 004_cplxo.js,cm_in,1600,1600,1600,1598,1600 004_cplxo.js,cm_no,20469,20563,20472,20485,20465 004_cplxo.js,v8n6,96322,96459,96264,96178,97580 004_cplxo.js,v8n12,122829,122868,123093,122843,122891 004_cplxo.js,sp_fa,6127,6143,6142,6126,6116 005_cplxpo.js,cm_in,969,960,963,963,961 005_cplxpo.js,cm_no,12375,12361,12381,12389,12355 005_cplxpo.js,v8n6,86042,82960,86269,85661,92117 005_cplxpo.js,v8n12,76343,76407,76407,76496,76310 005_cplxpo.js,sp_fa,5277,5274,5272,5281,5275 006_array.js,cm_in,16806,16671,16655,16653,16646 006_array.js,cm_no,42082,42069,42099,42077,42092 006_array.js,v8n6,30373,30378,30409,30498,30373 006_array.js,v8n12,31498,31435,31450,31438,31413 006_array.js,sp_fa,5575,5598,5570,5570,5599 007_arrayco.js,cm_in,6142,6140,6164,6178,6146 007_arrayco.js,cm_no,30131,30131,30164,30244,30132 007_arrayco.js,v8n6,32738,32758,32756,32756,32768 007_arrayco.js,v8n12,33767,33776,33609,33667,33595 007_arrayco.js,sp_fa,6223,6196,6155,6157,6181 001_fib.js,sp_ba,19206,19486 59 002_fibco.js,sp_ba,21315,21277 003_cplx.js,sp_ba,55214,55214 004_cplxo.js,sp_ba,70780,70681 005_cplxpo.js,sp_ba,49949,50003 006_array.js,sp_ba,103444,103158 007_arrayco.js,sp_ba,125522,125756 001_fib.js,sp_ba,23289,23413 001_fib.js,sp_in,204498,186132 001_fib.js,dukt,1131094,1064052 002_fibco.js,sp_ba,25690,26517 002_fibco.js,sp_in,221071,194745 002_fibco.js,dukt,1391052,1410797 003_cplx.js,sp_ba,66883,66645 003_cplx.js,sp_in,376200,375370 003_cplx.js,dukt,1428458,1504127 004_cplxo.js,sp_ba,85316,85619 004_cplxo.js,sp_in,1186910,1388429 004_cplxo.js,dukt,3306682,3292335 005_cplxpo.js,sp_ba,60601,60259 005_cplxpo.js,sp_in,801553,856546 005_cplxpo.js,dukt,2012607,1957465 006_array.js,sp_ba,125611,125497 006_array.js,sp_in,1704134,1668996 006_array.js,dukt,5892989,6649120 007_arrayco.js,sp_ba,195697,152099 007_arrayco.js,sp_in,2074499,1703201 007_arrayco.js,dukt,8058768,7355199 Raspberry PI B2 ,,Test0,Test1,Test2,Test3,Test4 001_fib.js,ccref,760,760,760,760,760 001_fib.js,cm_in,2091,2090,2090,2091,2090 001_fib.js,cm_no,2779,2778,2775,2776,2777 001_fib.js,cmref,4250,4250,4250,4250,4250 001_fib.js,v8no_c,2456,2458,2456,2455,2457 001_fib.js,v8no_b,6795,6835,6793,6794,6793 001_fib.js,sp_fa,1629,1628,1628,1629,1628 001_fib.js,sp_ba,20759,20873,20759,20754,20754 001_fib.js,jsc_d,1997,1997,1996,1996,1997 001_fib.js,jsc_j,12408,12720,12367,12411,12354 001_fib.js,jsc_i,26957,26956,26965,26954,26944 001_fib.js,nasho,11794,11386,11477,11335,11480 002_fibco.js,ccref,490,490,490,490,490 002_fibco.js,cm_in,1733,1734,1734,1734,1732 002_fibco.js,cm_no,2395,2392,2393,2392,2396 002_fibco.js,v8no_c,2454,2454,2457,2457,2456 002_fibco.js,v8no_b,8516,8511,8512,8512,8511 002_fibco.js,sp_fa,1536,1537,1537,1536,1537 002_fibco.js,sp_ba,22797,22903,22900,22899,22901 002_fibco.js,jsc_d,2042,2041,2040,2044,2041 002_fibco.js,jsc_j,13373,13374,13376,13371,13374 002_fibco.js,jsc_i,30794,30807,30795,31157,30798 002_fibco.js,nasho,15626,15730,15877,15731,15634 003_cplx.js,ccref,4610,4610,4610,4610,4610 003_cplx.js,cm_in,3929,3929,3929,3929,3932 003_cplx.js,cm_no,3928,3929,3929,3928,3929 003_cplx.js,cmref,24350,24350,24350,24350,24350 003_cplx.js,v8no_c,4613,4614,4612,4613,4613 003_cplx.js,v8no_b,64524,64064,64116,64132,64151 003_cplx.js,sp_fa,5640,5637,5637,5637,5639 003_cplx.js,sp_ba,65646,65285,65294,65289,65304 003_cplx.js,jsc_d,9056,9057,9059,9058,9057 003_cplx.js,jsc_j,24478,24100,24459,24114,24095 003_cplx.js,jsc_i,82918,82957,82908,82929,82920 60 003_cplx.js,nasho,50807,51190,50836,50825,50834 004_cplxo.js,cm_in,2089,2050,2052,2050,2050 004_cplxo.js,cm_no,19820,19822,19823,19824,19822 004_cplxo.js,v8no_c,4273,4272,4272,4270,4272 004_cplxo.js,v8no_b,81846,81832,81780,81858,81817 004_cplxo.js,sp_fa,4449,4446,4447,4450,4447 004_cplxo.js,sp_ba,86228,86131,86120,86131,85776 004_cplxo.js,jsc_d,5726,5726,5727,5727,5726 004_cplxo.js,jsc_j,33846,33898,33905,33902,33846 004_cplxo.js,jsc_i,81452,81444,81494,81815,81798 004_cplxo.js,nasho,62222,62219,61108,62196,61556 005_cplxpo.js,cm_in,1229,1230,1229,1231,1229 005_cplxpo.js,cm_no,11994,11994,11992,11993,11992 005_cplxpo.js,v8no_c,6219,6220,6219,6220,6220 005_cplxpo.js,v8no_b,50169,50198,50592,50186,50200 005_cplxpo.js,sp_fa,4723,4726,4728,4727,4726 005_cplxpo.js,sp_ba,57909,57505,57486,57862,57495 005_cplxpo.js,jsc_d,13452,13454,13450,13453,13456 005_cplxpo.js,jsc_j,25027,25100,24939,24931,25052 005_cplxpo.js,jsc_i,48868,49283,49237,48874,48878 005_cplxpo.js,nasho,41732,41752,41774,41718,42110 006_array.js,cm_in,7883,7881,7881,7886,7884 006_array.js,cm_no,24053,24015,24005,24010,24010 006_array.js,v8no_c,4469,4469,4468,4468,4468 006_array.js,v8no_b,22247,22243,22248,22254,22621 006_array.js,sp_fa,4810,4810,4813,4811,4813 006_array.js,sp_ba,110679,102806,103180,102819,103218 006_array.js,jsc_d,8250,8252,8247,8247,8246 006_array.js,jsc_j,42230,42226,42228,42227,42599 006_array.js,jsc_i,140438,140791,140471,140874,140456 006_array.js,nasho,144238,143528,144542,143920,143794 007_arrayco.js,cm_in,4802,4805,4802,4805,4804 007_arrayco.js,cm_no,18877,18880,18872,19247,18876 007_arrayco.js,v8no_c,4469,4468,4470,4469,4471 007_arrayco.js,v8no_b,24649,24641,24654,24654,24639 007_arrayco.js,sp_fa,5492,5491,5490,5493,5491 007_arrayco.js,sp_ba,114751,118252,114809,118312,114750 007_arrayco.js,jsc_d,8250,8247,8247,8248,8247 007_arrayco.js,jsc_j,44622,44611,44621,44993,44620 007_arrayco.js,jsc_i,161469,161829,161749,161383,164269 007_arrayco.js,nasho,170721,170952,171143,170867,171380 001_fib.js,sp_in,78937,79057 001_fib.js,rhino,26269,26253 001_fib.js,dukt,495224,461889 002_fibco.js,sp_in,84915,85141 002_fibco.js,rhino,35472,35778 002_fibco.js,dukt,626624,643804 003_cplx.js,sp_in,267937,267954 003_cplx.js,rhino,136147,135926 003_cplx.js,dukt,954121,953236 004_cplxo.js,sp_in,560811,559066 004_cplxo.js,rhino,170031,169823 004_cplxo.js,dukt,1818710,1831224 005_cplxpo.js,sp_in,338900,338950 005_cplxpo.js,rhino,114481,115004 005_cplxpo.js,dukt,1094543,1094539 006_array.js,sp_in,538550,520576 006_array.js,rhino,236915,237215 006_array.js,dukt,3377678,3773322 007_arrayco.js,sp_in,547954,535538 007_arrayco.js,rhino,299922,297781 007_arrayco.js,dukt,3816090,3823619 61 C compilers ,,Test0,Test1,Test2,Test3,Test4 001_fib.js,msvc_05,240,240,230,250,240 001_fib.js,msvc_13,250,260,260,260,260 001_fib.js,clang,343,345,343 001_fib.js,gcc,190,190,190,190,190 002_fibco.js,msvc_05,240,270,240,240,240 002_fibco.js,msvc_13,270,260,260,272,260 002_fibco.js,clang,260,250,260,250,250 002_fibco.js,gcc,200,200,200,200,210 003_cplx.js,msvc_05,460,460,460,470,460 003_cplx.js,msvc_13,460,470,460,470,460 003_cplx.js,clang,511,510,520,510,520 003_cplx.js,gcc,511,520,508,511,500 004_cplxo.js,msvc_05,390,390,378,380,381 004_cplxo.js,msvc_13,360,355,360,352,360 004_cplxo.js,clang,250,260,260,260,250 004_cplxo.js,gcc,240,230,230,230,230 005_cplxpo.js,msvc_05,230,230,230,230,230 005_cplxpo.js,msvc_13,220,224,220,220,220 005_cplxpo.js,clang,220,220,220,220,220 005_cplxpo.js,gcc,140,140,139,138,140 006_array.js,msvc_05,930,920,930,928,940 006_array.js,msvc_13,715,725,720,722,720 006_array.js,clang,880,880,880,885,880 006_array.js,gcc,1770,1758,1762,1760,1770 007_arrayco.js,msvc_05,620,610,610,610,600 007_arrayco.js,msvc_13,290,290,290,290,288 007_arrayco.js,clang,370,370,370,370,370 007_arrayco.js,gcc,280,280,280,280,275 Appendix D: Questionnaire JavaScript static compilation questions *Required Firstly, do you program in plain JavaScript or use some kind of transpiler? * • • • • JavaScript CoffeeScript TypeScript Haxe • Other: What JavaScript/EcmaScript environments have you used? * • • • • • • • • • • Chrome/Node (V8) Firefox (Spidemonkey,etc) Safari (JavaScriptcore) Internet explorer Rhino (JVM) Nashorn (JVM) PhoneGap / Cordova Titanium Appcelerator CocoonJS GameClosure 62 • Other: Dynamic code loading Does your code use eval? * • • Yes No Does your code use another mechanism for code loading in a dynamic fashion? * Such as using the CommonJS function require("level"+number) ? or the rhino load("level"+number) • • Yes No If you answered yes to the eval question above, could your code easily be changed to use load() or CommonJS require() as described above? • • Yes No If you use dynamic code loading and like to add anything to the above, do so here. 2. Undefined behaviour Does your code depend on undefined for it's functionality? * (Excluding explicit "undefined" testing for browser functionality to retain compability) • • Yes No Would your code behave differently if a variable only used as a number upon initialization would be set to 0 or NaN instead of undefined ? * • • Yes No Would your code behave differently if a variable only used for objects would be null instead of undefined? * • • Yes No Would you consider it a benefit if the compiler faulted on most usages of undefined? For example any numeric operation with undefined would produce an error at compiletime instead of a NaN at runtime. 63 • • Yes No • Other: If you have additional thoughts about undefined usage or behaviour write here 3. Object behaviour Would your code behave differently if fields existed ahead of time? * Ie, var obj={}; var beforeDefinition=obj.hasOwnProperty("a"); obj.a=true; console.log(beforeDefinition); // with beforeDefinition being true instead of false as with regular Ecmascript/Javascript • • Yes No Numeric behaviour The ecmascript standard specifies that all numbers are handled as IEEE-754 double precision floating point numbers, however processors usually handle integers faster than double precision numbers and in many cases programmers only expect the integer behaviour in most cases when using numbers. JavaScript JIT compilers can easily handle overflowing integers by promoting to doubles at runtime and recompiling code but an ahead of time compiler has no such luxury. A mode similar to how C uses numbers with separate int and double types and promotion rules similar to C is looked at. Would your code be affected by integer overflows? * Answer conservatively if you are unsure. For example (1<<30)+(1<<30) becomes -1 for 32 bit integers and 65536*65536 would produce 0 for a 32bit integer. • • Yes No To get the benefits of integer operations what would be your preference? * Have integer operations separated from doubles by default (fast, less compatibility, risk of subtle bugs not found on other implementations) • Have doubles(compatible) as the default but enable a similar fast mode for specified functions and other blocks (similar to the "use strict" and "use asm" directives) • Have hand optimized code sequences (no compiler option, programmer specifices coercions manually by for example 0|(a+b), similar to how asm.js requires operations to be written) • • Other: Code style 64 To store dynamic mappings between keys and values, do you use plain object literals or some kind of container class functionality? * Ie. create an object var namedColors={}; and then DYNAMICALLY load in values in the fashion of namedColors[colorID] =colorNumber; where in one instance might be colorID="red" and colorNumber=0xff0000; to create a mapping or does your code look more like var namedColors=new HashMap(); namedColors.put(colorID,colorNumber); • • Object literals ( {} ) Container classes ( new HashMap, new TreeMap, etc) • Other: 65 www.kth.se
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement