A Relational Approach to Interprocedural Shape Analysis BERTRAND JEANNET and ALEXEY LOGINOV and THOMAS REPS and MOOLY SAGIV This paper addresses the verification of properties of imperative programs with recursive procedure calls, heap-allocated storage, and destructive updating of pointer-valued fields—i.e.,interprocedural shape analysis. The paper makes three contributions: — It introduces a new method for abstracting relations over memory configurations for use in abstract interpretation. — It shows how this method furnishes the elements needed for a compositional approach to shape analysis. In particular, abstracted relations are used to represent the shape transformation performed by a sequence of operations, and an over-approximation to relational composition can be performed using the meet operation of the domain of abstracted relations. — It applies these ideas in a new algorithm for context-sensitive interprocedural shape analysis. The algorithm creates procedure summaries using abstracted relations over memory configurations, and the meet-based composition operation provides a way to apply the summary transformer for a procedure P at each call site from which P is called. The algorithm has been applied successfully to establish properties of both (i) recursive programs that manipulate lists, and (ii) recursive programs that manipulate binary trees. Categories and Subject Descriptors: D.2.4 [Software Engineering]: Software/program Verification—Assertion checkers; D.2.5 [Software Engineering]: Testing and Debugging—symbolic execution; D.3.3 [Programming Languages]: Language Constructs and Features—data types and structures; dynamic storage management; Procedures, functions and subroutines; Recursion; E.1 [Data]: Data Structures—Lists, stacks and queues; Trees; E.2 [Data]: Data Storage A preliminary version of this paper appeared in the proceedings of the 11th Int. Static Analysis Symposium (SAS), (Verona, Italy, August 26-28, 2004) [Jeannet et al. 2004]. This work was supported in part by ONR under grants N00014-01-1-0796 and N00014-01-10708, and by NSF under grants CCR-9986308, CCF-0540955, and CCF-0524051. Affiliations: Bertrand Jeannet; INRIA; Bertrand.Jeannet@inrialpes.fr. Alexey Loginov; GrammaTech, Inc.; alexey@grammatech.com. Thomas Reps; Comp. Sci. Dept., University of Wisconsin, and GrammaTech, Inc.; reps@cs.wisc.edu. Mooly Sagiv; School of Comp. Sci., Tel Aviv University; msagiv@post.tau.ac.il. When the research reported in the paper was carried out, Bertrand Jeannet was affiliated with INRIA or visiting the University of Wisconsin, and Alexey Loginov was affiliated with the University of Wisconsin. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 20YY ACM 0000-0000/20YY/0000-0001 $5.00 ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1–0??. 2 · Bertrand Jeannet et al. Representations—composite structures; linked representations; F.3.1 [Logics and Meanings of Programs]: Specifying and Verifying and Reasoning about Programs—Assertions; Invariants General Terms: Algorithms, Languages, Theory, Verification Additional Key Words and Phrases: Abstract interpretation, context-sensitive analysis, interprocedural dataflow analysis, destructive updating, pointer analysis, shape analysis, static analysis, 3-valued logic 1. INTRODUCTION This paper concerns techniques for static analysis of recursive programs that manipulate heap-allocated storage and perform destructive updating of pointer-valued fields. The goal is to recover shape descriptors that provide information about the characteristics of the data structures that a program’s pointer variables can point to. Such information can be used to help programmers understand certain aspects of the program’s behavior, to verify properties of the program, and to optimize or parallelize the program. The work reported in the paper builds on past work by several of the authors on static analysis based on 3-valued logic [Sagiv et al. 2002; Reps et al. 2003] and its implementation in the TVLA system [Lev-Ami and Sagiv 2000]. In this setting, two related logics come into play: an ordinary 2-valued logic, as well as a related 3-valued logic. A memory configuration, or store, is modeled by what logicians call a logical structure, which consists of a predicate (i.e., a relation of appropriate arity) for each predicate symbol of a vocabulary P. A store is modeled by a 2-valued logical structure; a set of stores is abstracted by a (finite) set of bounded-size 3valued logical structures. An individual of a 3-valued structure’s universe either models a single memory cell or, in the case of a summary individual, a collection of memory cells. The constraint of working with limited-size descriptors entails a loss of information about the store. Certain properties of concrete individuals are lost due to abstraction, which groups together multiple individuals into summary individuals: a property can be true for some concrete individuals of the group but false for other individuals. It is for this reason that 3-valued logic is used; uncertainty about a property’s value is captured by means of the third truth value, 1/2. One of the opportunities for scaling up this approach is to exploit the compositional structure of programs. In interprocedural dataflow analysis, one avenue for accomplishing this is to create a summary transformer for each procedure P , and use the summary transformer at each call site at which P is called. Each summary transformer must capture (an over-approximation of) the net effect of a call on P . To be able to create summary transformers, the abstract transformers for individual transitions must have a “composable representation”; that is, given the representations of two abstract transformers, it must be possible to represent their composition as an object of roughly the same size. One then carries out a fixpoint-finding procedure on a collection of equations in which each variable in the equation set has a transformer-valued value—i.e., a value drawn from the domain of transformers—rather than a dataflow value proper. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 3 A number of approaches to interprocedural dataflow analysis based on summary transformers are known [Cousot and Cousot 1977; Sharir and Pnueli 1981; Knoop and Steffen 1992; Reps et al. 1995; Sagiv et al. 1996; Reps et al. 2005]. However, not all program-analysis problems have abstract transformers that have a composable representation. For some problems, it is possible to address this issue by working pointwise, tabulating composed transformers using either (i) sets of pairs that consist of an input abstract value and an output abstract value [Sharir and Pnueli 1981], or (ii) finer-granularity sets of pairs that capture how parts of an input abstract value influence parts of an output abstract value [Reps et al. 1995; Sagiv et al. 1996; Ball and Rajamani 2001]. In essence, these approaches start with the kinds of objects used in intraprocedural analysis and pair them together to create the objects that are used in interprocedural analysis. However, for interprocedural shape analysis, tabulating pairs of 3-valued structures—the kinds of objects used in intraprocedural shape analysis—has significant drawbacks insofar as precision is concerned: in the 3-valued-logic approach to shape analysis, individuals—which model memory cells—do not have fixed identities; they are identified only up to their “distinguishing characteristics”, namely, their values for a specific set of unary predicates. Because these “distinguishing characteristics” can change during the course of a procedure call, there is no way to identify individuals in an input abstract structure with their corresponding individuals in the output abstract structure. In essence, a pair of input/output 3-valued structures loses track of the correlations between the input and output values of an individual’s unary predicates. Consequently, an approach based on tabulating composed transformers as sets of pairs of 3-valued structures provides only a weak characterization of a procedure’s net effect, and is fundamentally limited in the properties that it can express. All is not lost, however: instead of “abstracting and then pairing” (as discussed above), the solution is to “pair and then abstract”. Observation 1.1. By using a 3-valued structure over a doubled vocabulary P ] P 0 , where P 0 = {p0 | p ∈ P} and ] denotes disjoint union, one can obtain a finite abstraction that relates the predicate values for an individual at the beginning of a transition to the predicate values for the individual at the end of the transition. This approach provides a way to create much more accurate composable representations of transformers, and hence much more accurate summary transformers, for a broad class of problems. The advantages come from two effects: — The addition of the second vocabulary changes the abstraction in use because individuals now have additional “distinguishing characteristics” [Sagiv et al. 2002]. — The second vocabulary helps permit the changes in a predicate to be tracked over a sequence of operations [Lev-Ami et al. 2000]. The benefit of these properties is that, in many cases, a relationship on the before and after values of a predicate can be tracked on individual locations or tuples of locations, over a sequence of operations—even when abstraction has been performed. The consequence is that two-vocabulary 3-valued structures provide more precise ACM Journal Name, Vol. V, No. N, Month 20YY. 4 · Bertrand Jeannet et al. descriptors of relations between stores than an approach based on pairing abstract stores from an existing store abstraction. Moreover, by extending the abstract domain of 3-valued logical structures with some new operations, it is possible to perform abstract interpretation of call and return statements without losing too much precision (see §6 and §7). We have used these ideas to create a context-sensitive shape-analysis algorithm for recursive programs that manipulate heap-allocated storage and perform destructive updating. The “pair and then abstract” principle of Observation 1.1 is related to several well-known concepts: Pairing without abstraction:. The use of a doubled vocabulary is standard in logic-based reasoning about execution behavior: the transition relations of a language’s concrete semantics are often expressed by means of formulas over presentstate and next-state variables (e.g., [Gries 1981; Manna and Pnueli 1995; Clarke et al. 1999]). For instance, the semantics of a statement x := y+1 can be expressed as the formula (x0 = y + 1) ∧ (y 0 = y). Similarly, a procedure’s post-condition is often expressed using such a doubled vocabulary (i.e., the post-condition expresses a relation over input stores and output stores). Pairing and then numeric abstraction:. For analyzing programs that manipulate numeric data, a composable abstract transformer for a statement such as x := y+1 can be created directly from the formula (x0 = y + 1) ∧ (y 0 = y) when using the polyhedral abstract domain [Cousot and Halbwachs 1978]. The number of dimensions in each polyhedron used by the analyzer is double the number |V | of numeric variables V about which the analyzer is trying to obtain information. Each program variable has a primed and an unprimed version, and a polyhedron captures linear relations among the 2|V | variables. In this paper, we use Observation 1.1 to create composable abstract transformers for programs that manipulate non-numeric data. Our work provides a new approach to performing context-sensitive interprocedural shape analysis, and allows us to verify properties of imperative programs with recursive procedure calls, heap-allocated storage, and destructive updating of pointer-valued fields. The contributions of our work include the following: (1) We introduce a new method for abstracting relations over memory configurations for use in abstract interpretation. (2) We show how this method furnishes the elements needed for a compositional approach to shape analysis. In particular, abstracted relations are used to represent the shape transformation performed by a sequence of operations, and an over-approximation to relational composition can be performed using the meet operation of the domain of abstracted relations. (3) We apply these ideas in a new algorithm for context-sensitive interprocedural shape analysis. The algorithm creates procedure summaries using abstracted relations over memory configurations, and the meet-based composition operation provides a way to apply the summary transformer for a procedure P at each call site from which P is called. We have been able to apply this approach successfully to establish properties of both (i) recursive programs that manipulate lists, and (ii) recursive programs that maACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis typedef struct node { struct node *n; int data; } *List; List res; void main(List l) { res = rev(l); } · 5 List rev(List x){ List y, z; z = x->n; x->n = NULL; if (z != NULL){ y = rev(z); z->n = x; } else y = x; return y; } Fig. 1. Recursive list-reversal program. The recursive function rev destructively reverses a nonempty, acyclic, singly-linked list using recursion to traverse the list. nipulate binary trees. While list-manipulation programs can often be implemented in tail-recursive fashion—and hence can be converted easily into loop programs— tree-manipulation programs are much less easily converted to non-recursive form. In particular, the shape properties that characterize sorted binary trees are complex and rely on global properties, whereas the shape properties that characterize sorted lists are mostly local properties—with cyclicity properties being the main exception. Organization. The remainder of the paper is organized as follows: §2 presents, at a semi-formal level, several of the principles that lie behind our approach. §3 presents some background on 2-valued and 3-valued logic. §4 defines the language to which our analysis applies, and gives a concrete semantics, based on the use of 2-valued logical structures for representing memory configurations. §5 describes the abstraction of 2-valued logical structures with bounded-size 3-valued logical structures [Sagiv et al. 2002]. Our interprocedural shape analysis is based on a relational semantics, which establishes at each control point a relation between the input state of the enclosing procedure and the state at the current point. This semantics requires the ability to represent relations between memory configurations, which presents certain difficulties at the abstract level. §6 addresses this problem by abstracting relations between memory configurations using the same principles as those used to abstract sets of memory configurations in §5. §7 describes the interprocedural shape-analysis algorithm that we developed based on these ideas. §8 presents experimental results. §9 discusses related work. 2. OVERVIEW In this section, we discuss at a semi-formal level the “pairing” aspect of Obs. 1.1 (“pair and then abstract”). Abstraction is the subject of §5. §7 applies the “pair and then abstract” principle in the context of interprocedural shape analysis. Consider non-empty, acyclic, singly-linked lists constructed from nodes of the type List whose declaration is given in Fig. 1. One of the issues discussed below concerns how to create a summary transformer for a procedure that reverses a list, using destructive updating. The summary transformer that we give applies both to recursive and non-recursive destructive list-reversal procedures. Because summary transformers (also known as “procedure summaries”) are particularly useful for anACM Journal Name, Vol. V, No. N, Month 20YY. 6 · Bertrand Jeannet et al. (a) [1] [2] [3] [4] a = <a 4-element list>; b = NULL; p = NULL; b = rev(a); p = b->n; . . . (b) [1] [2] [3] [4] a = <a 4-element list>; b = NULL; c = NULL; b = rev(a); c = rev(b); . . . Fig. 2. Examples to illustrate one-vocabulary structures, two-vocabulary structures, transformer application, and procedure summaries. (a) S = a (b) S0 = a (c) S 00 = n n n n n n b a n n n b p Fig. 3. (a) The (one-vocabulary) structure that represents a four-element acyclic list that is pointed to by a; (b) the (one-vocabulary) structure that represents the list from (a) after the operation “[2] b = rev(a);”; (c) the (one-vocabulary) structure that represents the list from (b) after the operation “[3] p = b->n;”. alyzing recursive programs, the running example used in later sections of the paper is the recursive list-reversal program shown in Fig. 1. That procedure destructively reverses a non-empty, acyclic, singly-linked list using recursion to traverse the list. In the remainder of this section, we discuss the two code fragments shown in Fig. 2. Fig. 3 depicts three four-element, singly-linked, acyclic lists. The nodes of each graph represent memory cells. An address-valued program variable (“pointer variable”) that points to a given memory cell is represented by an arrow from the variable name to the node for the cell. (A pointer variable whose value is NULL is not shown.) The other arrows in the graph, labeled with n, represent the values of cells’ n-fields. Fig. 3(a), (b), and (c) represent lists that arise just before lines [2], [3], and [4] of Fig. 2(a), respectively. Two Kinds of Pairing. Figs. 4 and 5 illustrate two different kinds of pairing operations that can be performed on lists: — Fig. 4(a) depicts a pair of one-vocabulary structures that represent the net transformation from just before line [2] of Fig. 2(a) to just before line [3]; Fig. 4(b) depicts a pair of one-vocabulary structures that represent the net transformation from just before line [2] of Fig. 2(a) to just before line [4]. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis n (a) hS, S 0 i = , a’ n b’ n’ n’ n’ n , a” a hS, S 00 i = 7 n a n (b) n · b” n” n” n” p” Fig. 4. Pairs of one-vocabulary structures that represent (a) the net transformation from just before line [2] of Fig. 2(a) to just before line [3]; (b) the net transformation from just before line [2] of Fig. 2(a) to just before line [4]. n (a) (b) S h·, S h·, 0 00 i i = = n n b’ a,a’ n’ n’ n’ n n n n” n” n” a,a” b” p” Fig. 5. Two-vocabulary structures that represent (a) the net transformation from just before line [2] of Fig. 2(a) to just before line [3]; (b) the net transformation from just before line [2] of Fig. 2(a) to just before line [4]. (The superscript in each structure’s name indicates what vocabularies are present in the structure; “·” stands for “unprimed”.) — Fig. 5(a) depicts a two-vocabulary structure that represents the net transformation from just before line [2] of Fig. 2(a) to just before line [3]; Fig. 5(b) depicts a two-vocabulary structure that represents the net transformation from just before line [2] of Fig. 2(a) to just before line [4]. A two-vocabulary structure has a single set of memory cells that are structured using two vocabularies. In Fig. 5(a), one vocabulary is {a, b, p, n}; the second vocabulary is {a0 , b0 , p0 , n0 }. In Fig. 5(b), the two vocabularies are {a, b, p, n} and {a00 , b00 , p00 , n00 }.1 (In Fig. 4(a) and (b), we have used single-primed and doubleprimed vocabularies in the respective second-component structures to emphasize how they correspond to the two-vocabulary structures of Fig. 5(a) and (b). Strictly speaking, these should have been unprimed vocabularies.) Even though we have drawn the list in the second component of the pair shown in Fig. 4(a) so that each n0 -edge appears to have been reversed from the n-edge in the first component, we have not given names to the nodes, and thus Fig. 4(a) does not contain sufficient information to ensure that each the original edges has, in fact, been reversed.2 1 Variables b, p, and p0 do not appear in Fig. 5(a) because they have the value NULL. Likewise, variables b and p do not appear in Fig. 5(b) because they have the value NULL. 2 Although it would be easy to give indelible names to nodes in each concrete list, it will become apparent in §5 that this is not the case for nodes in abstract lists. The discussion in this section is intended to convey—using concrete lists—how we overcome the lack of indelible names for nodes in abstract lists. ACM Journal Name, Vol. V, No. N, Month 20YY. 8 · Bertrand Jeannet et al. In contrast, because there is a unique set of nodes in the two-vocabulary structure of Fig. 5(a), we know that for each n-edge there is a corresponding reversed n0 -edge, and vice versa. Transformer Application. Let τ denote the transformation produced by the statement “[3] p = b->n;” in line [3] of Fig. 2(a). Consider three ways of depicting the effect: — In terms of one-vocabulary structures, the transformation amounts to passing from Fig. 3(b) to Fig. 3(c): τ (S 0 ) = S 00 . — In terms of pairs of one-vocabulary structures, the transformation amounts to passing from Fig. 4(a) to Fig. 4(b): τ (hS, S 0 i) = hS, τ (S 0 )i = hS, S 00 i. — In terms of two-vocabulary structures, the transformation amounts to passing from Fig. 5(a) to Fig. 5(b): 0 00 τ (S h·, i ) = S h·, i , where the superscript indicates what vocabularies are included in the structure (“·” stands for “unprimed”). Two-Vocabulary Structures as Procedure Summaries. Both (i) a pair of onevocabulary structures, and (ii) a two-vocabulary structure provide a way to represent the net transformation performed by an operation (or a sequence of operations). However, as illustrated above, in the absence of indelible names for nodes, a two-vocabulary structure can represent information more precisely than a pair of one-vocabulary structures, and thus a two-vocabulary structure can provide a more precise procedure summary than a pair of one-vocabulary structures. In the remainder of this section, we discuss the code fragment shown in h·,0 i Fig. 2(b). Structure S2 in Fig. 6(a) summarizes the transformation performed by h0 ,00 i “[2] b = rev(a);”, and structure S3 in Fig. 6(b) summarizes the transformation performed by “[3] c = rev(b);”. Transformer Composition. The result of composing the transformations represented by two two-vocabulary structures can be expressed as another two-vocabulary h·,00 i structure. For instance, consider the two-vocabulary structure S2;3 shown in Fig. 7, which represents the result of composing Fig. 6(b) with Fig. 6(a) to obtain a two-vocabulary structure for the sequence “[2] b = rev(a); [3] c = rev(b);”. The composition of the transformations represented by two two-vocabulary structures can be expressed in terms of a meet operation on three-vocabulary structures. To explain this, we introduce the graphical notation of dotted edges to represent unknown information (i.e., with truth value 1/2). For instance, Fig. 8(a) and Fig. 8(b) h1/2,0 ,00 i h·,0 ,1/200 i , respectively, where and S3 show two three-vocabulary structures S2 the symbol 1/2 in the superscript of a structure name indicates that the structure h·,0 ,1/200 i has only unknown information for a given vocabulary. Note that S2 and ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis Operation · 9 Resulting Structure h·,0 i S2 [2] b = rev(a); = [3] c = rev(b); n n n’ n’ n’ b’ a,a’ a’,a” h0 ,00 i S3 n n” n” n” n’ n’ n’ b’,b” = c” Fig. 6. The two-vocabulary structures that summarize (a) the transformation performed by “[2] b = rev(a);”, and (b) the transformation performed by “[3] c = rev(b);”. n a,a” 00 h·, i S2;3 n n b” = c” n” n” n” h·,00 i Fig. 7. The two-vocabulary structure S2;3 represents the net transformation performed by the sequence “[2] b = rev(a); [3] c = rev(b)”. Note that for each (unprimed) n-edge there is a corresponding (double-primed) n00 -edge, and vice versa. h1/2,0 ,00 i S3 are three-vocabulary structures that correspond to the two-vocabulary h·,0 i h0 ,00 i structures S2 and S3 from Fig. 6, respectively. We introduce the meet operation (u), where “unknown” u “definite information” h0 ,00 i h·,0 i yields “definite information”.3 With this notation, the composition S3 ◦ S2 of h·,0 i h0 ,00 i the transformations represented by two two-vocabulary structures S2 and S3 can be expressed in terms of three-vocabulary structures as h0 ,00 i S3 h·,0 i ◦ S2 h1/2,0 ,00 i = project1,3 (S3 = h·,0 ,1/200 i u S2 ) h·,00 i S2;3 . h·,0 ,00 i h1/2,0 ,00 i h·,0 ,1/200 i The three-vocabulary structure S2;3 obtained from S3 u S2 is shown in Fig. 9. Finally, by projecting away the “middle” (single-primed) vocabulary from h·,0 ,00 i h·,00 i S2;3 , we obtain the two-vocabulary composition result S2;3 shown in Fig. 7. How These Ideas are Used in Relational Shape Analysis. In §5, we introduce a way to use 3-valued structures as abstractions of sets of 2-valued structures. In §6, this is extended to using two-vocabulary 3-valued structures as abstractions of transformations on 2-valued structures. This provides what is needed for a compositional approach to shape analysis: — the 3-valued analog of the two-vocabulary version of transformer application can be used for intraprocedural propagation; 3“Definite information” means “definitely present” (true, denoted by 1) or “definitely absent” (false, denoted by 0). Thus, 1/2 u 1 = 1 = 1 u 1/2 and 1/2 u 0 = 0 = 0 u 1/2. ACM Journal Name, Vol. V, No. N, Month 20YY. 10 · Bertrand Jeannet et al. n” n” n” n” (a) h·,0 ,1/200 i S2 a,a’ = n” n” (b) n” n” n n n’ n’ n’ n” b’ a”,b”,c” h1/2,0 ,00 i S3 n” n a,b,c a’,a” c” = n n” n” n” n’ n’ n’ n n n n n n b’,b” n n n Fig. 8. Three-vocabulary structures for the two-vocabulary structures from Fig. 6. Dotted edges indicate predicate tuples that have the value 1/2 (and hence correspond to information that is unknown). In (a), the unprimed and single-primed vocabularies capture the transformation performed by [2] b = rev(a);, and the information in the double-primed vocabulary (predicates a00 , b00 , c00 , and n00 ) is unknown. In (b), the single-primed and double-primed vocabularies capture the transformation performed by [3] c = rev(b);, and the information in the unprimed vocabulary (predicates a, b, c, and n) is unknown. a,a’,a” h·,0 ,00 i S2;3 = c” n n n n’ n’ n’ n” n” n” h·,0 ,00 i Fig. 9. The three-vocabulary structure S2;3 0 00 0 00 h1/2, , i h·, ,1/2 i from Fig. 8: S3 u S2 . Note that (double-primed) n00 -edge, and vice versa. b’,b” obtained from the meet (u) of the two structures for each (unprimed) n-edge there is a corresponding — the 3-valued analog of transformer composition can be used for interprocedural propagation. Two-vocabulary 3-valued structures are used as summary transformers for the shape transformations performed by the possible sequences of operations in each procedure, and an over-approximation of composition can be performed using the meet operation on three-vocabulary 3-valued structures. In particular, it is possible to perform an over-approximation of the composition of the transformations 0 0 00 represented by two two-vocabulary 3-valued structures, (S # )h·, i and (T # )h , i , 0 00 by (i) promoting them to three-vocabulary 3-valued structures (S # )h·, ,1/2 i and 0 00 (T # )h1/2, , i , (ii) taking their meet, and (iii) projecting away the middle vocabulary. (See §6.5.) 3. PRELIMINARIES 3.1 2-Valued First-Order Logic We briefly discuss definitions related to first-order logic. We assume a vocabulary P of predicate symbols and a set of variables, usually denoted by v, v1 , . . .. Formulas ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis J1KS (Z) Jp(v1 , . . . , vk )KS (Z) J¬ϕ1 KS (Z) Jϕ1 ∨ ϕKS (Z) J∃v1 : ϕ1 KS (Z) = = = = = · 11 1 ι(p)(Z(v1 ), . . . , Z(vk )) 1 − Jϕ1 KS (Z) max(Jϕ1 KS (Z), Jϕ1 KS (Z)) maxu∈U Jϕ1 K(Z[v1 7→ u]) Table I. Meaning of first-order formulas, given a logical structure S = (U, ι) and an assignment Z. are defined by the syntax: ϕ ::= | | | 1 p(v1 , . . . , vk ) ¬ϕ | ϕ ∨ ϕ ∃v : ϕ logical literal where p is a predicate symbol of arity k logical connectives existential quantification (1) For reasons that will be made explicit in the next paragraph, we do not include the formula v1 = v2 in the grammar itself. Instead, we assume that the vocabulary P contains a special predicate symbol eq of arity 2 that will have a special interpretation. We will write v1 = v2 and v1 6= v2 for eq(v1 , v2 ) and ¬eq(v1 , v2 ). The literal 0, the connectives ⇒ and ∧, and the quantifier ∀v are defined in the usual way, in terms of items in grammar (1). A conditional expression ϕ1 ? ϕ2 : ϕ3 is an abbreviation for (ϕ1 ∧ ϕ2 ) ∨ (¬ϕ1 ∧ ϕ3 ). The notion of free variables is defined in the usual way. The set {0, 1} of (2-valued) truth values is denoted by B. A 2-valued logical structure S = (U, ι) is aSpair, where the universe U is a set of individuals and the valuation ι : P → k≥0 (U k → B) maps each predicate symbol of arity k to a predicate (or truth-valued function). The set of 2-valued structures over a vocabulary P is denoted by 2−STRUCT[P]. We assume that for any (U, ι) ∈ 2−STRUCT[P], ι(eq) is defined by ι(eq)(u1 , u2 ) = (u1 = u2 ). An assignment Z : {v1 , . . . , vk } → U maps free variables (implicitly with respect to a formula) to individuals. Given a 2-valued logical structure S = (U, ι) and an assignment Z of free variables, the (2-valued) meaning of a formula ϕ, denoted by JϕKS (Z), is defined in Tab. I by induction on the syntax of ϕ. A logical structure satisfies a closed formula ϕ (i.e., without free variables), denoted by S |= ϕ, iff JϕKS = 1. For open formulas, satisfaction with respect to assignment Z is defined by S, Z |= ϕ, iff JϕKS (Z) = 1. 3.2 3-Valued First-Order Logic We now extend the definitions from §3.1 to 3-valued logic, in which a third truth value, denoted by 1/2, represents uncertainty. The set B ∪ {1/2} of 3-valued truth values is denoted by T, and is partially ordered by the order l @ 1/2 for l ∈ B. A 3-valued logical structure SS= (U, ι) is almost identical to a 2-valued structure, except for the fact that ι : P → k≥0 (U k → T) maps each predicate symbol of arity k to a 3-valued truth-valued function. The syntax of formulas defined in Eqn. (1) is extended with the logical literal 1/2, which is given the meaning J1/2KS = 1/2. The meaning of other syntactic constructs is still defined by Tab. I. Note that the operations “−” and “max” can accept the value 1/2 as an operand. A 3-valued logical structure potentially satisfies a closed (3-valued) formula ϕ, ACM Journal Name, Vol. V, No. N, Month 20YY. 12 · Bertrand Jeannet et al. denoted by S |= ϕ, iff JϕKS ∈ {1/2, 1}. For open formulas, we have S, Z |= ϕ, iff JϕKS (Z) ∈ {1/2, 1}. We refer to [Sagiv et al. 2002] for the extension of first-order 2- and 3-valued logic with transitive closure, which we have omitted here for the sake of simplicity. The transitive closure of a formula with two free variables ϕ(v1 , v2 ) is denoted by ϕ∗ (v1 , v2 ). Embedding of 3-Valued Logical Structures. To abstract memory configurations represented by logical structures, we use the following notion of embedding: Definition 3.1. Given S = (U, ι) and S 0 = (U 0 , ι0 ), two 3-valued structures over the same vocabulary P, and f : U → U 0 , a surjective function, f embeds S in S 0 , denoted by S vf S 0 , if for all p ∈ P and u1 , . . . , uk ∈ U , ι(p)(u1 , . . . , uk ) v ι0 (p)(f (u1 ), . . . , f (uk )) If, in addition, ι0 (p)(u01 , . . . , u0k ) = G ι(p)(u1 , . . . , uk ) u1 ∈f −1 (u01 ),...,uk ∈f −1 (u0k ) then S 0 is the tight embedding of S with respect to f , denoted by S 0 = f (S). Intuitively, f (S) is obtained by merging individuals of S and by defining accordingly the valuation of predicates (in the most precise way). Observe that vid , which will be denoted simply by v, is the natural information order between structures that share the same universe. Note that one has S vf S 0 ⇔ f (S) vid S 0 . We can now explain the usefulness of the eq predicate. Let S = (U, ι) ∈ 2−STRUCT and S 0 = (U 0 , ι0 ) = f (S). We have 1 if ∀u1 ∈ f −1 (u01 ), ∀u2 ∈ f −1 (u02 ) : ι(eq)(u1 , u2 ) = 1 0 0 0 0 if ∀u1 ∈ f −1 (u01 ), ∀u1 ∈ f −1 (u02 ) : ι(eq)(u1 , u2 ) = 0 ι (eq)(u1 , u2 ) = 1/2 otherwise which can be simplified to 1 if u01 = u02 ∧ |f −1 (u01 )| = 1 0 0 0 0 if u01 6= u02 ι (eq)(u1 , u2 ) = 1/2 if u01 = u02 ∧ |f −1 (u01 )| > 1 Note that u01 = u02 in the simplified definition is not a shorthand for eq(u01 , u02 ); it evaluates to true whenever u01 and u02 are the same individual of U 0 . Similarly, u01 6= u02 evaluates to true when u01 and u02 are distinct individuals of U 0 . Hence, for any S 00 = (U 00 , ι00 ) wf S, if for some u00 ∈ U 00 ι00 (eq)(u00 , u00 ) = 1, then |f −1 (u00 )| = 1, otherwise |f −1 (u00 )| ≥ 1. Consequently, the value of the formula eq(v, v) evaluated in a 3-valued structure S 00 indicates whether an individual of S 00 represents exactly one individual in each of the structures S that can be embedded into S 00 , or at least one individual. The following preservation theorem about the interpretation of logical formulas allows to interpret logical formulas in embedded structures in a conservative way with respect to the original structure. ACM Journal Name, Vol. V, No. N, Month 20YY. srev re y= l r es= r z=x->n …cal z== NUL L return site call rev … ret y=x = re res ) v(l Fig. 10. z) v( x->n=NULL if(z==NULL) z! =N UL L call rev emain 13 l al smain · …c ev( l) A Relational Approach to Interprocedural Shape Analysis return site =x >n z v y=re …ret erev (z) Interprocedural CFG of the list-reversal program. Theorem 3.2 (Embedding theorem [Sagiv et al. 2002]). Let S = (U, ι) and S 0 = (U 0 , ι0 ) be two 3-valued structures, such that there exists an embedding function f with S vf S 0 . Then, for any formula ϕ(v1 , . . . , vk ) and assignment Z : {v1 , . . . , vk } → U of free variables of ϕ, we have 0 JϕKS3 (Z) v JϕKS3 (Z 0 ), where Z 0 : {v1 , . . . , vk } → U 0 is the abstract assignment defined by Z 0 (vi ) = f (Z(vi )). 4. PROGRAMS AND MEMORY CONFIGURATIONS We consider programs written in an imperative programming language in which (1) it is forbidden to take the address of a local variable, a global variable, a parameter, or a function; (2) parameters are passed by value; (3) pointer arithmetic is forbidden. These restrictions prevent direct aliasing among variables; thus, only nodes in heapallocated structures can be aliased. The third feature makes memory configurations invariant under permutations of addresses. Note that both Java and Ml follow these conventions. 4.1 Program Syntax A program is defined by a set of procedures Pi , 0 ≤ i ≤ K. Each procedure has local variables, formal input parameters, and output parameters. To simplify our notation, we will assume that each procedure has only one input parameter and one output parameter; the generalization to multiple parameters is straightforward. We also assume that an input parameter is not modified during the execution of the procedure. This assumption is made solely for convenience, and involves no loss of generality because it is always possible to copy input parameters to additional local variables. ACM Journal Name, Vol. V, No. N, Month 20YY. 14 · Bertrand Jeannet et al. Set of cells Cell Pointer variable z z ∈ Cell ∪ {NULL} U Universe z:U →B Unary relation Table II. Set-theoretic view Pointer field n n ∈ Cell → Cell ∪ {NULL} n:U ×U →B Binary relation Logical view Data variable x x∈D Data field d d ∈ Cell → D x:D Nullary function d:U →D Unary function Two related models of a program state, where D may be B or Int. x 5 2 9 3 NULL y Fig. 11. A possible store, consisting of a four-node linked list pointed to by x and y. Thus, a procedure Pi = hfpii , fpoi , Li , Gi i is defined by its input parameter fpii , its output parameter fpoi , its set of local variables Li (containing fpii and fpoi ), and Gi , its intraprocedural control flow graph (CFG). A program is represented by a directed graph G∗ = (N ∗ , E ∗ ), called an interprocedural CFG. G∗ consists of a collection of intraprocedural CFGs G1 , G2 , . . . , GK , one of which, Gmain , represents the program’s main procedure. Each CFG Gi contains exactly one start node si and exactly one exit node ei . The nodes of a CFG represent control points and its edges represent individual statements and branches of a procedure in the usual way. A procedure call statement relates a call node and a return-site node. For n ∈ N ∗ , proc(n) denotes the (index of the) procedure that contains n. In addition to the ordinary intraprocedural edges that connect the nodes of the individual flowgraphs in G∗ , each procedure call, represented by call-node c and return-site node r, has two edges: (1) a call-to-start edge from c to the start node of the called procedure; (2) an exit-to-return-site edge from the exit node of the called procedure to r. The functions call and ret record matching call and return-site nodes: call(r) = c and ret(c) = r. We assume that a start node has no incoming edges except call-to-start edges. 4.2 Representing Memory Configurations Consider a program that consists of several procedures, and, for the moment, ignore the stack of activation records in each state. At a given control point, a program state s ∈ State is defined by the values of the local variables and the heap. We describe two ways in which such a state s can be modeled (see Tab. II): — The set-theoretic model is perhaps more intuitive. We consider a fixed set Cell of memory cells. The value of a pointer variable z is modeled by an element z ∈ Cell∪{NULL}, where NULL denotes the null value. If cells have a pointer-valued field n, the values of n fields are modeled by a function n : Cell → Cell ∪ {NULL} that associates with each memory cell the cell pointed to by the field. If cells have an Int-valued (or, more generally, a data-valued) field x, the values of x fields are modeled by a function d : Cell → Int that associates with each memory cell the value of the corresponding field. — Sagiv et al. [2002] model states using the tools of logic (cf. §3.1). Each state is modeled as a 2-valued logical structure: the set of memory cells is replaced by ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 15 a universe U of individuals; the value of a program variable z is defined by a unary predicate on U ; and the value of a field n is defined by a binary predicate on U . Integrity constraints are used to capture the fact that, for instance, a unary predicate z that represents what program variable z points to can have the value “true” for at most one memory cell [Sagiv et al. 2002]. We use the term “predicate of arity n” for a Boolean function U n → B. We use Pn to denote the set of predicates symbols of arity n, and N to denote the set of integer-valued function symbols. With such notation, the concrete state-space considered is:4 State = (U → B)|P1 | × (U 2 → B)|P2 | × (U → Int)|N | (2) where |E| denotes the size of a finite set E. A concrete property in ℘(State) is thus a set of tuples, each field of which is a function. From now on, for the sake of simplicity, we will first perform the trivial abstraction of the concrete state space defined by Eqn. (2) to the state-space State = (U → B)|P1 | × (U 2 → B)|P2 | (3) In this case, a state S ∈ State can be represented bySa 2-valued logical structure (U, ι) (§3.1), where the valuation function ι : P → k (U k → B) associates each predicate symbol of arity k with a k-ary relation over U . We thus have State ' 2−STRUCT[P]. In the sequel, we also assume that the universe U is infinite. Because all infinite countable sets are isomorphic, we can omit the universe in declarations of 2-valued structures S = (U, ι) ∈ 2−STRUCT[P], so that S will denote both the 2-valued structure and its valuation function ι. Remark 4.1. Because we want shape properties to be invariant under permutations of memory cells, we implicitly quotient State by the equivalence relation S ≈ S 0 if there is a permutation f : U → U such that ∀p ∈ P : S 0 (p)(u1 , . . . , uk ) = S(p)(f (u1 ), . . . , f (uk )) 2 The predicates that are part of the underlying semantics of the language to be analyzed are called core predicates. They will be distinguished from additional predicates that will be introduced later when abstracting concrete heaps. The set of core predicates that are used is dictated by the semantics of the programming language to be analyzed. (The programming language can have a degree of abstraction already built into it by the analysis designer, as illustrated by Remarks 4.2 and 4.3 below.) For the programs that we consider, and the part of the state-space that we chose to analyze (Eqn. (3)), we need to introduce a core predicate for each program variable and data-structure field, following Tab. II. The set of core predicates is thus uniquely defined for a given program. 4 Eqn. (2) is the concrete state-space that one has when the techniques of [Sagiv et al. 2002] are combined with those of [Gopan et al. 2004]. To simplify Eqn. (2), we have omitted nullary predicates, which would be used to model Boolean-valued variables, and nullary functions, which would be used to model data-valued variables. ACM Journal Name, Vol. V, No. N, Month 20YY. 16 · Bertrand Jeannet et al. Remark 4.2. (Modeling dynamic memory allocation) The free memory pool required for dynamic memory allocation and deallocation is modeled using a core predicate free(v), which has the value true for the unbounded number of nodes modeling free memory cells. 2 Remark 4.3. (Modeling ordering among cells’ data values) In some experiments of §8.2, we model lists and trees that are ordered with respect to integer keys. However, according to Eqn. (3), we abstract integer values and we cannot compare such keys directly. Instead, we introduce a special core predicate leq(v1 , v2 ), which (i) is a total order, and (ii) has the value true on (v1 , v2 ) whenever the key of cell v1 is less than or equal to the key of cell v2 . This core predicate can be seen as an abstraction of the predicate cell1->key <= cell2->key when the state-space of Eqn. (2) is abstracted into the state-space of Eqn. (3). 2 4.3 Semantics of Intraprocedural Operations The usefulness of adopting the logical view for modeling memory becomes apparent when defining the semantics of instructions. This is because one can use the language of first-order logic for specifying how predicates—and hence logical structures and memory configurations—are transformed by the program’s operations. In this section, we only discuss intraprocedural operations; the problem of defining the semantics of interprocedural operations is left to §7.1. Generally speaking, the concrete operational semantics of a programming language is defined by specifying a state transformer for each kind of operation associated with intraprocedural edges of the control-flow graph. We distinguish among the operations statements, which modify the program state, from conditions, which select program states that satisfy the conditions. The semantics of a statement stm is a transformer with signature JstmK : State → State; the semantics of a condition cond is a predicate JcondK : State → B, which can be lifted to a transformer with signature JcondK : ℘(State) → ℘(State) that filters out the states not satisfying the condition. 4.3.1 Statements. The transformer of a statement stm acts on states modeled as logical structures. It is defined using a collection of predicate-update formulas, c(v1 , . . . , vk ) = ϕcstm (v1 , . . . , vk ), one for each core predicate c (see [Sagiv et al. 2002]). These formulas define how the core predicates of a logical structure S are transformed by the statement stm to create a logical structure S 0 ; they define the value of predicate c in S 0 as a function of c’s value in S. Formally, JstmK : State −→ State S 7−→ S 0 where ∀c ∈ P : S 0 (c)(u1 , . . . , uk ) = Jϕcstm (v1 , . . . , vk )KS ([v1 7→ u1 , . . . , vk 7→ uk ]) (4) For instance, the semantics of the assignment statement z->n = NULL; is specified by the predicate-update formulas ϕnstm (v1 , v2 ) = n(v1 , v2 ) ∧ ¬z(v1 ), ϕcstm (v1 , . . . , vk ) = c(v1 , . . . , vk ) for c 6= n The predicate-update formula ϕnstm should be read as follows: “If the cell v1 is not pointed to by the variable z, leave the n field of the cell v1 unchanged, otherwise ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis Statement z = NULL z = y z = y->sel z->sel = NULL z->sel = y (assuming that z->sel = NULL) Table III. · 17 Predicate-update formula ιz (v) = 0 ιz (v) = y(v) ιz (v) = ∃v1 : y(v1 ) ∧ sel(v1 , v) ιsel (v1 , v2 ) = sel (v1 , v2 ) ∧ ¬z(v1 ) ιsel (v1 , v2 ) = sel (v1 , v2 ) ∨ (z(v1 ) ∧ y(v2 )) Predicate-update formulas for statements. Condition z == NULL z != NULL z1 == z2 z1 != z2 z->sel == NULL (assuming that z != NULL) z->sel != NULL z1->sel == z2 (assuming that z1 != NULL) z1->sel != z2 Table IV. Precondition formula ∀v : ¬z(v) ∃v : z(v) ∀v : z1(v) ⇔ z2(v) ∃v : ¬(z1(v) ⇔ z2(v)) ∀v1 , v2 : z(v1 ) ⇒ ¬sel (v1 , v2 ) ∃v1 , v2 : z(v1 ) ∧ sel (v1 , v2 )) ∀v1 , v2 : z1(v1 ) ⇒ (sel (v1 , v2 ) ⇔ z2(v2 )) ∃v1 , v2 : z1(v1 ) ∧ ¬(sel (v1 , v2 ) ⇔ z2(v2 )) Precondition formulas for conditions. assign it the value NULL (represented by n(v1 , v2 ) = false for every cell v2 ).” We assume that the statements of the analyzed program are decomposed into the elementary statements listed in Tab. III (which is always possible for the class of languages considered in this paper). The elementary statements modify the value of at most one core predicate. We omit writing explicit predicate-update formulas for predicates that are unchanged by a statement. (The omitted formulas merely express the identity transformation.) 4.3.2 Conditions. The semantics of a condition cond is defined by a precondition formula ϕcond , which is a nullary formula that filters out structures that should not follow the transition along edges e labeled by the condition. Formally, Jϕcond K : ℘(State) −→ ℘(State) X 7−→ X 0 ⊆ X where X 0 = {S ∈ X | S |= ϕcond } (5) For instance, the semantics of the condition z->n != NULL is given by the precondition formula ∃v1 , v2 : z(v1 ) ∧ n(v1 , v2 ), which evaluates to false on logical structures for which the n field of the cell pointed to by z (if any) is equal to NULL. Tab. IV gives the complete semantics of conditions. Program assumptions, such as z!=NULL at the point of a dereference of z, are checked by the analysis using the “halt” instruction of the TVLA system [LevAmi and Sagiv 2000], which generates an alert when a program assumption is not satisfied. 4.3.3 Memory allocation and deallocation.. Remark 4.2 introduced the predicate free(v) for modeling the free memory pool. The semantics of a memory deallocation instruction dealloc(z) is defined using the predicate-update formulas ACM Journal Name, Vol. V, No. N, Month 20YY. 18 · Bertrand Jeannet et al. τ z (v) = 0 and τ free (v) = free(v) ∨ z(v). Intuitively, the semantics of a memory allocation instruction z = alloc() is to pick randomly a node v0 with free(v0 ) = 1, and then update free(v) and z(v) using predicate-update formulas τ free (v) = free(v) ∧ ¬eq(v, v0 ) and τ z (v) = eq(v, v0 ).5 5. ABSTRACTING MEMORY CONFIGURATIONS In this section, we discuss the abstraction method developed by Sagiv et al. [2002], which maps 2-valued logical structures (of arbitrary size) to 3-valued logical structures of bounded size. The problem with representing and manipulating 2-valued structures is the unbounded universe U . Consequently, the starting point for abstracting a 2-valued structure is the abstraction of the universe U to an abstract universe U ] of bounded size. Intuitively, the abstraction consists of (i) merging concrete individuals into a bounded number of abstract individuals U ] , and (ii) replacing the concrete predicates by abstract versions in which the values of the tuples reflect how concrete individuals have been merged to create the abstract individuals. 5.1 The Abstraction Principle Given a finite set U ] with a surjective function f : U → U ] , one can define the following Galois connection, using the tight embedding on logical structures induced by f and the partial order defined on 3-valued structures (see Defn. 3.1): γf −− −− 3−STRUCT ℘(2−STRUCT) ← −− αf→ G αf (X) = f (S) S∈X ] γf (S ) = {S | S vf S ] } S In this abstraction, sets of valuations for predicate symbols ι : P → k U k → B S ] k are abstracted with a single abstract valuation ι : P → k (U ) → T . 5.2 The Abstract Domain of 3-Valued Structures The abstraction principle depicted above is parameterized by a finite abstraction of the universe U of 2-valued structures. The idea behind canonical abstraction [Sagiv et al. 2002] is to choose a subset A ⊆ P1 of abstraction predicates and to define an equivalence relation 'ιA on U that is parameterized by the logical structure S ∈ 2−STRUCT to be abstracted: u1 'SA u2 ⇔ ∀p ∈ A : S(p)(u1 ) = S(p)(u2 ) 5 Unfortunately, “picking a node randomly” cannot be easily expressed in 2-valued logic, so we define it directly in 3-valued logic using the special operator Focus that will be introduced in §5. (To conserve space, we do not give the precise definition here.) An alternative would have been to employ a concrete model of the free memory pool, e.g., using a singly-linked list, but this would have increased the complexity of the summaries of procedures that perform allocation and deallocation. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis x r[n,x] n r[n,x] n x r[n,x] (a) a 2-valued structure S that represents a singly-linked list r[n,x] n · r[n,x] 19 n (b) the canonical abstraction of S with A = {x} Unary predicates associated with variable pointers (e.g., x) are depicted with arrows. The other unary predicates (e.g., r[n, x]) are depicted inside nodes for which they evaluate to true. (The meaning of r[n, x] will be explained in §5.2.1; see also Tab. V.) Binary predicates (e.g., n) are depicted using arrows linking the two arguments. Solid arrows denote the value 1, dashed arrows denote the value 1/2. Summary nodes (for which eq = 1/2) are depicted using double ovals. Fig. 12. Graphical representation of logical structures that represent memory configurations. S This equivalence relation defines the surjective function fA : U → U/ 'SA that maps an individual to its equivalence class. We thus have the Galois connection γ −− −− ℘(3−STRUCT[P]) = A ℘(State) = ℘(2−STRUCT[P]) ← −− α→ S α(X) = {fA (S) | S ∈ X} γ(Y ) = {S | S ] ∈ Y ∧ S vf S ] } S where fA is the tight embedding function for logical structures S induced by fA : U → U/ 'SA The abstraction function α is referred to as canonical abstraction. It defines the canonical 3-valued structures as those that are the image of canonical abstraction. Fig. 12 illustrates the abstraction of a singly-linked list using the predicate x as the unique abstraction predicate. The ordering in A extends the ordering between 3-valued structures as follows: Y1 v Y2 iff ∀S1] ∈ Y1 : ∃S2] ∈ Y2 : S1] v S2] . Thanks to the Embedding Theorem (Thm. 3.2), one can evaluate a logical formula in a 3-valued structure to obtain a conservative result with respect to the structure’s concretization as a set of 2-valued structures. Consequently, we can reuse the formulas that specify the concrete operational semantics of statements and conditions (see §4): when evaluated in a 3-valued structure, these formulas yield sound approximations — in the abstract lattice A — of the concrete transformers. 5.2.1 Instrumentation Predicates. As always with abstraction interpretation, there is a danger that as the analysis proceeds, the indefinite value 1/2 will become pervasive. This can destroy the ability to recover interesting information (although soundness is maintained). A key role for improving the precision of the abstraction is played by instrumentation predicates, which record auxiliary information in a logical structure. An instrumentation predicate p of arity k is defined by a logical formula ψp (v1 , . . . , vk ) over the core predicate symbols, and captures a property that each k-tuple of nodes may or may not possess. Tab. V lists some instrumentation predicates that are important for the analysis of programs that use type List. If the set of instrumentation predicates is denoted by I ⊆ P, the concretization function becomes: S (6) γ(S ] ) = S ∈ γA (S ] ) ∀p ∈ I : Jp(v1 , . . . , vk )KS2 = Jψp (v1 , . . . , vk )KS2 ACM Journal Name, Vol. V, No. N, Month 20YY. · 20 Bertrand Jeannet et al. p Intended Meaning ψp t[n](v1 , v2 ) r[n, q](v) Is v2 reachable from v1 along n fields? Is v reachable from pointer variable q along n fields? Is v on a directed cycle of n fields? Is v pointed by 2 or more n fields? n∗ (v1 , v2 ) ∃ v1 : q(v1 ) ∧ t[n](v1 , v) c[n](v) is[n](v) ∃ v1 : n(v, v1 ) ∧ t[n](v1 , v) ∃ v1 , v2 : ¬eq(v1 , v2 ) ∧ n(v1 , v) ∧ n(v2 , v) Table V. Defining formulas of instrumentation predicates used to characterize singly-linked lists. Typically, there is a separate predicate symbol r[n, q] for each pointer variable q. The constraint in Eqn. (6) that the value of an instrumentation predicate p must match its defining formula ψp filters out many concrete structures from consideration, thereby increasing the precision of the abstraction. Moreover, the use of unary instrumentation predicates as abstraction predicates provides a way to control which concrete individuals are merged together into summary nodes, and thereby to control the amount of information lost by abstraction. For instance, in program-analysis applications, reachability properties from specific pointer variables have the effect of keeping disjoint sublists or subtrees summarized separately. This is particularly important when analyzing a program in which two pointers are advanced along disjoint sublists.6 When applying the abstract transformer JstmK : 3-STRUCT → 3-STRUCT for statement stm, one could first update the values of the core predicates, and then reevaluate each instrumentation predicate’s defining formula in the resulting abstract store. However, this would not provide any additional information. To gain maximum benefit from instrumentation predicates, their value should be computed in some other way. This problem, the instrumentation-predicate-maintenance problem, is solved by updating the instrumentation predicates of the post-state as a function of their values in the pre-state. [Reps et al. 2003] presents an algorithm to generate an appropriate predicate-maintenance formula for each instrumentation predicate p, using the (core) predicate-update formulas ϕcstm that define the semantics of stm, together with p’s defining formula ψp (v1 , . . . , vk ). Given the importance of instrumentation predicates that express reachability properties—such as t[n](v1 , v2 ) and r[n, q](v) shown in Tab. V—for maintaining precision under canonical abstraction, there is one limitation of the method from [Reps et al. 2003] that is worth mentioning: if b is a core binary predicate, and t[b] is the corresponding reachability predicate, the method from [Reps et al. 2003] works best when the modification to b by each concrete transformer is a unit-size change— i.e., when the transformer changes the value of at most one b-tuple. This presents a problem for creating summary transformers for procedures, because the net action 6A method for automatically identifying appropriate instrumentation predicates, using a process of abstraction refinement, is presented in [Loginov et al. 2005]. In that paper, the input required to specify a program analysis consists of (i) a program, (ii) a characterization of the inputs, and (iii) a query (i.e., a formula that characterizes the intended output). That work, along with [Reps et al. 2003], provides a framework for automating most of the issues related to instrumentation predicates that were explicit obligations of an analysis designer in the original formulation of the 3-valued-logic approach to shape analysis [Sagiv et al. 2002]. See also [Loginov 2006]. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 21 of a procedure will modify multiple b-tuples, in general. Fortunately, the approach to applying procedure summaries developed in this paper uses a different approach to maintaining the values of instrumentation predicates than the one presented in [Reps et al. 2003] (see §6.5). 5.2.2 Other Operations on Logical Structures. Several additional operations on logical structures help prevent an analysis from losing precision [Sagiv et al. 2002]: — Focus is an operation that can be invoked to elaborate a 3-valued structure— allowing it to be replaced by a set of more precise 3-valued structures (not necessarily images of canonical abstraction) that represent the same set of concrete stores. — Coerce is a clean-up operation that may “sharpen” a 3-valued structure by setting an indefinite value (1/2) to a definite value (0 or 1), or discard a structure entirely if the structure exhibits some fundamental inconsistency (e.g., it cannot represent any possible concrete store). Because the Embedding Theorem applies to any pair of structures for which one can be embedded into the other, it is not necessary to perform canonical abstraction after the application of each abstract transformer. To ensure that abstract interpretation terminates, it is only necessary that canonical abstraction be applied as a widening operator somewhere in each loop, e.g., at the target of each backedge in the CFG. 6. REPRESENTING AND ABSTRACTING RELATIONS BETWEEN MEMORY CONFIGURATIONS 6.1 Motivation As discussed more thoroughly in §7 and §9, there are two main approaches to interprocedural static analysis: the functional and operational approaches [Sharir and Pnueli 1981]. In this paper, we follow the functional approach (also known as the relational approach). A key aspect of the functional approach is that it computes procedure summaries. It computes a predicate transformer for each node of the program by finding the smallest fixpoint of a set of equations over predicate transformers. During this process, the effect of a call to procedure P at a call site c is handled by composing the predicate transformer for c with the predicate transformer for P . (The predicate transformer for P is the predicate transformer for the exit node of P .) When the fixpoint solution is obtained, the predicate transformer for P is the procedure summary for P . In this paper such predicate transformers will be viewed as relations. The main point here is that the ability to represent and abstract relations between memory configurations is fundamental for capturing the input/output behavior of a procedure. This section shows how representations for relations between memory configurations that are represented as logical structures can be created. This representation is the basis of the interprocedural shape analysis described in the next section. ACM Journal Name, Vol. V, No. N, Month 20YY. 22 · Bertrand Jeannet et al. 6.2 Principles of the Representation We now return to the discussion from §2 about two ways to represent and abstract relations between concrete program states, when a program state is a 2-valued structure. The first approach described in §2 involved representing relations between concrete program states as sets of pairs of 2-valued structures. This point of view leads to a simple abstraction, where abstract relations are (sets of) pairs of 3-valued structures obtained by canonical abstraction; see Fig. 13(b). However, this solution is unsatisfactory for the following reasons: — There is a technical difficulty: as explained in Remark 4.1, logical structures are implicitly defined up to a permutation of individuals. As explained in §2, this leads to a loss of information compared with first pairing and then abstracting.7 With this representation it is also difficult to implement the application of a predicate transformer (sets of pairs) to an input predicate (a set of logical structures). — From an efficiency point of view, applying such a solution to a complex abstract domain like 3-valued structures would often lead to combinatorial explosion.8 Fortunately, another approach is possible. We will proceed by analogy with an approach used when abstracting sets of vectors X ⊆ Rn and sets of relations R ⊆ Rn × Rn between such vectors. Sets of vectors can be abstracted with convex polyhedra [Cousot and Halbwachs 1978]: γ ℘(Rn ) ← Pol[n] It is well-known that a good approach to abstracting relations between vectors is not to consider pairs of polyhedra, but to view relations between n-dimensional vectors as sets of 2n-dimensional vectors, and to consider polyhedra in 2n dimensions: γ ℘(Rn × Rn ) ← Pol[2n] Indeed, a relation like ~x = x~0 cannot be finitely represented with pairs of polyhedra, but is very easily represented with a 2n-dimensional polyhedron. Composing two such relations P1 , P2 ∈ Pol[2n] is also easy: one computes the intersection P12 (~x, x~0 , x~00 ) = P1 (~x, x~0 , −) ∩ P2 (−, x~0 , x~00 ) ∈ Pol[3n], and then projects out the x~0 variables in P12 . S Coming back to 2-valued logical structures (U, ι : P → k (U k → B)), an analogy can be drawn with polyhedra by considering each predicate symbol in a logical structure over a vocabulary P, where |P| = n, to correspond to a dimension in an n-dimensional vector. Thus, we will use logical structures over the duplicated vocabulary P ] P 0 to represent relations between logical structures over vocabulary 7 In concrete structures, identity of individuals is preserved in any given run of a procedure. The problem with abstraction-and-pairing is that the identity of the abstract individual to which a given concrete individual is mapped is not necessarily the same when different concrete structures are abstracted. The canonical name for u in S1] on entry to a procedure has no a priori fixed relationship to the canonical name in a structure S2] that arises at the exit of the procedure. 8 Even with intraprocedural analysis using single structures, combinatorial explosion needs to be carefully controlled by choosing a suitable set of abstraction predicates. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 23 c list id succ[n,inp,out] n[out] n[out] id succ[n,inp,out] id succ[n,out,inp] n[out] n[out] n[out] n[out] n[out] id succ[n,inp,out] id succ[n,out,inp] n[inp] id succ[n,inp,out] id succ[n,out,inp] n[inp] n[inp] id succ[n,inp,out] id succ[n,out,inp] n[inp] n[inp] n[inp] (a) Relational representation n list n c n list n n n n (b) Tabulated representation Fig. 13. Two abstractions of the relation between an input list and an output list in which a new cell pointed to by c has been inserted—using destructive updating—somewhere in the middle of the list. Predicates n[inp] and n[out] represent the valuations of the n predicate before and after the insertion, respectively. P. Observe that the representation of concrete and abstract relations is unified by the notion of 3-valued structures, as before. Taking the analogy further, the existential quantification of a dimension in a set of vectors X ⊆ Rn corresponds to assigning the value 1/2 to all tuples of a predicate. With the addition of a meet operation on 3-valued structures (described in §6.5.2), we will be able to implement relation composition on two-vocabulary structures, in a manner similar to convex polyhedra in 2n dimensions. Example 6.1. Fig. 13(a) and (b) illustrate the relational and tabulated representations, respectively, of a relation between input lists pointed to by a pointer variable list and output lists obtained by the insertion of a cell pointed to by pointer c.9 The meanings of the relational instrumentation predicates displayed inside the nodes in Fig. 13(a) are explained in §6.4. They allow the analysis to track whether the fields of some cells have been modified or not. Observe that the relational representation provides more information, because each cell is tracked individually in the representation. For instance, in Fig. 13(b), the information that the output list contains exactly one more cell than the input list is lost. Furthermore, with the tabulated representation, there is no way to determine whether the cells in the output list have been permuted from their order in the input list. In contrast, with the relational representation and the use of the relational instrumentation predicates, it is possible to record the fact that the fields of some cells have not been mutated. 2 9 To reduce clutter, we have omitted certain information from Fig. 13(a); in particular, values have been omitted for some of the standard list predicates given in Tab. V, and therefore the reason why certain non-summary nodes have been kept separate from the summary nodes may not be apparent. This is just to simplify the diagram; the actual system has additional information not shown in Fig. 13(a) that controls which collections of nodes are summarized. ACM Journal Name, Vol. V, No. N, Month 20YY. 24 · Bertrand Jeannet et al. 6.3 Structure of the Vocabulary In this section, we define the vocabularies that are used when two-vocabulary logical structures are used to represent relations between logical structures. Because our analysis method will use relation composition (see Eqn. (12) in §7), we actually need three vocabularies. For each original predicate p ∈ P, we will define three predicates p[inp], p[out] and p[tmp]. A logical structure that represents a relation will use only p[inp] and p[out ] predicates. The p[tmp] predicates (which will be used for computing compositions as explained below) are irrelevant outside of composition. The “irrelevancy” of a predicate corresponds to “undefinedness”, and will be modeled in a 3-valued structure using the value 1/2. We will refer to the labels inp, out and tmp as modes. We have already distinguished, among predicates, core predicates from instrumentation predicates: P = C ∪ I. Moreover, among core predicates, we have distinguished predicates related to the local state and those related to the global state: C = L ∪ G. The vocabulary of core predicates will now contain: — three sets of predicates corresponding to global core predicates in G: G[inp], G[out] and G[tmp]; — the set of local core predicates L. We will assume that the formal input parameter of a procedure is not modified in the procedure, so as to obtain at the exit node of the procedure a relationship between the values of predicates in G[inp] ∪ {fpi} and predicates in G[out] ∪ {fpo}. The other local variables may be forgotten at the exit node. The case of an instrumentation predicate p is a bit more complex, because it depends on the predicates involved in its defining formula ψp . If ψp involves at least one global predicate, the vocabulary will include three copies of the instrumentation predicate p: p[inp], p[out] and p[tmp]. For instance, the vocabulary will include three copies of the reachability predicate r[n, q](v) defined in Tab. V, because we need to characterize a cell by its reachability properties from the pointer variable q through n links both at the entry of the procedure and at the current control point. We can now give the precise definition of 3-valued structures S ] = (U ] , ι] ) ∈ 3-STRUCT[P[inp] ∪ P[out]] in terms of a relation R ⊆ (2-STRUCT[C])2 : ∃S = (U, ι) ∈ γ(S ] ) : ∀p ∈ G[inp] : ι1 (p) = ι(p[inp]) ] γr (S ) = ((U, ι1 ), (U, ι2 )) ∀p ∈ G[out ] : ι2 (p) = ι(p[out ]) ∀p ∈ L : ι1 (p) = ι2 (p) = ι(p) where the concretization function γ is defined by Eqn. (6). 6.4 Relational Instrumentation Predicates To prevent loss of essential information, we also need specific instrumentation predicates to capture properties that relate p[inp] predicates and p[out ] predicates. We call such multi-vocabulary instrumentation predicates relational instrumentation predicates. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 25 In particular, it will be essential to capture accurately the identity relationship (see §7.1, Eqn. (11)). As a consequence, we always use the unary predicates id succ[n, m1 , m2 ] and id pred[n, m1 , m2 ], where m1 , m2 ∈ {inp, out} and m1 6= m2 , to record information about the values of different modes of predicate n, such as whether the value of predicate n[m1 ] implies n[m2 ]. These are defined by id succ[n, m1 , m2 ](v) = ∀v1 : (n[m1 ](v, v1 ) ⇒ n[m2 ](v, v1 )) id pred[n, m1 , m2 ](v) = ∀v1 : (n[m1 ](v1 , v) ⇒ n[m2 ](v1 , v)). Example 6.2. In Fig. 13(a), the fact that id succ[n, inp, out](v) and id succ[n, out , inp](v) both hold for the two summary nodes captures the fact that the concrete memory cells represented by these summary nodes have not been reordered. More generally, the value of id succ[n, m1 , m2 ] on the different nodes allows to capture precisely that the only transformation performed on the list is the addition of the new cell. (Looking ahead to Fig. 15(a), the fact that id succ[n, inp, tmp](v) and id succ[n, tmp, inp](v) hold globally captures the condition that the n[inp] and n[tmp] predicates are identical.) 2 Generally speaking, relational instrumentation predicates are essential to preserving relational information that would otherwise be lost when concrete nodes are merged into summary nodes. Some additional constraint rules related to these relational instrumentation predicates are also needed for the relation composition operation defined in §6.5. These constraint rules expresses logical consequences between relational instrumentation predicates. For instance, the rule id succ[n, m1 , m2 ](v) ∧ id succ[n, m2 , m3 ](v) ⇒ id succ[n, m1 , m3 ](v) for m1 6= m2 6= m3 is standard for capturing the fact that the composition of two identity relations is the identity relation. At present time such rules are provided manually. Depending on the procedures in the analyzed program and their semantics, one may need additional relational instrumentation predicates and constraint rules. For the list-reversal example of Fig. 1, §8.1 discusses the relational instrumentation predicates used to capture the fact that the list has been reversed. 6.5 Relation Composition As mentioned in §6.2, relation composition can be defined in term of meet and 0 00 0 projection operations. In the notation from §2, the composition S h , i ◦ S h·, i of the 0 0 00 transformations represented by two two-vocabulary structures S h·, i and S h , i is performed as follows: 0 00 Sh , i 0 0 00 ◦ S h·, i = project1,3 (S h1/2, , i 0 00 u S h·, ,1/2 i ) (7) 00 = S h·, i . We define the projection and meet operations below, and discuss their interaction with instrumentation predicates. 6.5.1 The Projection Operation. The existential quantification of a (core) predicate symbol p0 in a 2-valued logical structure S = (U, τ ) is formally defined as the ACM Journal Name, Vol. V, No. N, Month 20YY. 26 · Bertrand Jeannet et al. disjunction of all the possible values {1, 0} for all tuples of the predicate p0 in S, leading to a set of 2-valued structures: ∃p0 : S = {S 0 = (U, τ 0 ) | ∀p ∈ P \ {p0 } : τ 0 (p) = τ (p)}. Now consider existential quantification in a 3-valued logical structure S ] . The goal is to create a 3-valued structure that over-approximates the result of existential quantification in all 2-valued structures that S ] represents. When S ] contains no instrumentation predicates, existential quantification can be modeled exactly by assigning the value 1/2 to all tuples of the predicate p0 , as follows: (∃p0 : S) = (U, τ 0 ), where τ 0 is defined by ∀~u ∈ U ∗ : τ 0 (p0 )(~u) = 1/2 ∧ ∀p ∈ P \ {p0 } : τ 0 (p) = τ (p). This operation can be implemented with a predicate-update formula (§5.2). Applying the concretization operation γ : ℘(3-STRUCT) → ℘(2-STRUCT) gives back the disjunction of 2-valued structures. Matters are slightly different when we consider a 3-valued logical structure equipped with instrumentation predicates. Consider S ] ∈ 3-STRUCT[P], where P = C ∪ I has core predicates C and instrumentation predicates I. Quantifying out a core predicate c alone may not be sufficient to drop all information about c: in particular, every instrumentation predicate whose defining formula involves c provides (a degree of) redundant information about c; hence, all instrumentation predicates whose defining formula involves c should also be quantified out.10 Projecting a logical structure in 3-STRUCT[P[inp] ∪ P[out] ∪ P[tmp]] onto the subspace 3-STRUCT[P[inp] ∪ P[out]] is thus equivalent to the existential quantification of all p[tmp] predicates, for p ∈ P, as well as all relational instrumentation predicates that involve a predicate in p[tmp]. This operation on 3-valued structures is extended in the standard way to our abstract domain ℘(3-STRUCT[P[inp] ∪ P[out] ∪ P[tmp]]) that manipulates sets of such structures. 6.5.2 The Meet Operation. The meet operation is first defined as the greatestlower-bound operation induced by the approximation order in the lattice 3-STRUCT[P]. It is then extended to the abstract domain ℘(3-STRUCT[P]). [Arnold et al. 2006] shows that in general the first operation is NP-complete. However, [Arnold et al. 2006] provides an algorithm based on graph matching that performs rather well in practice. This is discussed in more detail in §8.3. The effect of the meet operation on instrumentation predicates deserves a further remark: In the context of abstract structures, it should be combined with the Coerce operation discussed in §5.2.2, which propagates logical consequences between (core and instrumentation) predicates. Indeed, the standard meet operation performs a logical meet without exploiting the defining formulas of instrumentation predicates: instrumentation predicates are just treated as independent core predicates. 10 Quantifying out c and all instrumentation predicates whose defining formula involves c might not be the best correct approximation of quantifying out c in all concrete structures represented by S ] if the defining formula ψp of an instrumentation predicate p has a syntactic dependence on c without involving a true semantic dependence—for instance, if we have ψ(p)(~ v ) = . . . ∧ (c(~ v ) ∨ ¬c(~ v)). ACM Journal Name, Vol. V, No. N, Month 20YY. · A Relational Approach to Interprocedural Shape Analysis p = 1/2 p = 1/2 c c (a) S1] p = 1/2 p=1 c (b) S2] 27 c (c) S ] = S1] u S2] (d) coerce(S ] ) Fig. 14. Applying the Coerce operation after the p a nullary meet operation. c is a core predicate, instrumentation predicate defined by p = ∃v : c(v) ∧ ∀v0 : v 6= v0 ⇒ ¬c(v0 ) . list n[inp] list,res id succ[n,inp,tmp] id succ[n,tmp,inp] id succ[n,tmp,out] id succ[n,out,tmp] n[out] n[out] n[inp] id succ[n,inp,tmp] id succ[n,tmp,inp] n[tmp] id succ[n,tmp,out] id succ[n,out,tmp] n[tmp] n[tmp] (a) S1] n[tmp] (b) S2] list,res list,res id id id id n[inp],n[out] n[inp],n[out] succ[n,inp,tmp] succ[n,tmp,out] succ[n,out,tmp] succ[n,tmp,inp] n[inp],n[out] id id id id n[tmp] succ[n,inp,tmp] succ[n,tmp,out] succ[n,out,tmp] succ[n,tmp,inp] id id id id id id succ[n,inp,tmp] succ[n,tmp,out] succ[n,inp,out] succ[n,out,tmp] succ[n,tmp,inp] succ[n,out,inp] id id id id id id n[inp],n[out] n[tmp] succ[n,inp,tmp] succ[n,tmp,out] succ[n,inp,out] succ[n,out,tmp] succ[n,tmp,inp] succ[n,out,inp] n[tmp] (c) S1] list,res u n[tmp] S2] (d) coerce(S1] u S2] ) list,res n[inp] n[out] n[out] n[inp] id succ[n,inp,out] id succ[n,out,inp] n[out] id succ[n,inp,out] id succ[n,out,inp] n[inp] n[out] (e) coerce(projectinp,out (S1] Fig. 15. n[inp] u S2] )) (f) projectinp,out (coerce(S1] u S2] )) Applying the Coerce operation in relation composition Consider the example of Fig. 14. It returns the structure depicted in Fig. 14(c), where p holds the indefinite value 1/2. However, performing a semantic reduction on S ] using Coerce leads to p obtaining the definite value 1, as shown in Fig. 14(d). In this case, Coerce used constraint rules derived from the defining formula of p to infer that p must have the value 1. (See [Sagiv et al. 2002, §6.4] for more details about the use of constraint propagation during Coerce.) This aspect is even more important in the context of multi-vocabulary logical structures that are combined for relation composition, see Fig. 15. As discussed ACM Journal Name, Vol. V, No. N, Month 20YY. 28 · Bertrand Jeannet et al. in §6.4, the multi-vocabulary logical structures that we work with are typically equipped with relational instrumentation predicates and related constraint rules. To retain precision, it is necessary to make sure that logical consequences of the predicates in the vocabulary to be dropped have been incorporated into the predicates of the other vocabularies before projection. Fig. 15 illustrates this point when the id succ[n,m1,m2] relational instrumentation predicates and their related constraint rules as defined in §6.4 are active. It shows that applying the Coerce operation before projection is the key to obtaining the fact that the resulting relation in Fig. 15(f) is the identity relation. As a consequence, the abstract meet operation between 3-valued structures is defined as S1] u] S2] = coerce(S1] u S2] ), where u is the standard meet on 3-valued structures, and the abstract relationcomposition operation (Eqn. (7)) is redefined as 0 00 Sh , i 0 0 00 ◦ S h·, i = project1,3 (S h1/2, , i 0 00 u] S h·, ,1/2 i ) (8) 00 = S h·, i . The use of the abstract meet operation in Eqn. (8) addresses a problem that was mentioned in §5.2.1: the instrumentation-predicate-maintenance formulas created by finite differencing [Reps et al. 2003] are able to maintain definite values for instrumentation predicates that express reachability properties only for unit-size changes to core predicates. However, procedure summaries can involve non-unit-size changes to core predicates. We side-step this problem by using abstract meet—rather than a method that involves finite differencing—to implement abstract relation composition. 7. INTERPROCEDURAL SHAPE ANALYSIS Our interprocedural shape analysis is based on a variant of the functional approach to interprocedural analysis [Cousot and Cousot 1977; Sharir and Pnueli 1981; Knoop and Steffen 1992], in which the two computation steps referred to in §6.1 are merged into a single step. Jeannet and Serwe [Jeannet and Serwe 2004] show how the functional approach can be derived as an abstract interpretation of the standard operational semantics, modeled using a stack of activation records. Once the interprocedural semantics is defined in this way, a second abstraction step may be used to abstract the data (in our case, the values of variables and linked memory cells). In this section, we start directly from the derived forward relational semantics obtained by abstract interpretation of the standard operational semantics, as described in [Jeannet and Serwe 2004]. In §7.1, we first instantiate this forward relational semantics for the case where relations between memory configurations are represented as sets of pairs. In §7.2, we reformulate it for the case where relations are represented and abstracted with the two-vocabulary structures defined in §6, so as to obtain the effective dataflow equations used by our analysis. Finally, in §7.3, we discuss how these dataflow equations can be modified so that their solutions can be obtained more rapidly. (Experimental results with the latter technique are presented in §8.4.) ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 29 7.1 Forward Relational Semantics In the forward relational semantics, each node of the program’s CFG is associated with a relation between — the states reachable at the entry node of the current procedure, and — the states reachable at the current node of the procedure. The relational semantics is defined as the least fixpoint of a system of equations over such relations. Each procedure is viewed as a pure function taking inputs and returning outputs, without performing any side effect on the global store. However, the programs that we consider do modify the global store, defined by the heap and the value of global variables. To account for this, at the semantic level we include the heap and the global variables as implicit input and output parameters of the functions, in addition to the explicit input and output parameters. Notation. For the time being, we represent relations between concrete memory configurations as sets of pairs of 2-valued structures. Thus, we define the function R : N → ℘(State × State), which maps each node of the CFG to a relation over states. States are represented as 2-valued logical structures over core predicates. Among the core predicates, some predicates represent information about the local state of a procedure (i.e., the values of local variables), while other predicates represent information about the global state of the program, i.e., the structure of the heap and the values of global variables. We thus decompose the set of core predicates into local and global predicates: C =G∪L Intraprocedural Operations. An edge n → n0 of the CFG labeled with a statement stm or a condition cond generates the following equations, respectively: R(n0 ) ⊇ {(S, S 00 ) | (S, S 0 ) ∈ R(n) ∧ S 00 = JstmK(S 0 )} 0 00 0 00 0 R(n ) ⊇ {(S, S ) | (S, S ) ∈ R(n) ∧ S ∈ JcondK({S })} (9) (10) Intuitively, the current relation is composed with the relation induced by the semantics of the operation. We use inclusions in the equations because several edges may have n0 as their target. Procedure Calls. In a procedure call, modeled by a call-to-start edge (c, s) labeled by an expression hcall apo = Pi (api)i, the current global state and the actual parameter are passed to the callee, while the other local variables become undefined. One generates the identity relation from the obtained reachable set of states: T (fpi) = S 0 (api) ∧ 0 R(s) = (T, T ) (S, S ) ∈ R(c) ∧ (11) ∀p ∈ G : T (p) = S 0 (p) Note that an undefined predicate is modeled as: “any value is possible”. Procedure Returns. This is the most complex operation. We assume that an exitto-return edge (e, r) is labeled by hret apo = Pi (api)i, and that (e, r)’s corresponding ACM Journal Name, Vol. V, No. N, Month 20YY. 30 · Bertrand Jeannet et al. call-to-start edge is (c, s) (i.e., call(r) = c). The processing of a procedure return consists of the following steps: — composing the relation R(c) at the corresponding call node c with the relation R(e) at the exit node of the callee, to create the global state at r; — taking the local state at the call node and modifying it with the assignment of the actual output parameter at the exit node, to create the local state at r. (S, S 0 ) ∈ R(c) ∧ (T, T 0 ) ∈ R(e) S 0 (p) = T (p) ∧ S 0 (api) = T (fpi) W (p) = S 0 (p) W (p) = T 0 (p) ∧ W (apo) = T 0 (fpo) (12) In the above equations, for (S, S 0 ) and (T, T 0 ) to be composable, the states S 0 and T must agree on the input parameters (actual and formal) and the global state. In the new state W , the values of local variables except the actual output parameter are inherited from S 0 , while the global state and the value of the actual output parameter are taken from T 0 . ∧ ∀p ∈ G : R(r) = (S, W ) ∧ ∀p ∈ L \ {apo} : ∧ ∀p ∈ G : The Initial Set of Relations. Normally, the analysis starts in an initial state (here, a relation). Assuming that the set of possible memory configurations at the start node of the main procedure is X, we add the inclusion R(smain ) ⊇ {(S, S) | S ∈ X} (13) Reachable States. The set that we want to compute is the least fixpoint of Eqns. (9), (10), (11), (12), and (13). This defines a framework for interprocedural dataflow analysis: — A given analysis is obtained by instantiating these equations for a suitable abstract domain. — At each control-flow graph node, the fixpoint solution captures the relation between the reachable states at the entry of the current procedure and the reachable states at the current node. — The states reachable at each node n can thus be extracted by projecting the relation R(n) onto its second component. Eqns. (9), (10), (11), (12), and (13) are a particular version of the equations given in [Jeannet and Serwe 2004], except that the global state is passed back and forth explicitly. (Also, here we merge the two sets of activation records that were kept separate in [Jeannet and Serwe 2004] to support backward analysis.) The soundness of the semantics with respect to the standard operational semantics is proven in [Jeannet and Serwe 2004] by using abstract interpretation. 7.2 Dataflow Equations In §6, we showed how to represent relations between logical structures more efficiently and to abstract them more precisely with two-vocabulary structures. We thus instantiate the equations of §7.1 with this better representation. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 31 Intraprocedural Operations. Eqns. (9) and (10) are replaced by R(n0 ) ⊇ JstmK(R(n)) R(n0 ) ⊇ JcondK(R(n)) except that predicate-update formulas and precondition formulas in functions JstmK and JcondK defined by Eqns. (4) and (5) are modified by replacing global predicates p ∈ G with predicates p[out] ∈ G[out]. For instance, in the case of the statement x->n := NULL, the predicate-update formula becomes n0 [out](v1 , v2 ) = n[out ](v1 , v2 ) ∧ ¬x(v1 ). Procedure Calls. Eqn. (11) is replaced by T (fpi) = S(api) T (p) = 1/2 R(s) = T S ∈ R(c) ∧ ∧ ∀p ∈ L \ {fpi} : ∧ ∀p ∈ G : T (p[inp]) = T (p[out]) = S(p[out ]) Procedure Returns. We proceed in three steps to implement Eqn. (12). First we take the relation R(e) at the exit node of the callee and transform it by eliminating local variables that are not formal input or output parameters, and by setting the values of p[tmp] predicates to the values of p[inp] predicates: S 0 (p) = 1/2 ∀p ∈ L \ {fpi, fpo} : ∀p ∈ G : S 0 (p[inp]) = 1/2 R0 (e) = S 0 ∃S ∈ R(e) : ∀p ∈ G : S 0 (p[tmp]) = S(p[inp]) We also take the relation R(c) at the call node, set the values of p[tmp] predicates to the values of p[out] predicates, and equate formal and actual input parameters. (To simplify the presentation, we assume that there are no name conflicts.) S 0 (fpi) = S(api) R0 (c) = S 0 S ∈ R(c) ∧ ∧ ∀p ∈ G : S 0 (p[tmp]) = S(p[out ]) ∧ ∀p ∈ G : S 0 (p[out]) = 1/2 The last step consists of combining R0 (c) and R0 (e) by taking their meet, assigning the formal output parameter to the actual parameter, and then forgetting p[tmp] predicates and the formal output parameter of the callee: S 0 (apo) = S(fpo) S 0 (fpi) = 1/2 R(r) = S 0 S ∈ R0 (c) u] R0 (e) ∧ (14) 0 ∧ ∀p ∈ G : S (p[tmp]) = 1/2 The meet forces the relations R0 (c) ∈ 3-STRUCT[P[inp] ∪ P[tmp]] and R0 (e) ∈ 3-STRUCT[P[tmp] ∪ P[out]] to agree on global p[tmp] predicates, and on actual and formal parameters. With the exception of the meet operation, all operations can be implemented using predicate-update formulas (cf. §4.3). We do not specify in the equations above how instrumentation predicates are updated—the implementation mainly uses the automatically generated predicate-maintenance formulas created by finite differencing [Reps et al. 2003], although for some simple cases and for instrumentation predicates that involve only one mode, they were provided manually. In particular, ACM Journal Name, Vol. V, No. N, Month 20YY. · 32 Bertrand Jeannet et al. R(n0 ) ⊇ Finstr (R(n)) R(smain ) ⊇ IdState R(s)⊇Idcodom(R(c)) R(r)⊇FCombine (R(c), R(e)) R(s)⊇IdState R(r)⊇FCombine (R(c), R(e)) main() f() g() main() main() f() f() R(s)⊇Idcodom(R(c)) R(r)⊇FCombine (R(c), R(e)) f() g() f() g() f() g() g() (a) Phase I, bottom-up f() g() g() (b) Phase II, top-down f() g() f() g() (c) Combination with forward relational semantics Fig. 16. Inequation systems and induced dependences between variables. Solid and dashed lines are used to distinguish between the first and second calls to f() and g(). for the procedure-call operations, we provided manually the values of relational instrumentation predicates that model the identity relationship. 7.3 Impact of the Form of the Dataflow Equations on Precision and Efficiency We now compare our one-phase approach to interprocedural analysis to a two-phase approach that is in the spirit of [Sharir and Pnueli 1981; Knoop and Steffen 1992]. At the concrete level, the two approaches are semantically equivalent; however, at the abstract level the one-phase approach can yield more precise answers because the abstract operations are no longer exact. We exploited this difference to develop an optimization that, in practice, speeds up the convergence of our one-phase analysis while still retaining its precision advantages (see the experimental results presented in §8.4). Our forward relational semantics can be sketched as follows: R(n0 )⊇Finstr (R(n)) R(smain )⊇IdState R(s)⊇Idcodom(R(c)) R(r)⊇FCombine (R(c), R(e)) Intraprocedural statement/condition Uninitialized state at start Procedure call, with call-to-start edge (c, s) Procedure return, with call-site c and exit-to-return edge (e, r) where IdX denotes the identity relation restricted to the domain X and codom(R) denotes the projection of a relation R on its codomain. In contrast, the two-phase method involves solving two equation systems in succession. The first system, which defines the so-called bottom-up phase, computes procedure summaries that are valid for any input, instead of being specialized to the reachable inputs of the callee. The second system, which defines the so-called top-down phase, computes reachability information using the procedure summaries R(e) computed by the first phase. The corresponding equations are given in Fig. 16. The advantage of combining the two phases into a single phase—as done in §7.1— ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 33 is that the one-phase approach can yield more precise answers (because it computes procedure summaries that are specialized to the reachable inputs of the callee). The one-phase approach may converge more slowly than the two-phase approach because the one-phase equation system is more intricate. However, it is possible to speed up the convergence of the one-phase analysis while retaining its precision advantage as follows: it is sound to replace the inequation R(s) ⊇ Idcodom(R(c)) associated with call-to-start edges by R(s) ⊇ Idcodom(R(c))∪X 0 (c) for any X 0 (c). The idea is to choose X 0 (c) to be a set of states that is very likely to be reachable (e.g., for a list-manipulating procedure, the set of well-formed lists). Because it may take several iteration steps for the solver to obtain this information and to propagate it further, adding it right from the beginning may speed up convergence. From a semantic point of view, this is equivalent to starting the iterative fixpoint computation from a higher initial value for R(s), which becomes IdX 0 (c) instead of ⊥. Two cases can occur: — X 0 (c) contains only reachable inputs of the callee, and the initial value of R(s) is still smaller than the smallest solution of the original equation system; in this case, there is no impact on precision, and we gain a convergence speed-up. — X 0 (c) contains some unreachable inputs of the callee, which can have a negative impact on precision (with respect to the precision of the one-phase approach). However, in the limit (i.e., when X 0 (c) = State), one obtains the equations of Phase I of the two-phase approach: the propagation from call nodes to start nodes of only reachable abstract values is completely eliminated. In this case, the precision of the one-phase approach degenerates to that of Phase I of the two-phase approach (see Fig. 16).11 The results of our experiments with this optimization are presented in §8.4. 8. IMPLEMENTATION AND EXPERIMENTS To perform interprocedural shape analysis by the method that is described in §7, we created a modified version of TVLA [Lev-Ami and Sagiv 2000], an existing shape-analysis system, to allow it to support the following features: — We replaced the built-in notion of an intraprocedural CFG by the more general notion of equation system, in which transfer functions may depend on more than one variable. This modification was needed for implementing the return operation (Eqn. (14)). — We also designed a more general language in which to specify equation systems. These modifications, originally performed in 2003 [Jeannet et al. 2004], were made to the version of the TVLA system as it existed in 2003 [Lev-Ami and Sagiv 2000]. Later, we extended the modified system to incorporate the algorithm for the meet 11 In the limit case, the approach described above still solves only one equation system—one that is equivalent to Phase I of the two-phase approach. Thus, although the results are as precise as the two-phase approach with respect to summary functions, because we do not perform Phase II afterwards, the results obtained are imprecise with respect to what Phase II discovers about the reachable inputs of callees. In such a case, it would be possible to obtain more accurate information by solving the equations of Phase II. ACM Journal Name, Vol. V, No. N, Month 20YY. · 34 Bertrand Jeannet et al. Const. unary predicates r[n,in,list] r[n,out,res] reverse_n_succ[in,out] reverse_n_succ[out,in] list Const. unary predicates id_succ[n,in,out] id_succ[n,out,in] id_pred[n,in,out] id_pred[n,out,in] r[n,in,list] r[n,out,list] res=1/2 r[n,in,res]=1/2 r[n,out,res]=1/2 reverse_n_succ[in,out]=1/2 reverse_n_succ[out,in]=1/2 list Const. unary predicates r[n,in,list] r[n,out,res] reverse_n_succ[in,out] reverse_n_succ[out,in] list id_succ[n,out,in] id_pred[n,in,out] r[n,out,list] n[in] n[in] n[out] (a) S0 n[in] n[out] id_succ[n,in,out]=1/2 id_succ[n,out,in]=1/2 id_pred[n,in,out]=1/2 id_pred[n,out,in]=1/2 n[in] n[out] res res n[in] n[out] n[out] id_succ[n,in,out] id_pred[n,out,in] r[n,in,res] id_succ[n,in,out] id_pred[n,out,in] r[n,in,res] id_succ[n,out,in]=1/2 id_pred[n,in,out]=1/2 (b) S1 (c) S2 n[out] n[in] id_succ[n,out,in] id_pred[n,in,out] r[n,out,list] id_succ[n,in,out]=1/2 id_pred[n,out,in]=1/2 Fig. 17. List-reversal example: The input structure S0 represents all acyclic singly-linked lists of length two or more. The analysis produces the two output structures S1 and S2 . (In each structure, unary predicates that have the same non-0 value for all individuals are displayed in the box labeled “Const. unary predicates”. The values of the “irrelevant” predicates of the vocabulary are not shown. By convention, the in, tmp, or out qualifier for a predicate whose name includes square-bracket symbols is inserted inside the brackets, e.g., r[n, out, res].) operator described in [Arnold et al. 2006]. It is this version of TVLA that we used in the experiments reported here. This section is organized as follows: §8.1 discusses the analysis of the recursive list-reversal procedure from Fig. 1; §8.2 describes our experiments on a variety of list-manipulation and tree-manipulation procedures. §8.3 discusses improvements (compared to our previous work [Jeannet et al. 2004]) brought about by the use of an improved meet operation [Arnold et al. 2006]. §8.4 discusses experiments to speed up the convergence of the analysis method by injecting likely reachable states at the start nodes of procedures. §8.5 compares our method and experimental results with that of Rinetzky et al. [2005]. All running times were obtained using a 2GHz Pentium M, equipped with 1 GB of memory, running Linux. 8.1 Analysis of the List-Reversal Example Given that the input is an acyclic, singly-linked list, the goal of the analysis of the procedure from Fig. 1, which destructively reverses an acyclic, singly-linked list, using recursion to traverse the list, is to show that (1) the output is an acyclic list (2) each link of the output list is the reversal of a link of the input list, and vice versa (3) the cells of the output list are exactly the cells of the input list. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 35 Fig. 17 shows how the summary information that we obtain captures the behavior of the recursive list-reversal procedure of Figs. 1 and 10. The descriptor of the initial summary transformer at start node smain was the 3-valued structure S0 , shown in Fig. 17(a), which represents (the identity transformation on) all linked lists of length at least two that are pointed to by program variable list. The head of the answer list is pointed to by program variable res. At the program’s exit node emain , the summary transformers were the structures S1 and S2 of Fig. 17(b) and Fig. 17(c), which represent the transformations that reverse lists of length two, and all lists of length greater than two, respectively. Note that in both S1 and S2 from Fig. 17, each node has the value 0 for the unary predicate c[n, out] and each node has the value 1 for r[n, out, res]. This means that no node lies on a directed cycle of n fields and all nodes are reachable from the new head of the list res, and hence establishes item 1. As discussed in §6.4, relational instrumentation predicates need to be introduced to prevent the loss of essential information. Besides the identity instrumentation predicates defined in §6.4, the unary predicates reverse n succ[m1 , m2 ], with m1 , m2 ∈ {in, out} and m1 6= m2 , record whether n[m2 ] is the reverse of n[m1 ]. These are defined by reverse n succ[m1 , m2 ](v) = ∀v1 : (n[m1 ](v, v1 ) ⇒ n[m2 ](v1 , v)). (15) We also provided the following related constraint rules, which allow to deduce a relationship between n[in] and n[out]. id succ[n, in, tmp](v) ∧ reverse n succ[tmp, out](v) ⇒ reverse n succ[in, out](v) reverse n succ[in, tmp](v) ∧ id pred[tmp, out](v) ⇒ reverse n succ[in, out](v) Note that only the reverse n succ[m1 , m2 ] predicates and the related constraint rules are specific to the list-reversal example. The other predicates that appear in Fig. 17 are shape properties that characterize singly-linked lists. (They have been used in previous papers about shape analysis of list-manipulation programs; e.g., see [Sagiv et al. 2002].) For instance, r[n, out, list](v) holds the value 1 for individuals that are reachable from variable list through a chain of n[out] links. In structures S1 and S2 , the values for the predicates reverse n succ[m1 , m2 ], with m1 , m2 ∈ {in, out} and m1 6= m2 , show that for each n link n[in](v1 , v2 ) at the entry node smain , we have an n link n[out](v2 , v1 ) at the exit node emain . In other words, the procedure reverses all of the n links; this establishes item 2. Finally, in both of the output structures S1 and S2 , we find that r[n, in, list](v) and r[n, out, res](v) hold for each node. This means that no nodes are either lost or gained, and hence the cells of the output list are exactly the cells of the input list; this establishes item 3. From the above discussion, it should be clear that the set of 3-valued structures {S1 , S2 } establishes the desired properties: the output list is the reversal of the input list, and no elements are either lost or gained. We generalized this experiment by having procedure main call procedure rev twice, as in Fig. 2(b). To achieve the same level of accuracy as we obtained for a single call on rev, we needed to introduce an additional family of unary instrumentation predicates, reverse n pred[m1 , m2 ], whose definition is the same as reverse n succ[m1 , m2 ] (Eqn. (15)), except with v and v1 exchanged. With these ACM Journal Name, Vol. V, No. N, Month 20YY. 36 · Bertrand Jeannet et al. Program a. programs on unsorted lists create creates a list of any length append (create) appends 2 lists split (create) cuts a list into 2 lists reverse (create) destructive list reversal revappend (create) reverse-append (using an accumulator parameter) insert (create) inserts a cell at a random place in a list delete (create) removes a cell at a random place in a list merge (create) merges randomly 2 lists merge* splice (create) splices 2 lists (specialized merge) splice* b. programs on sorted lists create creates a list of any length append (create) appends 2 lists split create) cuts a list into 2 lists reverse (create) destructive list reversal revappend (create) reverse-append (using an accumulator parameter) createo (inserto) creates a sorted list using inserto inserto (createo) inserts a cell in the right place in a sorted list deleteo (createo,inserto) removes a cell with a given key from a sorted list mergeo (createo,inserto) merges 2 sorted lists in one sorted list mergeo* spliceo (createo,inserto) splices 2 sorted lists (interleaves their cells) spliceo* tailsort (create, inserto) sorts a list recursively using insert tailsort* insertionsort (create, inserto) insertion sort (using an accumulator parameter) insertionsort* mergesort (create,split,mergeo) mergesort mergesort* Iterative # of Time structs (sec) Recursive # of Time structs (sec) 3/3 12/9 7/6 9/4 0.8 5.5 2.3 4.3 4/3 11/9 7/6 5/4 1 5.5 2.3 2.4 12/4 9/7 4.8 4.5 15/12 10/7 6.2 4.2 9/8 5.6 14/10 92/49 32/13 10/10 9/7 4.9 92 23 13 11 4/3 12/9 7/6 9/4 1 6 3 4.8 4/3 11/9 7/6 5/4 1 6 3 3 12/4 7/3 5.5 8.5 15/12 9/3 7 8.5 9/7 10 11/8 10 188/16 114 52/23 120/64 35 161 32/13 10/10 45 19 10/7 110/93 19 135 23/3 15 – 65/3 – 69/3 – 35 – 113 Table VI. Experimental results on unsorted and sorted lists. The names in parentheses indicate the other procedures that are analyzed in the example. The stars indicate the introduction of “blurring functions” in dataflow equations. The column “# of structs” indicates (i) the maximum number of logical structures at any control point of the main procedure, and (ii) the maximum number of logical structures at the summary point. additional instrumentation predicates, we were able to establish that the second call to rev always restores the initial memory configuration. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 37 Recursive # of Time structs (sec) Program a. programs on unsorted trees create creates an unsorted tree of any size (possibly empty) create* spliceLeft (create) inserts a tree as the leftmost child of another tree insert* (create) inserts a cell in a tree find* (create) finds a cell in a tree removeRoot (create,spliceLeft) remove the root of a tree remove* (create,spliceLeft,removeRoot) remove a cell in a tree rotate (create) exchange left and right subtrees of all nodes b. programs on sorted trees create creates an unsorted tree create* insertu* (create) inserts a cell in an (unsorted) tree spliceLeft (create) inserts an (unsorted) tree as the leftmost child of another (unsorted) tree createo* (insert) creates a sorted tree insert* (createo) inserts a cell in a sorted tree find* (createo,insert) finds a cell with a given key in a sorted tree removeRoot* (createo,insert,removeRoot,spliceLeft) removes a cell with a given key in a sorted tree remove* (createo,insert,removeRoot,spliceLeft) removes a cell with a given key in a sorted tree split* splits a tree into two trees according to a key, one with cells less than the key, one with cells greater than the key rotate (create) exchange left and right subtrees of all nodes 10/5 10/3 21/11 13 11 47 14/7 74/18 12/9 53/25 11/5 47 100 63 593 64 26/5 10/3 14/7 21/11 61 14 59 109 16/7 35/12 41/15 654 676 771 12/9 1160 30/15 1888 51/18 1780 19/5 101 Table VII. Experimental results on unsorted and sorted trees. The names in parentheses indicate the other procedures that are analyzed in the example. The stars indicate the introduction of “blurring functions” in dataflow equations. The column “# of structs” indicates (i) the maximum number of logical structures at any control point of the main procedure, and (ii) the maximum number of logical structures at the summary point. 8.2 Experimental Results on Lists and Trees Tabs. VI and VII present our experimental results on lists and trees. In these analyses, memory allocation and deallocation is modeled using a pool of free cells [Reps et al. 2003]. The instrumentation predicates related to data structures (lists and trees) are given in Tab. V and Tab. VIII. For sorted lists and trees, we introduce the total-order core predicate leq(v1 , v2 ) described in Remark 4.3. We also introduce the related predicates of Tab. IX. All analyses start with a memory heap consisting of a summary node that represents the free-cell pool and another summary node that represents any context. The core predicate leq(v1 , v2 ) evaluates globally to 1/2. The examples are named according to the main analyzed procedure, but for most of them the main procedure ACM Journal Name, Vol. V, No. N, Month 20YY. · 38 Bertrand Jeannet et al. p Intended Meaning and ψp down(v1 , v2 ) At least one field of v1 points to v2 : left(v1 , v2 ) ∨ right (v1 , v2 ) Both fields of v1 points to v2 : left(v1 , v2 ) ∧ right (v1 , v2 ) Reachability by any field from a variable z: ∃v1 : z(v1 ) ∧ down ∗ (v1 , v) Shared property: ∃v1 , v2 : v1 6= v2 ∧ down(v1 , v) ∧ down(v2 , v) Cyclicity property: ∃v1 : down(v, v1 ) ∧ down ∗ (v1 , v) both(v1 , v2 ) r down[z ](v) shared down(v) cyc down(v) Table VIII. Defining formulas of instrumentation predicates related to binary trees. ψp and Intended Meaning p Sorted lists: orda[n](v) ordb[n](v) ord [n](v) The n field of v points to a cell v2 with v ≤ v2 : ∃v2 : n(v, v2 ) ∧ leq(v, v2 ) The n field of v is null: ∀v2 : ¬n(v, v2 ) Property of all cells of a sorted list orda[n](v) ∨ ordb[n](v) Sorted binary tree orda right (v) The keys of the right subtree of v are greater than the key of v: orda right (v) = ∃v1 : right (v, v1 ) ∧ ∀v2 : down ∗ (v1 , v2 ) ⇒ leq(v, v2 ) orda left(v) The keys of the left subtree of v are less than the key of v: orda left(v) = ∃v1 : left(v, v1 ) ∧ ∀v2 : down ∗ (v1 , v2 ) ⇒ leq(v2 , v) ordb[n](v) The n field of v is null: ∀v2 : ¬n(v, v2 ) ord tree(v) The tree is sorted: ord tree(v) = (ordb[right](v) ∨ orda right (v))∧ (ordb[left](v) ∨ orda left(v)) Table IX. Defining formulas of instrumentation predicates related to ordering of cells. first calls one or more data-structure-creation procedures, and possibly subprocedures, which are also analyzed from scratch. Analysis Goals. The goal of each analysis run is to establish that a data-structure invariant is preserved (or re-established), and that the summary obtained for each procedure captures its effect with sufficient precision. For unsorted lists (resp. trees), the output should be a well-formed list (resp. tree), without cell sharing, cycles, and memory leaks. Additionally, for sorted lists (resp. trees), the output should satisfy shape properties that define the proper ordering of cells in the data structure. The input/output invariant that the summary of a procedure should capture depends on the procedure. Fig. 17 shows the procedure summary computed for the list-reversal example, which shows that the output list is composed of exactly the same set of cells as the input list, and that for each cell, the incoming n link has become an outgoing n link towards the same cell. For the insert and delete examples, the summary pinpoints the inserted or deleted cell (see Fig. 13(a)). In ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 39 general, we observed that when the analysis fails to capture a precise approximation of the summary of a procedure, the abstract memory configurations obtained at the return site of the procedure do not establish that the expected data-structure invariants hold. Note that the shape property that characterizes an ordered tree is much more complex than the shape property that characterizes an ordered list (see Tab. IX). A list is sorted if and only if each of its cells satisfies a local shape property (namely, that it is in the right order with respect to its immediate successor), whereas a tree is sorted if and only if each of its cells satisfies a global shape property (namely, that it is in the right order with respect to all of its children). The resulting more complex instrumentation predicates for trees explain, for instance, the time difference between the spliceLeft example on unsorted trees and on sorted trees. In the latter case, the analyzer must propagate the values of instrumentation predicates that hold information about ordering properties. The analysis times are quite high for procedures on sorted trees. However, the ability to automatically infer correct summaries for procedures that manipulate sorted trees is a major success for our technique. Indeed, other approaches to interprocedural shape analysis have not yet tackled this challenge. For instance, the tree analyses presented in [Rinetzky et al. 2005] do not establish that orderedness is maintained. Choice of Appropriate Instrumentation Predicates. The instrumentation predicates (§5.2.1) that characterize a data structure’s shape properties—such as those defined in Tabs. V, VIII, and IX) are needed for the analysis to infer interesting information. As soon as the data structures manipulated by the analyzed program are large enough to generate summary nodes in abstract structures, these data structures cannot be characterized accurately without these instrumentation predicates. Concerning relational instrumentation predicates (see §6.4), besides the id succ[n, m1 , m2 ] predicate that is needed to model the identity relationship that holds at the entry of procedures, they are also needed in several procedure summaries to capture crucial information about the before and after states. §8.1 discussed the predicate reverse n succ[m1 , m2 ] that models the reversal of n links, used in the analysis of reverse and revappend. For trees, rotate requires a similar relational predicate. The other examples in Tabs. VI and VII do not require specific relational instrumentation predicates. The omission of necessary instrumentation predicates quickly leads to useless analysis results: an initial minor loss of precision generally leads to a major loss of precision. The methodology with respect to this issue consists of checking whether the provided instrumentation predicates allow capturing both (i) shape properties that characterize the data structure, and (ii) the effects of the procedures in the analyzed program. We needed some trial-and-error steps to define the appropriate instrumentation predicates for sorted trees. An alternative approach to the problem of choosing appropriate instrumentation predicates would have been to use the method developed by Loginov et al. [Loginov et al. 2005; Loginov 2006] for performing automatic abstraction refinement, using inductive logic programming to identify candidate instrumentation predicates. We ACM Journal Name, Vol. V, No. N, Month 20YY. 40 · Bertrand Jeannet et al. fr1 r[n,inp,fp1] r[n,out,fp1] r[n,inp,fr1] r[n,out,fr1] fp1 n[inp] r[n,inp,fp2] r[n,out,fp2] r[n,inp,fr1] r[n,out,fr1] fp2 n[out] n[inp] r[n,inp,fp1] r[n,out,fp1] r[n,inp,fr1] r[n,out,fr1] r[n,inp,fp2] r[n,out,fp2] r[n,inp,fr1] r[n,out,fr1] r[n,out,fp1] r[n,inp,fp2] r[n,out,fp2] r[n,out,fr1] n[inp] Fig. 18. r[n,inp,fp1] r[n,out,fp1] r[n,inp,fr1] r[n,out,fr1] n[out] fp2 r[n,out,fp1] r[n,inp,fp2] r[n,out,fp2] r[n,out,fr1] r[n,out,fp2] r[n,inp,fp1] r[n,out,fp1] r[n,out,fr1] n[inp] n[inp] n[out] n[out] fp1 n[out] r[n,out,fp1] r[n,inp,fp2] r[n,out,fp2] r[n,out,fr1] fp1 n[out] n[out] fp2 fr1 fr1 ... n[inp] r[n,inp,fp1] r[n,out,fp1] r[n,out,fp2] r[n,inp,fr1] r[n,out,fr1] n[out] r[n,out,fp2] r[n,inp,fp1] r[n,out,fp1] r[n,out,fr1] n[out] r[n,out,fp1] r[n,inp,fp2] r[n,out,fp2] r[n,out,fr1] Combinatorial explosion with the summary of merge: illustration with 2 lists of size 2. did not attempt to use that approach in this work. Introduction of “Blurring Functions” in the Analysis. From the sorted-list examples, one can observe that the analysis time and complexity (in terms of the number of structures representing the summary function) becomes high for merging and sorting procedures. This is due to the fact that our abstraction is sometimes more precise than necessary, and this can cause combinatorial explosion. For instance, for the merge procedure, the abstraction remembers, for each cell of the resulting list, whether it belonged to the first argument list or the second one. The many possible interleavings of the first cells in the resulting lists causes a combinatorial explosion in the result (see Fig. 18). This is all the more frustrating because this information is rarely relevant: the properties that the summary function of procedure reverse should capture accurately are: (1) Each cell in the result list was a cell in one of the two input lists, and vice versa. (2) The result is a list, and it is a sorted list. One way to limit this combinatorial explosion is to apply an extra abstraction step at the end of the procedure. For reverse, (1) we introduce an instrumentation predicate r n fp1or2[m](v) = r[n, m, fp1](v) ∨ r[n, m, fp2](v) indicating whether a cell is reachable from one of the two list arguments fp1 and fp2; (2) we forget the value of n[inp](v1 , v2 ) and related predicates r[n, inp, fp1](v), ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 41 r[n, inp, fp2](v) and r[n, inp, fr1](v) on all cells reachable from the result, using the assignment: n[inp](v1 , v2 ) r[n, inp, fp1](v) r[n, inp, fp2](v) r[n, inp, fr1](v) = = = = (r[n, out, fr1](v1 ) ? 1/2 : n[inp](v1 , v2 )) (r[n, out, fr1](v) ? 1/2 : r[n, inp, fp1](v)) (r[n, out, fr1](v) ? 1/2 : r[n, inp, fp2](v)) (r[n, out, fr1](v) ? 1/2 : r[n, inp, fr1](v)) The same phenomenon holds for sorting procedures, where the main information to be captured is that the resulting list is a sorted permutation of the input list, but where a (partial) knowledge about the applied permutation is superfluous. The starred versions of the examples in Tabs. VI and VII refer to versions of the dataflow equations in which such “blurring” functions are introduced to forget information considered irrelevant. Our methodology with respect to the introduction of blurring functions is related to the choice of instrumentation predicates, with the difference that it is guided only by performance issues. If the existing instrumentation predicates lead to combinatorial explosion with respect to the desired procedure summary, this motivates the application of a blurring function, together with the possible addition of an instrumentation predicate to preserve essential information. Our experience was that adding adequate blurring functions and related instrumentation predicates was quite easy to do, once the origin of the combinatorial explosion was identified (either theoretically or experimentally). The main issue is to blur enough predicates, otherwise some of them might again take on definite values after semantic reduction via Coerce. This is because instrumentation predicates are not independent from each other. 8.3 Improvement Brought About by the Meet Operator Compared to [Jeannet et al. 2004], our implementation of interprocedural analysis has been improved by the use of a precise meet operation on the abstract domain of 3-valued structures, proposed by [Arnold et al. 2006] and based on graph matching, as mentioned in §7.2. In [Jeannet et al. 2004], we used an approximate implementation of the meet of two 3-valued structures, based on the conversion of one of the argument structures to a set of constraint rules, and the application of these additional constraints to the other argument structure using the Coerce and Focus operations, which were briefly described in §5.2.2. The approximations came both from the conversion to constraints and the restricted use of the Focus operation. To be exact would require the analysis to focus (temporarily) on all predicates common to the two 3-valued structures; for efficiency reasons we decided to focus only on predicates that represent pointer variables. The method was still rather inefficient: — The conversion of a 3-valued structure to a 3-valued logical formula and then to constraint rules generates many rules, in particular due to the restricted syntax allowed for such rules in TVLA. Given a 3-valued structure of size n on a vocabulary of size p1 + p2 , where pi is the number of predicates of arity i, the number of generated constraint rules is in O(n · p1 + n2 · p2 ). — The Coerce operation is the most expensive operation in the TVLA implemenACM Journal Name, Vol. V, No. N, Month 20YY. 42 · Bertrand Jeannet et al. Old meet # of Time structs (sec) Program programs on unsorted lists (recursive version) reverse destructive list reversal insert inserts a cell at a random place in a list delete removes a cell at a random place in a list New meet # of Time structs (sec) 7 23 4.7 82 5/4 10/7 2.4 4.2 32 84 14/10 4.9 Analysis with the old meet does not include the creation of input lists. It also requires 2 additional instrumentation predicates for the insert and delete examples, due to the approximation induced by the meet. Table X. Comparison of the use of resources with the old and new meet operators. Normal # of Time structs (sec) Program Accelerated # of Time structs (sec) a. programs on sorted lists tailsort* (create,insert) insertionsort* (create,insert) mergesort* (create,split,merge) 23/3 65/3 69/3 15 35 113 18/3 65/3 48/3 8.8 27 40 b. programs on sorted trees createo* (insert) 16/7 654 14/8 250 Table XI. Interprocedural analysis method. tation.12 The gains obtained by the use of the precise meet operation of [Arnold et al. 2006] are illustrated in Tab. X for a few simple examples. The gain in efficiency is impressive, but the gain in precision is also important: for the insert and delete examples, we did not need to introduce specific instrumentation predicates to capture the effect of these procedures (the triangular pattern shown in Fig. 13(a)). This precision issue prevented us from experimenting with the old meet implementation on our full set of examples. 8.4 Speeding up the Analysis by Modifying the Equations In §7.3, we discussed the possibility of speeding up the convergence of the analysis by injecting (a subset of) likely reachable states at the start nodes of procedures, which may reduce the number of iteration steps needed for reaching a fixpoint. We experimented with this technique on programs that consist of several recursive procedures: we injected the set of all well-formed (and possibly ordered) lists at the 12 It should be noted that the inefficiency of the Coerce operation was a general problem in past versions of the TVLA system. It motivated the work reported in [Arnold 2006], which obtained substantial speedups by replacing pairs of Focus/Coerce operations by the meet operation whenever possible. It also motivated the work of Bogudlov et al. [2007b; 2007a] who developed techniques that allowed Coerce to run over an order of magnitude faster. These techniques were not incorporated into the version of TVLA that implements the methods described in the present paper. The methods of Bogudlov et al. are essentially orthogonal to the ones that we developed, and thus the speed-ups that would be obtained by incorporating their techniques into our implementation should be comparable to the speed-ups reported in [Bogudlov et al. 2007b; 2007a]. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis void main(){ List list = create(); List acc = NULL; List res = insertionsort(list,acc); } List insertionsort(List list, List acc){ List res,t,tt; if (list==NULL) res = acc; else { t = list->n; list->n = NULL; tt = insert(acc,list); res = insertionsort(t,tt); } return res; } Fig. 19. · 43 List insert(List list, List cell){ List res; if (list!=NULL && cell->key > list->key){ t = list->n; list->n = NULL; res = insert(t,cell); list->n = res; res = list; } else { if (list==NULL) cell->n = NULL; else cell->n = list; res = cell; } return res; } Insertionsort example entry of all list-manipulating procedures. The results are given in Tab. XI, and show that in such cases this technique is very efficient. Note that in the examples from Tab. XI, all procedures are recursive, which leads to more complex dependences than in the example shown in Fig. 16. For the insertionsort example depicted on Fig. 19, with the standard technique, insertionsort, and then insert, will first be called with the acc=NULL argument. insert will be analyzed for this base case, then insertionsort will be called with a one-element list in acc, which will be propagated later in the body of insert. So it takes several steps to infer that insert might be called with any sorted list. Injecting the set of all sorted lists at the entry of insert allows to compute quicker the complete summary of insert, which also induces a faster propagation of the call to insert in insertionsort. 8.5 Comparison with Cutpoint Semantics and Tabulated Representation In this section, we compare our method and experimental results with those of Rinetzky et al. [2005]. The two methods are built on the same abstract domain of 3-valued logical structures, and both implement a context- and flow-sensitive interprocedural shape analysis based on procedure summarization. The effective reuse of procedure summaries in different calling contexts motivated the development of mechanisms to allow parts of the heap that are not relevant to the procedure’s actions to be ignored [Rinetzky et al. 2005]. Rinetzky et al. [2005] use a tabulated representation—i.e., using pairs of abstracted structures, rather than abstractions of paired structures—to capture summaries of procedures, and the notion of cutpoints is used to eliminate details of the heap that are inessential to the callee, thereby permitting procedure summaries to be used in different calling contexts.13 In §9, we describe the similarities and differences between the methods more thoroughly, and discuss how a similar effect of eliminating details is obtained essentially for free with our approach. 13 When control is passed from the caller to the callee, the cutpoints represent the frontier of vertices in the part of the heap visible to the callee that are reachable from the caller’s pointer variables (or from pointer variables of other procedures further back in the stack). During the execution of the callee, it is necessary to track all nodes that are reachable from the local variables, the global variables, and the cutpoints, but other parts of the heap structure can be removed. ACM Journal Name, Vol. V, No. N, Month 20YY. 44 · Program Bertrand Jeannet et al. Cutpoint -based Relational std. * a. (Sorted) list-manip. programs create 8 8 8 insert 46 10 10 delete 46 35 35 reverse 32 3 3 revappend 47 7 7 merge 83 161 45 insertionsort 265 >1000 35 tailsort 65 135 15 mergesort 576 >1000 113 Program Cutpoint -based Relational std. * b. (Unsorted) tree-manip. programs create 10 61 14 insert 25 — 47 find 67 — 100 removeRoot 49 — 63 remove 114 — 593 spliceLeft 26 — 47 rotate 43 — 64 Table XII. Times for cutpoint-based analysis vs. relational analysis. All times are in seconds. (The columns labeled * report the times for analysis runs in which blurring functions are applied. A long dash (—) means that the run was not attempted.) Tab. XII compares the two analyses on a set of examples. We followed [Rinetzky et al. 2005] by not analyzing procedures in isolation, but instead analyzing a full program from scratch. This means that the analysis time for the mergesort example includes the analysis time for the creation of a list (create), as well as the auxiliary procedures (split and merge). Both analyses were executed on the same computer, using the same version of the TVLA system (with the exception of the additions mentioned at the very beginning of this section). Sorted-List Examples. For this set of examples, the relational method is generally as efficient as, or more efficient than, the cutpoint-based method. It should be noted, however, that the relational method sometimes requires the application of blurring functions to obtain reasonable performance, but in such cases the gain in performance is significant, even with respect to the cutpoint-based analysis. The latter is somewhat surprising because the cutpoint-based analysis tabulates pairs of structures, and, as discussed in §6.2 and illustrated in Fig. 13, the information computed by the relational analysis is much more precise than the information that is computed when a tabulated representation is used. Unsorted-Tree Examples. Because [Rinetzky et al. 2005] did not try to analyze examples with sorted trees, the experiments dealt only with unsorted trees: the ordering relation between tree cells is abstracted away. The execution times are better for the cutpoint-based method, but remain of the same order of magnitude, with the exception of the remove example. The latter example demonstrates that the advantages of the relational analysis in terms of precision can have a cost in terms of efficiency, even when blurring functions (§8.2) are applied. In the case of the remove procedure, the extra precision of the relational analysis causes the number of cases that the analyzer has to consider to increase: in particular, the set of output two-vocabulary structures at the exit node of remove (i.e., the procedure summary for remove) relates an output tree—in which a cell has been removed— to an input tree—which contains the cell. Consequently, for essentially the same output tree, the analyzer ends up enumerating a number of different two-vocabulary structures according to the different possible positions in the input tree of the cell to be removed: the cell to be removed is the root; the cell to be removed is a left or ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 45 right child of the root; or the cell to be removed is one that lies deeper in the left or right subtree of the root cell. 9. RELATED WORK General approaches to interprocedural analysis One can distinguish two main approaches to interprocedural static analysis. The first approach, called the functional approach (after the name used in [Sharir and Pnueli 1981]), uses a denotational semantics of the analyzed program and consists of two steps. The first step computes predicate transformers associated with the procedures of the program by finding a fixpoint of a set of equations over predicate transformers. The operations used in these equations are (primarily) transformer composition and transformer join. The second step (repeatedly) applies a composed predicate transformer for a program path to some predicate that characterizes the possible input states, to obtain a predicate that holds at the end of the path. [Cousot and Cousot 1977; Sharir and Pnueli 1981; Knoop and Steffen 1992] apply this approach to different classes of programs. The second approach, which we call the operational approach, adopts an operational semantics for programs. Here, as in many intraprocedural verification techniques, the predicates are propagated along the edges of the program’s control-flow graph, using the predicate transformers associated with program statements and conditions, until a fixpoint is reached. The analysis can be viewed as a symbolic execution of the program in which values are replaced by properties. In contrast with the functional approach, there is no computation of (composed) predicate transformers associated with blocks of instructions or procedures. However, to simulate the execution of the program, one needs to take into account the program’s call stack: when a procedure returns to its caller, the call site should be popped from the stack and the local state of the caller should be restored to the state that it had before the call. The “call-strings” approach of [Sharir and Pnueli 1981] provides one way to address this issue, by maintaining additional information in the abstract domain to over-approximate the state of the call stack. Techniques based on pushdown systems [Bouajjani et al. 1997; Finkel et al. 1997] and weighted pushdown systems [Bouajjani et al. 2003; Reps et al. 2005] contain elements of both the functional and operational approaches. Jeannet and Serwe [Jeannet and Serwe 2004] show how the functional and operational approaches can be derived as an abstract interpretation of the standard operational semantics, modeled using a stack of activation records. Once the interprocedural semantics is defined in this way, a second abstraction step may be used to abstract the data (in our case, the values of variables and linked memory cells). This is the approach we followed in §7.1, with the variations described in §7.3. Interprocedural shape analysis Several other papers have studied interprocedural shape analysis using canonical abstraction. In [Rinetzky and Sagiv 2001], the store is augmented to include the runtime stack as an explicit data structure. The storage abstraction used in [Rinetzky and Sagiv 2001] is an abstraction of the store augmented in this fashion. In essence, the collection of activation records that form the stack are abstracted using ACM Journal Name, Vol. V, No. N, Month 20YY. 46 · Bertrand Jeannet et al. an abstraction for linked lists. This “stack-materialization” approach causes certain technical complications; they are not insurmountable, but do cause the designer of an abstract interpretation to have to identify certain shape properties that relate the state of the stack and the state of the heap during the execution of the program (in particular, how the heap cells reachable from the visible and invisible instances of local variables are related). This approach is reminiscent of the “call-strings” approach; in contrast, the approach used in the present paper was inspired by the functional approach, in which the stack is not materialized as an explicit data structure; instead it is an implicit part of the programming-language semantics. Thus, the designer of an abstract interpretation does not need to be concerned with the “shape” of the runtime stack nor with such things as visible and invisible instances of local variables. Because of the different nature of the information obtained, [Rinetzky and Sagiv 2001] can only show that a list reversed twice yields a list with the same head and the same set of memory cells (in some order) as the initial list, while our method shows that it yields the same initial list. As mentioned in §8.5, [Rinetzky et al. 2005] implements a context- and flowsensitive analysis that is also inspired by the functional approach, but which uses tabulation to represent the summaries of procedures. The effective reuse of procedure summaries in different calling contexts is made possible by using the notion of cutpoints and by considering cutpoint-free programs.14 As for [Rinetzky and Sagiv 2001], the resulting analysis is less precise than ours because their tabulated representation is less expressive than our relational representation, in which one can track (an approximation of) the evolution of individual objects. It may perform more efficiently, particularly on trees, but it has not yet been applied to ordered trees, where the ingredients of invariants satisfied by ordered trees need to be tracked. Our approach has the benefit of generality (it is not restricted to cutpoint-free programs) and conceptual simplicity: it reuses the same algorithms as the intraprocedural analysis, and relational composition is performed using the standard notions of intersection and elimination. A method for performing interprocedural shape analysis using procedure specifications and assume-guarantee reasoning is presented in [Yorsh et al. 2004]. There, it is assumed that a specification for each procedure—a pre- and post-condition—is already known; the technique presented in [Yorsh et al. 2004] can be used to interpret a procedure’s pre- and post-condition in the most precise way (for a given abstraction). For every procedure invocation, one checks if the current abstract value potentially violates the precondition; if it does, a warning is produced. At the point immediately after the call, one can assume that the post-condition holds. Similarly, when a procedure is analyzed, the pre-condition is assumed to hold on entry, and at end of the procedure the post-condition is checked. The work described in the present paper is complementary to [Yorsh et al. 2004]: our work provides a way to identify procedure specifications (in the form of sets of 2-vocabulary 3-valued structures) that can be used with the method from [Yorsh et al. 2004]. Several techniques have been suggested for automatically checking the partial cor14 In cutpoint-free programs [Rinetzky et al. 2005], the nodes pointed to by a caller’s parameters always dominate the nodes that are reachable from the caller’s pointer variables (or from pointer variables of other procedures further back in the stack). ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 47 rectness of programs annotated with loop invariants and pre- and post-conditions [Møller and Schwartzbach 2001; Berdine et al. 2005; Lahiri and Qadeer 2008]. Compared to our approach to shape analysis, those techniques can be faster; in particular, annotations can drastically reduce the cost of interprocedural shape analysis because they allow the correctness of a set of procedures to be checked modularly, using a linear pass over each procedure’s body. However, the burden of requiring programmers to express loop invariants and the required pre- and post-conditions is much higher than the effort required for providing adequate instrumentation predicates in our method. A recent approach to interprocedural shape analysis is based on separation logic, which has been designed for performing context-independent reasoning about memory shapes [Gotsman et al. 2006]. It has to take care of cutpoints, and to abstract them if too many cutpoints appears in the course of the analysis. In our method, pointer variables to cutpoints in the caller are forgotten in the callee, but are recovered upon return from the callee thanks to the meet operation in Eqn. (14). Cutpoints were also used to develop interprocedural shape-analysis algorithms that are not based on canonical abstractions [Marron et al. 2008]. We believe that the principles that underlie our relational analysis (i.e., the use of abstractions of twovocabulary structures) are also applicable for other abstractions as long as they support the right interface operations (e.g., projection and meet). “Heap Modularity” Both [Rinetzky et al. 2005] and [Gotsman et al. 2006] state that their techniques are fully “heap modular” in the sense that the procedure summaries computed by the analyses deal only with the reachable parts of the heap and ignore the (unreachable) context of the caller in the callee, which cannot be modified by the callee. This effect is obtained naturally with our approach. Because most core and instrumentation predicates are related to reachability from visible variables, the part of the heap that is not reachable from the local variables in a callee is summarized with a single (or a few) “context” summary nodes. When the callee returns to its caller, this context summary node is materialized again by the meet of the summary relation at the return site of the callee and the relation at the call site. In fact, because some predicates are independent of reachability properties (predicates related to cyclicity or sharing), there may be several context summary nodes. In such cases, at the entry of the callee, a predicate-update formula (cf. §4.3) may be used to assign the value 1/2 to those predicates for non-reachable cells, as follows: p0 (v) = reachable from input parameters(v) ? p(v) : 1/2 This induces a more effective merging of abstract cells not reachable in the callee (hence not modifiable by the callee). The information is recovered during the processing at the procedure return site. Abstract transformers The analysis described in this paper uses 3-valued structures over a doubled vocabulary. A similar approach is standard when concrete transition relations are expressed by means of formulas. For instance, the semantics of a statement x := y+1 can be expressed as (x0 = y +1)∧(y 0 = y). Statements such as x := y+1 can be ACM Journal Name, Vol. V, No. N, Month 20YY. 48 · Bertrand Jeannet et al. transformed into composable abstract transformers for programs that manipulate numeric data, using several numeric lattices (e.g., polyhedra [Cousot and Halbwachs 1978], octagons [Miné 2006], etc.). A key feature of the approach described in the present paper is that relational instrumentation predicates can refer to both the P[inp] and P[out] vocabularies. For instance, the family of unary predicates reverse n succ[m1 , m2 ] discussed in §8 (with m1 , m2 ∈ {inp, out} and m1 6= m2 ) records whether n[m2 ] is an inverse of n[m1 ]. The classic functional approach of Sharir and Pnueli [Sharir and Pnueli 1981] uses function composition for all operations. As is typically done in analyses based on the numerical abstract domain [Cousot and Halbwachs 1978; Miné 2006], the approach taken in this paper might be more properly described as a hybrid approach: (1) Intraprocedural propagation is based on a form of transformer application, rather than transformer composition. That is, for an intraprocedural propagation with respect to transformer τ , the actions of τ are applied to the second vocabulary, with the first vocabulary kept constant. (2) Interprocedural propagation is based on the composition of two-vocabulary structures (using three-vocabulary structures, structure meet, and vocabulary projection). For shape analysis, the advantage of the hybrid approach has to do with the maintenance of instrumentation predicates that express reachability properties. The application step used in item (1) is satisfactory when there are unit-size changes to core relations: the instrumentation-predicate-maintenance formulas created by finite differencing [Reps et al. 2003] are generally able to maintain definite values for instrumentation predicates that express reachability properties for unit-size changes to core predicates. The (approximate) composition step used in item 2 generally allows definite values to be retained under the non-unit-size changes to core predicates that occur when applying a procedure summary. Acknowledgments. We are grateful to V. Kuncak for several discussions about the use of two-vocabulary structures in shape analysis; to N. Rinetzky for many discussions about interprocedural shape-analysis methods, as well as for his help with the experiments that compare our methods with his; and to G. Arnold for his help incorporating his work on the meet operation into our implementation. REFERENCES Arnold, G. 2006. Specialized 3-valued logic shape analysis using structure-based refinement and loose embedding. In Static Analysis Symposium, SAS’06. LNCS, vol. 4134. Arnold, G., Manevich, R., Sagiv, M., and Shaham, R. 2006. Combining shape analyses by intersecting abstractions. In Int. Conf. on Verification, Model Checking and Abstract Interpretation, VMCAI’06. LNCS, vol. 3855. Ball, T. and Rajamani, S. 2001. Bebop: A path-sensitive interprocedural dataflow engine. In Prog. Analysis for Softw. Tools and Eng. 97–103. Berdine, J., Calcagno, C., and O’Hearn, P. W. 2005. Smallfoot: Modular automatic assertion checking with separation logic. In FMCO’05. LNCS, vol. 4111. Springer, 115–137. Bogudlov, I., Lev-Ami, T., Reps, T., and Sagiv, M. 2007a. Revamping TVLA: Making parametric shape analysis competitive. Tech. Rep. TR-2007-01-01, Tel-Aviv Univ., Tel-Aviv, Israel. ACM Journal Name, Vol. V, No. N, Month 20YY. A Relational Approach to Interprocedural Shape Analysis · 49 Bogudlov, I., Lev-Ami, T., Reps, T., and Sagiv, M. 2007b. Revamping TVLA: Making parametric shape analysis competitive (tool paper). In Int. Conf. on Computer Aided Verif. LNCS, vol. 4590. Bouajjani, A., Esparza, J., and Maler, O. 1997. Reachability analysis of pushdown automata: Application to model checking. In Proc. CONCUR. LNCS, vol. 1243. Springer-Verlag, 135–150. Bouajjani, A., Esparza, J., and Touili, T. 2003. A generic approach to the static analysis of concurrent programs with procedures. In Princ. of Prog. Lang. 62–73. Clarke, Jr., E., Grumberg, O., and Peled, D. 1999. Model Checking. The M.I.T. Press. Cousot, P. and Cousot, R. 1977. Static determination of dynamic properties of recursive procedures. In Formal Descriptions of Programming Concepts, E. Neuhold, Ed. North-Holland, 237–277. Cousot, P. and Halbwachs, N. 1978. Automatic discovery of linear constraints among variables of a program. In Princ. of Prog. Lang. 84–96. Finkel, A., B.Willems, and Wolper, P. 1997. A direct symbolic approach to model checking pushdown systems. Elec. Notes in Theor. Comp. Sci. 9. Gopan, D., DiMaio, F., N.Dor, Reps, T., and Sagiv, M. 2004. Numeric domains with summarized dimensions. In Tools and Algs. for the Construct. and Anal. of Syst. LNCS, vol. 2988. 512–529. Gotsman, A., Berdine, J., and Cook, B. 2006. Interprocedural shape analysis with separated heap abstractions. In Static Analysis Symp. LNCS, vol. 4134. 240–260. Gries, D. 1981. The Science of Programming. Springer-Verlag. Jeannet, B., Loginov, A., Reps, T., and Sagiv, M. 2004. A relational approach to interprocedural shape analysis. In Static Analysis Symp. LNCS, vol. 3148. Jeannet, B. and Serwe, W. 2004. Abstracting call-stacks for interprocedural verification of imperative programs. In Algebraic Methodology and Software Technology, AMAST’04. LNCS, vol. 3116. Knoop, J. and Steffen, B. 1992. The interprocedural coincidence theorem. In Comp. Construct. LNCS, vol. 641. 125–140. Lahiri, S. K. and Qadeer, S. 2008. Back to the future: Revisiting precise program verification using smt solvers. In Princ. of Prog. Lang. Lev-Ami, T., Reps, T., Sagiv, M., and Wilhelm, R. 2000. Putting static analysis to work for verification: A case study. In Int. Symp. on Softw. Testing and Analysis. 26–38. Lev-Ami, T. and Sagiv, M. 2000. TVLA: A system for implementing static analyses. In Static Analysis Symp. LNCS, vol. 1824. 280–301. Loginov, A. 2006. Refinement-based program verification via three-valued-logic analysis. Ph.D. thesis, Comp. Sci. Dept., Univ. of Wisconsin, Madison, WI. Tech. Rep. 1574. Loginov, A., Reps, T., and Sagiv, M. 2005. Abstraction refinement via inductive learning. In Int. Conf. on Computer Aided Verif. LNCS, vol. 3576. Manna, Z. and Pnueli, A. 1995. Temporal Verification of Reactive Systems: Safety. SpringerVerlag. Marron, M., Hermenegildo, M. V., Kapur, D., and Stefanovic, D. 2008. Efficient contextsensitive shape analysis with graph based heap models. In Comp. Construct. LNCS, vol. 4959. 245–259. Miné, A. 2006. The octagon abstract domain. Higher-Order and Symbolic Computation 19, 1, 31–100. Møller, A. and Schwartzbach, M. I. 2001. The pointer assertion logic engine. In Prog. Lang. Design and Impl. 221–231. Reps, T., Horwitz, S., and Sagiv, M. 1995. Precise interprocedural dataflow analysis via graph reachability. In Princ. of Prog. Lang. ACM Press, New York, NY, 49–61. Reps, T., Sagiv, M., and Loginov, A. 2003. Finite differencing of logical formulas for static analysis. In European Symp. on Programming. LNCS, vol. 2618. 380–398. Reps, T., Schwoon, S., Jha, S., and Melski, D. 2005. Weighted pushdown systems and their application to interprocedural dataflow analysis. Sci. of Comp. Prog. 58, 1–2 (Oct.), 206–263. ACM Journal Name, Vol. V, No. N, Month 20YY. 50 · Bertrand Jeannet et al. Rinetzky, N., Bauer, J., Reps, T., Sagiv, M., and Wilhelm, R. 2005. A semantics for procedure local heaps and its abstraction. In Proc. of the 32th ACM SIGPLAN - SIGACT Symposium on Principles of Programming Languages (POPL’05). Rinetzky, N. and Sagiv, M. 2001. Interprocedural shape analysis for recursive programs. In Comp. Construct. LNCS, vol. 2027. 133–149. Rinetzky, N., Sagiv, M., and Yahav, E. 2005. Interprocedural shape analysis for cutpoint-free programs. In Static Analysis Symposium, SAS’05. LNCS, vol. 3672. Sagiv, M., Reps, T., and Horwitz, S. 1996. Precise interprocedural dataflow analysis with applications to constant propagation. Theor. Comp. Sci. 167, 131–170. Sagiv, M., Reps, T., and Wilhelm, R. 2002. Parametric shape analysis via 3-valued logic. Trans. on Prog. Lang. and Syst. 24, 3, 217–298. Sharir, M. and Pnueli, A. 1981. Two approaches to interprocedural data flow analysis. In Program Flow Analysis: Theory and Applications, S. Muchnick and N. Jones, Eds. PrenticeHall, Englewood Cliffs, NJ, Chapter 7, 189–234. Yorsh, G., Reps, T., and Sagiv, M. 2004. Symbolically computing most-precise abstract operations for shape analysis. In Tools and Algs. for the Construct. and Anal. of Syst. LNCS, vol. 2988. 530–545. ACM Journal Name, Vol. V, No. N, Month 20YY.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising