* Consi se eu * * Easy to ed R nt at e d Sam Tobin-Hochstadt Ev Aaron Todd cu m e Lindsey Kuper * ll We st * Complete * PLDI * tifact t en * AECDo Extensible Deterministic Parallelism with LVish Ar alu Taming the Parallel Effect Zoo Ryan R. Newton Indiana University {lkuper, toddaaro, samth, rrnewton}@cs.indiana.edu Abstract 1. A fundamental challenge of parallel programming is to ensure that the observable outcome of a program remains deterministic in spite of parallel execution. Language-level enforcement of determinism is possible, but existing deterministic-by-construction parallel programming models tend to lack features that would make them applicable to a broad range of problems. Moreover, they lack extensibility: it is difficult to add or change language features without breaking the determinism guarantee. The recently proposed LVars programming model, and the accompanying LVish Haskell library, took a step toward broadlyapplicable guaranteed-deterministic parallel programming. The LVars model allows communication through shared monotonic data structures to which information can only be added, never removed, and for which the order in which information is added is not observable. LVish provides a Par monad for parallel computation that encapsulates determinism-preserving effects while allowing a more flexible form of communication between parallel tasks than previous guaranteed-deterministic models provided. While applying LVar-based programming to real problems using LVish, we have identified and implemented three capabilities that extend its reach: inflationary updates other than least-upperbound writes; transitive task cancellation; and parallel mutation of non-overlapping memory locations. The unifying abstraction we use to add these capabilities to LVish—without suffering added complexity or cost in the core LVish implementation, or compromising determinism—is a form of monad transformer, extended to handle the Par monad. With our extensions, LVish provides the most broadly applicable guaranteed-deterministic parallel programming interface available to date. We demonstrate the viability of our approach both with traditional parallel benchmarks and with results from a real-world case study: a bioinformatics application that we parallelized using our extended version of LVish. For parallelism to become the norm it must become easier. One great stride in this direction would be to provide deterministicby-default1 parallel languages that are broadly applicable, available, and practical—enabling more software to avoid heisenbugs by construction. Historically, language-level enforcement of determinism can be found in languages based on synchronous dataflow [8], data-parallel languages [5, 20], and languages with advanced permissions systems that prevent data races [3]. However, virtually all practical parallel programs are written in traditional languages (e.g., C++, Java, Fortran), rather than in these more restricted languages that could guarantee determinism. It must be that the benefits of determinism do not yet in practice outweigh the limitations of guaranteed-deterministic programming models. One practical issue is that many applications do not fit into a single, restrictive paradigm—such as synchronous dataflow parallelism or functional task parallelism—which is what most guaranteed-deterministic parallel programming models offer. We find that would-be guaranteed-deterministic parallel programs either have multiple components spanning different paradigms, or they depend on deterministic algorithms not yet expressible in a guaranteed-deterministic fashion. For deterministic languages to make an impact, therefore, they require the breadth to span multiple paradigms and accommodate a wide variety of algorithms. Categories and Subject Descriptors D.3.3 [Language Constructs and Features]: Concurrent programming structures; D.1.3 [Concurrent Programming]: Parallel programming; D.3.2 [Language Classifications]: Concurrent, distributed, and parallel languages Keywords Deterministic parallelism Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected] PLDI ’14, June 9-11, 2014, Edinburgh, United Kingdom. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2784-8/14/06. . . $15.00. http://dx.doi.org/10.1145/2594291.2594312 Expressive Deterministic Parallelism Determinism and its breadth dilemma Deterministic parallel languages necessarily restrict effects. If they allowed arbitrary access to shared memory locations, nondeterminism would directly follow. The particular restrictions on effects vary by language: in a stream-processing language (e.g., StreamIt [8]), the only effects possible within a stream filter are push() and pop() operations on a linear stream data structure. Deterministic data-parallel languages [5, 20], on the other hand, typically do not allow effects in parallel regions at all, encapsulating parallelism in aggregate operations such as map and fold that apply pure element-wise functions. Clearly, many deterministic parallel programs cannot be expressed with these abstractions: for example, an asynchronous quorum voting program, where a vote succeeds (and causes an effect) only when the number of “aye”s exceeds a threshold. Sharply restricted communication and synchronization capabilities have consequences not only for the immediate usability of guaranteed-deterministic languages, but for those languages’ extensibility as well. In an unrestricted parallel language, such as Java, new synchronization or communication constructs can always be implemented as needed, without changing the language—but no deterministic parallel language has offered anything comparable. 1 We refer here to external determinism, also called determinacy. Of course, many parallel applications depend critically on observably nondeterministic behavior—for example, hardware designs and GUIs. These are not candidates for deterministic execution, but that still leaves many that are. The difficulty of extensibility means that having a fully fleshedout set of built-in parallel primitives is more important in a deterministic parallel language than it is in a traditional, unrestricted language. If a guaranteed-deterministic parallel language must limit the user to a certain set of idioms, then those idioms should encompass as much functionality as possible. Choosing such a set of broadly applicable built-in parallel idioms is not easy. They must preserve determinism even under arbitrary composition and even against an adversarial programmer. Determinism is a global property of the language that can be difficult to verify, as the proofs of determinism for such languages testify [3, 4, 9, 11], and composition of language features may not preserve determinism, even if those features behave deterministically in isolation. Finally, adding features to a parallel runtime system can become an increasingly delicate engineering challenge as the feature list grows [7]. LVars: a step forward In this paper we build on our previous work on the LVars programming model [11, 12] and the accompanying LVish library for deterministic parallel programming in Haskell. LVars (which we review in more detail in Section 2) are shared monotonic data structures to which information can only be added, never removed, and for which the order in which information is added is not observable. The key insight behind LVars is that the states a shared data structure can take on have an ordering, and updates preserve that ordering because writes take the least upper bound (lub) of the previous value and the new value. Because the lub operation is idempotent, multiple writes of the same value to a single location can be allowed deterministically. LVars are already far more expressive than the write-once IVars [1] of prior work on deterministic parallelism, and are a step in the direction of broadly-applicable deterministic parallelism. Unfortunately, they fall short in a wide variety of domains where deterministic parallel algorithms should be expressible. Our contributions In this paper we describe the design and implementation of extensions to LVish that enable the following capabilities, while leaving the basic model intact: • Commutative (but non-idempotent) read-modify-write opera- tions such as fetch-and-add (Section 3). We make use of this capability to parallelize PhyBin [18], a bioinformatics application for comparing genealogical histories (phylogenetic trees) that relies heavily on a parallel tree-edit distance computation. • Full access to mutable state with enforced disjoint access for parallel threads (Section 5). The addition of this feature to LVish is, to our knowledge, the first integration of parallel updates to mutable memory (à la Deterministic Parallel Java [3]) with blocking dataflow communication in a guaranteeddeterministic programming model. • Deterministic speculation and cancellation, which have a par- ticular synergy with a new data structure we add to LVish for memoization (Section 6). The result of our work is a significantly extended LVish library, now well suited for parallelizing a far broader variety of pre-existing Haskell programs. LVish is implemented purely as a Haskell library, even though it provides features that usually require language extensions, e.g., enforced alias-free mutable data to support disjoint parallel update. Further, we show in Section 7 that our new library is effective, providing parallel speedup on benchmarks old and new, and a 3.35× parallel speedup on eight cores for the parallelized PhyBin application. In order to implement our extensions to LVish, we introduce Par-monad transformers (Section 4), an extension of traditional monad transformers. Using transformers to add new capabilities only where they are required has two benefits: first, the cost of added functionality is paid only when it is needed, and, second, it minimizes the impact of our changes on the core LVish scheduler. Moreover, the infrastructure we have created for Par-monad transformers paves the way for future extensibility and provides a modular way to think about determinism guarantees. 2. Background: LVars and LVish The LVars programming model [11, 12] offers a principled approach to guaranteed-deterministic parallel programming with shared state. In this section we review previous work on LVars and LVish, a Haskell library that implements the LVars programming model; Sections 3-6 then describe our extensions to LVish. An LVar is a mutable data structure that can be shared among multiple threads. Unlike an ordinary shared mutable data structure, though, LVars come with a determinism guarantee: a program in which all communication among threads takes place through LVars—and in which there are no other observable side effects—is guaranteed to evaluate to the same value on every run, regardless of thread scheduling. This determinism property holds because for every LVar, the set of states that the LVar can take on form a lattice2 specific to that LVar, and the semantics of reading from and writing to the LVar is defined in terms of this lattice of states. The two fundamental LVar operations are put, for writing, and get, for reading. At a high level: • A put operation can only change an LVar’s state in a way that is inflationary with respect to its lattice. Informally, the contents of an LVar must stay the same or “grow bigger” with each write. This is guaranteed to be the case because put takes the least upper bound (lub) of the current state and the new state with respect to the lattice. • A get operation allows limited observations of the state of an LVar. The key idea is that reads from an LVar are threshold reads: they only return a value when the LVar’s state meets a certain (monotonic) criterion, or “threshold”, and the value returned is the same regardless of how high above the threshold the LVar’s state goes. We give a concrete example later in this section. Together, least-upper-bound puts and threshold gets guarantee that programs behave in an observably deterministic way, despite schedule nondeterminism and concurrent access to shared memory. Furthermore, since the LVars model is lattice-generic, it guarantees the safety of arbitrary compositions of programs mixing and matching concurrent data structures, so long as the state spaces of those data structures can be viewed as lattices and the operations their APIs expose are expressible in terms of puts and gets.3 The LVish library LVish is an implementation of the LVars programming model as a library in Haskell. Like the monad-par library that preceded it [16], the LVish library provides a Par monad for encapsulating parallel computation. Par computations run in lightweight, library-level threads that are scheduled by a custom 2 Formally, the lattice of states is given as a 4-tuple (D, v, ⊥, >) where D is a set, v is a partial order on D, ⊥ is D’s least element according to v, and > is D’s greatest element. We do not require that every pair of elements in D have a greatest lower bound, only a least upper bound; hence (D, v, ⊥, >) is really a bounded join-semilattice with a designated greatest element (>). For brevity, we use the term “lattice” as a shorthand. 3 In practice, it is also important to be able to register latent event handlers that run when puts that change the state of an LVar occur, but these are equivalent to an implicit set of functions blocked on gets. work-stealing scheduler provided by LVish.4 LVish also provides a variety of LVar data structures (e.g., sets, maps, graphs) that support concurrent insertion, but not deletion, during Par computations. In addition to the data structures the LVish library provides, users may implement their own LVar data structures (although note the proof obligations for data structure implementors, below). LVars can be quite sophisticated and correspond to many physical memory locations (e.g., implemented as a concurrent skip list or bag), but the simplest way to implement an LVar data structure (and the easiest way to satisfy said proof obligations) is to represent it as a single, pure value in a mutable box. LVish provides a PureLVar type constructor to facilitate the definition of such “pure” LVars. An example: parallel “and” Consider an LVar that stores the result of a parallel logical “and” operation between two inputs. At any point in time, each input will have written true, false, or nothing yet. In Haskell, we encode this: data Inp = Bot | T | F The state of a complete parallel-and LVar, then, would capture the state of each of its inputs, plus the possibility of an error (the top state in Figure 1). In Haskell we model this using the following type for lattice states: type State = Maybe (Inp,Inp) top :: State top = Nothing Maybe is commonly used for computations that might fail. Indeed, here top (Nothing) represents a failure, whereas the Just (Bot,Bot) state is the least element of the lattice and represents the state of the LVar before any writes have taken place. Then, when one of the two inputs becomes available, the state moves to the second tier of states above (Bot,Bot), and then to the third if a second input arrives. The join (lub) function for States combines their information, and is as simple as writing down the lattice from Figure 1 as a total function in Haskell: instance JoinSemiLattice State where -- Use the Maybe monad to keep this code simple join a b = do (x1,y1) ← a (x2,y2) ← b x3 ← joinInp x1 x2 y3 ← joinInp y1 y2 return (x3,y3) joinInp joinInp joinInp joinInp joinInp :: Inp → Inp → Maybe Inp x y | x == y = Just x Bot x = Just x T F = top x y = joinInp y x instance BoundedJoinSemiLattice State where bottom = Just (Bot,Bot) Next, we can use the PureLVar type constructor provided by LVish to define an LVar type called AndLV, whose states are of the State type we just defined: type AndLV = PureLVar State Threshold reads from an AndLV Under what circumstances can we deterministically read from an AndLV? It is not safe, for example, 4 LVish is available at http://hackage.haskell.org/package/ lvish. It generalizes the original Par monad exposed by the monad-par library (http://hackage.haskell.org/package/monad-par), which allowed determinism-preserving communication between threads using IVars—single-assignment variables with blocking read semantics. IVars are a special case of LVars, corresponding to a lattice with one “empty” and multiple “full” states, where ∀i. empty < fulli . top (T,T) (F,T) (T,F) (F,F) (T,Bot) (Bot,T) (F,Bot) (Bot,F) (Bot,Bot) Figure 1. Lattice of states that a parallel-and LVar (that is, an AndLV) can take on. The five red states in the lattice correspond to a false result, and the one green to a true one. to test whether one or both inputs have been written at a point in time. Rather, we must describe a monotonic threshold function for when the read may return. One handy way to do this is to create a threshold set of (pairwise incompatible) “trigger values”. When the state of the LVar moves above a trigger, the trigger is returned as the result of the get operation. For example: getAndLV :: AndLV → Par Bool getAndLV lv = do let bothtrue = [Just (T,T)] anyfalse = map Just [(F,Bot),(Bot,F), (F,T),(T,F),(F,F)] res ← getPureLVar lv [bothtrue, anyfalse] return (res == bothtrue) Here, bothtrue and anyfalse are the threshold triggers.5 These two sets are pairwise incompatible (that is, the lub of Just (T,T) and each element of anyfalse is top), and thus no more than one of them can be activated by monotonic change in lv’s state. When getPureLVar returns, res holds whichever of the triggers was activated. Note that getPureLVar may unblock after only one input is written, if that input is false; otherwise, it must wait for the second input. Adding parallelism An AndLV variable can be shared between threads, but does not itself add parallelism. The next step is to build a combinator for launching two boolean computations in parallel, and returning the result of their logical and: asyncAnd :: Par Bool → Par Bool → Par Bool asyncAnd m1 m2 = do res ← newPureLVar bottom fork (do b1 ← m1 putPureLVar res (Just (toInp b1,Bot))) fork (do b2 ← m2 putPureLVar res (Just (Bot,toInp b2))) getAndLV res toInp True = T toInp False = F Finally, to run an asyncAnd computation, we can use the runPar operation provided by LVish, which converts Par a to a—initializing the library’s scheduler and running the parallel computation. Thus 5 The original LVars programming model [12] only allows each trigger to contain a single lattice state, but in practice, we allow ourselves to use more general monotonic threshold functions. Using sets of states as triggers (while still satisfying pairwise incompatibility) is a relatively minor generalization. runPar provides the means by which parallel LVish code can be embedded inside ordinary, pure Haskell programs. For example, to fold asyncAnd over the results of 100 trivial boolean computations launched in parallel, we could write: main = do print (runPar foldr asyncAnd (return True) (concat (replicate 100 [return True, return False]))) Proof obligations for LVish data structures When implementing a data structure with LVish, it is the data structure author’s obligation to ensure that the states of their data structure correspond to elements of a lattice, and that the operations in the API they expose would be expressible using the aforementioned put and get operations. To put it another way, operations on a data structure exposed as an LVar must have the semantic effect of a lub for writes or a threshold for reads, but none of this need be visible to clients. Any data structure API that provides such a semantics is guaranteed to provide deterministic communication. Because AndLV has a finite lattice, its join function can be trivially and exhaustively verified to compute a lub. In fact, the following list comprehension generates every possible input: [ join x y | x ← [ bottom .. top ] , y ← [ bottom .. top ] ] Likewise, it is trivial to verify this join’s associativity, commutativity, and idempotence. In general, however, definitions of LVarbased data structures, their join functions, or their get operations should occur only in trusted code. Fortunately, most LVish applications need not define any new LVar-based data structures and instead make use of those that the LVish library provides. A missing feature: task cancellation The AndLV LVar just described makes it possible to do a “short-circuit” computation because getAndLV unblocks and returns a result as soon as any F is written. However, it is still the case that the other thread writing to the AndLV runs to completion. Although this cannot affect the deterministic outcome of the computation, it needlessly uses up cycles. This motivates the desire to be able to cancel an in-flight thread that cannot affect the deterministic outcome of a Par computation. We discuss our extension that enables cancellation in Section 6. Features not covered here In this section we overviewed LVars and their use, but have not covered all features in detail. In particular, it is possible to register handlers that are invoked whenever an LVar changes, and also to freeze an LVar to read its contents exactly. For example, freezing enables iteration over the full contents of collection LVars, and can be done unsafely during parallel computation, or safely upon exiting the parallel computation (runParThenFreeze). Both these features are covered in detail in previous work [12]. 3. Warm-up: Read-modify-write extension In the original LVars model [11, 12], the only way for the state of an LVar to evolve over time is through a series of least-upper-bound (lub) updates resulting from calls to put operations. Unfortunately, this way of updating an LVar provides no efficient way to model an atomically incremented counter that occupies one memory location. Yet, atomic increments to such a counter are efficient, commutative, and ultimately fit well inside the LVish framework. Hence the most basic extension we make to LVish is to (optionally) relax its reliance on idempotence of operations. Thereafter we can safely add a restricted set of atomic read-modify-write operations that are inflationary with respect to a lattice, but do not compute a lub. We will see one example of an application that critically requires this functionality in Section 7.1. We consider a new family of LVar update operations that are required to commute and be inflationary with respect to the lattice in question, but are not limited to the lub semantics of put. Specifically, for a lattice (D, v, ⊥, >), a data structure author may define a set of bump operations bumpi : D → D, which must meet the following two conditions: • ∀a, i. a v bumpi (a) • ∀a, i, j. bumpi (bumpj (a)) = bumpj (bumpi (a)) As a simple example, consider an LVar whose states are natural numbers, with 0 as the least element and with the usual ≤ operation on natural numbers as the ordering on states. The ordering induces a lub operation equivalent to the max operation on natural numbers. We can implement a family of bump operations that increment the LVar by various amounts: {(+1), (+2), (+3), . . . }. A critical point to note, however, is that it is not safe to update the same LVar with both put and bump. For example, a put of 4 and a bump(+1) do not commute! If we start with an initial state of 0 and the put occurs first, then the state of the LVar changes to 4 since max(0, 4) = 4, and the subsequent bump(+1) updates it to 5. But if the bump(+1) happens first, then the final state of the LVar will be max(1, 4) = 4. Furthermore, multiple distinct families of bump functions only commute among themselves and cannot be combined. In practice, this distinction is enforced by the type system. For example, the LVish-provided set data structure Data.LVar.Set supports only put, whereas Data.LVar.Counter supports only bump. Composing these data structures, however, is fine. For example, an LVar could represent a monotonically growing collection (which supports put) of counter LVars, where each counter is itself monotonically increasing and supports only bump. Indeed, the PhyBin application described in Section 7.1 uses just such a collection of counters. Determinism guarantee The determinism of LVish [11] relies on the fact that the states of all LVars evolve monotonically with respect to their lattices, and that the lub operation is commutative and therefore the order in which puts occur does not matter. Together, these properties suffice to ensure that the threshold reads made by get operations are deterministic. The rest of an LVish program is purely functional, and its behavior is, in fact, a pure function of these get observations. Since bump operations are also commutative and inflationary with respect to the lattice of the LVar they operate on, we see that they preserve determinism, as long as programs do not use bump and put on the same LVar. We can go further and generalize this argument, and in fact will need to for the other extensions described in this paper. Consider LVish with put, bump, and get effects. By commutativity, we can reduce the effects in a program execution to three unordered sets: (P, B, G). By slicing the system at this interface, we can decompose determinism proof obligations into two parts: • The LVish (Haskell) code must guarantee that it implements a monotone function P(G) → P(P ) × P(B). That is, adding more get results on the left results only in more put/bump effects on the right (as more gets unblock and run, more put and bump operations can run); furthermore, those effects are a function of nothing other than get. • The mutable (but monotonically growing) heap of LVars must likewise guarantee that the set of get results is a pure function of the set of puts and bumps. This is straightforward to show, given the lattice-based semantics of put and bump. This communicating agents formulation of determinism for LVish puts us in a position to replace the monotonically growing heap with other deterministic agents that fulfill the same contract, as we will see in Section 5. At that point, we will also discuss temporal contracts on the order of operations, going beyond the simple case of completely unordered sets at the interfaces. Deleveraging idempotency Since lub is an idempotent operation, the previously existing LVish implementation assumed idempotence of all writes, which in turn enabled the scheduler to relax synchronization requirements at the cost of low-probability duplication of work [12]. Adding support for operations like bump makes this assumption untenable. Therefore, we re-engineered the LVish runtime system to (optionally) include additional synchronization.6 Fine-grained effect tracking Naturally, it is best to pay the aforementioned synchronization overhead only when required. This requires static information about whether a given program uses bump. With that as one of our goals, we extend LVish to allow for static fine-grained effect tracking. The idea is to guarantee that only certain LVar effects can occur within a given Par computation. In Haskell, we can do so at the type level by indexing Par computations with a phantom type e that indicates their effect level. That is, the Par type becomes, instead, Par e, where e is a type-level encoding of booleans indicating whether or not writes, reads, nonidempotent (bump), or non-deterministic (IO) operations are allowed to run inside it. Moreover, in real LVish programs, the Par type constructor has a second type parameter, s, making Par e s a the complete type of a computation that returns a result of type a.7 The s parameter ensures that it is not possible to reuse an LVar from one runPar session to the next, just as the ST monad in Haskell prevents an STRef from escaping runST; likewise the types of individual LVars must be parameterized by s as well. For simplicity of presentation, we elided the e and s type parameters in Section 2, instead following the simpler Par a format of the earlier monad-par library [16], but we include them from this point onward. To enable future additions of effect “switches” encoded in e, we follow the precedent of recent work by Kiselyov et al. on extensible effects in Haskell [10]: we abstract away the specific structure of e into type class constraints, which allow a Par computation to be annotated with the interface that its e type parameter is expected to satisfy. For example, a Par computation annotated with the effect level constraint HasPut can perform puts. Thus the signature for the put operation on IVars becomes: put :: HasPut e ⇒ IVar s a → a → Par e s () while the signature for an incrCounter operation uses the HasBump constraint: incrCounter :: HasBump e ⇒ Counter s → Par s e () These constraints can also be negative. For example, the runPar function for executing Par computations in a purely functional context requires the absence of explicit freeze or IO operations: runPar :: (NoFreeze e, NoIO e) ⇒ (∀ s . Par e s a) → a 4. Par-monad transformers The effect-tracking system of the previous section gives us a way to toggle on and off a fixed set of basic capabilities using the type 6 Space constraints preclude full description here, but the key challenge is resolving a race between puts and attempts to register new handlers (callbacks) on an LVar. Our solution is a specialized variant of a readerwriter lock that requires zero writes to shared addresses if no handlers are currently being registered. 7 To be precise, in the earlier 1.x releases of LVish, the e type parameter for effect level was instead d, for “determinism level”, and was a simple typelevel boolean switch distinguishing deterministic from quasi-deterministic Par computations [12]. The effect signatures in this paper generalize determinism levels and correspond to the newer LVish 2.x API. system—that is, with the switches embedded in the e parameterizing the Par type. These type-level distinctions are needed for defining restricted but safe idioms, but they do not address extensibility. For that, we turn to multiple monads rather than a single parameterized Par monad. Working Haskell programmers use a variety of different monads: Reader for threading parameters, State for in-place update, Cont for continuations, and so on. All monads support the same core operations (bind and return from the Monad type class) and satisfy the three monad laws. However, each monad must also provide other operations that make it worth using. Most famously, the IO monad provides various input-output operations. A monad transformer, on the other hand, is a type constructor that adds “plug-in” capabilities to an underlying monad. For example, the StateT monad transformer adds an extra piece of implicit, modifiable state to an underlying monad. Adding a monad transformer to a type always returns another monad (preserving the Monad instance). In the same way, we can define a Par-monad transformer as a type constructor T, where, for all Par monads m, T m is another Par monad with additional capabilities, and a value of type T m a, for instance, T (Par e s) a, is a computation in that monad. Indeed, Par-monad transformers are valid monad transformers (in the sense of providing a standard MonadTrans instance). Just as Monad is a type class (interface) with associated laws, the semantics of a Par monad is captured by a series of type classes, all of which are closed under Par-monad transformer application. At minimum, a Par monad must have a fork operation, satisfying this type class: class (Monad m) ⇒ ParMonad m where fork :: m () → m () Programs with fork create a binary tree of monadic actions with () (unit) return values. Whereas the original LVish library8 provided a single, concrete Par type, here we allow any instantiation of the ParMonad type class. Additional type classes capture the interfaces to basic parallel data structures and control constructs such as futures (ParFuture), IVars (ParIVar), and more general LVars (ParLVar). For example, the class ParIVar provides new and put methods with the signatures below.9 class (ParMonad m) ⇒ ParFuture m where ... class (ParMonad m) ⇒ ParIVar m where type IVar m :: ∗ → ∗ new :: m (IVar m a) put :: IVar m a → a → m () get :: IVar m a → m a The ParFuture, ParIVar, and ParLVar type classes form a hierarchy: any implementation that can support LVars can support IVars, and any that can support IVars can support futures. Taken together, this framework for generic Par programming makes it possible for LVish programs to be reusable across a variety of schedulers. This can be quite useful; for example, we provide a ParFuture instance for the native GHC work-stealing scheduler [15]. Example: threading state in parallel Perhaps the simplest example of a Par-monad transformer is the standard StateT monad transformer (provided by Haskell’s Control.Monad.State package). However, even if m is a Par monad, for StateT s m to also be a Par monad, the state s must be splittable; that is, it must be specified what is to be done with the state at fork points in the control 8 By “the original”, we refer to 1.x releases of LVish, e.g., http:// hackage.haskell.org/package/lvish-1.1.2. 9 Although it may appear that generic treatment of Par monads as type variables m removes the additional metadata in a type such as Par e s a, note that it is possible to recover this information with type-level functions. flow. For example, the state may be duplicated, split, or otherwise updated to note the fork. The below code promotes StateT to be a Par-monad transformer: • Composability: While a user only wants one copy of RngT— and thus it could be hard-coded into the scheduler if desired— other transformers make it useful to have more than one copy in the stack. For example, a program with two implicit states might stack two StateT transformers. This is not possible for capabilities baked into the core scheduler. class SplittableState a where splitState :: a → (a,a) instance (SplittableState s, ParMonad m) ⇒ ParMonad (StateT s m) where fork task = do s ← State.get let (s1,s2) = splitState s State.put s2 lift (fork (do runStateT task s1; return ())) Note that here, put and get are not LVar operations, but the standard procedures for setting and retrieving the state in a StateT. Here are two immediately useful applications of threaded, splittable state: • PedigreeT keeps the index in the binary control-flow tree as im- plicit state, e.g., “LRRLL”. This is sometimes called the pedigree of the parallel computation [13]. In this case the split action is to add “L” or “R” for each branch of the fork, respectively. Pedigrees can then be augmented with counters that increase with certain sequential actions, thus providing a form of parallel “program counter”. Also, examining pedigrees at runtime can answer “happens before” or “happens in parallel” questions. • RngT is an application of pedigrees to the problem of determin- istic pseudo-random number generation. The idea is simple: either use the pedigree itself as a seed, or keep the random generator state itself with StateT. The interface to the user is a simple rand nullary function that can be called on any thread. In fact, parallel deterministic random number generation was considered important enough for Intel to significantly modify the Cilk runtime system to support it directly [13]. In LVish, no such runtime system modification is necessary: instead, we add the StateT transformer to Par to track pedigree only for applications that need it. (Section 7.2 discusses the overhead of Par-monad transformers.) Further, given the above instances, we can declare all random number generators into splittable states, and thus define a very simple interface for random number generation, e.g.: instance RandomGen g ⇒ SplittableState g where splitState = System.Random.split randInt :: (ParMonad m, RandomGen g) ⇒ StateT g m Int Determinism guarantee The StateT transformer preserves determinism because it is effectively syntactic sugar. That is, StateT does not allow one to write any program that could not already be written using the underlying Par monad, simply by passing around an extra argument. This is because StateT only provides a functional state (an implicit argument and return value), not actual mutable heap locations. Genuine mutable locations in pure computations, on the other hand, require Haskell’s ST monad, the safer sister monad to IO. We return to ST in Section 5. The case for pluggability Why should parallel effects be plug-in, rather than baked-in? In summary, there are three reasons: • Modularity: Runtime systems for parallel schedulers like Cilk and language runtimes like GHC’s grow into enormously complicated low-level concurrent codebases. Isolating parallel capabilities in transformers makes them modular and maintainable. • Runtime cost: The transformers introduced in this paper in- troduce book-keeping and synchronization overheads (Section 7.2), which should be paid only by computations that use them. Expensive features should pay their own way. Engineering note: independent extensibility Because extensibility is an explicit goal, we must ask what functionality can be added by separate packages, deployed independently from LVish. The framework presented thus far enables new (trusted) packages to add transformers that preserve the ability to use core data structures; that is, they provide instances for the classes ParMonad, ParFuture, and ParIVar. In the other direction, separate packages can provide new data structures that work with the base Par monad. But how can new transformers provide instances for new data types they do not know about? This is simply the problem of “independent extensibility” in a new guise. Fortunately, there is a good solution. The interactions between a concurrent data-structure implementation (such as Data.LVar.Map or Data.LVar.Counter) and the scheduler are limited and have a common structure. Thus, rather than splitting out fine-grained classes for each conceivable data structure (ParMap, ParCounter, and so on), we make ParLVar into a general data-structure/scheduler interface.10 For intuition, a small portion of the ParLVar type class interface is shown below. As described in previous work [12], the implementation distinguishes between the type of complete LVar states, for which we use the type parameter a, and state changes, or “deltas”, for which we use d: class (Monad m, ...) ⇒ParLVar m where -- The type of raw LVars type LVar m :: ∗ → ∗ → ∗ newLV :: IO a → m (LVar m a d) -- 2nd arg does the update, reports any change: putLV :: LVar m a d → (a → IO (Maybe d)) → m () ... With this approach, a generic monotonic Map package can work with any monad satisfying ParLVar, including those produced by stacking transformers that were written with no knowledge of the data structure. Likewise, transformers that preserve ParLVar instances work with past and future data structures. 5. Disjoint parallel update with ParST LVish is based on the notion that it is fine for multiple threads to access and update shared memory, so long as updates commute and “build on” one another, only adding information rather than destroying it. Yet it should be possible for threads to update memory destructively, so long as the memory updated by different threads is disjoint. This is the approach to deterministic parallelism taken by, for example, Deterministic Parallel Java (DPJ) [3], which uses a region-based type and effect system to ensure that each mutable region of the heap is passed linearly to a thread that then gains exclusive permission to update that region. In order to add this capability to LVish, though, we need destructive updates to interoperate with other LVish put/get/bump effects. Moreover, we wish to do so at the library level, without requiring language extensions. Our solution is to provide a ParST transformer, a variant of the StateT transformer of Section 4. ParST allows arbitrarily complex mutable state, such as tuples of vectors (arrays). However, ParST enforces the restriction that every memory location in the state is reachable by only one pointer: alias freedom. 10 We retain interfaces like ParIVar as well because they provide a means to interoperate with legacy Par monads that provide only IVars or futures and have no notion of LVars, or with the built-in GHC work-stealing runtime itself, which provides only a ParFuture instance. Previous approaches to integrating mutable memory with pure functional code (i.e., the ST monad) work with LVish, but only allow thread-private memory. There is no way to operate on the same structure (for instance, on two halves of an array) from different threads. ParST exploits the fact that it is perfectly safe to do so as long as the different threads are accessing disjoint parts of the data structure. Below we demonstrate the idea using a simplified convenience module provided alongside the general (ParST) library, which handles the specific case of a single vector as the mutable state being shared. runParVecT 10 ( do -- Fill all 10 slots with "a": set "a" -- Get a pointer to the state: ptr ← reify -- Call pre-existing ST code: new ← pickLetter ptr forkSTSplit (SplitAt 5) (write 0 new) (write 0 "c") -- ptr is again accessible here . . .) This program demonstrates running a parallel, stateful session within a Par computation. The shared mutable vector is implicit and global within the monadic do block. We fork the control flow of the program with forkSTSplit, where (write 0 new) and (write 0 "c") are the two forked child computations. The SplitAt value describes how to partition the state into disjoint pieces: (SplitAt 5) indicates that the element at index 5 in the vector is the “split point”, and hence the first child computation passed to forkSTSplit may access only the first half of the vector, while the other may access only the second half. (We will see shortly how this generalizes.) Each child computation sees only a local view of the vector, so writing "c" to index 0 in the second child computation is really writing to index 5 of the global vector. This is exactly the splitting method in our parallel sort (Section 7.3). Ensuring the safety of ParST hinges on two requirements: • Disjointness: Any thread can get a direct pointer to its state. In the above example, ptr is an STVector that can be passed to any standard library procedures in the ST monad. However, it must not be possible to access ptr from forkSTSplit’s child computations. We accomplish this using Haskell’s support for higher-rank types,11 ensuring that accessing ptr from a child computation causes a type error. Finally, forkSTSplit is a forkjoin construct; after it completes the parent thread again has full access to ptr. • Alias freedom: Imagine that we expanded the example above to have as its state a tuple of two vectors: (v1 , v2 ). (In fact, this is the state we need for the merge phase in Section 7.3.) If we allowed the user to supply an arbitrary initial state to their ParST computation, then they might provide the state (v1 , v1 ), i.e., two copies of the same pointer. This breaks the abstraction, enabling them to reach the same mutable location from multiple threads (by splitting the supposedly-disjoint vectors at a different index). Thus, in LVish, users do not populate the state directly, but only describe a recipe for its creation. Each type used as a ParST state has an associated type for descriptions of (1) how to create an initial structure, and (2) how to split it into disjoint pieces. We provide a trusted library of instances for commonly used types. 11 That is, the type of a child computation begins with (∀ s . ParST . . . ). State transformation In comparison to the region-typing approach of DPJ, it can be painful to keep the state inside a single structure reachable from one variable. However, it is possible to define combinator libraries that make this much easier (in the spirit of the lens library for Haskell). For example, we provide ways to either “zoom in”, that is, run a computation whose state is a sub-component of the current state, or “zoom out”, by placing the current state inside a newly constructed one. We use this ability inside our code for parallel merge sort (Section 7.3) to shift from a single vector state to having a second temporary buffer for the merge phase. Inter-thread communication Disjoint state update does not solve the problem of communication between threads. Hence systems built around this idea often include other means for performing reductions, or require “commutativity annotations” for operations such as adding to a set. For instance, DPJ provides a commuteswith form for asserting that operations commute with one another to enable concurrent mutation. In LVish, however, such annotations are unnecessary, because LVish already provides a language-level guarantee that all effects commute! Thus, a programmer using LVish with ParST can use any of the rich library of LVar-based data structures to communicate results between threads performing disjoint updates, without requiring trusted code or annotations. Furthermore, to our knowledge, LVish now provides the first example of a deterministic parallel programming model allowing both DPJstyle, disjoint destructive parallel updates and blocking, dataflowstyle communication between threads (through LVars). Determinism guarantee The ParST transformer relies on the fact that the disjoint updates made by a forkSTSplit call are equivalent to a single sequential state update. This means that if ParST were a base monad instead of a transformer, its determinism would be a straightforward consequence of this disjointness property, which prevents data races. Indeed, ParST would be equivalent to a proper subset of DPJ, which is provably deterministic [3].12 The complication is that a ParST computation may spawn arbitrary, asynchronous computations that use the underlying effects provided by the monads under it in the transformer stack, e.g., put and get on LVars. To convince ourselves that this is safe, we return to the “communicating agents” formulation of Section 3 to enable modular reasoning. The mutable heap of ST objects (STRef, STVector, and so on) becomes a third agent alongside the purely functional component of the LVish computation and the monotonically-growing heap. The purely functional agent exchanges put/get messages with the monotonically-growing heap and read/write messages with the mutable heap. In this case, however, there is a protocol that must be followed, and we cannot ignore ordering and control flow to reason only about the sets of messages exchanged. In the basic LVish programming model there are two sources of ordering constraints: monadic bind, and data dependencies from put to get. Intuitively, we can think of the LVish agent emitting Before(a, b) messages, meaning that (a, b) is in a happens-before relation for a pair of events a and b that it previously emitted. Normally, such a relation would be unnecessary; most Par monads are so order-insensitive that all their effects satisfy the following reordering-tolerance property: (do m1; m2) 12 There == (do fork m1; m2) is a minor caveat here. DPJ requires that the type system statically determine disjointness of state updates, whereas in LVish, we can also allow complicated partitioning strategies that are only checkable at runtime. Nevertheless, DPJ could be extended with this functionality, and it does not affect the determinism argument. But for a destructively mutable heap, ordering is important and ParST effects clearly cannot support the above property—write operations do not commute! In fact, ParST does not even expose a one-armed fork operation that allows ST effects in the child computation.13 Rather, it supports fork-join parallelism with forkSTSplit, which requires that both child computations complete before returning. Furthermore, the forkSTSplit control construct can be thought of as generating additional Before(a, b) messages to express these barriers. On the mutable-heap side, the contract for determinism is the standard one: all read/write and write/write pairs must be ordered according to the Before relation. Of course, we do not track Before at runtime, so this must be guaranteed by construction. How can we guarantee this if there is a stack of monads composed with ParST? The key here is that get effects can only add more Before constraints, not take them away. Additional blocking operations can therefore never break the requirements for determinism in the mutable heap (race-freedom). Given that, the same argument as in Section 3 applies: all results returned to the LVish agent from the heap are deterministic, therefore its final value is. ParST composition The reader may legitimately be wondering: how can there be a ParST transformer, if there is no ST transformer in Haskell? The answer is that ParST is not a transformer supporting unfettered composition. Instead, a given Par monad can either have the ST feature, or not. It is not safe to combine two copies of ParST, nor to apply ParST on top of certain other transformers that LVish might be eventually be extended with (e.g., ListT). To implement this, each Par monad tracks one bit of information in its type: whether the ST switch has been turned on for this monad.14 Once this bit is turned on, new copies of ParST cannot be applied on top, but other transformers, such as RngT, can be added. Thus it is possible to compose reordering-tolerant transformers such as StateT and RngT freely on either side of the ParST transformation, without violating the invariants of the underlying state implementation. 6. Control-related Effects In the previous section, we saw a Par transformer that restricts the control flow of an LVish program to retain determinism: ParST requires that child computations that modify the state are created in a fork-join, rather than asynchronous fork, fashion. In this section, we will instead look at transformers that add additional controlflow behaviors to a program: for example, the ability for one thread to cancel another. Every Par monad provides continuation capture under the hood15 to be able to support work-stealing scheduling and blocking gets. This provides significant power for implementing new control constructs, but it does not change the fact that we must carefully identify limited idioms that retain determinism, and expose only those determinism-preserving constructs from the library of Par transformers. 6.1 Cancellation It is common to speculatively create a parallel computation whose result may not be needed; for example, in search problems. In the parallel “and” example from Section 2, we saw that LVish programs written using the previously existing LVish library could 13 To be precise, ParST does provide a ParMonad instance, but any attempt to reference the state in a forked computation results in a runtime error. 14 We enforce this restriction through an extra superclass constraint upon a class that users are prevented from instantiating. 15 That is, a ParIVar is always also a Cont monad. create trees of parallel boolean operations and even allow them to make their results available before all branches completed execution. However, it was not possible to actually cancel the unneeded branches to avoid wasted CPU time. Fundamentally, cancellation is a challenge for guaranteeddeterministic parallel programming because a cancelled thread might have side effects and the cancellation could race with those effects. With fine-grained effect tracking, though, we are able to provide a CancelT transformer providing operations such as forkCancelable, which takes a computation as argument and runs it in parallel, returning a cancellable future; and cancel, which takes a CFuture and cancels the thread associated with it and all of that thread’s subthreads, transitively. It is an error to both cancel and read such a future, even if the read happens first. In our generic framework, the signatures for forkCancelable and cancel include: forkCancelable :: (ParLVar m, ReadOnly m, ...) ⇒ CancelT m a → CancelT m (CFuture m a) cancel :: (HasPut m2, ...) ⇒ CFuture m1 a → CancelT m2 () Note that forkCancelable, which requires that the forked computation must be ReadOnly16 , uses the same monad, m, for the child and parent computations. This is only because lifting a read-only computation into one that includes writes (explicit subtype coercion) is done separately. In fact, because cancel may cause another thread to throw an exception, it counts as a put effect; thus a program with cancellation must have a non-read-only “trunk” that it connects read-only branches to. Finally, if the user wants cancellation of child computations with arbitrary effects, a variant, forkCancelableND, allows them but requires nondeterminism (that is, IO) in its own effect signature. Using forkCancelableND we can write a version of the asyncAnd function from Section 2 which, when getAndLV returns a False, calls cancel to terminate any remaining (now useless) forked computations. Using forkCancelable directly is not possible because of the putPureLVar calls in the code. Because we have verified manually that this use of cancellable writes is safe, however, we could add a blessed version of asyncAnd to the library that works with ReadOnly computations. Implementation The CancelT transformer allocates one mutable location whenever a new CFuture is created by forkCancelable (regular forks continue to share the state of the parent). This location stores a tuple (live, children), which tracks whether the computation is still alive, and a list of the child CFutures, which must be cancelled if the current thread is cancelled. Thus the implementation is driven by polling a thread’s liveness every time a scheduler action (get, fork, put, and so on) is performed. Because scheduler actions are frequent, this is sufficient. Moreover, alternatives that support more direct preemption (e.g., using Haskell asynchronous exceptions to kill the underlying worker threads), require much more bookkeeping, as well as invasive modifications to the LVish scheduler itself. 6.2 Memoization Even though cancellation allows us to write a more efficient version of asyncAnd—canceling wasted work—it remains the case that 16 In fact, these subcomputations have an additional stipulation relating to exception semantics, which also applies to ReadOnly computations used in memoization. Briefly, normal LVish threads eagerly push exceptions up to the scheduler, which is necessary when threads perform side effects like put that may throw exceptions that appear deterministically. A cancellable future, on the other hand, must have no visible effect but its result, and thus we require that exceptions not be propagated to the parent until/unless the future is read. ReadOnly cancelled computations are completely wasted: they can- not do externally useful work. Canceling a computation that could do externally useful work would necessarily break determinism—or would it? In this section, we show how a cancellable, ReadOnly computation can help other threads along without interfering with determinism. The idea is that a cancellable ReadOnly computation can contribute work to a shared memo table LVar. Since the only observable effect of writing to the memo table is that calls to memoized functions run faster, determinism is preserved, regardless of whether the computation is cancelled. A basic memo table has a direct encoding using only the public interface of Set and Map LVars. Specifically, we use one LVar for requests and a second for results: type Memo e s k v = (ISet s k, IMap k s v) The set of requests is connected to a handler that launches a compute job for each unique request of type k. When a job completes, it stores the (k,v) pair into the IMap. Thus doing a lookup on the memo table consists of simply inserting into the set, and then performing a blocking get on the map. This provides an efficient way to memoize functions—even functions that have side effects within the Par monad (i.e., makeMemo takes a function (a → Par e s b)). It is a great application of existing LVar data structures. But a further synergy with CancelT is possible. The Memo type above has an e parameter that tracks the effect signature of the memoized function. However, making a memo table request means writing an element into the ISet—a put effect. Thus, reading from a memo table has a put effect! This in turn means that it cannot be cancelled. Fortunately, this is a place where we can identify a specific combination of parallel effects that compose well. It is safe for an alternate version of the memo-table get function to require a ReadOnly memoized function, and in return hide (bless as safe/unobservable) the put effect in the result signature: getMemoRO :: (ReadOnly e) ⇒ Memo e s k v → k → Par e s v Then, with getMemoRO, we can safely use ReadOnly memo tables inside cancelled computations! Hence we retain a full determinism guarantee, while canceling unneeded work, and retaining partial solutions discovered in the cancelled threads. The result is that read-only computations that use memoized functions can allow one to learn something from a computation that never happened— deterministically! The rest of the Zoo While we do not have space to cover all of them here, there are other interesting examples of transformers that deal with parallel control flow. One example is DeadlockT, which returns when all computations underneath a forked child have either returned or blocked indefinitely. This transformer is useful for detecting and responding to cycles in graphs of computations. Deadlock-detecting computations have the opposite effect requirement from cancellation: rather than requiring read-only computations for determinism, they require “blind” computations which may only write to the world outside the subcomputation. (If they could read, they could block on data outside of their control, which creates ambiguity between genuine deadlock and temporary blocking.) Another example is BulkRetryT, which improves the ability of a Par monad to support the deterministic reservations [2] idiom efficiently, and is described in a workshop paper [17]. In brief, ParIVar monads already support blocking reads, but to efficiently execute a parallel for loop with a large iteration space, it is often better to cheaply mark the iterations that fail and retry them in bulk. However, the approach of aborting and retrying rather than blocking requires that each iteration of computation have only idempotent effects. In this example and others, we see that fine- global: biptable, distmat (1) for t ∈ alltrees: for bip ∈ t: insert(biptable, (t, bip)) (2) for (_, trset) ∈ biptable: for t1 ∈ alltrees: for t2 ∈ alltrees: if t1 ∈ trset ‘xor‘ t2 ∈ trset then increment(distmat[t1,t2]) Figure 3. Pseudocode of the HashRF algorithm for computing a tree-edit-distance matrix. grained effect tracking is essential to how our zoo of additional capabilities interoperate. 7. Evaluation In this section, we evaluate the performance of our extended LVish library. We begin with a case study describing our experience using LVish to parallelize PhyBin, a bioinformatics application, and compare the performance of our parallelized PhyBin with its competitors. Next, we benchmark to measure the runtime overhead incurred by our use of Par transformers. Finally, to measure the effectiveness of our ParST transformer for disjoint parallel update, we evaluate its performance on a parallel merge sort benchmark. All measurements come from a dual-socket (12-core) Intel Xeon X5660 system, running RHEL Linux 6.4. 7.1 Case Study: PhyBin: all-to-all tree edit distance A phylogenetic tree represents a possible ancestry for a set of N species. Leaf nodes in the tree are labeled with species’ names, and the structure of the tree represents a hypothesis about common ancestors. For a variety of reasons, biologists often end up with many alternative trees, whose relationships they need to then analyze. PhyBin17 is a medium-sized (3500-line) bioinformatics program for this purpose, initially released in 2010. The primary output of the software is a hierarchical clustering of the input tree set (a tree of trees), but most of its computational effort is spent computing an N ×N distance matrix, which records the pairwise edit distance between trees. It is this distance computation that we parallelize in our case study. The distance metric itself is called Robinson-Foulds (RF) distance, and the fastest algorithm for all-to-all RF distance computation is the HashRF algorithm [19], introduced by a software package of the same name.18 HashRF is about 2-3× as fast as PhyBin. Both packages are dozens or hundreds of times faster than the more widely-used software that computes RF distance matri2 ces (e.g., Phylip19 , DendroPy20 ). These slower packages use N 2−N full applications of the distance metric, which has poor locality in 2 that it reads all trees in from memory N 2−N times. Before describing how the HashRF algorithm improves on this, we must observe that edit distance between trees (number of modifications to transform one to the other) can be reduced to symmetric set difference between sets of bipartitions. That is, each intermediate node of a tree can be seen as partitioning the set of leaves into those below and above the node, respectively. For example, with leaves A, B, C, D, and E, one bipartition would be ‘‘AB|CDE’’, while another would be ‘‘ABC|DE’’. Identical trees, of course, convert to the same set of bipartitions. Furthermore, after convert17 http://hackage.haskell.org/package/phybin 18 https://code.google.com/p/hashrf/ 19 http://evolution.genetics.washington.edu/phylip.html 20 http://pythonhosted.org/DendroPy/ 11/16/13 LVish_LVishState.svg 11/16/13 Trace_TraceST.svg Figure 2. The overhead of adding one StateT transformer (left) or ParST transformer (right). The Y axis is the speedup/slowdown factor (higher better), and the X axis is the count of benchmarks. Each color represents one of the benchmarks drawn from Figure 4. For each benchmark, there is a different bubble per thread setting, with the area proportional to the number of threads. We do not see a trend with more or less overhead at larger numbers of threads. All times are the median of five runs. Trees 100 Species 150 1000 150 PhyBin DendroPy 0.269 22.1 PhyBin 1, 2, 4, 8 core 4.7 3 1.9 1.4 Phylip 12.8 HashRF 1.7 Table 1. PhyBin performance comparison with DendroPy, Phylip, and HashRF. All times in seconds. ing trees to sets of bipartitions, set difference may be computed using standard set data structures. The HashRF algorithm makes use of this fact and adds a clever trick that greatly improves locality. Before computing the actual distances between trees, it populates a table mapping each observed bipartition to the set of trees that contain it. In the original PhyBin source: type BipTable = Map DenseLabelSet (Set TreeID) Above, a DenseLabelSet encodes an individual bipartition as a bit vector. PhyBin uses purely functional data structures for the Map and Set types, whereas HashRF uses a mutable hash table. Yet in both cases, these structures grow monotonically during execution. The full algorithm for computing the distance matrix is shown in Figure 3. The second phase of the algorithm is still O(N 2 ), but it only needs to read from the much smaller trset during this phase. All loops in Figure 3 are potentially parallel. Parallelization The LVish methodology applies directly to this application: • The biptable in the first phase is a map of sets, which are directly replaced by their LVar counterparts. • The distmat in the second phase is a vector of monotonic bump counters. 7.2 Benchmark 1: overhead of transformers Monad transformers have both direct and secondary costs. The direct cost is to pay for what they do; the secondary cost is that complicated monad-transformer stacks result in extremely complicated code that the (GHC) compiler must unravel to optimize effectively. Our Par transformer approach only requires paying these overheads when a specific capability is needed, but we must still account for what that cost is and whether it is prohibitive. LVish’s primary focus is on non-traditional parallel applications such as k-CFA program analysis [12] or PhyBin. Nevertheless, here we also include a benchmark suite of traditional parallel kernels shown in Figure 4. We use these in Figure 2 as well, which summarizes the overhead added when rerunning this benchmark suite with additional, unneeded transformers added. We measure overheads for adding a StateT or ParST transformer. (Note that a CancelT is just such a StateT.) These result in a 4% geomean slowdown, and 2% geomean speedup, respectively. Indeed, the interactions of these transformers with the GHC compiler’s optimizer are difficult to predict, but overall, overhead is not prohibitive. 7.3 Benchmark 2: non-copying parallel sorting To measure the effectiveness of our ParST transformer, we ported a well-known parallel merge sort implementation originally written in Cilk and later reimplemented in DPJ [3]. We omit the details of the algorithm and comment only on its formulation with ParST. This is a destructive mergeSort function, which assumes a vector state, and leaves the sorted result occupying the same memory locations as the input: mergeSort :: (ParMonad parM) ⇒ ParST (MVector s2 elt) s parM () This function works over any underlying monad parM, extended with the ParST effect. Internally, the algorithm must add a second buffer to have extra space for merging, shifting the state to (v1, v2). At the fork points, both of these buffers are split at the same locations. The code for the heart of the parallel sort is: In fact, the parallel port of PhyBin using LVish was so straightforward that, after reading the code, parallelizing the first phase took only 29 minutes.21 Once the second phase was ported, the distance forkSTSplit (sz1,sz1) computation sped up by a factor of 3.35× on 8 cores (Table 1). file://localhost/ffh/ryan/cloud_drive/working_copies/lvars/sepdetpar/effectzoo/data/LVish_LVishState.svg 1/1 (do forkSTSplit (sz2,sz2) mergeSort mergeSort This is exactly where we would like to use LVish—to achieve mod- file://localhost/ffh/ryan/cloud_drive/working_copies/lvars/sepdetpar/effectzoo/data/Trace_TraceST.svg mergeL2R) est speedups for modest effort, in programs with complicated data (do forkSTSplit (sz2,sz2) mergeSort mergeSort structures (and high allocation rates), and without changing the demergeL2R) terminism guarantee of the original functional code. mergeR2L 21 Git commit range: https://github.com/rrnewton/PhyBin/ compare/5cbf7d26c07a...6a05cfab490a7a As in both the DPJ and Cilk implementations, we need to unroll the recursive sorting process, splitting twice. This ensures that after each round the output ends up back in the original buffer. The type- Parallel Speedup 11 1 2 4 6 8 10 12 14 16 18 20 22 24 ParST DPJ Cilk 0.77935323 1.52179334 2.95841771 4.30444184 5.29401822 6.11327155 7.3222211 7.66579651 7.93997038 8.04578697 8.00105536 8.26811651 8.28329072 0.3067352 0.52666362 8.25 1.04197308 1.43231461 1.96056513 2.25542923 2.48466402 5.5 2.77584129 2.64618102 3.04178671 2.71193972 3.12248753 2.75 2.91519755 1 1.95617335 3.9064066 5.65455913 7.3395181 8.60665862 9.6666963 10.5474233 10.8697039 11.3686028 11.9374522 12.253214 12.0728815 Parallel Speedup Factor THREADS ParST DPJ 4.465019 2.286662 1.176246 0.808427 0.657313 0.569225 0.475242 0.453942 0.438267 0.432503 0.434921 0.420873 0.420102 Cilk 11.3447265 6.60730472 3.33965155 2.42951303 1.77491017 1.54286685 1.40052215 1.25361166 1.3150374 1.14400756 1.28315057 1.11444064 1.1936848 3.479827 1.778895 0.8908 0.615402 0.474122 0.404318 0.359981 0.329922 0.32014 0.306091 0.291505 0.283993 0.288235 blackscholes mergesortFP 0 1 2 4 matmult sumeuler 6 8 nbody 10 12 Number of CPU Threads Figure 4. Benchmark suite of traditional parallel kernels in the LVish Par monad. These are runnable either with monad-par or LVish. Performance The reason performance suffered in previous parallel sort implementations in Haskell [7] is that each recursive call to mergeSort had to perform an append (copy) to combine the halves together. The two independent recursions had to return fresh values, because no mechanism for (deterministic) mutation in parallel was available. The performance of such a copying merge sort is shown in Figure 4 (mergesortFP). This benchmark suite uses only a base Par monad, not the ParST transformer. It is also the only one of these benchmarks that completely stops scaling before twelve cores! Indeed, when sorting arrays larger than the last-level cache, mergesortFP reads the entire input memory at least log2 (N ) times, greatly increasing memory traffic.22 Eliminating the copying by using ParST causes scaling to continue to twelve cores. We look at two variations of this in Figure 5. Naturally, all these implementations of merge sort bottom out to sequential sorts below a granularity threshold. The two variants we examine bottom out to different sequential sorts: either (1) a pure Haskell sequential sort, or (2) a library call to a C sort (namely, the same sequential sort used by the Cilk implementation). The table in Figure 5 contains the times for the all-Haskell sort. It achieves a 10.7× parallel speedup on 12 cores. The line graph above it shows the other variant, alongside the DPJ and Cilk benchmarks. Our parallel version does add overhead relative to Cilk, with a best time of 0.42 instead of 0.29 seconds. 22 Performing problem. a multi-way merge sort could reduce the impact of this Parallel Speedup Parallel Speedup, Normalized to Cilk checking of s parameters ensures that the nested splits can access only exactly the data they have permission to. In summary, the ParST-based Haskell implementation offers exactly the same determinism guarantee offered by DPJ. Our version has the disadvantage of being written with a more restrictive (single implicit state object) mechanism, but it has the advantage of being callable from a purely functional context (e.g., from within a function of type Int → Int) with a guarantee that no visible side effects occur. 10 ParST/C DPJ Cilk 7.5 5 2.5 0 1 Threads ParST/HSonly 2 4 6 8 Number of CPU Threads 1 2 4 6 8 36.5 18.0 9.2 6.3 4.8 10 10 4.6 12 12 3.4 Figure 5. Non-copying merge sort. Parallel speedups shown relative to the Cilk single-thread execution time of 3.48 seconds. 8. Related Work Work on deterministic parallel programming models is longstanding. As we discussed in Section 1, deterministic parallel languages must restrict effects so that schedule nondeterminism cannot be observed—whether that means avoiding shared mutable state entirely, as in data-parallel languages [5, 20], allowing sharing only by a limited form of message passing, as in dataflow-based or stream processing languages [4, 8, 9], or ensuring that concurrent accesses to shared state are disjoint [3]. In addition to the models already discussed, here we contrast our work on extending LVish with non-language-based approaches, in particular, those that attempt to run arbitrary threaded programs deterministically. The narrowest form of deterministic parallelism is repeatability: the property that, on a specific machine, whatever happens the first time a program is run will also happen on subsequent runs, given the same inputs. For example, the TERN system [6] uses a schedule memoization approach to improve debuggability by repeating the same thread interleavings as previous runs. Also of interest is consistent scheduling on a particular input. The recent work on DThreads [14] transparently converts multi-threaded programs into multi-process ones, enforcing a deterministic resolution of conflicting updates to memory. DThreads intercepts the pthreads API to hook into arbitrary programs. While supporting legacy software makes this line of research very important, there are major differences between the approach taken by systems like DThreads, and that taken by LVish: • LVish requires no reasoning about interleavings. Deterministic threading packages make thread interleavings a consistent behavior, but the programmer still needs to think about concurrency, given that they will not generally be able to predict the exact schedule chosen by the deterministic scheduling package. In LVish, all lattice-based actions commute, so interleavings are not relevant. • Deterministic threading packages typically support lock-based, multi-threaded programs, but cannot handle other forms of synchronization based on user-space atomic memory operations— in particular, lock-free data structures such as those that underlie modern work-stealing runtime systems. By contrast, LVish is specifically focused on enabling the programmer to use finegrained concurrent data structures. • A language-based approach can ensure determinism by stati- cally limiting what features can be combined (effects, transformers), rather than by runtime enforcement that carries a runtime overhead. 9. Conclusion We present an extended version of the LVish library for deterministic parallelism, augmented with the ability to manage a wide variety of effects previously not seen in combination in any guaranteed-deterministic parallel programming system. Our extended library offers the well-known benefits of language-level enforcement of determinism, but without being limited to a single shared data structure or a single programming paradigm as previous deterministic-by-construction programming models have been. Furthermore, our case study and empirical results demonstrate that deterministic parallelism can be effective, while also retaining the ease of use that is the hallmark of deterministic parallel models. Acknowledgments Thanks to Aaron Turon for many illuminating conversations that helped develop the ideas in this paper, and to the anonymous PLDI reviewers for their insightful and helpful comments. This research was funded in part by NSF grant CCF-1218375. References [1] Arvind, R. S. Nikhil, and K. K. Pingali. I-structures: data structures for parallel computing. ACM Trans. Program. Lang. Syst., 11(4), Oct. 1989. [2] G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and J. Shun. Internally deterministic parallel algorithms can be fast. In PPoPP, 2012. [3] R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A type and effect system for deterministic parallel Java. In OOPSLA, 2009. [4] Z. Budimlić, M. Burke, V. Cavé, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. Peixotto, V. Sarkar, F. Schlimbach, and S. Taşirlar. Concurrent Collections. Sci. Program., 18(3-4), Aug. 2010. [5] M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. In DAMP, 2011. [6] H. Cui, J. Wu, C.-C. Tsai, and J. Yang. Stable deterministic multithreading through schedule memoization. In OSDI, 2010. [7] A. Foltzer, A. Kulkarni, R. Swords, S. Sasidharan, E. Jiang, and R. R. Newton. A meta-scheduler for the par-monad: Composable scheduling for the heterogeneous cloud. In ICFP: International Conference on Functional Programming. ACM, 2012. [8] M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, C. Leger, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In ASPLOS, 2002. [9] G. Kahn. The semantics of a simple language for parallel programming. In J. L. Rosenfeld, editor, Information processing. North Holland, Amsterdam, Aug. 1974. [10] O. Kiselyov, A. Sabry, and C. Swords. Extensible effects: an alternative to monad transformers. In Haskell, 2013. [11] L. Kuper and R. R. Newton. LVars: lattice-based data structures for deterministic parallelism. In FHPC, 2013. [12] L. Kuper, A. Turon, N. R. Krishnaswami, and R. R. Newton. Freeze after writing: Quasi-deterministic parallel programming with LVars. In POPL, 2014. [13] C. E. Leiserson, T. B. Schardl, and J. Sukha. Deterministic parallel random-number generation for dynamic-multithreading platforms. In PPoPP, 2012. [14] T. Liu, C. Curtsinger, and E. D. Berger. Dthreads: Efficient deterministic multithreading. In SOSP, 2011. [15] S. Marlow, S. Peyton Jones, and S. Singh. Runtime support for multicore Haskell. In ICFP, 2009. [16] S. Marlow, R. Newton, and S. Peyton Jones. A monad for deterministic parallelism. In Haskell, 2011. [17] P. Narayanan and R. R. Newton. Graph algorithms in a guaranteeddeterministic language. In Workshop on Deterministic and Correctness in Parallel Programming (WoDet’14), 2014. [18] R. R. Newton and I. L. Newton. PhyBin: binning trees by topology. PeerJ, 1:e187, Oct. 2013. [19] S.-J. Sul and T. L. Williams. A randomized algorithm for comparing sets of phylogenetic trees. In APBC, 2007. [20] V. Weinberg. Data-parallel programming with Intel Array Building Blocks (ArBB). CoRR, abs/1211.1581, 2012. A. Using LVish: two brief examples The LVish library that we have described in this paper is available on Hackage, the Haskell package repository.23 To install it, run: $ cabal install ’lvish >= 2.0’ Once LVish is installed, you can compile and run the LVish programs from this paper, and many more.24 The following simple example demonstrates the threshold read semantics of LVars. In this example, cart is an LVar representing a shopping cart to which Items, such as a Book or Shoes, can be added. import Control.LVish; import Data.LVar.PureMap data Item = Book | Shoes deriving (Ord, Eq) p :: (HasPut e, HasGet e) ⇒ Par e s Int p = do cart ← newEmptyMap fork (insert Book 2 cart) fork (insert Shoes 1 cart) getKey Book cart main = print (runPar p) Running this program deterministically prints 2. The two forked operations run asynchronously and in arbitrary order; the call getKey Book cart is a blocking threshold read, and will block until the operation insert Book 2 cart has occurred. This example also demonstrates a number of other features of LVish. First, p is a Par computation parameterized by an effect level with the constraints HasPut and HasGet, indicating that p may perform LVar writes and reads. Second, running a Par computation with runPar produces a pure result. LVish also provides a runParIO function for running Par computations that return results in the IO monad. Finally, this example demonstrates one of the many builtin data structures provided by LVish—a key-value Map. All of these data structures work with the rest of the LVish infrastructure without any additional effort on the programmer’s part. The following example demonstrates two more features of LVish: handlers, which are callbacks run every time the contents of an LVar change, and the runParThenFreeze operation, which freezes an LVar on the way out of a Par computation, allowing the exact contents of the LVar to be read in a deterministic fashion. Here, traverse is a function that performs a breadth-first traversal of a graph g starting from a given node startNode and finds all the nodes reachable from startNode. traverse :: HasPut e ⇒ G.Graph → Int → Par e s (ISet s Int) traverse g startNode = do seen ← newEmptySet h ← newHandler seen -- Callback to be run whenever a -- new node appears in the ‘seen‘ set. (λnode → do mapM (λv → insert v seen) (neighbors g node) return ()) insert startNode seen -- Kick things off return seen main = print (runParThenFreeze (traverse myGraph (0 :: G.Vertex))) B. Repeating our results In addition to the LVish library, we also provide the means for others to re-run our experiments. Infrastructure for the benchmarks in this paper, in addition to the source code of our library and further instructions, is available in the following GitHub repository: https://github.com/iu-parfunc/pldi2014-artifact With a checkout of that repository, and assuming GHC 7.6.3 and Cabal 1.18, the command $ make everything will compile the library and benchmarks and run them in a slightlyreduced configuration from the paper. Doing so will produce three primary outputs, for each of the three benchmarks from our paper. • For PhyBin, presented in Section 7.1, the results are available in the phybin results.txt file. These results can be regenerated with make phybin bench. Note that this benchmarks PhyBin, but not the other systems we compare with. • For the evaluation of transformer overhead, presented in Sec- tion 7.2, the results are found in the transformer results.txt file. To regenerate just these results, run make transformer bench. • For parallel merge sort, presented in Section 7.3, the results are available in two text files, hs mergesort results.txt and c mergesort results.txt. The first of these shows the performance of merge sort with the leaf sequential sort implemented in Haskell; the latter with the leaf sequential sort implemented in C. These results can be regenerated with make mergesort bench. Again, note that this does not benchmark the other systems we compare with. For all of these benchmarks, the Makefile automatically runs up to four-core versions. The mergesort bench large target will run the full versions for merge sort; for the others, slight modifications to the Makefile will be needed. Finally, note that ongoing benchmarking of the LVish development repository uses a different mechanism: the HSBencher package and run_benchmarks.hs files, which upload data to a Google Fusion Table.25 Running our benchmarks in a pre-built environment Our primary tool for making it easy to re-run our code and benchmarks is the Docker container tool, which provides lightweight virtualization on Linux systems.26 With Docker, you can automatically run our full suite of benchmarks with a pre-built version of GHC by running: $ docker pull iuparfunc/pldi2014-artifact $ docker run -e USER=pldi -i -t \ iuparfunc/pldi2014-artifact:build /bin/bash Then follow the instructions above—all of the files are in the pldi2014-artifact directory inside the Docker container. You can even automatically compile and run the benchmarks in the Docker environment with a single command. $ docker build -t pldi2014-artifact \ github.com/iu-parfunc/pldi2014-artifact Then you can see the results by running the following command and looking at the generated files described above. $ docker run -i -t pldi2014-artifact /bin/bash 23 https://hackage.haskell.org/package/lvish 25 https://www.google.com/fusiontables/DataSource?docid= 24 See, 1YxEmNpeUoGCBptDK0ddtomC_oK2IVH1f2M89IIA, is an example. 26 Available at http://docker.io. for instance, https://github.com/lkuper/lvar-examples for more example LVish programs.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement