Programming Language Pragmatics, Second Edition

Programming Language Pragmatics, Second Edition
Programming Language Pragmatics is a very well-written textbook that captures the interest and
focus of the reader. Each of the topics is very well introduced, developed, illustrated, and integrated with the preceding and following topics. The author employs up-to-date information and
illustrates each concept by using examples from various programming languages. The level of presentation is appropriate for students, and the pedagogical features help make the chapters very easy
to follow and refer back to.
—Kamal Dahbur, DePaul University
Programming Language Pragmatics strikes a good balance between depth and breadth in its
coverage on of both classic and updated languages.
—Jingke Li, Portland State University
Programming Language Pragmatics is the most comprehensive book to date on the theory and
implementation of programming languages. Prof. Scott writes well, conveying both unifying fundamental principles and the differing design choices found in today’s major languages. Several
improvements give this new second edition a more user-friendly format.
—William Calhoun, Bloomsburg University
Prof. Scott has met his goal of improving Programming Language Pragmatics by bringing the
text up-to-date and making the material more accessible for students. The addition of the chapter
on scripting languages and the use of XML to illustrate the use of scripting languages is unique in
programming languages texts and is an important addition.
—Eileen Head, Binghamton University
This new edition of Programming Language Pragmatics does an excellent job of balancing the
three critical qualities needed in a textbook: breadth, depth, and clarity. Prof. Scott manages to
cover the full gamut of programming languages, from the oldest to the newest with sufficient depth
to give students a good understanding of the important features of each, but without getting bogged
down in arcane and idiosyncratic details. The new chapter on scripting languages is a most valuable addition as this class of languages continues to emerge as a major mainstream technology.
This book is sure to become the gold standard of the field.
—Christopher Vickery, Queens College of CUNY
Programming Language Pragmatics not only explains language concepts and implementation
details with admirable clarity, but also shows how computer architecture and compilers influence language design and implementation. . . This book shows that programming languages are
the true center of computer science—the bridges spanning the chasm between programmer and
machine.
—From the Foreword by Jim Larus, Microsoft Research
Programming Language Pragmatics
SECOND EDITION
About the Author
Michael L. Scott is a professor and past chair of the Department of Computer Science at the University of Rochester. He received his Ph.D. in computer sciences in
1985 from the University of Wisconsin–Madison. His research interests lie at the
intersection of programming languages, operating systems, and high-level computer architecture, with an emphasis on parallel and distributed computing. He
is the designer of the Lynx distributed programming language and a codesigner
of the Charlotte and Psyche parallel operating systems, the Bridge parallel file system, and the Cashmere and InterWeave shared memory systems. His MCS mutual exclusion lock, codesigned with John Mellor-Crummey, is used in a variety
of commercial and academic systems. Several other algorithms, codesigned with
Maged Michael and Bill Scherer, appear in the java.util.concurrent standard
library.
Dr. Scott is a member of the Association for Computing Machinery, the Institute of Electrical and Electronics Engineers, the Union of Concerned Scientists, and Computer Professionals for Social Responsibility. He has served on a
wide variety of program committees and grant review panels, and has been a
principal or coinvestigator on grants from the NSF, ONR, DARPA, NASA, the
Departments of Energy and Defense, the Ford Foundation, Digital Equipment
Corporation (now HP), Sun Microsystems, Intel, and IBM. He has contributed
to the GRE advanced exam in computer science, and is the author of some 95
refereed publications. In 2003 he chaired the ACM Symposium on Operating
Systems Principles. He received a Bell Labs Doctoral Scholarship in 1983 and an
IBM Faculty Development Award in 1986. In 2001 he received the University of
Rochester’s Robert and Pamela Goergen Award for Distinguished Achievement
and Artistry in Undergraduate Teaching.
Programming Language Pragmatics
SECOND EDITION
Michael L. Scott
Department of Computer Science
University of Rochester
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann Publishers is an imprint of Elsevier
Publishing Director: Michael Forster
Publisher: Denise Penrose
Publishing Services Manager: Andre Cuello
Assistant Publishing Services Manager
Project Manager: Carl M. Soares
Developmental Editor: Nate McFadden
Editorial Assistant: Valerie Witte
Cover Design: Ross Carron Designs
Cover Image: © Brand X Pictures/Corbin Images
Text Design: Julio Esperas
Composition: VTEX
Technical Illustration: Dartmouth Publishing Inc.
Copyeditor: Debbie Prato
Proofreader: Phyllis Coyne et al. Proofreading Service
Indexer: Ferreira Indexing Inc.
Interior printer: Maple-Vail
Cover printer: Phoenix Color
Morgan Kaufmann Publishers is an imprint of Elsevier.
500 Sansome Street, Suite 400, San Francisco, CA 94111
This book is printed on acid-free paper.
© 2006 by Elsevier Inc. All rights reserved.
Designations used by companies to distinguish their products are often claimed as trademarks or registered
trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names
appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies
for more complete information regarding trademarks and registration.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written
permission of the publisher.
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford,
UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: [email protected] You may also
complete your request on-line via the Elsevier homepage (http://elsevier.com) by selecting “Customer Support” and then “Obtaining Permissions.”
Library of Congress Cataloging-in-Publication Data
Application Submitted
ISBN 13: 978-0-12-633951-2
ISBN10: 0-12-633951-1
For information on all Morgan Kaufmann publications,
visit our Web site at www.mkp.com or www.books.elsevier.com
Printed in the United States of America
05 06 07 08 09
5 4 3 2 1
To the roses now in full bloom.
Foreword
Computer science excels at layering abstraction on abstraction. Our field’s facility
for hiding details behind a simplified interface is both a virtue and a necessity.
Operating systems, databases, and compilers are very complex programs shaped
by forty years of theory and development. For the most part, programmers need
little or no understanding of the internal logic or structure of a piece of software
to use it productively. Most of the time, ignorance is bliss.
Opaque abstraction, however, can become a brick wall, preventing forward
progress, instead of a sound foundation for new artifacts. Consider the subject
of this book, programs and programming languages. What happens when a program runs too slowly, and profiling cannot identify any obvious bottleneck or the
bottleneck does not have an algorithmic explanation? Some potential problems
are the translation of language constructs into machine instructions or how the
generated code interacts with a processor’s architecture. Correcting these problems requires an understanding that bridges levels of abstraction.
Abstraction can also stand in the path of learning. Simple questions—how
programs written in a small, stilted subset of English can control machines that
speak binary or why programming languages, despite their ever growing variety
and quantity, all seem fairly similar—cannot be answered except by diving into
the details and understanding computers, compilers, and languages.
A computer science education, taken as a whole, can answer these questions.
Most undergraduate programs offer courses about computer architecture, operating systems, programming language design, and compilers. These are all fascinating courses that are well worth taking—but difficult to fit into most study
plans along with the many other rich offerings of an undergraduate computer
science curriculum. Moreover, courses are often taught as self-contained subjects
and do not explain a subject’s connections to other disciplines.
This book also answers these questions, by looking beyond the abstractions
that divide these subjects. Michael Scott is a talented researcher who has made
major contributions in language implementation, run-time systems, and computer architecture. He is exceptionally well qualified to draw on all of these fields
ix
x
Foreword
to provide a coherent understanding of modern programming languages. This
book not only explains language concepts and implementation details with admirable clarity, but also shows how computer architecture and compilers influence language design and implementation. Moreover, it neatly illustrates how
different languages are actually used, with realistic examples to clearly show how
problem domains shape languages as well.
In interest of full disclosure, I must confess this book worried me when I first
read it. At the time, I thought Michael’s approach de-emphasized programming
languages and compilers in the curriculum and would leave students with a superficial understanding of the field. But now, having reread the book, I have come
to realize that in fact the opposite is true. By presenting them in their proper context, this book shows that programming languages are the true center of computer science—the bridges spanning the chasm between programmer and machine.
James Larus, Microsoft Research
Contents
Foreword
Preface
I
FOUNDATIONS
1 Introduction
ix
xxiii
1
3
1.1 The Art of Language Design
5
1.2 The Programming Language Spectrum
8
1.3 Why Study Programming Languages?
11
1.4 Compilation and Interpretation
13
1.5 Programming Environments
21
1.6 An Overview of Compilation
1.6.1 Lexical and Syntax Analysis
1.6.2 Semantic Analysis and Intermediate Code
Generation
1.6.3 Target Code Generation
1.6.4 Code Improvement
22
23
25
28
30
1.7 Summary and Concluding Remarks
31
1.8 Exercises
32
1.9 Explorations
33
1.10 Bibliographic Notes
35
xii
Contents
2 Programming Language Syntax
37
2.1 Specifying Syntax
2.1.1 Tokens and Regular Expressions
2.1.2 Context-Free Grammars
2.1.3 Derivations and Parse Trees
38
39
42
43
2.2 Scanning
2.2.1 Generating a Finite Automaton
2.2.2 Scanner Code
2.2.3 Table-Driven Scanning
2.2.4 Lexical Errors
2.2.5 Pragmas
46
49
54
58
58
60
2.3 Parsing
2.3.1 Recursive Descent
2.3.2 Table-Driven Top-Down Parsing
2.3.3 Bottom-Up Parsing
2.3.4 Syntax Errors
61
64
70
80
· 93
CD 1
2.4 Theoretical Foundations
2.4.1 Finite Automata
2.4.2 Push-Down Automata
2.4.3 Grammar and Language Classes
CD 13 · 94
CD 13
CD 16
CD 17
2.5 Summary and Concluding Remarks
95
2.6 Exercises
96
2.7 Explorations
101
2.8 Bibliographic Notes
101
3 Names, Scopes, and Bindings
103
3.1 The Notion of Binding Time
104
3.2 Object Lifetime and Storage Management
3.2.1 Static Allocation
3.2.2 Stack-Based Allocation
3.2.3 Heap-Based Allocation
3.2.4 Garbage Collection
106
107
109
111
113
3.3 Scope Rules
3.3.1 Static Scope
3.3.2 Nested Subroutines
3.3.3 Declaration Order
3.3.4 Modules
114
115
117
119
124
Contents
3.3.5 Module Types and Classes
3.3.6 Dynamic Scope
3.4 Implementing Scope
3.4.1 Symbol Tables
3.4.2 Association Lists and Central Reference Tables
xiii
128
131
CD 23 · 135
CD 23
CD 27
3.5 The Binding of Referencing Environments
3.5.1 Subroutine Closures
3.5.2 First- and Second-Class Subroutines
136
138
140
3.6 Binding Within a Scope
3.6.1 Aliases
3.6.2 Overloading
3.6.3 Polymorphism and Related Concepts
142
142
143
145
3.7 Separate Compilation
3.7.1 Separate Compilation in C
3.7.2 Packages and Automatic Header Inference
3.7.3 Module Hierarchies
CD 30 · 149
CD 30
CD 33
CD 35
3.8 Summary and Concluding Remarks
149
3.9 Exercises
151
3.10 Explorations
157
3.11 Bibliographic Notes
158
4 Semantic Analysis
161
4.1 The Role of the Semantic Analyzer
162
4.2 Attribute Grammars
166
4.3 Evaluating Attributes
168
4.4 Action Routines
179
4.5 Space Management for Attributes
4.5.1 Bottom-Up Evaluation
4.5.2 Top-Down Evaluation
CD 39 · 181
CD 39
CD 44
4.6 Decorating a Syntax Tree
182
4.7 Summary and Concluding Remarks
187
4.8 Exercises
189
4.9 Explorations
193
4.10 Bibliographic Notes
194
xiv
Contents
5 Target Machine Architecture
195
5.1 The Memory Hierarchy
5.2 Data Representation
5.2.1 Computer Arithmetic
196
199
·
CD 54
199
5.3 Instruction Set Architecture
5.3.1 Addressing Modes
5.3.2 Conditions and Branches
5.4 Architecture and Implementation
5.4.1 Microprogramming
5.4.2 Microprocessors
5.4.3 RISC
5.4.4 Two Example Architectures: The x86 and MIPS
5.4.5 Pseudo-Assembly Notation
II
201
201
202
204
205
206
207
CD 59 · 208
209
5.5 Compiling for Modern Processors
5.5.1 Keeping the Pipeline Full
5.5.2 Register Allocation
210
211
216
5.6 Summary and Concluding Remarks
221
5.7 Exercises
223
5.8 Explorations
226
5.9 Bibliographic Notes
227
CORE ISSUES IN LANGUAGE DESIGN
6 Control Flow
231
233
6.1 Expression Evaluation
6.1.1 Precedence and Associativity
6.1.2 Assignments
6.1.3 Initialization
6.1.4 Ordering Within Expressions
6.1.5 Short-Circuit Evaluation
234
236
238
246
249
252
6.2 Structured and Unstructured Flow
6.2.1 Structured Alternatives to goto
6.2.2 Continuations
254
255
259
Contents
xv
6.3 Sequencing
260
6.4 Selection
6.4.1 Short-Circuited Conditions
6.4.2 Case / Switch Statements
261
262
265
6.5 Iteration
6.5.1 Enumeration-Controlled Loops
6.5.2 Combination Loops
6.5.3 Iterators
6.5.4 Generators in Icon
6.5.5 Logically Controlled Loops
6.6 Recursion
6.6.1 Iteration and Recursion
6.6.2 Applicative- and Normal-Order Evaluation
6.7 Nondeterminacy
270
271
277
278
CD 69 · 284
284
287
287
291
CD 72 · 295
6.8 Summary and Concluding Remarks
296
6.9 Exercises
298
6.10 Explorations
304
6.11 Bibliographic Notes
305
7 Data Types
7.1 Type Systems
7.1.1 Type Checking
7.1.2 Polymorphism
7.1.3 The Definition of Types
7.1.4 The Classification of Types
7.1.5 Orthogonality
307
308
309
309
311
312
319
7.2 Type Checking
7.2.1 Type Equivalence
7.2.2 Type Compatibility
7.2.3 Type Inference
7.2.4 The ML Type System
321
321
327
332
CD 81 · 335
7.3 Records (Structures) and Variants (Unions)
7.3.1 Syntax and Operations
7.3.2 Memory Layout and Its Impact
7.3.3 With Statements
7.3.4 Variant Records
336
337
338
·
CD 90
341
341
xvi
Contents
7.4 Arrays
7.4.1 Syntax and Operations
7.4.2 Dimensions, Bounds, and Allocation
7.4.3 Memory Layout
349
349
353
358
7.5 Strings
366
7.6 Sets
367
7.7 Pointers and Recursive Types
7.7.1 Syntax and Operations
7.7.2 Dangling References
7.7.3 Garbage Collection
369
370
379
383
7.8 Lists
389
7.9 Files and Input/Output
7.9.1 Interactive I/O
7.9.2 File-Based I/O
7.9.3 Text I/O
CD 93 · 392
CD 93
CD 94
CD 96
7.10 Equality Testing and Assignment
393
7.11 Summary and Concluding Remarks
395
7.12 Exercises
398
7.13 Explorations
404
7.14 Bibliographic Notes
405
8 Subroutines and Control Abstraction
8.1 Review of Stack Layout
407
408
8.2 Calling Sequences
410
8.2.1 Displays
CD 107 · 413
8.2.2 Case Studies: C on the MIPS; Pascal on the x86 CD 111 · 414
8.2.3 Register Windows
CD 119 · 414
8.2.4 In-Line Expansion
415
8.3 Parameter Passing
8.3.1 Parameter Modes
8.3.2 Call by Name
8.3.3 Special Purpose Parameters
8.3.4 Function Returns
8.4 Generic Subroutines and Modules
8.4.1 Implementation Options
8.4.2 Generic Parameter Constraints
417
418
CD 122 · 426
427
432
434
435
437
Contents
8.4.3 Implicit Instantiation
8.4.4 Generics in C++, Java, and C#
8.5 Exception Handling
8.5.1 Defining Exceptions
8.5.2 Exception Propagation
8.5.3 Example: Phrase-Level Recovery in a Recursive
Descent Parser
8.5.4 Implementation of Exceptions
8.6 Coroutines
8.6.1 Stack Allocation
8.6.2 Transfer
8.6.3 Implementation of Iterators
8.6.4 Discrete Event Simulation
xvii
440
CD 125 · 440
441
443
445
448
449
453
455
457
·
CD 135
458
CD 139 · 458
8.7 Summary and Concluding Remarks
459
8.8 Exercises
460
8.9 Explorations
466
8.10 Bibliographic Notes
467
9 Data Abstraction and Object Orientation
469
9.1 Object-Oriented Programming
471
9.2 Encapsulation and Inheritance
9.2.1 Modules
9.2.2 Classes
9.2.3 Type Extensions
481
481
484
486
9.3 Initialization and Finalization
9.3.1 Choosing a Constructor
9.3.2 References and Values
9.3.3 Execution Order
9.3.4 Garbage Collection
489
490
491
495
496
9.4 Dynamic Method Binding
9.4.1 Virtual and Nonvirtual Methods
9.4.2 Abstract Classes
9.4.3 Member Lookup
9.4.4 Polymorphism
9.4.5 Closures
497
500
501
502
505
508
9.5 Multiple Inheritance
9.5.1 Semantic Ambiguities
CD 146 · 511
CD 148
xviii
Contents
9.5.2 Replicated Inheritance
9.5.3 Shared Inheritance
9.5.4 Mix-In Inheritance
9.6 Object-Oriented Programming Revisited
9.6.1 The Object Model of Smalltalk
CD 151
CD 152
CD 154
512
·
CD 158
513
9.7 Summary and Concluding Remarks
513
9.8 Exercises
515
9.9 Explorations
517
9.10 Bibliographic Notes
518
III
ALTERNATIVE PROGRAMMING
MODELS
10 Functional Languages
521
523
10.1 Historical Origins
524
10.2 Functional Programming Concepts
526
10.3 A Review/Overview of Scheme
10.3.1 Bindings
10.3.2 Lists and Numbers
10.3.3 Equality Testing and Searching
10.3.4 Control Flow and Assignment
10.3.5 Programs as Lists
10.3.6 Extended Example: DFA Simulation
528
530
531
532
533
535
537
10.4 Evaluation Order Revisited
10.4.1 Strictness and Lazy Evaluation
10.4.2 I/O: Streams and Monads
539
541
542
10.5 Higher-Order Functions
545
10.6 Theoretical Foundations
10.6.1 Lambda Calculus
10.6.2 Control Flow
10.6.3 Structures
CD 166 · 549
CD 168
CD 171
CD 173
10.7 Functional Programming in Perspective
549
10.8 Summary and Concluding Remarks
552
Contents
xix
10.9 Exercises
552
10.10 Explorations
557
10.11 Bibliographic Notes
558
11 Logic Languages
559
11.1 Logic Programming Concepts
560
11.2 Prolog
11.2.1 Resolution and Unification
11.2.2 Lists
11.2.3 Arithmetic
11.2.4 Search/Execution Order
11.2.5 Extended Example: Tic-Tac-Toe
11.2.6 Imperative Control Flow
11.2.7 Database Manipulation
561
563
564
565
566
569
571
574
11.3 Theoretical Foundations
11.3.1 Clausal Form
11.3.2 Limitations
11.3.3 Skolemization
CD 180 · 579
CD 181
CD 182
CD 183
11.4 Logic Programming in Perspective
11.4.1 Parts of Logic Not Covered
11.4.2 Execution Order
11.4.3 Negation and the “Closed World”
Assumption
579
580
580
11.5 Summary and Concluding Remarks
583
11.6 Exercises
584
11.7 Explorations
586
11.8 Bibliographic Notes
587
12 Concurrency
581
589
12.1 Background and Motivation
12.1.1 A Little History
12.1.2 The Case for Multithreaded Programs
12.1.3 Multiprocessor Architecture
590
590
593
597
12.2 Concurrent Programming Fundamentals
12.2.1 Communication and Synchronization
12.2.2 Languages and Libraries
12.2.3 Thread Creation Syntax
601
601
603
604
xx
Contents
12.2.4 Implementation of Threads
613
12.3 Shared Memory
12.3.1 Busy-Wait Synchronization
12.3.2 Scheduler Implementation
12.3.3 Semaphores
12.3.4 Monitors
12.3.5 Conditional Critical Regions
12.3.6 Implicit Synchronization
619
620
623
627
629
634
638
12.4 Message Passing
12.4.1 Naming Communication Partners
12.4.2 Sending
12.4.3 Receiving
12.4.4 Remote Procedure Call
642
642
646
651
656
12.5 Summary and Concluding Remarks
660
12.6 Exercises
662
12.7 Explorations
668
12.8 Bibliographic Notes
669
13 Scripting Languages
671
13.1 What Is a Scripting Language?
13.1.1 Common Characteristics
672
674
13.2 Problem Domains
13.2.1 Shell (Command) Languages
13.2.2 Text Processing and Report Generation
13.2.3 Mathematics and Statistics
13.2.4 “Glue” Languages and General Purpose
Scripting
13.2.5 Extension Languages
677
677
684
689
690
698
13.3 Scripting the World Wide Web
13.3.1 CGI Scripts
13.3.2 Embedded Server-Side Scripts
13.3.3 Client-Side Scripts
13.3.4 Java Applets
13.3.5 XSLT
701
702
703
708
708
712
13.4 Innovative Features
13.4.1 Names and Scopes
13.4.2 String and Pattern Manipulation
13.4.3 Data Types
722
723
728
736
Contents
13.4.4 Object Orientation
xxi
741
13.5 Summary and Concluding Remarks
748
13.6 Exercises
750
13.7 Explorations
755
13.8 Bibliographic Notes
756
IV
A CLOSER LOOK AT
IMPLEMENTATION
14 Building a Runnable Program
14.1 Back-End Compiler Structure
14.1.1 A Plausible Set of Phases
14.1.2 Phases and Passes
14.2 Intermediate Forms
14.2.1 Diana
14.2.2 GNU RTL
759
761
761
762
766
CD 189 · 766
CD 189
CD 192
14.3 Code Generation
14.3.1 An Attribute Grammar Example
14.3.2 Register Allocation
769
769
772
14.4 Address Space Organization
775
14.5 Assembly
14.5.1 Emitting Instructions
14.5.2 Assigning Addresses to Names
776
778
780
14.6 Linking
14.6.1 Relocation and Name Resolution
14.6.2 Type Checking
781
782
783
14.7 Dynamic Linking
14.7.1 Position-Independent Code
14.7.2 Fully Dynamic (Lazy) Linking
CD 195 · 784
CD 195
CD 196
14.8 Summary and Concluding Remarks
786
14.9 Exercises
787
14.10 Explorations
789
14.11 Bibliographic Notes
790
xxii
Contents
15 Code Improvement
CD 202 · 791
15.1 Phases of Code Improvement
CD 204
15.2 Peephole Optimization
CD 206
15.3 Redundancy Elimination in Basic Blocks
15.3.1 A Running Example
15.3.2 Value Numbering
CD 209
CD 210
CD 211
15.4 Global Redundancy and Data Flow Analysis
15.4.1 SSA Form and Global Value Numbering
15.4.2 Global Common Subexpression Elimination
CD 217
CD 218
CD 220
15.5 Loop Improvement I
15.5.1 Loop Invariants
15.5.2 Induction Variables
CD 227
CD 228
CD 229
15.6 Instruction Scheduling
CD 232
15.7 Loop Improvement II
15.7.1 Loop Unrolling and Software Pipelining
15.7.2 Loop Reordering
CD 236
CD 237
CD 241
15.8 Register Allocation
CD 248
15.9 Summary and Concluding Remarks
CD 252
15.10 Exercises
CD 253
15.11 Explorations
CD 257
15.12 Bibliographic Notes
CD 258
A Programming Languages Mentioned
793
B Language Design and Language
Implementation
803
C Numbered Examples
807
Bibliography
819
Index
837
Preface
A course in computer programming provides the typical student’s first exposure to the field of computer science. Most students in such a course will have
used computers all their lives, for e-mail, games, web browsing, word processing,
instant messaging, and a host of other tasks, but it is not until they write their
first programs that they begin to appreciate how applications work. After gaining
a certain level of facility as programmers (presumably with the help of a good
course in data structures and algorithms), the natural next step is to wonder how
programming languages work. This book provides an explanation. It aims, quite
simply, to be the most comprehensive and accurate languages text available, in
a style that is engaging and accessible to the typical undergraduate. This aim reflects my conviction that students will understand more, and enjoy the material
more, if we explain what is really going on.
In the conventional “systems” curriculum, the material beyond data structures (and possibly computer organization) tends to be compartmentalized into
a host of separate subjects, including programming languages, compiler construction, computer architecture, operating systems, networks, parallel and distributed computing, database management systems, and possibly software engineering, object-oriented design, graphics, or user interface systems. One problem
with this compartmentalization is that the list of subjects keeps growing, but the
number of semesters in a bachelor’s program does not. More important, perhaps,
many of the most interesting discoveries in computer science occur at the boundaries between subjects. The RISC revolution, for example, forged an alliance between computer architecture and compiler construction that has endured for 20
years. More recently, renewed interest in virtual machines has blurred the boundary between the operating system kernel and the language run-time system. The
spread of Java and .NET has similarly blurred the boundary between the compiler
and the run-time system. Programs are now routinely embedded in web pages,
spreadsheets, and user interfaces.
Increasingly, both educators and practitioners are recognizing the need to emphasize these sorts of interactions. Within higher education in particular there is
xxiii
xxiv
Preface
a growing trend toward integration in the core curriculum. Rather than give the
typical student an in-depth look at two or three narrow subjects, leaving holes in
all the others, many schools have revised the programming languages and operating systems courses to cover a wider range of topics, with follow-on electives
in various specializations. This trend is very much in keeping with the findings
of the ACM/IEEE-CS Computing Curricula 2001 task force, which emphasize the
growth of the field, the increasing need for breadth, the importance of flexibility
in curricular design, and the overriding goal of graduating students who “have
a system-level perspective, appreciate the interplay between theory and practice,
are familiar with common themes, and can adapt over time as the field evolves”
[CR01, Sec. 11.1, adapted].
The first edition of Programming Language Pragmatics (PLP-1e) had the
good fortune of riding this curricular trend. The second edition continues and
strengthens the emphasis on integrated learning while retaining a central focus
on programming language design.
At its core, PLP is a book about how programming languages work. Rather than
enumerate the details of many different languages, it focuses on concepts that
underlie all the languages the student is likely to encounter, illustrating those
concepts with a variety of concrete examples, and exploring the tradeoffs that
explain why different languages were designed in different ways. Similarly, rather
than explain how to build a compiler or interpreter (a task few programmers will
undertake in its entirety), PLP focuses on what a compiler does to an input program, and why. Language design and implementation are thus explored together,
with an emphasis on the ways in which they interact.
Changes in the Second Edition
There were four main goals for the second edition:
1. Introduce new material, most notably scripting languages.
2. Bring the book up to date with respect to everything else that has happened
in the last six years.
3. Resist the pressure toward rising textbook prices.
4. Strengthen the book from a pedagogical point of view, to make it more useful
and accessible.
Item (1) is the most significant change in content. With the explosion of the
World Wide Web, languages like Perl, PHP, Tcl/Tk, Python, Ruby, JavaScript, and
XSLT have seen an enormous upsurge not only in commercial significance, but
also in design innovation. Many of today’s graduates will spend more of their
time working with scripting languages than with C++, Java, or C#. The new chapter on scripting languages (Chapter 13) is organized first by application domain
(shell languages, text processing and report generation, mathematics and statistics, “glue” languages and general purpose scripting, extension languages, script-
Preface
xxv
ing the World Wide Web) and then by innovative features (names and scopes,
string and pattern manipulation, high level data types, object orientation). References to scripting languages have also been added wherever appropriate throughout the rest of the text.
Item (2) reflects such key developments as the finalized C99 standard and the
appearance of Java 5 and C# (version 2.0). Chapter 6 (Control Flow) now covers boxing, unboxing, and the latest iterator constructs. Chapter 8 (Subroutines)
covers Java and C# generics. Chapter 12 (Concurrency) covers the Java 5 concurrency library (JSR 166). References to C# have been added where appropriate
throughout. In keeping with changes in the microprocessor market, the ubiquitous Intel/AMD x86 has replaced the Motorola 68000 in the case studies of
Chapters 5 (Architecture) and 8 (Subroutines). The MIPS case study in Chapter 8 has been updated to 64-bit mode. References to technological constants and
trends have also been updated. In several places I have rewritten examples to use
languages with which students are more likely to be familiar; this process will
undoubtedly continue in future editions.
Many sections have been heavily rewritten to make them clearer or more accurate. These include coverage of finite automaton creation (2.2.1); declaration
order (3.3.3); modules (3.3.4); aliases and overloading (3.6.1 and 3.6.2); polymorphism and generics (3.6.3, 7.1.2, 8.4, and 9.4.4); separate compilation (3.7);
continuations, exceptions, and multilevel returns (6.2.1, 6.2.2, and 8.5); calling
sequences (8.2); and most of Chapter 5.
Item (3) reflects Morgan Kaufmann’s commitment to making definitive texts
available at student-friendly prices. PLP-1e was larger and more comprehensive
than competing texts, but sold for less. This second edition keeps a handle on
price (and also reduces bulk) with high-quality paperback construction.
Finally, item (4) encompasses a large number of presentational changes. Some
of these are relatively small. There are more frequent section headings, for example, and more historical anecdotes. More significantly, the book has been organized into four major parts:
Part I covers foundational material: (1) Introduction to Language Design and
Implementation; (2) Programming Language Syntax; (3) Names, Scopes, and
Bindings; (4) Semantic Analysis; and (5) Target Machine Architecture. The
second and fifth of these have a fairly heavy focus on implementation issues.
The first and fourth are mixed. The third introduces core issues in language
design.
Part II continues the coverage of core issues: (6) Control Flow; (7) Data Types;
(8) Subroutines and Control Abstraction; and (9) Data Abstraction and Object Orientation. The last of these has moved forward from its position in PLP1e, reflecting the centrality of object-oriented programming to much of modern computing.
Part III turns to alternative programming models: (10) Functional Languages;
(11) Logic Languages; (12) Concurrency; and (13) Scripting Languages. Functional and logic languages shared a single chapter in PLP-1e.
xxvi
Preface
Part IV returns to language implementation: (14) Building a Runnable Program (code generation, assembly, and linking); and (15) Code Improvement
(optimization).
The PLP CD
To minimize the physical size of the text, make way for new material, and allow
students to focus on the fundamentals when browsing, approximately 250 pages
of more advanced or peripheral material has been moved to a companion CD.
For the most part (though not exclusively), this material comprises the sections
that were identified as advanced or optional in PLP-1e.
The most significant single move is the entire chapter on code improvement
(15). The rest of the moved material consists of scattered, shorter sections. Each
such section is represented in the text by a brief introduction to the subject and
an “In More Depth” paragraph that summarizes the elided material.
Note that the placement of material on the CD does not constitute a judgment
about its technical importance. It simply reflects the fact that there is more material worth covering than will fit in a single volume or a single course. My intent is
to retain in the printed text the material that is likely to be covered in the largest
number of courses.
Design & Implementation Sidebars
PLP-1e placed a heavy emphasis on the ways in which language design constrains
implementation options, and the ways in which anticipated implementations
have influenced language design. PLP-2e uses more than 120 sidebars to make
these connections more explicit. A more detailed introduction to these sidebars
appears on page 7 (Chapter 1). A numbered list appears in Appendix B.
Numbered and Titled Examples
Examples in PLP-2e are intimately woven into the flow of the presentation. To
make it easier to find specific examples, to remember their content, and to refer
to them in other contexts, a number and a title for each is now displayed in a
marginal note. There are nearly 900 such examples across the main text and the
CD. A detailed list appears in Appendix C.
Exercise Plan
PLP-1e contained a total of 385 review questions and 312 exercises, located at the
ends of chapters. Review questions in the second edition have been moved to the
Preface
xxvii
ends of sections, closer to the material they cover, to make it easier to tell when
one has grasped the central concepts. The total number of such questions has
nearly doubled.
The problems remaining at the ends of chapters have now been divided
into Exercises and Explorations. The former are intended to be more or less
straightforward, though more challenging than the per-section review questions; they should be suitable for homework or brief projects. The exploration
questions are more open-ended, requiring web or library research, substantial
time commitment, or the development of subjective opinion. The total number of questions has increased from a little over 300 in PLP-1e to over 500
in the current edition. Solutions to the exercises (but not the explorations)
are available to registered instructors from a password-protected web site: visit
www.mkp.com/companions/0126339511/.
How to Use the Book
Programming Language Pragmatics covers almost all of the material in the PL
“knowledge units” of the Computing Curricula 2001 report [CR01]. The book is
an ideal fit for the CS 341 model course (Programming Language Design), and
can also be used for CS 340 (Compiler Construction) or CS 343 (Programming
Paradigms). It contains a significant fraction of the content of CS 344 (Functional
Programming) and CS 346 (Scripting Languages). Figure 1 illustrates several possible paths through the text.
For self-study, or for a full-year course (track F in Figure 1), I recommend
working through the book from start to finish, turning to the PLP CD as each “In
More Depth” section is encountered. The one-semester course at the University
of Rochester (track R ), for which the text was originally developed, also covers
most of the book but leaves out most of the CD sections, as well as bottom-up
parsing (2.3.3), message passing (12.4), web scripting (13.3), and most of Chapter 14 (Building a Runnable Program).
Some chapters (2, 4, 5, 14, 15) have a heavier emphasis than others on implementation issues. These can be reordered to a certain extent with respect to the
more design-oriented chapters, but it is important that Chapter 5 or its equivalent be covered before Chapters 6 through 9. Many students will already be familiar with some of the material in Chapter 5, most likely from a course on computer
organization. In this case the chapter may simply be skimmed for review. Some
students may also be familiar with some of the material in Chapter 2, perhaps
from a course on automata theory. Much of this chapter can then be read quickly
as well, pausing perhaps to dwell on such practical issues as recovery from syntax
errors, or the ways in which a scanner differs from a classical finite automaton.
A traditional programming languages course (track P in Figure 1) might leave
out all of scanning and parsing, plus all of Chapters 4 and 5. It would also
deemphasize the more implementation-oriented material throughout. In place
xxviii
Preface
Figure 1 Paths through the text. Darker shaded regions indicate supplemental “In More Depth” sections on the PLP CD.
Section numbers are shown for breaks that do not correspond to supplemental material.
of these it could add such design-oriented CD sections as the ML type system (7.2.4), multiple inheritance (9.5), Smalltalk (9.6.1), lambda calculus (10.6),
and predicate calculus (11.3).
PLP has also been used at some schools for an introductory compiler course
(track C in Figure 1). The typical syllabus leaves out most of Part III (Chapters 10
through 13), and deemphasizes the more design-oriented material throughout.
In place of these it includes all of scanning and parsing, Chapters 14 and 15, and
a slightly different mix of other CD sections.
For a school on the quarter system, an appealing option is to offer an introductory one-quarter course and two optional follow-on courses (track Q in Figure 1). The introductory quarter might cover the main (non-CD) sections of
Chapters 1, 3, 6, and 7, plus the first halves of Chapters 2 and 8. A languageoriented follow-on quarter might cover the rest of Chapter 8, all of Part III, CD
sections from Chapters 6 through 8, and possibly supplemental material on formal semantics, type systems, or other related topics. A compiler-oriented followon quarter might cover the rest of Chapter 2; Chapters 4–5 and 14–15, CD sec-
Preface
xxix
tions from Chapters 3 and 8–9, and possibly supplemental material on automatic
code generation, aggressive code improvement, programming tools, and so on.
Whatever the path through the text, I assume that the typical reader has already acquired significant experience with at least one imperative language. Exactly which language it is shouldn’t matter. Examples are drawn from a wide
variety of languages, but always with enough comments and other discussion
that readers without prior experience should be able to understand easily. Singleparagraph introductions to some 50 different languages appear in Appendix A.
Algorithms, when needed, are presented in an informal pseudocode that should
be self-explanatory. Real programming language code is set in "typewriter"
font. Pseudocode is set in a sans-serif font.
Supplemental Materials
In addition to supplemental sections of the text, the PLP CD contains a variety
of other resources:
Links to language reference manuals and tutorials on the Web
Links to Open Source compilers and interpreters
Complete source code for all nontrivial examples in the book (more than 300
source files)
Search engine for both the main text and the CD-only content
Additional resources are available at www.mkp.com/companions/0126339511/
(you may wish to check back from time to time). For instructors who have
adopted the text, a password-protected page provides access to
Editable PDF source for all the figures in the book
Editable PowerPoint slides
Solutions to most of the exercises
Suggestions for larger projects
Acknowledgments for the Second Edition
In preparing the second edition I have been blessed with the generous assistance
of a very large number of people. Many provided errata or other feedback on
the first edition, among them Manuel E. Bermudez, John Boyland, Brian Cumming, Stephen A. Edward, Michael J. Eulenstein, Tayssir John Gabbour, Tommaso Galleri, Eileen Head, David Hoffman, Paul Ilardi, Lucian Ilie, Rahul Jain,
Eric Joanis, Alan Kaplan, Les Lander, Jim Larus, Hui Li, Jingke Li, Evangelos Milios, Eduardo Pinheiro, Barbara Ryder, Nick Stuifbergen, Raymond Toal, Andrew
Tolmach, Jens Troeger, and Robbert van Renesse. Zongyan Qiu prepared the Chinese translation, and found several bugs in the process. Simon Fillat maintained
xxx
Preface
the Morgan Kaufmann web site. I also remain indebted to the many other people, acknowledged in the first edition, who helped in that earlier endeavor, and to
the reviewers, adopters, and readers who made it a success. Their contributions
continue to be reflected in the current edition.
Work on the second edition began in earnest with a “focus group” at
SIGCSE ’02; my thanks to Denise Penrose, Emilia Thiuri, and the rest of the
team at Morgan Kaufmann for organizing that event, to the approximately two
dozen attendees who shared their thoughts on content and pedagogy, and to the
many other individuals who reviewed two subsequent revision plans.
A draft of the second edition was class tested in the fall of 2004 at eight different universities. I am grateful to Gerald Baumgartner (Louisiana State University), William Calhoun (Bloomsburg University), Betty Cheng (Michigan State
University), Jingke Li (Portland State University), Beverly Sanders (University of
Florida), Darko Stefanovic (University of New Mexico), Raymond Toal (Loyola
Marymount University), Robert van Engelen (Florida State University), and all
their students for a mountain of suggestions, reactions, bug fixes, and other feedback. Professor van Engelen provided several excellent end-of-chapter exercises.
External reviewers for the second edition also provided a wealth of useful suggestions. My thanks to Richard J. Botting (California State University,
San Bernardino), Kamal Dahbur (DePaul University), Stephen A. Edwards
(Columbia University), Eileen Head (Binghamton University), Li Liao (University of Delaware), Christopher Vickery (Queens College, City University of New
York), Garrett Wollman (MIT), Neng-Fa Zhou (Brooklyn College, City University of New York), and Cynthia Brown Zickos (University of Mississippi). Garrett Wollman’s technical review of Chapter 13 was particularly helpful, as were
his earlier comments on a variety of topics in the first edition. Sadly, time has
not permitted me to do justice to everyone’s suggestions. I have incorporated
as much as I could, and have carefully saved the rest for guidance on the third
edition. Problems that remain in the current edition are entirely my own.
PLP-2e was also class tested at the University of Rochester in the fall of 2004.
I am grateful to all my students, and to John Heidkamp, David Lu, and Dan Mullowney in particular, for their enthusiasm and suggestions. Mike Spear provided
several helpful pointers on web technology for Chapter 13. Over the previous
several years, my colleagues Chen Ding and Sandhya Dwarkadas taught from the
first edition several times and had many helpful suggestions. Chen’s feedback on
Chapter 15 (assisted by Yutao Zhong) was particularly valuable. My thanks as
well to the rest of my colleagues, to department chair Mitsunori Ogihara, and
to the department’s administrative, secretarial, and technical staff for providing
such a supportive and productive work environment.
As they were on the first edition, the staff at Morgan Kaufmann have been a
genuine pleasure to work with, on both a professional and a personal level. My
thanks in particular to Denise Penrose, publisher; Nate McFadden, editor; Carl
Soares, production editor; Peter Ashenden, CD designer; Brian Grimm, marketing manager; and Valerie Witte, editorial assistant.
Preface
xxxi
Most important, I am indebted to my wife, Kelly, and our daughters, Erin and
Shannon, for their patience and support through endless months of writing and
revising. Computing is a fine profession, but family is what really matters.
Michael L. Scott
Rochester, NY
April 2005
I
Foundations
A central premise of Programming Language Pragmatics is that language design and implementation
are intimately connected; it’s hard to study one without the other.
The bulk of the text—Parts II and III—is organized around topics in language design, but with
detailed coverage throughout of the many ways in which design decisions have been shaped by
implementation concerns.
The first five chapters—Part I—set the stage by covering foundational material in both design
and implementation. Chapter 1 motivates the study of programming languages, introduces the major language families, and provides an overview of the compilation process. Chapter 3 covers the
high-level structure of programs, with an emphasis on names, the binding of names to objects, and
the scope rules that govern which bindings are active at any given time. In the process it touches on
storage management; subroutines, modules, and classes; polymorphism; and separate compilation.
Chapters 2, 4, and 5 are more implementation-oriented. They provide the background needed to
understand the implementation issues mentioned in Parts II and III. Chapter 2 discusses the syntax,
or textual structure, of programs. It introduces regular expressions and context-free grammars, which
designers use to describe program syntax, together with the scanning and parsing algorithms that a
compiler or interpreter uses to recognize that syntax. Given an understanding of syntax, Chapter 4
explains how a compiler (or interpreter) determines the semantics, or meaning of a program. The
discussion is organized around the notion of attribute grammars, which serve to map a program
onto something else that has meaning, like mathematics or some other existing language. Finally,
Chapter 5 provides an overview of assembly-level computer architecture, focusing on the features of
modern microprocessors most relevant to compilers. Programmers who understand these features
have a better chance not only of understanding why the languages they use were designed the way
they were, but also of using those languages as fully and effectively as possible.
1
Introduction
EXAMPLE
1.1
GCD program in MIPS
machine language
The first electronic computers were monstrous contraptions, filling
several rooms, consuming as much electricity as a good-size factory, and costing
millions of 1940s dollars (but with the computing power of a modern hand-held
calculator). The programmers who used these machines believed that the computer’s time was more valuable than theirs. They programmed in machine language. Machine language is the sequence of bits that directly controls a processor,
causing it to add, compare, move data from one place to another, and so forth at
appropriate times. Specifying programs at this level of detail is an enormously tedious task. The following program calculates the greatest common divisor (GCD)
of two integers, using Euclid’s algorithm. It is written in machine language, expressed here as hexadecimal (base 16) numbers, for the MIPS R4000 processor.
27bdffd0
00401825
00641823
03e00008
EXAMPLE
1.2
GCD program in MIPS
assembler
afbf0014 0c1002a8 00000000 0c1002a8 afa2001c 8fa4001c
10820008 0064082a 10200003 00000000 10000002 00832023
1483fffa 0064082a 0c1002b2 00000000 8fbf0014 27bd0020
00001025
As people began to write larger programs, it quickly became apparent that
a less error-prone notation was required. Assembly languages were invented to
allow operations to be expressed with mnemonic abbreviations. Our GCD program looks like this in MIPS assembly language:
A:
addiu
sw
jal
nop
jal
sw
lw
move
beq
slt
beq
nop
sp,sp,-32
ra,20(sp)
getint
getint
v0,28(sp)
a0,28(sp)
v1,v0
a0,v0,D
at,v1,a0
at,zero,B
B:
C:
D:
b
subu
subu
bne
slt
jal
nop
lw
addiu
jr
move
C
a0,a0,v1
v1,v1,a0
a0,v1,A
at,v1,a0
putint
ra,20(sp)
sp,sp,32
ra
v0,zero
3
4
Chapter 1 Introduction
Assembly languages were originally designed with a one-to-one correspondence between mnemonics and machine language instructions, as shown in this
example.1 Translating from mnemonics to machine language became the job
of a systems program known as an assembler. Assemblers were eventually augmented with elaborate “macro expansion” facilities to permit programmers to
define parameterized abbreviations for common sequences of instructions. The
correspondence between assembly language and machine language remained obvious and explicit, however. Programming continued to be a machine-centered
enterprise: each different kind of computer had to be programmed in its own assembly language, and programmers thought in terms of the instructions that the
machine would actually execute.
As computers evolved, and as competing designs developed, it became increasingly frustrating to have to rewrite programs for every new machine. It also
became increasingly difficult for human beings to keep track of the wealth of
detail in large assembly language programs. People began to wish for a machineindependent language, particularly one in which numerical computations (the
most common type of program in those days) could be expressed in something
more closely resembling mathematical formulae. These wishes led in the mid1950s to the development of the original dialect of Fortran, the first arguably
high-level programming language. Other high-level languages soon followed,
notably Lisp and Algol.
Translating from a high-level language to assembly or machine language is the
job of a systems program known as a compiler. Compilers are substantially more
complicated than assemblers because the one-to-one correspondence between
source and target operations no longer exists when the source is a high-level
language. Fortran was slow to catch on at first, because human programmers,
with some effort, could almost always write assembly language programs that
would run faster than what a compiler could produce. Over time, however, the
performance gap has narrowed and eventually reversed. Increases in hardware
complexity (due to pipelining, multiple functional units, etc.) and continuing
improvements in compiler technology have led to a situation in which a state-ofthe-art compiler will usually generate better code than a human being will. Even
in cases in which human beings can do better, increases in computer speed and
program size have made it increasingly important to economize on programmer effort, not only in the original construction of programs, but in subsequent
program maintenance—enhancement and correction. Labor costs now heavily
outweigh the cost of computing hardware.
1 Each of the 23 lines of assembly code in the example is encoded in the corresponding 32 bits of
the machine language. Note for example that the two sw (store word) instructions begin with
the same 11 bits ( afa or afb ). Those bits encode the operation ( sw ) and the base register ( sp ).
1.1 The Art of Language Design
1.1
5
The Art of Language Design
Today there are thousands of high-level programming languages, and new ones
continue to emerge. Human beings use assembly language only for special purpose applications. In a typical undergraduate class, it is not uncommon to find
users of scores of different languages. Why are there so many? There are several
possible answers:
Evolution. Computer science is a young discipline; we’re constantly finding better ways to do things. The late 1960s and early 1970s saw a revolution in “structured programming,” in which the go to -based control flow of languages like
Fortran, Cobol, and Basic2 gave way to while loops, case statements, and
similar higher-level constructs. In the late 1980s the nested block structure of
languages like Algol, Pascal, and Ada began to give way to the object-oriented
structure of Smalltalk, C++, Eiffel, and the like.
Special Purposes. Many languages were designed for a specific problem domain.
The various Lisp dialects are good for manipulating symbolic data and complex data structures. Snobol and Icon are good for manipulating character
strings. C is good for low-level systems programming. Prolog is good for reasoning about logical relationships among data. Each of these languages can be
used successfully for a wider range of tasks, but the emphasis is clearly on the
specialty.
Personal Preference. Different people like different things. Much of the parochialism of programming is simply a matter of taste. Some people love the terseness of C; some hate it. Some people find it natural to think recursively; others
prefer iteration. Some people like to work with pointers; others prefer the implicit dereferencing of Lisp, Clu, Java, and ML. The strength and variety of
personal preference make it unlikely that anyone will ever develop a universally acceptable programming language.
Of course, some languages are more successful than others. Of the many that
have been designed, only a few dozen are widely used. What makes a language
successful? Again there are several answers:
Expressive Power. One commonly hears arguments that one language is more
“powerful” than another, though in a formal mathematical sense they are all
Turing equivalent—each can be used, if awkwardly, to implement arbitrary algorithms. Still, language features clearly have a huge impact on the programmer’s ability to write clear, concise, and maintainable code, especially for very
2 The name of each of these languages is sometimes written entirely in uppercase letters and sometimes in mixed case. For consistency’s sake, I adopt the convention in this book of using mixed
case for languages whose names are pronounced as words (e.g., Fortran, Cobol, Basic) and uppercase for those pronounced as a series of letters (e.g., APL, PL/I, ML).
6
Chapter 1 Introduction
large systems. There is no comparison, for example, between early versions of
Basic on the one hand and Common Lisp or Ada on the other. The factors
that contribute to expressive power—abstraction facilities in particular—are a
major focus of this book.
Ease of Use for the Novice. While it is easy to pick on Basic, one cannot deny its
success. Part of that success is due to its very low “learning curve.” Logo is popular among elementary-level educators for a similar reason: even a 5-year-old
can learn it. Pascal was taught for many years in introductory programming
language courses because, at least in comparison to other “serious” languages,
it is compact and easy to learn. In recent years Java has come to play a similar
role. Though substantially more complex than Pascal, it is much simpler than,
say, C++.
Ease of Implementation. In addition to its low learning curve, Basic is successful because it could be implemented easily on tiny machines, with limited
resources. Forth has a small but dedicated following for similar reasons. Arguably the single most important factor in the success of Pascal was that its
designer, Niklaus Wirth, developed a simple, portable implementation of the
language, and shipped it free to universities all over the world (see Example 1.12).3 The Java designers have taken similar steps to make their language
available for free to almost anyone who wants it.
Open Source. Most programming languages today have at least one open source
compiler or interpreter, but some languages—C in particular—are much
more closely associated than others with freely distributed, peer reviewed,
community supported computing. C was originally developed in the early
1970s by Dennis Ritchie and Ken Thompson at Bell Labs,4 in conjunction
with the design of the original Unix operating system. Over the years Unix
evolved into the world’s most portable operating system—the OS of choice
for academic computer science—and C was closely associated with it. With
the standardization of C, the language has become available on an enormous
variety of additional platforms. Linux, the leading open source operating system, is written in C. As of March 2005, C and its descendants account for 60%
of the projects hosted at sourceforge.net.
Excellent Compilers. Fortran owes much of its success to extremely good compilers. In part this is a matter of historical accident. Fortran has been around
longer than anything else, and companies have invested huge amounts of time
3 Niklaus Wirth (1934–), Professor Emeritus of Informatics at ETH in Zürich, Switzerland, is
responsible for a long line of influential languages, including Euler, Algol-W, Pascal, Modula,
Modula-2, and Oberon. Among other things, his languages introduced the notions of enumeration, subrange, and set types, and unified the concepts of records (structs) and variants (unions).
He received the annual ACM Turing Award, computing’s highest honor, in 1984.
4 Ken Thompson (1943–) led the team that developed Unix. He also designed the B programming language, a child of BCPL and the parent of C. Dennis Ritchie (1941–) was the principal
force behind the development of C itself. Thompson and Ritchie together formed the core of an
incredibly productive and influential group. They shared the ACM Turing Award in 1983.
1.2 The Programming Language Spectrum
declarative
functional
dataflow
logic, constraint-based
template-based
imperative
von Neumann
scripting
object-oriented
9
Lisp/Scheme, ML, Haskell
Id, Val
Prolog, spreadsheets
XSLT
C, Ada, Fortran, . . .
Perl, Python, PHP, . . .
Smalltalk, Eiffel, C++, Java, . . .
Figure 1.1
Classification of programming languages. Note that the categories are fuzzy and
open to debate. In particular, it is possible for a functional language to be object-oriented, and
many authors do not consider functional programming to be declarative.
It is not yet clear to what extent, and in what problem domains, we can expect
compilers to discover good algorithms for problems stated at a very high level. In
any domain in which the compiler cannot find a good algorithm, the programmer needs to be able to specify one explicitly.
Within the declarative and imperative families, there are several important
subclasses.
Functional languages employ a computational model based on the recursive
definition of functions. They take their inspiration from the lambda calculus,
a formal computational model developed by Alonzo Church in the 1930s. In
essence, a program is considered a function from inputs to outputs, defined
in terms of simpler functions through a process of refinement. Languages in
this category include Lisp, ML, and Haskell.
Dataflow languages model computation as the flow of information (tokens)
among primitive functional nodes. They provide an inherently parallel model:
nodes are triggered by the arrival of input tokens, and can operate concurrently. Id and Val are examples of dataflow languages. Sisal, a descendant of
Val, is more often described as a functional language.
Logic or constraint-based languages take their inspiration from predicate logic.
They model computation as an attempt to find values that satisfy certain specified relationships, using a goal-directed a search through a list of logical rules.
Prolog is the best-known logic language. The term can also be applied to the
programmable aspects of spreadsheet systems such as Excel, VisiCalc, or Lotus 1-2-3.
The von Neumann languages are the most familiar and successful. They include Fortran, Ada 83, C, and all of the others in which the basic means of
computation is the modification of variables.6 Whereas functional languages
6 John von Neumann (1903–1957) was a mathematician and computer pioneer who helped to
develop the concept of stored program computing, which underlies most computer hardware. In
a stored program computer, both programs and data are represented as bits in memory, which
the processor repeatedly fetches, interprets, and updates.
1.1 The Art of Language Design
7
and money in making compilers that generate very fast code. It is also a matter
of language design, however: Fortran dialects prior to Fortran 90 lack recursion and pointers, features that greatly complicate the task of generating fast
code (at least for programs that can be written in a reasonable fashion without
them!). In a similar vein, some languages (e.g., Common Lisp) are successful
in part because they have compilers and supporting tools that do an unusually
good job of helping the programmer manage very large projects.
Economics, Patronage, and Inertia. Finally, there are factors other than technical
merit that greatly influence success. The backing of a powerful sponsor is one.
Cobol and PL/I, at least to first approximation, owe their life to IBM. Ada
owes its life to the United States Department of Defense: it contains a wealth
of excellent features and ideas, but the sheer complexity of implementation
would likely have killed it if not for the DoD backing. Similarly, C#, despite its
technical merits, would probably not have received the attention it has without
the backing of Microsoft. At the other end of the life cycle, some languages
remain widely used long after “better” alternatives are available because of a
huge base of installed software and programmer expertise, which would cost
too much to replace.
D E S I G N & I M P L E M E N TAT I O N
Introduction
Throughout the book, sidebars like this one will highlight the interplay of language design and language implementation. Among other things, we will consider the following.
Cases (such as those mentioned in this section) in which ease or difficulty
of implementation significantly affected the success of a language
Language features that many designers now believe were mistakes, at least
in part because of implementation difficulties
Potentially useful features omitted from some languages because of concern
that they might be too difficult or slow to implement
Language limitations adopted at least in part out of concern for implementation complexity or cost
Language features introduced at least in part to facilitate efficient or elegant
implementations
Cases in which a machine architecture makes reasonable features unreasonably expensive
Various other tradeoffs in which implementation plays a significant role
A complete list of sidebars appears in Appendix B.
8
Chapter 1 Introduction
Clearly no one factor determines whether a language is “good.” As we study
programming languages, we shall need to consider issues from several points of
view. In particular, we shall need to consider the viewpoints of both the programmer and the language implementor. Sometimes these points of view will be
in harmony, as in the desire for execution speed. Often, however, there will be
conflicts and tradeoffs, as the conceptual appeal of a feature is balanced against
the cost of its implementation. The tradeoff becomes particularly thorny when
the implementation imposes costs not only on programs that use the feature, but
also on programs that do not.
In the early days of computing the implementor’s viewpoint was predominant.
Programming languages evolved as a means of telling a computer what to do. For
programmers, however, a language is more aptly defined as a means of expressing algorithms. Just as natural languages constrain exposition and discourse, so
programming languages constrain what can and cannot be expressed, and have
both profound and subtle influence over what the programmer can think. Donald
Knuth has suggested that programming be regarded as the art of telling another
human being what one wants the computer to do [Knu84].5 This definition perhaps strikes the best sort of compromise. It acknowledges that both conceptual
clarity and implementation efficiency are fundamental concerns. This book attempts to capture this spirit of compromise by simultaneously considering the
conceptual and implementation aspects of each of the topics it covers.
1.2
EXAMPLE
1.3
Classification of
programming languages
The Programming Language Spectrum
The many existing languages can be classified into families based on their model
of computation. Figure 1.1 shows a common set of families. The top-level division distinguishes between the declarative languages, in which the focus is on
what the computer is to do, and the imperative languages, in which the focus is
on how the computer should do it.
Declarative languages are in some sense “higher level”; they are more in tune
with the programmer’s point of view, and less with the implementor’s point of
view. Imperative languages predominate, however, mainly for performance reasons. There is a tension in the design of declarative languages between the desire
to get away from “irrelevant” implementation details and the need to remain
close enough to the details to at least control the outline of an algorithm. The design of efficient algorithms, after all, is what much of computer science is about.
5 Donald E. Knuth (1938–), Professor Emeritus at Stanford University and one of the foremost
figures in the design and analysis of algorithms, is also widely known as the inventor of the
TEX typesetting system (with which this book was produced) and of the literate programming
methodology with which TEX was constructed. His multivolume The Art of Computer Programming has an honored place on the shelf of most professional computer scientists. He received the
ACM Turing Award in 1974.
10
Chapter 1 Introduction
are based on expressions that have values, von Neumann languages are based
on statements (assignments in particular) that influence subsequent computation via the side effect of changing the value of memory.
Scripting languages are a subset of the von Neumann languages. They are distinguished by their emphasis on “gluing together” components that were originally developed as independent programs. Several scripting languages were
originally developed for specific purposes: csh and bash , for example, are
the input languages of job control (shell) programs; Awk was intended for
text manipulation; PHP and JavaScript are primarily intended for the generation of web pages with dynamic content (with execution on the server and
the client, respectively). Other languages, including Perl, Python, Ruby, and
Tcl, are more deliberately general purpose. Most place an emphasis on rapid
prototyping, with a bias toward ease of expression over speed of execution.
Object-oriented languages are comparatively recent, though their roots can be
traced to Simula 67. Most are closely related to the von Neumann languages
but have a much more structured and distributed model of both memory and
computation. Rather than picture computation as the operation of a monolithic processor on a monolithic memory, object-oriented languages picture
it as interactions among semi-independent objects, each of which has both its
own internal state and subroutines to manage that state. Smalltalk is the purest
of the object-oriented languages; C++ and Java are the most widely used. It is
also possible to devise object-oriented functional languages (the best known
of these is the CLOS [Kee89] extension to Common Lisp), but they tend to
have a strong imperative flavor.
One might suspect that concurrent languages also form a separate class (and
indeed this book devotes a chapter to the subject), but the distinction between
concurrent and sequential execution is mostly orthogonal to the classifications
above. Most concurrent programs are currently written using special library
packages or compilers in conjunction with a sequential language such as Fortran or C. A few widely used languages, including Java, C#, Ada, and Modula-3,
have explicitly concurrent features. Researchers are investigating concurrency in
each of the language classes mentioned here.
It should be emphasized that the distinctions among language classes are
not clear-cut. The division between the von Neumann and object-oriented languages, for example, is often very fuzzy, and most of the functional and logic languages include some imperative features. The preceding descriptions are meant
to capture the general flavor of the classes, without providing formal definitions.
Imperative languages—von Neumann and object-oriented—receive the bulk
of the attention in this book. Many issues cut across family lines, however, and
the interested reader will discover much that is applicable to alternative computational models in most of the chapters of the book. Chapters 10 through 13
contain additional material on functional, logic, concurrent, and scripting languages.
1.3 Why Study Programming Languages?
1.3
11
Why Study Programming Languages?
Programming languages are central to computer science and to the typical computer science curriculum. Like most car owners, students who have become familiar with one or more high-level languages are generally curious to learn about
other languages, and to know what is going on “under the hood.” Learning about
languages is interesting. It’s also practical.
For one thing, a good understanding of language design and implementation
can help one choose the most appropriate language for any given task. Most
languages are better for some things than for others. No one would be likely
to use APL for symbolic computing or string processing, but other choices are
not nearly so clear-cut. Should one choose C, C++, or Modula-3 for systems
programming? Fortran or Ada for scientific computations? Ada or Modula-2 for
embedded systems? Visual Basic or Java for a graphical user interface? This book
should help equip you to make such decisions.
Similarly, this book should make it easier to learn new languages. Many languages are closely related. Java and C# are easier to learn if you already know C++.
Common Lisp is easier to learn if you already know Scheme. More important,
there are basic concepts that underlie all programming languages. Most of these
concepts are the subject of chapters in this book: types, control (iteration, selection, recursion, nondeterminacy, concurrency), abstraction, and naming. Thinking in terms of these concepts makes it easier to assimilate the syntax (form)
and semantics (meaning) of new languages, compared to picking them up in
a vacuum. The situation is analogous to what happens in natural languages: a
good knowledge of grammatical forms makes it easier to learn a foreign language.
Whatever language you learn, understanding the decisions that went into its
design and implementation will help you use it better. This book should help you
Understand obscure features. The typical C++ programmer rarely uses unions,
multiple inheritance, variable numbers of arguments, or the .* operator. (If
you don’t know what these are, don’t worry!) Just as it simplifies the assimilation of new languages, an understanding of basic concepts makes it easier to understand these features when you look up the details in the manual.
Choose among alternative ways to express things, based on a knowledge of implementation costs. In C++, for example, programmers may need to avoid unnecessary temporary variables, and use copy constructors whenever possible,
to minimize the cost of initialization. In Java they may wish to use Executor
objects rather than explicit thread creation. With certain (poor) compilers,
they may need to adopt special programming idioms to get the fastest code:
pointers for array traversal in C; with statements to factor out common address calculations in Pascal or Modula-3; x*x instead of x**2 in Basic. In any
12
Chapter 1 Introduction
language, they need to be able to evaluate the tradeoffs among alternative implementations of abstractions—for example between computation and table
lookup for functions like bit set cardinality, which can be implemented either
way.
Make good use of debuggers, assemblers, linkers, and related tools. In general, the
high-level language programmer should not need to bother with implementation details. There are times, however, when an understanding of those details
proves extremely useful. The tenacious bug or unusual system-building problem is sometimes a lot easier to handle if one is willing to peek at the bits.
Simulate useful features in languages that lack them. Certain very useful features
are missing in older languages but can be emulated by following a deliberate
(if unenforced) programming style. In older dialects of Fortran, for example, programmers familiar with modern control constructs can use comments
and self-discipline to write well-structured code. Similarly, in languages with
poor abstraction facilities, comments and naming conventions can help imitate modular structure, and the extremely useful iterators of Clu, Icon, and C#
(which we will study in Section 6.5.3) can be imitated with subroutines and
static variables. In Fortran 77 and other languages that lack recursion, an iterative program can be derived via mechanical hand transformations, starting
with recursive pseudocode. In languages without named constants or enumeration types, variables that are initialized once and never changed thereafter can
make code much more readable and easy to maintain.
Make better use of language technology wherever it appears. Most programmers
will never design or implement a conventional programming language, but
most will need language technology for other programming tasks. The typical
personal computer contains files in dozens of structured formats, encompassing web content, word processing, spreadsheets, presentations, raster and vector graphics, music, video, databases, and a wide variety of other application
domains. Each of these structured formats has formal syntax and semantics,
which tools must understand. Code to parse, analyze, generate, optimize, and
otherwise manipulate structured data can thus be found in almost any sophisticated program, and all of this code is based on language technology. Programmers with a strong grasp of this technology will be in a better position to
write well-structured, maintainable tools.
In a similar vein, most tools themselves can be customized, via start-up
configuration files, command-line arguments, input commands, or built-in
extension languages (considered in more detail in Chapter 13). My home directory holds more than 250 separate configuration (“preference”) files. My
personal configuration files for the emacs text editor comprise more than
1200 lines of Lisp code. The user of almost any sophisticated program today
will need to make good use of configuration or extension languages. The designers of such a program will need either to adopt (and adapt) some existing
extension language, or to invent new notation of their own. Programmers with
a strong grasp of language theory will be in a better position to design elegant,
1.4 Compilation and Interpretation
13
well-structured notation that meets the needs of current users and facilitates
future development.
Finally, this book should help prepare you for further study in language design or implementation, should you be so inclined. It will also equip you to understand the interactions of languages with operating systems and architectures,
should those areas draw your interest.
C H E C K YO U R U N D E R S TA N D I N G
1. What is the difference between machine language and assembly language?
2. In what way(s) are high-level languages an improvement on assembly language? In what circumstances does it still make sense to program in assembler?
3. Why are there so many programming languages?
4. What makes a programming language successful?
5. Name three languages in each of the following categories: von Neumann,
functional, object-oriented. Name two logic languages. Name two widely
used concurrent languages.
6. What distinguishes declarative languages from imperative languages?
7. What organization spearheaded the development of Ada?
8. What is generally considered the first high-level programming language?
9. What was the first functional language?
1.4
EXAMPLE
1.4
Pure compilation
Compilation and Interpretation
At the highest level of abstraction, the compilation and execution of a program
in a high-level language look something like this:
14
EXAMPLE
Chapter 1 Introduction
1.5
Pure interpretation
EXAMPLE
1.6
Mixing compilation and
interpretation
The compiler translates the high-level source program into an equivalent target
program (typically in machine language) and then goes away. At some arbitrary
later time, the user tells the operating system to run the target program. The compiler is the locus of control during compilation; the target program is the locus of
control during its own execution. The compiler is itself a machine language program, presumably created by compiling some other high-level program. When
written to a file in a format understood by the operating system, machine language is commonly known as object code.
An alternative style of implementation for high-level languages is known as
interpretation.
Unlike a compiler, an interpreter stays around for the execution of the application. In fact, the interpreter is the locus of control during that execution. In
effect, the interpreter implements a virtual machine whose “machine language”
is the high-level programming language. The interpreter reads statements in that
language more or less one at a time, executing them as it goes along.
In general, interpretation leads to greater flexibility and better diagnostics (error messages) than does compilation. Because the source code is being executed
directly, the interpreter can include an excellent source-level debugger. It can also
cope with languages in which fundamental characteristics of the program, such
as the sizes and types of variables, or even which names refer to which variables,
can depend on the input data. Some language features are almost impossible to
implement without interpretation: in Lisp and Prolog, for example, a program
can write new pieces of itself and execute them on the fly. (Several scripting languages, including Perl, Tcl, Python, and Ruby, also provide this capability.) Delaying decisions about program implementation until run time is known as late
binding; we will discuss it at greater length in Section 3.1.
Compilation, by contrast, generally leads to better performance. In general, a
decision made at compile time is a decision that does not need to be made at run
time. For example, if the compiler can guarantee that variable x will always lie at
location 49378 , it can generate machine language instructions that access this location whenever the source program refers to x . By contrast, an interpreter may
need to look x up in a table every time it is accessed, in order to find its location.
Since the (final version of a) program is compiled only once, but generally executed many times, the savings can be substantial, particularly if the interpreter is
doing unnecessary work in every iteration of a loop.
While the conceptual difference between compilation and interpretation is
clear, most language implementations include a mixture of both. They typically
look like this:
1.4 Compilation and Interpretation
15
We generally say that a language is “interpreted” when the initial translator is
simple. If the translator is complicated, we say that the language is “compiled.”
The distinction can be confusing because “simple” and “complicated” are subjective terms, and because it is possible for a compiler (complicated translator)
to produce code that is then executed by a complicated virtual machine (interpreter); this is in fact precisely what happens by default in Java. We still say
that a language is compiled if the translator analyzes it thoroughly (rather than
effecting some “mechanical” transformation) and if the intermediate program
does not bear a strong resemblance to the source. These two characteristics—
thorough analysis and nontrivial transformation—are the hallmarks of compilation.
In practice one sees a broad spectrum of implementation strategies. For example:
EXAMPLE
1.7
Preprocessing
Most interpreted languages employ an initial translator (a preprocessor) that
removes comments and white space, and groups characters together into tokens, such as keywords, identifiers, numbers, and symbols. The translator may
also expand abbreviations in the style of a macro assembler. Finally, it may
identify higher-level syntactic structures, such as loops and subroutines. The
goal is to produce an intermediate form that mirrors the structure of the
source but can be interpreted more efficiently.
D E S I G N & I M P L E M E N TAT I O N
Compiled and interpreted languages
Certain languages (APL and Smalltalk, for example) are sometimes referred
to as “interpreted languages” because most of their semantic error checking
must be performed at run time. Certain other languages (Fortran and C, for
example) are sometimes referred to as “compiled languages” because almost
all of their semantic error checking can be performed statically. This terminology isn’t strictly correct: interpreters for C and Fortran can be built easily,
and a compiler can generate code to perform even the most extensive dynamic
semantic checks. That said, language design has a profound effect on “compilability.”
16
EXAMPLE
Chapter 1 Introduction
1.8
Library routines and linking
EXAMPLE
1.9
Post-compilation assembly
In some very early implementations of Basic, the manual actually suggested
removing comments from a program in order to improve its performance.
These implementations were pure interpreters; they would reread (and then
ignore) the comments every time they executed a given part of the program.
They had no initial translator.
The typical Fortran implementation comes close to pure compilation. The
compiler translates Fortran source into machine language. Usually, however,
it counts on the existence of a library of subroutines that are not part of the
original program. Examples include mathematical functions ( sin , cos , log ,
etc.) and I/O. The compiler relies on a separate program, known as a linker, to
merge the appropriate library routines into the final program:
In some sense, one may think of the library routines as extensions to the hardware instruction set. The compiler can then be thought of as generating code
for a virtual machine that includes the capabilities of both the hardware and
the library.
In a more literal sense, one can find interpretation in the Fortran routines
for formatted output. Fortran permits the use of format statements that control the alignment of output in columns, the number of significant digits and
type of scientific notation for floating-point numbers, inclusion/suppression
of leading zeros, and so on. Programs can compute their own formats on the
fly. The output library routines include a format interpreter. A similar interpreter can be found in the printf routine of C and its descendants.
Many compilers generate assembly language instead of machine language.
This convention facilitates debugging, since assembly language is easier for
people to read, and isolates the compiler from changes in the format of machine language files that may be mandated by new releases of the operating
system (only the assembler must be changed, and it is shared by many compilers).
1.4 Compilation and Interpretation
17
EXAMPLE
1.10
The C preprocessor
Compilers for C (and for many other languages running under Unix) begin
with a preprocessor that removes comments and expands macros. The preprocessor can also be instructed to delete portions of the code itself, providing
a conditional compilation facility that allows several versions of a program to
be built from the same source.
EXAMPLE
1.11
Source-to-source
translation (C++)
C++ implementations based on the early AT&T compiler actually generated
an intermediate program in C, instead of in assembly language. This C++
compiler was indeed a true compiler: it performed a complete analysis of the
syntax and semantics of the C++ source program, and with very few exceptions generated all of the error messages that a programmer would see prior
to running the program. In fact, programmers were generally unaware that
the C compiler was being used behind the scenes. The C++ compiler did
not invoke the C compiler unless it had generated C code that would pass
through the second round of compilation without producing any error messages.
18
EXAMPLE
Chapter 1 Introduction
1.12
Bootstrapping
Occasionally one would hear the C++ compiler referred to as a preprocessor,
presumably because it generated high-level output that was in turn compiled.
I consider this a misuse of the term: compilers attempt to “understand” their
source; preprocessors do not. Preprocessors perform transformations based
on simple pattern matching, and may well produce output that will generate
error messages when run through a subsequent stage of translation.
Many early Pascal compilers were built around a set of tools distributed by
Niklaus Wirth. These included the following.
– A Pascal compiler, written in Pascal, that would generate output in P-code,
a simple stack-based language
– The same compiler, already translated into P-code
– A P-code interpreter, written in Pascal
To get Pascal up and running on a local machine, the user of the tool set
needed only to translate the P-code interpreter (by hand) into some locally
available language. This translation was not a difficult task; the interpreter
was small. By running the P-code version of the compiler on top of the P-code
interpreter, one could then compile arbitrary Pascal programs into P-code,
which could in turn be run on the interpreter. To get a faster implementation,
one could modify the Pascal version of the Pascal compiler to generate a locally available variety of assembly or machine language, instead of generating
P-code (a somewhat more difficult task). This compiler could then be “run
through itself ” in a process known as bootstrapping, a term derived from the
intentionally ridiculous notion of lifting oneself off the ground by pulling on
one’s bootstraps.
1.4 Compilation and Interpretation
19
At this point, the P-code interpreter and the P-code version of the Pascal compiler could simply be thrown away. More often, however, programmers would
choose to keep these tools around. The P-code version of a program tends
to be significantly smaller than its machine language counterpart. On a circa
1970 machine, the savings in memory and disk requirements could really be
important. Moreover, as noted near the beginning of this section, an interpreter will often provide better run-time diagnostics than will the output of
a compiler. Finally, an interpreter allows a program to be rerun immediately
after modification, without waiting for recompilation—a feature that can be
particularly valuable during program development. Some of the best programming environments for imperative languages include both a compiler
and an interpreter.
D E S I G N & I M P L E M E N TAT I O N
The early success of Pascal
The P-code based implementation of Pascal is largely responsible for the language’s remarkable success in academic circles in the 1970s. No single hardware platform or operating system of that era dominated the computer landscape the way the x86, Linux, and Windows do today.7 Wirth’s toolkit made
it possible to get an implementation of Pascal up and running on almost any
platform in a week or so. It was one of the first great successes in system portability.
7 Throughout this book we will use the term “x86” to refer to the instruction set architecture of the
Intel 8086 and its descendants, including the various Pentium processors. Intel calls this architecture the IA-32, but x86 is a more generic term that encompasses the offerings of competitors
such as AMD as well.
20
EXAMPLE
Chapter 1 Introduction
1.13
Compiling interpreted
languages
EXAMPLE
1.14
Dynamic and just-in-time
compilation
EXAMPLE
1.15
Microcode (firmware)
One will sometimes find compilers for languages (e.g., Lisp, Prolog, Smalltalk,
etc.) that permit a lot of late binding and are traditionally interpreted. These
compilers must be prepared, in the general case, to generate code that performs much of the work of an interpreter, or that makes calls into a library
that does that work instead. In important special cases, however, the compiler
can generate code that makes reasonable assumptions about decisions that
won’t be finalized until run time. If these assumptions prove to be valid the
code will run very fast. If the assumptions are not correct, a dynamic check
will discover the inconsistency, and revert to the interpreter.
In some cases a programming system may deliberately delay compilation until
the last possible moment. One example occurs in implementations of Lisp or
Prolog that invoke the compiler on the fly, to translate newly created source
into machine language, or to optimize the code for a particular input set. Another example occurs in implementations of Java. The Java language definition defines a machine-independent intermediate form known as byte code.
Byte code is the standard format for distribution of Java programs; it allows
programs to be transferred easily over the Internet and then run on any platform. The first Java implementations were based on byte-code interpreters,
but more recent (faster) implementations employ a just-in-time compiler that
translates byte code into machine language immediately before each execution
of the program. C#, similarly, is intended for just-in-time translation. The
main C# compiler produces .NET Common Intermediate Language (CIL),
which is then translated into machine code immediately prior to execution.
CIL is deliberately language independent, so it can be used for code produced
by a variety of front-end compilers.
On some machines (particularly those designed before the mid-1980s), the
assembly-level instruction set is not actually implemented in hardware but in
fact runs on an interpreter. The interpreter is written in low-level instructions
called microcode (or firmware), which is stored in read-only memory and executed by the hardware. Microcode and microprogramming are considered
further in Section 5.4.1.
As some of these examples make clear, a compiler does not necessarily translate from a high-level language into machine language. It is not uncommon
for compilers, especially prototypes, to generate C as output. A little farther
afield, text formatters like TEX and troff are actually compilers, translating highlevel document descriptions into commands for a laser printer or phototypesetter. (Many laser printers themselves incorporate interpreters for the Postscript
page-description language.) Query language processors for database systems are
also compilers, translating languages like SQL into primitive operations on files.
There are even compilers that translate logic-level circuit specifications into photographic masks for computer chips. Though the focus in this book is on imperative programming languages, the term “compilation” applies whenever we
translate automatically from one nontrivial language to another, with full analysis of the meaning of the input.
1.5 Programming Environments
1.5
21
Programming Environments
Compilers and interpreters do not exist in isolation. Programmers are assisted
in their work by a host of other tools. Assemblers, debuggers, preprocessors, and
linkers were mentioned earlier. Editors are familiar to every programmer. They
may be assisted by cross-referencing facilities that allow the programmer to find
the point at which an object is defined, given a point at which it is used. Pretty
printers help enforce formatting conventions. Style checkers enforce syntactic or
semantic conventions that may be tighter than those enforced by the compiler
(see Exploration 1.11). Configuration management tools help keep track of dependences among the (many versions of) separately compiled modules in a large
software system. Perusal tools exist not only for text but also for intermediate
languages that may be stored in binary. Profilers and other performance analysis
tools often work in conjunction with debuggers to help identify the pieces of a
program that consume the bulk of its computation time.
In older programming environments, tools may be executed individually, at
the explicit request of the user. If a running program terminates abnormally with
a “bus error” (invalid address) message, for example, the user may choose to
invoke a debugger to examine the “core” file dumped by the operating system.
He or she may then attempt to identify the program bug by setting breakpoints,
enabling tracing, and so on, and running the program again under the control of
the debugger. Once the bug is found, the user will invoke the editor to make
an appropriate change. He or she will then recompile the modified program,
possibly with the help of a configuration manager.
More recent programming environments provide much more integrated
tools. When an invalid address error occurs in an integrated environment, a new
window is likely to appear on the user’s screen, with the line of source code at
which the error occurred highlighted. Breakpoints and tracing can then be set in
this window without explicitly invoking a debugger. Changes to the source can
be made without explicitly invoking an editor. The editor may also incorporate
knowledge of the language syntax, providing templates for all the standard control structures, and checking syntax as it is typed in. If the user asks to rerun
the program after making changes, a new version may be built without explicitly
invoking the compiler or configuration manager.
D E S I G N & I M P L E M E N TAT I O N
Powerful development environments
Sophisticated development environments can be a two-edged sword. The
quality of the Common Lisp environment has arguably contributed to its
widespread acceptance. On the other hand, the particularity of the graphical
environment for Smalltalk (with its insistence on specific fonts, window styles,
etc.) has made it difficult to port the language to systems accessed through a
textual interface, or to graphical systems with a different “look and feel.”
22
Chapter 1 Introduction
Integrated environments have been developed for a variety of languages and
systems. They are fundamental to Smalltalk—it is nearly impossible to separate
the language from its graphical environment—and are widely used with Common Lisp. They are common on personal computers; examples include the Visual Studio environment from Microsoft and the Project Builder environment
from Apple. Several similar commercial and open source environments are available for Unix, and much of the appearance of integration can be achieved within
sophisticated editors such as emacs.
C H E C K YO U R U N D E R S TA N D I N G
10. Explain the distinction between interpretation and compilation. What are the
comparative advantages and disadvantages of the two approaches?
11. Is Java compiled or interpreted (or both)? How do you know?
12. What is the difference between a compiler and a preprocessor?
13. What was the intermediate form employed by the original AT&T C++ compiler?
14.
15.
16.
17.
What is P-code?
What is bootstrapping?
What is a just-in-time compiler?
Name two languages in which a program can write new pieces of itself “onthe-fly.”
18. Briefly describe three “unconventional” compilers—compilers whose purpose is not to prepare a high-level program for execution on a microprocessor.
19. Describe six kinds of tools that commonly support the work of a compiler
within a larger programming environment.
1.6
EXAMPLE
1.16
Phases of compilation
An Overview of Compilation
Compilers are among the most well-studied types of computer programs. In a
typical compiler, compilation proceeds through a series of well-defined phases,
shown in Figure 1.2. Each phase discovers information of use to later phases,
or transforms the program into a form that is more useful to the subsequent
phase.
The first few phases (up through semantic analysis) serve to figure out the
meaning of the source program. They are sometimes called the front end of the
compiler. The last few phases serve to construct an equivalent target program.
They are sometimes called the back end of the compiler. Many compiler phases
1.6 An Overview of Compilation
23
Figure 1.2
Phases of compilation. Phases are listed on the right and the forms in which
information is passed between phases are listed on the left. The symbol table serves throughout
compilation as a repository for information about identifiers.
can be created automatically from a formal description of the source and/or target languages.
One will sometimes hear compilation described as a series of passes. A pass
is a phase or set of phases that is serialized with respect to the rest of compilation: it does not start until previous phases have completed, and it finishes before
any subsequent phases start. If desired, a pass may be written as a separate program, reading its input from a file and writing its output to a file. Compilers are
commonly divided into passes so that the front end may be shared by compilers
for more than one machine (target language), and so that the back end may be
shared by compilers for more than one source language. Prior to the dramatic increases in memory sizes of the mid- to late 1980s, compilers were also sometimes
divided into passes to minimize memory usage: as each pass completed, the next
could reuse its code space.
1.6.1
EXAMPLE
1.17
GCD program in Pascal
Lexical and Syntax Analysis
Consider the greatest common divisor (GCD) program introduced at the beginning of this chapter. Written in Pascal, the program might look like this:8
8 We use Pascal for this example because its lexical and syntactic structure is significantly simpler
than that of most modern imperative languages.
24
Chapter 1 Introduction
program gcd(input, output);
var i, j : integer;
begin
read(i, j);
while i <> j do
if i > j then i := i - j
else j := j - i;
writeln(i)
end.
EXAMPLE
1.18
GCD program tokens
Scanning and parsing serve to recognize the structure of the program, without
regard to its meaning. The scanner reads characters (‘p’, ‘r’, ‘o’, ‘g’, ‘r’, ‘a’, ‘m’, ‘ ’,
‘g’, ‘c’, ‘d’, etc.) and groups them into tokens, which are the smallest meaningful
units of the program. In our example, the tokens are
program
var
read
i
then
:=
)
EXAMPLE
1.19
Context-free grammar and
parsing
gcd
i
(
<>
i
j
end
(
,
i
j
:=
.
input
j
,
do
i
i
,
:
j
if
;
output
integer
)
i
j
writeln
)
;
;
>
else
(
;
begin
while
j
j
i
Scanning is also known as lexical analysis. The principal purpose of the scanner is to simplify the task of the parser by reducing the size of the input (there
are many more characters than tokens) and by removing extraneous characters.
The scanner also typically removes comments, produces a listing if desired, and
tags tokens with line and column numbers to make it easier to generate good diagnostics in later phases. One could design a parser to take characters instead of
tokens as input—dispensing with the scanner—but the result would be awkward
and slow.
Parsing organizes tokens into a parse tree that represents higher-level constructs in terms of their constituents. The ways in which these constituents combine are defined by a set of potentially recursive rules known as a context-free
grammar. For example, we know that a Pascal program consists of the keyword
program , followed by an identifier (the program name), a parenthesized list of
files, a semicolon, a series of definitions, and the main begin . . . end block, terminated by a period:
program
−→
PROGRAM id ( id more ids ) ; block .
where
block −→
labels constants types variables subroutines BEGIN stmt
more stmts END
and
more ids −→
, id more ids
1.6 An Overview of Compilation
25
or
more ids −→
EXAMPLE
1.20
GCD program parse tree
Here represents the empty string; it indicates that more ids can simply be
deleted. Many more grammar rules are needed, of course, to explain the full
structure of a program.
A context-free grammar is said to define the syntax of the language; parsing is
therefore known as syntactic analysis. There are many possible grammars for Pascal (an infinite number, in fact); the fragment shown above is based loosely on the
“circles-and-arrows” syntax diagrams found in the original Pascal text [JW91]. A
full parse tree for our GCD program (based on a full grammar not shown here)
appears in Figure 1.3. Much of the complexity of this figure stems from (1) the
use of such artificial “constructs” as more stmts and more exprs to represent lists
of arbitrary length and (2) the use of the equally artificial term, factor, and so on,
to capture precedence and associativity in arithmetic expressions. Grammars and
parse trees will be covered in more detail in Chapter 2.
In the process of scanning and parsing, the compiler checks to see that all of the
program’s tokens are well formed and that the sequence of tokens conforms to the
syntax defined by the context-free grammar. Any malformed tokens (e.g., 123abc
or [email protected] in Pascal) should cause the scanner to produce an error message. Any
syntactically invalid token sequence (e.g., A := B C D in Pascal) should lead to
an error message from the parser.
1.6.2
Semantic Analysis and Intermediate Code Generation
Semantic analysis is the discovery of meaning in a program. The semantic analysis phase of compilation recognizes when multiple occurrences of the same
identifier are meant to refer to the same program entity, and ensures that the
uses are consistent. In most languages the semantic analyzer tracks the types of
both identifiers and expressions, both to verify consistent usage and to guide the
generation of code in later phases.
To assist in its work, the semantic analyzer typically builds and maintains a
symbol table data structure that maps each identifier to the information known
about it. Among other things, this information includes the identifier’s type, internal structure (if any), and scope (the portion of the program in which it is
valid).
Using the symbol table, the semantic analyzer enforces a large variety of rules
that are not captured by the hierarchical structure of the context-free grammar
and the parse tree. For example, it checks to make sure that
Every identifier is declared before it is used.
No identifier is used in an inappropriate context (calling an integer as a subroutine, adding a string to an integer, referencing a field of the wrong type of
record, etc.).
Subroutine calls provide the correct number and types of arguments.
26
Chapter 1 Introduction
Figure 1.3 Parse tree for the GCD program. The symbol represents the empty string. The remarkable level of complexity
in this figure is an artifact of having to fit the (much simpler) source code into the hierarchical structure of a context-free
grammar.
1.6 An Overview of Compilation
27
Labels on the arms of a case statement are distinct constants.
Every function contains at least one statement that specifies a return value.
In many compilers, the work of the semantic analyzer takes the form of semantic action routines, invoked by the parser when it realizes that it has reached a
particular point within a production.
Of course, not all semantic rules can be checked at compile time. Those that
can are referred to as the static semantics of the language. Those that must be
checked at run time are referred to as the dynamic semantics of the language.
Examples of rules that must often be checked at run time include
EXAMPLE
1.21
GCD program abstract
syntax tree
Variables are never used in an expression unless they have been given a value.9
Pointers are never dereferenced unless they refer to a valid object.
Array subscript expressions lie within the bounds of the array.
Arithmetic operations do not overflow.
When it cannot enforce rules statically, a compiler will often produce code
to perform appropriate checks at run time, aborting the program or generating an exception if one of the checks then fails. (Exceptions will be discussed in
Section 8.5.) Some rules, unfortunately, may be unacceptably expensive or impossible to enforce, and the language implementation may simply fail to check
them. In Ada, a program that breaks such a rule is said to be erroneous; in C its
behavior is said to be undefined.
A parse tree is sometimes known as a concrete syntax tree, because it demonstrates, completely and concretely, how a particular sequence of tokens can be
derived under the rules of the context-free grammar. Once we know that a token
sequence is valid, however, much of the information in the parse tree is irrelevant to further phases of compilation. In the process of checking static semantic
rules, the semantic analyzer typically transforms the parse tree into an abstract
syntax tree (otherwise known as an AST, or simply a syntax tree) by removing
most of the “artificial” nodes in the tree’s interior. The semantic analyzer also
annotates the remaining nodes with useful information, such as pointers from
identifiers to their symbol table entries. The annotations attached to a particular
node are known as its attributes. A syntax tree for our GCD program is shown in
Figure 1.4.
In many compilers, the annotated syntax tree constitutes the intermediate
form that is passed from the front end to the back end. In other compilers, semantic analysis ends with a traversal of the tree that generates some other intermediate form. Often this alternative form resembles assembly language for an
extremely simple idealized machine. In a suite of related compilers, the front ends
9 As we shall see in Section 6.1.3, Java and C# actually do enforce initialization at compile time, but
only by adopting a conservative set of rules for “definite assignment,” which outlaw programs for
which correctness is difficult or impossible to verify at compile time.
28
Chapter 1 Introduction
Figure 1.4 Syntax tree and symbol table for the GCD program. Unlike Figure 1.3, the syntax
tree retains just the essential structure of the program, omitting detail that was needed only to
drive the parsing algorithm.
for several languages and the back ends for several machines would share a common intermediate form.
1.6.3
EXAMPLE
1.22
GCD program assembly
code
Target Code Generation
The code generation phase of a compiler translates the intermediate form into
the target language. Given the information contained in the syntax tree, generating correct code is usually not a difficult task (generating good code is
harder, as we shall see in Section 1.6.4). To generate assembly or machine language, the code generator traverses the symbol table to assign locations to variables, and then traverses the syntax tree, generating loads and stores for variable references, interspersed with appropriate arithmetic operations, tests, and
branches. Naive code for our GCD example appears in Figure 1.5, in MIPS assembly language. It was generated automatically by a simple pedagogical compiler.
The assembly language mnemonics may appear a bit cryptic, but the comments on each line (not generated by the compiler!) should make the correspondence between Figures 1.4 and 1.5 generally apparent. A few hints: sp , ra , at , a0 ,
v0 , and t0 – t9 are registers (special storage locations, limited in number, that can
be accessed very quickly). 28(sp) refers to the memory location 28 bytes beyond
1.6 An Overview of Compilation
A:
B:
C:
D:
E:
addiu
sw
jal
nop
sw
jal
nop
sw
lw
lw
nop
beq
nop
lw
lw
nop
slt
beq
nop
lw
lw
nop
subu
sw
b
nop
lw
lw
nop
subu
sw
lw
lw
nop
bne
nop
lw
jal
nop
move
b
nop
b
nop
lw
addiu
jr
nop
Figure 1.5
sp,sp,-32
ra,20(sp)
getint
# reserve room for local variables
# save return address
# read
v0,28(sp)
getint
# store i
# read
v0,24(sp)
t6,28(sp)
t7,24(sp)
# store j
# load i
# load j
t6,t7,D
# branch if i = j
t8,28(sp)
t9,24(sp)
# load i
# load j
at,t9,t8
at,zero,B
# determine whether j < i
# branch if not
t0,28(sp)
t1,24(sp)
# load i
# load j
t2,t0,t1
t2,28(sp)
C
# t2 := i - j
# store i
t3,24(sp)
t4,28(sp)
# load j
# load i
t5,t3,t4
t5,24(sp)
t6,28(sp)
t7,24(sp)
#
#
#
#
t6,t7,A
# branch if i <> j
a0,28(sp)
putint
# load i
# writeln
v0,zero
E
# exit status for program
# branch to E
E
# branch to E
ra,20(sp)
sp,sp,32
ra
# retrieve return address
# deallocate space for local variables
# return to operating system
t5 := j - i
store j
load i
load j
Naive MIPS assembly language for the GCD program.
29
30
Chapter 1 Introduction
the location whose address is in register sp . Jal is a subroutine call (“jump and
link”); the first argument is passed in register a0 , and the return value comes back
in register v0 . Nop is a “no-op”; it does no useful work but delays the program
for one time cycle, allowing a two-cycle load or branch instruction to complete
(branch and load delays were a common feature in early RISC machines; we will
consider them in Section 5.5.1). Arithmetic operations generally operate on the
second and third arguments, and put their result in the first.
Often a code generator will save the symbol table for later use by a symbolic
debugger—for example, by including it as comments or some other nonexecutable part of the target code.
1.6.4
EXAMPLE
1.23
GCD program
optimization
Code Improvement
Code improvement is often referred to as optimization, though it seldom makes
anything optimal in any absolute sense. It is an optional phase of compilation
whose goal is to transform a program into a new version that computes the same
result more efficiently—more quickly or using less memory, or both.
Some improvements are machine independent. These can be performed as
transformations on the intermediate form. Other improvements require an understanding of the target machine (or of whatever will execute the program in
the target language). These must be performed as transformations on the target program. Thus code improvement often appears as two additional phases
of compilation, one immediately after semantic analysis and intermediate code
generation, the other immediately after target code generation.
Applying a good code improver to the code in Figure 1.5 produces the code
shown in Example 1.2 (page 3). Comparing the two programs, we can see that
the improved version is quite a lot shorter. Conspicuously absent are most of the
loads and stores. The machine-independent code improver is able to verify that i
and j can be kept in registers throughout the execution of the main loop (this
would not have been the case if, for example, the loop contained a call to a subroutine that might reuse those registers, or that might try to modify i or j ). The
machine-specific code improver is then able to assign i and j to actual registers
of the target machine. In our example the machine-specific improver is also able
to schedule (reorder) instructions to eliminate several of the no-ops. Careful examination of the instructions following the loads and branches will reveal that
they can be executed safely even when the load or branch has not yet completed.
For modern microprocessor architectures, particularly those with so-called superscalar RISC instruction sets (ones in which separate functional units can execute multiple instructions simultaneously), compilers can usually generate better
code than can human assembly language programmers.
1.7 Summary and Concluding Remarks
31
C H E C K YO U R U N D E R S TA N D I N G
20. List the principal phases of compilation, and describe the work performed by
each.
21. Describe the form in which a program is passed from the scanner to the
parser; from the parser to the semantic analyzer; from the semantic analyzer
to the intermediate code generator.
22. What distinguishes the front end of a compiler from the back end?
23. What is the difference between a phase and a pass of compilation? Under what
circumstances does it make sense for a compiler to have multiple passes?
24. What is the purpose of the compiler’s symbol table?
25. What is the difference between static and dynamic semantics?
26. On modern machines, do assembly language programmers still tend to write
better code than a good compiler can? Why or why not?
1.7
Summary and Concluding Remarks
In this chapter we introduced the study of programming language design and
implementation. We considered why there are so many languages, what makes
them successful or unsuccessful, how they may be categorized for study, and what
benefits the reader is likely to gain from that study. We noted that language design
and language implementation are intimately related to one another. Obviously an
implementation must conform to the rules of the language. At the same time, a
language designer must consider how easy or difficult it will be to implement
various features, and what sort of performance is likely to result for programs
that use those features.
Language implementations are commonly differentiated into those based on
interpretation and those based on compilation. We noted, however, that the difference between these approaches is fuzzy, and that most implementations include a bit of each. As a general rule, we say that a language is compiled if execution is preceded by a translation step that (1) fully analyzes both the structure
(syntax) and meaning (semantics) of the program and (2) produces an equivalent program in a significantly different form. The bulk of the implementation
material in this book pertains to compilation.
Compilers are generally structured as a series of phases. The first few phases—
scanning, parsing, and semantic analysis—serve to analyze the source program. Collectively these phases are known as the compiler’s front end. The
final few phases—intermediate code generation, code improvement, and target code generation—are known as the back end. They serve to build a tar-
32
Chapter 1 Introduction
get program—preferably a fast one—whose semantics match those of the
source.
Chapters 3, 6, 7, 8, and 9 form the core of the rest of this book. They cover fundamental issues of language design, both from the point of view of the programmer and from the point of view of the language implementor. To support the
discussion of implementations, Chapters 2 and 4 describe compiler front ends
in more detail than has been possible in this introduction. Chapter 5 provides
an overview of assembly-level architecture. Chapters 14 and 15 discuss compiler
back ends, including assemblers and linkers. Additional language paradigms are
covered in Chapters 10 through 13. Appendix A lists the principal programming
languages mentioned in the text, together with a genealogical chart and bibliographic references. Appendix B contains a list of “Design and Implementation”
sidebars. Appendix C contains a list of numbered examples.
1.8
Exercises
1.1 Errors in a computer program can be classified according to when they are
detected and, if they are detected at compile time, what part of the compiler
detects them. Using your favorite imperative language, give an example of
each of the following.
(a) A lexical error, detected by the scanner
(b)
(c)
(d)
(e)
A syntax error, detected by the parser
A static semantic error, detected by semantic analysis
A dynamic semantic error, detected by code generated by the compiler
An error that the compiler can neither catch nor easily generate code to
catch (this should be a violation of the language definition, not just a
program bug)
1.2 Algol family languages are typically compiled, while Lisp family languages, in
which many issues cannot be settled until run time, are typically interpreted.
Is interpretation simply what one “has to do” when compilation is infeasible,
or are there actually some advantages to interpreting a language, even when
a compiler is available?
1.3 The gcd program of Example 1.17 might also be written
program gcd(input, output);
var i, j : integer;
begin
read(i, j);
while i <> j do
if i > j then i := i mod j
else j := j mod i;
writeln(i)
end.
1.9 Explorations
33
Does this program compute the same result? If not, can you fix it? Under
what circumstances would you expect one or the other to be faster?
1.4 In your local implementation of C, what is the limit on the size of integers?
What happens in the event of arithmetic overflow? What are the implications
of size limits on the portability of programs from one machine/compiler to
another? How do the answers to these questions differ for Java? For Ada? For
Pascal? For Scheme? (You may need to find a manual.)
1.5 The Unix make utility allows the programmer to specify dependences among
the separately compiled pieces of a program. If file A depends on file B and
file B is modified, make deduces that A must be recompiled, in case any of
the changes to B would affect the code produced for A. How accurate is this
sort of dependence management? Under what circumstances will it lead to
unnecessary work? Under what circumstances will it fail to recompile something that needs to be recompiled?
1.6 Why is it difficult to tell whether a program is correct? How do you go about
finding bugs in your code? What kinds of bugs are revealed by testing? What
kinds of bugs are not? (For more formal notions of program correctness, see
the bibliographic notes at the end of Chapter 4.)
1.9
Explorations
1.7 (a) What was the first programming language you learned? If you chose it,
why did you do so? If it was chosen for you by others, why do you think
they chose it? What parts of the language did you find the most difficult
to learn?
(b) For the language with which you are most familiar (this may or may
not be the first one you learned), list three things you wish had been
differently designed. Why do you think they were designed the way they
were? How would you fix them if you had the chance to do it over? Would
there be any negative consequences—for example, in terms of compiler
complexity or program execution speed?
1.8 Get together with a classmate whose principal programming experience is
with a language in a different category of Figure 1.1. (If your experience is
mostly in C, for example, you might search out someone with experience in
Lisp.) Compare notes. What are the easiest and most difficult aspects of programming, in each of your experiences? Pick some simple problem (e.g., sorting, or identification of connected components in a graph) and solve it using
each of your favorite languages. Which solution is more elegant (do the two
of you agree)? Which is faster? Why?
34
Chapter 1 Introduction
1.9 (a) If you have access to a Unix system, compile a simple program with
the -S command-line flag. Add comments to the resulting assembly
language file to explain the purpose of each instruction.
(b) Now use the -o command-line flag to generate a relocatable object file.
Using appropriate local tools (look in particular for nm , objdump , or
a symbolic debugger like gdb or dbx ), identify the machine language
corresponding to each line of assembler.
(c) Using
nm , objdump , or a similar tool, identify the undefined external
symbols in your object file. Now run the compiler to completion, to
produce an executable file. Finally, run nm or objdump again to see what
has happened to the symbols in part (b). Where did they come from,
and how did the linker resolve them?
(d) Run the compiler to completion one more time, using the
-v command-line flag. You should see messages describing the various subprograms invoked during the compilation process (some compilers use
a different letter for this option; check the man page). The subprograms
may include a preprocessor, separate passes of the compiler itself (often two), probably an assembler, and the linker. If possible, run these
subprograms yourself, individually. Which of them produce the files
described in the previous subquestions? Explain the purpose of the various command-line flags with which the subprograms were invoked.
1.10 Write a program that commits a dynamic semantic error (e.g., division by
zero, access off the end of an array, dereference of a nil pointer). What
happens when you run this program? Does the compiler give you options
to control what happens? Devise an experiment to evaluate the cost of runtime semantic checks. If possible, try this exercise with more than one language or compiler.
1.11 C has a reputation for being a relatively “unsafe” high-level language. In
particular, it allows the programmer to mix operands of different sizes and
types in many more ways than do its “safer” cousins. The Unix lint utility
can be used to search for potentially unsafe constructs in C programs. In effect, many of the rules that are enforced by the compiler in other languages
are optional in C and are enforced (if desired) by a separate program. What
do you think of this approach? Is it a good idea? Why or why not?
1.12 Using an Internet search engine or magazine indexing service, read up on
the history of Java and C#, including the conflict between Sun and Microsoft over Java standardization. Some have claimed that C# is, at least
in part, Microsoft’s attempt to kill Java. Defend or refute this claim.
1.10 Bibliographic Notes
1.10
35
Bibliographic Notes
The compiler-oriented chapters of this book attempt to convey a sense of what
the compiler does, rather than explaining how to build one. A much greater level
of detail can be found in other texts. Leading options include the work of Cooper
and Torczon [CT04], Grune et al. [GBJL01], and Appel [App97]. The older texts
by Aho, Sethi, and Ullman [ASU86] and Fischer and LeBlanc [FL88] were for
many years the standards in the field, but have grown somewhat dated. Highquality texts on programming language design include those of Louden [Lou03],
Sebesta [Seb04], and Sethi [Set96].
Some of the best information on the history of programming languages can be
found in the proceedings of conferences sponsored by the Association for Computing Machinery in 1978 and 1993 [Wex78, Ass93]. Another excellent reference
is Horowitz’s 1987 text [Hor87]. A broader range of historical material can be
found in the quarterly IEEE Annals of the History of Computing. Given the importance of personal taste in programming language design, it is inevitable that some
language comparisons should be marked by strongly worded opinions. Examples
include the writings of Dijkstra [Dij82], Hoare [Hoa81], Kernighan [Ker81], and
Wirth [Wir85a].
Most personal computer software development now takes place in integrated
programming environments. Influential precursors to these environments include the Genera Common Lisp environment from Symbolics Corp. [WMWM87]
and the Smalltalk [Gol84], Interlisp [TM81], and Cedar [SZBH86] environments
at the Xerox Palo Alto Research Center.
2
Programming Language Syntax
EXAMPLE
2.1
Syntax of Arabic numerals
Unlike natural languages such as English or Chinese, computer languages
must be precise. Both their form (syntax) and meaning (semantics) must be specified without ambiguity so that both programmers and computers can tell what a
program is supposed to do. To provide the needed degree of precision, language
designers and implementors use formal syntactic and semantic notation. To facilitate the discussion of language features in later chapters, we will cover this
notation first: syntax in the current chapter and semantics in Chapter 4.
As a motivating example, consider the Arabic numerals with which we represent numbers. These numerals are composed of digits, which we can enumerate
as follows (‘ ’ means “or”):
digit
−→
0
1
2
3
4
5
6
7
8
9
Digits are the syntactic building blocks for numbers. In the usual notation, we say
that a natural number is represented by an arbitrary-length (nonempty) string of
digits, beginning with a nonzero digit:
non zero digit −→ 1
2
3
4
5
6
natural number −→ non zero digit digit *
7
8
9
Here the “Kleene1 star” metasymbol (*) is used to indicate zero or more repetitions of the symbol to its left.
Of course, digits are only symbols: ink blobs on paper or pixels on a screen.
They carry no meaning in and of themselves. We add semantics to digits when
we say that they represent the natural numbers from zero to nine, as defined
by mathematicians. Alternatively, we could say that they represent colors, or the
days of the week in a decimal calendar. These would constitute alternative semantics for the same syntax. In a similar fashion, we define the semantics of natural
numbers by associating a base-10, place-value interpretation with each string of
1 Stephen Kleene (1909–1994), a mathematician at the University of Wisconsin, was responsible
for much of the early development of the theory of computation, including much of the material
2.4.
in Section
37
38
Chapter 2 Programming Language Syntax
digits. Similar syntax rules and semantic interpretations can be devised for rational numbers, (limited-precision) real numbers, arithmetic, assignments, control
flow, declarations, and indeed all of programming languages.
Distinguishing between syntax and semantics is useful for at least two reasons.
First, different programming languages often provide features with very similar
semantics but very different syntax. It is generally much easier to learn a new
language if one is able to identify the common (and presumably familiar) ideas
beneath the unfamiliar syntax. Second, there are some very efficient and elegant
algorithms that a compiler or interpreter can use to discover the syntactic structure (but not the semantics!) of a computer program, and these algorithms can
be used to drive the rest of the compilation or interpretation process.
In the current chapter we focus on syntax: how we specify the structural rules
of a programming language, and how a compiler identifies the structure of a
given input program. These two tasks—specifying syntax rules and figuring out
how (and whether) a given program was built according to those rules—are distinct. The first is of interest mainly to programmers, who want to write valid
programs. The second is of interest mainly to compilers, which need to analyze
those programs. The first task relies on regular expressions and context-free grammars, which specify how to generate valid programs. The second task relies on
scanners and parsers, which recognize program structure. We address the first of
these tasks in Section 2.1, the second in Sections 2.2 and 2.3.
In Section 2.4 (largely on the PLP CD) we take a deeper look at the formal theory underlying scanning and parsing. In theoretical parlance, a scanner is a deterministic finite automaton (DFA) that recognizes the tokens of a programming
language. A parser is a deterministic push-down automaton (PDA) that recognizes
the language’s context-free syntax. It turns out that one can generate scanners and
parsers automatically from regular expressions and context-free grammars. This
task is performed by tools like Unix’s lex and yacc.2 Possibly nowhere else in
computer science is the connection between theory and practice so clear and so
compelling.
2.1
Specifying Syntax: Regular Expressions and
Context-Free Grammars
Formal specification of syntax requires a set of rules. How complicated (expressive) the syntax can be depends on the kinds of rules we are allowed to use. It
turns out that what we intuitively think of as tokens can be constructed from
2 At many sites, lex and yacc have been superseded by the GNU flex and bison tools. These
independently developed, noncommercial alternatives are available without charge from the Free
Software Foundation at www.gnu.org/software. They provide a superset of the functionality of
lex and yacc.
2.1 Specifying Syntax
39
individual characters using just three kinds of formal rules: concatenation, alternation (choice among a finite set of alternatives), and so-called “Kleene closure”
(repetition an arbitrary number of times). Specifying most of the rest of what
we intuitively think of as syntax requires one additional kind of rule: recursion
(creation of a construct from simpler instances of the same construct). Any set of
strings that can be defined in terms of the first three rules is called a regular set,
or sometimes a regular language. Regular sets are generated by regular expressions
and recognized by scanners. Any set of strings that can be defined if we add recursion is called a context-free language (CFL). Context-free languages are generated
by context-free grammars (CFGs) and recognized by parsers. (Terminology can
be confusing here. The meaning of the word language varies greatly, depending
on whether we’re talking about “formal” languages [e.g., regular or context-free]
or programming languages. A formal language is just a set of strings, with no
accompanying semantics.)
2.1.1
Tokens and Regular Expressions
Tokens are the basic building blocks of programs. They include keywords, identifiers, numbers, and various kinds of symbols. Pascal, which is a fairly simple
language, has 64 kinds of tokens, including 21 symbols ( + , - , ; , := , .. , etc.),
35 keywords ( begin , end , div , record , while , etc.), integer literals (e.g., 137 ),
real (floating-point) literals (e.g., 6.022e23 ), quoted character/string literals
(e.g., ’snerk’ ), identifiers ( MyVariable , YourType , maxint , readln , etc., 39
of which are predefined), and two different kinds of comments.
Upper- and lowercase letters in identifiers and keywords are considered distinct in some languages (e.g., Modula-2/3 and C and its descendants), and identical in others (e.g., Ada, Common Lisp, Fortran 90, and Pascal). Thus foo , Foo ,
and FOO all represent the same identifier in Ada but different identifiers in C.
Modula-2 and Modula-3 require keywords and predefined (built-in) identifiers
to be written in uppercase; C and its descendants require them to be written in
lowercase. A few languages (notably Modula-3 and Standard Pascal) allow only
letters and digits in identifiers. Most (including many actual implementations of
Pascal) allow underscores. A few (notably Lisp) allow a variety of additional characters. Some languages (e.g., Java, C#, and Modula-3) have standard conventions
on the use of upper- and lowercase letters in names.3
With the globalization of computing, non-Latin character sets have become
increasingly important. Many modern languages, including C99, C++, Ada 95,
Java, C#, and Fortran 2003, have explicit support for multibyte character sets,
generally based on the Unicode and ISO/IEC 10646 international standards. Most
modern programming languages allow non-Latin characters to appear with in
3 For the sake of consistency we do not always obey such conventions in this book. Most examples
follow the common practice of C programmers, in which underscores, rather than capital letters,
separate the “subwords” of names.
40
Chapter 2 Programming Language Syntax
comments and character strings; an increasing number allow them in identifiers as well. Conventions for portability across character sets and for localization
to a given character set can be surprisingly complex, particularly when various
forms of backward compatibility are required (the C99 Rationale devotes five full
pages to this subject [Int99, pp. 19–23]); for the most part we ignore such issues
here.
Some language implementations impose limits on the maximum length of
identifiers, but most avoid such unnecessary restrictions. Most modern languages
are also more-or-less free format, meaning that a program is simply a sequence
of tokens: what matters is their order with respect to one another, not their physical position within a printed line or page. “White space” (blanks, tabs, carriage
returns, and line and page feed characters) between tokens is usually ignored, except to the extent that it is needed to separate one token from the next. There are
a few exceptions to these rules. Some language implementations limit the maximum length of a line, to allow the compiler to store the current line in a fixedlength buffer. Dialects of Fortran prior to Fortran 90 use a fixed format, with 72
characters per line (the width of a paper punch card, on which programs were
once stored) and with different columns within the line reserved for different
purposes. Line breaks serve to separate statements in several other languages, including Haskell, Occam, SR, Tcl, and Python. Haskell, Occam, and Python also
give special significance to indentation. The body of a loop, for example, consists
of precisely those subsequent lines that are indented farther than the header of
the loop.
To specify tokens, we use the notation of regular expressions. A regular expression is one of the following.
1. A character
2. The empty string, denoted 3. Two regular expressions next to each other, meaning any string generated by
the first one followed by (concatenated with) any string generated by the second one
4. Two regular expressions separated by a vertical bar ( ), meaning any string
generated by the first one or any string generated by the second one
D E S I G N & I M P L E M E N TAT I O N
Formatting restrictions
Formatting limitations inspired by implementation concerns—as in the
punch-card-oriented rules of Fortran 77 and its predecessors—have a tendency to become unwanted anachronisms as implementation techniques improve. Given the tendency of certain word processors to “fill” or auto-format
text, the line break and indentation rules of languages like Haskell, Occam, and
Python are somewhat controversial.
2.1 Specifying Syntax
41
5. A regular expression followed by a Kleene star, meaning the concatenation of
zero or more strings generated by the expression in front of the star
EXAMPLE
2.2
Syntax of numbers in
Pascal
Parentheses are used to avoid ambiguity about where the various subexpressions start and end.4
Returning to the example of Pascal, numeric literals can be generated by the
following regular expressions.5
digit −→
0
1
2
unsigned integer −→
unsigned number −→
3
4
5
6
7
8
9
digit digit *
unsigned integer ( ( . unsigned integer ) )
E)(+
) unsigned integer ) )
(((e
To generate a valid string, we scan the regular expression from left to right,
choosing among alternatives at each vertical bar, and choosing a number of repetitions at each Kleene star. Within each repetition we may make different choices
at vertical bars, generating different substrings. Note that while we have allowed
later definitions to build on earlier ones, nothing is ever defined in terms of itself. Such recursive definitions are the distinguishing characteristic of contextfree grammars, described in Section 2.1.2.
Many readers will be familiar with regular expressions from the grep family
of tools in Unix, the search facilities of various text editors (notably emacs ), or
such scripting languages and tools as Perl, Python, Ruby, awk , and sed . Most
of these provide a rich set of extensions to the notation of regular expressions.
Some extensions, such as shorthand for “zero or one occurrences” or “anything
other than white space” do not change the power of the notation. Others, such
as the ability to require a second occurrence later in the input string of the same
character sequence that matched an earlier part of the expression, increase the
power of the notation, so it is no longer restricted to generating regular sets. Still
other extensions are designed not to increase the expressiveness of the notation
but rather to tie it to other language facilities. In many tools, for example, one
can bracket portions of a regular expression in such a way that when a string
is matched against it the contents of the corresponding substrings are assigned
into named local variables. We will return to these issues in Section 13.4.2, in the
context of scripting languages.
4 Some authors use λ to represent the empty string. Some use a period (.), rather than juxtaposition, to indicate concatenation. Some use a plus sign (+), rather than a vertical bar, to indicate
alternation.
5 Numeric literals in many languages are significantly more complex. Java, for example, supports
both 32 and 64-bit integer constants, in decimal, octal, and hexadecimal.
42
Chapter 2 Programming Language Syntax
2.1.2
EXAMPLE
2.3
Syntactic nesting in
expressions
Context-Free Grammars
Regular expressions work well for defining tokens. They are unable, however, to
specify nested constructs, which are central to programming languages. Consider
for example the structure of an arithmetic expression:
expr −→
id
number
- expr
( expr )
expr op expr
op −→
EXAMPLE
2.4
Extended BNF (EBNF)
+
-
*
/
Here the ability to define a construct in terms of itself is crucial. Among other
things, it allows us to ensure that left and right parentheses are matched, something that cannot be accomplished with regular expressions (see Section 2.4.3
for more details).
Each of the rules in a context-free grammar is known as a production. The
symbols on the left-hand sides of the productions are known as variables, or nonterminals. There may be any number of productions with the same left-hand side.
Symbols that are to make up the strings derived from the grammar are known as
terminals (shown here in typewriter font). They cannot appear on the left-hand
side of any production. In a programming language, the terminals of the contextfree grammar are the language’s tokens. One of the nonterminals, usually the one
on the left-hand side of the first production, is called the start symbol. It names
the construct defined by the overall grammar.
The notation for context-free grammars is sometimes called Backus-Naur
Form (BNF), in honor of John Backus and Peter Naur, who devised it for the
definition of the Algol 60 programming language [NBB+ 63].6 Strictly speaking,
the Kleene star and meta-level parentheses of regular expressions are not allowed
in BNF, but they do not change the expressive power of the notation and are commonly included for convenience. Sometimes one sees a “Kleene plus” (+) as well;
it indicates one or more instances of the symbol or group of symbols in front
of it.7 When augmented with these extra operators, the notation is often called
extended BNF (EBNF). The construct
id list −→
id ( , id )*
is shorthand for
id list −→
id
id list −→
id list , id
6 John Backus (1924–), is also the inventor of Fortran. He spent most of his professional career at
IBM Corporation, and was named an IBM Fellow in 1987. He received the ACM Turing Award
in 1977.
7 Some authors use curly braces ({ }) to indicate zero or more instances of the symbols inside.
Some use square brackets ([ ]) to indicate zero or one instance of the symbols inside—that is, to
indicate that those symbols are optional.
2.1 Specifying Syntax
43
“Kleene plus” is analogous. The vertical bar is also in some sense superfluous,
though it was provided in the original BNF. The construct
op −→
+
-
*
/
can be considered shorthand for
op −→
+
op −→
-
op −→
*
op −→
/
which is also sometimes written
op −→
+
−→
-
−→
*
−→
/
Many tokens, such as id and number above, have many possible spellings (i.e.,
may be represented by many possible strings of characters). The parser is oblivious to these; it does not distinguish one identifier from another. The semantic
analyzer does distinguish them, however, so the scanner must save the spelling of
each “interesting” token for later use.
2.1.3
EXAMPLE
2.5
Derivation of slope * x
+ intercept
Derivations and Parse Trees
A context-free grammar shows us how to generate a syntactically valid string
of terminals: begin with the start symbol. Choose a production with the start
symbol on the left-hand side; replace the start symbol with the right-hand side
of that production. Now choose a nonterminal A in the resulting string, choose a
production P with A on its left-hand side, and replace A with the right-hand side
of P. Repeat this process until no nonterminals remain.
As an example, we can use our grammar for expressions to generate the string
“ slope * x + intercept ”:
expr ⇒
expr op expr
⇒
expr op id
⇒
expr + id
⇒
expr op expr + id
⇒
expr op id + id
⇒
expr * id + id
⇒
id
( slope )
* id +
id
(x)
( intercept )
44
Chapter 2 Programming Language Syntax
Figure 2.1
Parse tree for slope * x + intercept (grammar in Example 2.3).
Figure 2.2 Alternative (less desirable) parse tree for slope * x + intercept (grammar in
Example 2.3). The fact that more than one tree exists implies that our grammar is ambiguous.
EXAMPLE
2.6
Parse trees for slope * x
+ intercept
The ⇒ metasymbol indicates that the right-hand side was obtained by using
a production to replace some nonterminal in the left-hand side. At each line we
have underlined the symbol A that is replaced in the following line.
A series of replacement operations that shows how to derive a string of terminals from the start symbol is called a derivation. Each string of symbols along
the way is called a sentential form. The final sentential form, consisting of only
terminals, is called the yield of the derivation. We sometimes elide the intermediate steps and write expr ⇒* slope * x + intercept , where the metasymbol ⇒* means “yields after zero or more replacements.” In this particular
derivation, we have chosen at each step to replace the right-most nonterminal
with the right-hand side of some production. This replacement strategy leads to
a right-most derivation, also called a canonical derivation. There are many other
possible derivations, including left-most and options in-between. Most parsers
are designed to find a particular derivation (usually the left-most or right-most).
We saw in Chapter 1 that we can represent a derivation graphically as a parse
tree. The root of the parse tree is the start symbol of the grammar. The leaves of
the tree are its yield. Each internal node, together with its children, represents the
use of a production.
A parse tree for our example expression appears in Figure 2.1. This tree is not
unique. At the second level of the tree, we could have chosen to turn the operator
into a * instead of a + , and to further expand the expression on the right, rather
than the one on the left (see Figure 2.2). The fact that some strings are the yield
of more than one parse tree tells us that our grammar is ambiguous. Ambiguity
2.1 Specifying Syntax
Figure 2.3
EXAMPLE
2.7
Expression grammar with
precedence and
associativity
45
Parse tree for 3 + 4 * 5 , with precedence (grammar in Example 2.7).
turns out to be a problem when trying to build a parser: it requires some extra
mechanism to drive a choice between equally acceptable alternatives.
A moment’s reflection will reveal that there are infinitely many context-free
grammars for any given context-free language. Some of these grammars are much
more useful than others. In this text we will avoid the use of ambiguous grammars
(though most parser generators allow them, by means of disambiguating rules).
We will also avoid the use of so-called useless symbols: nonterminals that cannot
generate any string of terminals, or terminals that cannot appear in the yield of
any derivation.
When designing the grammar for a programming language, we generally try
to find one that reflects the internal structure of programs in a way that is useful
to the rest of the compiler. (We shall see in Section 2.3.2 that we also try to find
one that can be parsed efficiently, which can be a bit of a challenge.) One place
in which structure is particularly important is in arithmetic expressions, where
we can use productions to capture the associativity and precedence of the various operators. Associativity tells us that the operators in most languages group
left-to-right, so 10 - 4 - 3 means (10 - 4) - 3 rather than 10 - (4 - 3)
. Precedence tells us that multiplication and division in most languages group
more tightly than addition and subtraction, so 3 + 4 * 5 means 3 + (4 * 5)
rather than (3 + 4) * 5 . (These rules are not universal; we will consider them
again in Section 6.1.1.)
Here is a better version of our expression grammar.
1.
expr −→
term
2.
term −→
factor
3.
factor −→
id
4.
add op −→
5.
mult op −→
+
*
expr add op term
term mult op factor
number
- factor
( expr )
/
This grammar is unambiguous. It captures precedence in the way factor, term,
and expr build on one another, with different operators appearing at each level. It
captures associativity in the second halves of lines 1 and 2, which build subexprs
and subterms to the left of the operator, rather than to the right. In Figure 2.3, we
can see how building the notion of precedence into the grammar makes it clear
46
Chapter 2 Programming Language Syntax
Figure 2.4
Parse tree for 10 - 4 - 3 , with left associativity (grammar in Example 2.7).
that multiplication groups more tightly than addition in 3 + 4 * 5 , even without parentheses. In Figure 2.4, we can see that subtraction groups more tightly to
the left, so 10 - 4 - 3 would evaluate to 3 rather than to 9 .
C H E C K YO U R U N D E R S TA N D I N G
1. What is the difference between syntax and semantics?
2. What are the three basic operations that can be used to build complex regular
expressions from simpler regular expressions?
3. What additional operation (beyond the three of regular expressions) is provided in context-free grammars?
4. What is Backus-Naur form? When and why was it devised?
5. Name a language in which indentation affects program syntax.
6. When discussing context-free languages, what is a derivation? What is a sentential form?
7. What is the difference between a right-most derivation and a left-most derivation? Which one of them is also called canonical?
8. What does it mean for a context-free grammar to be ambiguous?
9. What are associativity and precedence? Why are they significant in parse trees?
2.2
Scanning
Together, the scanner and parser for a programming language are responsible
for discovering the syntactic structure of a program. This process of discovery,
or syntax analysis, is a necessary first step toward translating the program into
2.2 Scanning
EXAMPLE
2.8
Outline of a scanner for
Pascal
47
an equivalent program in the target language. (It’s also the first step toward interpreting the program directly. In general, we will focus on compilation, rather
than interpretation, for the remainder of the book. Most of what we shall discuss either has an obvious application to interpretation or is obviously irrelevant
to it.)
By grouping input characters into tokens, the scanner dramatically reduces the
number of individual items that must be inspected by the more computationally
intensive parser. In addition, the scanner typically removes comments (so the
parser doesn’t have to worry about them appearing throughout the context-free
grammar); saves the text of “interesting” tokens like identifiers, strings, and numeric literals; and tags tokens with line and column numbers to make it easier to
generate high-quality error messages in later phases.
Suppose for a moment that we are writing a scanner for Pascal.8 We might
sketch the process as shown in Figure 2.5. The structure of the code is entirely up to the programmer, but it seems reasonable to check the simpler
and more common cases first, to peek ahead when we need to, and to embed loops for comments and for long tokens such as identifiers, numbers, and
strings.
After announcing a token the scanner returns to the parser. When invoked
again it repeats the algorithm from the beginning, using the next available characters of input (including any look-ahead that was peeked at but not consumed
the last time).
As a rule, we accept the longest possible token in each invocation of the scanner. Thus foobar is always foobar and never f or foo or foob . More to the
point, 3.14159 is a real number and never 3 , . , and 14159 . White space (blanks,
D E S I G N & I M P L E M E N TAT I O N
Nested comments
Nested comments can be handy for the programmer (e.g., for temporarily
“commenting out” large blocks of code). Scanners normally deal only with
nonrecursive constructs, however, so nested comments require special treatment. Some languages disallow them. Others require the language implementor to augment the scanner with special purpose comment-handling code.
C++ and C99 strike a compromise: /* ... */ style comments are not allowed
to nest, but /* ... */ and //... style comments can appear inside each other.
The programmer can thus use one style for “normal” comments and the other
for “commenting out.” (The C99 designers note, however, that conditional
compilation ( #if ) is preferable [Int03a, p. 58].)
8 As in Example 1.17, we use Pascal for this example because its lexical structure is significantly
simpler than that of most modern imperative languages.
48
Chapter 2 Programming Language Syntax
we skip any initial white space (spaces, tabs, and newlines)
we read the next character
if it is a ( we look at the next character
if that is a * we have a comment;
we skip forward through the terminating *)
otherwise we return a left parenthesis and reuse the look-ahead
if it is one of the one-character tokens ([ ] , ; = + - etc.)
we return that token
if it is a . we look at the next character
if that is a . we return .. †
otherwise we return . and reuse the look-ahead
if it is a < we look at the next character
if that is a = we return <=
otherwise we return < and reuse the look-ahead
etc.
if it is a letter we keep reading letters and digits
and maybe underscores until we can’t anymore;
then we check to see if it is a keyword
if so we return the keyword
otherwise we return an identifier
in either case we reuse the character beyond the end of the token
if it is a digit we keep reading until we find a nondigit
if that is not a . we return an integer and reuse the nondigit
otherwise we keep looking for a real number
if the character after the . is not a digit we return an integer
and reuse the . and the look-ahead
etc.
Figure 2.5
Outline of an ad hoc Pascal scanner. Only a fraction of the code is shown.
†The double-dot .. token is used to specify ranges in Pascal (e.g., type day = 1..31).
EXAMPLE
2.9
Finite automaton for part
of a Pascal scanner
tabs, carriage returns, comments) is generally ignored, except to the extent that
it separates tokens (e.g., foo bar is different from foobar ).
It is not difficult to flesh out Figure 2.5 by hand to produce code in some
programming language. This ad hoc style of scanner is often used in production
compilers; the code is fast and compact. In some cases, however, it makes sense
to build a scanner in a more structured way, as an explicit representation of a
finite automaton. An example of such an automaton, for part of a Pascal scanner,
appears in Figure 2.6. The automaton starts in a distinguished initial state. It then
moves from state to state based on the next available character of input. When it
reaches one of a designated set of final states it recognizes the token associated
with that state. The “longest possible token” rule means that the scanner returns
to the parser only when the next character cannot be used to continue the current
token.
2.2 Scanning
49
Figure 2.6
Pictorial representation of (part of) a Pascal scanner as a finite automaton. Scanning for each token begins in the state marked “Start.” The final states, in which a token is
recognized, are indicated by double circles.
2.2.1
Generating a Finite Automaton
While a finite automaton can in principle be written by hand, it is more common to build one automatically from a set of regular expressions, using a scanner
generator tool. Because regular expressions are significantly easier to write and
modify than an ad hoc scanner is, automatically generated scanners are often
used during language or compiler development, or when ease of implementation is more important than the last little bit of run-time performance. In effect,
regular expressions constitute a declarative programming language for a limited
problem domain: namely, that of scanning.
The example automaton of Figure 2.6 is deterministic: there is never any ambiguity about what it ought to do, because in a given state with a given in-
50
Chapter 2 Programming Language Syntax
put character there is never more than one possible outgoing transition (arrow)
labeled by that character. As it turns out, however, there is no obvious one-step
algorithm to convert a set of regular expressions into an equivalent deterministic
finite automaton (DFA). The typical scanner generator implements the conversion as a series of three separate steps.
The first step converts the regular expressions into a nondeterministic finite
automaton (NFA). An NFA is like a DFA except that (a) there may be more than
one transition out of a given state labeled by a given character, and (b) there may
be so-called epsilon transitions: arrows labeled by the empty string symbol, . The
NFA is said to accept an input string (token) if there exists a path from the start
state to a final state whose non-epsilon transitions are labeled, in order, by the
characters of the token.
To avoid the need to search all possible paths for one that “works,” the second step of a scanner generator translates the NFA into an equivalent DFA: an
automaton that accepts the same language, but in which there are no epsilon
transitions and no states with more than one outgoing transition labeled by the
same character. The third step is a space optimization that generates a final DFA
with the minimum possible number of states.
From a Regular Expression to an NFA
EXAMPLE
2.10
Constructing an NFA for a
given regular expression
EXAMPLE
2.11
NFA for ( 1 * 0 1 * 0 )* 1 *
A trivial regular expression consisting of a single character a is equivalent to
a simple two-state NFA (in fact, a DFA), illustrated in part (a) of Figure 2.7.
Similarly, the regular expression is equivalent to a two-state NFA whose arc is
labeled by . Starting with this base we can use three subconstructions, illustrated
in parts (b)–(d) of the same figure, to build larger NFAs to represent the concatenation, alternation, or Kleene closure of the regular expressions represented by
smaller NFAs. Each step preserves three invariants: there are no transitions into
the initial state, there is a single final state, and there are no transitions out of the
final state. These invariants allow smaller machines to be joined into larger machines without any ambiguity about where to create the connections, and without creating any unexpected paths.
To make these constructions concrete, we consider a small but nontrivial example. Suppose we wish to generate all strings of zeros and ones in which the
number of zeros is even. To generate exactly two zeros we could use the expression 00 . We must allow these to be preceded, followed, or separated by an arbitrary number of ones: 1 * 0 1 * 0 1 *. This whole construct can then be repeated
an arbitrary number of times: ( 1 * 0 1 * 0 1 * ) *. Finally, we observe that there is
no point in beginning and ending the parenthesized expression with 1 *. If we
move one of the occurrences outside the parentheses we get an arguably simpler
expression: ( 1 * 0 1 * 0 ) * 1 *.
Starting with this regular expression and using the constructions of Figure 2.7,
we illustrate the construction of an equivalent NFA in Figure 2.8. In this particular example alternation is not required.
2.2 Scanning
51
Figure 2.7 Construction of an NFA equivalent to a given regular expression. Part (a) shows
the base case: the automaton for the single letter a . Parts (b), (c), and (d), respectively, show
the constructions for concatenation, alternation, and Kleene closure. Each construction retains a
unique start state and a single final state. Internal detail is hidden in the diamond-shaped center
regions.
From an NFA to a DFA
EXAMPLE
2.12
DFA for ( 1 * 0 1 * 0 )* 1 *
With no way to “guess” the right transition to take from any given state, any practical implementation of an NFA would need to explore all possible transitions,
concurrently or via backtracking. To avoid such a complex and time-consuming
strategy, we can use a “set of subsets” construction to transform the NFA into
an equivalent DFA. The key idea is for the state of the DFA after reading a given
input to represent the set of states that the NFA might have reached on the same
input. We illustrate the construction in Figure 2.9 using the NFA from Figure 2.8.
Initially, before it consumes any input, the NFA may be in State 1, or it may make
epsilon transitions to States 2, 3, 5, 11, 12, or 14. We thus create an initial State
A for our DFA to represent this set. On an input of 1 , our NFA may move from
52
Chapter 2 Programming Language Syntax
Figure 2.8 Construction of an NFA equivalent to the regular expression ( 1 * 0 1 * 0 )* 1 *. In the top line are the primitive
automata for 1 and 0 , and the Kleene closure construction for 1 *. In the second and third rows we have used the concatenation
construction to build 1 * 0 and 1 * 0 1 *. The fourth row uses Kleene closure again to construct ( 1 * 0 1 * 0 )* ; the final line uses
concatenation to complete the NFA. We have labeled the states in the final automaton for reference in subsequent figures.
2.2 Scanning
53
Figure 2.9 A DFA equivalent to the NFA at the bottom of Figure 2.8. Each state of the DFA
represents the set of states that the NFA could be in after seeing the same input.
State 3 to State 4, or from State 12 to State 13. It has no other transitions on this
input from any of the states in A. From States 4 and 13, however, the NFA may
make epsilon transitions to any of States 3, 5, 12, or 14. We therefore create DFA
State B as shown. On a 0 , our NFA may move from State 5 to State 6, from which
it may reach States 7 and 9 by epsilon transitions. We therefore create DFA State C
as shown, with a transition from A to C on 0 . Careful inspection reveals that a 1
will leave the DFA in State B, while a 0 will move it from B to C. Continuing in
this fashion, we end up creating three additional states. Each state that “contains”
the final state (State 14) of the NFA is marked as a final state of the DFA.
In our example, the DFA ends up being smaller than the NFA, but this is only
because our regular language is so simple. In theory, the number of states in the
DFA may be exponential in the number of states in the NFA, but this extreme
is also uncommon in practice. For a programming language scanner, the DFA
tends to be larger than the NFA, but not outlandishly so.
Minimizing the DFA
EXAMPLE
2.13
Minimal DFA for
( 1 * 0 1 * 0 )* 1 *
Starting from a regular expression we have now constructed an equivalent DFA.
Though this DFA has five states, a bit of thought suggests that it should be possible to build an automaton with only two states: one that will be reached after
consuming input containing an odd number of zeros and one that will be reached
after consuming input containing an even number of zeros. We can obtain this
machine by performing the following inductive construction. Initially we place
the states of the (not necessarily minimal) DFA into two equivalence classes: final
states and nonfinal states. We then repeatedly search for an equivalence class C
and an input symbol a such that when given a as input, the states in C make
transitions to states in k > 1 different equivalence classes. We then partition C
into k classes in such a way that all states in a given new class would move to a
member of the same old class on a . When we are unable to find a class to partition in this fashion we are done. In our example, the original placement puts
54
Chapter 2 Programming Language Syntax
Figure 2.10
Minimal DFA for the language consisting of all strings of zeros and ones in which
the number of zeros is even. State q1 represents the merger of states qA , qB , and qE in Figure 2.9;
state q2 represents the merger of states qC and qD .
States A, B, and E in one class (final states) and C and D in another. In all cases,
a 1 leaves us in the current class, while a 0 takes us to the other class. Consequently, no class requires partitioning, and we are left with the two-state DFA of
Figure 2.10.
2.2.2
EXAMPLE
2.14
Nested case statement
automaton
Scanner Code
We can implement a scanner that explicitly captures the “circles-and-arrows”
structure of a DFA in either of two main ways. One embeds the automaton in the
control flow of the program using goto s or nested case ( switch ) statements; the
other, described in the following subsection, uses a table and a driver. As a general rule, handwritten scanners tend to use nested case statements, while most
(but not all [BC93]) automatically generated scanners use tables. Tables are hard
to create by hand but easier than code to create from within a program. Unix’s
lex/flex tool produces C language output containing tables and a customized
driver. Some other scanner generators produce tables for use with a handwritten
driver, which can be written in any language.
The nested case statement style of automaton is illustrated in Figure 2.11.
The outer case statement covers the states of the finite automaton. The inner case statements cover the transitions out of each state. Most of the inner
clauses simply set a new state. Some return from the scanner with the current
token.
Two aspects of the code do not strictly follow the form of a finite automaton.
One is the handling of keywords. The other is the need to peek ahead in order to
distinguish between the dot in the middle of a real number and a double dot that
follows an integer.
Keywords in most languages (including Pascal) look just like identifiers, but
they are reserved for a special purpose (some authors use the term reserved word
instead of keyword9 ). It is possible to write a finite automaton that distinguishes
9 Keywords (reserved words) are not the same as predefined identifiers. Predefined identifiers can
be redefined to have a different meaning; keywords cannot. The scanner does not distinguish between predefined and other identifiers. It does distinguish between identifiers and keywords.
In Pascal, keywords include begin , div , record , and while . Predefined identifiers include
integer , writeln , true , and ord .
2.2 Scanning
55
state := start
loop
case state of
start :
erase text of current token
case input char of
‘ ’, ‘\t’, ‘\n’, ‘\r’ : no op
‘[’ : state := got lbrac
‘]’ : state := got rbrac
‘,’ : state := got comma
...
‘(’ : state := saw lparen
‘.’ : state := saw dot
‘<’ : state := saw lthan
...
‘a’..‘z’, ‘A’..‘Z’ :
state := in ident
‘0’..‘9’ : state := in int
...
else error
...
saw lparen: case input char of
‘*’ : state := in comment
else return lparen
in comment: case input char of
‘*’ : state := leaving comment
else no op
leaving comment: case input char of
‘)’ : state := start
else state := in comment
...
saw dot : case input char of
‘.’ : state := got dotdot
else return dot
...
saw lthan : case input char of
‘=’ : state := got le
else return lt
...
Figure 2.11 Outline of a Pascal scanner written as an explicit finite automaton, in the form
of nested case statements in a loop. (continued)
56
Chapter 2 Programming Language Syntax
in ident : case input char of
‘a’..‘z’, ‘A’..‘Z’, ‘0’..‘9’, ‘ ’ : no op
else
look up accumulated token in keyword table
if found, return keyword
else return id
...
in int : case input char of
‘0’..‘9’ : no op
‘.’ :
peek at character beyond input char;
if ‘0’..‘9’, state := saw real dot
else
unread peeked-at character
return intconst
‘a’..‘z’, ‘A’..‘Z’, ‘ ’ : error
else return intconst
...
saw real dot : . . .
...
got lbrac : return lbrac
got rbrac : return rbrac
got comma : return comma
got dotdot : return dotdot
got le : return le
...
append input char to text of current token
read new input char
Figure 2.11
(continued)
between keywords and identifiers, but it requires a lot of states. To begin with,
there must be a separate state, reachable from the initial state, for each letter that
might begin a keyword. For each of these, there must then be a state for each possible second character of a keyword (e.g., to distinguish between file , for , and
from ). It is a nuisance (and a likely source of errors) to enumerate these states by
hand. Likewise, while it is easy to write a regular expression that represents a keyword ( b e g i n e n d w h i l e . . . ), it is not at all easy to write an
expression that represents a (non-keyword) identifier (Exercise 2.3). Most scanners, both handwritten and automatically generated, therefore treat keywords as
“exceptions” to the rule for identifiers. Before returning an identifier to the parser,
the scanner looks it up in a hash table or trie (a tree of branching paths) to make
sure it isn’t really a keyword. This convention is reflected in the in ident arm of
Figure 2.11.
Whenever one legitimate token is a prefix of another, the “longest possible
token” rule says that we should continue scanning. If some of the intermediate
2.2 Scanning
EXAMPLE
2.15
The “dot-dot problem” in
Pascal
EXAMPLE
2.16
Look-ahead in Fortran
scanning
57
strings are not valid tokens, however, we can’t tell whether a longer token is possible without looking more than one character ahead. This problem arises in Pascal
in only one case, sometimes known as the “dot-dot problem.” If the scanner has
seen a 3 and has a dot coming up in the input, it needs to peek at the character
beyond the dot in order to distinguish between 3.14 (a single token designating a
real number), 3 .. 5 (three tokens designating a range), and 3 . foo (three tokens that the scanner should accept, even though the parser will object to seeing
them in that order).
In messier languages, a scanner may need to look an arbitrary distance ahead.
In Fortran IV, for example, DO 5 I = 1,25 is the header of a loop (it executes
the statements up to the one labeled 5 for values of I from 1 to 25), while DO 5
I = 1.25 is an assignment statement that places the value 1.25 into the variable DO5I . Spaces are ignored in (pre-’90) Fortran input, even in the middle of
variable names. Moreover, variables need not be declared, and the terminator
for a DO loop is simply a label, which the parser can ignore. After seeing DO , the
scanner cannot tell whether the 5 is part of the current token until it reaches
the comma or dot. It has been widely (but apparently incorrectly) claimed that
NASA’s Mariner 1 space probe was lost due to accidental replacement of a comma
with a dot in a case similar to this one in flight control software.10 Dialects of
Fortran starting with Fortran 77 allow (in fact encourage) the use of alternative
syntax for loop headers, in which an extra comma makes misinterpretation less
likely: DO 5,I = 1,25 .
In Pascal, the dot-dot problem can be handled as a special case, as shown in
the in int arm of Figure 2.11. In languages requiring larger amounts of lookahead, the scanner can take a more general approach. In any case of ambiguity, it
assumes that a longer token will be possible but remembers that a shorter token
could have been recognized at some point in the past. It also buffers all characters
read beyond the end of the shorter token. If the optimistic assumption leads the
D E S I G N & I M P L E M E N TAT I O N
Longest possible tokens
A little care in syntax design—avoiding tokens that are nontrivial prefixes of
other tokens—can dramatically simplify scanning. In straightforward cases of
prefix ambiguity the scanner can enforce the “longest possible token” rule automatically. In Fortran, however, the rules are sufficiently complex that no
purely lexical solution suffices. Some of the problems, and a possible solution,
are discussed in an article by Dyadkin [Dya95].
10 In actuality, the faulty software for Mariner 1 appears to have stemmed from a missing “bar”
punctuation mark (indicating an average) in handwritten notes from which the software was
derived [Cer89, pp. 202–203]. The Fortran DO loop error does appear to have occurred in at least
one piece of NASA software, but no serious harm resulted [Web89].
58
Chapter 2 Programming Language Syntax
scanner into an error state, it “unreads” the buffered characters so that they will
be seen again later, and returns the shorter token.
2.2.3
EXAMPLE
2.17
Table-driven scanning
Table-Driven Scanning
Figure 2.11 uses control flow—a loop and nested case statements—to represent
a finite automaton. An alternative approach represents the automaton as a data
structure: a two-dimensional transition table. A driver program uses the current
state and input character to index into the table (Figure 2.12). Each entry in the
table specifies whether to move to a new state (and if so, which one), return a
token, or announce an error. A second table indicates, for each state, whether we
might be at the end of a token (and if so, which one). Separating this second table
from the first allows us to notice when we pass a state that might have been the
end of a token, so we can back up if we hit an error state.
Like a handwritten scanner, the table-driven code of Figure 2.12 looks tokens
up in a table of keywords immediately before returning. An outer loop serves to
filter out comments and “white space”—spaces, tabs, and newlines. These character sequences are not meaningful to the parser, and would in fact be very difficult to represent in a grammar (Exercise 2.15).
2.2.4
Lexical Errors
The code in Figure 2.12 explicitly recognizes the possibility of lexical errors. In
some cases the next character of input may be neither an acceptable continuation
of the current token nor the start of another token. In such cases the scanner must
print an error message and perform some sort of recovery so that compilation can
continue, if only to look for additional errors. Fortunately, lexical errors are relatively rare—most character sequences do correspond to token sequences—and
relatively easy to handle. The most common approach is simply to (1) throw away
the current, invalid token, (2) skip forward until a character is found that can legitimately begin a new token, (3) restart the scanning algorithm, and (4) count
on the error-recovery mechanism of the parser to cope with any cases in which
the resulting sequence of tokens is not syntactically valid. Of course the need for
error recovery is not unique to table-driven scanners; any scanner must cope with
errors. We did not show the code in Figures 2.5 and 2.11, but it would have to be
there in practice.
The code in Figure 2.12 also shows that the scanner must return both the
kind of token found and its character-string image (spelling); again this requirement applies to all types of scanners. For some tokens the character-string image
is redundant: all semicolons look the same, after all, as do all while keywords.
For other tokens, however (e.g., identifiers, character strings, and numeric constants), the image is needed for semantic analysis. It is also useful for error messages: “undeclared identifier” is not as nice as “ foo has not been declared.”
2.2 Scanning
59
state = 0 . . number of states
token = 0 . . number of tokens
scan tab : array [char, state] of record
action : (move, recognize, error)
new state : state
–– what to recognize
token tab : array [state] of token
keyword tab : set of record
k image : string
k token : token
–– these three tables are created by a scanner generator tool
tok : token
cur char : char
remembered chars : list of char
repeat
cur state : state := start state
image : string := null
–– none
remembered state : state := 0
loop
read cur char
case scan tab[cur char, cur state].action
move:
if token tab[cur state] = 0
–– this could be a final state
remembered state := cur state
remembered chars := add cur char to remembered chars
cur state := scan tab[cur char, cur state].new state
recognize:
tok := token tab[cur state]
unread cur char
–– push back into input stream
exit inner loop
error:
if remembered state = 0
tok := token tab[remembered state]
unread remembered chars
exit inner loop
–– else print error message and recover; probably start over
append cur char to image
–– end inner loop
until tok ∈ {white space, comment}
look image up in keyword tab and replace tok with appropriate keyword if found
return tok, image
Figure 2.12 Driver for a table-driven scanner, with code to handle the ambiguous case in
which one valid token is a prefix of another, but some intermediate string is not.
60
Chapter 2 Programming Language Syntax
2.2.5
Pragmas
Some languages and language implementations allow a program to contain constructs called pragmas that provide directives or hints to the compiler. Pragmas
are sometimes called significant comments because, in most cases, they do not
affect the meaning (semantics) of the program—only the compilation process.
In many languages the name is also appropriate because, like comments, pragmas can appear anywhere in the source program. In this case they are usually
handled by the scanner: allowing them anywhere in the grammar would greatly
complicate the parser. In other languages (Ada, for example), pragmas are permitted only at certain well-defined places in the grammar. In this case they are
best handled by the parser or semantic analyzer.
Examples of directives include the following.
Turn various kinds of run-time checks (e.g., pointer or subscript checking) on
or off.
Turn certain code improvements on or off (e.g., on in inner loops to improve
performance; off otherwise to improve compilation speed).
Turn performance profiling on or off.
Some directives “cross the line” and change program semantics. In Ada, for example, the unchecked pragma can be used to disable type checking.
Hints provide the compiler with information about the source program that
may allow it to do a better job:
Variable x is very heavily used (it may be a good idea to keep it in a register).
Subroutine F is a pure function: its only effect on the rest of the program is
the value it returns.
Subroutine S is not (indirectly) recursive (its storage may be statically allocated).
32 bits of precision (instead of 64) suffice for floating-point variable x .
The compiler may ignore these in the interest of simplicity, or in the face of contradictory information.
C H E C K YO U R U N D E R S TA N D I N G
10. List the tasks performed by the typical scanner.
11. What are the advantages of an automatically generated scanner, in comparison to a handwritten one? Why do many commercial compilers use a handwritten scanner anyway?
12. Explain the difference between deterministic and nondeterministic finite automata. Why do we prefer the deterministic variety for scanning?
2.3 Parsing
61
13. Outline the constructions used to turn a set of regular expressions into a minimal DFA.
14. What is the “longest possible token” rule?
15. Why must a scanner sometimes “peek” at upcoming characters?
16.
17.
18.
19.
What is the difference between a keyword and an identifier?
Why must a scanner save the text of tokens?
How does a scanner identify lexical errors? How does it respond?
What is a pragma?
2.3
Parsing
The parser is the heart of a typical compiler. It calls the scanner to obtain the tokens of the input program, assembles the tokens together into a syntax tree, and
passes the tree (perhaps one subroutine at a time) to the later phases of the compiler, which perform semantic analysis and code generation and improvement.
In effect, the parser is “in charge” of the entire compilation process; this style of
compilation is sometimes referred to as syntax-directed translation.
As noted in the introduction to this chapter, a context-free grammar (CFG) is
a generator for a CF language. A parser is a language recognizer. It can be shown
that for any CFG we can create a parser that runs in O(n3 ) time, where n is the
length of the input program.11 There are two well-known parsing algorithms that
achieve this bound: Earley’s algorithm [Ear70] and the Cocke-Younger-Kasami
(CYK) algorithm [Kas65, You67]. Cubic time is much too slow for parsing sizable
programs, but fortunately not all grammars require such a general and slow parsing algorithm. There are large classes of grammars for which we can build parsers
that run in linear time. The two most important of these classes are called LL
and LR.
LL stands for “Left-to-right, Left-most derivation.” LR stands for “Left-toright, Right-most derivation.” In both classes the input is read left-to-right. An
LL parser discovers a left-most derivation; an LR parser discovers a right-most
derivation. We will cover LL parsers first. They are generally considered to be
simpler and easier to understand. They can be written by hand or generated automatically from an appropriate grammar by a parser-generating tool. The class
of LR grammars is larger, and some people find the structure of the grammars
more intuitive, especially in the part of the grammar that deals with arithmetic
11 In general, an algorithm is said to run in time O( f (n)), where n is the length of the input, if
its running time t(n) is proportional to f (n) in the worst case. More precisely, we say t(n) =
O( f (n)) ⇐⇒ ∃ c, m [n > m −→ t(n) < c f (n)].
62
EXAMPLE
Chapter 2 Programming Language Syntax
2.18
Top-down and bottom-up
parsing
expressions. LR parsers are almost always constructed by a parser-generating tool.
Both classes of parsers are used in production compilers, though LR parsers are
more common.
LL parsers are also called “top-down” or “predictive” parsers. They construct
a parse tree from the root down, predicting at each step which production will be
used to expand the current node, based on the next available token of input. LR
parsers are also called “bottom-up” parsers. They construct a parse tree from the
leaves up, recognizing when a collection of leaves or other nodes can be joined
together as the children of a single parent.
We can illustrate the difference between top-down and bottom-up parsing
by means of a simple example. Consider the following grammar for a commaseparated list of identifiers, terminated by a semicolon.
id list −→
id id list tail
id list tail −→
, id id list tail
id list tail −→
;
These are the productions that would normally be used for an identifier list in
a top-down parser. They can also be parsed bottom-up (most top-down grammars can be). In practice they would not be used in a bottom-up parser, for reasons that will become clear in a moment, but the ability to handle them either
way makes them good for this example.
Progressive stages in the top-down and bottom-up construction of a parse
tree for the string A, B, C; appear in Figure 2.13. The top-down parser begins
by predicting that the root of the tree (id list) will be replaced by id id list tail.
It then matches the id against a token obtained from the scanner. (If the scanner produced something different, the parser would announce a syntax error.)
The parser then moves down into the first (in this case only) nonterminal child
and predicts that id list tail will be replaced by , id id list tail. To make this
prediction it needs to peek at the upcoming token (a comma), which allows it to
choose between the two possible expansions for id list tail. It then matches the
comma and the id and moves down into the next id list tail. In a similar, recursive fashion, the top-down parser works down the tree, left-to-right, predicting
and expanding nodes and tracing out a left-most derivation of the fringe of the
tree.
The bottom-up parser, by contrast, begins by noting that the left-most leaf of
the tree is an id . The next leaf is a comma and the one after that is another id .
The parser continues in this fashion, shifting new leaves from the scanner into
a forest of partially completed parse tree fragments, until it realizes that some
of those fragments constitute a complete right-hand side. In this grammar, that
doesn’t occur until the parser has seen the semicolon—the right-hand side of
id list tail −→ ; . With this right-hand side in hand, the parser reduces the semicolon to an id list tail. It then reduces , id id list tail into another id list tail.
After doing this one more time it is able to reduce id id list tail into the root of
the parse tree, id list.
2.3 Parsing
63
Figure 2.13 Top-down (left) and bottom-up parsing (right) of the input string A, B, C; .
Grammar appears at lower left.
At no point does the bottom-up parser predict what it will see next. Rather,
it shifts tokens into its forest until it recognizes a right-hand side, which it then
reduces to a left-hand side. Because of this behavior, bottom-up parsers are sometimes called shift-reduce parsers. Looking up the figure, from bottom to top, we
can see that the shift-reduce parser traces out a right-most (canonical) derivation, in reverse.
64
EXAMPLE
Chapter 2 Programming Language Syntax
2.19
Bounding space with a
bottom-up grammar
There are several important subclasses of LR parsers, including SLR, LALR,
and “full LR.” SLR and LALR are important for their ease of implementation,
full LR for its generality. LL parsers can also be grouped into SLL and “full LL”
subclasses. We will cover the differences among them only briefly here; for further information see any of the standard compiler-construction or parsing theory
textbooks [App97, ASU86, AU72, CT04, FL88].
One commonly sees LL or LR (or whatever) written with a number in parentheses after it: LL(2) or LALR(1), for example. This number indicates how many
tokens of look-ahead are required in order to parse. Most real compilers use just
one token of look-ahead, though more can sometimes be helpful. Terrence Parr’s
open-source ANTLR tool, in particular, uses multi-token look-ahead to enlarge
the class of languages amenable to top-down parsing [PQ95]. In Section 2.3.1
we will look at LL(1) grammars and handwritten parsers in more detail. In Sections 2.3.2 and 2.3.3 we will consider automatically generated LL(1) and LR(1)
(actually SLR(1)) parsers.
The problem with our example grammar, for the purposes of bottom-up
parsing, is that it forces the compiler to shift all the tokens of an id list into its
forest before it can reduce any of them. In a very large program we might run out
of space. Sometimes there is nothing that can be done to avoid a lot of shifting.
In this case, however, we can use an alternative grammar that allows the parser to
reduce prefixes of the id list into nonterminals as it goes along:
id list −→
id list prefix ;
id list prefix −→
−→
id list prefix , id
id
This grammar cannot be parsed top-down, because when we see an id on the
input and we’re expecting an id list prefix, we have no way to tell which of the two
possible productions we should predict (more on this dilemma in Section 2.3.2).
As shown in Figure 2.14, however, the grammar works well bottom-up.
2.3.1
EXAMPLE
2.20
Top-down grammar for a
calculator language
Recursive Descent
To illustrate top-down (predictive) parsing, let us consider the grammar for a
simple “calculator” language, shown in Figure 2.15. The calculator allows values
to be read into (numeric) variables, which may then be used in expressions. Expressions in turn can be written to the output. Control flow is strictly linear (no
loops, if statements, or other jumps). The end-marker ( $$ ) pseudo-token is
produced by the scanner at the end of the input. This token allows the parser to
terminate cleanly once it has seen the entire program. As in regular expressions,
we use the symbol to denote the empty string. A production with on the
right-hand side is sometimes called an epsilon production.
It may be helpful to compare the expr portion of Figure 2.15 to the expression grammar of Example 2.7 (page 45). Most people find that previous, LR
grammar to be significantly more intuitive. It suffers, however, from a problem
2.3 Parsing
65
Figure 2.14 Bottom-up parse of A, B, C; using a grammar (lower left) that allows lists to be
collapsed incrementally.
similar to that of the id list grammar of Example 2.19: if we see an id on the
input when expecting an expr, we have no way to tell which of the two possible productions to predict. The grammar of Figure 2.15 avoids this problem
by merging the common prefixes of right-hand sides into a single production,
and by using new symbols (term tail and factor tail) to generate additional operators and operands as required. The transformation has the unfortunate side
effect of placing the operands of a given operator in separate right-hand sides.
In effect, we have sacrificed grammatical elegance in order to be able to parse
predictively.
So how do we parse a string with our calculator grammar? We saw the basic
idea in Figure 2.13. We start at the top of the tree and predict needed productions
on the basis of the current left-most nonterminal in the tree and the current in-
66
Chapter 2 Programming Language Syntax
program −→
stmt list $$
stmt list −→
stmt stmt list
stmt −→
id := expr
expr −→
term term tail
term tail −→
term −→
add op term term tail
factor tail −→
mult op −→
2.21
Recursive descent parser
for the calculator language
EXAMPLE
2.22
Recursive descent parse of
a “sum and average”
program
mult op factor factor tail
( expr )
add op −→
EXAMPLE
write expr
factor factor tail
factor −→
Figure 2.15
read id
+
*
id
number
/
LL(1) grammar for a simple calculator language.
put token. We can formalize this process in one of two ways. The first, described
in the remainder of this subsection, is to build a recursive descent parser whose
subroutines correspond, one-to-one, to the nonterminals of the grammar. Recursive descent parsers are typically constructed by hand, though the ANTLR
parser generator constructs them automatically from an input grammar. The
second approach, described in Section 2.3.2, is to build an LL parse table, which
is then read by a driver program. Table-driven parsers are almost always constructed automatically by a parser generator. These two options—recursive descent and table-driven—are reminiscent of the nested case statements and tabledriven approaches to building a scanner that we saw in Sections 2.2.2 and 2.2.3.
Handwritten recursive descent parsers are most often used when the language
to be parsed is relatively simple, or when a parser-generator tool is not available.
Pseudocode for a recursive descent parser for our calculator language appears
in Figure 2.16. It has a subroutine for every nonterminal in the grammar. It also
has a mechanism input token to inspect the next token available from the scanner
and a subroutine ( match ) to consume this token and in the process verify that it
is the one that was expected (as specified by an argument). If match or any of the
other subroutines sees an unexpected token, then a syntax error has occurred.
For the time being let us assume that the parse error subroutine simply prints
a message and terminates the parse. In Section 2.3.4 we will consider how to
recover from such errors and continue to parse the remainder of the input. Suppose now that we are to parse a simple program to read two numbers and
print their sum and average:
read A
read B
sum := A + B
write sum
write sum / 2
2.3 Parsing
67
procedure match(expected)
if input token = expected
consume input token
else parse error
–– this is the start routine:
procedure program
case input token of
id , read , write , $$ :
stmt list
match( $$ )
otherwise parse error
procedure stmt list
case input token of
id , read , write : stmt; stmt list
$$ : skip
–– epsilon production
otherwise parse error
procedure stmt
case input token of
id : match( id ); match( := ); expr
read : match( read ); match( id )
write : match( write ); expr
otherwise parse error
procedure expr
case input token of
id , number , ( : term; term tail
otherwise parse error
procedure term tail
case input token of
+ , - : add op; term; term tail
) , id , read , write , $$ :
skip
–– epsilon production
otherwise parse error
procedure term
case input token of
id , number , ( : factor; factor tail
otherwise parse error
Figure 2.16
Recursive descent parser for the calculator language. Execution begins in procedure program . The recursive calls trace out a traversal of the parse tree. Not shown is code to
save this tree (or some similar structure) for use by later phases of the compiler. (continued)
68
Chapter 2 Programming Language Syntax
procedure factor tail
case input token of
* , / : mult op; factor; factor tail
+ , - , ) , id , read , write , $$ :
skip
–– epsilon production
otherwise parse error
procedure factor
case input token of
id : match( id )
number : match( number )
( : match( ( ); expr; match( ) )
otherwise parse error
procedure add op
case input token of
+ : match( + )
- : match( - )
otherwise parse error
procedure mult op
case input token of
* : match( * )
/ : match( / )
otherwise parse error
Figure 2.16
(continued)
The parse tree for this program appears in Figure 2.17. The parser begins by
calling the subroutine program . After noting that the initial token is a read ,
program calls stmt list and then attempts to match the end-of-file pseudo-token.
(In the parse tree, the root, program, has two children, stmt list and $$ .) Procedure stmt list again notes that the upcoming token is a read . This observation allows it to determine that the current node (stmt list) generates stmt
stmt list (rather than ). It therefore calls stmt and stmt list before returning.
Continuing in this fashion, the execution path of the parser traces out a leftto-right depth-first traversal of the parse tree. This correspondence between the
dynamic execution trace and the structure of the parse tree is the distinguishing
characteristic of recursive descent parsing. Note that because the stmt list nonterminal appears in the right-hand side of a stmt list production, the stmt list
subroutine must call itself. This recursion accounts for the name of the parsing
technique.
Without additional code (not shown in Figure 2.16), the parser merely verifies that the program is syntactically correct (i.e., that none of the otherwise
parse error clauses in the case statements are executed and that match always
sees what it expects to see). To be of use to the rest of the compiler—which must
produce an equivalent target program in some other language—the parser must
2.3 Parsing
Figure 2.17
69
Parse tree for the sum-and-average program of Example 2.22, using the grammar of Figure 2.15.
save the parse tree or some other representation of program fragments as an explicit data structure. To save the parse tree itself, we can allocate and link together
records to represent the children of a node immediately before executing the recursive subroutines and match invocations that represent those children. We shall
need to pass each recursive routine an argument that points to the record that is
to be expanded (i.e., whose children are to be discovered). Procedure match will
also need to save information about certain tokens (e.g., character-string representations of identifiers and literals) in the leaves of the tree.
As we saw in Chapter 1, the parse tree contains a great deal of irrelevant detail
that need not be saved for the rest of the compiler. It is therefore rare for a parser
to construct a full parse tree explicitly. More often it produces an abstract syntax
tree or some other more terse representation. In a recursive descent compiler, a
syntax tree can be created by allocating and linking together records in only a
subset of the recursive calls.
Perhaps the trickiest part of writing a recursive descent parser is figuring out
which tokens should label the arms of the case statements. Each arm represents
70
Chapter 2 Programming Language Syntax
one production: one possible expansion of the symbol for which the subroutine
was named. The tokens that label a given arm are those that predict the production. A token X may predict a production for either of two reasons: (1) the
right-hand side of the production, when recursively expanded, may yield a string
beginning with X , or (2) the right-hand side may yield nothing (i.e., it is , or a
string of nonterminals that may recursively yield ), and X may begin the yield
of what comes next. In the following subsection we will formalize this notion of
prediction using sets called FIRST and FOLLOW, and show how to derive them
automatically from an LL(1) CFG.
C H E C K YO U R U N D E R S TA N D I N G
20. What is the inherent “big-O” complexity of parsing? What is the complexity
of parsers used in real compilers?
21. Summarize the difference between LL and LR parsing. Which one of them is
also called “bottom-up”? “Top-down”? Which one is also called “predictive”?
“Shift-reduce”? What do “LL” and “LR” stand for?
22. What kind of parser (top-down or bottom-up) is most common in production compilers?
23. What is the significance of the “1” in LR(1)?
24. Why might we want (or need) different grammars for different parsing algorithms?
25. What is an epsilon production?
26. What are recursive descent parsers? Why are they used mostly for small languages?
27. How might a parser construct an explicit parse tree or syntax tree?
2.3.2
EXAMPLE
2.23
Driver and table for
top-down parsing
Table-Driven Top-Down Parsing
In a recursive descent parser, each arm of a case statement corresponds to a
production, and contains parsing routine and match calls corresponding to the
symbols on the right-hand side of that production. At any given point in the
parse, if we consider the calls beyond the program counter (the ones that have
yet to occur) in the parsing routine invocations currently in the call stack, we
obtain a list of the symbols that the parser expects to see between here and the
end of the program. A table-driven top-down parser maintains an explicit stack
containing this same list of symbols.
Pseudocode for such a parser appears in Figure 2.18. The code is language
independent. It requires a language dependent parsing table, generally produced
2.3 Parsing
71
terminal = 1 . . number of terminals
non terminal = number of terminals + 1 . . number of symbols
symbol = 1 . . number of symbols
production = 1 . . number of productions
parse tab : array [non terminal, terminal] of
action : (predict, error)
prod : production
prod tab : array [production] of list of symbol
–– these two tables are created by a parser generator tool
parse stack : stack of symbol
parse stack.push(start symbol)
loop
expected sym : symbol := parse stack.pop
if expected sym ∈ terminal
–– as in Figure 2.16
match(expected sym)
–– success!
if expected sym = $$ return
else
if parse tab[expected sym, input token].action = error
parse error
else
prediction : production := parse tab[expected sym, input token].prod
foreach sym : symbol in reverse prod tab[prediction]
parse stack.push(sym)
Figure 2.18
EXAMPLE
2.24
Table-driven parse of the
“sum and average”
program
Driver for a table-driven LL(1) parser.
by an automatic tool. For the calculator grammar of Figure 2.15, the table appears
as shown in Figure 2.19.
To illustrate the algorithm, Figure 2.20 shows a trace of the stack and the input over time for the sum-and-average program of Example 2.22. The parser
iterates around a loop in which it pops the top symbol off the stack and performs
the following actions. If the popped symbol is a terminal, the parser attempts
to match it against an incoming token from the scanner. If the match fails, the
parser announces a syntax error and initiates some sort of error recovery (see Section 2.3.4). If the popped symbol is a nonterminal, the parser uses that nonterminal together with the next available input token to index into a two-dimensional
table that tells it which production to predict (or whether to announce a syntax
error and initiate recovery).
Initially, the parse stack contains the start symbol of the grammar (in our case,
program). When it predicts a production, the parser pushes the right-hand-side
symbols onto the parse stack in reverse order, so the first of those symbols ends up
at top-of-stack. The parse completes successfully when we match the end token,
$$ . Assuming that $$ appears only once in the grammar, at the end of the first
production, and that the scanner returns this token only at end-of-file, any syntax
72
Chapter 2 Programming Language Syntax
Top-of-stack
nonterminal
id
number
read
write
Current input token
:=
(
)
+
-
*
/
$$
program
stmt list
stmt
expr
term tail
term
factor tail
factor
add op
mult op
1
2
4
7
9
10
12
14
–
–
–
–
–
7
–
10
–
15
–
–
1
2
5
–
9
–
12
–
–
–
1
2
6
–
9
–
12
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
7
–
10
–
13
–
–
–
–
–
–
9
–
12
–
–
–
–
–
–
–
8
–
12
–
16
–
–
–
–
–
8
–
12
–
17
–
–
–
–
–
–
–
11
–
–
18
–
–
–
–
–
–
11
–
–
19
1
3
–
–
9
–
12
–
–
–
Figure 2.19
LL(1) parse table for the calculator language. Table entries indicate the production to predict (as numbered in
Figure 2.22). A dash indicates an error. When the top-of-stack symbol is a terminal, the appropriate action is always to match
it against an incoming token from the scanner. An auxiliary table, not shown here, gives the right-hand side symbols for each
production.
error is guaranteed to manifest itself either as a failed match or as an error entry
in the table.
Predict Sets
EXAMPLE
2.25
Predict sets for the
calculator language
As we hinted at the end of Section 2.3.1, predict sets are defined in terms of simpler sets called FIRST and FOLLOW, where FIRST(A) is the set of all tokens that
could be the start of an A, plus if A ⇒* , and FOLLOW(A) is the set of all
tokens that could come after an A in some valid program, plus if A can be the
final token in the program. If we extend the domain of FIRST in the obvious way
to include strings of symbols, we then say that the predict set of a production A
−→ β is FIRST (β) (except for ), plus FOLLOW(A) if β ⇒* .12
We can illustrate the algorithm to construct these sets using our calculator
grammar (Figure 2.15). We begin with “obvious” facts about the grammar and
build on them inductively. If we recast the grammar in plain BNF (no EBNF ‘ ’
constructs), then it has 19 productions. The “obvious” facts arise from adjacent
pairs of symbols in right-hand sides. In the first production, we can see that $$
12 Following conventional notation, we use uppercase Roman letters near the beginning of the
alphabet to represent nonterminals, uppercase Roman letters near the end of the alphabet to
represent arbitrary grammar symbols (terminals or nonterminals), lowercase Roman letters near
the beginning of the alphabet to represent terminals (tokens), lowercase Roman letters near the
end of the alphabet to represent token strings, and lowercase Greek letters to represent strings of
arbitrary symbols.
2.3 Parsing
Parse stack
Input stream
program
read A read B . . .
read A read B . . .
stmt list $$
stmt stmt list $$
read A read B . . .
read id stmt list $$
read A read B . . .
A read B . . .
id stmt list $$
stmt list $$
read B sum := . . .
stmt stmt list $$
read B sum := . . .
read id stmt list $$
read B sum := . . .
B sum := . . .
id stmt list $$
stmt list $$
sum := A + B . . .
stmt stmt list $$
sum := A + B . . .
id := expr stmt list $$
sum := A + B . . .
:= expr stmt list $$
:= A + B . . .
expr stmt list $$
A + B ...
term term tail stmt list $$
A + B ...
A + B ...
factor factor tail term tail stmt list $$
id factor tail term tail stmt list $$
A + B ...
factor tail term tail stmt list $$
+ B write sum . . .
term tail stmt list $$
+ B write sum . . .
+ B write sum . . .
add op term term tail stmt list $$
+ term term tail stmt list $$
+ B write sum . . .
term term tail stmt list $$
B write sum . . .
factor factor tail term tail stmt list $$
B write sum . . .
id factor tail term tail stmt list $$
B write sum . . .
factor tail term tail stmt list $$
write sum . . .
term tail stmt list $$
write sum write . . .
write sum write . . .
stmt list $$
stmt stmt list $$
write sum write . . .
write expr stmt list $$
write sum write . . .
expr stmt list $$
sum write sum / 2
term term tail stmt list $$
sum write sum / 2
factor factor tail term tail stmt list $$
sum write sum / 2
id factor tail term tail stmt list $$
sum write sum / 2
factor tail term tail stmt list $$
sum write sum / 2
term tail stmt list $$
write sum / 2
stmt list $$
write sum / 2
write sum / 2
stmt stmt list $$
write expr stmt list $$
write sum / 2
sum / 2
expr stmt list $$
term term tail stmt list $$
sum / 2
sum / 2
factor factor tail term tail stmt list $$
id factor tail term tail stmt list $$
sum / 2
factor tail term tail stmt list $$
/ 2
mult op factor factor tail term tail stmt list $$ / 2
/ factor factor tail term tail stmt list $$
/ 2
factor factor tail term tail stmt list $$
2
number factor tail term tail stmt list $$
2
factor tail term tail stmt list $$
term tail stmt list $$
stmt list $$
$$
Figure 2.20
Comment
initial stack contents
predict program −→ stmt list $$
predict stmt list −→ stmt stmt list
predict stmt −→ read id
match read
match id
predict stmt list −→ stmt stmt list
predict stmt −→ read id
match read
match id
predict stmt list −→ stmt stmt list
predict stmt −→ id := expr
match id
match :=
predict expr −→ term term tail
predict term −→ factor factor tail
predict factor −→ id
match id
predict factor tail −→ predict term tail −→ add op term term tail
predict add op −→ +
match +
predict term −→ factor factor tail
predict factor −→ id
match id
predict factor tail −→ predict term tail −→ predict stmt list −→ stmt stmt list
predict stmt −→ write expr
match write
predict expr −→ term term tail
predict term −→ factor factor tail
predict factor −→ id
match id
predict factor tail −→ predict term tail −→ predict stmt list −→ stmt stmt list
predict stmt −→ write expr
match write
predict expr −→ term term tail
predict term −→ factor factor tail
predict factor −→ id
match id
predict factor tail −→ mult op factor factor tail
predict mult op −→ /
match /
predict factor −→ number
match number
predict factor tail −→ predict term tail −→ predict stmt list −→ Trace of a table-driven LL(1) parse of the sum-and-average program of Example 2.22.
73
74
Chapter 2 Programming Language Syntax
program −→ stmt list $$
stmt list −→ stmt stmt list
stmt list −→ stmt −→ id := expr
stmt −→ read id
stmt −→ write expr
expr −→ term term tail
term tail −→ add op term term tail
term tail −→ term −→ factor factor tail
factor tail −→ mult op factor factor tail
factor tail −→ factor −→ ( expr )
factor −→ id
factor −→ number
add op −→ +
add op −→ mult op −→ *
mult op −→ /
Figure 2.21
$$ ∈ FOLLOW (stmt list),
∈ FOLLOW ( $$ ), and ∈
FOLLOW (program)
∈ FIRST (stmt list)
id ∈ FIRST (stmt) and := ∈ FOLLOW ( id )
read ∈ FIRST (stmt) and id ∈ FOLLOW ( read )
write ∈ FIRST (stmt)
∈
FIRST (term
tail)
∈ FIRST (factor tail)
( ∈ FIRST (factor) and ) ∈
id ∈ FIRST (factor)
number ∈ FIRST (factor)
+ ∈ FIRST (add op)
- ∈ FIRST (add op)
* ∈ FIRST (mult op)
/ ∈ FIRST (mult op)
FOLLOW (expr)
“Obvious” facts about the LL(1) calculator grammar.
∈ FOLLOW(stmt list). In the fourth (stmt −→ id := expr), id ∈ FIRST(stmt),
and := ∈ FOLLOW( id ). In the fifth and sixth productions (stmt −→ read id
write expr), { read , write } ⊂ FIRST(stmt), and id ∈ FOLLOW( read ). The
complete set of “obvious” facts appears in Figure 2.21.
From the “obvious” facts we can deduce a larger set of facts during a second
pass over the grammar. For example, in the second production (stmt list −→
stmt stmt list) we can deduce that { id , read , write } ⊂ FIRST(stmt list), because we already know that { id , read , write } ⊂ FIRST(stmt), and a stmt list
can begin with a stmt. Similarly, in the first production, we can deduce that $$ ∈
FIRST(program), because we already know that ∈ FIRST(stmt list).
In the eleventh production (factor tail −→ mult op factor factor tail), we
can deduce that { ( , id , number } ⊂ FOLLOW(mult op), because we already know
that { ( , id , number } ⊂ FIRST(factor), and factor follows mult op in the righthand side. In the seventh production (expr −→ term term tail), we can deduce
that ) ∈ FOLLOW(term tail), because we already know that ) ∈ FOLLOW(expr),
and a term tail can be the last part of an expr. In this same production, we can
also deduce that ) ∈ FOLLOW(term), because the term tail can generate ( ∈
FIRST(term tail)), allowing a term to be the last part of an expr.
There is more that we can learn from our second pass through the grammar,
but these examples cover all the different kinds of cases. To complete our calculation, we continue with additional passes over the grammar until we don’t learn
any more (i.e., we don’t add anything to any of the FIRST and FOLLOW sets). We
2.3 Parsing
FIRST
program { id , read , write , $$ }
stmt list { id , read , write , }
stmt { id , read , write }
expr { ( , id , number }
term tail { + , - , }
term { ( , id , number }
factor tail { * , / , }
factor { ( , id , number }
add op { + , - }
mult op { * , / }
Also note that FIRST ( a ) = { a } ∀ tokens a .
expr { ) , id , read , write , $$ }
term tail { ) , id , read , write , $$ }
term { + , - , ) , id , read , write , $$ }
factor tail { + , - , ) , id , read , write , $$ }
factor { + , - , * , / , ) , id , read , write , $$ }
add op { ( , id , number }
mult op { ( , id , number }
PREDICT
1.
2.
3.
4.
FOLLOW
5.
id { + , - , * , / , ) , := , id , read , write , $$ }
6.
number { + , - , * , / , ) , id , read , write , $$ }
7.
read { id }
8.
write { ( , id , number }
9.
( { ( , id , number }
10.
) { + , - , * , / , ) , id , read , write , $$ }
11.
:= { ( , id , number }
12.
+ { ( , id , number }
13.
- { ( , id , number }
14.
* { ( , id , number }
15.
/ { ( , id , number }
16.
$$ {}
17.
program {}
18.
19.
stmt list { $$ }
stmt { id , read , write , $$ }
Figure 2.22
FIRST , FOLLOW ,
and
PREDICT
75
program −→ stmt list $$ { id , read , write , $$ }
stmt list −→ stmt stmt list { id , read , write }
stmt list −→ { $$ }
stmt −→ id := expr { id }
stmt −→ read id { read }
stmt −→ write expr { write }
expr −→ term term tail { ( , id , number }
term tail −→ add op term term tail { + , - }
term tail −→ { ) , id , read , write , $$ }
term −→ factor factor tail { ( , id , number }
factor tail −→ mult op factor factor tail { * , / }
factor tail −→ { + , - , ) , id , read , write , $$ }
factor −→ ( expr ) { ( }
factor −→ id { id }
factor −→ number { number }
add op −→ + { + }
add op −→ - { - }
mult op −→ * { * }
mult op −→ / { / }
sets for the calculator language.
then construct the PREDICT sets. Final versions of all three sets appear in Figure 2.22. The parse table of Figure 2.19 follows directly from PREDICT.
The algorithm to compute FIRST, FOLLOW, and PREDICT sets appears, a bit
more formally, in Figure 2.23. It relies on the following definitions.
≡ {a : α ⇒* a β } ∪ ( if α ⇒* then {} else ∅ )
FOLLOW(A) ≡ {a : S ⇒+ α A a β } ∪ ( if S ⇒* α A then {} else ∅ )
PREDICT(A −→ α) ≡ ( FIRST (α){}) ∪ ( if α ⇒* then FOLLOW(A)
else ∅ )
FIRST (α)
Note that FIRST sets for strings of length greater than one are calculated on demand; they are not stored explicitly. The algorithm is guaranteed to terminate
76
Chapter 2 Programming Language Syntax
First sets for all symbols:
for all terminals a , FIRST ( a ) := { a }
for all nonterminals X , FIRST ( X ) := ∅
for all productions X −→ , add to FIRST ( X )
repeat
outer for all productions X −→ Y1 Y2 . . . Yk ,
inner for i in 1 . . k
add (FIRST ( Y i ) {}) to FIRST ( X )
if ∈ FIRST ( Y i ) (yet)
continue outer loop
add to FIRST ( X )
until no further progress
First set subroutine for string X1 X2 . . . Xn , similar to inner loop above:
return value := ∅
for i in 1 . . n
add (FIRST ( X i ) {}) to return value
if ∈ FIRST ( X i )
return
add to return value
Follow sets for all symbols:
FOLLOW ( S ) := {}, where S is the start symbol
for all other symbols X , FOLLOW ( X ) := ∅
repeat
for all productions A −→ α B β ,
add (FIRST (β ) {}) to FOLLOW ( B )
for all productions A −→ α B
or A −→ α B β , where ∈ FIRST (β ),
add FOLLOW ( A ) to FOLLOW ( B )
until no further progress
Predict sets for all productions:
for all productions A −→ α
PREDICT (A −→ α ) := (FIRST (α) {})
∪ (if ∈ FIRST (α ) then FOLLOW ( A ) else ∅ )
Figure 2.23 Algorithm to calculate FIRST,
if and only if the PREDICT sets are disjoint.
FOLLOW ,
and
PREDICT
sets. The grammar is LL(1)
(i.e., converge on a solution), because the sizes of the sets are bounded by the
number of terminals in the grammar.
If in the process of calculating PREDICT sets we find that some token belongs
to the PREDICT set of more than one production with the same left-hand side,
then the grammar is not LL(1), because we will not be able to choose which
of the productions to employ when the left-hand side is at the top of the parse
stack (or we are in the left-hand side’s subroutine in a recursive descent parser)
and we see the token coming up in the input. This sort of ambiguity is known
as a predict-predict conflict; it can arise either because the same token can begin
2.3 Parsing
77
more than one right-hand side, or because it can begin one right-hand side and
can also appear after the left-hand side in some valid program, and one possible
right-hand side can generate .
Writing an LL(1) Grammar
EXAMPLE
2.26
Left recursion
When working with a top-down parser generator, one has to acquire a certain
facility in writing and modifying LL(1) grammars. The two most common obstacles to “LL(1)-ness” are left recursion and common prefixes.
Left recursion occurs when the first symbol on the right-hand side of a production is the same as the symbol on the left-hand side. Here again is the grammar from Example 2.19, which cannot be parsed top-down:
id list −→ id list prefix ;
id list prefix −→ id list prefix , id
−→ id
EXAMPLE
2.27
Common prefixes
The problem is in the second and third productions; with id list prefix at topof-stack and an id on the input, a predictive parser cannot tell which of the
productions it should use. (Recall that left recursion is desirable in bottom-up
grammars, because it allows recursive constructs to be discovered incrementally,
as in Figure 2.14.)
Common prefixes occur when two different productions with the same lefthand side begin with the same symbol or symbols. Here is an example that commonly appears in Algol-family languages:
stmt −→
−→
EXAMPLE
2.28
Eliminating left recursion
id := expr
id ( argument list )
Clearly id is in the FIRST set of both right-hand sides, and therefore in the
PREDICT set of both productions.
Both left recursion and common prefixes can be removed from a grammar
mechanically. The general case is a little tricky (Exercise 2.17), because the prediction problem may be an indirect one (e.g., S −→ A α and A −→ S β , or
S −→ A α , S −→ B β , A ⇒* a γ , and B ⇒* a δ). We can see the
general idea in the examples above, however.
Our left-recursive definition of id list can be replaced by the right-recursive
variant we saw in Example 2.18:
id list −→ id id list tail
id list tail −→ , id id list tail
id list tail −→ ;
EXAMPLE
2.29
Left factoring
– – procedure call
Our common-prefix definition of stmt can be made LL(1) by a technique called
left factoring:
stmt −→ id stmt list tail
stmt list tail −→ := expr
( argument list )
78
EXAMPLE
Chapter 2 Programming Language Syntax
2.30
Parsing a “dangling else ”
Of course, simply eliminating left recursion and common prefixes is not guaranteed to make a grammar LL(1). There are infinitely many non-LL languages—
languages for which no LL grammar exists—and the mechanical transformations
to eliminate left recursion and common prefixes work on their grammars just
fine. Fortunately, the few non-LL languages that arise in practice can generally be
handled by augmenting the parsing algorithm with one or two simple heuristics.
The best known example of a “not quite LL” construct arises in languages
like Pascal, in which the else part of an if statement is optional. The natural
grammar fragment
stmt −→
if condition then clause else clause
then clause −→
else clause −→
other stmt
then stmt
else stmt
is ambiguous (and thus neither LL nor LR); it allows the else in if C1 then if
C2 then S1 else S2 to be paired with either then . The less natural grammar
fragment
stmt −→
balanced stmt
balanced stmt −→
unbalanced stmt
if condition then balanced stmt else balanced stmt
other stmt
unbalanced stmt −→
EXAMPLE
2.31
“Dangling else ” program
bug
if condition then stmt
if condition then balanced stmt else unbalanced stmt
can be parsed bottom-up but not top-down (there is no pure top-down grammar
for Pascal else statements). A balanced stmt is one with the same number of
then s and else s. An unbalanced stmt has more then s.
The usual approach, whether parsing top-down or bottom-up, is to use the
ambiguous grammar together with a “disambiguating rule,” which says that in
the case of a conflict between two possible productions, the one to use is the one
that occurs first, textually, in the grammar. In the ambiguous fragment above,
the fact that else clause −→ else stmt comes before else clause −→ ends up
pairing the else with the nearest then , as desired.
Better yet, a language designer can avoid this sort of problem by choosing
different syntax. The ambiguity of the dangling else problem in Pascal leads to
problems not only in parsing but in writing and maintaining correct programs.
Most Pascal programmers have at one time or another written a program like this
one:
if P <> nil then
if P^.val = goal then
foundIt := true
else
endOfList := true
Indentation notwithstanding, the Pascal manual states that an else clause
matches the closest unmatched then —in this case the inner one—which is
2.3 Parsing
79
clearly not what the programmer intended. To get the desired effect, the Pascal
programmer must write
if P <> nil then begin
if P^.val = goal then
foundIt := true
end
else
endOfList := true
EXAMPLE
2.32
End markers for structured
statements
Many other Algol-family languages (including Modula, Modula-2, and Oberon,
all more recent inventions of Pascal’s designer, Niklaus Wirth) require explicit end
markers on all structured statements. The grammar fragment for if statements
in Modula-2 looks something like this:
stmt −→ IF condition then clause else clause END
then clause −→ THEN stmt list
else clause −→ ELSE stmt list EXAMPLE
2.33
The need for elsif
other stmt
The addition of the END eliminates the ambiguity.
Modula-2 uses END to terminate all its structured statements. Ada and Fortran 77 end an if with end if (and a while with end while , etc.). Algol 68 creates its terminators by spelling the initial keyword backward ( if . . . fi ,
case . . . esac , do . . . od , etc.).
One problem with end markers is that they tend to bunch up. In Pascal one
can write
if A
else
else
else
else
= B then
if A = C
if A = D
if A = E
...
...
then ...
then ...
then ...
With end markers this becomes
if A = B then ...
else if A = C then ...
else if A = D then ...
else if A = E then ...
else ...
end end end end
D E S I G N & I M P L E M E N TAT I O N
The dangling else
A simple change in language syntax—eliminating the dangling else —not
only reduces the chance of programming errors but also significantly simplifies parsing. For more on the dangling else problem, see Exercise 2.23 and
Section 6.4.
80
Chapter 2 Programming Language Syntax
To avoid this awkwardness, languages with end markers generally provide an
elsif keyword (sometimes spelled elif ):
if A = B then ...
elsif A = C then ...
elsif A = D then ...
elsif A = E then ...
else ...
end
With elsif clauses added, the Modula-2 grammar fragment for if statements
looks like this:
stmt −→
IF condition then clause elsif clauses else clause END
then clause −→
elsif clauses −→
else clause −→
other stmt
THEN stmt list
ELSIF condition then clause elsif clauses
ELSE stmt list
C H E C K YO U R U N D E R S TA N D I N G
28. Discuss the similarities and differences between recursive descent and tabledriven top-down parsing.
29. What are FIRST and FOLLOW sets? What are they used for?
30. Under what circumstances does a top-down parser predict the production
A −→ α ?
31. What sorts of “obvious” facts form the basis of
FIRST
set and
FOLLOW
set
construction?
32. Outline the algorithm used to complete the construction of
FOLLOW
FIRST
and
sets. How do we know when we are done?
33. How do we know when a grammar is not LL(1)?
34. Describe two common idioms in context-free grammars that cannot be
parsed top-down.
35. What is the “dangling else ” problem? How is it avoided in modern languages?
2.3.3
Bottom-Up Parsing
Conceptually, as we saw at the beginning of Section 2.3, a bottom-up parser
works by maintaining a forest of partially completed subtrees of the parse tree,
which it joins together whenever it recognizes the symbols on the right-hand side
2.3 Parsing
81
of some production used in the right-most derivation of the input string. It creates a new internal node and makes the roots of the joined-together trees the
children of that node.
In practice, a bottom-up parser is almost always table-driven. It keeps the roots
of its partially completed subtrees on a stack. When it accepts a new token from
the scanner, it shifts the token into the stack. When it recognizes that the top
few symbols on the stack constitute a right-hand side, it reduces those symbols
to their left-hand side by popping them off the stack and pushing the left-hand
side in their place. The role of the stack is the first important difference between
top-down and bottom-up parsing: a top-down parser’s stack contains a list of
what the parser expects to see in the future; a bottom-up parser’s stack contains
a record of what the parser has already seen in the past.
Canonical Derivations
EXAMPLE
2.34
Derivation of an id list
We also noted earlier that the actions of a bottom-up parser trace out a rightmost (canonical) derivation in reverse. The roots of the partial subtrees, leftto-right, together with the remaining input, constitute a sentential form of the
right-most derivation. On the right-hand side of Figure 2.13, for example, we
have the following series of steps.
stack contents (roots of partial trees)
remaining input
A, B, C;
, B, C;
B, C;
, C;
C;
;
id
id
id
id
id
id
id
id
id
(A)
(A)
(A)
(A)
(A)
(A)
(A)
(A)
(A)
,
,
,
,
,
,
,
id
id
id
id
id
id
(B)
(B) ,
(B) ,
(B) ,
(B) ,
(B) id
id (C)
id (C) ;
id (C) id list tail
list tail
id list tail
id list
The last four lines (the ones that don’t just shift tokens into the forest) correspond
to the right-most derivation:
id list ⇒
EXAMPLE
2.35
Bottom-up grammar for
the calculator language
id id list tail
⇒
id , id id list tail
⇒
id , id , id id list tail
⇒
id , id , id ;
The symbols that need to be joined together at each step of the parse to represent
the next step of the backward derivation are called the handle of the sentential
form. In the preceding parse trace, the handles are underlined.
In our id list example, no handles were found until the entire input had been
shifted onto the stack. In general this will not be the case. We can obtain a more
realistic example by examining an LR version of our calculator language, shown
82
Chapter 2 Programming Language Syntax
1. program −→
stmt list $$
2. stmt list −→
stmt list stmt
3. stmt list −→
stmt
4. stmt −→
id := expr
5. stmt −→
read id
6. stmt −→
write expr
7. expr −→
term
8. expr −→
expr add op term
9. term −→
10. term −→
factor
term mult op factor
11. factor −→
( expr )
12. factor −→
id
13. factor −→
number
14. add op −→
+
15. add op −→
-
16. mult op −→
*
17. mult op −→
/
Figure 2.24 LR(1) grammar for the calculator language. Productions have been numbered for
reference in future figures.
in Figure 2.24. While the LL grammar of Figure 2.15 can be parsed bottomup, the version in Figure 2.24 is preferable for two reasons. First, it uses a leftrecursive production for stmt list. Left recursion allows the parser to collapse
long statement lists as it goes along, rather than waiting until the entire list is on
the stack and then collapsing it from the end. Second, it uses left-recursive productions for expr and term. These productions capture left associativity while
still keeping an operator and its operands together in the same right-hand side,
something we were unable to do in a top-down grammar.
Modeling a Parse with LR Items
EXAMPLE
2.36
Bottom-up parse of the
“sum and average”
program
Suppose we are to parse the sum-and-average program from Example 2.22:
read A
read B
sum := A + B
write sum
write sum / 2
The key to success will be to figure out when we have reached the end of a righthand side—that is, when we have a handle at the top of the parse stack. The trick
is to keep track of the set of productions we might be “in the middle of ” at any
83
2.3 Parsing
particular time, together with an indication of where in those productions we
might be.
When we begin execution, the parse stack is empty and we are at the beginning of the production for program. (In general, we can assume that there is only
one production with the start symbol on the left-hand side; it is easy to modify any grammar to make this the case.) We can represent our location—more
specifically, the location represented by the top of the parse stack—with a in
the right-hand side of the production:
.
program −→
.
stmt list $$
.
.
When augmented with a , a production is called an LR item. Since the in this
item is immediately in front of a nonterminal—namely stmt list—we may be
about to see the yield of that nonterminal coming up on the input. This possibility implies that we may be at the beginning of some production with stmt list on
the left-hand side:
program −→
stmt list −→
stmt list −→
.
.
.
stmt list $$
stmt list stmt
stmt
And, since stmt is a nonterminal, we may also be at the beginning of any production whose left-hand side is stmt:
program −→
stmt list −→
stmt list −→
stmt −→
stmt −→
stmt −→
.
.
.
.
.
.
stmt list $$
(State 0)
stmt list stmt
stmt
id := expr
read id
write expr
Since all of these last productions begin with a terminal, no additional items need
to be added to our list. The original item (program −→ stmt list $$ ) is called
the basis of the list. The additional items are its closure. The list represents the initial state of the parser. As we shift and reduce, the set of items will change, always
indicating which productions may be the right one to use next in the derivation
of the input string. If we reach a state in which some item has the at the end
of the right-hand side, we can reduce by that production. Otherwise, as in the
current situation, we must shift. Note that if we need to shift, but the incoming
token cannot follow the in any item of the current state, then a syntax error has
occurred. We will consider error recovery in more detail in Section 2.3.4.
Our upcoming token is a read . Once we shift it onto the stack, we know we
are in the following state:
.
.
.
stmt −→
read
.
id
This state has a single basis item and an empty closure—the
nal. After shifting the A , we have
(State 1)
. precedes a termi-
84
Chapter 2 Programming Language Syntax
stmt −→
.
read id
(State 1 )
We now know that read id is the handle, and we must reduce. The reduction
pops two symbols off the parse stack and pushes a stmt in their place, but what
should the new state be? We can see the answer if we imagine moving back in time
to the point at which we shifted the read —the first symbol of the right-hand
side. At that time we were in the state labeled “State 0” above, and the upcoming
tokens on the input (though we didn’t look at them at the time) were read id .
We have now consumed these tokens, and we know that they constituted a stmt.
By pushing a stmt onto the stack, we have in essence replaced read id with stmt
on the input stream, and have then “shifted” the nonterminal, rather than its
yield, into the stack. Since one of the items in State 0 was
stmt list −→
.
stmt
we now have
stmt list −→
stmt
.
(State 0 )
Again we must reduce. We remove the stmt from the stack and push a stmt list in
its place. Again we can see this as “shifting” a stmt list when in State 0. Since two
of the items in State 0 have a stmt list after the , we don’t know (without looking
ahead) which of the productions will be the next to be used in the derivation, but
we don’t have to know. The key advantage of bottom-up parsing over top-down
parsing is that we don’t need to predict ahead of time which production we shall
be expanding.
Our new state is as follows:
.
program −→
stmt list
stmt list −→
stmt list
stmt −→
stmt −→
stmt −→
.
.
.
.
.
$$
(State 2)
stmt
id := expr
read id
write expr
The first two productions are the basis; the others are the closure. Since no item
has a at the end, we shift the next token, which happens again to be a read ,
taking us back to State 1. Shifting the B takes us to State 1 again, at which point
we reduce. This time however, we go back to State 2 rather than State 0 before
shifting the left-hand side stmt. Why? Because we were in State 2 when we began
to read the right-hand side.
.
The Characteristic Finite State Machine and LR Parsing Variants
An LR-family parser keeps track of the states it has traversed by pushing them into
the parse stack along with the grammar symbols. It is in fact the states (rather
than the symbols) that drive the parsing algorithm: they tell us what state we
were in at the beginning of a right-hand side. Specifically, when the combination of state and input tells us we need to reduce using production A −→ α , we
pop length(α) symbols off the stack, together with the record of states we moved
85
2.3 Parsing
through while shifting those symbols. These pops expose the state we were in immediately prior to the shifts, allowing us to return to that state and proceed as if
we had seen A in the first place.
We can think of the shift rules of an LR-family parser as the transition function
of a finite automaton, much like the automata we used to model scanners. Each
state of the automaton corresponds to a list of items that indicate where the parser
might be at some specific point in the parse. The transition for input symbol X
(which may be either a terminal or a nonterminal) moves to a state whose basis
consists of items in which the has been moved across an X in the right-hand
side, plus whatever items need to be added as closure. The lists are constructed by
a bottom-up parser generator in order to build the automaton but are not needed
during parsing.
It turns out that the simpler members of the LR family of parsers—LR(0),
SLR(1), and LALR(1)—all use the same automaton, called the characteristic
finite-state machine, or CFSM. Full LR parsers use a machine with (for most
grammars) a much larger number of states. The differences between the algorithms lie in how they deal with states that contain a shift-reduce conflict—one
item with the in the middle (suggesting the need for a shift) and another with
the at the end (suggesting the need for a reduction). An LR(0) parser works
only when there are no such states. It can be proven that with the addition of an
end-marker (i.e., $$ ), any language that can be deterministically parsed bottomup has an LR(0) grammar. Unfortunately, the LR(0) grammars for real programming languages tend to be prohibitively large and unintuitive.
SLR (simple LR) parsers peek at upcoming input and use FOLLOW sets to resolve conflicts. An SLR parser will call for a reduction via A −→ α only if the
upcoming token(s) are in FOLLOW(α). It will still see a conflict, however, if the
tokens are also in the FIRST set of any of the symbols that follow a in other
items of the state. As it turns out, there are important cases in which a token may
follow a given nonterminal somewhere in a valid program, but never in a context
described by the current state. For these cases global FOLLOW sets are too crude.
LALR (look-ahead LR) parsers improve on SLR by using local (state-specific)
look-ahead instead.
Conflicts can still arise in an LALR parser when the same set of items can
occur on two different paths through the CFSM. Both paths will end up in the
same state, at which point state-specific look-ahead can no longer distinguish
between them. A full LR parser duplicates states in order to keep paths disjoint
when their local look-aheads are different.
LALR parsers are the most common bottom-up parsers in practice. They are
the same size and speed as SLR parsers, but are able to resolve more conflicts.
Full LR parsers for real programming languages tend to be very large. Several
researchers have developed techniques to reduce the size of full-LR tables, but
LALR works sufficiently well in practice that the extra complexity of full LR is
usually not required. Yacc/bison produces C code for an LALR parser.
.
.
.
.
86
Chapter 2 Programming Language Syntax
Bottom-Up Parsing Tables
EXAMPLE
2.37
CFSM for the bottom-up
calculator grammar
Like a table-driven LL(1) parser, an SLR(1), LALR(1), or LR(1) parser executes
a loop in which it repeatedly inspects a two-dimensional table to find out what
action to take. However, instead of using the current input token and top-ofstack nonterminal to index into the table, an LR-family parser uses the current
input token and the current parser state (which can be found at the top of the
stack). “Shift” table entries indicate the state that should be pushed. “Reduce”
table entries indicate the number of states that should be popped and the nonterminal that should be pushed back onto the input stream, to be shifted by the
state uncovered by the pops. There is always one popped state for every symbol
on the right-hand side of the reducing production. The state to be pushed next
can be found by indexing into the table using the uncovered state and the newly
recognized nonterminal.
The CFSM for our bottom-up version of the calculator grammar appears in
Figure 2.25. States 6, 7, 9, and 13 contain potential shift-reduce conflicts, but all
of these can be resolved with global FOLLOW sets. SLR parsing therefore suffices.
In State 6, for example, FIRST(add op) ∩ FOLLOW(stmt) = ∅. In addition to shift
and reduce rules, we allow the parse table as an optimization to contain rules of
the form “shift and then reduce.” This optimization serves to eliminate trivial
states such as 1 and 0 in Example 2.36, which had only a single item, with the
at the end.
A pictorial representation of the CFSM appears in Figure 2.26. A tabular
representation, suitable for use in a table-driven parser, appears in Figure 2.27.
Pseudocode for the (language independent) parser driver appears in Figure 2.28.
A trace of the parser’s actions on the sum-and-average program appears in Figure 2.29.
.
Handling Epsilon Productions
EXAMPLE
2.38
Epsilon productions in the
bottom-up calculator
grammar
The careful reader may have noticed that the grammar of Figure 2.24, in addition
to using left-recursive rules for stmt list, expr, and term, differs from the grammar of Figure 2.15 in one other way: it defines a stmt list to be a sequence of one
or more stmts, rather than zero or more. (This means, of course, that it defines a
different language.) To capture the same language as Figure 2.15, the productions
program −→
stmt list $$
stmt list −→
stmt list stmt
stmt
in Figure 2.24 would need to be replaced with
program −→
stmt list $$
stmt list −→
stmt list stmt
2.3 Parsing
State
0.
.
..
..
.
program −→
stmt
stmt
stmt
stmt
stmt
Transitions
stmt list $$
list −→ stmt list stmt
list −→ stmt
−→ id := expr
−→ read id
−→ write expr
.
1.
stmt −→ read
2.
program −→ stmt list
stmt list −→ stmt list
stmt −→
stmt −→
stmt −→
..
.
.
stmt −→ id
4.
stmt −→ write
5.
$$
stmt
on id shift and goto 3
on read shift and goto 1
on write shift and goto 4
on := shift and goto 5
term
expr add op term
factor
term mult op factor
( expr )
id
number
on term shift and goto 7
on factor shift and reduce (pop 1 state, push term on input)
on ( shift and goto 8
on id shift and reduce (pop 1 state, push factor on input)
on number shift and reduce (pop 1 state, push factor on input)
expr
on expr shift and goto 9
term
expr add op term
factor
term mult op factor
( expr )
id
number
on term shift and goto 7
stmt −→ write expr
stmt −→ expr add op term
Figure 2.25
on $$ shift and reduce (pop 2 states, push program on input)
on stmt shift and reduce (pop 2 states, push stmt list on input)
on expr shift and goto 6
stmt −→ id :=
add op −→
add op −→
on stmt shift and reduce (pop 1 state, push stmt list on input)
on id shift and goto 3
on read shift and goto 1
on write shift and goto 4
expr
..
..
..
.
.
..
..
..
.
. .
..
expr −→
expr −→
term −→
term −→
factor −→
factor −→
factor −→
6.
..
:= expr
.
on stmt list shift and goto 2
on id shift and reduce (pop 2 states, push stmt on input)
id
id := expr
read id
write expr
3.
expr −→
expr −→
term −→
term −→
factor −→
factor −→
factor −→
87
+
-
on factor shift and reduce (pop 1 state, push term on input)
on ( shift and goto 8
on id shift and reduce (pop 1 state, push factor on input)
on number shift and reduce (pop 1 state, push factor on input)
on FOLLOW(stmt) = { id , read , write , $$ } reduce
(pop 2 states, push stmt on input)
on add op shift and goto 10
on + shift and reduce (pop 1 state, push add op on input)
on - shift and reduce (pop 1 state, push add op on input)
CFSM for the calculator grammar (Figure 2.24). Basis and closure items in each
state are separated by a horizontal rule. Trivial reduce-only states have been eliminated by use
of “shift and reduce” transitions. (continued)
88
Chapter 2 Programming Language Syntax
State
7.
8.
expr −→ term
term −→ term
..
mult op −→
mult op −→
*
/
on FOLLOW(expr) = { id , read , write , $$ , ) , + , - } reduce
(pop 1 state, push expr on input)
on mult op shift and goto 11
on * shift and reduce (pop 1 state, push mult op on input)
on / shift and reduce (pop 1 state, push mult op on input)
expr )
on expr shift and goto 12
factor −→ (
..
..
..
.
expr −→
expr −→
term −→
term −→
factor −→
factor −→
factor −→
9.
11.
13.
. .
stmt −→ id := expr
expr −→ expr add op term
..
+
-
expr −→ expr add op
..
..
.
.
term
on term shift and goto 7
on factor shift and reduce (pop 1 state, push term on input)
on ( shift and goto 8
on id shift and reduce (pop 1 state, push factor on input)
on number shift and reduce (pop 1 state, push factor on input)
on FOLLOW ( stmt ) = { id , read , write , $$ } reduce
(pop 3 states, push stmt on input)
on add op shift and goto 10
on + shift and reduce (pop 1 state, push add op on input)
on - shift and reduce (pop 1 state, push add op on input)
on term shift and goto 13
term −→ factor
term −→ term mult op factor
factor −→ ( expr )
factor −→ id
factor −→ number
on ( shift and goto 8
on id shift and reduce (pop 1 state, push factor on input)
on number shift and reduce (pop 1 state, push factor on input)
term −→ term mult op
on factor shift and reduce (pop 3 states, push term on input)
factor −→
factor −→
factor −→
12.
..
.
mult op factor
term
expr add op term
factor
term mult op factor
( expr )
id
number
add op −→
add op −→
10.
Transitions
..
.
.
factor
( expr )
id
number
on factor shift and reduce (pop 1 state, push term on input)
on ( shift and goto 8
on id shift and reduce (pop 1 state, push factor on input)
on number shift and reduce (pop 1 state, push factor on input)
..
factor −→ ( expr )
expr −→ expr add op term
on ) shift and reduce (pop 3 states, push factor on input)
on add op shift and goto 10
add op −→
add op −→
on + shift and reduce (pop 1 state, push add op on input)
on - shift and reduce (pop 1 state, push add op on input)
..
+
-
.
.
expr −→ expr add op term
term −→ term mult op factor
mult op −→
mult op −→
Figure 2.25
..
*
/
(continued)
on FOLLOW(expr) = { id , read , write , $$ , ) , + , - } reduce
(pop 3 states, push expr on input)
on mult op shift and goto 11
on * shift and reduce (pop 1 state, push mult op on input)
on / shift and reduce (pop 1 state, push mult op on input)
89
2.3 Parsing
Figure 2.26 Pictorial representation of the CFSM of Figure 2.25. Symbol names have been abbreviated for clarity. Reduce
actions are not shown.
Top-of-stack
state
sl
0
1
2
3
4
5
6
7
8
9
10
11
12
13
s2
–
–
–
–
–
–
–
–
–
–
–
–
–
s
e
t
f
ao
mo
id
Current input symbol
lit
r
w
:=
(
b3
–
b2
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
s6
s9
–
–
s12
–
–
–
–
–
–
–
–
–
s7
s7
–
–
s7
–
s13
–
–
–
–
–
–
–
b9
b9
–
–
b9
–
b9
b10
–
–
–
–
–
–
–
–
s10
–
–
s10
–
–
s10
–
–
–
–
–
–
–
–
s11
–
–
–
–
–
s11
s3
b5
s3
–
b12
b12
r6
r7
b12
r4
b12
b12
–
r8
–
–
–
–
b13
b13
–
–
b13
–
b13
b13
–
–
s1
–
s1
–
–
–
r6
r7
–
r4
–
–
–
r8
s4
–
s4
–
–
–
r6
r7
–
r4
–
–
–
r8
–
–
–
s5
–
–
–
–
–
–
–
–
–
–
–
–
–
–
s8
s8
–
–
s8
–
s8
s8
–
–
)
+
-
*
/
$$
–
–
–
–
–
–
–
r7
–
–
–
–
b11
r8
–
–
–
–
–
–
b14
r7
–
b14
–
–
b14
r8
–
–
–
–
–
–
b15
r7
–
b15
–
–
b15
r8
–
–
–
–
–
–
–
b16
–
–
–
–
–
b16
–
–
–
–
–
–
–
b17
–
–
–
–
–
b17
–
–
b1
–
–
–
r6
r7
–
r4
–
–
–
r8
Figure 2.27
SLR(1) parse table for the calculator language. Table entries indicate whether to shift (s), reduce (r), or shift and
then reduce (b). The accompanying number is the new state when shifting, or the production that has been recognized when
(shifting and) reducing. Production numbers are given in Figure 2.24. Symbol names have been abbreviated for the sake of
formatting. A dash indicates an error. An auxiliary table, not shown here, gives the left-hand side symbol and right-hand side
length for each production.
Note that it does in general make sense to have an empty statement list. In the
calculator language it simply permits an empty program, which is admittedly
silly. In real languages, however, it allows the body of a structured statement to
be empty, which can be very useful. One frequently wants one arm of a case or
multiway if . . . then . . . else statement to be empty, and an empty while loop
allows a parallel program (or the operating system) to wait for a signal from
another process or an I/O device.
90
Chapter 2 Programming Language Syntax
state = 1 . . number of states
symbol = 1 . . number of symbols
production = 1 . . number of productions
action rec = record
action : (shift, reduce, shift reduce, error)
new state : state
prod : production
parse tab : array [symbol, state] of action rec
prod tab : array [production] of record
lhs : symbol
rhs len : integer
–– these two tables are created by a parser generator tool
parse stack : stack of record
sym : symbol
st : state
parse stack.push(null, start state)
–– get new token from scanner
cur sym : symbol := scan
loop
–– peek at state at top of stack
cur state : state := parse stack.top.st
if cur state = start state
and cur sym = start symbol return –– success!
ar : action rec := parse tab[cur state, cur sym]
case ar.action
shift:
parse stack.push(cur sym, ar.new state)
–– get new token from scanner
cur sym := scan
reduce:
cur sym := prod tab[ar.prod].lhs
parse stack.pop(prod tab[ar.prod].rhs len)
shift reduce:
cur sym := prod tab[ar.prod].lhs
parse stack.pop(prod tab[ar.prod].rhs len−1)
error:
parse error
Figure 2.28 Driver for a table-driven SLR(1) parser. We call the scanner directly, rather than
using the global input token of Figures 2.16 and 2.18, so that we can set cur sym to be an
arbitrary symbol.
EXAMPLE
2.39
CFSM with epsilon
productions
If we look at the CFSM for the calculator language, we discover that State 0 is
the only state that needs to be changed in order to allow empty statement lists.
The item
stmt list −→
becomes
.
stmt
2.3 Parsing
Parse stack
0
0 read 1
0
0
0 stmt list
0 stmt list
0 stmt list
0
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0 stmt list
0
0 stmt list
0
[done]
Input stream
2
2 read 1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
id
id
id
id
id
id
id
id
id
id
id
id
id
id
2
2
2
2
2
2
2
2
write
write
write
write
write
write
4
4
4
4 term 7
4
4 expr 6
2
2
2
2
2
2
2
2
2
2
2
2
2
write
write
write
write
write
write
write
write
write
write
write
4
4
4
4
4
4
4
4
4
4
4
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
5
5
5
5
5
5
5
5
5
5
5
5
5
term
expr
expr
expr
expr
expr
expr
expr
term
term
term
term
7
7
7
7
term 7
expr 6
read A read B . . .
A read B . . .
stmt read B . . .
stmt list read B . . .
read B sum . . .
B sum := . . .
stmt sum := . . .
stmt list sum := . . .
sum := A . . .
:= A + . . .
A + B ...
factor + B . . .
term + B . . .
7
+ B write . . .
expr + B write . . .
9
+ B write . . .
9
add op B write . . .
9 add op 10
B write sum . . .
9 add op 10
factor write sum . . .
9 add op 10
term write sum . . .
9 add op 10 term 13 write sum . . .
expr write sum . . .
9
write sum . . .
stmt write sum . . .
stmt list write sum . . .
write sum . . .
sum write sum . . .
factor write sum . . .
term write sum . . .
write sum . . .
expr write sum . . .
write sum . . .
stmt write sum . . .
stmt list write sum . . .
write sum / . . .
sum / 2 . . .
factor / 2 . . .
term / 2 . . .
/ 2 $$
mult op 2 $$
mult op 11
2 $$
mult op 11
factor $$
term $$
$$
expr $$
$$
stmt $$
stmt list $$
$$
program
91
Comment
shift read
shift id(A) & reduce by stmt −→ read id
shift stmt & reduce by stmt list −→ stmt
shift stmt list
shift read
shift id(B) & reduce by stmt −→ read id
shift stmt & reduce by stmt list −→ stmt list stmt
shift stmt list
shift id(sum)
shift :=
shift id(A) & reduce by factor −→ id
shift factor & reduce by term −→ factor
shift term
reduce by expr −→ term
shift expr
shift + & reduce by add op −→ +
shift add op
shift id(B) & reduce by factor −→ id
shift factor & reduce by term −→ factor
shift term
reduce by expr −→ expr add op term
shift expr
reduce by stmt −→ id := expr
shift stmt & reduce by stmt list −→ stmt
shift stmt list
shift write
shift id(sum) & reduce by factor −→ id
shift factor & reduce by term −→ factor
shift term
reduce by expr −→ term
shift expr
reduce by stmt −→ write expr
shift stmt & reduce by stmt list −→ stmt list stmt
shift stmt list
shift write
shift id(sum) & reduce by factor −→ id
shift factor & reduce by term −→ factor
shift term
shift / & reduce by mult op −→ /
shift mult op
shift number(2) & reduce by factor −→ number
shift factor & reduce by term −→ term mult op factor
shift term
reduce by expr −→ term
shift expr
reduce by stmt −→ write expr
shift stmt & reduce by stmt list −→ stmt list stmt
shift stmt list
shift $$ & reduce by program −→ stmt list $$
Figure 2.29 Trace of a table-driven SLR(1) parse of the sum-and-average program. States in the parse stack are shown in
boldface type. Symbols in the parse stack are for clarity only; they are not needed by the parsing algorithm. Parsing begins with
the initial state of the CFSM (State 0) in the stack. It ends when we reduce by program −→ stmt list $$ , uncovering State 0
again and pushing program onto the input stream.
92
Chapter 2 Programming Language Syntax
stmt list −→
.
which is equivalent to
stmt list −→
or simply
stmt list −→
.
.
The entire state is then
program −→
stmt list −→
stmt list −→
stmt −→
stmt −→
stmt −→
..
.
.
..
stmt list $$
on stmt list shift and goto 2
stmt list stmt
id := expr
read id
write expr
on
on
on
on
on
$$ reduce (pop 0 states, push stmt list
input)
id shift and goto 3
read shift and goto 1
write shift and goto 4
The look-ahead for item
stmt list −→
.
is FOLLOW(stmt list), which is the end-marker, $$ . Since $$ does not appear in
the look-aheads for any other item in this state, our grammar is still SLR(1). It is
worth noting that epsilon productions prevent a grammar from being LR(0),
since one can never tell whether to “recognize” without peeking ahead. An
LR(0) grammar never has epsilon productions.
C H E C K YO U R U N D E R S TA N D I N G
36. What is the handle of a right sentential form?
37. Explain the significance of the characteristic finite state machine in LR
parsing.
.
38. What is the significance of the dot ( ) in an LR item?
39. What distinguishes the basis from the closure of an LR state?
40. What is a shift-reduce conflict? How is it resolved in the various kinds of LRfamily parsers?
41. Outline the steps performed by the driver of a bottom-up parser.
42. What kind of parser is produced by yacc/bison? By ANTLR?
43. Why are there never any epsilon productions in an LR(0) grammar?
2.3 Parsing
2.3.4
EXAMPLE
2.40
A syntax error in C
93
Syntax Errors
Suppose we are parsing a C program and see the following code fragment in a
context where a statement is expected.
A = B : C + D;
We will detect a syntax error immediately after the B , when the colon appears
from the scanner. At this point the simplest thing to do is just to print an error
message and halt. This naive approach is generally not acceptable, however: it
would mean that every run of the compiler reveals no more than one syntax
error. Since most programs, at least at first, contain numerous such errors, we
really need to find as many as possible now (we’d also like to continue looking
for semantic errors). To do so, we must modify the state of the parser and/or the
input stream so that the upcoming token(s) are acceptable. We shall probably
want to turn off code generation, disabling the back end of the compiler: since
the input is not a valid program, the code will not be of use, and there’s no point
in spending time creating it.
In general, the term syntax error recovery is applied to any technique that
allows the compiler, in the face of a syntax error, to continue looking for other
errors later in the program. High-quality syntax error recovery is essential in any
production-quality compiler. The better the recovery technique, the more likely
the compiler will be to recognize additional errors (especially nearby errors) correctly, and the less likely it will be to become confused and announce spurious
cascading errors later in the program.
IN MORE DEPTH
There are many possible approaches to syntax error recovery. In panic mode, the
compiler writer defines a small set of “safe symbols” that delimit clean points in
the input. When an error occurs, the compiler deletes input tokens until it finds a
safe symbol, and then “backs the parser out” (e.g., returns from recursive descent
subroutines) until it finds a context in which that symbol might appear. Phraselevel recovery improves on this technique by employing different sets of “safe”
symbols in different productions of the grammar. Context-sensitive look-ahead
obtains additional improvements by differentiating among the various contexts
in which a given production might appear in a syntax tree. To respond gracefully
to certain common programming errors, the compiler writer may augment the
grammar with error productions that capture language-specific idioms that are
incorrect but are often written by mistake.
Niklaus Wirth published an elegant implementation of phrase-level and
context-sensitive recovery for recursive descent parsers in 1976 [Wir76, Sec. 5.9].
Exceptions (to be discussed further in Section 8.5.3) provide a simpler alternative
if supported by the language in which the compiler is written. For table-driven
top-down parsers, Fischer, Milton, and Quiring published an algorithm in 1980
that automatically implements a well-defined notion of locally least-cost syntax
94
Chapter 2 Programming Language Syntax
repair. Locally least-cost repair is also possible in bottom-up parsers, but it is significantly more difficult. Most bottom-up parsers rely on more straightforward
phrase-level recovery; a typical example can be found in yacc/bison.
2.4
Theoretical Foundations
Our understanding of the relative roles and computational power of scanners,
parsers, regular expressions, and context-free grammars is based on the formalisms of automata theory. In automata theory, a formal language is a set of
strings of symbols drawn from a finite alphabet. A formal language can be specified either by a set of rules (such as regular expressions or a context-free grammar) that generate the language or by a formal machine that accepts (recognizes)
the language. A formal machine takes strings of symbols as input and outputs
either “yes” or “no.” A machine is said to accept a language if it says “yes” to all
and only those strings that are in the language. Alternatively, a language can be
defined as the set of strings for which a particular machine says “yes.”
Formal languages can be grouped into a series of successively larger classes
known as the Chomsky hierarchy.13 Most of the classes can be characterized in
two ways: by the types of rules that can be used to generate the set of strings or
by the type of formal machine that is capable of recognizing the language. As
we have seen, regular languages are defined by using concatenation, alternation,
and Kleene closure, and are recognized by a scanner. Context-free languages are
a proper superset of the regular languages. They are defined by using concatenation, alternation, and recursion (which subsumes Kleene closure), and are recognized by a parser. A scanner is a concrete realization of a finite automaton, a type
of formal machine. A parser is a concrete realization of a push-down automaton.
Just as context-free grammars add recursion to regular expressions, push-down
automata add a stack to the memory of a finite automaton. There are additional
levels in the Chomsky hierarchy, but they are less directly applicable to compiler
construction, and are not covered here.
It can be proven, constructively, that regular expressions and finite automata
are equivalent: one can construct a finite automaton that accepts the language
defined by a given regular expression, and vice versa. Similarly, it is possible to
construct a push-down automaton that accepts the language defined by a given
context-free grammar, and vice versa. The grammar-to-automaton constructions
are in fact performed by scanner and parser generators such as lex and yacc . Of
course, a real scanner does not accept just one token; it is called in a loop so that
it keeps accepting tokens repeatedly. This detail is accommodated by having the
13 Noam Chomsky (1928–), a linguist and social philosopher at the Massachusetts Institute of Technology, developed much of the early theory of formal languages.
2.5 Summary and Concluding Remarks
95
scanner accept the alternation of all the tokens in the language, and by having it
continue to consume characters until no longer token can be constructed.
IN MORE DEPTH
On the PLP CD we consider finite and pushdown automata in more detail. We
give an algorithm to convert a DFA into an equivalent regular expression. Combined with the constructions in Section 2.2.1, this algorithm demonstrates the
equivalence of regular expressions and finite automata. We also consider the sets
of grammars and languages that can and cannot be parsed by the various lineartime parsing algorithms.
2.5
Summary and Concluding Remarks
In this chapter we have introduced the formalisms of regular expressions and
context-free grammars, and the algorithms that underlie scanning and parsing
in practical compilers. We also mentioned syntax error recovery, and presented a
quick overview of relevant parts of automata theory. Regular expressions and
context-free grammars are language generators: they specify how to construct
valid strings of characters or tokens. Scanners and parsers are language recognizers: they indicate whether a given string is valid. The principal job of the scanner
is to reduce the quantity of information that must be processed by the parser, by
grouping characters together into tokens, and by removing comments and white
space. Scanner and parser generators automatically translate regular expressions
and context-free grammars into scanners and parsers.
Practical parsers for programming languages (parsers that run in linear time)
fall into two principal groups: top-down (also called LL or predictive) and
bottom-up (also called LR or shift-reduce). A top-down parser constructs a parse
tree starting from the root and proceeding in a left-to-right depth-first traversal.
A bottom-up parser constructs a parse tree starting from the leaves, again working left-to-right, and combining partial trees together when it recognizes the children of an internal node. The stack of a top-down parser contains a prediction of
what will be seen in the future; the stack of a bottom-up parser contains a record
of what has been seen in the past.
Top-down parsers tend to be simple, both in the parsing of valid strings and in
the recovery from errors in invalid strings. Bottom-up parsers are more powerful, and in some cases lend themselves to more intuitively structured grammars,
though they suffer from the inability to embed action routines at arbitrary points
4.5.1).
in a right-hand side (we discuss this point in more detail in Section
Both varieties of parser are used in real compilers, though bottom-up parsers are
more common. Top-down parsers tend to be smaller in terms of code and data
size, but modern machines provide ample memory for either.
96
Chapter 2 Programming Language Syntax
Both scanners and parsers can be built by hand if an automatic tool is
not available. Hand-built scanners are simple enough to be relatively common.
Hand-built parsers are generally limited to top-down recursive descent, and are
generally used only for comparatively simple languages (e.g., Pascal but not
Ada). Automatic generation of the scanner and parser has the advantage of increased reliability, reduced development time, and easy modification and enhancement.
Various features of language design can have a major impact on the complexity of syntax analysis. In many cases, features that make it difficult for a compiler
to scan or parse also make it difficult for a human being to write correct, maintainable code. Examples include the lexical structure of Fortran and the if . . .
then . . . else statement of languages like Pascal. This interplay among language
design, implementation, and use will be a recurring theme throughout the remainder of the book.
2.6
Exercises
2.1 Write regular expressions to capture
(a) Strings in C. These are delimited by double quotes ( " ), and may not
contain newline characters. They may contain double quote or backslash
characters if and only if those characters are “escaped” by a preceding
backslash. You may find it helpful to introduce shorthand notation to
represent any character that is not a member of a small specified set.
(b) Comments in Pascal. These are delimited by
(* and *) , as shown in
Figure 2.6, or by { and }.
(c) Floating-point constants in Ada. These are the same as in Pascal (see
the definition of unsigned number in Example 2.2 [page 41]), except that
(1) an underscore is permitted between digits, and (2) an alternative
numeric base may be specified by surrounding the non-exponent part
of the number with pound signs, preceded by a base in decimal (e.g.,
16#6.a7#e+2 ). In this latter case, the letters a . . f (both upper- and lowercase) are permitted as digits. Use of these letters in an inappropriate
(e.g., decimal) number is an error but need not be caught by the scanner.
(d) Inexact constants in Scheme. Scheme allows real numbers to be explicitly
inexact (imprecise). A programmer who wants to express all constants
using the same number of characters can use sharp signs (#) in place
of any lower-significance digits whose values are not known. A base-ten
constant without exponent consists of one or more digits followed by
zero of more sharp signs. An optional decimal point can be placed at the
beginning, the end, or anywhere in between. (For the record, numbers
in Scheme are actually a good bit more complicated than this. For the
2.6 Exercises
97
purposes of this exercise, please ignore anything you may know about
sign, exponent, radix, exactness and length specifiers, and complex or
rational values.)
(e) Financial quantities in American notation. These have a leading dollar
sign ( $ ), an optional string of asterisks ( * —used on checks to discourage
fraud), a string of decimal digits, and an optional fractional part consisting of a decimal point ( . ) and two decimal digits. The string of digits to
the left of the decimal point may consist of a single zero ( 0 ). Otherwise
it must not start with a zero. If there are more than three digits to the
left of the decimal point, groups of three (counting from the right) must
be separated by commas ( , ). Example: $**2,345.67 . (Feel free to use
“productions” to define abbreviations, so long as the language remains
regular.)
2.2 Show (as “circles-and-arrows” diagrams) the finite automata for parts (a)
and (c) of Exercise 2.1.
2.3 Build a regular expression that captures all nonempty sequences of letters
other than file , for , and from . For notational convenience, you may
assume the existence of a not operator that takes a set of letters as argument
and matches any other letter. Comment on the practicality of constructing
a regular expression for all sequences of letters other than the keywords of
a large programming language.
2.4 (a) Show the NFA that results from applying the construction of Figure 2.8
to the regular expression letter ( letter
digit )* .
(b) Apply the transformation illustrated by Example 2.12 to create an equivalent DFA.
(c) Apply the transformation illustrated by Example 2.13 to minimize the
DFA.
2.5 Build an ad hoc scanner for the calculator language. As output, have it print
a list, in order, of the input tokens. For simplicity, feel free to simply halt in
the event of a lexical error.
2.6 Build a nested- case -statements finite automaton that converts all letters
in its input to lowercase, except within Pascal-style comments and strings.
A Pascal comment is delimited by { and }, or by (* and *) . Comments do not nest. A Pascal string is delimited by single quotes ( ’ . . . ’ ).
A quote character can be placed in a string by doubling it ( ’Madam, I’’m
Adam.’ ). This upper-to-lower mapping can be useful if feeding a program
written in standard Pascal (which ignores case) to a compiler that considers
upper- and lowercase letters to be distinct.
2.7 Give an example of a grammar that captures right associativity for an ex
ponentiation operator (e.g., ** in Fortran).
2.8 Prove that the following grammar is LL(1).
98
Chapter 2 Programming Language Syntax
decl −→ ID decl tail
decl tail −→ , decl
−→ : ID ;
(The final ID is meant to be a type name.)
2.9 Consider the following grammar.
G
S
M
A
E
B
−→
−→
−→
−→
−→
−→
S $$
A M
S a E
b A A
a B
b A b E
a B B
(a) Describe in English the language that the grammar generates.
(b) Show a parse tree for the string a b a a .
(c) Is the grammar LL(1)? If so, show the parse table; if not, identify a prediction conflict.
2.10 Consider the language consisting of all strings of properly balanced parentheses and brackets.
(a) Give LL(1) and SLR(1) grammars for this language.
(b) Give the corresponding LL(1) and SLR(1) parsing tables.
(c) For each grammar, show the parse tree for ([]([]))[](()) .
(d) Give a trace of the actions of the parsers on this input.
2.11 Give an example of a grammar that captures all the levels of precedence
for arithmetic expressions in C. (Hint: This exercise is somewhat tedious.
You probably want to attack it with a text editor rather than a pencil, so
you can cut, paste, and replace. You can find a summary of C precedence
in Figure 6.1 [page 237]; you may want to consult a manual for further
details.)
2.12 Extend the grammar of Figure 2.24 to include
if statements and while
loops, along the lines suggested by the following examples.
abs := n
if n < 0 then abs := 0 - abs fi
sum := 0
read count
while count > 0 do
read n
sum := sum + n
count := count - 1
od
write sum
2.6 Exercises
99
Your grammar should support the six standard comparison operations in
conditions, with arbitrary expressions as operands. It should allow an arbitrary number of statements in the body of an if or while statement.
2.13 Consider the following LL(1) grammar for a simplified subset of Lisp.
P −→
E $$
E −→
atom
−→
’ E
−→
( E Es )
Es −→
E Es
−→
(a)
(b)
(c)
(d)
What is FIRST(Es)? FOLLOW(E)? PREDICT(Es −→ )?
Give a parse tree for the string (cdr ’(a b c)) $$.
Show the left-most derivation of (cdr ’(a b c)) $$.
Show a trace, in the style of Figure 2.20, of a table-driven top-down
parse of this same input.
(e) Now consider a recursive descent parser running on the same input.
At the point where the quote token ( ’ ) is matched, which recursive
descent routines will be active (i.e., what routines will have a frame on
the parser’s run-time stack)?
2.14 Write top-down and bottom-up grammars for the language consisting of
all well-formed regular expressions. Arrange for all operators to be leftassociative. Give Kleene closure the highest precedence and alternation the
lowest precedence.
2.15 Suppose that the expression grammar in Example 2.7 were to be used in
conjunction with a scanner that did not remove comments from the input
but rather returned them as tokens. How would the grammar need to be
modified to allow comments to appear at arbitrary places in the input?
2.16 Build a complete recursive descent parser for the calculator language. As
output, have it print a trace of its matches and predictions.
2.17 Flesh out the details of an algorithm to eliminate left recursion and common prefixes in an arbitrary context-free grammar.
2.18 In some languages an assignment can appear in any context in which an
expression is expected: the value of the expression is the right-hand side
of the assignment, which is placed into the left-hand side as a side effect.
Consider the following grammar fragment for such a language. Explain why
it is not LL(1), and discuss what might be done to make it so.
expr −→
id := expr
expr −→
term term tail
term tail −→
+ term term tail
100
Chapter 2 Programming Language Syntax
term −→ factor factor tail
factor tail −→ * factor factor tail
factor −→ ( expr )
id
2.19 Construct a trace over time of the forest of partial parse trees manipulated
by a bottom-up parser for the string A, B, C; , using the grammar in Example 2.19 (the one that is able to collapse prefixes of the id list as it goes
along).
2.20 Construct the CFSM for the id list grammar in Example 2.18 (page 62) and
verify that it can be parsed bottom-up with zero tokens of look-ahead.
2.21 Modify the grammar in Exercise 2.20 to allow an id list to be empty. Is the
grammar still LR(0)?
2.22 Consider the following grammar for a declaration list.
decl list −→ decl list decl ; decl ;
decl −→ id : type
type −→ int
real
char
−→ array const .. const of type
−→ record decl list end
Construct the CFSM for this grammar. Use it to trace out a parse (as in
Figure 2.29) for the following input program.
foo : record
a : char;
b : array 1..2 of real;
end;
2.23 The dangling
else problem of Pascal is not shared by Algol 60. To avoid
ambiguity regarding which then is matched by an else , Algol 60 prohibits
if statements immediately inside a then clause. The Pascal fragment
if C1 then if C2 then S1 else S2
must be written as either
if C1 then begin if C2 then S1 end else S2
or
if C1 then begin if C2 then S1 else S2 end
in Algol 60. Show how to write a grammar for conditional statements that
enforces this rule. (Hint: You will want to distinguish in your grammar between conditional statements and nonconditional statements; some contexts will accept either, some only the latter.)
2.24–2.28 In More Depth.
2.7 Explorations
2.7
101
Explorations
2.29 Some languages (e.g., C) distinguish between upper- and lowercase letters
in identifiers. Others (e.g., Ada) do not. Which convention do you prefer?
Why?
2.30 The syntax for type casts in C and its descendants introduces potential ambiguity: is (x)-y a subtraction, or the unary negation of y , cast to type x ?
Find out how C, C++, Java, and C# answer this question. Discuss how you
would implement the answer(s).
2.31 What do you think of Haskell, Occam, and Python’s use of indentation
to delimit control constructs (Section 2.1.1)? Would you expect this convention to make program construction and maintenance easier or harder?
Why?
2.32 Skip ahead to Section 13.4.2 and learn about the “regular expressions” used
in scripting languages, editors, search tools, and so on. Are these really regular? What can they express that cannot be expressed in the notation introduced in Section 2.1.1?
2.33 Rebuild the automaton of Exercise 2.6 using lex/flex.
2.34 Find a manual for yacc/bison, or consult a compiler textbook [ASU86]
to learn about operator precedence parsing. Explain how it could be used to
simplify the grammar of Exercise 2.11.
2.35 Use lex/flex and yacc/bison to construct a parser for the calculator language. Have it output a trace of its shifts and reductions.
2.36 Repeat the previous exercise using ANTLR.
2.37–2.38 In More Depth.
2.8
Bibliographic Notes
Our coverage of scanning and parsing in this chapter has of necessity been brief.
Considerably more detail can be found in texts on parsing theory [AU72] and
compiler construction [App97, ASU86, CT04, FL88, GBJL01]. Many compilers
of the early 1960s employed recursive descent parsers. Lewis and Stearns [LS68]
and Rosenkrantz and Stearns [RS70] published early formal studies of LL
grammars and parsing. The original formulation of LR parsing is due to
Knuth [Knu65]. Bottom-up parsing became practical with DeRemer’s discovery
of the SLR and LALR algorithms [DeR69, DeR71]. W. L. Johnson et al. [JPAR68]
describe an early scanner generator. The Unix lex tool is due to Lesk [Les75].
Yacc is due to S. C. Johnson [Joh75].
Further details on formal language theory can be found in a variety of
textbooks, including those of Hopcroft, Motwani, and Ullman [HMU01] and
102
Chapter 2 Programming Language Syntax
Sipser [Sip97]. Kleene [Kle56] and Rabin and Scott [RS59] proved the equivalence of regular expressions and finite automata.14 The proof that finite automata
are unable to recognize nested constructs is based on a theorem known as the
pumping lemma, due to Bar-Hillel, Perles, and Shamir [BHPS61]. Context-free
grammars were first explored by Chomsky [Cho56] in the context of natural language. Independently, Backus and Naur developed BNF for the syntactic description of Algol 60 [NBB+ 63]. Ginsburg and Rice [GR62] recognized the equivalence of the two notations. Chomsky [Cho62] and Evey [Eve63] demonstrated
the equivalence of context-free grammars and push-down automata.
Fischer and LeBlanc’s text [FL88] contains an excellent survey of error recovery and repair techniques, with references to other work. The phrase-level recovery mechanism for recursive descent parsers described in Section 2.3.4 is due
to Wirth [Wir76, Sec. 5.9]. The locally least-cost recovery mechanism for table2.3.4 is due to Fischer, Milton, and
driven LL parsers described in Section
Quiring [FMQ80]. Dion published a locally least-cost bottom-up repair algorithm in 1978 [Dio78]. It is quite complex, and requires very large precomputed
tables. More recently, McKenzie, Yeatman, and De Vere have shown how to effect
the same repairs without the precomputed tables, at a higher but still acceptable
cost in time [MYD95].
14 Dana Scott (1932–), Professor Emeritus at Carnegie Mellon University, is known principally
for inventing domain theory and launching the field of denotational semantics, which provides
a mathematically rigorous way to formalize the meaning of programming languages. Michael
Rabin (1931–), of Harvard University, has made seminal contributions to the concepts of nondeterminism and randomization in computer science. Scott and Rabin shared the ACM Turing
Award in 1976.
3
Names, Scopes, and Bindings
“High-level” programming languages take their name from the relatively high level, or degree of abstraction, of the features they provide, relative
to those of the assembly languages that they were originally designed to replace.
The adjective abstract, in this context, refers to the degree to which language features are separated from the details of any particular computer architecture. The
early development of languages like Fortran, Algol, and Lisp was driven by a pair
of complementary goals: machine independence and ease of programming. By
abstracting the language away from the hardware, designers not only made it
possible to write programs that would run well on a wide variety of machines,
but also made the programs easier for human beings to understand.
Machine independence is a fairly simple concept. Basically it says that a programming language should not rely on the features of any particular instruction
set for its efficient implementation. Machine dependences still become a problem
from time to time (standards committees for C, for example, have only recently
agreed on how to accommodate machines with 64-bit arithmetic), but with a few
noteworthy exceptions (Java comes to mind) it has probably been 30 years since
the desire for greater machine independence has really driven language design.
Ease of programming, on the other hand, is a much more elusive and compelling
goal. It affects every aspect of language design, and has historically been less a
matter of science than of aesthetics and trial and error.
This chapter is the first of five to address core issues in language design. (The
others are Chapters 6–9.) In Chapter 6 we will look at control-flow constructs,
which allow the programmer to specify the order in which operations are to occur. In contrast to the jump-based control flow of assembly languages, high-level
control flow relies heavily on the lexical nesting of constructs. In Chapter 7 we
will look at types, which allow the programmer to organize program data and
the operations on them. In Chapters 8 and 9 we will look at subroutines and
classes. In this current chapter we look at names.
103
104
Chapter 3 Names, Scopes, and Bindings
A name is a mnemonic character string used to represent something else.
Names in most languages are identifiers (alpha-numeric tokens), though certain
other symbols, such as + or := , can also be names. Names allow us to refer to variables, constants, operations, types, and so on using symbolic identifiers rather
than low-level concepts like addresses. Names are also essential in the context of
a second meaning of the word abstraction. In this second meaning, abstraction is
a process by which the programmer associates a name with a potentially complicated program fragment, which can then be thought of in terms of its purpose or
function, rather than in terms of how that function is achieved. By hiding irrelevant details, abstraction reduces conceptual complexity, making it possible for
the programmer to focus on a manageable subset of the program text at any particular time. Subroutines are control abstractions: they allow the programmer to
hide arbitrarily complicated code behind a simple interface. Classes are data abstractions: they allow the programmer to hide data representation details behind
a (comparatively) simple set of operations.
We will look at several major issues related to names. Section 3.1 introduces
the notion of binding time, which refers not only to the binding of a name to
the thing it represents, but also in general to the notion of resolving any design
decision in a language implementation. Section 3.2 outlines the various mechanisms used to allocate and deallocate storage space for objects, and distinguishes
between the lifetime of an object and the lifetime of a binding of a name to that
object.1 Most name-to-object bindings are usable only within a limited region of
a given high-level program. Section 3.3 explores the scope rules that define this
region; Section 3.4 (mostly on the PLP CD) considers their implementation.
The complete set of bindings in effect at a given point in a program is known as
the current referencing environment. Section 3.5 expands on the notion of scope
rules by considering the ways in which a referencing environment may be bound
to a subroutine that is passed as a parameter, returned from a function, or stored
in a variable. Section 3.6 discusses aliasing, in which more than one name may
refer to a given object in a given scope; overloading, in which a name may refer to
more than one object in a given scope, depending on the context of the reference;
and polymorphism, in which a single object may have more than one type, depending on context or execution history. Finally, Section 3.7 (mostly on the PLP
CD) discusses separate compilation.
3.1
The Notion of Binding Time
A binding is an association between two things, such as a name and the thing it
names. Binding time is the time at which a binding is created or, more generally,
1 For want of a better term, we will use the term object throughout Chapters 3–8 to refer to anything that might have a name: variables, constants, types, subroutines, modules, and others. In
many modern languages object has a more formal meaning, which we will consider in Chapter 9.
3.1 The Notion of Binding Time
105
the time at which any implementation decision is made (we can think of this
as binding an answer to a question). There are many different times at which
decisions may be bound:
Language design time: In most languages, the control flow constructs, the set of
fundamental (primitive) types, the available constructors for creating complex
types, and many other aspects of language semantics are chosen when the language is designed.
Language implementation time: Most language manuals leave a variety of issues
to the discretion of the language implementor. Typical (though by no means
universal) examples include the precision (number of bits) of the fundamental
types, the coupling of I/O to the operating system’s notion of files, the organization and maximum sizes of stack and heap, and the handling of run-time
exceptions such as arithmetic overflow.
Program writing time: Programmers, of course, choose algorithms, data structures, and names.
Compile time: Compilers choose the mapping of high-level constructs to machine code, including the layout of statically defined data in memory.
Link time: Since most compilers support separate compilation—compiling different modules of a program at different times—and depend on the availability of a library of standard subroutines, a program is usually not complete
until the various modules are joined together by a linker. The linker chooses
the overall layout of the modules with respect to one another. It also resolves
intermodule references. When a name in one module refers to an object in another module, the binding between the two was not finalized until link time.
Load time: Load time refers to the point at which the operating system loads the
program into memory so that it can run. In primitive operating systems, the
choice of machine addresses for objects within the program was not finalized
until load time. Most modern operating systems distinguish between virtual
and physical addresses. Virtual addresses are chosen at link time; physical addresses can actually change at run time. The processor’s memory management
hardware translates virtual addresses into physical addresses during each individual instruction at run time.
D E S I G N & I M P L E M E N TAT I O N
Binding time
It is difficult to overemphasize the importance of binding times in the design and implementation of programming languages. In general, early binding times are associated with greater efficiency, while later binding times are
associated with greater flexibility. The tension between the goals provides a
recurring theme for later chapters of this book.
106
Chapter 3 Names, Scopes, and Bindings
Run time: Run time is actually a very broad term that covers the entire span
from the beginning to the end of execution. Bindings of values to variables
occur at run time, as do a host of other decisions that vary from language
to language. Run time subsumes program start-up time, module entry time,
elaboration time (the point at which a declaration is first “seen”), subroutine
call time, block entry time, and statement execution time.
The terms static and dynamic are generally used to refer to things bound before
run time and at run time, respectively. Clearly static is a coarse term. So is dynamic.
Compiler-based language implementations tend to be more efficient than
interpreter-based implementations because they make earlier decisions. For example, a compiler analyzes the syntax and semantics of global variable declarations once, before the program ever runs. It decides on a layout for those
variables in memory, and generates efficient code to access them wherever they
appear in the program. A pure interpreter, by contrast, must analyze the declarations every time the program begins execution. In the worst case, an interpreter
may reanalyze the local declarations within a subroutine each time that subroutine is called. If a call appears in a deeply nested loop, the savings achieved by a
compiler that is able to analyze the declarations only once may be very large. As
we shall see in the following section, a compiler will not usually be able to predict the address of a local variable at compile time, since space for the variable
will be allocated dynamically on a stack, but it can arrange for the variable to
appear at a fixed offset from the location pointed to by a certain register at run
time.
Some languages are difficult to compile because their definitions require certain fundamental decisions to be postponed until run time, generally in order to
increase the flexibility or expressiveness of the language. Smalltalk, for example,
delays all type checking until run time. All operations in Smalltalk are cast in the
form of “messages” to “objects.” A message is acceptable if and only if the object
provides a handler for it. References to objects of arbitrary types (classes) can
then be assigned into arbitrary named variables, as long as the program never
ends up sending a message to an object that is not prepared to handle it. This
form of polymorphism—allowing a variable name to refer to objects of multiple types—allows the Smalltalk programmer to write very general purpose code,
which will correctly manipulate objects whose types had yet to be fully defined
at the time the code was written. We will mention polymorphism again in Section 3.6.3, and discuss it further in Chapters 7 and 9.
3.2
Object Lifetime and Storage Management
In any discussion of names and bindings, it is important to distinguish between
names and the objects to which they refer, and to identify several key events:
3.2 Object Lifetime and Storage Management
107
The creation of objects
The creation of bindings
References to variables, subroutines, types, and so on, all of which use bindings
The deactivation and reactivation of bindings that may be temporarily unusable
The destruction of bindings
The destruction of objects
The period of time between the creation and the destruction of a name-toobject binding is called the binding’s lifetime. Similarly, the time between the
creation and destruction of an object is the object’s lifetime. These lifetimes need
not necessarily coincide. In particular, an object may retain its value and the potential to be accessed even when a given name can no longer be used to access it.
When a variable is passed to a subroutine by reference, for example (as it typically
is in Fortran or with var parameters in Pascal or “ & ” parameters in C++), the
binding between the parameter name and the variable that was passed has a lifetime shorter than that of the variable itself. It is also possible, though generally a
sign of a program bug, for a name-to-object binding to have a lifetime longer than
that of the object. This can happen, for example, if an object created via the C++
new operator is passed as a & parameter and then deallocated ( delete -ed) before the subroutine returns. A binding to an object that is no longer live is called
a dangling reference. Dangling references will be discussed further in Sections 3.5
and 7.7.2.
Object lifetimes generally correspond to one of three principal storage allocation mechanisms, used to manage the object’s space:
1. Static objects are given an absolute address that is retained throughout the
program’s execution.
2. Stack objects are allocated and deallocated in last-in, first-out order, usually
in conjunction with subroutine calls and returns.
3. Heap objects may be allocated and deallocated at arbitrary times. They require
a more general (and expensive) storage management algorithm.
3.2.1
Static Allocation
Global variables are the obvious example of static objects, but not the only one.
The instructions that constitute a program’s machine-language translation can
also be thought of as statically allocated objects. In addition, we shall see examples in Section 3.3.1 of variables that are local to a single subroutine but retain
their values from one invocation to the next; their space is statically allocated.
Numeric and string-valued constant literals are also statically allocated, for statements such as A = B/14.7 or printf("hello, world\n") . (Small constants
108
Chapter 3 Names, Scopes, and Bindings
Figure 3.1
Static allocation of space for subroutines in a language or program without recur-
sion.
EXAMPLE
3.1
Static allocation of local
variables
are often stored within the instruction itself; larger ones are assigned a separate
location.) Finally, most compilers produce a variety of tables that are used by runtime support routines for debugging, dynamic type checking, garbage collection,
exception handling, and other purposes; these are also statically allocated. Statically allocated objects whose value should not change during program execution
(e.g., instructions, constants, and certain run-time tables) are often allocated in
protected, read-only memory so that any inadvertent attempt to write to them
will cause a processor interrupt, allowing the operating system to announce a
run-time error.
Logically speaking, local variables are created when their subroutine is called
and destroyed when it returns. If the subroutine is called repeatedly, each invocation is said to create and destroy a separate instance of each local variable. It is
not always the case, however, that a language implementation must perform work
at run time corresponding to these create and destroy operations. Recursion was
not originally supported in Fortran (it was added in Fortran 90). As a result, there
can never be more than one invocation of a subroutine active at any given time,
and a compiler may choose to use static allocation for local variables, effectively
arranging for the variables of different invocations to share the same locations,
and thereby avoiding any run-time overhead for creation and destruction (Figure 3.1).
3.2 Object Lifetime and Storage Management
109
In many languages a constant is required to have a value that can be determined at compile time. Usually the expression that specifies the constant’s value
is permitted to include only literal (manifest) constants and built-in functions
and arithmetic operators. These sorts of compile-time constants can always be
allocated statically, even if they are local to a recursive subroutine: multiple instances can share the same location. In other languages (e.g., C and Ada), constants are simply variables that cannot be changed after elaboration time. Their
values, though unchanging, can depend on other values that are not known until
run time. These elaboration-time constants, when local to a recursive subroutine,
must be allocated on the stack. C# provides both options, explicitly, with the
const and readonly keywords.
Along with local variables and elaboration-time constants, the compiler typically stores a variety of other information associated with the subroutine, including the following.
Arguments and return values. Modern compilers tend to keep these in registers
when possible, but sometimes space in memory is needed.
Temporaries. These are usually intermediate values produced in complex calculations. Again, a good compiler will keep them in registers whenever possible.
Bookkeeping information. This may include the subroutine’s return address, a
reference to the stack frame of the caller (also called the dynamic link), additional saved registers, debugging information, and various other values that
we will study later.
3.2.2
EXAMPLE
3.2
Layout of the run-time
stack
Stack-Based Allocation
If a language permits recursion, static allocation of local variables is no longer an
option, since the number of instances of a variable that may need to exist at the
same time is conceptually unbounded. Fortunately, the natural nesting of subroutine calls makes it easy to allocate space for locals on a stack. A simplified
picture of a typical stack appears in Figure 3.2. Each instance of a subroutine at
run time has its own frame (also called an activation record) on the stack, containing arguments and return values, local variables, temporaries, and bookkeeping
D E S I G N & I M P L E M E N TAT I O N
Recursion in Fortran
The lack of recursion in (pre-Fortran 90) Fortran is generally attributed to the
expense of stack manipulation on the IBM 704, on which the language was
first implemented. Many (perhaps most) Fortran implementations choose to
use a stack for local variables, but because the language definition permits the
use of static allocation instead, Fortran programmers were denied the benefits
of language-supported recursion for over 30 years.
110
Chapter 3 Names, Scopes, and Bindings
Figure 3.2
Stack-based allocation of space for subroutines. We assume here that subroutine A
has been called by the main program and that it then calls subroutine B . Subroutine B subsequently calls C , which in turn calls D . At any given time, the stack pointer ( sp ) register points
to the first unused location on the stack (or the last used location on some machines), and the
frame pointer ( fp ) register points to a known location within the frame (activation record) of
the current subroutine. The relative order of fields within a frame may vary from machine to
machine and compiler to compiler.
information. Arguments to be passed to subsequent routines lie at the top of the
frame, where the callee can easily find them. The organization of the remaining information is implementation-dependent: it varies from one language and
compiler to another.
Maintenance of the stack is the responsibility of the subroutine calling sequence—the code executed by the caller immediately before and after the call—
and of the prologue (code executed at the beginning) and epilogue (code executed
at the end) of the subroutine itself. Sometimes the term “calling sequence” is used
to refer to the combined operations of the caller, the prologue, and the epilogue.
We will study calling sequences in more detail in Section 8.2.
While the location of a stack frame cannot be predicted at compile time (the
compiler cannot in general tell what other frames may already be on the stack),
the offsets of objects within a frame usually can be statically determined. Moreover, the compiler can arrange (in the calling sequence or prologue) for a particular register, known as the frame pointer, to always point to a known location
3.2 Object Lifetime and Storage Management
111
within the frame of the current subroutine. Code that needs to access a local variable within the current frame, or an argument near the top of the calling frame,
can do so by adding a predetermined offset to the value in the frame pointer. As
we shall see in Section 5.3.1, almost every processor provides an addressing mode
that allows this addition to be specified implicitly as part of an ordinary load
or store instruction. The stack grows “downward” toward lower addresses in
most language implementations. Some machines provide special push and pop
instructions that assume this direction of growth. Arguments and returns typically have positive offsets from the frame pointer; local variables, temporaries,
and bookkeeping information typically have negative offsets.
Even in a language without recursion, it can be advantageous to use a stack for
local variables, rather than allocating them statically. In most programs the pattern of potential calls among subroutines does not permit all of those subroutines
to be active at the same time. As a result, the total space needed for local variables
of currently active subroutines is seldom as large as the total space across all subroutines, active or not. A stack may therefore require substantially less memory
at run time than would be required for static allocation.
3.2.3
EXAMPLE
3.3
External fragmentation in
the heap
Heap-Based Allocation
A heap is a region of storage in which subblocks can be allocated and deallocated
at arbitrary times.2 Heaps are required for the dynamically allocated pieces of
linked data structures and for dynamically resized objects, such as fully general
character strings, lists, and sets, whose size may change as a result of an assignment statement or other update operation.
There are many possible strategies to manage space in a heap. We review the
major alternatives here; details can be found in any data-structures textbook. The
principal concerns are speed and space, and as usual there are tradeoffs between
them. Space concerns can be further subdivided into issues of internal and external fragmentation. Internal fragmentation occurs when a storage-management
algorithm allocates a block that is larger than required to hold a given object; the
extra space is then unused. External fragmentation occurs when the blocks that
have been assigned to active objects are scattered through the heap in such a way
that the remaining, unused space is composed of multiple blocks: there may be
quite a lot of free space, but no one piece of it may be large enough to satisfy some
future request (see Figure 3.3).
Many storage-management algorithms maintain a single linked list—the free
list—of heap blocks not currently in use. Initially the list consists of a single block
comprising the entire heap. At each allocation request the algorithm searches
the list for a block of appropriate size. With a first fit algorithm we select the
2 Unfortunately, the term heap is also used for a common tree-based implementation of a priority
queue. These two uses of the term have nothing to do with one another.
112
Chapter 3 Names, Scopes, and Bindings
Figure 3.3 External fragmentation. The shaded blocks are in use; the clear blocks are free.
While there is more than enough total free space remaining to satisfy an allocation request of
the illustrated size, no single remaining block is large enough.
first block on the list that is large enough to satisfy the request. With a best fit
algorithm we search the entire list to find the smallest block that is large enough
to satisfy the request. In either case, if the chosen block is significantly larger than
required, then we divide it in two and return the unneeded portion to the free list
as a smaller block. (If the unneeded portion is below some minimum threshold
in size, we may leave it in the allocated block as internal fragmentation.) When a
block is deallocated and returned to the free list, we check to see whether either
or both of the physically adjacent blocks are free; if so, we coalesce them.
Intuitively, one would expect a best fit algorithm to do a better job of reserving
large blocks for large requests. At the same time, it has a higher allocation cost
than a first fit algorithm, because it must always search the entire list, and it tends
to result in a larger number of very small “leftover” blocks. Which approach—
first fit or best fit—results in lower external fragmentation depends on the distribution of size requests.
In any algorithm that maintains a single free list, the cost of allocation is linear in the number of free blocks. To reduce this cost to a constant, some storage
management algorithms maintain separate free lists for blocks of different sizes.
Each request is rounded up to the next standard size (at the cost of internal fragmentation) and allocated from the appropriate list. In effect, the heap is divided
into “pools,” one for each standard size. The division may be static or dynamic.
Two common mechanisms for dynamic pool adjustment are known as the buddy
system and the Fibonacci heap. In the buddy system, the standard block sizes are
powers of two. If a block of size 2k is needed, but none is available, a block of
size 2k+1 is split in two. One of the halves is used to satisfy the request; the other
is placed on the kth free list. When a block is deallocated, it is coalesced with
its “buddy”—the other half of the split that created it—if that buddy is free. Fibonacci heaps are similar, but they use Fibonacci numbers for the standard sizes,
instead of powers of two. The algorithm is slightly more complex but leads to
slightly lower internal fragmentation because the Fibonacci sequence grows more
slowly than 2k .
The problem with external fragmentation is that the ability of the heap to satisfy requests may degrade over time. Multiple free lists may help, by clustering
small blocks in relatively close physical proximity, but they do not eliminate the
3.2 Object Lifetime and Storage Management
113
problem. It is always possible to devise a sequence of requests that cannot be satisfied, even though the total space required is less than the size of the heap. If size
pools are statically allocated, one need only exceed the maximum number of requests of a given size. If pools are dynamically readjusted, one can “checkerboard”
the heap by allocating a large number of small blocks and then deallocating every
other one, in order of physical address, leaving an alternating pattern of small free
and allocated blocks. To eliminate external fragmentation, we must be prepared
to compact the heap, by moving already-allocated blocks. This task is complicated
by the need to find and update all outstanding references to a block that is being
moved. We will discuss compaction further in Sections 7.7.2 and 7.7.3.
3.2.4
Garbage Collection
Allocation of heap-based objects is always triggered by some specific operation
in a program: instantiating an object, appending to the end of a list, assigning a
long value into a previously short string, and so on. Deallocation is also explicit in
some languages (e.g., C, C++, and Pascal.) As we shall see in Section 7.7, however,
many languages specify that objects are to be deallocated implicitly when it is no
longer possible to reach them from any program variable. The run-time library
for such a language must then provide a garbage collection mechanism to identify
and reclaim unreachable objects. Most functional languages require garbage collection, as do many more recent imperative languages, including Modula-3, Java,
C#, and all the major scripting languages.
The traditional arguments in favor of explicit deallocation are implementation simplicity and execution speed. Even naive implementations of automatic
garbage collection add significant complexity to the implementation of a language with a rich type system, and even the most sophisticated garbage collector
can consume nontrivial amounts of time in certain programs. If the programmer
can correctly identify the end of an object’s lifetime, without too much run-time
bookkeeping, the result is likely to be faster execution.
The argument in favor of automatic garbage collection, however, is compelling: manual deallocation errors are among the most common and costly
bugs in real-world programs. If an object is deallocated too soon, the program
may follow a dangling reference, accessing memory now used by another object.
If an object is not deallocated at the end of its lifetime, then the program may
“leak memory,” eventually running out of heap space. Deallocation errors are
notoriously difficult to identify and fix. Over time, both language designers and
programmers have increasingly come to consider automatic garbage collection
an essential language feature. Garbage-collection algorithms have improved, reducing their run-time overhead; language implementations have become more
complex in general, reducing the marginal complexity of automatic collection;
and leading-edge applications have become larger and more complex, making
the benefits of automatic collection ever more appealing.
114
Chapter 3 Names, Scopes, and Bindings
C H E C K YO U R U N D E R S TA N D I N G
1. What is binding time?
2. Explain the distinction between decisions that are bound statically and those
that are bound dynamically.
3. What is the advantage of binding things as early as possible? What is the advantage of delaying bindings?
4. Explain the distinction between the lifetime of a name-to-object binding and
its visibility.
5. What determines whether an object is allocated statically, on the stack, or in
the heap?
6. List the objects and information commonly found in a stack frame.
7. What is a frame pointer? What is it used for?
8. What is a calling sequence?
9. What are internal and external fragmentation?
10. What is garbage collection?
11. What is a dangling reference?
3.3
Scope Rules
The textual region of the program in which a binding is active is its scope. In
most modern languages, the scope of a binding is determined statically—that
is, at compile time. In C, for example, we introduce a new scope upon entry
to a subroutine. We create bindings for local objects and deactivate bindings for
global objects that are “hidden” by local objects of the same name. On subroutine
exit, we destroy bindings for local variables and reactivate bindings for any global
objects that were hidden. These manipulations of bindings may at first glance appear to be run-time operations, but they do not require the execution of any code:
the portions of the program in which a binding is active are completely determined at compile time. We can look at a C program and know which names refer
to which objects at which points in the program based on purely textual rules. For
this reason, C is said to be statically scoped (some authors say lexically scoped 3 ).
3 Lexical scope is actually a better term than static scope, because scope rules based on nesting can
be enforced at run time instead of compile time if desired. In fact, in Common Lisp and Scheme
it is possible to pass the unevaluated text of a subroutine declaration into some other subroutine
as a parameter, and then use the text to create a lexically nested declaration at run time.
3.3 Scope Rules
115
Other languages, including APL, Snobol, and early dialects of Lisp, are dynamically scoped: their bindings depend on the flow of execution at run time. We will
examine static and dynamic scope in more detail in Sections 3.3.1 and 3.3.6.
In addition to talking about the “scope of a binding,” we sometimes use the
word scope as a noun all by itself, without a specific binding in mind. Informally,
a scope is a program region of maximal size in which no bindings change (or
at least none are destroyed—more on this in Section 3.3.3). Typically, a scope
is the body of a module, class, subroutine, or structured control flow statement,
sometimes called a block. In C family languages it would be delimited with {...}
braces.
Algol 68 and Ada use the term elaboration to refer to the process by which
declarations become active when control first enters a scope. Elaboration entails
the creation of bindings. In many languages, it also entails the allocation of stack
space for local objects, and possibly the assignment of initial values. In Ada it
can entail a host of other things, including the execution of error-checking or
heap-space-allocating code, the propagation of exceptions, and the creation of
concurrently executing tasks (to be discussed in Chapter 12).
At any given point in a program’s execution, the set of active bindings is called
the current referencing environment. The set is principally determined by static
or dynamic scope rules. We shall see that a referencing environment generally
corresponds to a sequence of scopes that can be examined (in order) to find the
current binding for a given name.
In some cases, referencing environments also depend on what are (in a confusing use of terminology) called binding rules. Specifically, when a reference to a
subroutine S is stored in a variable, passed as a parameter to another subroutine,
or returned as a function value, one needs to determine when the referencing
environment for S is chosen—that is, when the binding between the reference to
S and the referencing environment of S is made. The two principal options are
deep binding, in which the choice is made when the reference is first created, and
shallow binding, in which the choice is made when the reference is finally used.
We will examine these options in more detail in Section 3.5.
3.3.1
Static Scope
In a language with static (lexical) scoping, the bindings between names and objects can be determined at compile time by examining the text of the program,
without consideration of the flow of control at run time. Typically, the “current”
binding for a given name is found in the matching declaration whose block most
closely surrounds a given point in the program, though as we shall see there are
many variants on this basic theme.
The simplest static scope rule is probably that of early versions of Basic, in
which there was only a single, global scope. In fact, there were only a few hundred
possible names, each of which consisted of a letter optionally followed by a digit.
116
Chapter 3 Names, Scopes, and Bindings
There were no explicit declarations; variables were declared implicitly by virtue
of being used.
Scope rules are somewhat more complex in Fortran, though not much more.4
Fortran distinguishes between global and local variables. The scope of a local
variable is limited to the subroutine in which it appears; it is not visible elsewhere.
Variable declarations are optional. If a variable is not declared, it is assumed to be
local to the current subroutine and to be of type integer if its name begins with
the letters I–N, or real otherwise. (Different conventions for implicit declarations can be specified by the programmer. In Fortran 90, the programmer can
also turn off implicit declarations, so that use of an undeclared variable becomes
a compile-time error.)
Global variables in Fortran may be partitioned into common blocks, which are
then “imported” by subroutines. Common blocks are designed to support separate
compilation: they allow a subroutine to import only a subset of the global environment. Unfortunately, Fortran requires each subroutine to declare the names
and types of the variables in each of the common blocks it uses, and there is
no standard mechanism to ensure that the declarations in different subroutines
are the same. In fact, Fortran explicitly allows the declarations to be different.
A programmer who knows the data layout rules employed by the compiler can
use a completely different set of names and types in one subroutine to refer to
the data defined in another subroutine. The underlying bits will be shared, but
the effect of this sharing is highly implementation-dependent. A similar effect
can be achieved through the (mis)use of equivalence statements, which allow the programmer to specify that a set of variables share the same location(s).
Equivalence statements are a precursor of the variant records and unions of
languages like Pascal and C. Their intended purpose is to save space in programs
in which only one of the equivalence -ed variables is in use at any one time.
Semantically, the lifetime of a local Fortran variable (both the object itself
and the name-to-object binding) encompasses a single execution of the variable’s
subroutine. Programmers can override this rule by using an explicit save statement. A save -ed variable has a lifetime that encompasses the entire execution
of the program. Instead of a logically separate object for every invocation of the
subroutine, the save statement creates a single object that retains its value from
one invocation of the subroutine to the next. (The name-to-variable binding, of
course, is inactive when the subroutine is not executing, because the name is out
of scope.)
In early implementations of Fortran, it was common for all local variables to
behave as if they were save -ed, because language implementations employed the
static allocation strategy described in Section 3.2. It is a dangerous practice to
4 Fortran and C have evolved considerably over the years. Unless otherwise noted, comments
in this text apply to the Fortran 77 dialect [Ame78a] (still more widely used than the newer
Fortran 90). Comments on C refer to all versions of the language (including the C99 standard [Int99]) unless otherwise noted. Comments on Ada, likewise, refer to both Ada 83 [Ame83]
and Ada 95 [Int95b] unless otherwise noted.
3.3 Scope Rules
117
depend on this implementation artifact, however, because it is not guaranteed
by the language definition. In a Fortran compiler that uses a stack to save space,
or that exploits knowledge of the patterns of calls among subroutines to overlap
statically allocated space (Exercise 3.10), non- save -ed variables may not retain
their values from one invocation to the next.
3.3.2
EXAMPLE
3.4
Nested scopes
Nested Subroutines
The ability to nest subroutines inside each other, introduced in Algol 60, is a feature of many modern languages, including Pascal, Ada, ML, Scheme, and Common Lisp. Other languages, including C and its descendants, allow classes or
other scopes to nest. Just as the local variables of a Fortran subroutine are not
visible to other subroutines, any constants, types, variables, or subroutines declared within a block are not visible outside that block in Algol-family languages.
More formally, Algol-style nesting gives rise to the closest nested scope rule for
resolving bindings from names to objects: a name that is introduced in a declaration is known in the scope in which it is declared, and in each internally nested
scope, unless it is hidden by another declaration of the same name in one or more
nested scopes. To find the object referenced by a given use of a name, we look for
a declaration with that name in the current, innermost scope. If there is one, it
defines the active binding for the name. Otherwise, we look for a declaration in
the immediately surrounding scope. We continue outward, examining successively surrounding scopes, until we reach the outer nesting level of the program,
where global objects are declared. If no declaration is found at any level, then the
program is in error.
Many languages provide a collection of built-in, or predefined, objects, such as
I/O routines, trigonometric functions, and in some cases types such as integer
and char . It is common to consider these to be declared in an extra, invisible,
outermost scope, which surrounds the scope in which global objects are declared.
The search for bindings described in the previous paragraph terminates at this extra, outermost scope, if it exists, rather than at the scope in which global objects
are declared. This outermost scope convention makes it possible for a programmer to define a global object whose name is the same as that of some predefined
object (whose “declaration” is thereby hidden, making it unusable).
An example of nested scopes appears in Figure 3.4.5 In this example, procedure
P2 is called only by P1 , and need not be visible outside. It is therefore declared
inside P1 , limiting its scope (its region of visibility) to the portion of the program
shown here. In a similar fashion, P4 is visible only within P1 , P3 is visible only
within P2 , and F1 is visible only within P4 . Under the standard rules for nested
scopes, F1 could call P2 , and P4 could call F1 , but P2 could not call F1 .
5 This code is not contrived; it was extracted from an implementation of the FMQ error repair
2.3.4.
algorithm described in Section
118
Chapter 3 Names, Scopes, and Bindings
procedure P1(A1 : T1);
var X : real;
...
procedure P2(A2 : T2);
...
procedure P3(A3 : T3);
...
begin
...
(* body of
end;
...
begin
...
(* body of
end;
...
procedure P4(A4 : T4);
...
function F1(A5 : T5) :
var X : integer;
...
begin
...
(* body of
end;
...
begin
...
(* body of
end;
...
begin
...
(* body of
end
Figure 3.4
P3 *)
P2 *)
T6;
F1 *)
P4 *)
P1 *)
Example of nested subroutines in Pascal.
Though they are hidden from the rest of the program, nested subroutines are
able to access the parameters and local variables (and other local objects) of the
surrounding scope(s). In our example, P3 can name (and modify) A1 , X , and A2 ,
in addition to A3 . Because P1 and F1 both declare local variables named X , the
inner declaration hides the outer one within a portion of its scope. Uses of X in
F1 refer to the inner X ; uses of X in other regions of the code shown here refer to
the outer X .
A name-to-object binding that is hidden by a nested declaration of the same
name is said to have a hole in its scope. In most languages the object whose name
is hidden is inaccessible in the nested scope (unless it has more than one name).
Some languages allow the programmer to access the outer meaning of a name by
applying a qualifier or scope resolution operator. In Ada, for example, a name may
3.3 Scope Rules
119
be prefixed by the name of the scope in which it is declared, using syntax that
resembles the specification of fields in a record. My_proc.X , for example, refers
to the declaration of X in subroutine My_proc , regardless of whether some other
X has been declared in a lexically closer scope. In C++, which does not allow
subroutines to nest, ::X refers to a global declaration of X , regardless of whether
the current subroutine also has an X .6
Access to Nonlocal Objects
EXAMPLE
3.5
Static chains
We have already seen that the compiler can arrange for a frame pointer register to
point to the frame of the currently executing subroutine at run time. Target code
can use this register to access local objects, as well as any objects in surrounding
scopes that are still within the same subroutine. But what about objects in lexically surrounding subroutines? To find these we need a way to find the frames
corresponding to those scopes at run time. Since a deeply nested subroutine may
call a routine in an outer scope, it is not the case that the lexically surrounding
scope corresponds to the caller’s scope at run time. At the same time, we can be
sure that there is some frame for the surrounding scope somewhere below in the
stack, since the current subroutine could not have been called unless it was visible, and it could not have been visible unless the surrounding scope was active.
(It is actually possible in some languages to save a reference to a nested subroutine and then call it when the surrounding scope is no longer active. We defer this
possibility to Section 3.5.2.)
The simplest way in which to find the frames of surrounding scopes is to maintain a static link in each frame that points to the “parent” frame: the frame of the
most recent invocation of the lexically surrounding subroutine. If a subroutine is
declared at the outermost nesting level of the program, then its frame will have a
null static link at run time. If a subroutine is nested k levels deep, then its frame’s
static link, and those of its parent, grandparent, and so on, will form a static chain
of length k at run time. To find a variable or parameter declared j subroutine
scopes outward, target code at run time can dereference the static chain j times,
and then add the appropriate offset. Static chains are illustrated in Figure 3.5. We
will discuss the code required to maintain them in Section 8.2.
3.3.3
Declaration Order
In our discussion so far we have glossed over an important subtlety: suppose
an object x is declared somewhere within block B . Does the scope of x include
the portion of B before the declaration, and if so, can x actually be used in that
portion of the code? Put another way, can an expression E refer to any name
6 The C++ :: operator is also used to name members (fields or methods) of a base class that are
hidden by members of a derived class; we will consider this use in Section 9.2.2.
120
Chapter 3 Names, Scopes, and Bindings
Figure 3.5 Static chains. Subroutines A , B , C , D , and E are nested as shown on the left. If the
sequence of nested calls at run time is A , E , B , D , and C , then the static links in the stack will
look as shown on the right. The code for subroutine C can find local objects at known offsets
from the frame pointer. It can find local objects of the surrounding scope, B , by dereferencing its
static chain once and then applying an offset. It can find local objects in B ’s surrounding scope,
A , by dereferencing its static chain twice and then applying an offset.
declared in the current scope, or only to names that are declared before E in the
scope?
Several early languages, including Algol 60 and Lisp, required that all declarations appear at the beginning of their scope. One might at first think that this rule
would avoid the questions in the preceding paragraph, but it does not, because
declarations may refer to one another.7
D E S I G N & I M P L E M E N TAT I O N
Mutual recursion
Some Algol 60 compilers were known to process the declarations of a scope in
program order. This strategy had the unfortunate effect of implicitly outlawing
mutually recursive subroutines and types, something the language designers
clearly did not intend [Atk73].
7 We saw an example of mutually recursive subroutines in the recursive descent parsing of Section 2.3.1. Mutually recursive types frequently arise in linked data structures, where nodes of
two types may need to point to each other.
3.3 Scope Rules
EXAMPLE
3.6
A “gotcha” in
declare-before-use
121
In an apparent attempt at simplification, Pascal modified the requirement to
say that names must be declared before they are used (with special-case mechanisms to accommodate recursive types and subroutines). At the same time, however, Pascal retained the notion that the scope of a declaration is the entire surrounding block. These two rules can interact in surprising ways:
1.
2.
3.
4.
5.
6.
7.
const N = 10;
...
procedure foo;
const
M = N; (* static semantic error! *)
...
N = 20; (* additional constant declaration; hides the outer N *)
Pascal says that the second declaration of N covers all of foo , so the semantic
analyzer should complain on line 5 that N is being used before its declaration.
The error has the potential to be highly confusing, particularly if the programmer
meant to use the outer N :
const N = 10;
...
procedure foo;
const
M = N;
(* static semantic error! *)
var
A : array [1..M] of integer;
N : real;
(* hiding declaration *)
EXAMPLE
3.7
Whole-block scope in C#
Here the pair of messages “ N used before declaration” and “ N is not a constant”
are almost certainly not helpful.
In order to determine the validity of any declaration that appears to use a
name from a surrounding scope, a Pascal compiler must scan the remainder of
the scope’s declarations to see if the name is hidden. To avoid this complication,
most Pascal successors (and some dialects of Pascal itself) specify that the scope
of an identifier is not the entire block in which it is declared (excluding holes), but
rather the portion of that block from the declaration to the end (again excluding
holes). If our program fragment had been written in Ada, for example, or in C,
C++, or Java, no semantic errors would be reported. The declaration of M would
refer to the first (outer) declaration of N .
C++ and Java further relax the rules by dispensing with the define-before-use
requirement in many cases. In both languages, members of a class (including
those that are not defined until later in the program text) are visible inside all
of the class’s methods. In Java, classes themselves can be declared in any order.
Interestingly, while C# echos Java in requiring declaration before use for local
variables (but not for classes and members), it returns to the Pascal notion of
whole-block scope. Thus the following is invalid in C#.
122
Chapter 3 Names, Scopes, and Bindings
class A {
const int N = 10;
void foo() {
const int M = N;
const int N = 20;
EXAMPLE
3.8
“Local if written” in Python
EXAMPLE
3.9
Declaration order in
Scheme
// uses inner N before it is declared
Perhaps the simplest approach to declaration order, from a conceptual point
of view, is that of Modula-3, which says that the scope of a declaration is the
entire block in which it appears (minus any holes created by nested declarations)
and that the order of declarations doesn’t matter. The principal objection to this
approach is that programmers may find it counterintuitive to use a local variable
before it is declared. Python takes the “whole block” scope rule one step further
by dispensing with variable declarations altogether. In their place it adopts the
unusual convention that the local variables of subroutine S are precisely those
variables that are written by some statement in the (static) body of S . If S is
nested inside of T , and the name x appears on the left-hand side of assignment
statements in both S and T , then the x ’s are distinct: there is one in S and one
in T . Nonlocal variables are read-only unless explicitly imported (using Python’s
global statement).
In the interest of flexibility, modern Lisp dialects tend to provide several options for declaration order. In Scheme, for example, the letrec and let* constructs define scopes with, respectively, whole-block and declaration-to-end-ofblock semantics. The most frequently used construct, let , provides yet another
option:
(let ((A 1))
(let ((A 2)
(B A))
B))
; outer scope, with A
; inner scope, with A
;
and B
; return the value of
defined to be 1
defined to be 2
defined to be A
B
Here the nested declarations of A and B don’t take effect until after the end of
the declaration list. Thus B is defined to be the outer A , and the code as a whole
returns 1.
Declarations and Definitions
EXAMPLE
3.10
Declarations v. definitions
in C
Given the requirement that names be declared before they can be used, languages
like Pascal, C, and C++ require special mechanisms for recursive types and subroutines. Pascal handles the former by making pointers an exception to the rules
and the latter by introducing so-called forward declarations. C and C++ handle
both cases uniformly, by distinguishing between the declaration of an object and
its definition. Informally, a declaration introduces a name and indicates its scope.
A definition describes the thing to which the name is bound. If a declaration is
not complete enough to be a definition, then a separate definition must appear
elsewhere in the scope. In C we can write
3.3 Scope Rules
123
struct manager;
/* declaration only */
struct employee {
struct manager *boss;
struct employee *next_employee;
...
};
struct manager {
/* definition */
struct employee *first_employee;
...
};
and
void list_tail(follow_set fs);
/* declaration only */
void list(follow_set fs)
{
switch (input_token) {
case id : match(id); list_tail(fs);
...
}
void list_tail(follow_set fs)
/* definition */
{
switch (input_token) {
case comma : match(comma); list(fs);
...
}
Nested Blocks
In many languages, including Algol 60, C89, and Ada, local variables can be declared not only at the beginning of any subroutine, but also at the top of any
D E S I G N & I M P L E M E N TAT I O N
Redeclarations
Some languages, particularly those that are intended for interactive use, permit the programmer to redeclare an object: to create a new binding for a given
name in a given scope. Interactive programmers commonly use redeclarations
to fix bugs. In most interactive languages, the new meaning of the name replaces the old in all contexts. In ML, however, the old meaning of the name
may remain accessible to functions that were elaborated before the name was
redeclared. This design choice in ML can sometimes be counterintuitive. It
probably reflects the fact that ML is usually compiled, bit by bit on the fly,
rather than interpreted. A language like Scheme, which is lexically scoped but
usually interpreted, stores the binding for a name in a known location. A program accesses the meaning of the name indirectly through that location: if the
meaning of the name changes, all accesses to the name will use the new meaning. In ML, previously elaborated functions have already been compiled into a
form (often machine code) that accesses the meaning of the name directly.
124
EXAMPLE
Chapter 3 Names, Scopes, and Bindings
3.11
Inner declarations in C
begin . . . end ( {...} ) block. Others languages, including Algol 68, C99, and all
of C’s descendants, are even more flexible, allowing declarations wherever a statement may appear. In most languages a nested declaration hides any outer declaration with the same name (Java and C# make it a static semantic error if the
outer declaration is local to the current subroutine).
Variables declared in nested blocks can be very useful, as for example in the
following C code.
{
int temp = a;
a = b;
b = temp;
}
Keeping the declaration of temp lexically adjacent to the code that uses it makes
the program easier to read, and eliminates any possibility that this code will interfere with another variable named temp .
No run-time work is needed to allocate or deallocate space for variables declared in nested blocks; their space can be included in the total space for local
variables allocated in the subroutine prologue and deallocated in the epilogue.
Exercise 3.9 considers how to minimize the total space required.
C H E C K YO U R U N D E R S TA N D I N G
12. What do we mean by the scope of a name-to-object binding?
13. Describe the difference between static and dynamic scope.
14.
15.
16.
17.
What is elaboration?
What is a referencing environment?
Explain the closest nested scope rule.
What is the purpose of a scope resolution operator?
18. What is a static chain? What is it used for?
19. What are forward references? Why are they prohibited or restricted in many
programming languages?
20. Explain the difference between a declaration and a definition. Why is the distinction important?
3.3.4
Modules
A major challenge in the construction of any large body of software is how to
divide the effort among programmers in such a way that work can proceed on
multiple fronts simultaneously. This modularization of effort depends critically
3.3 Scope Rules
125
/*
Place into *s a new name beginning with the letter l and
continuing with the ascii representation of an integer guaranteed
to be distinct in each separate call. s is assumed to point to
space large enough to hold any such name; for the short ints used
here, seven characters suffice. l is assumed to be an upper or
lower-case letter. sprintf ’prints’ formatted output to a string.
*/
void gen_new_name(char *s, char l) {
static short int name_nums[52];
/* C guarantees that static local variables without explicit
initial values are initialized as if explicitly set to zero. */
int index = (l >= ’a’ && l <= ’z’) ? l-’a’ : 26 + l-’A’;
name_nums[index]++;
sprintf(s, "%c%d\0", 1, name_nums[index]);
}
Figure 3.6
EXAMPLE
3.12
Static variables in C
C code to illustrate the use of static variables.
on the notion of information hiding, which makes objects and algorithms invisible, whenever possible, to portions of the system that do not need them. Properly
modularized code reduces the “cognitive load” on the programmer by minimizing the amount of information required to understand any given portion of the
system. In a well-designed program the interfaces between modules are as “narrow” (i.e., simple) as possible, and any design decision that is likely to change
is hidden inside a single module. This latter point is crucial, since maintenance
(bug fixes and enhancement) consumes many more programmer years than does
initial construction for most commercial software.
In addition to reducing cognitive load, information hiding has several more
pedestrian benefits. First, it reduces the risk of name conflicts: with fewer visible
names, there is less chance that a newly introduced name will be the same as
one already in use. Second, it safeguards the integrity of data abstractions: any
attempt to access objects outside of the subroutine(s) to which they belong will
cause the compiler to issue an “undefined symbol” error message. Third, it helps
to compartmentalize run-time errors: if a variable takes on an unexpected value,
we can generally be sure that the code that modified it is in the variable’s scope.
Unfortunately, the information hiding provided by nested subroutines is limited to objects whose lifetime is the same as that of the subroutine in which they
are hidden. When control returns from a subroutine, its local variables will no
longer be live: their values will be discarded. We have seen a partial solution to
this problem in the form of the save statement in Fortran. A similar directive
exists in several other languages: the own variables of Algol and the static variables of C, for example, retain their values from one invocation of a subroutine
to the next.
As an example of the use of static variables, consider the code in Figure 3.6.
The subroutine gen_new_name can be used to generate a series of distinct
126
EXAMPLE
Chapter 3 Names, Scopes, and Bindings
3.13
Stack module in Modula-2
character-string names. A compiler could use these in its assembly language output. Labels, for example, might be named L1 , L2 , L3 , and so on; subroutines
could be named S1 , S2 , S3 , and so on.
Static variables allow a subroutine like gen_new_name to have “memory”—
to retain information from one invocation to the next—while protecting that
memory from accidental access or modification by other parts of the program.
Put another way, static variables allow programmers to build single-subroutine
abstractions. Unfortunately, they do not allow the construction of abstractions
whose interface needs to consist of more than one subroutine. Suppose, for example, that we wish to construct a stack abstraction. We should like to hide the
representation of the stack—its internal structure—from the rest of the program,
so that it can be accessed only through its push and pop routines. We can achieve
this goal in many languages through use of a module construct.
A module allows a collection of objects—subroutines, variables, types, and
so on—to be encapsulated in such a way that (1) objects inside are visible to
each other, but (2) objects on the inside are not visible on the outside unless
explicitly exported, and (3) (in many languages) objects outside are not visible on
the inside unless explicitly imported. Modules can be found in Clu (which calls
them clusters), Modula (1, 2, and 3), Turing, Ada (which calls them packages),
C++ (which calls them namespaces), and many other modern languages. They
can also be emulated to some degree through use of the separate compilation
facilities of C; we discuss this possibility in Section 3.7.
As an example of the use of modules, consider the stack abstraction shown
in Figure 3.7. This stack can be embedded anywhere a subroutine might appear
in a Modula-2 program. Bindings to variables declared in a module are inactive
outside the module, not destroyed. In our stack example, s and top have the
same lifetime they would have had if not enclosed in the module. If stack is
declared at the program’s outermost nesting level, then s and top retain their
values throughout the execution of the program, though they are visible only to
the code inside push and pop . If stack is declared inside some subroutine sub ,
then s and top have the same lifetime as the local variables of sub . If stack is
declared inside some other module mod , then s and top have the same lifetime as
they would have had if not enclosed in either module. Type stack_index , which
is also declared inside stack , is likewise visible only inside push and pop . The
issue of lifetime is not relevant for types or constants, since they have no mutable
state.
Our stack abstraction has two imports: the type ( element ) and maximum
number ( stack_size ) of elements to be placed in the stack. Element and
stack_size must be declared in a surrounding scope; the compiler will complain if they are not. With one exception, element and stack_size are the
only names from surrounding scopes that will be visible inside stack . The exception is that predefined (pervasive) names, such as integer and arctan , are
visible without being imported. Our stack also has two exports: push and pop .
These are the only names inside of stack that will be visible in the surrounding
scope.
3.3 Scope Rules
127
CONST stack_size = ...
TYPE element = ...
...
MODULE stack;
IMPORT element, stack_size;
EXPORT push, pop;
TYPE
stack_index = [1..stack_size];
VAR
s
: ARRAY stack_index OF element;
top : stack_index;
(* first unused slot *)
PROCEDURE error; ...
PROCEDURE push(elem : element);
BEGIN
IF top = stack_size THEN
error;
ELSE
s[top] := elem;
top := top + 1;
END;
END push;
PROCEDURE pop() : element;
BEGIN
IF top = 1 THEN
error;
ELSE
top := top - 1;
RETURN s[top];
END;
END pop;
(* A Modula-2 function is just a *)
(* procedure with a return type. *)
BEGIN
top := 1;
END stack;
Figure 3.7
VAR x, y : element;
...
push(x);
...
y := pop;
Stack abstraction in Modula-2.
Most module-based languages allow the programmer to specify that certain
exported names are usable only in restricted ways. Variables may be exported
read-only, for example, or types may be exported opaquely, meaning that variables of that type may be declared, passed as arguments to the module’s subroutines, and possibly compared or assigned to one another, but not manipulated in
any other way. To facilitate separate compilation, many module-based languages
(Modula-2 among them) also allow a module to be divided into a declaration
part (or header) and an implementation part (or body). Code that uses the ex-
128
Chapter 3 Names, Scopes, and Bindings
ports of a given module can then be compiled as soon as the header exists; it is
not dependent on the body.
Modules into which names must be explicitly imported are said to be closed
scopes. Modules are closed in Modula (1, 2, and 3). By extension, modules that
do not require imports are said to be open scopes. An increasingly common option, found in the modules of Ada, Java, C#, and Python, among others, might
be called selectively open scopes. In these languages a name foo exported from
module A is automatically visible in peer module B as A.foo . It becomes visible
as merely foo if B explicitly imports it.
Nested subroutines are open scopes in most Algol family languages. Important
exceptions are Euclid, in which both module and subroutine scopes are closed,
Turing, Modula (1), and Perl, in which subroutines are optionally closed, and
Clu, which outlaws the use of nonlocal variables entirely. A subroutine in Euclid
must explicitly import any nonpervasive name that it uses from a surrounding
scope. A subroutine in Turing or Modula can also import names explicitly; if it
does so then no other nonlocal names are visible. Import lists serve to document
the program: the use of names from surrounding scopes is really part of the interface between a subroutine and the rest of the program. Requiring explicit imports
forces the programmer to document this interface more precisely than is required
in other languages. Outlawing nonlocal variables serves a similar purpose in Clu,
though nonlocal constants and subroutines can still be named, without explicit
import.
In addition to making programs easier to understand and maintain, import
lists help a Euclid or Turing compiler to enforce language rules that prohibit the
creation of aliases—multiple names that refer to the same object in a given scope.
Modula has no similar prohibition; its import lists are simply for documentation
and information hiding. We will return to the subject of aliases in Section 3.6.1.
3.3.5
EXAMPLE
3.14
Module as “manager” for
a type
Module Types and Classes
Modules facilitate the construction of abstractions by allowing data to be made
private to the subroutines that use them. As defined in Modula-2, Turing, or
Ada 83, however, modules are most naturally suited to creating only a single instance of a given abstraction. The code in Figure 3.7, for example, does not lend
itself to applications that require several stacks. For such an application, the programmer must either replicate the code (giving the new copy another name) or
adopt an alternative organization in which the module becomes a “manager” for
instances of a stack type, which is then exported (see Figure 3.8). This latter organization requires additional subroutines to create/initialize and possibly destroy
stack instances, and it requires that every subroutine ( push , pop , create ) take
an extra parameter, to specify the stack in question. Clu addresses this problem
by automatically making every module (“cluster”) the manager for a type. In fact,
the only variables that may appear in a cluster (other than static variables in subroutines) are the representation of that type.
3.3 Scope Rules
129
CONST stack_size = ...
TYPE element = ...
...
MODULE stack_manager;
IMPORT element, stack_size;
EXPORT stack, init_stack, push, pop;
TYPE
stack_index = [1..stack_size];
STACK = RECORD
s : ARRAY stack_index OF element;
top : stack_index;
(* first unused slot *)
END;
PROCEDURE init_stack(VAR stk : stack);
BEGIN
stk.top := 1;
END init_stack;
PROCEDURE push(VAR stk : stack; elem : element);
BEGIN
IF stk.top = stack_size THEN
error;
ELSE
stk.s[stk.top] := elem;
stk.top := stk.top + 1;
END;
END push;
PROCEDURE pop(VAR stk : stack) : element;
BEGIN
IF stk.top = 1 THEN
error;
ELSE
stk.top := stk.top - 1;
return stk.s[stk.top];
END;
END pop;
END stack_manager;
Figure 3.8
EXAMPLE
3.15
Module types in Euclid
var A, B : stack;
var x, y : element;
...
init_stack(A);
init_stack(B);
...
push(A, x);
...
y := pop(B);
Manager module for stacks in Modula-2.
An alternative solution to the multiple instance problem can be found in Simula, Euclid, and (in a slightly different sense) ML, which treat modules as types,
rather than simple encapsulation constructs. Given a module type, the programmer can declare an arbitrary number of similar module objects. The skeleton
of a Euclid stack appears in Figure 3.9. As in the (single) Modula-2 stack of
Figure 3.7, Euclid allows the programmer to provide initialization code that is
executed whenever a new stack is created. Euclid also allows the programmer to
130
Chapter 3 Names, Scopes, and Bindings
const stack_size := ...
type element : ...
...
type stack = module
imports (element, stack_size)
exports (push, pop)
type
stack_index = 1..stack_size
var
s
: array stack_index of element
top : stack_index
procedure push(elem : element) = ...
function pop returns element = ...
...
initially
top := 1
end stack
var A, B : stack
var x, y : element
...
A.push(x)
...
y := B.pop
Figure 3.9 Module type for stacks in Euclid. Unlike the code in Figure 3.7, the code here can
be used to create an arbitrary number of stacks.
specify finalization code that will be executed at the end of a module’s lifetime.
This feature is not needed for an array-based stack, but it would be useful if elements were allocated from a heap and needed to be reclaimed.
The difference between the module-as-manager and module-as-type approaches to abstraction is reflected in the lower right of Figures 3.8 and 3.9. With
module types, the programmer can think of the module’s subroutines as “belonging” to the stack in question ( A.push(x) ), rather than as outside entities
to which the stack can be passed as an argument ( push(A, x) ). Conceptually,
there is a separate pair of push and pop operations for every stack. In practice,
of course, it would be highly wasteful to create multiple copies of the code. As we
shall see in Chapter 9, all stacks share a single pair of push and pop operations,
and the compiler arranges for a pointer to the relevant stack to be passed to the
operation as an extra, hidden parameter. The implementation turns out to be
very similar to the implementation of Figure 3.8, but the programmer need not
think of it that way.8
As an extension of the module-as-type approach to data abstraction, many
languages now provide a class construct for object-oriented programming. To first
approximation, classes can be thought of as module types that have been augmented with an inheritance mechanism. Inheritance allows new classes to be defined as extensions or refinements of existing classes. Inheritance facilitates a pro8 It is interesting to note that Turing, which was derived from Euclid, reverts to Modula-2 style
modules, in order to avoid implementation complexity [HMRC88, p. 9].
3.3 Scope Rules
EXAMPLE
3.16
N-ary methods in C++
131
gramming style in which all or most operations are thought of as belonging to
objects, and in which new objects can inherit most of their operations from existing objects, without the need to rewrite code. Classes have their roots in Simula67, and are the central innovation of object-oriented languages such as Smalltalk,
Eiffel, C++, Java, and C#. Inheritance mechanisms can also be found in several
languages that are not usually considered object-oriented, including Modula-3,
Ada 95, and Oberon. We will examine inheritance and its impact on scope rules
in Chapter 9.
Module types and classes (ignoring issues related to inheritance) require only
simple changes to the scope rules defined for modules in the previous subsection.
Every instance A of a module type or class (e.g., every stack) has a separate copy
of the module or class’s variables. These variables are then visible when executing one of A ’s operations. They may also be indirectly visible to the operations
of some other instance B if A is passed as a parameter to one of those operations. This rule makes it possible in most object-oriented languages to construct
binary (or more-ary) operations that can manipulate the variables of more than
one instance of a class. In C++, for example, we could create an operation that
determines which of two stacks contains a larger number of elements:
class stack {
...
bool deeper(stack other) {
return (top > other.top);
}
...
};
...
if (A.deeper(B)) ...
// function declaration
Within the deeper operation of stack A , top refers to A.top . Because deeper
is an operation of class stack , however, it is able to refer not only to the variables of A (which it can access directly by name), but also to the variables of any
other stack that is passed to it as an argument. Because these variables belong
to a different stack, deeper must name that stack explicitly—for example, as in
other.top . In a module-as-manager style program, of course, module subroutines would access all instance variables via parameters.
3.3.6
Dynamic Scope
In a language with dynamic scoping, the bindings between names and objects
depend on the flow of control at run time and, in particular, on the order in which
subroutines are called. In comparison to the static scope rules discussed in the
previous section, dynamic scope rules are generally quite simple: the “current”
binding for a given name is the one encountered most recently during execution,
and not yet destroyed by returning from its scope.
132
Chapter 3 Names, Scopes, and Bindings
1. a : integer
–– global declaration
2. procedure first
3.
a := 1
4. procedure second
5.
a : integer
6.
first()
7.
8.
9.
10.
11.
12.
–– local declaration
a := 2
if read integer() > 0
second()
else
first()
write integer(a)
Figure 3.10 Static versus dynamic scope. Program output depends on both scope rules and,
in the case of dynamic scope, a value read at run time.
EXAMPLE
3.17
Static v. dynamic scope
Languages with dynamic scoping include APL [Ive62], Snobol [GPP71], and
early dialects of Lisp [MAE+ 65, Moo78, TM81] and Perl.9 Because the flow of
control cannot in general be predicted in advance, the bindings between names
and objects in a language with dynamic scope cannot in general be determined
by a compiler. As a result, many semantic rules in a language with dynamic scope
become a matter of dynamic semantics rather than static semantics. Type checking in expressions and argument checking in subroutine calls, for example, must
in general be deferred until run time. To accommodate all these checks, languages
with dynamic scoping tend to be interpreted rather than compiled.
As an example of dynamic scope, consider the program in Figure 3.10. If static
scoping is in effect, this program prints a 1. If dynamic scoping is in effect, the
program prints either a 1 or a 2, depending on the value read at line 8 at run time.
Why the difference? At issue is whether the assignment to the variable a at line 3
refers to the global variable declared at line 1 or to the local variable declared at
line 5. Static scope rules require that the reference resolve to the closest lexically
enclosing declaration—namely the global a . Procedure first changes a to 1, and
line 12 prints this value.
Dynamic scope rules, on the other hand, require that we choose the most recent, active binding for a at run time. We create a binding for a when we enter
the main program. We create another when and if we enter procedure second .
When we execute the assignment statement at line 3, the a to which we are referring will depend on whether we entered first through second or directly from
9 Scheme and Common Lisp are statically scoped, though the latter allows the programmer to
specify dynamic scoping for individual variables. Static scoping was added to Perl in version 5.
The programmer now chooses static or dynamic scoping explicitly in each variable declaration.
3.3 Scope Rules
max score : integer
133
–– maximum possible score
function scaled score(raw score : integer) : real
return raw score / max score * 100
...
procedure foo
–– highest percentage seen so far
max score : real := 0
...
foreach student in class
student.percent := scaled score(student.points)
if student.percent > max score
max score := student.percent
Figure 3.11 The problem with dynamic scoping. Procedure scaled score probably does not
do what the programmer intended when dynamic scope rules allow procedure foo to change
the meaning of max score .
the main program. If we entered through second , we will assign the value 1 to
second ’s local a . If we entered from the main program, we will assign the value 1
to the global a . In either case, the write at line 12 will refer to the global a , since
second ’s local a will be destroyed, along with its binding, when control returns
to the main program.
With dynamic scoping in effect, no program fragment that makes use of nonlocal names is guaranteed a predictable referencing environment. In Figure 3.11,
for example, the declaration of a local variable in procedure foo accidentally
redefines a global variable used by function scaled score , which is then called
from foo . Since the global max score is an integer, while the local max score
is a floating-point number, dynamic semantic checks in at least some languages
will result in a type clash message at run time. If the local max score had been
an integer, no error would have been detected, but the program would almost
certainly have produced incorrect results. This sort of error can be very hard to
find.
D E S I G N & I M P L E M E N TAT I O N
Dynamic scoping
It is not entirely clear whether the use of dynamic scoping in Lisp and other
early interpreted languages was deliberate or accidental. One reason to think
that it may have been deliberate is that it makes it very easy for an interpreter to
look up the meaning of a name: all that is required is a stack of declarations (we
examine this stack more closely in Section 3.4.2). Unfortunately, this simple
implementation has a very high run-time cost, and experience indicates that
dynamic scoping makes programs harder to understand. The modern consensus seems to be that dynamic scoping is usually a bad idea (see Exercise 3.15
and Exploration 3.29 for two exceptions).
134
EXAMPLE
Chapter 3 Names, Scopes, and Bindings
3.18
Customization via dynamic
scope
The principal argument in favor of dynamic scoping is that it facilitates the
customization of subroutines. Suppose, for example, that we have a library routine print integer that is capable of printing its argument in any of several bases
(decimal, binary, hexadecimal, etc.). Suppose further that we want the routine to
use decimal notation most of the time, and to use other bases only in a few special
cases; we do not want to have to specify a base explicitly on each individual call.
We can achieve this result with dynamic scoping by having print integer obtain
its base from a nonlocal variable print base . We can establish the default behavior
by declaring a variable print base and setting its value to 10 in a scope encountered early in execution. Then, any time we want to change the base temporarily,
we can write
begin
–– nested block
print base : integer := 16
print integer(n)
EXAMPLE
3.19
Multiple interface
alternative
EXAMPLE
3.20
Static variable alternative
–– use hexadecimal
The problem with this argument is that there are usually other ways to achieve
the same effect, without dynamic scoping. One option would be to have print
integer use decimal notation in all cases, and create another routine, print
integer with base , that takes a second argument. In a language like Ada or C++,
one could make the base an optional (default) parameter of a single print integer
routine, or use overloading to give the same name to both routines. (We will
consider default parameters in Section 8.3.3; overloading is discussed in Section 3.6.2.)
Unfortunately, using two different routines for printing (or one routine with
two calling sequences) requires that the caller know what is going on. In our
example, alternative routines work fine if the calls are all made in the scope in
which the local print base variable would have been declared. If that scope calls
subroutines that in turn call print integer , however, we cannot in general arrange
for the called routines to use the alternative interface. A second alternative to
dynamic scoping solves this problem: we can create a static variable, either global
or encapsulated with print integer inside an appropriate module, that controls
the base. To change the print base temporarily, we can then write
begin
print
print
print
print
–– nested block
base save : integer := print base
base := 16
–– use hexadecimal
integer(n)
base := print base save
The possibility that we may forget to restore the original value, of course, is a
potential source of bugs. With dynamic scoping the value is restored automatically.
3.4 Implementing Scope
3.4
135
Implementing Scope
To keep track of the names in a statically scoped program, a compiler relies on a
data abstraction called a symbol table. In essence, the symbol table is a dictionary:
it maps names to the information the compiler knows about them. The most basic operations serve to place a new mapping (a name-to-object binding) into the
table and to retrieve (nondestructively) the information held in the mapping for
a given name. Static scope rules in most languages impose additional complexity
by requiring that the referencing environment be different in different parts of
the program.
In a language with dynamic scoping, an interpreter (or the output of a compiler) must perform operations at run time that correspond to the insert , lookup ,
enter scope , and leave scope symbol table operations in the implementation of
a statically scoped language. In principle, any organization used for a symbol
table in a compiler could be used to track name-to-object bindings in an interpreter, and vice versa. In practice, implementations of dynamic scoping tend to
adopt one of two specific organizations: an association list or a central reference
table.
IN MORE DEPTH
Most variations on static scoping can be handled by augmenting a basic
dictionary-style symbol table with enter scope and leave scope operations to
keep track of visibility. Nothing is ever deleted from the table; the entire structure
is retained throughout compilation, and then saved for the debugger. A symbol
table with visibility support can be implemented in several different ways. One
appealing approach, due to LeBlanc and Cook [CL83], is described on the PLP
CD.
An association list (or A-list for short) is simply a list of name/value pairs.
When used to implement dynamic scope it functions as a stack: new declarations are pushed as they are encountered, and popped at the end of the scope
in which they appeared. Bindings are found by searching down the list from the
top. A central reference table avoids the need for linear-time search by maintaining an explicit mapping from names to their current meanings. Lookup is faster,
but scope entry and exit are somewhat more complex, and it becomes substantially more difficult to save a referencing environment for future use (we discuss
this issue further in Section 3.5.1).
C H E C K YO U R U N D E R S TA N D I N G
21. Explain the importance of information hiding.
22. What is an opaque export?
136
Chapter 3 Names, Scopes, and Bindings
23. Why might it be useful to distinguish between the header and the body of a
module?
24. What does it mean for a scope to be closed?
25. Explain the distinction between “modules as managers” and “modules as
types.”
26. How do classes differ from modules?
27. Why does the use of dynamic scoping imply the need for run-time type
checking?
28. Give an argument in favor of dynamic scoping. Describe how similar benefits
can be achieved in a language without dynamic scoping.
29. Explain the purpose of a compiler’s symbol table.
3.5
EXAMPLE
3.21
Deep and shallow binding
The Binding of Referencing Environments
We have seen in the previous section how scope rules determine the referencing
environment of a given statement in a program. Static scope rules specify that
the referencing environment depends on the lexical nesting of program blocks
in which names are declared. Dynamic scope rules specify that the referencing
environment depends on the order in which declarations are encountered at run
time. An additional issue that we have not yet considered arises in languages that
allow one to create a reference to a subroutine—for example, by passing it as
a parameter. When should scope rules be applied to such a subroutine: when
the reference is first created, or when the routine is finally called? The answer is
particularly important for languages with dynamic scoping, though we shall see
that it matters even in languages with static scoping. As an example of the former,
consider the program fragment shown in Figure 3.12. (As in Figure 3.10, we use
an Algol-like syntax, even though Algol-family languages are usually statically
scoped.)
Procedure print selected records in our example is assumed to be a general
purpose routine that knows how to traverse the records in a database, regardless
of whether they represent people, sprockets, or salads. It takes as parameters a
database, a predicate to make print/don’t print decisions, and a subroutine that
knows how to format the data in the records of this particular database. In Section 3.3.6 we hypothesized a print integer library routine that would print in
any of several bases, depending on the value of a nonlocal variable print base .
Here we have hypothesized in a similar fashion that print person uses the value
of nonlocal variable line length to calculate the number and width of columns
in its output. In a language with dynamic scope, it is natural for procedure print
selected records to declare and initialize this variable locally, knowing that code
inside print routine will pick it up if needed. For this coding technique to work,
3.5 The Binding of Referencing Environments
137
type person = record
...
age : integer
...
threshold : integer
people : database
function older than(p : person) : boolean
return p.age ≥ threshold
procedure print person(p : person)
–– Call appropriate I/O routines to print record on standard output.
–– Make use of nonlocal variable line length to format data in columns.
...
procedure print selected records(db : database;
predicate, print routine : procedure)
line length : integer
if device type(stdout) = terminal
line length := 80
else
–– Standard output is a file or printer.
line length := 132
foreach record r in db
–– Iterating over these may actually be
–– a lot more complicated than a ‘for’ loop.
if predicate(r)
print routine(r)
–– main program
...
threshold := 35
print selected records(people, older than, print person)
Figure 3.12
Program to illustrate the importance of binding rules. One might argue that deep
binding is appropriate for the environment of function older than (for access to threshold ),
while shallow binding is appropriate for the environment of procedure print person (for access
to line length ).
the referencing environment of print routine must not be created until the routine is actually called by print selected records . This late binding of the referencing environment of a subroutine that has been passed as a parameter is
known as shallow binding. It is usually the default in languages with dynamic
scoping.
For function older than , by contrast, shallow binding may not work well. If,
for example, procedure print selected records happens to have a local variable
named threshold , then the variable set by the main program to influence the behavior of older than will not be visible when the function is finally called, and
138
Chapter 3 Names, Scopes, and Bindings
the predicate will be unlikely to work correctly. In such a situation, the code that
originally passes the function as a parameter has a particular referencing environment (the current one) in mind; it does not want the routine to be called in
any other environment. It therefore makes sense to bind the environment at the
time the routine is first passed as a parameter, and then restore that environment
when the routine is finally called. This early binding of the referencing environment is known as deep binding. The need for deep binding is sometimes referred
to as the funarg problem in Lisp.
3.5.1
Subroutine Closures
Deep binding is implemented by creating an explicit representation of a referencing environment (generally the one in which the subroutine would execute if
called at the present time) and bundling it together with a reference to the subroutine. The bundle as a whole is referred to as a closure. Usually the subroutine
itself can be represented in the closure by a pointer to its code. If an association
list is used to represent the referencing environment of a program with dynamic
scoping, then the referencing environment in a closure can be represented by a
top-of-stack (beginning of A-list) pointer. When a subroutine is called through
a closure, the main pointer to the referencing environment A-list is temporarily
replaced by the saved pointer, making any bindings created since the closure was
created temporarily invisible. New bindings created within the subroutine are
pushed using the temporary pointer. Because the A-list is represented by pointers (rather than an array), the effect is to have two lists—one representing the
temporary referencing environment resulting from use of the closure and the
other the main referencing environment that will be restored when the subroutine returns—that share their older entries.
If a central reference table is used to represent the referencing environment of
a program with dynamic scoping, then the creation of a closure is more complicated. In the general case, it may be necessary to copy the entire main array
of the central table and the first entry on each of its lists. Space and time overhead may be reduced if the compiler or interpreter is able to determine that only
some of the program’s names will be used by the subroutine in the closure (or by
things that the subroutine may call). In this case, the environment can be saved
by copying the first entries of the lists for only the “interesting” names. When the
subroutine is called through the closure, these entries can then be pushed onto
the beginnings of the appropriate lists in the central reference table.
Deep binding is often available as an option in languages with dynamic scope.
In early dialects of Lisp, for example, the built-in primitive function takes a
function as its argument and returns a closure whose referencing environment is
the one in which the function would execute if called at the present time. This
closure can then be passed as a parameter to another function. If and when it is
eventually called, it will execute in the saved environment. (Closures work slightly
3.5 The Binding of Referencing Environments
139
program binding_example(input, output);
procedure A(I : integer; procedure P);
procedure B;
begin
writeln(I);
end;
begin (* A *)
if I > 1 then
P
else
A(2, B);
end;
procedure C; begin end;
begin (* main *)
A(1, C);
end.
Figure 3.13 Deep binding in Pascal. When B is called via formal parameter P , two instances
of I exist. Because the closure for P was created in the initial invocation of A , it uses that
invocation’s instance of I , and prints a 1 .
EXAMPLE
3.22
Binding rules with static
scoping
differently from “bare” functions in most Lisp dialects: they must be called by
passing them to the built-in primitives funcall or apply .)
Deep binding is generally the default in languages with static (lexical) scoping.
At first glance, one might be tempted to think that the binding time of referencing environments would not matter in languages with static scoping. After all,
the meaning of a statically scoped name depends on its lexical nesting, not on
the flow of execution, and this nesting is the same whether it is captured at the
time a subroutine is passed as a parameter or at the time the subroutine is called.
The catch is that a running program may have more than one instance of an object that is declared within a recursive subroutine. A closure in a language with
static scoping captures the current instance of every object, at the time the closure is created. When the closure’s subroutine is called, it will find these captured
instances, even if newer instances have subsequently been created by recursive
calls.
One could imagine combining static scoping with shallow binding [VF82],
but the combination does not seem to make much sense, and it does not appear
to have been adopted in any language. Figure 3.13 contains a Pascal program
that illustrates the impact of binding rules in the presence of static scoping. This
program prints a 1. With shallow binding it would print a 2.
140
Chapter 3 Names, Scopes, and Bindings
It should be noted that binding rules matter with static scoping only when
accessing objects that are neither local nor global. If an object is local to the currently executing subroutine, then it does not matter whether the subroutine was
called directly or through a closure; in either case local objects will have been created when the subroutine started running. If an object is global, there will never
be more than one instance, since the main body of the program is not recursive.
Binding rules are therefore irrelevant in languages like C, which has no nested
subroutines, or Modula-2, which allows only outermost subroutines to be passed
as parameters. (They are also irrelevant in languages like PL/I and Ada 83, which
do not permit subroutines to be passed as parameters at all.)
Suppose then that we have a language with static scoping in which nested subroutines can be passed as parameters, with deep binding. To represent a closure
for subroutine S, we can simply save a pointer to S’s code together with the static link that S would use if it were called right now, in the current environment.
When S is finally called, we temporarily restore the saved static link, rather than
creating a new one. When S follows its static chain to access a nonlocal object,
it will find the object instance that was current at the time the closure was created.
3.5.2
EXAMPLE
3.23
First- and Second-Class Subroutines
In general, a value in a programming language is said to have first-class status
if it can be passed as a parameter, returned from a subroutine, or assigned into
a variable. Simple types such as integers and characters are first-class values in
most programming languages. By contrast, a “second-class” value can be passed
as a parameter, but not returned from a subroutine or assigned into a variable,
and a “third-class” value cannot even be passed as a parameter. As we shall see
in Section 8.3.2, labels are third-class values in most programming languages but
second-class values in Algol. Subroutines are second-class values in most imperative languages but third-class values in Ada 83. They are first-class values in all
functional programming languages, in C#, Perl, and Python, and, with certain
restrictions, in several other imperative languages, including Fortran, Modula-2
and -3, Ada 95, C, and C++.10
So far in this subsection we have considered the ramifications of second-class
subroutines. First-class subroutines in a language with nested scopes introduce
an additional level of complexity: they raise the possibility that a reference to
a subroutine may outlive the execution of the scope in which that routine was
declared. Consider the following example in Scheme.
Returning a first-class
subroutine in Scheme
10 Some authors would say that first-class status requires the ability to create new functions at run
time. C#, Perl, Python, and all functional languages meet this requirement, but most imperative
languages do not.
3.5 The Binding of Referencing Environments
141
1. (define plus_x (lambda (x)
2.
(lambda (y) (+ x y))))
3. ...
4. (let ((f (plus_x 2)))
5.
(f 3))
; returns 5
Here the let construct on line 4 declares a new function, f , which is the result
of calling plus_x with argument 2 . (Like all Lisp dialects, Scheme puts the function name inside the parentheses, right in front of the arguments. The lambda
keyword introduces the parameter list and body of a function.) When f is called
at line 5, it must use the 2 that was passed to plus_x , despite the fact that plus_x
has already returned.
If local objects were destroyed (and their space reclaimed) at the end of each
scope’s execution, then the referencing environment captured in a long-lived closure might become full of dangling references. To avoid this problem, most functional languages specify that local objects have unlimited extent: their lifetimes
continue indefinitely. Their space can be reclaimed only when the garbage collection system is able to prove that they will never be used again. Local objects
(other than own / static variables) in Algol-family languages generally have limited extent: they are destroyed at the end of their scope’s execution. Space for local
objects with limited extent can be allocated on a stack. Space for local objects with
unlimited extent must generally be allocated on a heap.
Given the desire to maintain stack-based allocation for the local variables
of subroutines, imperative languages with first-class subroutines must generally
adopt alternative mechanisms to avoid the dangling reference problem for closures. C, C++, and Fortran, of course, do not have nested subroutines. Modula-2
allows references to be created only to outermost subroutines (outermost routines are first-class values; nested routines are third-class values). Modula-3 allows nested subroutines to be passed as parameters, but only outermost routines
to be returned or stored in variables (outermost routines are first-class values;
nested routines are second-class values). Ada 95 allows a nested routine to be returned, but only if the scope in which it was declared is at least as wide as that
of the declared return type. This containment rule, while more conservative than
strictly necessary (it forbids the Ada equivalent of Figure 3.13), makes it impossi-
D E S I G N & I M P L E M E N TAT I O N
Binding rules and extent
Binding mechanisms and the notion of extent are closely tied to implementation issues. A-lists make it easy to build closures, but so do the non-nested
subroutines of C and the rule against passing non-global subroutines as parameters in Modula-2. In a similar vein, the lack of first-class subroutines in
most imperative languages reflects in large part the desire to avoid heap allocation, which would be needed for local variables with unlimited extent.
142
Chapter 3 Names, Scopes, and Bindings
ble to propagate a subroutine reference to a portion of the program in which the
routine’s referencing environment is not active.
3.6
Binding Within a Scope
So far in our discussion of naming and scopes we have assumed that every name
must refer to a distinct object in every scope. This is not necessarily the case.
Two or more names that refer to a single object in a given scope are said to be
aliases. A name that can refer to more than one object in a given scope is said to
be overloaded.
3.6.1
EXAMPLE
3.24
Aliasing with parameters
EXAMPLE
3.25
Aliases
Simple examples of aliases occur in the common blocks and equivalence statements of Fortran (Section 3.3.1) and in the variant records and unions of languages like Pascal and C#. They also arise naturally in programs that make use of
pointer-based data structures. A more subtle way to create aliases in many languages is to pass a variable by reference to a subroutine that also accesses that variable directly (consider variable sum in Figure 3.14). As we noted in Section 3.3.4,
Euclid and Turing use explicit and implicit subroutine import lists to catch and
prohibit precisely this case.
As a general rule, aliases tend to make programs more confusing than they
otherwise would be. They also make it much more difficult for a compiler to
perform certain important code improvements. Consider the following C code.
Aliases and code
improvement
D E S I G N & I M P L E M E N TAT I O N
Pointers in C and Fortran
The tendency of pointers to introduce aliases is one of the reasons why Fortran
compilers have tended, historically, to produce faster code than C compilers:
pointers are heavily used in C but missing from Fortran 77 and its predecessors. It is only in recent years that sophisticated alias analysis algorithms have
allowed C compilers to rival their Fortran counterparts in speed of generated
code. Pointer analysis is sufficiently important that the designers of the C99
standard decided to add a new keyword to the language. The restrict qualifier, when attached to a pointer declaration, is an assertion on the part of the
programmer that the object to which the pointer refers has no alias in the current scope. It is the programmer’s responsibility to ensure that the assertion is
correct; the compiler need not attempt to check it.
3.6 Binding Within a Scope
double sum, sum_of_squares;
...
void accumulate(double& x)
{
sum += x;
sum_of_squares += x * x;
}
...
accumulate(sum);
143
// x passed by reference
Figure 3.14 Example of a potentially problematic alias in C++. Procedure accumulate probably does not do what the programmer intended when sum is passed as a parameter.
int a, b, *p, *q;
...
a = *p;
/* read from the variable referred to by p */
*q = 3;
/* assign to the variable referred to by q */
b = *p;
/* read from the variable referred to by p */
The initial assignment to a will, on most machines, require that *p be loaded into
a register. Since accessing memory is expensive, the compiler will want to hang
onto the loaded value and reuse it in the assignment to b . It will be unable to
do so, however, unless it can verify that p and q cannot refer to the same object.
While verification of this sort is possible in many common cases, in general it’s
uncomputable.
3.6.2
EXAMPLE
3.26
Overloaded enumeration
constants in Ada
EXAMPLE
3.27
Resolving ambiguous
overloads
Overloading
Most programming languages provide at least a limited form of overloading. In
C, for example, the plus sign ( + ) is used to name two different functions: integer
and floating-point addition. Most programmers don’t worry about the distinction between these two functions—both are based on the same mathematical
concept, after all—but they take arguments of different types and perform very
different operations on the underlying bits. A slightly more sophisticated form
of overloading appears in the enumeration constants of Ada. In Figure 3.15, the
constants oct and dec refer either to months or to numeric bases, depending on
the context in which they appear.
Within the symbol table of a compiler, overloading must be handled by arranging for the lookup routine to return a list of possible meanings for the requested name. The semantic analyzer must then choose from among the elements of the list based on context. When the context is not sufficient to decide,
as in the call to print in Figure 3.15, then the semantic analyzer must announce
an error. Most languages that allow overloaded enumeration constants allow the
programmer to provide appropriate context explicitly. In Ada, for example, one
can say
144
Chapter 3 Names, Scopes, and Bindings
declare
type month is (jan, feb, mar, apr, may, jun,
jul, aug, sep, oct, nov, dec);
type print_base is (dec, bin, oct, hex);
mo : month;
pb : print_base;
begin
mo := dec;
-- the month dec
pb := oct;
-- the print_base oct
print(oct);
-- error! insufficient context to decide
Figure 3.15
Overloading of enumeration constants in Ada.
print(month’(oct));
In Modula-3, and C#, every use of an enumeration constant must be prefixed
with a type name, even when there is no chance of ambiguity:
mo := month.dec;
pb := print_base.oct;
EXAMPLE
3.28
Overloading in Ada and
C++
EXAMPLE
3.29
Overloading built-in
operators
In C, C++, and standard Pascal, one cannot overload enumeration constants at
all; every constant visible in a given scope must be distinct.
Both Ada and C++ have elaborate facilities for overloading subroutine names.
(Most of the C++ facilities carry over to Java and C#.) A given name may refer
to an arbitrary number of subroutines in the same scope, so long as the subroutines differ in the number or types of their arguments. C++ examples appear in
Figure 3.16.11
Ada, C++, C#, and Fortran 90 also allow the built-in arithmetic operators ( + ,
- , * , etc.) to be overloaded with user-defined functions. Ada, C++, and C# do
this by defining alternative prefix forms of each operator, and defining the usual
infix forms to be abbreviations (or “syntactic sugar”) for the prefix forms. In
Ada, A + B is short for "+"(A, B) . If "+" is overloaded, it must be possible to
determine the intended meaning from the types of A and B . In C++ and C#, A +
B is short for A.operator+(B) , where A is an instance of a class (module type)
that defines an operator+ function. The class-based style of abbreviation in C++
and C# resembles a similar facility in Clu. Since the abbreviation expands to an
unambiguous name (i.e., A ’s operator+ ; not any other), one might be tempted
to say that no “real” overloading is involved, and this is in fact the case in Clu. In
C++ and C#, however, there may be more than one definition of A.operator+ ,
allowing the second argument to be of several types. Fortran 90 provides a special
interface construct that can be used to associate an operator with some named
binary function.
11 C++ actually provides more elegant ways to handle both I/O and user-defined types such as
complex . We examine these in Section
7.9 and Chapter 9.
3.6 Binding Within a Scope
145
struct complex {
double real, imaginary;
};
enum base {dec, bin, oct, hex};
int i;
complex x;
void print_num(int n) ...
void print_num(int n, base b) ...
void print_num(complex c) ...
print_num(i);
print_num(i, hex);
print_num(x);
// uses the first function above
// uses the second function above
// uses the third function above
Figure 3.16 Simple example of overloading in C++. In each case the compiler can tell which
function is intended by the number and types of arguments.
3.6.3
EXAMPLE
3.30
Overloading v. coercion
Polymorphism and Related Concepts
In the case of subroutine names, it is worth distinguishing overloading from the
closely related concepts of coercion and polymorphism. All three can be used, in
certain circumstances, to pass arguments of multiple types to (or return values
of multiple types from) a given named routine. The syntactic similarity, however,
hides significant differences in semantics and pragmatics.
Suppose, for example, that we wish to be able to compute the minimum of
two values of either integer or floating-point type. In Ada we might obtain this
capability using overloaded functions:
function min(a, b : integer) return integer is ...
function min(x, y : real) return real is ...
In Fortran, however, we could get by with a single function:
real function min(x, y)
real x, y
...
If the Fortran function is called in a context that expects an integer (e.g.,
i = min(j, k) ), the compiler will automatically convert the integer arguments
( j and k ) to floating-point numbers, call min , and then convert the result back
to an integer (via truncation). So long as real variables have at least as many significant bits as integer s (which they do in the case of 32-bit integers and 64-bit
double-precision floating-point), the result will be numerically correct.
Coercion is the process by which a compiler automatically converts a value of
one type into a value of another type when that second type is required by the
surrounding context. As we shall see in Section 7.2.2, coercion is somewhat controversial. Pascal provides a limited number of coercions. Fortran and C provide
146
Chapter 3 Names, Scopes, and Bindings
more. C++ provides an extremely rich set, and allows the programmer to define more. Ada as a matter of principle coerces nothing but explicit constants,
subranges, and in certain cases arrays with the same type of elements.
In our example, overloading allows the Ada compiler to choose between two
different versions of min , depending on the types of the arguments. Coercion
allows the Fortran compiler to modify the arguments to fit a single subroutine.
Polymorphism provides yet another option: it allows a single subroutine to accept
unconverted arguments of multiple types.
The term polymorphic is from the Greek, meaning “having multiple forms.”
It is applied to code—both data structures and subroutines—that can work with
values of multiple types. For this concept to make sense, the types must generally have certain characteristics in common, and the code must not depend
on any other characteristics. The commonality is usually captured in one of two
main ways. In parametric polymorphism the code takes a type (or set of types) as
a parameter, either explicitly or implicitly. In subtype polymorphism the code is
designed to work with values of some specific type T, but the programmer can
define additional types to be extensions or refinements of T, and the polymorphic
code will work with these subtypes as well.
Explicit parametric polymorphism is also known as genericity. Generic facilities appear in Ada, C++, Clu, Eiffel, Modula-3, and recent versions of Java and
C#, among others. Readers familiar with C++ will know them by the name of
templates. We will consider them further in Sections 8.4 and 9.4.4. Implicit parametric polymorphism appears in the Lisp and ML families of languages, and
7.2.4
in various scripting languages; we will consider it further in Sections
and 10.3. Subtype polymorphism is fundamental to object-oriented languages,
in which subtypes (classes) are said to inherit the methods of their parent types.
We will consider inheritance further in Section 9.4.
Generics (explicit parametric polymorphism) are usually, though not always,
implemented by creating multiple copies of the polymorphic code, one specialized for each needed concrete type. Inheritance (subtype polymorphism) is almost always implemented by creating a single copy of the code, and by inserting sufficient “metadata” in the representation of objects that the code can tell
when to treat them differently. Implicit parametric polymorphism can be impleD E S I G N & I M P L E M E N TAT I O N
Coercion and overloading
In addition to their semantic differences, coercion and overloading can have
very different costs. Calling an integer-specific version of min would be much
more efficient than calling the floating-point version with integer arguments:
it would use integer arithmetic for the comparison (which is cheaper in and
of itself) and would avoid four conversion operations. One of the arguments
against supporting coercion in a language is that it tends to impose hidden
costs.
3.6 Binding Within a Scope
147
generic
type T is private;
with function "<"(x, y : T) return Boolean;
function min(x, y : T) return T;
function min(x, y : T) return T is
begin
if x < y then return x;
else return y;
end if;
end min;
function string_min is new min(string, "<");
function date_min is new min(date, date_precedes);
Figure 3.17
EXAMPLE
3.31
Generic min function in
Ada
Use of a generic subroutine in Ada.
mented either way. Most Lisp implementations use a single copy of the code, and
delay all semantic checks until run time. ML and its descendants perform all type
checking at compile time. They typically generate a single copy of the code where
possible (e.g., when all the types in question are records that share a similar representation) and generate multiple copies when necessary (e.g., when polymorphic
arithmetic must operate on both integer and floating-point numbers). Objectoriented languages that perform type checking at compile time, including C++,
Eiffel, Java, and C#, generally provide both generics and inheritance. Smalltalk
(Section 9.6.1), Objective-C, Python, and Ruby use a single mechanism (with
run-time checking) to provide both parametric and subtype polymorphism.
As a concrete example of generics, consider the overloaded min functions of
Example 3.30. The code for the integer and floating-point versions is likely to be
very similar. We can exploit this similarity to define a single version that works
not only for integers and reals, but for any type whose values are totally ordered.
This code appears in Figure 3.17. The initial (bodyless) declaration of min is preceded by a generic clause specifying that two things are required in order to
create a concrete instance of a minimum function: a type, T , and a correspondD E S I G N & I M P L E M E N TAT I O N
Generics as macros
In some sense, the local stack module of Figure 3.7 (page 127) is a primitive
sort of generic module. Because it imports the element type and stack_size
constant, it can be inserted (with a text editor) into any context in which these
names are declared, and will produce a “customized” stack for that context
when compiled. Early versions of C++ formalized this mechanism by using
macros to implement templates. Later versions of C++ have made templates
(generics) a fully supported language feature.
148
EXAMPLE
Chapter 3 Names, Scopes, and Bindings
3.32
Implicit polymorphism in
Scheme
ing comparison routine. This declaration is followed by the actual code for min .
Given appropriate declarations of string and date types (not shown), we can
create functions to return the lesser of pairs of objects of these types as shown in
the last two lines. (The "<" operation mentioned in the definition of string_min
is presumably overloaded; the compiler resolves the overloading by finding the
version of "<" that takes arguments of type T , where T is already known to be
string .)
With the implicit parametric polymorphism of Lisp, ML, and their descendants, the programmer need not specify a type parameter. The Scheme definition
of min looks like this:
(define min (lambda (a b) (if (< a b) a b)))
EXAMPLE
3.33
Implicit polymorphism in
Haskell
It makes no mention of types. The typical Scheme implementation employs an
interpreter that examines the arguments to min and determines, at run time,
whether they support a < operator. Given the preceding definition, the expression (min 123 456) evaluates to 123 ; (min 3.14159 2.71828) evaluates to
2.71828 . The expression (min "abc" "def") produces a run-time error when
evaluated, because the string comparison operator is named string<? , not < . The Haskell version of min is even simpler and more general:
min a b = if a < b then a else b
This version works for values of any totally ordered type, including strings. It is
type-checked at compile time, using a sophisticated system of type inference (to
be described in Section 7.2.4).
So what exactly is the difference between the overloaded min functions of Example 3.30 and the generic version of Figure 3.17? The answer lies in the generality of the code. With overloading the programmer must write a separate copy
of the code, by hand, for every type with a min operation. Generics allow the
compiler (in the typical implementation) to create a copy automatically for every
needed type. The similarity of the calling syntax and of the generated code has
led some authors to refer to overloading as ad hoc (special case) polymorphism.
There is no particular reason, however, for the programmer to think of generics
in terms of multiple copies: from a semantic (conceptual) point of view, overloaded subroutines use a single name for more than one thing; a polymorphic
subroutine is a single thing.
C H E C K YO U R U N D E R S TA N D I N G
30. Describe the difference between deep and shallow binding of referencing environments.
31. Why are binding rules particularly important for languages with dynamic
scoping?
32. What is a closure? What is it used for? How is it implemented?
33. What are first-class subroutines? What languages support them?
3.7 Separate Compilation
149
34. Explain the distinction between limited and unlimited extent of objects in a
local scope.
35. What are aliases? Why are they considered a problem in language design and
implementation?
36. Explain the value of the restrict qualifier in C99.
37. Explain the differences between overloading, coercion, and polymorphism.
38. Define parametric and subtype polymorphism. Explain the distinction between explicit and implicit parametric polymorphism. Which is also known
as genericity?
39. Why is overloading sometimes referred to as ad hoc polymorphism?
3.7
Separate Compilation
Since most large programs are constructed and tested incrementally, and since
the compilation of a very large program can be a multihour operation, any language designed to support large programs must provide a separate compilation
facility.
IN MORE DEPTH
Because they are designed for encapsulation and provide a narrow interface,
modules are the natural choice for the “compilation units” of many programming languages. The separate module headers and bodies of Modula-3 and Ada,
for example, are explicitly intended for separate compilation, and reflect experience gained with more primitive facilities in other languages. C and C++, by
contrast, must maintain backward compatibility with mechanisms designed in
the early 1970s. C++ includes a namespace mechanism that provides modulelike data hiding, but names must still be declared before they are used in every
compilation unit, and the mechanisms used to accommodate this rule are purely
a matter of convention. Java and C# break with the C tradition by requiring the
compiler to infer header information automatically from separately compiled
class definitions; no header files are required.
3.8
Summary and Concluding Remarks
This chapter has addressed the subject of names, and the binding of names to
objects (in a broad sense of the word). We began with a general discussion of the
notion of binding time: the time at which a name is associated with a particular
object or, more generally, the time at which an answer is associated with any open
150
Chapter 3 Names, Scopes, and Bindings
question in language or program design or implementation. We defined the notion of lifetime for both objects and name-to-object bindings, and noted that they
need not be the same. We then introduced the three principal storage allocation
mechanisms—static, stack, and heap—used to manage space for objects.
In Section 3.3 we described how the binding of names to objects is governed by
scope rules. In some languages, scope rules are dynamic: the meaning of a name is
found in the most recently entered scope that contains a declaration and that has
not yet been exited. In most modern languages, however, scope rules are static, or
lexical: the meaning of a name is found in the closest lexically surrounding scope
that contains a declaration. We found that lexical scope rules vary in important
but sometimes subtle ways from one language to another. We considered what
sorts of scopes are allowed to nest, whether scopes are open or closed, whether the
scope of a name encompasses the entire block in which it is declared, and whether
a name must be declared before it is used. We explored the implementation of
scope rules in Section 3.4. In Section 3.5 we considered the question of when to
bind a referencing environment to a subroutine that is passed as a parameter,
returned from a function, or stored in a variable.
Some of the more complicated aspects of lexical scoping illustrate the evolution of language support for data abstraction, a subject to which we will return
in Chapter 9. We began by describing the own or static variables of languages
like Fortran, Algol 60, and C, which allow a variable that is local to a subroutine
to retain its value from one invocation to the next. We then noted that simple
modules can be seen as a way to make long-lived objects local to a group of subroutines, in such a way that they are not visible to other parts of the program.
At the next level of complexity, we noted that some languages treat modules as
types, allowing the programmer to create an arbitrary number of instances of the
abstraction defined by a module. We contrasted this module-as-abstraction style
of programming with the module-as-manager approach. Finally, we noted that
object-oriented languages extend the module-as-abstraction approach by providing an inheritance mechanism that allows new abstractions (classes) to be defined as extensions or refinements of existing classes.
In Section 3.6 we examined several ways in which bindings relate to one another. Aliases arise when two or more names in a given scope are bound to the
same object. Overloading arises when one name is bound to multiple objects.
Polymorphism allows a single body of code to operate on objects of more than
one type, depending on context or execution history. We noted that while similar
effects can sometimes be achieved through overloading, coercion, and polymorphism, the underlying mechanisms are really very different. In Section 3.7 we
considered rules for separate compilation.
Among the topics considered in this chapter, we saw several examples of useful features (recursion, static scoping, forward references, first-class subroutines,
unlimited extent) that have been omitted from certain languages because of
concern for their implementation complexity or run-time cost. We also saw an
example of a feature (the private part of a module specification) introduced expressly to facilitate a language’s implementation, and another (separate compila-
3.9 Exercises
151
tion in C) whose design was clearly intended to mirror a particular implementation. In several additional aspects of language design (late versus early binding,
static versus dynamic scope, support for coercions and conversions, toleration of
pointers and other aliases), we saw that implementation issues play a major role.
In a similar vein, apparently simple language rules can have surprising implications. In Section 3.3.3, for example, we considered the interaction of wholeblock scope with the requirement that names be declared before they can be used.
Like the do loop syntax and white space rules of Fortran (Section 2.2.2) or the
if . . . then . . . else syntax of Pascal (Section 2.3.2), poorly chosen scoping rules
can make program analysis difficult not only for the compiler, but for human
beings as well. In future chapters we shall see several additional examples of features that are both confusing and hard to compile. Of course, semantic utility and
ease of implementation do not always go together. Many easy-to-compile features
( goto statements, for example) are of questionable value at best. We will also
see several examples of highly useful and (conceptually) simple features, such as
garbage collection (Section 7.7.3) and unification (Sections 7.2.4 and 11.2.1),
whose implementations are quite complex.
3.9
Exercises
3.1 Indicate the binding time (e.g., when the language is designed, when the
program is linked, when the program begins execution, etc.) for each of the
following decisions in your favorite programming language and implementation. Explain any answers you think are open to interpretation.
The number of built-in functions (math, type queries, etc.)
The variable declaration that corresponds to a particular variable reference (use)
The maximum length allowed for a constant (literal) character string
The referencing environment for a subroutine that is passed as a parameter
The address of a particular library routine
The total amount of space occupied by program code and data
3.2 In Fortran 77, local variables are typically allocated statically. In Algol and its
descendants (e.g., Pascal and Ada), they are typically allocated in the stack.
In Lisp they are typically allocated at least partially in the heap. What accounts for these differences? Give an example of a program in Pascal or Ada
that would not work correctly if local variables were allocated statically. Give
an example of a program in Scheme or Common Lisp that would not work
correctly if local variables were allocated on the stack.
152
Chapter 3 Names, Scopes, and Bindings
3.3 Give two examples in which it might make sense to delay the binding of an
implementation decision, even though sufficient information exists to bind
it early.
3.4 Give three concrete examples drawn from programming languages with
which you are familiar in which a variable is live but not in scope.
3.5 Consider the following pseudocode, assuming nested subroutines and static
scope.
procedure main
g : integer
procedure B(a : integer)
x : integer
procedure A(n : integer)
g := n
procedure R(m : integer)
write integer(x)
x /:= 2 –– integer division
if x > 1
R(m + 1)
else
A(m)
–– body of B
x := a × a
R(1)
–– body of main
B(3)
write integer(g)
(a) What does this program print?
(b) Show the frames on the stack when
A has just been called. For each
frame, show the static and dynamic links.
(c) Explain how A finds g .
3.6 As part of the development team at MumbleTech.com, Janet has written a
list manipulation library for C that contains, among other things, the code
in Figure 3.18.
(a) Accustomed to Java, new team member Brad includes the following
code in the main loop of his program.
list_node *L = 0;
while (more_widgets()) {
insert(next_widget(), L);
}
L = reverse(L);
3.9 Exercises
153
typedef struct list_node {
void *data;
struct list_node *next;
} list_node;
list_node *insert(void *d, list_node *L) {
list_node *t = (list_node *) malloc(sizeof(list_node));
t->data = d;
t->next = L;
return t;
}
list_node *reverse(list_node *L) {
list_node *rtn = 0;
while (L) {
rtn = insert(L->data, rtn);
L = L->next;
}
return rtn;
}
void delete_list(list_node *L) {
while (L) {
list_node *t = L;
L = L->next;
free(t->data);
free(t);
}
}
Figure 3.18
List management routines for Exercise 3.6.
Sadly, after running for a while, Brad’s program always runs out of
memory and crashes. Explain what’s going wrong.
(b) After Janet patiently explains the problem to him, Brad gives it another
try:
list_node *L = 0;
while (more_widgets()) {
insert(next_widget(), L);
}
list_node *T = reverse(L);
delete_list(L);
L = T;
This seems to solve the insufficient memory problem, but where the
program used to produce correct results (before running out of memory), now its output is strangely corrupted, and Brad goes back to Janet
for advice. What will she tell him this time?
154
Chapter 3 Names, Scopes, and Bindings
3.7 Rewrite Figures 3.7 and 3.8 in C.
3.8 Modula-2 provides no way to divide the header of a module into a public
part and a private part: everything in the header is visible to the users of
the module. Is this a major shortcoming? Are there disadvantages to the
public/private division (e.g., as in Ada)? (For hints, see Section 9.2.)
3.9 Consider the following fragment of code in C.
{
int a, b, c;
...
{
int d, e;
...
{
int f;
...
}
...
}
...
{
int g, h, i;
...
}
...
}
Assume that each integer variable occupies four bytes. How much total space
is required for the variables in this code? Describe an algorithm that a compiler could use to assign stack frame offsets to the variables of arbitrary
nested blocks, in a way that minimizes the total space required.
3.10 Consider the design of a Fortran 77 compiler that uses static allocation for
the local variables of subroutines. Expanding on the solution to the previous question, describe an algorithm to minimize the total space required
for these variables. You may find it helpful to construct a call graph data
structure in which each node represents a subroutine and each directed arc
indicates that the subroutine at the tail may sometimes call the subroutine
at the head.
3.11 Consider the following pseudocode.
procedure P(A, B : real)
X : real
procedure Q(B, C : real)
Y : real
...
procedure R(A, C : real)
Z : real
...
...
–– (*)
3.9 Exercises
155
Assuming static scope, what is the referencing environment at the location
marked by (*) ?
3.12 Write a simple program in Scheme that displays three different behaviors,
depending on whether we use let , let* , or letrec to declare a given set
of names. (Hint: To make good use of letrec , you will probably want your
names to be functions [ lambda expressions].)
3.13 Consider the following pseudocode.
x : integer
–– global
procedure set x(n : integer)
x := n
procedure print x
write integer(x)
procedure first
set x(1)
print x
procedure second
x : integer
set x(2)
print x
set x(0)
first()
print x
second()
print x
What does this program print if the language uses static scoping? What does
it print with dynamic scoping? Why?
3.14 Consider the programming idiom illustrated in Example 3.20. One of the
reviewers for this book suggests that we think of this idiom as a way to implement a central reference table for dynamic scope. Explain what is meant
by this suggestion.
3.15 If you are familiar with structured exception-handling, as provided in Ada,
Modula-3, C++, Java, C#, ML, Python, or Ruby, consider how this mechanism relates to the issue of scoping. Conventionally, a raise or throw statement is thought of as referring to an exception, which it passes as a parameter to a handler-finding library routine. In each of the languages mentioned,
the exception itself must be declared in some surrounding scope, and is subject to the usual static scope rules. Describe an alternative point of view, in
which the raise or throw is actually a reference to a handler, to which it
transfers control directly. Assuming this point of view, what are the scope
rules for handlers? Are these rules consistent with the rest of the language?
Explain. (For further information on exceptions, see Section 8.5.)
156
Chapter 3 Names, Scopes, and Bindings
3.16 Consider the following pseudocode.
x : integer
–– global
procedure set x(n : integer)
x := n
procedure print x
write integer(x)
procedure foo(S, P : function; n : integer)
x : integer := 5
if n in {1, 3}
set x(n)
else
S(n)
if n in {1, 2}
print x
else
P
set
set
set
set
x(0); foo(set
x(0); foo(set
x(0); foo(set
x(0); foo(set
x, print
x, print
x, print
x, print
x, 1); print
x, 2); print
x, 3); print
x, 4); print
x
x
x
x
Assume that the language uses dynamic scoping. What does the program
print if the language uses shallow binding? What does it print with deep
binding? Why?
3.17 Consider the following pseudocode.
x : integer := 1
y : integer := 2
procedure add
x := x + y
procedure second(P : procedure)
x : integer := 2
P()
procedure first
y : integer := 3
second(add)
first()
write integer(x)
(a) What does this program print if the language uses static scoping?
3.10 Explorations
157
(b) What does it print if the language uses dynamic scoping with deep binding?
(c) What does it print if the language uses dynamic scoping with shallow
binding?
3.18 In Section 3.6.3 we noted that while a single min function in Fortran would
work for both integer and floating-point numbers, overloading would be
more efficient because it would avoid the cost of type conversions. Give an
example in which overloading does not seem advantageous—one in which it
makes more sense to have a single function with floating-point parameters,
and perform coercion when integers are supplied.
3.19 (a) Write a polymorphic sorting routine in Scheme.
(b) Write a generic sorting routine in C++, Java, or C#. (For hints, see Section 8.4.)
(c) Write a nongeneric sorting routine using subtype polymorphism in
your favorite object-oriented language. Assume that the elements to be
sorted are members of some class derived from class ordered , which
has a method precedes such that a.precedes(b) is true if and only
if a comes before b in some canonical total order. (For hints, see Section 9.4.)
3.20–3.25 In More Depth.
3.10
Explorations
3.26 Experiment with naming rules in your favorite programming language.
Read the manual, and write and compile some test programs. Does the
language use lexical or dynamic scope? Can scopes nest? Are they open or
closed? Does the scope of a name encompass the entire block in which it is
declared, or only the portion after the declaration? How does one declare
mutually recursive types or subroutines? Can subroutines be passed as parameters, returned from functions, or stored in variables? If so, when are
referencing environments bound?
3.27 List the keywords (reserved words) of one or more programming languages.
List the predefined identifiers. (Recall that every keyword is a separate token. An identifier cannot have the same spelling as a keyword.) What criteria do you think were used to decide which names should be keywords
and which should be predefined identifiers? Do you agree with the choices?
Why or why not?
3.28 If you have experience with a language like C, C++, or Pascal, in which dynamically allocated space must be manually reclaimed, describe your experience with dangling references or memory leaks. How often do these bugs
158
Chapter 3 Names, Scopes, and Bindings
arise? How do you find them? How much effort does it take? Learn about
open source or commercial tools for finding storage bugs (IBM’s Purify
is a popular example). Do such tools weaken the argument for automatic
garbage collection?
3.29 We learned in Section 3.3.6 that modern languages have generally abandoned dynamic scoping. One place it can still be found is in the so-called
environment variables of the Unix programming environment. If you are
not familiar with these, read the manual page for your favorite shell (command interpreter— csh / tcsh , ksh / bash , etc.) to learn how these behave.
Explain why the usual alternatives to dynamic scoping (default parameters
and static variables) are not appropriate in this case.
3.30 Compare the mechanisms for overloading of enumeration names in Ada
and Modula-3 (Section 3.6.2). One might argue that the (historically more
recent) Modula-3 approach moves responsibility from the compiler to the
programmer: it requires even an unambiguous use of an enumeration constant to be annotated with its type. Why do you think this approach was
chosen by the language designers? Do you agree with the choice? Why or
why not?
3.31 Write a program in C++ or Ada that creates at least two concrete types or
subroutines from the same template/generic. Compile your code to assembly language and look at the result. Describe the mapping from source to
target code.
3.32 Do you think coercion is a good idea? Why or why not?
3.33 Give three examples of features that are not provided in some language with
which you are familiar, but that are common in other languages. Why do
you think these features are missing? Would they complicate the implementation of the language? If so, would the complication (in your judgment) be
justified?
3.34–3.38 In More Depth.
3.11
Bibliographic Notes
This chapter has traced the evolution of naming and scoping mechanisms
through many different languages, including Fortran (several versions), Basic,
Algol 60 and 68, Pascal, Simula, C and C++, Euclid, Turing, Modula (1, 2, and 3),
Ada (83 and 95), Oberon, Eiffel, Java, and C#. Bibliographic references for all of
these can be found in Appendix A.
Both modules and objects trace their roots to Simula, which was developed
by Dahl, Nygaard, Myhrhaug, and others at the Norwegian Computing Centre
in the mid-1960s. (Simula I was implemented in 1964; descriptions in this book
pertain to Simula 67.) The encapsulation mechanisms of Simula were refined in
3.11 Bibliographic Notes
159
the 1970s by the developers of Clu, Modula, Euclid, and related languages. Other
Simula innovations—inheritance and dynamic method binding in particular—
provided the inspiration for Smalltalk, the original and arguably purest of the
object-oriented languages. Modern object-oriented languages, including Eiffel,
C++, Java, and C#, represent to a large extent a reintegration of the evolutionary
lines of encapsulation on the one hand and inheritance and dynamic method
binding on the other.
The notion of information hiding originates in Parnas’s classic paper “On the
Criteria to Be Used in Decomposing Systems into Modules” [Par72]. Comparative discussions of naming, scoping, and abstraction mechanisms can be found,
among other places, in Liskov et al.’s discussion of Clu [LSAS77], Liskov and Guttag’s text [LG86, Chap. 4], the Ada Rationale [IBFW91, Chaps. 9–12], Harbison’s
text on Modula-3 [Har92, Chaps. 8–9], Wirth’s early work on modules [Wir80],
and his later discussion of Modula and Oberon [Wir88a]. Further information
on object-oriented languages can be found in Chapter 9.
For a detailed discussion of overloading and polymorphism, see the survey by
Cardelli and Wegner [CW85]. Cailliau [Cai82] provides a lighthearted discussion of many of the scoping pitfalls noted in Section 3.3.3. Abelson and Sussman [AS96, p. 11n] attribute the term “syntactic sugar” to Peter Landin.
4
Semantic Analysis
In Chapter 2 we considered the topic of programming language syntax.
In the current chapter we turn to the topic of semantics. Informally, syntax concerns the form of a valid program, while semantics concerns its meaning. Meaning
is important for at least two reasons: it allows us to enforce rules (e.g., type consistency) that go beyond mere form, and it provides the information we need in
order to generate an equivalent output program.
It is conventional to say that the syntax of a language is precisely that portion
of the language definition that can be described conveniently by a context-free
grammar, while the semantics is that portion of the definition that cannot. This
convention is useful in practice, though it does not always agree with intuition.
When we require, for example, that the number of arguments contained in a call
to a subroutine match the number of formal parameters in the subroutine definition, it is tempting to say that this requirement is a matter of syntax. After
all, we can count arguments without knowing what they mean. Unfortunately,
we cannot count them with context-free rules. Similarly, while it is possible to
write a context-free grammar in which every function must contain at least one
return statement, the required complexity makes this strategy very unattractive.
In general, any rule that requires the compiler to compare things that are separated by long distances, or to count things that are not properly nested, ends up
being a matter of semantics.
Semantic rules are further divided into static and dynamic semantics, though
again the line between the two is somewhat fuzzy. The compiler enforces static
semantic rules at compile time. It generates code to enforce dynamic semantic
rules at run time (or to call library routines that do so). Certain errors, such as
division by zero, or attempting to index into an array with an out-of-bounds
subscript, cannot in general be caught at compile time, since they may occur
only for certain input values, or certain behaviors of arbitrarily complex code.
In special cases, a compiler may be able to tell that a certain error will always
or never occur, regardless of run-time input. In these cases, the compiler can
161
162
Chapter 4 Semantic Analysis
generate an error message at compile time, or refrain from generating code to
perform the check at run time, as appropriate. Basic results from computability
theory, however, tell us that no algorithm can make these predictions correctly for
arbitrary programs. There will inevitably be cases in which an error will always
occur, but the compiler cannot tell, and must delay the error message until run
time. There will also be cases in which an error can never occur, but the compiler
cannot tell, and must incur the cost of unnecessary run-time checks.
Both semantic analysis and intermediate code generation can be described in
terms of annotation, or decoration, of a parse tree or syntax tree. The annotations
themselves are known as attributes. Numerous examples of static and dynamic
semantic rules will appear in subsequent chapters. In this current chapter we
focus primarily on the mechanisms a compiler uses to enforce the static rules.
We will consider intermediate code generation in Chapter 14.
In Section 4.1 we consider the role of the semantic analyzer in more detail,
considering both the rules it needs to enforce and its relationship to other phases
of compilation. Most of the rest of the chapter is then devoted to the subject
of attribute grammars. Attribute grammars provide a formal framework for the
decoration of a tree. This framework is a useful conceptual tool even in compilers
that do not build a parse tree or syntax tree as an explicit data structure. We
introduce the notion of an attribute grammar in Section 4.2. We then consider
various ways in which such grammars can be applied in practice. Section 4.3
discusses the issue of attribute flow, which constrains the order(s) in which nodes
of a tree can be decorated. In practice, most compilers require decoration of the
parse tree (or the evaluation of attributes that would reside in a parse tree if there
were one) to occur in the process of an LL or LR parse. Section 4.4 presents action
routines as an ad hoc mechanism for such on-the-fly evaluation. In Section 4.5
(mostly on the PLP CD) we consider the management of space for parse tree
attributes.
One particularly common compiler organization uses action routines during
parsing solely for the purpose of constructing a syntax tree. The syntax tree is
then decorated during a separate traversal, which can be formalized, if desired,
with a separate attribute grammar. We consider the decoration of syntax trees in
Section 4.6.
4.1
The Role of the Semantic Analyzer
Programming languages vary dramatically in their choice of semantic rules. In
Section 3.6.3, for example, we saw a range of approaches to coercion, from languages like Fortran and C, which allow operands of many types to be intermixed
in expressions, to languages like Ada, which do not. Languages also vary in the
extent to which they require their implementations to perform dynamic checks.
At one extreme, C requires no checks at all, beyond those that come “free” with
the hardware (e.g., division by zero or attempted access to memory outside the
4.1 The Role of the Semantic Analyzer
163
bounds of the program). At the other extreme, Java takes great pains to check as
many rules as possible, in part to ensure that an untrusted program cannot do
anything to damage the memory or files of the machine on which it runs.
In the typical compiler, the interface between semantic analysis and intermediate code generation defines the boundary between the front end and the back end.
The exact division of labor varies a bit from compiler to compiler: it can be hard
to say exactly where analysis (figuring out what the program means) ends and
synthesis (expressing that meaning in some new form) begins. Many compilers
actually carry a program through more than one intermediate form. In one common organization, described in more detail in Chapter 14, the semantic analyzer
creates an annotated syntax tree, which the intermediate code generator then
translates into a linear form reminiscent of the assembly language for some idealized machine. After machine-independent code improvement, this linear form
is then translated into yet another form, patterned more closely on the assembly
language of the target machine. That form may then undergo machine-specific
code improvement.
Compilers also vary in the extent to which semantic analysis and intermediate code generation are interleaved with parsing. With fully separated phases, the
parser passes a full parse tree on to the semantic analyzer, which converts it to a
syntax tree, fills in the symbol table, performs semantic checks, and passes it on to
the code generator. With fully interleaved phases, there may be no need to build
either the parse tree or the syntax tree in its entirety: the parser can call semantic check and code generation routines “on-the-fly” as it parses each expression,
statement, or subroutine of the source. We will focus on an organization in which
construction of the syntax tree is interleaved with parsing (and the parse tree is
not built), but semantic analysis occurs during a separate traversal of the syntax
tree.
Many compilers that implement dynamic checks provide the option of disabling them if desired. It is customary in some organizations to enable dynamic
checks during program development and testing, and then disable them for production use, to increase execution speed. The wisdom of this practice is questionable: Tony Hoare, one of the key figures in programming language design,1
has likened the programmer who disables semantic checks to a sailing enthusiast who wears a life jacket when training on dry land but removes it when
going to sea [Hoa89, p. 198]. Errors may be less likely in production use than
they are in testing, but the consequences of an undetected error are significantly
worse. Moreover, with the increasing use of multi-issue, superscalar processors
(described in Section 5.4.3), it is often possible for dynamic checks to execute in
instruction slots that would otherwise go unused, making them virtually free. On
1 Among other things, C. A. R. Hoare (1934–) invented the quicksort algorithm and the case
statement, contributed to the design of Algol W, and was one of the leaders in the development
of axiomatic semantics. In the area of concurrent programming, he refined and formalized the
monitor construct (to be described in Section 12.3.4), and designed the CSP programming model
and notation. He received the ACM Turing Award in 1980.
164
Chapter 4 Semantic Analysis
the other hand, some dynamic checks (e.g., for use of uninitialized variables) are
sufficiently expensive that they are rarely implemented.
Assertions
EXAMPLE
4.1
Assertions in Euclid
A few programming languages (e.g., Euclid and Eiffel) allow the programmer to
specify logical assertions, invariants, preconditions, and postconditions that must
be verified by dynamic semantic checks. An assertion is a statement that a specified condition is expected to be true when execution reaches a certain point in
the code. In Euclid one can write
assert denominator not= 0
EXAMPLE
4.2
Assertions in C
An invariant is a condition that is expected to be true at all “clean points” of a
given body of code. In Eiffel the programmer can specify an invariant on the data
inside a class: the invariant is expected to be true at the beginning and end of all
of the class’s methods (subroutines). Similar invariants for loops are expected to
be true before and after every iteration. Pre- and postconditions are expected to
be true at the beginning and end of subroutines, respectively.
Invariants, preconditions, and postconditions are essentially structured assertions. A postcondition, specified once in the header of a Euclid subroutine, will
be checked not only at the end of the subroutine’s text, but at every return statement as well, automatically.
Many languages support assertions via standard library routines or macros. In
C, for example, one can write
assert(denominator != 0);
If the assertion fails, the program will terminate abruptly with the message
myprog.c:42: failed assertion ‘denominator != 0’
The C manual requires assert to be implemented as a macro (or built into the
compiler) so that it has access to the textual representation of its argument, and
to the file name and line number on which the call appears.
D E S I G N & I M P L E M E N TAT I O N
Dynamic semantic checks
In the past, language theorists and researchers in programming methodology
and software engineering tended to argue for more extensive semantic checks,
while “real world” programmers “voted with their feet” for languages like C
and Fortran, which omitted those checks in the interest of execution speed.
As computers have become more powerful, and as companies have come to
appreciate the enormous costs of software maintenance, the “real world” camp
has become much more sympathetic to checking. Languages like Ada and Java
have been designed from the outset with safety in mind, and languages like
C and C++ have evolved (to the extent possible) toward increasingly strict
definitions.
4.1 The Role of the Semantic Analyzer
165
Assertions, of course, could be used to cover the other three sorts of checks,
but not as clearly or succinctly. Invariants, preconditions, and postconditions are
a prominent part of the header of the code to which they apply, and can cover
a potentially large number of places where an assertion would otherwise be required. Euclid and Eiffel implementations allow the programmer to disable assertions and related constructs when desired, to eliminate their run-time cost.
Static Analysis
In general, compile-time algorithms that predict run-time behavior are known
as static analysis. Such analysis is said to be precise if it allows the compiler to determine whether a given program will always follow the rules. Type checking, for
example, is static and precise in languages like Ada, C, and ML: the compiler ensures that no variable will ever be used at run time in a way that is inappropriate
for its type. By contrast, languages like Lisp and Smalltalk obtain greater flexibility, while remaining completely type-safe, by accepting the run-time overhead of
dynamic type checks. (We will cover type checking in more detail in Chapter 7.)
Static analysis can also be useful when it isn’t precise. Compilers will often
check what they can at compile time and then generate code to check the rest
dynamically. In Java, for example, type checking is mostly static, but dynamically
loaded classes and type casts may require run-time checks. In a similar vein, many
compilers perform extensive static analysis in an attempt to eliminate the need for
dynamic checks on array subscripts, variant record tags, or potentially dangling
pointers (again, to be discussed in Chapter 7).
If we think of the omission of unnecessary dynamic checks as a performance
optimization, it is natural to look for other ways in which static analysis may
enable code improvement. We will consider this topic in more detail in Chapter 15. Examples include alias analysis, which determines when values can be
safely cached in registers, computed “out of order,” or accessed by concurrent
threads; escape analysis, which determines when all references to a value will be
confined to a given context, allowing it to be allocated on the stack instead of
the heap, or to be accessed without locks; and subtype analysis, which determines
when a variable in an object-oriented language is guaranteed to have a certain
subtype, so that its methods can be called without dynamic dispatch.
An optimization is said to be unsafe if it may lead to incorrect code in certain
programs. It is said to be speculative if it usually improves performance but may
degrade it in certain cases. A compiler is said to be conservative if it applies optimizations only when it can guarantee that they will be both safe and effective.
By contrast, an optimistic compiler may make liberal use of speculative optimizations. It may also pursue unsafe optimizations by generating two versions of the
code, with a dynamic check that chooses between them based on information not
available at compile time. Examples of speculative optimization include nonbinding prefetches, which try to bring data into the cache before they are needed, and
trace scheduling, which rearranges code in hopes of improving the performance
of the processor pipeline and the instruction cache.
166
Chapter 4 Semantic Analysis
To eliminate dynamic checks, language designers may choose to tighten semantic rules, banning programs for which conservative analysis fails. The ML
7.2.4), for example, avoids the dynamic type checks of
type system (Section
Lisp but disallows certain useful programming idioms that Lisp supports. Similarly, the definite assignment rules of Java and C# (Section 6.1.3) allow the compiler to ensure that a variable is always given a value before it is used in an expression, but disallow certain programs that are legal (and correct) in C.
4.2
EXAMPLE
4.3
Bottom-up CFG for
constant expressions
In Chapter 2 we learned how to use a context-free grammar to specify the syntax
of a programming language. Here, for example, is an LR (bottom-up) grammar
for arithmetic expressions composed of constants, with precedence and associativity:
−→
E + T
E −→
E - T
E −→
T
T −→
T * F
T −→
T / F
T −→
F
F −→
- F
F −→
( E )
F −→
const
E
EXAMPLE
4.4
Bottom-up AG for
constant expressions
Attribute Grammars
This grammar will generate all properly formed constant expressions over the
basic arithmetic operators, but it says nothing about their meaning. To tie these
expressions to mathematical concepts (as opposed to, say, floor tile patterns or
dance steps), we need additional notation. The most common is based on attributes. In our expression grammar, we can associate a val attribute with each E,
T, F, and const in the grammar. The intent is that for any symbol S, S.val will
be the meaning, as an arithmetic value, of the token string derived from S. We
assume that the val of a const is provided to us by the scanner. We must then invent a set of rules for each production to specify how the vals of different symbols
are related. The resulting attribute grammar is shown in Figure 4.1.
In this simple grammar, every production has a single rule. We shall see more
complicated grammars later in which productions can have several rules. The
rules come in two forms. Those in productions 3, 6, 8, and 9 are known as copy
rules; they specify that one attribute should be a copy of another. The other rules
invoke semantic functions ( sum , quotient , additive inverse , etc.). In this example,
the semantic functions are all familiar arithmetic operations. In general, they can
be arbitrarily complex functions specified by the language designer. Each seman-
4.2 Attribute Grammars
167
1. E1 −→ E2 + T
E1 .val := sum(E2 .val, T.val)
2. E1 −→ E2 - T
E1 .val := difference(E2 .val, T.val)
3. E −→ T
E.val := T.val
4. T1 −→ T2 * F
T1 .val := product(T2 .val, F.val)
5. T1 −→ T2 / F
T1 .val := quotient(T2 .val, F.val)
6. T −→ F
T.val := F.val
7. F1 −→ - F2
F1 .val := additive inverse(F2 .val)
8. F −→ ( E )
F.val := E.val
9. F −→ const
F.val := const.val
Figure 4.1 A simple attribute grammar for constant expressions, using the standard arithmetic operations.
tic function takes an arbitrary number of arguments (each of which must be an
attribute of a symbol in the current production: no constants, global variables,
etc.), and each computes a single result, which must likewise be assigned into an
attribute of a symbol in the current production. When more than one symbol of
a production has the same name, subscripts are used to distinguish them. These
subscripts are solely for the benefit of the semantic functions; they are not part
of the context-free grammar itself.
In a strict definition of attribute grammars, copy rules and semantic function
calls are the only two kinds of permissible rules. In practice, it is common to
allow rules to consist of small fragments of code in some well-defined notation
(e.g., the language in which a compiler is being written) so that simple semantic
functions can be written out “in-line.” These code fragments are not allowed to
refer to any variables or attributes outside the current production (we will relax
this restriction when we discuss action routines in Section 4.4). In our examples
we use a symbol to introduce each code fragment corresponding to a single
semantic function.
Semantic functions must be written in some already-existing notation, because attribute grammars do not really specify the meaning of a program; rather,
they provide a way to associate a program with something else that presumably
has meaning. Neither the notation for semantic functions nor the types of the
168
Chapter 4 Semantic Analysis
attributes themselves (i.e., the domain of values passed to and returned from semantic functions) is intrinsic to the attribute grammar notion. In the preceding example, we have used an attribute grammar to associate numeric values
with the symbols in our grammar, using semantic functions drawn from ordinary arithmetic. In the code generation phase of a compiler, we might associate
fragments of target machine code with our symbols, using semantic functions
written in some existing programming language. If we were interested in defining the meaning of a programming language in a machine-independent way, our
attributes might be domain theory denotations (these are the basis of denotational
semantics). If we were interested in proving theorems about the behavior of programs in our language, our attributes might be logical formulas (this is the basis
of axiomatic semantics).2 These more formal concepts are beyond the scope of
this text (but see the Bibliographic Notes at the end of the chapter). We will use
attribute grammars primarily as a framework for building a syntax tree, checking
semantic rules, and (in Chapter 14) generating code.
4.3
EXAMPLE
4.5
Decoration of a parse tree
Evaluating Attributes
The process of evaluating attributes is called annotation or decoration of the parse
tree. Figure 4.2 shows how to decorate the parse tree for the expression (1 + 3)
* 2 , using the attribute grammar of Figure 4.1. Once decoration is complete, the
value of the overall expression can be found in the val attribute of the root of the
tree.
Synthesized Attributes
The attribute grammar of Figure 4.1 is very simple. Each symbol has at most one
attribute (the punctuation marks have none). Moreover, they are all so-called
synthesized attributes: their values are calculated (synthesized) only in productions in which their symbol appears on the left-hand side. For annotated parse
trees like the one in Figure 4.2, this means that the attribute flow—the pattern in
which information moves from node to node—is entirely bottom-up.
An attribute grammar in which all attributes are synthesized is said to be
S-attributed. The arguments to semantic functions in an S-attributed grammar
are always attributes of symbols on the right-hand side of the current production, and the return value is always placed into an attribute of the left-hand
side of the production. Tokens (terminals) often have intrinsic properties (e.g.,
the character-string representation of an identifier or the value of a numeric
2 It’s actually stretching things a bit to discuss axiomatic semantics in the context of attribute
grammars. Axiomatic semantics is intended not so much to define the meaning of programs as
to permit one to prove that a given program satisfies some desired property (e.g., computes some
desired function).
4.3 Evaluating Attributes
169
Figure 4.2 Decoration of a parse tree for (1 + 3) * 2 . The val attributes of symbols are
shown in boxes. Curved arrows represent the attribute flow, which is strictly upward in this
case.
constant); in a compiler these are synthesized attributes initialized by the scanner.
Inherited Attributes
EXAMPLE
4.6
Top-down CFG and parse
tree for subtraction
In general, we can imagine (and will in fact have need of) attributes whose values
are calculated when their symbol is on the right-hand side of the current production. Such attributes are said to be inherited. They allow contextual information to flow into a symbol from above or from the side, so that the rules of that
production can be enforced in different ways (or generate different values) depending on surrounding context. Symbol table information is commonly passed
from symbol to symbol by means of inherited attributes. Inherited attributes of
the root of the parse tree can also be used to represent the external environment
(characteristics of the target machine, command-line arguments to the compiler,
etc.).
As a simple example of inherited attributes, consider the following simplified
fragment of an LL(1) expression grammar (here covering only subtraction):
expr −→
const expr tail
expr tail −→
- const expr tail
.
170
Chapter 4 Semantic Analysis
For the expression 9 - 4 - 3 , we obtain the following parse tree:
EXAMPLE
4.7
Decoration with
left-to-right attribute flow
If we want to create an attribute grammar that accumulates the value of the overall expression into the root of the tree, we have a problem: because subtraction is
left-associative, we cannot summarize the right subtree of the root with a single
numeric value. If we want to decorate the tree bottom-up, with an S-attributed
grammar, we must be prepared to describe an arbitrary number of right operands
in the attributes of the top-most expr tail node (see Exercise 4.4). This is indeed
possible, but it defeats the purpose of the formalism: in effect, it requires us to
embed the entire tree into the attributes of a single node, and do all the real work
inside a single semantic function.
If, however, we are allowed to pass attribute values not only bottom-up but
also left-to-right in the tree, then we can pass the 9 into the top-most expr tail
node, where it can be combined (in proper left-associative fashion) with the 4 .
The resulting 5 can then be passed into the middle expr tail node, combined with
the 3 to make 2, and then passed upward to the root:
4.3 Evaluating Attributes
1. E −→ T TT
TT.st := T.val
E.val := TT.val
2. TT1 −→ + T TT2
TT2 .st := TT1 .st + T.val
TT1 .val := TT2 .val
3. TT1 −→ - T TT2
TT2 .st := TT1 .st − T.val
TT1 .val := TT2 .val
171
4. TT −→ TT.val := TT.st
5. T −→ F FT
FT.st := F.val
T.val := FT.val
6. FT1 −→ * F FT2
FT2 .st := FT1 .st × F.val
FT1 .val := FT2 .val
7. FT1 −→ / F FT2
FT2 .st := FT1 .st ÷ F.val
FT1 .val := FT2 .val
8. FT −→ FT.val := FT.st
9. F1 −→ - F2
F1 .val := − F2 .val
10. F −→ ( E )
F.val := E.val
11. F −→ const
F.val := const.val
Figure 4.3
EXAMPLE
4.8
Top-down AG for
subtraction
An attribute grammar for constant expressions based on an LL(1) CFG.
To effect this style of decoration, we need the following attribute rules:
expr −→ const expr tail
expr tail.st := const.val
expr.val := expr tail.val
expr tail1 −→ - const expr tail2
expr tail2 .st := expr tail1 .st − const.val
expr tail1 .val := expr tail2 .val
expr tail −→ expr tail.val := expr tail.st
EXAMPLE
4.9
Top-down AG for constant
expressions
In each of the first two productions, the first rule serves to copy the left context
(value of the expression so far) into a “subtotal” ( st ) attribute; the second rule
copies the final value from the right-most leaf back up to the root.
We can flesh out the grammar fragment of Example 4.6 to produce a more
complete expression grammar, as shown in Figure 4.3. The underlying CFG for
this grammar accepts the same language as the one in Figure 4.1, but where that
one was SLR(1), this one is LL(1). Attribute flow for a parse of (1 + 3) * 2 ,
172
Chapter 4 Semantic Analysis
Figure 4.4 Decoration of a top-down parse tree for (1 + 3) * 2 , using the attribute grammar of Figure 4.3. Curved
arrows again represent attribute flow, which is no longer bottom-up, but is still left-to-right.
using the LL(1) grammar, appears in Figure 4.4. As in the grammar fragment of
Example 4.6, the value of the left operand of each operator is carried into the
TT and FT productions by the st (subtotal) attribute. The relative complexity of
the attribute flow arises from the fact that operators are left associative, but the
grammar cannot be left recursive: the left and right operands of a given operator
are thus found in separate productions. Grammars to perform semantic analysis
for practical languages generally require some non-S-attributed flow.
Attribute Flow
Just as a context-free grammar does not specify how it should be parsed, an attribute grammar does not specify the order in which attribute rules should be
invoked. Put another way, both notations are declarative: they define a set of valid
trees, but they don’t say how to build or decorate them. Among other things, this
means that the order in which attribute rules are listed for a given production is
immaterial; attribute flow may require them to execute in any order. If in Figure 4.3 we were to reverse the order in which the rules appear in productions
1, 2, 3, 5, 6, and/or 7 (listing the rule for symbol.val first), it would be a purely
cosmetic change; the grammar would not be altered.
4.3 Evaluating Attributes
173
We say an attribute grammar is well defined if its rules determine a unique set
of values for the attributes of every possible parse tree. An attribute grammar is
noncircular if it never leads to a parse tree in which there are cycles in the attribute
flow graph—that is, if no attribute, in any parse tree, ever depends (transitively)
on itself. (A grammar can be circular and still be well defined if attributes are
guaranteed to converge to a unique value.) As a general rule, practical attribute
grammars tend to be noncircular.
An algorithm that decorates parse trees by invoking the rules of an attribute
grammar in an order that respects the tree’s attribute flow is called a translation
scheme. Perhaps the simplest scheme is one that makes repeated passes over a
tree, invoking any semantic function whose arguments have all been defined, and
stopping when it completes a pass in which no values change. Such a scheme is
said to be oblivious, in the sense that it exploits no special knowledge of either the
parse tree or the grammar. It will halt only if the grammar is well defined. Better
performance, at least for noncircular grammars, may be achieved by a dynamic
scheme that tailors the evaluation order to the structure of a given parse tree—for
example, by constructing a topological sort of the attribute flow graph and then
invoking rules in an order consistent with the sort.
The fastest translation schemes, however, tend to be static—based on an analysis of the structure of the attribute grammar itself, and then applied mechanically
to any tree arising from the grammar. Like LL and LR parsers, linear-time static
translation schemes can be devised only for certain restricted classes of grammars. S-attributed grammars, such as the one in Figure 4.1, form the simplest
such class. Because attribute flow in an S-attributed grammar is strictly bottomup, attributes can be evaluated by visiting the nodes of the parse tree in exactly the
same order that those nodes were generated by the parser. In fact, the attributes
can be evaluated on-the-fly during a bottom-up parse, thereby interleaving parsing and semantic analysis (attribute evaluation).
The attribute grammar of Figure 4.3 is a good bit messier than that of Figure 4.1, but it is still L-attributed: its attributes can be evaluated by visiting the
nodes of the parse tree in a single left-to-right, depth-first traversal (the same order in which they are visited during a top-down parse). If we say that an attribute
A.s depends on an attribute B.t if B.t is ever passed to a semantic function that
returns a value for A.s, then we can define L-attributed grammars more formally
with the following two rules: (1) each synthesized attribute of a left-hand side
symbol depends only on that symbol’s own inherited attributes or on attributes
(synthesized or inherited) of the production’s right-hand side symbols; and (2)
each inherited attribute of a right-hand side symbol depends only on inherited
attributes of the left-hand side symbol or on attributes (synthesized or inherited)
of symbols to its left in the right-hand side.
S-attributed grammars are the most general class of attribute grammars
for which evaluation can be implemented on-the-fly during an LR parse.
L-attributed grammars are a proper superset of S-attributed grammars. They
are the most general class of attribute grammars for which evaluation can be implemented on-the-fly during an LL parse. If we interleave semantic analysis (and
174
Chapter 4 Semantic Analysis
possibly intermediate code generation) with parsing, then a bottom-up parser
must in general be paired with an S-attributed translation scheme; a top-down
parser must be paired with an L-attributed translation scheme. (Depending on
the structure of the grammar, it is often possible for a bottom-up parser to accommodate some non-S-attributed attribute flow; we consider this possibility
in Section 4.5.1.) If we choose to separate parsing and semantic analysis into
separate passes, then the code that builds the parse tree or syntax tree must still
use an S-attributed or L-attributed translation scheme (as appropriate), but the
semantic analyzer can use a more powerful scheme if desired. There are certain
tasks, such as the generation of code for “short-circuit” Boolean expressions (to
be discussed in Sections 6.1.5 and 6.4.1), that are easiest to accomplish with a
non-L-attributed scheme.
One-Pass Compilers
A compiler that interleaves semantic analysis and code generation with parsing
is said to be a one-pass compiler.3 It is unclear whether interleaving semantic
analysis with parsing makes a compiler simpler or more complex; it’s mainly a
matter of taste. If intermediate code generation is interleaved with parsing, one
need not build a syntax tree at all (unless of course the syntax tree is the intermediate code). Moreover, it is often possible to write the intermediate code
to an output file on-the-fly, rather than accumulating it in the attributes of the
root of the parse tree. The resulting space savings were important for previous generations of computers, which had very small main memories. On the
other hand, semantic analysis is easier to perform during a separate traversal of
D E S I G N & I M P L E M E N TAT I O N
Forward references
In Sections 3.3.3 and 3.4.1 we noted that the scope rules of many languages
require names to be declared before they are used, and provide special mechanisms to introduce the forward references needed for recursive definitions.
While these rules may help promote the creation of clear, maintainable code,
an equally important motivation, at least historically, was to facilitate the construction of one-pass compilers. With increases in memory size, processing
speed, and programmer expectations regarding the quality of code improvement, multipass compilers have become ubiquitous, and language designers
have felt free (as, for example, in the class declarations of C++, Java, and C#)
to abandon the requirement that declarations precede uses.
3 Most authors use the term one-pass only for compilers that translate all the way from source to
target code in a single pass. Some authors insist only that intermediate code be generated in a
single pass, and permit additional pass(es) to translate intermediate code to target code.
4.3 Evaluating Attributes
175
E1 −→ E2 + T
E1 .ptr := make bin op(“+”, E2 .ptr, T.ptr)
E1 −→ E2 - T
E1 .ptr := make bin op(“−”, E2 .ptr, T.ptr)
E −→ T
E.ptr := T.ptr
T1 −→ T2 * F
T1 .ptr := make bin op(“×”, T2 .ptr, F.ptr)
T1 −→ T2 / F
T1 .ptr := make bin op(“÷”, T2 .ptr, F.ptr)
T −→ F
T.ptr := F.ptr
F1 −→ - F2
F1 .ptr := make un op(“+/− ”, F2 .ptr)
F −→ ( E )
F.ptr := E.ptr
F −→ const
F.ptr := make leaf(const.val)
Figure 4.5 Bottom-up attribute grammar to construct a syntax tree. The symbol +/− is used
(as it is on calculators) to indicate change of sign.
a syntax tree, because that tree reflects the program’s semantic structure better
than the parse tree does, especially with a top-down parser, and because one
has the option of traversing the tree in an order other than that chosen by the
parser.
Building a Syntax Tree
EXAMPLE
4.10
Bottom-up and top-down
AGs to build a syntax tree
If we choose not to interleave parsing and semantic analysis, we still need to add
attribute rules to the context-free grammar, but they serve only to create the syntax tree—not to enforce semantic rules or generate code. Figures 4.5 and 4.6 contain bottom-up and top-down attribute grammars, respectively, to build a syntax
tree for constant expressions. The attributes in these grammars hold neither numeric values nor target code fragments; instead they point to nodes of the syntax tree. Function make leaf returns a pointer to a newly allocated syntax tree
node containing the value of a constant. Functions make un op and make bin
op return pointers to newly allocated syntax tree nodes containing a unary or binary operator, respectively, and pointers to the supplied operand(s). Figures 4.7
and 4.8 show stages in the decoration of parse trees for (1 + 3) * 2 , using the
grammars of Figures 4.5 and 4.6, respectively.
176
Chapter 4 Semantic Analysis
E −→ T TT
TT.st := T.ptr
E.ptr := TT.ptr
TT1 −→ + T TT2
TT2 .st := make bin op(“+”, TT1 .st, T.ptr)
TT1 .ptr := TT2 .ptr
TT1 −→ - T TT2
TT2 .st := make bin op(“−”, TT1 .st, T.ptr)
TT1 .ptr := TT2 .ptr
TT −→ TT.ptr := TT.st
T −→ F FT
FT.st := F.ptr
T.ptr := FT.ptr
FT1 −→ * F FT2
FT2 .st := make bin op(“×”, FT1 .st, F.ptr)
FT1 .ptr := FT2 .ptr
FT1 −→ / F FT2
FT2 .st := make bin op(“÷”, FT1 .st, F.ptr)
FT1 .ptr := FT2 .ptr
FT −→ FT.ptr := FT.st
F1 −→ - F2
F1 .ptr := make un op(“+/− ”, F2 .ptr)
F −→ ( E )
F.ptr := E.ptr
F −→ const
F.ptr := make leaf(const.val)
Figure 4.6 Top-down attribute grammar to construct a syntax tree. Here the st attribute, like
the ptr attribute (and unlike the st attribute of Figure 4.3), is a pointer to a syntax tree node.
C H E C K YO U R U N D E R S TA N D I N G
1. What determines whether a language rule is a matter of syntax or of static
semantics?
2. Why is it impossible to detect certain program errors at compile time, even
though they can be detected at run time?
3. What is an attribute grammar?
4. What are programming assertions? What is their purpose?
5. What is the difference between synthesized and inherited attributes?
4.3 Evaluating Attributes
177
Figure 4.7 Construction of a syntax tree via decoration of a bottom-up parse tree, using the
grammar of Figure 4.5. In diagram (a), the values of the constants 1 and 3 have been placed
in new syntax tree leaves. Pointers to these leaves propagate up into the attributes of E and
T. In (b), the pointers to these leaves become child pointers of a new internal + node. In (c)
the pointer to this node propagates up into the attributes of T, and a new leaf is created for 2 .
Finally, in (d), the pointers from T and F become child pointers of a new internal × node, and
a pointer to this node propagates up into the attributes of E.
178
Chapter 4 Semantic Analysis
Figure 4.8 Construction of a syntax tree via decoration of a top-down parse tree, using the grammar of Figure 4.6. In the
top diagram, (a), the value of the constant 1 has been placed in a new syntax tree leaf. A pointer to this leaf then propagates to
the st attribute of TT. In (b), a second leaf has been created to hold the constant 3 . Pointers to the two leaves then become
child pointers of a new internal + node, a pointer to which propagates from the st attribute of the bottom-most TT, where
it was created, all the way up and over to the st attribute of the top-most FT. In (c), a third leaf has been created for the
constant 2 . Pointers to this leaf and to the + node then become the children of a new × node, a pointer to which propagates
from the st of the lower FT, where it was created, all the way to the root of the tree.
4.4 Action Routines
179
6. Give two examples of information that is typically passed through inherited
attributes.
7. What is attribute flow?
8. What is a one-pass compiler?
9. What does it mean for an attribute grammar to be S-attributed? L-attributed?
Noncircular? What is the significance of these grammar classes?
4.4
Action Routines
Just as there are automatic tools that will construct a parser for a given contextfree grammar, there are automatic tools that will construct a semantic analyzer
(attribute evaluator) for a given attribute grammar. Attribute evaluator generators are heavily used in syntax-based editors [RT88], incremental compilers [SDB84], and programming language research. Most production compilers,
however, use an ad hoc, handwritten translation scheme, interleaving parsing
with at least the initial construction of a syntax tree, and possibly all of semantic
analysis and intermediate code generation. Because they are able to evaluate the
attributes of each production as it is parsed, they do not need to build the full
parse tree.
An ad hoc translation scheme that is interleaved with parsing takes the form
of a set of action routines. An action routine is a semantic function that the programmer (grammar writer) instructs the compiler to execute at a particular point
in the parse. Most parser generators allow the programmer to specify action routines. In an LL parser generator, an action routine can appear anywhere within
a right-hand side. A routine at the beginning of a right-hand side will be called
as soon as the parser predicts the production. A routine embedded in the middle of a right-hand side will be called as soon as the parser has matched (the
yield of) the symbol to the left. The implementation mechanism is simple: when
D E S I G N & I M P L E M E N TAT I O N
Attribute evaluators
Automatic evaluators based on formal attribute grammars are popular in language research projects because they save developer time when the language
definition changes. They are popular in syntax-based editors and incremental
compilers because they save execution time: when a small change is made to
a program, the evaluator may be able to “patch up” tree decorations significantly faster than it could rebuild them from scratch. For the typical compiler,
however, semantic analysis based on a formal attribute grammar is overkill: it
has higher overhead than action routines, and doesn’t really save the compiler
writer that much work.
180
Chapter 4 Semantic Analysis
E −→ T { TT.st := T.ptr } TT { E.ptr := TT.ptr }
TT1 −→ + T { TT2 .st := make bin op(“+”, TT1 .st, T.ptr) } TT2 { TT1 .ptr := TT2 .ptr }
TT1 −→ - T { TT2 .st := make bin op(“−”, TT1 .st, T.ptr) } TT2 { TT1 .ptr := TT2 .ptr }
TT −→ { TT.ptr := TT.st }
T −→ F { FT.st := F.ptr } FT { T.ptr := FT.ptr }
FT1 −→ * F { FT2 .st := make bin op(“×”, FT1 .st, F.ptr) } FT2 { FT1 .ptr := FT2 .ptr }
FT1 −→ / F { FT2 .st := make bin op(“÷”, FT1 .st, F.ptr) } FT2 { FT1 .ptr := FT2 .ptr }
FT −→ { FT.ptr := FT.st }
F1 −→ - F2 { F1 .ptr := make un op(“+/− ”, F2 .ptr) }
F −→ ( E ) { F.ptr := E.ptr }
F −→ const { F.ptr := make leaf(const.ptr) }
Figure 4.9
EXAMPLE
4.11
Top-down action routines
to build a syntax tree
LL(1) grammar with action routines to build a syntax tree.
it predicts a production, the parser pushes all of the right-hand side onto the
stack—terminals (to be matched), nonterminals (to drive future predictions),
and pointers to action routines. When it finds a pointer to an action routine at
the top of the parse stack, the parser simply calls it.
To make this process more concrete, consider again our LL(1) grammar for
constant expressions. Action routines to build a syntax tree while parsing this
grammar appear in Figure 4.9. The only difference between this grammar and
the one in Figure 4.6 is that the action routines (delimited here with curly braces)
are embedded among the symbols of the right-hand sides; the work performed
is the same. The ease with which the attribute grammar can be transformed into
the grammar with action routines is due to the fact that the attribute grammar is
L-attributed. If it required more complicated flow, we would not be able to cast
it in the form of action routines.
Bottom-Up Evaluation
In an LR parser generator, one cannot in general embed action routines at arbitrary places in a right-hand side, since the parser does not in general know what
production it is in until it has seen all or most of the yield. LR parser generators
therefore permit action routines only after the point at which the production being parsed can be identified unambiguously (this is known as the trailing part of
the right-hand side; the ambiguous part is the left corner). If the attribute flow
of the action routines is strictly bottom-up (as it is in an S-attributed attribute
grammar), then execution at the end of right-hand sides is all that is needed.
The attribute grammars of Figures 4.1 and 4.5, in fact, are essentially identical
to the action routine versions. If the action routines are responsible for a significant part of semantic analysis, however (as opposed to simply building a syntax
4.5 Space Management for Attributes
181
tree), then they will often need contextual information in order to do their job.
To obtain and use this information in an LR parse, they will need some (necessarily limited) access to inherited attributes or to information outside the current
production. We consider this issue further in Section 4.5.1.
4.5
Space Management for Attributes
Any attribute evaluation method requires space to hold the attributes of the
grammar symbols. If we are building an explicit parse tree, then the obvious approach is to store attributes in the nodes of the tree themselves. If we are not
building a parse tree, then we need to find a way to keep track of the attributes
for the symbols we have seen (or predicted) but not yet finished parsing. The
details differ in bottom-up and top-down parsers.
For a bottom-up parser with an S-attributed grammar, the obvious approach
is to maintain an attribute stack that directly mirrors the parse stack: next to
every state number on the parse stack is an attribute record for the symbol we
shifted when we entered that state. Entries in the attribute stack are pushed and
popped automatically by the parser driver; space management is not an issue for
the writer of action routines. Complications arise if we try to achieve the effect of
inherited attributes, but these can be accommodated within the basic attributestack framework.
For a top-down parser with an L-attributed grammar, we have two principal
options. The first option is automatic, but more complex than for bottom-up
grammars. It still uses an attribute stack, but one that does not mirror the parse
stack. The second option has lower space overhead, and saves time by “shortcutting” copy rules, but requires action routines to allocate and deallocate space
for attributes explicitly.
In both families of parsers, it is common for some of the contextual information for action routines to be kept in global variables. The symbol table in
particular is usually global. We can be sure that the table will always represent
the current referencing environment because we control the order in which action routines (including those that modify the environment at the beginnings
and ends of scopes) are executed. In a pure attribute grammar we should need
to pass symbol table information into and out of productions through inherited
and synthesized attributes.
IN MORE DEPTH
We consider attribute space management in more detail on the PLP CD. Using bottom-up and top-down grammars for arithmetic expressions, we illustrate
automatic management for both bottom-up and top-down parsers, as well as the
ad hoc option for top-down parsers.
182
Chapter 4 Semantic Analysis
program −→ stmt list $$
stmt list −→ stmt list decl
decl −→ int id
real id
stmt −→ id := expr
expr −→ term
stmt list stmt
read id
write expr
expr add op term
term −→ factor
term mult op factor
factor −→ ( expr )
id
int_const
real_const
float ( expr )
trunc ( expr )
add op −→ +
mult op −→ *
/
Figure 4.10 Context-free grammar for a calculator language with types and declarations.
The intent is that every identifier be declared before use, and that types not be mixed in
computations.
4.6
EXAMPLE
4.12
Bottom-up CFG for
calculator language with
types
EXAMPLE
4.13
Syntax tree to average an
integer and a real
EXAMPLE
4.14
Tree grammar for the
calculator language with
types
Decorating a Syntax Tree
In our discussion so far we have used attribute grammars solely to decorate parse
trees. As we mentioned in the chapter introduction, attribute grammars can also
be used to decorate syntax trees. If our compiler uses action routines simply to
build a syntax tree, then the bulk of semantic analysis and intermediate code
generation will use the syntax tree as base.
Figure 4.10 contains a bottom-up CFG for a calculator language with types
and declarations. The grammar differs from that of Example 2.35 (page 81) in
three ways: (1) we allow declarations to be intermixed with statements, (2) we
differentiate between integer and real constants (presumably the latter contain a
decimal point), and (3) we require explicit conversions between integer and real
operands. The intended semantics of our language requires that every identifier
be declared before it is used, and that types not be mixed in computations.
Extrapolating from the example in Figure 4.5, it is easy to add semantic functions or action routines to the grammar of Figure 4.10 to construct a syntax tree
for the calculator language (Exercise 4.19). The obvious structure for such a tree
would represent expressions as we did in Figure 4.7, and would represent a program as a linked list of declarations and statements. As a concrete example, Figure 4.11 contains the syntax tree for a simple program to print the average of an
integer and a real.
Much as a context-free grammar describes the possible structure of parse trees
for a given programming language, we can use a tree grammar to represent the
possible structure of syntax trees. As in a CFG, each production of a tree grammar
represents a possible relationship between a parent and its children in the tree.
The parent is the symbol on the left-hand side of the production; the children are
4.6 Decorating a Syntax Tree
Figure 4.11
183
Syntax tree for a simple calculator program.
the symbols on the right-hand side. The productions used in Figure 4.11 might
look something like this:
program −→ item
int decl : item −→ id item
read : item −→ id item
real decl : item −→ id item
write : item −→ expr item
null : item −→ ‘÷’ : expr −→ expr expr
‘+’ : expr −→ expr expr
float : expr −→ expr
id : expr −→ real const : expr −→ The notation A : B on the left-hand side of a production means that A is one
kind of B, and may appear anywhere a B is expected on a right-hand side.
Tree grammars and context-free grammars differ in important ways. A contextfree grammar is meant to define (generate) a language composed of strings of tokens, where each string is the fringe (yield) of a parse tree. Parsing is the process
of finding a tree that has a given yield. A tree grammar, as we use it here, is meant
to define (or generate) the trees themselves. We have no need for a notion of
parsing: we can easily inspect a tree and determine whether (and how) it can
184
EXAMPLE
Chapter 4 Semantic Analysis
4.15
Tree AG for the calculator
language with types
be generated by the grammar. Our purpose in introducing tree grammars is to
provide a framework for the decoration of syntax trees. Semantic rules attached
to the productions of a tree grammar can be used to define the attribute flow of
a syntax tree in exactly the same way that semantic rules attached to the productions of a context-free grammar are used to define the attribute flow of a parse
tree. We will use a tree grammar in the remainder of this section to perform static semantic checking. In Chapter 14 we will show how additional semantic rules
can be used to generate intermediate code.
Figure 4.12 contains a complete tree attribute grammar for our calculator language with types. Once decorated, the program node at the root of the syntax
tree will contain a list, in a synthesized attribute, of all static semantic errors in
the program. (The list will be empty if the program is free of such errors.) Each
item or expr node has an inherited attribute symtab that contains a list, with
types, of all identifiers declared to the left in the tree. Each item node also has
an inherited attribute errors in that lists all static semantic errors found to its left
in the tree, and a synthesized attribute errors out to propagate the final error list
back to the root. Each expr node has one synthesized attribute that indicates its
type and another that contains a list of any static semantic errors found inside.
Our handling of semantic errors illustrates a common technique. In order to
continue looking for other errors we must provide values for any attributes that
would have been set in the absence of an error. To avoid cascading error messages,
we choose values for those attributes that will pass quietly through subsequent
checks. In our specific example we employ a pseudo-type called error , which we
associate with any symbol table entry or expression for which we have already
generated a message.
In our example grammar we accumulate error messages into a synthesized
attribute of the root of the syntax tree. In an ad hoc attribute evaluator we might
be tempted to print these messages on the fly as the errors are discovered. In
practice, however, particularly in a multipass compiler, it makes sense to buffer
the messages so they can be interleaved with messages produced by other phases
of the compiler and printed in program order at the end of compilation.
Though it takes a bit of checking to verify the fact, our attribute grammar is
noncircular and well defined. No attribute is ever assigned a value more than
once. (The helper routines in Figure 4.12 should be thought of as macros rather
than semantic functions. For the sake of brevity we have passed them entire tree
nodes as arguments. Each macro calculates the values of two different attributes.
Under a strict formulation of attribute grammars each macro would be replaced
by two separate semantic functions, one per calculated attribute.)
One could convert our attribute grammar into executable code using an automatic attribute evaluator generator. Alternatively, one could create an ad hoc
evaluator in the form of mutually recursive subroutines (Exercise 4.18). In the
latter case attribute flow would be explicit in the calling sequence of the routines.
We could then choose if desired to keep the symbol table in global variables,
4.6 Decorating a Syntax Tree
program
185
−→ item
item.symtab := nil
program.errors := item.errors out
item.errors in := nil
int decl : item1 −→ id item2
declare name(id, item1 , item2 , int)
item1 .errors out := item2 .errors out
real decl : item1 −→ id item2
declare name(id, item1 , item2 , real)
item1 .errors out := item2 .errors out
read : item1 −→ id item2
item2 .symtab := item1 .symtab
if id.name, ? ∈ item1 .symtab
item2 .errors in := item1 .errors in
else
item2 .errors in := item1 .errors in + [id.name “undefined at” id.location]
item1 .errors out := item2 .errors out
write : item1 −→ expr item2
expr.symtab := item1 .symtab
item2 .symtab := item1 .symtab
item2 .errors in := item1 .errors in + expr.errors
item1 .errors out := item2 .errors out
‘ := ’ : item1 −→ id expr item2
expr.symtab := item1 .symtab
item2 .symtab := item1 .symtab
if id.name, A ∈ item1 .symtab
–– for some type A
if A = error and expr.type = error and A = expr.type
item2 .errors in := item1 .errors in + [“type clash at” item1 .location]
else
item2 .errors in := item1 .errors in
else
item2 .errors in := item1 .errors in + [id.name “undefined at” id.location]
item1 .errors out := item2 .errors out
null : item −→ item.errors out := item.errors in
Figure 4.12 Attribute grammar to decorate an abstract syntax tree for the calculator language with types. We use square brackets to delimit error messages and pointed brackets to
delimit symbol table entries. Juxtaposition indicates concatenation within error messages; the
‘+’ and ‘−’ operators indicate insertion and removal in lists. We assume that every node has
been initialized by the scanner or by action routines in the parser to contain an indication of
the location (line and column) at which the corresponding construct appears in the source (see
Exercise 4.20). The ‘ ? ’ symbol is used as a “wild card”; it matches any type. (continued)
186
Chapter 4 Semantic Analysis
id : expr −→ if id.name, A ∈ expr.symtab
–– for some type A
expr.errors := nil
expr.type := A
else
expr.errors := [id.name “undefined at” id.location]
expr.type := error
int const : expr −→ expr.type := int
real const : expr −→ expr.type := real
‘+’ : expr1 −→ expr2 expr3
expr2 .symtab := expr1 .symtab
expr3 .symtab := expr1 .symtab
check types(expr1 , expr2 , expr3 )
‘−’ : expr1 −→ expr2 expr3
expr2 .symtab := expr1 .symtab
expr3 .symtab := expr1 .symtab
check types(expr1 , expr2 , expr3 )
‘×’ : expr1 −→ expr2 expr3
expr2 .symtab := expr1 .symtab
expr3 .symtab := expr1 .symtab
check types(expr1 , expr2 , expr3 )
‘÷’ : expr1 −→ expr2 expr3
expr2 .symtab := expr1 .symtab
expr3 .symtab := expr1 .symtab
check types(expr1 , expr2 , expr3 )
float : expr1 −→ expr2
expr2 .symtab := expr1 .symtab
convert type(expr2 , expr1 , int, real, “float of non-int”)
trunc : expr1 −→ expr2
expr2 .symtab := expr1 .symtab
convert type(expr2 , expr1 , real, int, “trunc of non-real”)
Figure 4.12
(continued on next page)
rather than passing it from node to node through attributes. Most compilers employ the ad hoc approach.
C H E C K YO U R U N D E R S TA N D I N G
10. What is the difference between a semantic function and an action routine?
11. Why can’t action routines be placed at arbitrary locations within the righthand side of productions in an LR CFG?
12. What patterns of attribute flow can be captured easily with action routines?
4.7 Summary and Concluding Remarks
187
macro declare name(id, cur item, next item : syntax tree node; t : type)
if id.name, ? ∈ cur item.symtab
next item.errors in := cur item.errors in + [“redefinition of” id.name “at” cur item.location]
next item.symtab := cur item.symtab − id.name, ? + id.name, error
else
next item.errors in := cur item.errors in
next item.symtab := cur item.symtab + id.name, t
macro check types(result, operand1, operand2)
if operand1.type = error or operand2.type = error
result.type := error
result.errors := operand1.errors + operand2.errors
else if operand1.type = operand2.type
result.type := error
result.errors := operand1.errors + operand2.errors + [“type clash at” result.location]
else
result.type := operand1.type
result.errors := operand1.errors + operand2.errors
macro convert type(old expr, new expr : syntax tree node; from t, to t : type; msg : string)
if old expr.type = from t or old expr.type = error
new expr.errors := old expr.errors
new expr.type := to t
else
new expr.errors := old expr.errors + [msg “at” old expr.location]
new expr.type := error
Figure 4.12
(continued)
13. Some compilers perform all semantic checks and intermediate code generation in action routines. Others use action routines to build a syntax tree and
then perform semantic checks and intermediate code generation in separate
traversals of the syntax tree. Discuss the tradeoffs between these two strategies.
14. What sort of information do action routines typically keep in global variables,
rather than in attributes?
15. Describe the similarities and differences between context-free grammars and
tree grammars.
16. How can a semantic analyzer avoid the generation of cascading error messages?
4.7
Summary and Concluding Remarks
This chapter has discussed the task of semantic analysis. We reviewed the sorts of
language rules that can be classified as syntax, static semantics, and dynamic se-
188
Chapter 4 Semantic Analysis
mantics, and discussed the issue of whether to generate code to perform dynamic
semantic checks. We also considered the role that the semantic analyzer plays in
a typical compiler. We noted that both the enforcement of static semantic rules
and the generation of intermediate code can be cast in terms of annotation, or
decoration, of a parse tree or syntax tree. We then presented attribute grammars
as a formal framework for this decoration process.
An attribute grammar associates attributes with each symbol in a context-free
grammar or tree grammar, and attribute rules with each production. Synthesized
attributes are calculated only in productions in which their symbol appears on
the left-hand side. The synthesized attributes of tokens are initialized by the scanner. Inherited attributes are calculated in productions in which their symbol appears within the right-hand side; they allow calculations internal to a symbol
to depend on the context in which the symbol appears. Inherited attributes of
the start symbol (goal) can represent the external environment of the compiler.
Strictly speaking, attribute grammars allow only copy rules (assignments of one
attribute to another) and simple calls to semantic functions, but we usually relax
this restriction to allow more or less arbitrary code fragments in some existing
programming language.
Just as context-free grammars can be categorized according to the parsing algorithm(s) that can use them, attribute grammars can be categorized according
to the complexity of their pattern of attribute flow. S-attributed grammars, in
which all attributes are synthesized, can naturally be evaluated in a single bottomup pass over a parse tree, in precisely the order the tree is discovered by an LRfamily parser. L-attributed grammars, in which all attribute flow is depth-first
left-to-right, can be evaluated in precisely the order that the parse tree is predicted
and matched by an LL-family parser. Attribute grammars with more complex
patterns of attribute flow are not commonly used in production compilers but
are valuable for syntax-based editors, incremental compilers, and various other
tools.
While it is possible to construct automatic tools to analyze attribute flow and
decorate parse trees, most compilers rely on action routines, which the compiler
writer embeds in the right-hand sides of productions to evaluate attribute rules at
specific points in a parse. In an LL-family parser, action routines can be embedded at arbitrary points in a production’s right-hand side. In an LR-family parser,
action routines must follow the production’s left corner. Space for attributes in a
bottom-up compiler is naturally allocated in parallel with the parse stack. Inherited attributes must be “faked” by accessing the synthesized attributes of symbols
known to lie below the current production in the stack. Space for attributes in
a top-down compiler can be allocated automatically, or managed explicitly by
the writer of action routines. The automatic approach has the advantage of regularity, and is easier to maintain; the ad hoc approach is slightly faster and more
flexible.
In a one-pass compiler, which interleaves scanning, parsing, semantic analysis,
and code generation in a single traversal of its input, semantic functions or action
routines are responsible for all of semantic analysis and code generation. More
4.8 Exercises
189
commonly, action routines simply build a syntax tree, which is then decorated
during separate traversal(s) in subsequent pass(es).
In subsequent chapters (6–9 in particular) we will consider a wide variety
of programming language constructs. Rather than present the actual attribute
grammars required to implement these constructs, we will describe their semantics informally, and give examples of the target code. We will return to attribute
grammars in Chapter 14, when we consider the generation of intermediate code
in more detail.
4.8
Exercises
4.1 Basic results from automata theory tell us that the language L =
an bn cn
= , abc , aabbcc , aaabbbccc , . . . is not context free. It can be captured,
however, using an attribute grammar. Give an underlying CFG and a set of
attribute rules that associate a Boolean attribute ok with the root R of each
parse tree, such that R.ok = true if and only if the string corresponding to
the fringe of the tree is in L.
4.2 Modify the grammar of Figure 2.24 so that it accepts only programs that
contain at least one write statement. Make the same change in the solution
to Exercise 2.12. Based on your experience, what do you think of the idea of
using the CFG to enforce the rule that every function in C must contain at
least one return statement?
4.3 Give two examples of reasonable semantic rules that cannot be checked at
reasonable cost, either statically or by compiler-generated code at run time.
4.4 Write an S-attributed attribute grammar, based on the CFG of Example 4.6,
that accumulates the value of the overall expression into the root of the
tree. You will need to use dynamic memory allocation so that individual
attributes can hold an arbitrary amount of information.
4.5 As we shall learn in Chapter 10, Lisp programs take the form of parenthesized lists. The natural syntax tree for a Lisp program is thus a tree of binary
cells (known in Lisp as cons cells), where the first child represents the first
element of the list and the second child represents the rest of the list. The
syntax tree for (cdr ’(a b c)) appears in Figure 4.13. (The notation ’L is
syntactic sugar for (quote L) .)
Extend the CFG of Exercise 2.13 to create an attribute grammar that will
build such trees. When a parse tree has been fully decorated, the root should
have an attribute v that refers to the syntax tree. You may assume that each
atom has a synthesized attribute v that refers to a syntax tree node that holds
information from the scanner. In your semantic functions, you may assume
the availability of a cons function that takes two references as arguments
and returns a reference to a new cons cell containing those references.
190
Chapter 4 Semantic Analysis
Figure 4.13
Natural syntax tree for the Lisp expression (cdr ’(a b c)) .
4.6 Suppose that we want to translate constant expressions into the postfix or
“reverse Polish” notation of logician Jan Łukasiewicz. Postfix notation does
not require parentheses. It appears in stack-based languages such as Postscript, Forth, and the P-code and Java byte code intermediate forms mentioned in Section 1.4. It also serves as the input language of certain HewlettPackard (HP) brand calculators. When given a number, an HP calculator
pushes it onto an internal stack. When given an operator, it pops the top
two numbers, applies the operator, and pushes the result. The display shows
the value at the top of the stack. To compute 2 × (5 − 3)/4 one would enter
2 5 3 - * 4 /.
Using the underlying CFG of Figure 4.1, write an attribute grammar that
will associate with the root of the parse tree a sequence of calculator button
pushes, seq, that will compute the arithmetic value of the tokens derived
from that symbol. You may assume the existence of a function buttons (c)
that returns a sequence of button pushes (ending with ENTER on an HP
calculator) for the constant c. You may also assume the existence of a concatenation function for sequences of button pushes.
4.7 Repeat the previous exercise using the underlying CFG of Figure 4.3.
4.8 Consider the following grammar for reverse Polish arithmetic expressions:
E −→ E E op
id
op −→ +
*
-
/
Assuming that each id has a synthesized attribute name of type string, and
that each E and op has an attribute val of type string, write an attribute
grammar that arranges for the val attribute of the root of the parse tree to
contain a translation of the expression into conventional infix notation. For
example, if the leaves of the tree, left to right, were “ A A B - * C / ”, then
the val field of the root would be “ ( ( A * ( A - B ) ) / C ) ”. As an
4.8 Exercises
191
extra challenge, write a version of your attribute grammar that exploits the
usual arithmetic precedence and associativity rules to use as few parentheses
as possible.
4.9 To reduce the likelihood of typographic errors, the digits comprising most
credit card numbers are designed to satisfy the so-called Luhn formula, standardized by ANSI in the 1960s and named for IBM mathematician Hans
Peter Luhn. Starting at the right, we double every other digit (the secondto-last, fourth-to-last, etc.). If the doubled value is 10 or more, we add the
resulting digits. We then sum together all the digits. In any valid number the
result will be a multiple of 10. For example, 1234 5678 9012 3456 becomes
2264 1658 9022 6416, which sums to 64, so this is not a valid number. If the
last digit had been 2, however, the sum would have been 60, so the number
would potentially be valid.
Give an attribute grammar for strings of digits that accumulates into the
root of the parse tree a Boolean value indicating whether the string is valid
according to Luhn’s formula. Your grammar should accommodate strings of
arbitrary length.
4.10 Consider the following CFG for floating-point constants, without exponential notation. (Note that this exercise is somewhat artificial: the language in
question is regular, and would be handled by the scanner of a typical compiler.)
C −→ digits . digits
digits −→ digit more digits
more digits −→ digits 1
2
3
digit −→ 0
4
5
6
7
8
9
Augment this grammar with attribute rules that will accumulate the value
of the constant into a val attribute of the root of the parse tree. Your answer
should be S-attributed.
4.11 One potential criticism of the obvious solution to the previous problem is
that the values in internal nodes of the parse tree do not reflect the value,
in context, of the fringe below them. Create an alternative solution that
addresses this criticism. More specifically, create your grammar in such a
way that the val of an internal node is the sum of the val s of its children. Illustrate your solution by drawing the parse tree and attribute flow
for 12.34 . (Hint: You will probably want a different underlying CFG, and
non-L-attributed flow.)
4.12 Consider the following attribute grammar for type declarations, based on
the CFG of Exercise 2.8.
decl −→ ID decl tail
decl.t := decl tail.t
decl tail.in tab := insert (decl.in tab, ID.n, decl tail.t)
decl.out tab := decl tail.out tab
192
Chapter 4 Semantic Analysis
decl tail
decl tail
−→ , decl
decl tail.t := decl.t
decl.in tab := decl tail.in tab
decl tail.out tab := decl.out tab
−→ : ID ;
decl tail.t := ID.n
decl tail.out tab := decl tail.in tab
Show a parse tree for the string A, B : C; . Then, using arrows and textual
description, specify the attribute flow required to fully decorate the tree.
(Hint: Note that the grammar is not L-attributed.)
4.13 A CFG-based attribute evaluator capable of handling non-L-attributed attribute flow needs to take a parse tree as input. Explain how to build a parse
tree automatically during a top-down or bottom-up parse (i.e., without explicit action routines).
4.14 Write an LL(1) grammar with action routines and automatic attribute space
management that generates the reverse Polish translation described in Exercise 4.6.
4.15 (a) Write a context-free grammar for polynomials in x. Add semantic functions to produce an attribute grammar that will accumulate the polynomial’s derivative (as a string) in a synthesized attribute of the root of the
parse tree.
(b) Replace your semantic functions with action routines that can be evaluated during parsing.
4.16 (a) Write a context-free grammar for
case or switch statements in the
style of Pascal or C. Add semantic functions to ensure that the same
label does not appear on two different arms of the construct.
(b) Replace your semantic functions with action routines that can be evaluated during parsing.
4.17 Write an algorithm to determine whether the rules of an arbitrary attribute
grammar are noncircular. (Your algorithm will require exponential time in
the worst case [JOR75].)
4.18 Rewrite the attribute grammar of Figure 4.12 in the form of an ad hoc tree
traversal consisting of mutually recursive subroutines in your favorite programming language. Keep the symbol table in a global variable, rather than
passing it through arguments.
4.19 Write an attribute grammar based on the CFG of Figure 4.10 that will build
a syntax tree with the structure described in Figure 4.12.
4.20 Augment the attribute grammar of Figure 4.5, Figure 4.6, or Exercise 4.19 to
initialize a synthesized attribute in every syntax tree node that indicates the
location (line and column) at which the corresponding construct appears in
4.9 Explorations
193
the source program. You may assume that the scanner initializes the location
of every token.
4.21 Modify the CFG and attribute grammar of Figures 4.10 and 4.12 to permit
mixed integer and real expressions, without the need for float and trunc .
You will want to add an annotation to any node that must be coerced to the
opposite type, so that the code generator will know to generate code to do
so. Be sure to think carefully about your coercion rules. In the expression
my_int + my_real , for example, how will you know whether to coerce the
integer to be a real or to coerce the real to be an integer?
4.22 Explain the need for the A : B notation on the left-hand sides of productions in a tree grammar. Why isn’t similar notation required for context-free
grammars?
4.23–4.27 In More Depth.
4.9
Explorations
4.28 One of the most influential applications of attribute grammars was the
Cornell Synthesizer Generator [Rep84, RT88], now available commercially
from grammatech.com.
Learn how the Generator uses attribute grammars not only for incremental update of semantic information in a program under edit, but also
for automatic creation of language based editors from formal language
specifications. How general is this technique? What applications might it
have beyond syntax-directed editing of computer programs?
4.29 The attribute grammars used in this chapter are all quite simple. Most are
S- or L-attributed. All are noncircular. Are there any practical uses for more
complex attribute grammars? How about automatic attribute evaluators?
Using the Bibliographic Notes as a starting point, conduct a survey of attribute evaluation techniques. Where is the line between practical techniques and intellectual curiosities?
4.30 The first validated Ada implementation was the Ada/Ed interpreter from
New York University [DGAFS+ 80]. The interpreter was written in the setbased language SETL [SDDS86] using a denotational semantics definition
of Ada. Learn about the Ada/Ed project, SETL, and denotational semantics.
Discuss how the use of a formal definition aided the development process.
Also discuss the limitations of Ada/Ed, and expand on the potential role of
formal semantics in language design, development, and prototype implementation.
4.31 The Scheme language manual [ADH+ 98] includes a formal definition of
Scheme in denotational semantics. How long is this definition compared
to the more conventional definition in English? How readable is it? What
194
Chapter 4 Semantic Analysis
do the length and the level of readability say about Scheme? About denotational semantics? (For more on denotational semantics, see the texts of
Stoy [Sto77] or Gordon [Gor79].)
4.32–4.33 In More Depth.
4.10
Bibliographic Notes
Much of the early theory of attribute grammars was developed by Knuth [Knu68].
Lewis, Rosenkrantz, and Stearns [LRS74] introduced the notion of an
L-attributed grammar. Watt [Wat77] showed how to use marker symbols to emulate inherited attributes in a bottom-up parser. Jazayeri, Ogden, and Rounds
[JOR75] showed that exponential time may be required in the worst case to decorate a parse tree with arbitrary attribute flow. Articles by Courcelle [Cou84] and
Engelfriet [Eng84] survey the theory and practice of attribute evaluation. The
best-known attribute grammar system for language-based editing is the Synthesizer Generator [RT88] (a follow-on to the language-specific Cornell Program
Synthesizer [TR81]) of Reps and Teitelbaum. Magpie [SDB84] is an incremental
compiler. Action routines to implement many language features can be found
in the texts of Fischer and LeBlanc [FL88] or Appel [App97]. Further notes on
attribute grammars can be found in the texts of Cooper and Torczon [CT04,
pp. 171–188] or Aho, Sethi, and Ullman [ASU86, pp. 340–342].
Marcotty, Ledgard, and Bochmann [MLB76] provide a survey of formal notations for programming language semantics. The seminal paper on axiomatic
semantics is by Hoare [Hoa69]. An excellent book on the subject is Gries’s The
Science of Programming [Gri81]. The seminal paper on denotational semantics is
by Scott and Strachey [SS71]. Texts on the subject include those of Stoy [Sto77]
and Gordon [Gor79].
5
Target Machine Architecture
As described in Chapter 1, a compiler is simply a translator. It translates
programs written in one language into programs written in another language.
This second language can be almost anything—some other high-level language,
phototypesetting commands, VLSI (chip) layouts—but most of the time it’s the
machine language for some available computer.
Just as there are many different programming languages, there are many different machine languages, though the latter tend to display considerably less diversity than the former. Each machine language corresponds to a different processor
architecture. Formally, an architecture is the interface between the hardware and
the software: the language generated by a compiler, or by a programmer writing for the bare machine. The implementation of the processor is a concrete realization of the architecture, generally in hardware. This chapter provides a brief
overview of those aspects of processor architecture and implementation of particular importance to compiler writers, and may be worth reviewing even by readers
who have seen the material before.
To generate correct code, it suffices for a compiler writer to understand the
target architecture. To generate fast code, it is generally necessary to understand
the implementation as well, because it is the implementation that determines the
relative speeds of alternative translations of a given language construct.
Processor implementations change over time, as people invent better ways of
doing things, and as technological advances (e.g., increases in the number of
transistors that will fit on one chip) make things feasible that were not feasible before. Processor architectures also change, for at least two reasons. Some
technological advances can be exploited only by changing the hardware/software
interface—for example, by increasing the number of bits that can be added or
multiplied in a single instruction. In addition, experience with compilers and
applications often suggests that certain new instructions would make programs
simpler or faster. Occasionally, technological and intellectual trends converge to
produce a revolutionary change in both architecture and implementation. We
195
196
Chapter 5 Target Machine Architecture
will discuss three such changes in Section 5.4: the development of microprogramming in the early 1960s, the development of the microprocessor in the early to
mid-1970s, and the development of RISC machines in the early 1980s. As this
book goes to press it appears we may be on the cusp of a fourth revolution, as
vendors turn to multithreaded and multiprocessor chips in an attempt to increase
computational power per watt of heat output.
Most of the discussion in this chapter, and indeed in the rest of the book,
will assume that we are compiling for a modern RISC (reduced instruction set
computer) architecture. Roughly speaking, a RISC machine is one that sacrifices
richness in the instruction set in order to increase the number of instructions that
can be executed per second. Where appropriate, we will devote a limited amount
of attention to earlier, CISC (complex instruction set computer) architectures.
The most popular desktop processor in the world—the x86 —is a legacy CISC
design, but RISC dominates among newer designs. Modern implementations of
the x86 generally run fastest if compilers restrict themselves to a relatively simple subset of the instruction set. Within the processor, a hardware “front end”
translates these instructions, on the fly, into a RISC-like internal format.
In the first three sections that follow, we consider the hierarchical organization
of memory, the types (formats) of data found in memory, and the instructions
used to manipulate those data. The coverage is necessarily somewhat cursory and
high-level; much more detail can be found in books on computer architecture
(e.g., in Chapter 2 of Hennessy and Patterson’s outstanding text [HP03]).
We consider the interplay between architecture and implementation in Section 5.4. In a supplemental subsection on the PLP CD, we illustrate the differences between CISC and RISC machines using the x86 and MIPS instruction sets
as examples. Finally, in Section 5.5, we consider some of the issues that make
compiling for modern processors a challenging task.
5.1
EXAMPLE
5.1
Memory hierarchy stats
The Memory Hierarchy
Memory on most machines consists of a numbered sequence of eight-bit bytes.
It is not uncommon for modern workstations to contain several gigabytes of
memory—much too much to fit on the same chip as the processor. Because
memory is off-chip (typically on the other side of a bus), getting at it is much
slower than getting at things on-chip. Most computers therefore employ a memory hierarchy, in which things that are used more often are kept close at hand.
A typical memory hierarchy, with access times and capacities, is shown in Figure 5.1.
Only three of the levels of the memory hierarchy—registers, memory, and
devices—are a visible part of the hardware/software interface. Compilers manage
registers explicitly, loading them from memory when needed and storing them
back to memory when done, or when the registers are needed for something else.
5.1 The Memory Hierarchy
registers
primary (L1) cache
secondary (L2) cache
tertiary (off-chip, L3) cache
main memory
disk
tape
197
typical access time
typical capacity
0.2–0.5ns
0.4–1ns
4–10ns
10–50ns
50–500ns
5–15ms
1–50s
256–1024 bytes
32K–256K bytes
512K–2M bytes
4–64M bytes
256M–16G bytes
80G bytes and up
effectively unlimited
Figure 5.1
The memory hierarchy of a workstation-class computer. Access times and capacities are approximate, based on 2005 technology. Registers must be accessed within a single clock
cycle. Primary cache typically responds in 1–2 cycles; off-chip cache in more like 20 cycles. Main
memory on a supercomputer can be as fast as off-chip cache; on a workstation it is typically
much slower. Disk and tape times are constrained by the movement of physical parts.
Caches are managed by the hardware. Devices are generally accessed only by the
operating system.
Registers hold small amounts of data that can be accessed very quickly. A typical RISC machine has two sets of registers, to hold integer and floating-point
operands. It also has several special purpose registers, including the program
counter (PC) and the processor status register. The program counter holds the
address of the next instruction to be executed. It is incremented automatically
when fetching most instructions; branches work by changing it explicitly. The
processor status register contains a variety of bits of importance to the operating
system (privilege level, interrupt priority level, trap enable bits) and, on some
machines, a few bits of importance to the compiler writer. Principal among these
are condition codes, which indicate whether the most recent arithmetic or logical
operation resulted in a zero, a negative value, and/or arithmetic overflow. (We
will consider condition codes in more detail in Section 5.3.2.)
Because registers can be accessed every cycle, whereas memory, generally, cannot, good compilers expend a great deal of effort trying to make sure that the
data they need most often are in registers, and trying to minimize the amount of
time spent moving data back and forth between registers and memory. We will
consider algorithms for register management in Section 5.5.2.
Caches are generally smaller but faster than main memory. They are designed
to exploit locality: the tendency of most computer programs to access the same
or nearby locations in memory repeatedly. By automatically moving the contents
of these locations into cache, a hierarchical memory system can dramatically improve performance. The idea makes intuitive sense: loops tend to access the same
local variables in every iteration, and to walk sequentially through arrays. Instructions, likewise, tend to be loaded from consecutive locations, and code that
accesses one element of a structure (or member of a class) is likely to access another.
Primary caches, also known as level-1 (L1) caches, are typically located on the
same chip as the processor, and usually come in pairs—one for instructions (the
198
Chapter 5 Target Machine Architecture
L1 I-cache) and another for data (the L1 D-cache), both of which can be accessed every cycle. Secondary caches are larger and slower, but still faster than
main memory. In a modern desktop or laptop system they are typically also
on the same chip as the processor. High-end desktop or server-class machines
may have an off-chip tertiary (L3) cache as well. Small embedded processors may
have a single level of on-chip cache, with or without any off-chip cache. Caches
are managed entirely in hardware on most machines, but compilers can increase
their effectiveness by generating code with a high degree of locality.
A memory access that finds its data in the cache is said to be a cache hit. An
access that does not find its data in the cache is said to be a cache miss. On a
miss, the hardware automatically loads a line of the cache with a contiguous block
of data containing the requested location, obtained from the next lower level of
cache or main memory. (Cache lines vary from as few as 8 to as many as 512 bytes
in length.) Assuming that the cache was already full, the load will displace some
other line, which is written back to memory if it has been modified.
A final characteristic of memory that is important to the compiler is known as
data alignment. Most machines are able to manipulate operands of several sizes,
typically one, two, four, and eight bytes. Most modern instruction sets refer to
these as byte, half-word, word, and double-word operands, respectively; on the
x86 they are byte, word, double-word, and quad-word operands. Most recent architectures require n-byte operands to appear in memory at addresses that are
evenly divisible by n. Integers, for example, which typically occupy four bytes,
must appear at a location whose address is evenly divisible by four. This restriction occurs for two reasons. First, buses are designed in such a way that data are
delivered to the processor over bit-parallel, aligned communication paths. Loading an integer from an odd address would require that the bits be shifted, adding
logic (and time) to the load path. The x86, which for reasons of backward compatibility allows operands to appear at arbitrary addresses, runs faster if those
operands are properly aligned. Second, on RISC machines, there are generally
not enough bits in an instruction to specify both an operation (e.g., load ) and a
full address. As we shall see in Section 5.3.1, it is typical to specify an address in
terms of an offset from some base location specified by a register. Requiring that
integers be word-aligned allows the offset to be specified in words, rather than
D E S I G N & I M P L E M E N TAT I O N
The processor/memory gap
Historically processor speed has increased much faster than memory speed,
so the number of processor cycles required to access memory has continued
to grow. As a result of this trend, caches have become increasingly critical
to performance. To improve the effectiveness of caching, programmers need
to choose algorithms whose data access patterns have a high degree of locality. High-quality compilers, likewise, need to consider locality of access when
choosing among the many possible translations of a given program.
5.2 Data Representation
199
in bytes, quadrupling the amount of memory that can be accessed using offsets
from a given base register.
5.2
EXAMPLE
5.2
Big- and little-endian
Data Representation
Data in the memory of most computers are untyped: bits are simply bits. Operations are typed, in the sense that different operations interpret the bits in memory
in different ways. Typical data formats include instructions, addresses, binary integers of various lengths, floating-point (real) numbers of various lengths, and
characters.
Integers typically come in half-word, word, and (recently) double-word
lengths. Floating-point numbers typically come in word and double-word
lengths, commonly referred to as single and double precision. Some machines
store the least-significant byte of a multi-word datum at the address of the datum itself, with bytes of increasing numeric significance at higher-numbered addresses. Other machines store the bytes in the opposite order. The first option
is called little-endian; the second is called big-endian. In either case, an n-byte
datum stored at address t occupies bytes t through t + n − 1. The advantage
of a little-endian organization is that it is tolerant of variations in operand size.
If the value 37 is stored as a word and then a byte is read from the same location, the value 37 will be returned. On a big-endian machine, the value 0 will be
returned (the upper eight bits of the number 37, when stored in 32 bits). The
problem with the little-endian approach is that it seems to scramble the bytes of
integers, when read from left to right (see Figure 5.2a). Little-endian-ness makes
a bit more sense if one thinks of memory as a (byte-addressable) array of words
(Figure 5.2b). Among CISC machines, the x86 is little-endian, as was the Digital VAX. The IBM 360/370 and the Motorola 680x0 are big-endian. Most of the
first-generation RISC machines were also big-endian; most of the current RISC
machines can run in either mode.
Support for characters varies widely. Most CISC machines will perform arbitrary arithmetic and logical operations on one-byte quantities. Many CISC machines also provide instructions that perform operations on strings of characters,
such as copying, comparing, or searching. Most RISC machines will load and
store bytes from or to memory, but operate only on longer quantities in registers.
5.2.1
Computer Arithmetic
Binary integers are almost universally represented in two related formats:
straightforward binary place-value for unsigned numbers, and two’s complement for signed numbers. An n-bit unsigned integer has a value in the range
0 . . 2n −1, inclusive. An n-bit two’s complement integer has a value in the range
200
Chapter 5 Target Machine Architecture
Figure 5.2
Big-endian and little-endian byte orderings. (a) Two four-byte quantities, the numbers 3716 and 12 34 56 7816 , stored at addresses 432 and 436, respectively. (b) The same situation
with memory visualized as a byte-addressable array of words.
−2n−1 . . 2n−1 −1, inclusive. Most instruction sets provide two forms of most of
the arithmetic operators: one for unsigned numbers and one for signed numbers. Even for languages in which integers are always signed, unsigned arithmetic
is important for the manipulation of addresses (e.g., pointers).
Floating-point numbers are the computer equivalent of scientific notation:
they consist of a mantissa or significand, sig, an exponent, exp, and (usually) a
sign bit, s. The value of a floating-point number is then −1s × sig × 2exp . Prior
to the mid-1980s, floating-point formats and semantics tended to vary greatly
across brands and even models of computers. Different manufacturers made different choices regarding the number of bits in each field, their order, and their
internal representation. They also made different choices regarding the behavior
of arithmetic operators with respect to rounding, underflow, overflow, invalid
operations, and the representation of extremely small quantities. With the completion in 1985 of IEEE standard number 754, however, the situation changed
dramatically. Most processors developed in subsequent years conform to the formats and semantics of this standard.
IN MORE DEPTH
We consider two’s complement and IEEE floating-point arithmetic in more detail
on the PLP CD.
5.3 Instruction Set Architecture
5.3
201
Instruction Set Architecture
On a RISC machine, computational instructions operate on values held in registers: a load instruction must be used to bring a value from memory into
a register before it can be used as an operand. CISC machines usually allow
all or most computational instructions to access operands directly in memory. RISC machines are therefore said to provide a load-store or register-register
architecture; CISC machines are said to provide a register-memory architecture.
For binary operations, instructions on RISC machines generally specify three
registers: two sources and a destination. Some CISC machines (e.g., the VAX) also
provide three-address instructions. Others (e.g., the x86 and the 680x0) provide
only two-address instructions; one of the operands is always overwritten by the
result. Two-address instructions are more compact, but three-address instructions allow both operands to be reused in subsequent operations. This reuse is
crucial on RISC machines: it minimizes the number of artificial restrictions on
the ordering of instructions, affording the compiler considerably more freedom
in choosing an order that performs well.
5.3.1
Addressing Modes
One can imagine many different ways in which a computational instruction
might specify the location of its operands. A given operand might be in a register, in memory, or, in the case of read-only constants, in the instruction itself.
If the operand is in memory, its address might be found in a register, in memory,
or in the instruction, or it might be derived from some combination of values
in various locations. Instruction sets differ greatly in the addressing modes they
provide to capture these various options.
As noted above, most RISC machines require that the operands of computational instructions reside in registers or the instruction. For load and store instructions, which are allowed to access memory, they typically support the displacement addressing mode, in which the operand’s address is found by adding
some small constant (the displacement) to the value found in a specified register (the base). The displacement is contained in the instruction. Displacement
addressing with respect to the frame pointer provides an easy way to access local variables. Displacement addressing with a displacement of zero is sometimes
called register indirect addressing.
Some RISC machines, including the PowerPC and Sparc, also allow load and
store instructions to use an indexed addressing mode, in which the operand’s address is found by adding the values in two registers. Indexed addressing is useful
for arrays: one register (the base) contains the address of the array; the second
(the index) contains the offset of the desired element.
202
Chapter 5 Target Machine Architecture
CISC machines typically provide a richer set of addressing modes, and allow
them to be used in computational instructions, as well as in load s and store s.
On the x86, for example, the address of an operand can be calculated by multiplying the value in one register by a small constant, adding the value found
in a second register, and then adding another small constant, all in one instruction.
5.3.2
EXAMPLE
5.3
An if statement in x86
assembler
Conditions and Branches
All instruction sets provide a branching mechanism to update the program
counter under program control. Branches allow compilers to implement conditional statements, subroutines, and loops. Conditional branches are generally
controlled in one of two ways. On most CISC machines they use condition codes.
As mentioned in Section 5.1, condition codes are usually implemented as a set
of bits in a special processor status register. All or most of the arithmetic, logical,
and data-movement instructions update the condition codes as a side effect. The
exact number of bits varies from machine to machine, but three and four are
common: one bit each to indicate whether the instruction produced a zero value,
a negative value, and/or an overflow or carry. To implement the following test,
for example,
A := B + C
if A = 0 then
body
a compiler for the x861 might generate
movl
addl
movl
jne
C, %eax
B, %eax
%eax, A
L1
;
;
;
;
move longword C into register eax
add
and store
branch (jump) if result not equal to zero
body
L1:
EXAMPLE
5.4
Compare and test
instructions
For cases in which the outcome of a branch depends on a value that has not
just been computed or moved, most machines provide compare and test instructions. Again on the x86:
1 Readers familiar with the x86 should be warned that this example uses the assembler syntax of
the Gnu gcc compiler and its assembler, gas . This syntax differs in several ways from Microsoft
and Intel assembler. Most notably, it specifies operands in the opposite order. The instruction
addl B, %eax , for example, adds the value in B to the value in register %eax and leaves the
result in %ebx : in Gnu assembler the destination operand is listed second. In Intel and Microsoft
assembler it’s the other way around: addl B, %eax would add the value in register %ebx to the
value in B and leave the result in B .
5.3 Instruction Set Architecture
if A ≤ B then
movl
cmpl
jg
body
203
A, %eax
B, %eax
L1
; move long-word A into register eax
; compare to B
; branch (jump) if greater
%eax, %eax
L2
; compare %eax (A) to 0
; branch if less than or equal
body
L1:
testl
jle
if A > 0 then
body
body
L2:
EXAMPLE
5.5
Conditional branches on
the MIPS
The x86 cmpl instruction subtracts its source operand from its destination
operand and sets the condition codes according to the result, but it does not
overwrite the destination operand. The testl instruction and s its two operands
together and compares the result to zero. Most often, as shown here, the two
operands are the same. When they are different, one is typically a mask value that
allows the programmer or compiler to test individual bits or bits fields in the
other operand.
Unfortunately, traditional condition codes make it difficult to implement
some important performance enhancements. In particular, the fact that they are
set by almost every instruction tends to preclude implementations in which logically unrelated instructions might be executed in between (or in parallel with)
the instruction that tests a condition and the branch that relies on the outcome
of the test. There are several possible ways to address this problem; the handling
of conditional branches is one of the areas in which extant RISC machines vary
most from one another. The ARM and Sparc architectures make setting of the
condition codes optional on an instruction-by-instruction basis. The PowerPC
provides eight separate sets of condition codes; compare and branch instructions
can specify the set to use. The MIPS has no condition codes (at least not for integer operations); it uses Boolean values in registers instead.
More precisely, where the x86 has 16 different branch instructions based on
arithmetic comparisons, the MIPS has only six. Four of these branch if the value
in a register is <, ≤, >, or ≥ zero. The other two branch if the values in two registers are = or =. In a convention shared by most RISC machines, register zero is
defined to always contain the value zero, so the latter two instructions cover both
the remaining comparisons to zero and direct comparisons of registers for equality. More general register-register comparisons (signed and unsigned) require a
separate instruction to place a Boolean value in a register that is then named by
the branch instruction. Repeating the preceding examples on the MIPS, we get
if A ≤ B then
lw
lw
slt
bne
body
body
L1:
$3,
$2,
$2,
$2,
A
B
$2, $3
$0, L1
;
;
;
;
load word: register 3 := A
register 2 := B
register 2 := (B < A)
branch if Boolean true (= 0)
204
Chapter 5 Target Machine Architecture
if A > 0 then
blez
body
body
$3, L2
; branch if A ≤ 0
L2:
By convention, destination registers are listed first in MIPS assembler (as they
are in assignment statements). The slt instruction stands for “set less than”;
bne and blez stand for “branch if not equal” and “branch if less than or equal
to zero,” respectively. Note that the compiler has used bne to compare register 2
to the constant register 0.
C H E C K YO U R U N D E R S TA N D I N G
1. What is the world’s most popular instruction set architecture (for desktop
machines)?
2. What is the difference between big-endian and little-endian addressing?
3. What is the purpose of a cache?
4. Why do many machines have more than one level of cache?
5. How many processor cycles does it typically take to access primary (on-chip)
cache? How many cycles does it typically take to access main memory?
6. What is data alignment? Why do many processors insist upon it?
7. List four common formats (interpretations) for bits in memory.
8. What is IEEE standard number 754? Why is it important?
9. What are the tradeoffs between two-address and three-address instruction
formats?
10. Describe at least five different addressing modes. Which of these are commonly supported on RISC machines?
11. What are condition codes? Why do some architectures not provide them?
What do they provide instead?
5.4
Architecture and Implementation
The typical processor implementation consists of a collection of functional units,
one (or more) for each logically separable facet of processor activity: instruction
fetch, instruction decode, operand fetch from registers, arithmetic computation,
memory access, write-back of results to registers, and so on. One could imagine an implementation in which all of the work for a particular instruction is
completed before work on the next instruction begins, and in fact this is how
many computers used to be constructed. The problem with this organization is
5.4 Architecture and Implementation
205
that most of the functional units are idle most of the time. Using ideas originally
developed for supercomputers of the 1960s, processor implementations have increasingly moved toward a pipelined organization, in which the functional units
work like the stations on an assembly line, with different instructions passing
through different pipeline stages concurrently. Pipelining is used in even the most
inexpensive personal computers today, and in all but the simplest processors for
the embedded market. A simple processor may have five or six pipeline stages.
The IBM PowerPC G5 has 21; the Intel Pentium 4E has 31.
By allowing (parts of) multiple instructions to execute in parallel, pipelining
can dramatically increase the number of instructions that can be completed per
second, but it is not a panacea. In particular, a pipeline will stall if the same functional unit is needed in two different instructions simultaneously, or if an earlier
instruction has not yet produced a result by the time it is needed in a later instruction, or if the outcome of a conditional branch is not known (or guessed)
by the time the next instruction needs to be fetched.
We shall see in Section 5.5 that many stalls can be avoided by adding a little extra hardware and then choosing carefully among the various ways of translating
a given construct into target code. An important example occurs in the case of
floating-point arithmetic, which is typically much slower than integer arithmetic.
Rather than stall the entire pipeline while executing a floating-point instruction,
we can build a separate functional unit for floating-point math, and arrange for
it to operate on a separate set of floating-point registers. In effect, this strategy
leads to a pair of pipelines—one for integers and one for floating-point—that
share their first few stages. The integer branch of the pipeline can continue to execute while the floating-point unit is busy, as long as subsequent instructions do
not require the floating-point result. The need to reorder, or schedule, instructions so that those that conflict with or depend on one another are separated
in time is one of the principal reasons why compiling for modern processors is
hard.
5.4.1
Microprogramming
As technology advances, there are occasionally times when it becomes feasible to
design machines in a very different way. During the 1950s and the early 1960s, the
instruction set of a typical computer was implemented by soldering together large
numbers of discrete components (transistors, capacitors, etc.) that performed
the required operations. To build a faster computer, one generally designed new,
more powerful instructions, which required extra hardware. This strategy had
the unfortunate effect of requiring assembly language programmers (or compiler
writers, though there weren’t many of them back then) to learn a new language
every time a new and better computer came along.
A fundamental breakthrough occurred in the early 1960s, when IBM hit
upon the idea of microprogramming. Microprogramming allowed a company
to provide the same instruction set across a whole line of computers, from
206
Chapter 5 Target Machine Architecture
inexpensive slow machines to expensive fast machines. The basic idea was to
build a “microengine” in hardware that executed an interpreter program in
“firmware.” The interpreter in turn implemented the “machine language” of
the computer—in this case, the IBM 360 instruction set. More expensive machines had fancier microengines, with more direct support for the instructions
seen by the assembly-level programmer. The top-of-the-line machines had everything in hardware. In effect, the architecture of the machine became an abstract interface behind which hardware designers could hide implementation
details, much as the interfaces of modules in modern programming languages
allow software designers to limit the information available to users of an abstraction.
In addition to allowing the introduction of computer families, microprogramming made it comparatively easy for architects to extend the instruction
set. Numerous studies were published in which researchers identified some
sequence of instructions that commonly occurred together (e.g., the instructions that jump to a subroutine and update bookkeeping information in the
stack) and then introduced a new instruction to perform the same function as
the sequence. The new instruction was usually faster than the sequence it replaced, and almost always shorter (and code size was more important then than
now).
5.4.2
Microprocessors
A second architectural breakthrough occurred in the mid-1970s, when very largescale integration (VLSI) chip technology reached the point at which a simple
microprogrammed processor could be implemented entirely on one inexpensive chip. The chip boundary is important because it takes much more time and
power to drive signals across macroscopic output pins than it does across intrachip connections, and because the number of pins on a chip is limited by packaging issues. With an entire processor on one chip, it became feasible to build
a commercially viable personal computer. Processor architectures of this era include the MOS Technology 6502, used in the Apple II and the Commodore 64,
and the Intel 8080 and Zilog Z80, used in the Radio Shack TRS-80 and various
CP/M machines. Continued improvements in VLSI technology led, by the mid1980s, to 32-bit microprogrammed microprocessors such as the Motorola 68000,
used in the original Apple Macintosh, and the Intel 80386, used in the first 32-bit
IBM PCs.
From an architectural standpoint, the principal impact of the microprocessor
revolution was to constrain, temporarily, the number of registers and the size of
operands. Where the IBM 360 (not a single-chip processor) operated on 32-bit
data, with 16 general purpose 32-bit registers, the Intel 8080 operated on 8-bit
data, with only seven 8-bit registers and a 16-bit stack pointer. Over time, as
VLSI density increased, registers and instruction sets expanded as well. Intel’s
32-bit 80386 was introduced in 1985.
5.4 Architecture and Implementation
5.4.3
207
RISC
By the early 1980s, several factors converged to make possible a third architectural
breakthrough. First, VLSI technology reached the point at which a pipelined 32bit processor with a sufficiently simple instruction set could be implemented on
a single chip, without microprogramming. Second, improvements in processor
speed were beginning to outstrip improvements in memory speed, increasing
the relative penalty for accessing memory, and thereby increasing the pressure to
keep things in registers. Third, compiler technology had advanced to the point at
which compilers could often match (and sometimes exceed) the quality of code
produced by the best assembly language programmers. Taken together, these factors suggested a reduced instruction set computer (RISC) architecture with a fast,
all-hardware implementation, a comparatively low-level instruction set, a large
number of registers, and an optimizing compiler.
The advent of RISC machines ran counter to the ever-more-powerfulinstructions trend in processor design but was to a large extent consistent with
established trends for supercomputers. Supercomputer instruction sets had always been relatively simple and low-level, in order to facilitate pipelining. Among
other things, effective pipelining depends on having most instructions take the
same, constant number of cycles to execute, and on minimizing dependences
that would prevent a later instruction from starting execution before its predecessors have finished. A major problem with the trend toward more complex
instruction sets was that it made it difficult to design high-performance implementations. Instructions on the VAX, for example, could vary in length from
one to more than 50 bytes, and in execution time from one to thousands of
cycles. Both of these factors tend to lead to pipeline stalls. Variable-length instructions make it difficult to even find the next instruction until the current one
has been studied extensively. Variable execution time makes it difficult to keep all
the pipeline stages busy. The original VAX (the 11/780) was shipped in 1978, but
it wasn’t until 1985 that Digital was able to ship a successfully pipelined version,
the 8600.2
The most basic rule of processor performance holds that total execution time
on any machine equals the number of instructions executed times the average
number of cycles per instruction times the length in time of a cycle. What we
might call the “CISC design philosophy” is to minimize execution time by reducing the number of instructions, letting each instruction do more work. The
“RISC philosophy,” by contrast, is to minimize execution time by reducing the
length of the cycle and the number of (nonoverlapped) cycles per instruction
(CPI).
Recent RISC machines (and RISC-like implementations of the x86) attempt
to minimize CPI by executing as many instructions as possible in parallel. The
2 An alternative approach—to maintain microprogramming but pipeline the microengine—was
adopted by the 8800 and, more recently, by Intel’s Pentium Pro and its successors.
208
Chapter 5 Target Machine Architecture
PowerPC G5, for example, can have over 200 instructions simultaneously “in
flight.” Some processors have very deep pipelines, allowing the work of an instruction to be divided into very short cycles. Many are superscalar: they have
multiple parallel pipelines, and start more than one instruction each cycle. (This
requires, of course, that the compiler and/or hardware identify instructions that
do not depend on one another, so that parallel execution is semantically indistinguishable from sequential execution.) To minimize artificial dependences between instructions (as, for instance, when one instruction must finish using a
register as an operand before another instruction overwrites that register with
a new value), many machines perform register renaming, dynamically assigning
logically independent uses of the same architectural register to different locations in a larger set of physical (implementation) registers. High performance
processor implementations may actually execute mutually independent instructions out of order when they can increase instruction-level parallelism by doing
so. These techniques dramatically increase implementation complexity but not
architectural complexity; in fact, it is architectural simplicity that makes them
possible.
5.4.4
EXAMPLE
5.6
The x86 ISA
EXAMPLE
5.7
The MIPS ISA
Two Example Architectures: The x86 and MIPS
We can illustrate the differences between CISC and RISC machines by examining a representative pair of architectures. The x86 is the most widely used CISC
design—in fact, the most widely used processor architecture of any kind (outside
the embedded market). The original model, the 8086, was announced in 1978.
Major changes were introduced by the 8087, 80286, 80386, Pentium Pro, Pentium/MMX, Pentium III, and Pentium 4. While technically backward compatible, these changes were often out of keeping with the philosophy of the earlier
generations. The result is a machine with an enormous number of stylistic inconsistencies and special cases. AMD’s 64-bit extension to the x86, saddled as it
was with the need for backward compatibility, is even more complex. Early generations of the x86 were extensively microprogrammed. More recent generations
still use microprogramming for the more complex portions of the instruction set,
but simpler instructions are translated directly (in hardware) into between one
and four microinstructions that are in turn fed to a heavily pipelined, RISC-like
computational core.
The MIPS architecture, begun as a commercial spin-off of research at Stanford University, is arguably the simplest of the commercial RISC machines. It
too has evolved, through five generations as of 2005, but with one exception—
a jump to 64-bit integer operands and addresses in 1991—the changes have been
relatively minor. MIPS processors were used by Digital Equipment Corp. for a
few years prior to the development of the (now defunct) Alpha architecture, and
by Silicon Graphics, Inc. throughout the 1990s. They are now used primarily in
embedded applications. MIPS-based tools are also widely used in academia. All
5.4 Architecture and Implementation
209
f1 := 0
goto L2
L1: f2 := *r1
–– load
f1 := f1 + f2
r1 := r1 + 8
–– floating-point numbers are 8 bytes long
r2 := r2 − 1
L2: if r2 > 0 goto L1
Figure 5.3 Example of pseudo-assembly notation. The code shown sums the elements of a
floating-point vector of length n. At the beginning, integer register r1 is assumed to point to the
vector and register r2 is assumed to contain n. At the end, floating-point register f1 contains
the sum.
models of the MIPS are implemented entirely in hardware; they are not microprogrammed.
IN MORE DEPTH
Among the most significant differences between the x86 and MIPS are their
memory access mechanisms, their register sets, and the variety of instructions
they provide. Like all RISC machines, the MIPS allows only load and store instructions to access memory; all computation is done with values in registers.
Like most CISC machines, the x86 allows computational instructions to operate
on values in either registers or memory. It also provides a richer set of addressing modes. Like most RISC machines, the MIPS has 32 integer registers and 32
floating-point registers. The x86, by contrast, has only 8 of each, and most of the
floating-point instructions treat the floating-point registers as a tiny stack, rather
than naming them directly. The MIPS provides many fewer distinct instructions
than does the x86, and its instruction set is much more internally consistent; the
x86 has a huge number of special cases. All MIPS instructions are exactly 4 bytes
long. Instructions on the x86 vary from 1 to 17 bytes.
5.4.5
EXAMPLE
5.8
Pseudo-assembler
Pseudo-Assembly Notation
At various times throughout the remainder of this book, we will need to consider
sequences of machine instructions corresponding to some high-level language
construct. Rather than present these sequences in the assembly language of some
particular processor architecture, we will (in most cases) rely on a simple notation designed to represent a generic RISC machine. A brief example appears in
Figure 5.3.
The notation should in most cases be self-explanatory. It uses “assignment
statements” and operators reminiscent of high-level languages, but each line of
code corresponds to a single machine instruction, and registers are named explicitly. Control flow is based entirely on goto s and subroutine calls. Conditional
210
Chapter 5 Target Machine Architecture
tests assume that the hardware can perform a comparison and branch in a single
instruction, where the comparison tests the contents of a register against a small
constant or the contents of another register.
C H E C K YO U R U N D E R S TA N D I N G
12. What is microprogramming? What breakthroughs did its invention make
possible?
13. What technological threshold was crossed in the mid-1970s, enabling the introduction of microprocessors? What subsequent threshold, crossed in the
early 1980s, made RISC machines possible?
14. What is pipelining?
15. Summarize the difference between the CISC and RISC philosophies in instruction set design.
16. Why do RISC machines allow only load and store instructions to access memory?
17. Name three CISC architectures. Name three RISC architectures. (If you’re
stumped, see the Summary and Concluding Remarks [Section 5.6].)
18. What three research groups share the credit for inventing RISC? (For this
you’ll probably need to peek at the Bibliographic Notes [Section 5.9].)
19. How can the designer of a pipelined machine cope with instructions (e.g.,
floating-point arithmetic) that take much longer than others to compute?
5.5
Compiling for Modern Processors
Programming a RISC machine by hand, in assembly language, is a tedious undertaking. Only loads and stores can access memory, and then only with limited
addressing modes. Moreover the limited space available in fixed-size instructions
means that a nonintuitive two-instruction sequence is required to load a 32-bit
constant or to jump to an absolute address. In some sense, complexity that used
to be hidden in the microcode of CISC machines has been exported to the compiler.
Fortunately, most of the code for modern processors is generated by compilers, which don’t get bored or make careless mistakes, and can easily deal with
comparatively primitive instructions. In fact, when compiling for recent implementations of the x86, compilers generally limit themselves to a small, RISC-like
subset of the instruction set, which the processor can pipeline effectively. Old
programs that make use of more complex instructions still run, but not as fast;
they don’t take full advantage of the hardware.
5.5 Compiling for Modern Processors
EXAMPLE
5.9
Performance = clock rate
211
The real difficulty in compiling for modern processors lies not in the need to
use primitive instructions, but in the need to keep the pipeline full and to make
effective use of registers. A user who trades in a Pentium III PC for one with
a Pentium 4 will typically find that while old programs run faster on the new
machine, the speed improvement is nowhere near as dramatic as the difference
in clock rates would lead one to expect. Improvements will generally be better
if one is able to obtain new program versions that have been compiled with the
newer processor in mind.
5.5.1
Keeping the Pipeline Full
Four main problems may cause a pipelined processor to stall:
Cache misses. A load instruction or an instruction fetch may miss in the cache.
Resource hazards. Two concurrently executing instructions may need to use the
same functional unit at the same time.
Data hazards. An instruction may need an operand that has not yet been produced by an earlier but still executing instruction.
Control hazards. Until the outcome (and target) of a branch instruction is determined, the processor does not know the location from which to fetch subsequent instructions.
All of these problems are amenable, at least in part, to both hardware and
software solutions. On the hardware side, misses can generally be reduced by
building larger or more highly associative caches.3 Resource hazards, likewise, can
be addressed by building multiple copies of the various functional units (though
most processors don’t provide enough to avoid all possible conflicts). Misses,
resource hazards, and data hazards can all be addressed by out-of-order execution,
which allows a processor (at the cost of significant design complexity, chip area,
and power consumption) to consider a lengthy “window” of instructions, and
make progress on any of them for which operands and hardware resources are
available.
Of course, even out-of-order execution works only if the processor is able to
fetch instructions, and thus it is control hazards that have the largest potential
negative impact on performance. Branches constitute something like 10% of all
instructions in typical programs,4 so even a one-cycle stall on every branch could
3 The degree of associativity of a cache is the number of distinct lines in the cache in which the
contents of a given memory location might be found. In a one-way associative (direct-mapped)
cache, each memory location maps to only one possible line in the cache. If the program uses two
locations that map to the same line, the contents of these two locations will keep evicting each
other, and many misses will result. More highly associative caches are slower but suffer fewer
such conflicts.
4 This is a very rough number. For the SPEC2000 benchmarks, Hennessy and Patterson report
percentages varying from 1 to 25 [HP03, pp. 138–139].
212
Chapter 5 Target Machine Architecture
be expected to slow down execution by 9% on average. On a deeply pipelined
machine one might naively expect to stall for more like five or even ten cycles
while waiting for a new program counter to be computed. To avoid such intolerable delays, most workstation-class processors incorporate hardware to predict
the outcome of each branch, based on past behavior, and to execute speculatively
down the predicted path. Assuming that it takes care to avoid any irreversible
operations, the processor will suffer stalls only in the case of an incorrect prediction.
On the software side, the compiler has a major role to play in keeping the
pipeline full. For any given source program, there is an unbounded number
of possible translations into machine code. In general we should prefer shorter
translations over longer ones, but we must also consider the extent to which various translations will utilize the pipeline. On an in-order processor (one that
always executes instructions in the order they appear in the machine language
program), a stall will inevitably occur whenever a load is followed immediately by an instruction that needs the loaded value, because even a first-level
cache requires at least one extra cycle to respond. A stall may also occur when
the result of a slow-to-complete floating-point operation is needed too soon
by another instruction, when two concurrently executing instructions need the
same functional unit in the same cycle, or, on a superscalar processor, when
an instruction that uses a value is executed concurrently with the instruction
that produces it. In all these cases performance may improve significantly if the
compiler chooses a translation in which instructions appear in a different order.
The general technique of reordering instructions at compile time so as to
maximize processor performance is known as instruction scheduling. On an inorder processor the goal is to identify a valid order that will minimize pipeline
stalls at run time. To achieve this goal the compiler requires a detailed model
of the pipeline. On an out-of-order processor the goal is simply to maximize
instruction-level parallelism (ILP): the degree to which unrelated instructions lie
near one another in the instruction stream (and thus are likely to fall within the
processor’s instruction window). A compiler for such an out-of-order machine
may be able to make do with a less detailed processor model. At the same time, it
may need to ensure a higher degree of ILP, since out-of-order execution tends to
be found on machines with several pipelines.
Instruction scheduling can have a major impact on resource and data hazards. On machines with so-called delayed branches it can also help with control
hazards. We will consider the topic of instruction scheduling in some detail in
Section 15.6. In the remainder of the current subsection we focus on the two
cases—loads and branches—where issues of instruction scheduling may actually
be embedded in the processor’s instruction set. Software techniques to reduce
the incidence of cache misses typically require large-scale restructuring of control flow or data layout. Though the better commercial compilers may reorganize
loops for better cache locality in scientific programs (a topic we will consider in
Section 15.7.2), most simply assume that every memory access will hit in the
5.5 Compiling for Modern Processors
213
primary cache. The assumption is generally a good one: most programs on most
machines achieve a cache hit rate of well over 90% (often over 99%). The important goal is to make sure that the pipeline can continue to operate during the
time that it takes the cache to respond.
Loads
EXAMPLE
5.10
Filling a load delay slot
Consider a load instruction that hits in the primary cache. The number of cycles
that must elapse before a subsequent instruction can use the result is known as
the load delay. Most current machines have a one-cycle load delay. If the instruction immediately after a load attempts to use the loaded value, a one-cycle load
penalty (a pipeline stall) will occur. Longer pipelines can have load delays of two
or even three cycles.
To avoid load penalties (in the absence of out-of-order execution), the compiler may schedule one or more unrelated instructions into the delay slot(s) between a load and a subsequent use. In the following code, for example, a simple
in-order pipeline will incur a one-cycle penalty between the second and third
instructions.
r2 := r1 + r2
r3 := A
r3 := r3 + r2
–– load
If we swap the first two instructions, the penalty goes away:
r3 := A
r2 := r1 + r2
r3 := r3 + r2
–– load
The second instruction gives the first instruction time enough to retrieve A before it is needed in the third instruction.
To maintain program correctness, an instruction-scheduling algorithm must
respect all dependences among instructions. These dependences come in three
varieties:
Flow dependence (also called true or read-after-write dependence): a later instruction uses a value produced by an earlier instruction.
Antidependence (also called write-after-read dependence): a later instruction
overwrites a value read by an earlier instruction.
Output dependence (also called write-after-write dependence): a later instruction overwrites a value written by a previous instruction.
EXAMPLE
5.11
Renaming registers for
scheduling
A compiler can often eliminate anti- and output dependences by renaming
registers. In the following, for example, antidependences prevent us from moving either the instruction before the load or the one after the add into the delay
slot of the load.
214
Chapter 5 Target Machine Architecture
r3 := r1 + 3
r1 := A
r2 := r1 + r2
r1 := 3
×
–– immovable ↓
–– load
×
–– immovable ↑
If we use a different register as the target of the load, however, then either instruction can be moved:
r3 := r1 + 3
r5 := A
r2 := r5 + r2
r1 := 3
–– movable ↓
–– load
–– movable ↑
The need to rename registers in order to move instructions can increase the number of registers needed by a given stretch of code. To maximize opportunities
for concurrent execution, out-of-order processor implementations may perform
register renaming dynamically in hardware, as noted in Section 5.4.3. These implementations possess more physical registers than are visible in the instruction
set. As instructions are considered for execution, any that use the same architectural register for independent purposes are given separate physical copies on
which to do their work. If a processor does not perform hardware register renaming, then the compiler must balance the desire to eliminate pipeline stalls
against the desire to minimize the demand for registers (so that they can be
used to hold loop indices, local variables, and other comparatively long-lived
values).
In order to enforce the flow dependence between a load of a register and its
subsequent use, a processor must include so-called interlock hardware. To minimize chip area, several of the very early RISC processors provided this hardware
only in the case of cache misses. The result was an architecturally visible delayed
load instruction, in which the value of the loaded register was undefined in the
immediately subsequent instruction slot. Filling the delay slot of a delayed load
with an unrelated instruction was a matter of correctness, not just of performance. If a compiler was unable to find a suitable “real” instruction, it had to fill
the delay slot with a no-op ( nop )—an instruction that has no effect. More recent
RISC machines have abandoned delayed loads; their implementations are fully
interlocked. Within processor families old binaries continue to work correctly;
the ( nop ) instructions are simply redundant.
Branches
Successful pipelining depends on knowing the address of the next instruction
before the current instruction has completed or has even been fully decoded.
With fixed-size instructions a processor can infer this address for straight-line
code but not for the code that follows a branch.5 In an attempt to minimize the
5 In this context, branches include not only the control flow for conditionals and loops, but also
subroutine calls and returns.
5.5 Compiling for Modern Processors
EXAMPLE
5.12
Filling a branch delay slot
215
impact of branch delays, several early RISC machines defined delayed branch
instructions similar to the delayed loads just described. In these machines the
instruction immediately after the branch is executed regardless of the outcome
of the branch. If the branch is not taken, all occurs as one would normally expect. If the branch is taken, however, the order of instructions is the branch itself,
the instruction after the branch, and then the instruction at the target of the
branch.
Because control may go either of two directions at a branch, finding an instruction to fill a delayed branch slot is slightly trickier than finding one to fill
a delayed load slot. The few instructions immediately before the branch are the
most obvious candidates to move, provided that they do not contribute to the
calculation that controls the branch, and that we don’t have to move them past
the target of some other branch:
B := r2
–– movable ↓
×
r1 := r2 × r3
–– immovable ↓
if r1 > 0 goto L1
nop
(This code sequence assumes that branches are delayed. Unless otherwise
noted, we will assume throughout the remainder of the book that they are
not.)
To address the problem of unfillable branch delay slots, some more recent
RISC machines provide nullifying conditional branch instructions. A nullifying
branch includes a bit that indicates the direction that the compiler “expects” the
branch to go. The hardware executes the instruction in the delay slot only if the
branch goes the expected direction. While the branch instruction is making its
way down the pipeline, the hardware begins to execute the next instruction. Ideally, by the time it must begin the instruction after that, it will know the outcome
of the branch. If the outcome matches the prediction, then the pipeline will proceed without stalling. If the outcome does not match the prediction, then the
(not yet completed) instruction in the delay slot will be abandoned, along with
any instructions fetched from the target of the branch.
Unfortunately, as architects have moved to more aggressive, deeply pipelined
processor implementations, multicycle branch delays have become the norm, and
architecturally visible delay slots no longer suffice to hide them. A few processors
have been designed with an architecturally visible branch delay of more than one
cycle, but this is not generally considered a viable strategy: it is simply too difficult for the compiler to find enough instructions to schedule into the slots.
Several processors retain one-slot delayed branches (sometimes with optional
nullification) for the sake of backward compatibility and as a means of reducing, but not eliminating, the number of pipeline stalls (the penalty) associated
with a branch. With or without delayed branches, many processors also employ
elaborate hardware mechanisms to predict the outcome and targets of branches
early, so that the pipeline can continue anyway. When a prediction turns out to
216
Chapter 5 Target Machine Architecture
be incorrect, of course, the hardware must ensure that none of the incorrectly
fetched instructions have visible effects. Even when hardware is able to predict
the outcome of branches, it can be useful for the compiler to do so also, in order
to schedule instructions to minimize load delays in the most likely cross-branch
code paths.
5.5.2
EXAMPLE
5.13
Register allocation for a
simple loop
Register Allocation
The load/store architecture of RISC machines explicitly acknowledges that moving data between registers and memory is expensive. A store instruction costs a
minimum of one cycle—more if several stores are executed in succession and the
memory system can’t keep up. A load instruction costs a minimum of one or two
cycles (depending on whether the delay slot can be filled) and can cost scores or
even hundreds of cycles in the event of a cache miss. These same costs are present
on CISC machines as well, even if they don’t stand out as prominently in a casual perusal of assembly code. In order to minimize the use of loads and stores,
a good compiler must keep things in registers whenever possible. We saw an example in Chapter 1: the most striking difference between the “optimized” code
of Example 1.2 (page 3) and the naive code of Figure 1.5 (page 29) is the absence
in the former of most of the loads and stores. As improvements in processor
speed continue to outstrip improvements in memory speed, the cost in cycles
of a cache miss continues to increase, making good register usage increasingly
important.
Register allocation is typically a two-stage process. In the first stage the compiler identifies the portions of the abstract syntax tree that represent basic blocks:
straight-line sequences of code with no branches in or out. Within each basic
block it assigns a “virtual register” to each loaded or computed value. In effect,
this assignment amounts to generating code under the assumption that the target machine has an unbounded number of registers. In the second stage, the
compiler maps the virtual registers of an entire subroutine onto the architectural (hardware) registers of the machine, using the same architectural register
when possible to hold different virtual registers at different times, and spilling
virtual registers to memory when there aren’t enough architectural registers to
go around.
We will examine this two-stage process in more detail in Section 15.8. For
now, we illustrate the ideas with a simple example. Suppose we are compiling a
function that computes the variance σ 2 of the contents of an n-element vector.
Mathematically,
1
σ =
(xi − x)2 =
n i
2
1 2
x − x2
n i i
where x0 . . . xn−1 are the elements of the vector, and x = 1/n
In pseudocode,
i xi
is their average.
5.5 Compiling for Modern Processors
1.
v1 := &A
2.
v2 := n
3.
w1 := 0.0
4.
w2 := 0.0
5.
goto L2
6. L1: w3 := *v1
7.
w1 := w1 + w3
8.
w4 := w3 × w3
9.
w2 := w2 + w4
10.
v1 := v1 + 8
11.
v2 := v2 − 1
12. L2: if v2 > 0 goto L1
13.
w5 := w1 / n
14.
w6 := w2 / n
15.
w7 := w5 × w5
16.
w8 := w6 − w7
17.
...
Figure 5.4
217
–– pointer to A[1]
–– count of elements yet to go
–– sum
–– squares
–– A[i] (floating-point)
–– accumulate sum
–– accumulate squares
–– 8 bytes per double-word
–– decrement count
–– average
–– average of squares
–– square of average
–– return value in w8
RISC assembly code for a vector variance computation.
double sum := 0
double squares := 0
for int i in 0 . . n−1
sum +:= A[i]
squares +:= A[i] × A[i]
double average := sum / n
return (squares / n) − (average × average)
After some simple code improvements and the assignment of virtual registers,
the assembly language for this function on a RISC machine is likely to look something like Figure 5.4. This code uses two integer virtual registers ( v1 and v2 ) and
eight floating-point virtual registers ( w1 – w8 ). For each of these we can compute
the range over which the value in the register is useful, or live. This range extends
from the point at which the value is defined to the last point at which the value is
used. For register w4 , for example, the range is only one instruction long, from
the assignment at line 8 to the use at line 9. For register v1 , the range is the union
of two subranges, one that extends from the assignment at line 1 to the use (and
redefinition) at line 10 and another that extends from this redefinition around
the loop to the same spot again.
Once we have calculated live ranges for all virtual registers, we can create a
mapping onto the architectural registers of the machine. We can use a single architectural register for two virtual registers only if their live ranges do not overlap.
If the number of architectural registers required is larger than the number available on the machine (after reserving a few for such special values as the stack
pointer), then at various points in the code we shall have to write (spill) some of
the virtual registers to memory in order to make room for the others.
218
Chapter 5 Target Machine Architecture
1.
r1 := &A
2.
r2 := n
3.
f1 := 0.0
4.
f2 := 0.0
5.
goto L2
6. L1: f3 := *r1
7.
f1 := f1 + f3
8.
f3 := f3 × f3
9.
f2 := f2 + f3
10.
r1 := r1 + 8
11.
r2 := r2 − 1
12. L2: if r2 > 0 goto L1
13.
f1 := f1 / n
14.
f2 := f2 / n
15.
f1 := f1 × f1
16.
f1 := f2 − f1
17.
...
–– no delay
–– 1-cycle wait for f3
–– no delay
–– 4-cycle wait for f3
–– no delay
–– no delay
–– no delay
–– return value in f1
Figure 5.5 The vector variance example with physical registers assigned. Also shown in the
body of the loop are the number of stalled cycles that can be expected on a simple in-order
pipelined machine, assuming a one-cycle penalty for loads, a two-cycle penalty for floating-point
adds, and a four-cycle penalty for floating-point multiplies.
In our example program, the live ranges for the two integer registers overlap, so they will have to be assigned to separate physical registers. Among the
floating-point registers, w1 overlaps with w2 – w4 , w2 overlaps with w3 – w5 ,
w5 overlaps with w6 , and w6 overlaps with w7 . There are several possible
mappings onto three physical floating-point registers, one of which is shown in
Figure 5.5.
Interaction with Instruction Scheduling
EXAMPLE
5.14
Register allocation and
instruction scheduling
From the point of view of execution speed, the code in Figure 5.5 has at least
two problems. First, of the seven instructions in the loop, nearly half are devoted
to bookkeeping: updating the pointer, decrementing the loop count, and testing
the terminating condition. Second, when run on a pipelined machine, the code
is likely to experience a very high number of stalls. Exercise 5.15 explores a first
step toward addressing the bookkeeping overhead. We consider the stalls below,
and will return to both problems in more detail in Chapter 15.
We noted in Section 5.5.1 that floating-point instructions commonly employ a
separate, longer pipeline. Because they take more cycles to complete, there can be
a significant delay before their results are available for use in other instructions.
Suppose that floating-point add and multiply instructions must be followed by
two and four cycles, respectively, of unrelated computation (these are modest
figures; real machines often have longer delays). Also suppose that the result of
a load is not available for the usual one-cycle delay. In the context of our vector
variance example, these delays imply a total of five stalled cycles in every iteration
5.5 Compiling for Modern Processors
1.
r1 := &A
2.
r2 := n
3.
f1 := 0.0
4.
f2 := 0.0
5.
goto L2
6. L1: f3 := *r1
7.
r1 := r1 + 8
8.
f4 := f3 × f3
9.
f1 := f1 + f3
10.
r2 := r2 − 1
11.
f2 := f2 + f4
12. L2: if r2 > 0 goto L1
13.
f1 := f1 / n
14.
f2 := f2 / n
15.
f1 := f1 × f1
16.
f1 := f2 − f1
17.
...
219
–– no delay
–– no delay
–– no delay
–– no delay
–– 1-cycle wait for f4
–– no delay
–– return value in f1
Figure 5.6 The vector variance example after instruction scheduling. All but one cycle of
delay has been eliminated. Because we have hoisted the multiply above the first floating-point
add, however, we need an extra physical floating-point register.
of the loop, even if the hardware successfully predicts the outcome and target
of the branch at the bottom. Added to the seven instructions themselves, this
implies a total of 12 cycles per loop iteration (i.e., per vector element).
By rescheduling the instructions in the loop (Figure 5.6) we can eliminate all
but one cycle of stall. This brings the total number of cycles per iteration down
to only eight, a reduction of 33%. The savings comes at a cost, however: we now
execute the multiply instruction before the first floating-point add, and we must
use an extra physical register to hold onto the add’s second argument. This effect
is not unusual: instruction scheduling has a tendency to overlap the live ranges
of virtual registers whose ranges were previously disjoint, leading to an increase
in the number of architectural registers required.
The Impact of Subroutine Calls
The register allocation scheme outlined above depends implicitly on the compiler
being able to see all of the code that will be executed over a given span of time
(e.g., an invocation of a subroutine). But what if that code includes calls to other
subroutines? If a subroutine were called from only one place in the program,
we could allocate registers (and schedule instructions) across both the caller and
the callee, effectively treating them as a single unit. Most of the time, however,
a subroutine is called from many different places in a program, and the code
improvements that we should like to make in the context of one caller will be
different from the ones that we should like to make in the context of a different
caller. For small, simple subroutines, the compiler may actually choose to expand
220
Chapter 5 Target Machine Architecture
a copy of the code at each call site, despite the resulting increase in code size.
This inlining of subroutines can be an important form of code improvement,
particularly for object-oriented languages, which tend to have very large numbers
of very small subroutines.
When inlining is not an option, most compilers treat each subroutine as an
independent unit. When a body of code for which we are attempting to perform
register allocation makes a call to a subroutine, there are several issues to consider:
Parameters must generally be passed. Ideally, we should like to pass them in
registers.
Any registers that the callee will use internally but that contain useful values in
the caller must be spilled to memory and then reread when the callee returns.
Any variables that the callee might load from memory but that have been kept
in a register in the caller must be written back to memory before the call, so
that the callee will see the current value.
Any variables to which the callee might store a value in memory but that have
been kept in a register in the caller must be reread from memory when the
callee returns, so that the caller will see the current value.
If the caller does not know exactly what the callee might do (this is often
the case—the callee might not have been compiled yet), then the compiler must
make conservative assumptions. In particular, it must assume that the callee reads
and writes every variable visible in its scope. The caller must write any such variable back to memory prior to the call if its current value is (only) in a register.
If it needs the value of such a variable after the call, it must reread it from memory.
With perfect knowledge of both the caller and the callee, the compiler could
arrange across subroutine calls to save and restore precisely those registers that
are both in use in the caller and needed (for internal purposes) in the callee.
Without this knowledge, we can choose either for the caller to save and restore
the registers it is using, before and after the call, or for the callee to save and
restore the registers it needs internally, at the top and bottom of the subroutine.
In practice it is conventional to choose the latter alternative for at least some static
D E S I G N & I M P L E M E N TAT I O N
In-line subroutines
Subroutine inlining presents, to a large extent, a classic time-space tradeoff. Inlining one instance of a subroutine replaces a relatively short calling sequence
with a subroutine body that is typically significantly longer. In return, it avoids
the execution overhead of the calling sequence, enables the compiler to perform code improvement across the call without performing interprocedural
analysis, and typically improves locality, especially in the L1 instruction cache.
5.6 Summary and Concluding Remarks
221
subset of the register set, for two reasons. First, while a subroutine may be called
from many locations, there is only one copy of the subroutine itself. Saving and
restoring registers in the callee, rather than the caller, can save substantially on
code size. Second, because many subroutines (particularly those that are called
most frequently) are very small and simple, the set of registers used in the callee
tends, on average, to be smaller than the set in use in the caller. We will look at
subroutine calling sequences in more detail in Chapter 8.
C H E C K YO U R U N D E R S TA N D I N G
20.
21.
22.
23.
What is a delayed load instruction?
What is a nullifying branch instruction?
List the four principal causes of pipeline stalls.
What is a pipeline interlock?
24. What is instruction scheduling? Why is it important on modern machines?
25. What is branch prediction? Why is it important?
26. Describe the interaction between instruction scheduling and register allocation.
27. What is the live range of a register?
28. What is subroutine inlining? What benefits does it provide? When is it possible? What is its cost?
29. Summarize the impact of subroutine calls on register allocation.
5.6
Summary and Concluding Remarks
Computer architecture has a major impact on the sort of code that a compiler
must generate and the sorts of code improvements it must effect in order to obtain acceptable performance. Since the early 1980s, the trend in processor design
has been to equip the compiler with more and more knowledge of the low-level
details of processor implementation, so that the generated code can use the implementation to its fullest. This trend has blurred the traditional dividing line between processor architecture and implementation: while a compiler can generate
correct code based on an understanding of the architecture alone, it cannot generate fast code unless it understands the implementation as well. In effect, timing
issues that were once hidden in the microcode of microprogrammed processors
(and that made microprogramming an extremely difficult and arcane craft) have
been exported into the compiler.
In the first several sections of this chapter we surveyed the organization of
memory and the representation of data (including integer and floating-point
222
Chapter 5 Target Machine Architecture
arithmetic), the variety of typical assembly language instructions, and the evolution of modern RISC machines. As examples we compared the x86 and the
MIPS. We also introduced a simple notation to be used for assembly language
examples in later chapters. In the final section we discussed why compiling for
modern machines is hard. The principal tasks include instruction scheduling, for
load and branch delays and for multiple functional units, and register allocation,
to minimize memory traffic. We noted that there is often a tension between these
tasks, and that both are made more difficult by frequent subroutine calls.
As of 2005 there are four principal commercial RISC architectures: ARM
(Intel, Texas Instruments, Motorola, and dozens of others), MIPS (SGI, NEC),
Power/PowerPC (IBM, Motorola, Apple), and Sparc (Sun, Texas Instruments,
Fujitsu). ARM is the property of ARM Holdings, PLC, an intellectual property
firm that relies on licensees for actual fabrication. Though ARM processors are
not generally employed in desktop or laptop computers, they power roughly
three-quarters of the world’s embedded systems, in everything from cell phones
and PDAs to remote controls and the dozens of devices in a modern automobile. MIPS processors, likewise, are now principally employed in the embedded
market, though they were once common in desktop and high-end machines.
Despite the handicap of a CISC instruction set and the need for backward
compatibility, the x86 overwhelmingly dominates the desktop and laptop market, largely due to the marketing prowess of IBM, Intel, and Microsoft, and to the
success of Intel and AMD in decoupling the architecture from the implementation. Modern implementations of the x86 incorporate a hardware front-end that
translates x86 code, on the fly, into a RISC-like internal format amenable to heavily pipelined execution. Recent processors from Intel and AMD are competitive
with the fastest RISC alternatives.
With growing demand for a 64-bit address space, however, a major battle
ensued in the x86 world. Intel’s IA-64/Itanium processors provide an x86 compatibility mode, but it is implemented in a separate portion of the processor—
essentially a Pentium subprocessor embedded in the corner of the chip. Application writers who want speed and address space enhancements were expected
to migrate to the (very different) IA-64 instruction set. AMD, by contrast, developed a backward-compatible 64-bit extension to the x86 instruction set; its
Opteron processors provide a much smoother upward migration path. In response to market demand, Intel has licensed the Opteron architecture (which it
calls EM64T) for use in its 64-bit Pentium processors.
As processor and compiler technology continue to evolve, it is likely that
processor implementations will continue to become more complex, and that
compilers will take on additional tasks in order to harness that complexity. What
is not clear at this point is the form that processor complexity will take. While
traditional CISC machines remain popular almost entirely due to the need for
backward compatibility, both the CISC and RISC “design philosophies” are still
very much alive [SW94]. The “CISC-ish” philosophy says that newly available resources (e.g., increases in chip area) should be used to implement functions that
must currently be performed in software, such as vector or graphics operations,
5.7 Exercises
223
decimal arithmetic, or new addressing modes; the “RISC-ish” philosophy says
that resources should be used to improve the speed of existing functions—for
example, by increasing cache size, employing faster but larger functional units,
or deepening the pipeline and decreasing the cycle time.
Where the first-generation RISC machines from different vendors differed
from one another only in minor details, the more recent generations are beginning to diverge, with the ARM and MIPS taking the more RISC-ish approach,
the Power/PowerPC family taking the more CISC-ish approach, and the Sparc
somewhere in the middle. It is not yet clear which approach will ultimately prove
most effective, nor is it even clear that this is the interesting question anymore.
Communication latency and heat dissipation are increasingly the limiting factors on both clock speed and the exploitation of instruction-level parallelism.
To address these concerns, vendors are increasingly turning to chip-level multiprocessors and other novel architectures, which will almost certainly require new
compiler techniques. At perhaps no time in the past 20 years has the future of
microarchitecture been in so much flux. However it all turns out, it is clear that
processor and compiler technology will continue to evolve together.
5.7
Exercises
5.1 Modern compilers often find they don’t have enough registers to hold all the
things they’d like to hold. At the same time, VLSI technology has reached the
point at which there is room on a chip to hold many more registers than are
found in the typical ISA. Why are we still using instruction sets with only 32
integer registers? Why don’t we make, say, 64 or 128 of them visible to the
programmer?
5.2 Some early RISC machines (e.g., the SPARC) provided a “multiply step” instruction that performed one iteration of the standard shift-and-add algorithm for binary integer multiplication. Speculate as to the rationale for this
instruction.
5.3 Consider sending a message containing a string of integers over the Internet. What problems may occur if the sending and receiving machines have
different “endian-ness”? How might you solve these problems?
5.4 Why do you think RISC machines standardized on 32-bit instructions? Why
not some smaller or larger length? Why not variable lengths?
5.5 Consider a machine with three condition codes, N, Z, and O. N indicates
whether the most recent arithmetic operation produced a negative result. Z
indicates whether it produced a zero result. O indicates whether it produced
a result that cannot be represented in the available precision for the numbers being manipulated (i.e., outside the range 0 . . 2n for unsigned arithmetic, −2n−1 . . 2n−1 −1 for signed arithmetic). Suppose we wish to branch
on condition A op B , where A and B are unsigned binary numbers, for
224
Chapter 5 Target Machine Architecture
op ∈ {<, ≤, =, =, >, ≥}. Suppose we subtract B from A , using two’s complement arithmetic. For each of the six conditions, indicate the logical combination of condition-code bits that should be used to trigger the branch.
Repeat the exercise on the assumption that A and B are signed, two’s complement numbers.
5.6 We implied in Section 5.4.1 that if one adds a new instruction to a nonpipelined, microcoded machine, the time required to execute that instruction is (to first approximation) independent of the time required to execute
all other instructions. Why is it not strictly independent? What factors could
cause overall execution to become slower when a new instruction is introduced?
5.7 Suppose that loads constitute 25% of the typical instruction mix on a certain machine. Suppose further that 15% of these loads miss in the on-chip
(primary) cache, with a penalty of 40 cycles to reach main memory. What
is the contribution of cache misses to the average number of cycles per instruction? You may assume that instruction fetches always hit in the cache.
Now suppose that we add an off-chip (secondary) cache that can satisfy 90%
of the misses from the primary cache, at a penalty of only 10 cycles. What is
the effect on cycles per instruction?
5.8 Many recent processors provide a conditional move instruction that copies
one register into another if and only if the value in a third register is (or is
not) equal to zero. Give an example in which the use of conditional moves
leads to a shorter program.
5.9 The 64-bit AMD Opteron architecture is backward compatible with the x86
instruction set, just as the x86 is backward compatible with the 16-bit 8086
instruction set. Less transparently, the IA-64 Itanium is capable of running
legacy x86 applications in “compatibility mode.” But recent members of the
ARM and MIPS processor families support new 16-bit instructions as an
extension to the architecture. Why might designers have chosen to introduce
these new, less powerful modes of execution?
5.10 Consider the following code fragment in pseudo-assembler notation.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
r1 := K
r4 := &A
r6 := &B
r2 := r1 × 4
r3 := r4 + r2
r3 := *r3
r5 := *(r3 + 12)
r3 := r6 + r2
r3 := *r3
r7 := *(r3 + 12)
r3 := r5 + r7
S := r3
–– load (register indirect)
–– load (displacement)
–– load (register indirect)
–– load (displacement)
–– store
5.7 Exercises
225
(a) Give a plausible explanation for this code (what might the corresponding source code be doing?).
(b) Identify all flow, anti-, and output dependences.
(c) Schedule the code to minimize load delays on a single-pipeline, in-order
processor.
(d) Can you do better if you rename registers?
5.11 With the development of deeper, more complex pipelines, delayed loads and
branches have become significantly less appealing as features of a RISC instruction set. Why is it that designers have been able to eliminate delayed
loads in more recent machines, but have had to retain delayed branches?
5.12 Some processors, including the PowerPC and recent members of the
x86 family, require one or more cycles to elapse between a conditiondetermining instruction and a branch instruction that uses that condition.
What options does a scheduler have for filling such delays?
5.13 Branch prediction can be performed statically (in the compiler) or dynamically (in hardware). In the static approach, the compiler guesses which way
the branch will usually go, encodes this guess in the instruction, and schedules instructions for the expected path. In the dynamic approach, the hardware keeps track of the outcome of recent branches, notices branches or
patterns of branches that recur, and predicts that the patterns will continue
in the future. Discuss the tradeoffs between these two approaches. What are
their comparative advantages and disadvantages?
5.14 Consider a machine with a three-cycle penalty for incorrectly predicted
branches and a zero-cycle penalty for correctly predicted branches. Suppose that in a typical program 20% of the instructions are conditional
branches, which the compiler or hardware manages to predict correctly 75%
of the time. What is the impact of incorrect predictions on the average
number of cycles per instruction? Suppose the accuracy of branch prediction can be increased to 90%. What is the impact on cycles per instruction?
Suppose that the number of cycles per instruction would be 1.5 with
perfect branch prediction. What is the percentage slowdown caused by mispredicted branches? Now suppose that we have a superscalar processor on
which the number of cycles per instruction would be 0.6 with perfect branch
prediction. Now what is the percentage slowdown caused by mispredicted
branches? What do your answers tell you about the importance of branch
prediction on superscalar machines?
5.15 Consider the code in Figure 5.6. In an attempt to eliminate the remaining
delay and reduce the overhead of the bookkeeping (loop control) instructions, one might consider unrolling the loop—that is, creating a new loop in
which each iteration performs the work of k iterations of the original loop.
Show the code for k = 2. You may assume that n is even and that your target
226
Chapter 5 Target Machine Architecture
machine supports displacement addressing. Schedule instructions as tightly
as you can. How many cycles does your loop consume per vector element?
5.16–5.23 In More Depth.
5.8
Explorations
5.24 Skip ahead to the sidebar on decimal types on page 314. Write algorithms
to convert BCD numbers to binary, and vice versa. Try writing the routines
in assembly language for your favorite machine (if your machine has special
instructions for this purpose, pretend you’re not allowed to use them). How
many cycles are required for the conversion?
5.25 Is microprogramming an idea that has outlived its usefulness, or are there
application domains for which it still makes sense to build a microprogrammed machine? Defend your answer.
5.26 If you have access to both CISC and RISC machines, compile a few programs for both machines and compare the size of the target code. Can you
generalize about the “space penalty” of RISC code?
5.27 Several computers have provided more general versions of the conditional
move instructions described in Exercise 5.8. Examples include the c. 1965
IBM ACS, the Cray 1, the HP PA-RISC, the ARM, and the Intel IA-64 (Itanium). General purpose conditional execution is sometimes known as predication.
Learn how predication works in ARM or IA-64. Explain how it can
sometimes improve performance even when it causes the processor to execute more instructions.
5.28 If you have access to computers of more than one type, compile a few programs on each machine and time their execution. (If possible, use the same
compiler [e.g., gcc ] and options on each machine.) Discuss the factors that
may contribute to different run times. How closely do the ratios of run
times mirror the ratios of clock rates? Why don’t they mirror them exactly?
5.29 Branch prediction can be characterized as control speculation: it makes a
guess about the future control flow of the program that saves enough time
when it’s right to outweigh the cost of cleanup when it’s wrong. Some researchers have proposed the complementary notion of value speculation, in
which the processor would predict the value to be returned by a cache miss,
and proceed on the basis of that guess. What do you think of this idea? How
might you evaluate its potential?
5.30 Can speculation be useful in software? How might you (or a compiler or
other tool) be able to improve performance by making guesses that are
subject to future verification, with (software) rollback when wrong? (Hint:
5.9 Bibliographic Notes
227
Think about operations that require communication over slow Internet
links.)
5.31 Translate the high-level pseudocode for vector variance (Example 5.13) into
your favorite programming language, and run it through your favorite compiler. Examine the resulting assembly language. Experiment with different
levels of optimization (code improvement). Discuss the quality of the code
produced.
5.32 Try to write a code fragment in your favorite programming language that
requires so many registers that your favorite compiler is forced to spill some
registers to memory (compile with a high level of optimization). How complex does your code have to be?
5.33 If you have access to a compiler that generates code for a machine with architecturally visible load delays, run some programs through it and evaluate
the degree of success it has in filling delay slots (an unfilled slot will contain
a nop instruction). What percentage of slots is filled? Suppose the machine
had interlocked loads. How much space could be saved in typical executable
programs if the nop s were eliminated?
5.34 Experiment with small subroutines in C++ to see how much time can be
saved by expanding them inline.
5.35–5.37 In More Depth.
5.9
Bibliographic Notes
The standard reference in computer architecture is the graduate-level text by Patterson and Hennessy [HP03]. More introductory material can be found in the
undergraduate computer organization text by the same authors [PH05]. Students
without previous assembly language experience may be particularly interested in
the text of Bryant and O’Hallaron [BO03], which surveys computer organization
from the point of view of the systems programmer, focusing in particular on the
correspondence between source-level programs in C and their equivalents in x86
assembler.
The “RISC revolution” of the early 1980s was spearheaded by three separate
research groups. The first to start (though last to publish [Rad82]) was the 801
group at IBM’s T. J. Watson Research Center, led by John Cocke. IBM’s Power
and PowerPC architectures, though not direct descendants of the 801, take significant inspiration from it. The second group (and the one that coined the term
“RISC”) was led by David Patterson [PD80, Pat85] at UC Berkeley. The commercial Sparc architecture is a direct descendant of the Berkeley RISC II design. The
third group was led by John Hennessy at Stanford [HJBG81]. The commercial
MIPS architecture is a direct descendant of the Stanford design.
Much of the history of pre-1980 processor design can be found in the text
by Siewiorek, Bell, and Newell [SBN82]. This classic work contains verbatim
228
Chapter 5 Target Machine Architecture
reprints of many important original papers. In the context of RISC processor
design, Smith and Weiss [SW94] contrast the more “RISCy” and “CISCy” design
philosophies in their comparison of implementations of the PowerPC and Alpha
architectures. Appendix C of Hennessy and Patterson’s architecture text (available online at www.mkp.com/CA3/) summarizes the similarities and differences
among nine different RISC instruction sets. Appendix D describes the x86. Current manuals for all the popular commercial processors are available from their
manufacturers.
An excellent treatment of computer arithmetic can be found in Goldberg’s
appendix to the Hennessy and Patterson architecture text [Gol03] (available online at www.mkp.com/CA3/). The IEEE 754 floating-point standard was printed
in ACM SIGPLAN Notices in 1985 [IEE87]. The texts of Muchnick [Muc97] and
of Cooper and Torczon [CT04] are excellent sources of information on instruction scheduling, register allocation, subroutine optimization, and other aspects
of compiling for modern machines.
II
Core Issues in Language Design
Having laid the foundation in Part I, we now turn to issues that lie at the core of most programming
languages: control flow, data types, and abstractions of both control and data.
Chapter 6 considers control flow, including expression evaluation, sequencing, selection, iteration, and recursion. In many cases we will see design decisions that reflect the sometimes complementary but often competing goals of conceptual clarity and efficient implementation. Several
issues, including the distinction between references and values and between applicative (eager) and
lazy evaluation will recur in later chapters.
Chapter 7, the longest in the book, considers the subject of types. It begins with type systems
and type checking, including the notions of equivalence, compatibility, and inference of types. It
then presents a survey of high-level type constructors, including records and variants, arrays, strings,
sets, pointers, lists, and files. The section on pointers includes an introduction to garbage collection
techniques.
Both control and data are amenable to abstraction, the process whereby complexity is hidden behind a simple and well-defined interface. Control abstraction is the subject of Chapter 8. Subroutines
are the most common control abstraction, but we also consider exceptions and coroutines, and return briefly to the subjects of continuations and iterators, introduced in Chapter 6. The coverage of
subroutines includes calling sequences, parameter passing mechanisms, and generics, which support
parameterization over types.
Chapter 9 returns to the subject of data abstraction, introduced in Chapter 3. In many modern languages this subject takes the form of object orientation, characterized by an encapsulation
mechanism, inheritance, and dynamic method dispatch (subtype polymorphism). Our coverage of
object-oriented languages will also touch on constructors, access control, polymorphism, closures,
and multiple and mix-in inheritance.
6
Control Flow
Having considered the mechanisms that a compiler uses to enforce semantic rules (Chapter 4) and the characteristics of the target machines for which
compilers must generate code (Chapter 5), we now return to core issues in language design. Specifically, we turn in this chapter to the issue of control flow or
ordering in program execution. Ordering is fundamental to most (though not all)
models of computing. It determines what should be done first, what second, and
so forth, to accomplish some desired task. We can organize the language mechanisms used to specify ordering into seven principal categories.
1. sequencing: Statements are to be executed (or expressions evaluated) in a certain specified order—usually the order in which they appear in the program
text.
2. selection: Depending on some run-time condition, a choice is to be made
among two or more statements or expressions. The most common selection
constructs are if and case ( switch ) statements. Selection is also sometimes
referred to as alternation.
3. iteration: A given fragment of code is to be executed repeatedly, either a certain number of times or until a certain run-time condition is true. Iteration
constructs include while , do , and repeat loops.
4. procedural abstraction: A potentially complex collection of control constructs
(a subroutine) is encapsulated in a way that allows it to be treated as a single
unit, often subject to parameterization.
5. recursion: An expression is defined in terms of (simpler versions of) itself, either directly or indirectly; the computational model requires a stack on which
to save information about partially evaluated instances of the expression. Recursion is usually defined by means of self-referential subroutines.
6. concurrency: Two or more program fragments are to be executed/evaluated
“at the same time,” either in parallel on separate processors or interleaved on
a single processor in a way that achieves the same effect.
233
234
Chapter 6 Control Flow
7. nondeterminacy: The ordering or choice among statements or expressions is
deliberately left unspecified, implying that any alternative will lead to correct
results. Some languages require the choice to be random, or fair, in some formal sense of the word.
Though the syntactic and semantic details vary from language to language, these
seven principal categories cover all of the control-flow constructs and mechanisms found in most programming languages. A programmer who thinks in
terms of these categories, rather than the syntax of some particular language,
will find it easy to learn new languages, evaluate the tradeoffs among languages,
and design and reason about algorithms in a language-independent way.
Subroutines are the subject of Chapter 8. Concurrency is the subject of Chapter 12. The bulk of this chapter (Sections 6.3 through 6.7) is devoted to a study
of the five remaining categories. We begin in Section 6.1 by examining expression evaluation. We consider the syntactic form of expressions, the precedence
and associativity of operators, the order of evaluation of operands, and the semantics of the assignment statement. We focus in particular on the distinction
between variables that hold a value and variables that hold a reference to a value;
this distinction will play an important role many times in future chapters. In Section 6.2 we consider the difference between structured and unstructured ( goto based) control flow.
The relative importance of different categories of control flow varies significantly among the different classes of programming languages. Sequencing, for
example, is central to imperative (von Neumann and object-oriented) languages,
but plays a relatively minor role in functional languages, which emphasize the
evaluation of expressions, deemphasizing or eliminating statements (e.g., assignments) that affect program output in any way other than through the return
of a value. Similarly, functional languages make heavy use of recursion, whereas
imperative languages tend to emphasize iteration. Logic languages tend to deemphasize or hide the issue of control flow entirely: the programmer simply specifies
a set of inference rules; the language implementation must find an order in which
to apply those rules that will allow it to deduce values that satisfy some desired
property.
6.1
EXAMPLE
6.1
A typical function call
Expression Evaluation
An expression generally consists of either a simple object (e.g., a literal constant,
or a named variable or constant) or an operator or function applied to a collection
of operands or arguments, each of which in turn is an expression. It is conventional to use the term operator for built-in functions that use special, simple syntax, and to use the term operand for the argument of an operator. In Algol-family
languages, function calls consist of a function name followed by a parenthesized,
comma-separated list of arguments, as in
6.1 Expression Evaluation
my_func(A, B, C)
EXAMPLE
6.2
Typical operators
235
Algol-family operators are simpler: they typically take only one or two arguments, and dispense with the parentheses and commas:
a + b
- c
EXAMPLE
6.3
Cambridge Polish (prefix)
notation
As we saw in Section 3.6.2, some languages define the operators as syntactic sugar
for more “normal”-looking functions. In Ada, for example, a + b is short for
"+"(a, b) ; in C++, a + b is short for a.operator+(b) .
In general, a language may specify that function calls (operator invocations)
employ prefix, infix, or postfix notation. These terms indicate, respectively,
whether the function name appears before, among, or after its several arguments.
Most imperative languages use infix notation for binary operators and prefix notation for unary operators and other functions (with parentheses around the arguments). Lisp uses prefix notation for all functions but places the function name
inside the parentheses, in what is known as Cambridge Polish1 notation:
(* (+ 1 3) 2)
(append a b c my_list)
EXAMPLE
6.4
Mixfix notation in Smalltalk
; that would be (1 + 3) * 2 in infix
A few languages, notably the R scripting language, allow the user to create
new infix operators. Smalltalk uses infix notation for all functions (which it calls
messages), both built-in and user-defined. The following Smalltalk statement
sends a “ displayOn: at: ” message to graphical object myBox , with arguments
myScreen and [email protected] (a pixel location). It corresponds to what other languages
would call the invocation of the “ displayOn: at: ” function with arguments
myBox , myScreen , and [email protected] .
myBox displayOn: myScreen at: [email protected]
EXAMPLE
6.5
Conditional expressions
This sort of multiword infix notation occurs occasionally in Algol-family languages as well.2 In Algol one can say
a := if b <> 0 then a/b else 0;
Here “ if . . . then . . . else ” is a three-operand infix operator. The equivalent operator in C is written “. . . ? . . . : . . . ”:
a = b != 0 ? a/b : 0;
Postfix notation is used for most functions in Postscript, Forth, the input language of certain hand-held calculators, and the intermediate code of some com1 Prefix notation was popularized by Polish logicians of the early 20th century; Lisp-like parenthesized syntax was first employed (for noncomputational purposes) by philosopher W. V. Quine
of Harvard University (Cambridge, MA).
2 Most authors use the term “infix” only for binary operators. Multiword operators may be called
“mixfix” or left unnamed.
236
Chapter 6 Control Flow
pilers. Postfix appears in a few places in other languages as well. Examples include the pointer dereferencing operator ( ^ ) of Pascal and the post-increment
and -decrement operators ( ++ and -- ) of C and its descendants.
6.1.1
EXAMPLE
6.6
A complicated Fortran
expression
Precedence and Associativity
Most languages provide a rich set of built-in arithmetic and logical operators.
When written in infix notation, without parentheses, these operators lead to ambiguity as to what is an operand of what. In Fortran, for example, which uses
** for exponentiation, how should we parse a + b * c**d**e/f ? Should this
group as
((((a + b) * c)**d)**e)/f
or
a + (((b * c)**d)**(e/f))
or
a + ((b * (c**(d**e)))/f)
EXAMPLE
6.7
Precedence in four
influential languages
EXAMPLE
6.8
A “gotcha” in Pascal
precedence
or yet some other option? (In Fortran, the answer is the last of the options
shown.)
In any given language, the choice among alternative evaluation orders depends
on the precedence and associativity of operators, concepts we introduced in Section 2.1.3. Issues of precedence and associativity do not arise in prefix or postfix
notation.
Precedence rules specify that certain operators, in the absence of parentheses,
group “more tightly” than other operators. Associativity rules specify that sequences of operators of equal precedence group to the right or to the left. In most
languages multiplication and division group more tightly than addition and subtraction. Other levels of precedence vary widely from one language to another.
Figure 6.1 shows the levels of precedence for several well-known languages. The precedence structure of C (and, with minor variations, of its descendants,
C++, Java, and C#) is substantially richer than that of most other languages. It
is, in fact, richer than shown in Figure 6.1, because several additional constructs,
including type casts, function calls, array subscripting, and record field selection,
are classified as operators in C. It is probably fair to say that most C programmers
do not remember all of their language’s precedence levels. The intent of the language designers was presumably to ensure that “the right thing” will usually happen when parentheses are not used to force a particular evaluation order. Rather
than count on this, however, the wise programmer will consult the manual or
add parentheses.
It is also probably fair to say that the relatively flat precedence hierarchy of
Pascal is a mistake. In particular, novice Pascal programmers frequently write
conditions like
6.1 Expression Evaluation
Fortran
Pascal
C
237
Ada
++ , -- (post-inc., dec.)
**
not
++ , -- (pre-inc., dec.),
+ , - (unary),
& , * (address, contents of),
! , ~ (logical, bit-wise not)
abs (absolute value),
not , **
*, /
*, /,
div , mod , and
* (binary), / ,
% (modulo division)
* , / , mod , rem
+ , - (unary
and binary)
+ , - (unary and
binary), or
+ , - (binary)
+ , - (unary)
<< , >>
(left and right bit shift)
+ , - (binary),
& (concatenation)
< , <= , > , >=
(inequality tests)
= , /= , < , <= , > , >=
.eq. , .ne. , .lt. ,
.le. , .gt. , .ge.
< , <= , > , >= ,
= , <> , IN
(comparisons)
== , != (equality tests)
.not.
& (bit-wise and)
^ (bit-wise exclusive or)
| (bit-wise inclusive or)
.and.
&& (logical and)
.or.
|| (logical or)
.eqv. , .neqv.
(logical comparisons)
?: (if . . . then . . . else)
and , or , xor
(logical operators)
= , += , -= , *= , /= , %= ,
>>= , <<= , &= , ^= , |=
(assignment)
, (sequencing)
Figure 6.1
Operator precedence levels in Fortran, Pascal, C, and Ada. The operators at the top of the figure group most
tightly.
if A < B and C < D then (* ouch *)
Unless A , B , C , and D are all of type Boolean, which is unlikely, this code will
result in a static semantic error, since the rules of precedence cause it to group
as A < (B and C) < D . (And even if all four operands are of type Boolean, the
result is almost sure to be something other than what the programmer intended.)
Most languages avoid this problem by giving arithmetic operators higher precedence than relational (comparison) operators, which in turn have higher prece-
238
EXAMPLE
Chapter 6 Control Flow
6.9
Common rules for
associativity
dence than the logical operators. Notable exceptions include APL and Smalltalk,
in which all operators are of equal precedence; parentheses must be used to specify grouping.
Associativity rules are somewhat more uniform across languages, but still display some variety. The basic arithmetic operators almost always associate left-toright, so 9 - 3 - 2 is 4 and not 8 . In Fortran, as noted above, the exponentiation operator ( ** ) follows standard mathematical convention and associates
right-to-left, so 4**3**2 is 262144 and not 4096 . In Ada, exponentiation does
not associate: one must write either (4**3)**2 or 4**(3**2); the language syntax does not allow the unparenthesized form. In languages that allow assignments
inside expressions (an option we will consider more in Section 6.1.2), assignment
associates right-to-left. Thus in C, a = b = a + c assigns a + c into b and then
assigns the same value into a .
Because the rules for precedence and associativity vary so much from one language to another, a programmer who works in several languages is wise to make
liberal use of parentheses.
6.1.2
Assignments
In a purely functional language, expressions are the building blocks of programs,
and computation consists entirely of expression evaluation. The effect of any individual expression on the overall computation is limited to the value that expression provides to its surrounding context. Complex computations employ recursion to generate a potentially unbounded number of values, expressions, and
contexts.
In an imperative language, by contrast, computation typically consists of an
ordered series of changes to the values of variables in memory. Assignments provide the principal means by which to make the changes. Each assignment takes
a pair of arguments: a value and a reference to a variable into which the value
should be placed.
In general, a programming language construct is said to have a side effect if
it influences subsequent computation (and ultimately program output) in any
way other than by returning a value for use in the surrounding context. Purely
functional languages have no side effects. As a result, the value of an expression
in such a language depends only on the referencing environment in which the
expression is evaluated, not on the time at which the evaluation occurs. If an
expression yields a certain value at one point in time, it is guaranteed to yield
the same value at any point in time. In fancier terms, expressions in a purely
functional language are said to be referentially transparent.
By contrast, imperative programming is sometimes described as “computing
by means of side effects.” While the evaluation of an assignment may sometimes
yield a value, what we really care about is the fact that it changes the value of a
variable, thereby affecting the result of any later computation in which the variable appears.
6.1 Expression Evaluation
239
Many (though not all) imperative languages distinguish between expressions,
which always produce a value, and may or may not have side effects, and statements, which are executed solely for their side effects, and return no useful value.
References and Values
EXAMPLE
6.10
L-values and r-values
EXAMPLE
6.11
L-values in C
On the surface, assignment appears to be a very straightforward operation. Below the surface, however, there are some subtle but important differences in the
semantics of assignment in different imperative languages. These differences are
often invisible, because they do not affect the behavior of simple programs. They
have a major impact, however, on programs that use pointers, and will be explored in further detail in Section 7.7. We provide an introduction to the issues
here.
Consider the following assignments in C:
d = a;
a = b + c;
In the first statement, the right-hand side of the assignment refers to the value of
a , which we wish to place into d . In the second statement, the left-hand side
refers to the location of a , where we want to put the sum of b and c . Both
interpretations—value and location—are possible because a variable in C (and
in Pascal, Ada, and many other languages) is a named container for a value. We
sometimes say that languages like C use a value model of variables. Because of
their use on the left-hand side of assignment statements, expressions that denote
locations are referred to as l-values. Expressions that denote values (possibly the
value stored in a location) are referred to as r-values. Under a value model of variables, a given expression can be either an l-value or an r-value, depending on the
context in which it appears.
Of course, not all expressions can be l-values, because not all values have a
location, and not all names are variables. In most languages it makes no sense
to say 2 + 3 = a , or even a = 2 + 3 , if a is the name of a constant. By the
same token, not all l-values are simple names; both l-values and r-values can be
complicated expressions. In C one may write
(f(a)+3)->b[c] = 2;
EXAMPLE
6.12
L-values in C++
In this expression f(a) returns a pointer to some element of an array of structures (records). The assignment places the value 2 into the c -th element of field
b of the third structure after the one to which f ’s return value points.
In C++ it is even possible for a function to return a “reference” to a structure,
rather than a pointer to it, allowing one to write
g(a).b[c] = 2;
We will consider references further in Section 8.3.1.
Several languages make the distinction between l-values and r-values more explicit by employing a reference model of variables. In Clu, for example, a variable
240
Chapter 6 Control Flow
Figure 6.2 The value (left) and reference (right) models of variables. Under the reference
model, it becomes important to distinguish between variables that refer to the same object and
variables that refer to different objects whose values happen (at the moment) to be equal.
EXAMPLE
6.13
Variables as values and
references
is not a named container for a value; rather, it is a named reference to a value. The
following fragment of code is syntactically valid in both Pascal and Clu.
b := 2;
c := b;
a := b + c;
A Pascal programmer might describe this code by saying: “We put the value 2
in b and then copy it into c . We then read these values, add them together, and
place the resulting 4 in a .” The Clu programmer would say: “We let b refer to 2
and then let c refer to it also. We then pass these references to the + operator, and
let a refer to the result, namely 4.”
These two ways of thinking are illustrated in Figure 6.2. With a value model
of variables, as in Pascal, any integer variable can contain the value 2. With a
reference model of variables, as in Clu, there is (at least conceptually) only one
2 —a sort of Platonic Ideal—to which any variable can refer. The practical effect
is the same in this example, because integers are immutable: the value of 2 never
changes, so we can’t tell the difference between two copies of the number 2 and
two references to “the” number 2.
In a language that uses the reference model, every variable is an l-value. When
it appears in a context that expects an r-value, it must be dereferenced to obtain
the value to which it refers. In most languages with a reference model (including
Clu), the dereference is implicit and automatic. In ML, the programmer must
D E S I G N & I M P L E M E N TAT I O N
Implementing the reference model
It is tempting to assume that the reference model of variables is inherently
more expensive than the value model, since a naive implementation would
require a level of indirection on every access. As we shall see in Section 7.7.1,
however, most compilers for languages with a reference model use multiple
copies of immutable objects for the sake of efficiency, achieving exactly the
same performance for simple types that they would with a value model.
6.1 Expression Evaluation
241
use an explicit dereference operator, denoted with a prefix exclamation point. We
will revisit ML pointers in Section 7.7.1.
The difference between the value and reference models of variables becomes
particularly important (specifically, it can affect program output and behavior) if
the values to which variables refer can change “in place,” as they do in many programs with linked data structures, or if it is possible for variables to refer to different objects that happen to have the “same” value. In this latter case it becomes
important to distinguish between variables that refer to the same object and variables that refer to different objects whose values happen (at the moment) to be
equal. (Lisp, as we shall see in Sections 7.10 and 10.3.3, provides more than one
notion of equality, to accommodate this distinction.) We will discuss the value
and reference models of variables further in Section 7.7. Languages that employ
(some variant of) the reference model include Algol 68, Clu, Lisp/Scheme, ML,
Haskell, and Smalltalk.
Java uses a value model for built-in types and a reference model for userdefined types (classes). C# and Eiffel allow the programmer to choose between
the value and reference models for each individual user-defined type. A C# class
is a reference type; a struct is a value type.
Boxing
EXAMPLE
6.14
Wrapper objects in Java 2
A drawback of using a value model for built-in types is that they can’t be passed
uniformly to methods that expect class typed parameters. Early versions of Java,
for example, required the programmer to “wrap” objects of built-in types inside
corresponding predefined class types in order to insert them in standard container (collection) classes:
import java.util.Hashtable;
...
Hashtable ht = new Hashtable();
...
Integer N = new Integer(13);
ht.put(N, new Integer(31));
Integer M = (Integer) ht.get(N);
int m = M.intValue();
EXAMPLE
6.15
Boxing in Java 5
// Integer is a "wrapper" class
More recent versions of Java perform automatic boxing and unboxing operations that avoid the need for wrappers in many cases:
ht.put(13, 31);
int m = (Integer) ht.get(13);
EXAMPLE
6.16
Boxing in C#
Here the compiler creates hidden Integer objects to hold the values 13 and 31 ,
so they may be passed to put as references. The Integer cast on the return value
is still needed, to make sure that the hash table entry for 13 is really an integer
and not, say, a floating-point number or string.
C# “boxes” not only the arguments, but the cast as well, eliminating the need
for the Integer class entirely. C# also provides so-called indexers (Section 9.1,
242
Chapter 6 Control Flow
page 474), which can be used to overload the subscripting ( [ ] ) operator, giving
the hash table array-like syntax:
ht[13] = 31;
int m = (int) ht[13];
Orthogonality
EXAMPLE
6.17
Expression orientation in
Algol 68
One of the principal design goals of Algol 68 was to make the various features
of the language as orthogonal as possible. Orthogonality means that features can
be used in any combination, the combinations all make sense, and the meaning
of a given feature is consistent, regardless of the other features with which it is
combined. The name is meant to draw an explicit analogy to orthogonal vectors
in linear algebra: none of the vectors in an orthogonal set depends on (or can
be expressed in terms of) the others, and all are needed in order to describe the
vector space as a whole.
Algol 68 was one of the first languages to make orthogonality a principal design goal, and in fact few languages since have given the goal such weight. Among
other things, Algol 68 is said to be expression-oriented: it has no separate notion
of statement. Arbitrary expressions can appear in contexts that would call for
a statement in a language like Pascal, and constructs that are considered to be
statements in other languages can appear within expressions. The following, for
example, is valid in Algol 68:
begin
a := if b < c then d else e;
a := begin f(b); g(c) end;
g(d);
2 + 3
end
Here the value of the if . . . then . . . else construct is either the value of its then
part or the value of its else part, depending on the value of the condition. The
value of the “statement list” on the right-hand side of the second assignment
is the value of its final “statement,” namely the return value of g(c) . There is
no need to distinguish between procedures and functions, because every subroutine call returns a value. The value returned by g(d) is discarded in this
example. Finally, the value of the code fragment as a whole is 5, the sum of 2
and 3.
C takes an approach intermediate between Pascal and Algol 68. It distinguishes
between statements and expressions, but one of the classes of statement is an “expression statement,” which computes the value of an expression and then throws
it away. In effect, this allows an expression to appear in any context that would
require a statement in most other languages. C also provides special expression
forms for selection and sequencing. Algol 60 defines if . . . then . . . else as both
a statement and an expression.
Both Algol 68 and C allow assignments within expressions. The value of an
assignment is simply the value of its right-hand side. Unfortunately, where most
6.1 Expression Evaluation
EXAMPLE
6.18
A “gotcha” in C conditions
243
of the descendants of Algol 60 use the := token to represent assignment, C follows
Fortran in simply using = . It uses == to represent a test for equality (Fortran uses
.eq. ). Moreover, C lacks a separate Boolean type. (C99 has a new _Bool type,
but it’s really just a one-bit integer.) In any context that would require a Boolean
value in other languages, C accepts an integer (or anything that can be coerced to
be an integer). It interprets zero as false; any other value is true. As a result, both
of the following constructs are valid—common—in C.
if (a == b) {
/* do the following if a equals b */
if (a = b) {
/* assign b into a and then do
the following if the result is nonzero */
Programmers who are accustomed to Ada or some other language in which = is
the equality test frequently write the second form above when the first is what is
intended. This sort of bug can be very hard to find.
Though it provides a true Boolean type ( bool ), C++ shares the problem of C,
because it provides automatic coercions from numeric, pointer, and enumeration
types. Java and C# eliminate the problem by disallowing integers in Boolean contexts. The assignment operator is still = , and the equality test is still == , but the
statement if (a = b) ... will generate a compile-time type clash error unless
a and b are both boolean (Java) or bool (C#), which is generally unlikely.
Combination Assignment Operators
EXAMPLE
6.19
Updating assignments
Because they rely so heavily on side effects, imperative programs must frequently
update a variable. It is thus common in many languages to see statements like
a = a + 1;
or worse,
b.c[3].d = b.c[3].d * e;
EXAMPLE
6.20
Side effects and updates
Such statements are not only cumbersome to write and to read (we must examine
both sides of the assignment carefully to see if they really are the same), they also
result in redundant address calculations (or at least extra work to eliminate the
redundancy in the code improvement phase of compilation).
If the address calculation has a side effect, then we may need to write a pair of
statements instead. Consider the following code in C:
void update(int A[], int index_fn(int n)) {
int i, j;
/* calculate i */
...
j = index_fn(i);
A[j] = A[j] + 1;
}
244
Chapter 6 Control Flow
Here we cannot safely write
A[index_fn(i)] = A[index_fn(i)] + 1;
EXAMPLE
6.21
Assignment operators
We have to introduce the temporary variable j because we don’t know whether
index_fn has a side effect or not. If it is being used, for example, to keep a log of
elements that have been updated, then we shall want to make sure that update
calls it only once.
To eliminate the clutter and compile- or run-time cost of redundant address
calculations, and to avoid the issue of repeated side effects, many languages, beginning with Algol 68 and including C and its descendants, provide so-called
assignment operators to update a variable. Using assignment operators, the statements in Example 6.19 can be written as follows.
a += 1;
b.c[3].d *= e;
Similarly, the two assignments in the update function can be replaced with
A[index_fn(i)] += 1;
EXAMPLE
6.22
Prefix and postfix inc/dec
In addition to being aesthetically cleaner, the assignment operator form guarantees that the address calculation is performed only once.
As shown in Figure 6.1, C provides 10 different assignment operators, one for
each of its binary arithmetic and bit-wise operators. C also provides prefix and
postfix increment and decrement operations. These allow even simpler code in
update :
A[index_fn(i)]++;
or
++A[index_fn(i)];
More significantly, increment and decrement operators provide elegant syntax
for code that uses an index or a pointer to traverse an array:
A[--i] = b;
*p++ = *q++;
EXAMPLE
6.23
Advantages of postfix
inc/dec
When prefixed to an expression, the ++ or -- operator increments or decrements
its operand before providing a value to the surrounding context. In the postfix
form, ++ or -- updates its operand after providing a value. If i is 3 and p and q
point to the initial elements of a pair of arrays, then b will be assigned into A[2]
(not A[3] ), and the second assignment will copy the initial elements of the arrays
(not the second elements).
The prefix forms of ++ and -- are syntactic sugar for += and -= . We could
have written
A[i -= 1] = b;
above. The postfix forms are not syntactic sugar. To obtain an effect similar to
the second statement above we would need an auxiliary variable and a lot of
6.1 Expression Evaluation
245
extra notation:
*(t = p, p += 1, t) = *(t = q, q += 1, t);
Both the assignment operators ( += , -= ) and the increment and decrement
operators ( ++ , -- ) do “the right thing” when applied to pointers in C. If p points
to an object that occupies n bytes in memory (including any bytes required for
alignment, as discussed in Section 5.1), then p += 3 points 3n bytes higher in
memory.
Multiway Assignment
EXAMPLE
6.24
Simple multiway
assignment
We have already seen that the right associativity of assignment (in languages that
allow assignment in expressions) allows one to write things like a = b = c . In
several languages, including Clu, ML, Perl, Python, and Ruby, it is also possible
to write
a, b := c, d;
EXAMPLE
6.25
Advantages of multiway
assignment
Here the comma in the right-hand side is not the sequencing operator of C.
Rather, it serves to define an expression, or tuple, consisting of multiple r-values.
The comma operator on the left-hand side produces a tuple of l-values. The effect
of the assignment is to copy c into a and d into b .3
While we could just as easily have written
a := c; b := d;
the multiway (tuple) assignment allows us to write things like
a, b := b, a;
which would otherwise require auxiliary variables. Moreover, multiway assignment allows functions to return tuples, as well as single values:
a, b, c := foo(d, e, f);
This notation eliminates the asymmetry (nonorthogonality) of functions in most
programming languages, which allow an arbitrary number of arguments but only
a single return.
ML generalizes the idea of multiway assignment into a powerful patternmatching mechanism; we will examine this mechanism in more detail in Section 7.2.4.
C H E C K YO U R U N D E R S TA N D I N G
1. Name seven major categories of control-flow mechanisms.
2. What distinguishes operators from other sorts of functions?
3 The syntax shown here is for Clu. Perl, Python, and Ruby follow C in using = for assignment.
ML requires parentheses around each tuple.
246
Chapter 6 Control Flow
3. Explain the difference between prefix, infix, and postfix notation. What is
Cambridge Polish notation? Name two programming languages that use postfix notation.
4. Why don’t issues of associativity and precedence arise in Postscript or Forth?
5. What does it mean for an expression to be referentially transparent?
6. What is the difference between a value model of variables and a reference
model of variables? Why is the distinction important?
7. What is an l-value? An r-value?
8. Why is the distinction between mutable and immutable values important in
the implementation of a language with a reference model of variables?
9. Define orthogonality in the context of programming language design.
10. What does it mean for a language to be expression-oriented?
11. What are the advantages of updating a variable with an assignment operator,
rather than with a regular assignment in which the variable appears on both
the left- and right-hand sides?
6.1.3
Initialization
Because they already provide a construct (the assignment statement) to set the
value of a variable, imperative languages do not always provide a means of specifying an initial value for a variable in its declaration. There are at least two reasons, however, why such initial values may be useful:
1. In the case of statically allocated variables (as discussed in Section 3.2), an
initial value that is specified in the context of the declaration can be placed into
memory by the compiler. If the initial value is set by an assignment statement
instead, it will generally incur execution cost at run time.
2. One of the most common programming errors is to use a variable in an expression before giving it a value. One of the easiest ways to prevent such errors
(or at least ensure that erroneous behavior is repeatable) is to give every variable a value when it is first declared.
Some languages (e.g., Pascal) have no initialization facility at all; all variables
must be given values by explicit assignment statements. To avoid the expense of
run-time initialization of statically allocated variables, many Pascal implementations provide initialization as a language extension, generally in the form of a
:= expr immediately after the name in the declaration. Unfortunately, the extension is usually nonorthogonal, in the sense that it only works for variables of
simple, built-in types. A more complete and orthogonal approach to initialization requires a notation for aggregates: built-up structured values of user-defined
composite types. Aggregates can be found in several languages, including C, Ada,
6.1 Expression Evaluation
247
Fortran 90, and ML; we will discuss them further in Section 7.1.5. It should be
emphasized that initialization saves time only for variables that are statically allocated. Variables allocated in the stack or heap at run time must be initialized at
run time.4 It is also worth noting that the problem of using an uninitialized variable occurs not only after elaboration, but also as a result of any operation that
destroys a variable’s value without providing a new one. Two of the most common such operations are explicit deallocation of an object referenced through a
pointer and modification of the tag of a variant record. We will consider these
operations further in Sections 7.7 and 7.3.4, respectively.
If a variable is not given an initial value explicitly in its declaration, the language may specify a default value. In C, for example, statically allocated variables
for which the programmer does not provide an initial value are guaranteed to
be represented in memory as if they had been initialized to zero. For most types
on most machines, this is a string of zero bits, allowing the language implementation to exploit the fact that most operating systems (for security reasons) fill
newly allocated memory with zeros. Zero-initialization applies recursively to the
subcomponents of variables of user-defined composite types. The designers of
C chose not to incur the run-time cost of automatically zero-filling uninitialized
variables that are allocated in the stack or heap. The programmer can specify an
initial value if desired; the effect is the same as if an assignment had been placed
at the beginning of the code for the variable’s scope.
Constructors
Many object-oriented languages allow the programmer to define types for which
initialization of dynamically allocated variables occurs automatically, even when
no initial value is specified in the declaration. C++ also distinguishes carefully
between initialization and assignment. Initialization is interpreted as a call to
a constructor function for the variable’s type, with the initial value as an argument. In the absence of coercion, assignment is interpreted as a call to the type’s
assignment operator or, if none has been defined, as a simple bit-wise copy of
the value on the assignment’s right-hand side. The distinction between initialization and assignment is particularly important for user-defined abstract data
types that perform their own storage management. A typical example occurs in
variable-length character strings. An assignment to such a string must generally
deallocate the space consumed by the old value of the string before allocating
space for the new value. An initialization of the string must simply allocate space.
Initialization with a nontrivial value is generally cheaper than default initialization followed by assignment because it avoids deallocation of the space allocated
for the default value. We will return to this issue in Section 9.3.2.
4 For variables that are accessed indirectly (e.g., in languages that employ a reference model of
variables), a compiler can often reduce the cost of initializing a stack or heap variable by placing
the initial value in static memory, and only creating the pointer to it at elaboration time.
248
Chapter 6 Control Flow
Neither Java nor C# distinguishes between initialization and assignment, or
between declaration and definition. Java uses a reference model for all variables
of user-defined object types, and provides for automatic storage reclamation, so
assignment never copies values. C# allows the programmer to specify a value
model when desired (in which case assignment does copy values), but otherwise
it mirrors Java. We will return to these issues again in Chapter 9 when we consider
object-oriented features in more detail.
Definite Assignment
EXAMPLE
6.26
Programs outlawed by
definite assignment
Java and C# require that a value be “definitely assigned” to a variable before that
variable is used in any expression. Both languages provide a precise definition of
“definitely assigned,” based on the control flow of the program. Roughly speaking, every possible control path to an expression must assign a value to every
variable in that expression. This is a conservative rule; it can sometimes prohibit
programs that would never actually use an uninitialized variable. In Java:
int i;
final static int j = 3;
...
if (j > 0) {
i = 2;
}
...
if (j > 0) {
System.out.println(i);
// error: "i might not have been initialized"
}
D E S I G N & I M P L E M E N TAT I O N
Safety v. performance
A recurring theme in any comparison between C++ and Java is the latter’s
willingness to accept additional run-time cost in order to obtain cleaner semantics or increased reliability. Definite assignment is one example: it may
force the programmer to perform “unnecessary” initializations on certain code
paths, but in so doing it avoids the many subtle errors that can arise from missing initialization in other languages. Similarly, the Java specification mandates
automatic garbage collection, and its reference model of user-defined types
forces most objects to be allocated in the heap. As we shall see in Chapters 7
and 9, Java also requires both dynamic binding of all method invocations and
run-time checks for out-of-bounds array references, type clashes, and other
dynamic semantic errors. Clever compilers can reduce or eliminate the cost of
these requirements in certain common cases, but for the most part the Java
design reflects an evolutionary shift away from performance as the overriding
design goal.
6.1 Expression Evaluation
249
While a human being might reason that i will only be used when it has previously
been given a value, it is uncomputable to make such determinations in the general
case, and the compiler does not attempt it.
Dynamic Checks
Instead of giving every uninitialized variable a default value, a language or implementation can choose to define the use of an uninitialized variable as a dynamic
semantic error, and can catch these errors at run time. The advantage of the semantic checks is that they will often identify a program bug that is masked or
made more subtle by the presence of a default value. With appropriate hardware
support, uninitialized variable checks can even be as cheap as default values, at
least for certain types. In particular, a compiler that relies on the IEEE standard
for floating-point arithmetic can fill uninitialized floating-point numbers with a
5.2.1. Any attempt to use such
signaling NaN value, as discussed in Section
a value in a computation will result in a hardware interrupt, which the language
implementation may catch (with a little help from the operating system), and use
to trigger a semantic error message.
For most types on most machines, unfortunately, the costs of catching all uses
of an uninitialized variable at run time are considerably higher. If every possible
bit pattern of the variable’s representation in memory designates some legitimate
value (and this is often the case), then extra space must be allocated somewhere
to hold an initialized/uninitialized flag. This flag must be set to “uninitialized” at
elaboration time and to “initialized” at assignment time. It must also be checked
(by extra code) at every use—or at least at every use that the code improver is unable to prove is redundant. Dynamic semantic checks for uninitialized variables
are common in interpreted languages, which already incur significant overhead
on every variable access. Because of their cost, however, the checks are usually
not performed in languages that are compiled.
6.1.4
EXAMPLE
6.27
Indeterminate ordering
Ordering Within Expressions
While precedence and associativity rules define the order in which binary infix
operators are applied within an expression, they do not specify the order in which
the operands of a given operator are evaluated. For example, in the expression
a - f(b) - c * d
we know from associativity that f(b) will be subtracted from a before performing the second subtraction, and we know from precedence that the right operand
of that second subtraction will be the result of c * d , rather than merely c , but
without additional information we do not know whether a - f(b) will be evaluated before or after c * d . Similarly, in a subroutine call with multiple arguments
f(a, g(b), c)
we do not know the order in which the arguments will be evaluated.
250
Chapter 6 Control Flow
There are two main reasons why the order can be important:
EXAMPLE
6.28
A value that depends on
ordering
EXAMPLE
6.29
An optimization that
depends on ordering
1. Side effects: If f(b) may modify d , then the value of a - f(b) - c * d
will depend on whether the first subtraction or the multiplication is performed first. Similarly, if g(b) may modify a and/or c , then the values passed
to f(a, g(b), c) will depend on the order in which the arguments are
evaluated.
2. Code improvement: The order of evaluation of subexpressions has an impact
on both register allocation and instruction scheduling. In the expression a *
b + f(c) , it is probably desirable to call f before evaluating a * b , because
the product, if calculated first, would need to be saved during the call to f ,
and f might want to use all the registers in which it might easily be saved. In
a similar vein, consider the sequence
a := B[i];
c := a * 2 + d * 3;
Here it is probably desirable to evaluate d * 3 before evaluating a * 2 , because the previous statement, a := B[i] , will need to load a value from
memory. Because loads are slow, if the processor attempts to use the value of a
in the next instruction (or even the next few instructions on many machines),
it will have to wait. If it does something unrelated instead (i.e., evaluate d *
3 ), then the load can proceed in parallel with other computation.
Because of the importance of code improvement, most language manuals say
that the order of evaluation of operands and arguments is undefined. (Java and
C# are unusual in this regard: they require left-to-right evaluation.) In the absence of an enforced order, the compiler can choose whatever order results in
faster code.
Applying Mathematical Identities
EXAMPLE
6.30
Optimization and
mathematical “laws”
Some language implementations (e.g., for dialects of Fortran) allow the compiler
to rearrange expressions involving operators whose mathematical abstractions
are commutative, associative, and/or distributive, in order to generate faster code.
Consider the following Fortran fragment.
a = b + c
d = c + e + b
Some compilers will rearrange this as
a = b + c
d = b + c + e
They can then recognize the common subexpression in the first and second statements, and generate code equivalent to
6.1 Expression Evaluation
251
a = b + c
d = a + e
Similarly,
a = b/c/d
e = f/d/c
may be rearranged as
t = c * d
a = b/t
e = f/t
Unfortunately, while mathematical arithmetic obeys a variety of commutative, associative, and distributive laws, computer arithmetic is not as orderly. The
problem is that numbers in a computer are of limited precision. With 32-bit
arithmetic, the expression b - c + d can be evaluated safely left to right if a , b ,
and c are all integers between two billion and three billion (232 is a little less than
4.3 billion). If the compiler attempts to reorganize this expression as b + d - c ,
however (e.g., in order to delay its use of c ), then arithmetic overflow will occur.
Many languages, including Pascal and most of its descendants, provide dynamic semantic checks to detect arithmetic overflow. In some implementations
these checks can be disabled to eliminate their run-time overhead. In C and C++,
the effect of arithmetic overflow is implementation-dependent. In Java, it is well
defined: the language definition specifies the size of all numeric types, and requires two’s complement integer and IEEE floating-point arithmetic. In C#, the
programmer can explicitly request the presence or absence of checks by tagging
an expression or statement with the checked or unchecked keyword. In a completely different vein, Scheme, Common Lisp, and several scripting languages
place no a priori limit on the size of numbers; space is allocated to hold extralarge values on demand.
Even in the absence of overflow, the limited precision of floating-point arithmetic can cause different arrangements of the “same” expression to produce sigD E S I G N & I M P L E M E N TAT I O N
Evaluation order
Expression evaluation represents a difficult tradeoff between semantics and
implementation. To limit surprises, most language definitions require the
compiler, if it ever reorders expressions, to respect any ordering imposed by
parentheses. The programmer can therefore use parentheses to prevent the
application of arithmetic “identities” when desired. No similar guarantee exists with respect to the order of evaluation of operands and arguments. It is
therefore unwise to write expressions in which a side effect of evaluating one
operand or argument can affect the value of another. As we shall see in Section 6.3, some languages, notably Euclid and Turing, outlaw such side effects.
252
EXAMPLE
Chapter 6 Control Flow
6.31
Reordering and numerical
stability
nificantly different results, invisibly. Single-precision IEEE floating-point numbers devote 1 bit to the sign, 8 bits to the exponent (power of 2), and 23 bits to
the mantissa. Under this representation, a + b is guaranteed to result in a loss of
information if | log2 (a/b)| > 23. Thus if b = -c , then a + b + c may appear to
be zero, instead of a , if the magnitude of a is small, while the magnitude of b and
c is large. In a similar vein, a number like 0.1 cannot be represented precisely,
because its binary representation is a “repeating decimal”: 0.0001001001. . . . For
certain values of x , (0.1 + x) * 10.0 and 1.0 + (x * 10.0) can differ by as
much as 25%, even when 0.1 and x are of the same magnitude.
6.1.5
EXAMPLE
6.32
Short-circuited
expressions
EXAMPLE
6.33
Saving time with
short-circuiting
EXAMPLE
6.34
Short-circuit pointer
chasing
Short-Circuit Evaluation
Boolean expressions provide a special and important opportunity for code improvement and increased readability. Consider the expression (a < b) and
(b < c) . If a is greater than b , there is really no point in checking to see whether
b is less than c ; we know the overall expression must be false. Similarly, in the expression (a > b) or (b > c), if a is indeed greater than b there is no point in
checking to see whether b is greater than c ; we know the overall expression must
be true. A compiler that performs short-circuit evaluation of Boolean expressions
will generate code that skips the second half of both of these computations when
the overall value can be determined from the first half.
Short-circuit evaluation can save significant amounts of time in certain situations:
But time is not the only consideration, or even the most important one. Shortcircuiting changes the semantics of Boolean expressions. In C, for example, one
can use the following code to search for an element in a list.
if (very_unlikely_condition && very_expensive_function()) ...
p = my_list;
while (p && p->key != val)
p = p->next;
C short-circuits its && and || operators, and uses zero for both nil and false, so
p->key will be accessed if and only if p is non-nil. The syntactically similar code
in Pascal does not work, because Pascal does not short-circuit and and or :
p := my_list;
while (p <> nil) and (p^.key <> val) do
p := p^.next;
(* ouch! *)
Here both of the <> relations will be evaluated before and -ing their results together. At the end of an unsuccessful search, p will be nil , and the attempt to
access p^.key will be a run-time (dynamic semantic) error, which the compiler
may or may not have generated code to catch. To avoid this situation, the Pascal
programmer must introduce an auxiliary Boolean variable and an extra level of
nesting:
6.1 Expression Evaluation
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
253
function tally(word : string) : integer;
(* Look up word in hash table. If found, increment tally; If not
found, enter with a tally of 1. In either case, return tally. *)
...
function misspelled(word : string) : Boolean;
(* Check to see if word is mis-spelled and return appropriate
indication. If yes, increment global count of mis-spellings. *)
...
while not eof(doc_file) do begin
w := get_word(doc_file);
if (tally(w) = 10) and misspelled(w) then
writeln(w)
end;
writeln(total_misspellings);
Figure 6.3
Pascal code that counts on the evaluation of Boolean operands.
p := my_list;
still_searching := true;
while still_searching do
if p = nil then
still_searching := false
else if p^.key = val then
still_searching := false
else
p := p^.next;
EXAMPLE
6.35
Short-circuiting and other
errors
Short-circuit evaluation can also be used to avoid out-of-bound subscripts:
const MAX = 10;
int A[MAX];
/* indices from 0 to 9 */
...
if (i >= 0 && i < MAX && A[i] > foo) ...
division by zero:
if (d <> 0 && n/d > threshold) ...
EXAMPLE
6.36
When not to use
short-circuiting
and various other errors.
Short-circuiting is not necessarily as attractive for situations in which a
Boolean subexpression can cause a side effect. Suppose we wish to count occurrences of words in a document, and print a list of all misspelled words that appear
ten or more times, together with a count of the total number of misspellings. Pascal code for this task appears in Figure 6.3. Here the if statement at line 9 tests
the conjunction of two subexpressions, both of which have important side effects.
If short-circuit evaluation is used, the program will not compute the right result.
The code can be rewritten to eliminate the need for non-short-circuit evaluation,
but one might argue that the result is more awkward than the version shown. 254
EXAMPLE
Chapter 6 Control Flow
6.37
Optional short-circuiting
So now we have seen situations in which short-circuiting is highly desirable,
and others in which at least some programmers would find it undesirable. A few
languages, among them Clu, Ada, and C, provide both regular and short-circuit
Boolean operators. (Similar flexibility can be achieved with if . . . then . . . else
in an expression-oriented language such as Algol 68; see Exercise 6.10.) In Clu,
the regular Boolean operators are and and or ; the short-circuit operators are
cand and cor (for conditional and and or ):
if d ~= 0 cand n/d > threshold then ...
In Ada, the regular operators are also and and or ; the short-circuit operators are
the two-word operators and then and or else :
found_it := p /= null and then p.key = val;
(Clu and Ada use ~= and /= , respectively, for “not equal.”) C’s logical && and ||
operators short-circuit; the bit-wise & and | operators can be used as non-shortcircuiting alternatives when their arguments are logical (zero or one) values. When used to determine the flow of control in a selection or iteration construct, short-circuit Boolean expressions do not really have to calculate a Boolean
value; they simply have to ensure that control takes the proper path in any given
situation. We will look more closely at the generation of code for short-circuit
expressions in Section 6.4.1.
C H E C K YO U R U N D E R S TA N D I N G
12. Given the ability to assign a value into a variable, why is it useful to be able to
specify an initial value?
13. What are aggregates? Why are they useful?
14. Explain the notion of definite assignment in Java and C#.
15. Why is it generally expensive to catch all uses of uninitialized variables at run
time?
16. Why is it impossible to catch all uses of uninitialized variables at compile
time?
17. Why do most languages leave unspecified the order in which the arguments
of an operator or function are evaluated?
18. What is short-circuit Boolean evaluation? Why is it useful?
6.2
EXAMPLE
6.38
Control flow with goto s
in Fortran
Structured and Unstructured Flow
Control flow in assembly languages is achieved by means of conditional and unconditional jumps (branches). Early versions of Fortran mimicked the low-level
6.2 Structured and Unstructured Flow
255
approach by relying heavily on goto statements for most nonprocedural control
flow:
if A .lt. B goto 10
...
! ".lt." means "<"
10
The 10 on the bottom line is a statement label.
Goto statements also feature prominently in other early imperative languages.
In Cobol and PL/I they provide the only means of writing logically controlled
( while -style) loops. Algol 60 and its successors provide a wealth of non- goto based constructs, but until recently most Algol-family languages still provided
goto as an option.
Throughout the late 1960s and much of the 1970s, language designers debated
hotly the merits and evils of goto s. It seems fair to say the detractors won. Ada
and C# allow goto s only in limited contexts. Modula (1, 2, and 3), Clu, Eiffel,
and Java do not allow them at all. Fortran 90 and C++ allow them primarily for
compatibility with their predecessor languages. (Java reserves the token goto as
a keyword, to make it easier for a Java compiler to produce good error messages
when a programmer uses a C++ goto by mistake.)
The abandonment of goto s was part of a larger “revolution” in software engineering known as structured programming. Structured programming was the
“hot trend” of the 1970s, in much the same way that object-oriented programming was the trend of the 1990s. Structured programming emphasizes top-down
design (i.e., progressive refinement), modularization of code, structured types
(records, sets, pointers, multidimensional arrays), descriptive variable and constant names, and extensive commenting conventions. The developers of structured programming were able to demonstrate that within a subroutine, almost
any well-designed imperative algorithm can be elegantly expressed with only sequencing, selection, and iteration. Instead of labels, structured languages rely on
the boundaries of lexically nested constructs as the targets of branching control.
Many of the structured control-flow constructs familiar to modern programmers were pioneered by Algol 60. These include the if . . . then . . . else construct
and both enumeration ( for ) and logically ( while ) controlled loops. The case
statement was introduced by Wirth and Hoare in Algol W [WH66] as an alternative to the more unstructured computed goto and switch constructs of Fortran
and Algol 60, respectively. Case statements were adopted in limited form by Algol 68, and more completely by Pascal, Modula, C, Ada, and a host of modern
languages.
6.2.1
Structured Alternatives to goto
Once the principal structured constructs had been defined, most of the controversy surrounding goto s revolved around a small number of special cases, each
of which was eventually addressed in structured ways.
256
EXAMPLE
Chapter 6 Control Flow
6.39
Leaving the middle of a
loop
Mid-loop exit and continue:
of the middle of a loop:
A common use of goto s in Pascal was to break out
while not eof do begin
readln(line);
if all_blanks(line) then goto 100;
consume_line(line)
end;
100:
EXAMPLE
6.40
Returning from the middle
of a subroutine
Less commonly, one would also see a label inside the end of a loop, to serve
as the target of a goto that would terminate a given iteration early. As we
shall see in Section 6.5.5, mid-loop exits are supported by special “one-and-a
half ” loop constructs in languages like Modula, C, and Ada. Some languages
also provide a statement to skip the remainder of the current loop iteration:
continue in C; cycle in Fortran 90; next in Perl.
Early returns from subroutines: Goto s were used fairly often in Pascal to terminate the current subroutine:
procedure consume_line(var line: string);
...
begin
...
if line[i] = ’%’ then goto 100;
(* rest of line is a comment *)
...
100:
end;
EXAMPLE
6.41
Escaping a nested
subroutine
At a minimum, this goto statement avoids putting the remainder of the procedure in an else clause. If the terminating condition is discovered within a
deeply nested if . . . then . . . else , it may avoid introducing an auxiliary variable that must be tested repeatedly in the remainder of the procedure ( if not
comment_line then ... ).
The obvious alternative to this use of goto is an explicit return statement.
Algol 60 does not have one, and neither does Pascal, but Fortran always has,
and most modern Algol descendants have adopted it.
Multilevel returns: Returns and (local) goto s allow control to return from the
current subroutine. On occasion it may make sense to return from a surrounding routine. Imagine, for example, that we are searching for an item matching
some desired pattern with a collection of files. The search routine might invoke several nested routines, or a single routine multiple times, once for each
place in which to search. In such a situation certain historic languages, including Algol 60, PL/I, and Pascal, permit a goto to branch to a lexically visible
label outside the current subroutine:
6.2 Structured and Unstructured Flow
function search(key : string) :
var rtn : string;
...
procedure search_file(fname
...
begin
...
for ... (* iterate over
...
if found(key, line)
rtn := line;
goto 100;
end;
...
end;
...
string;
: string);
lines *)
then begin
begin (* search *)
...
for ... (* iterate over files *)
...
search_file(fname);
...
100:
return rtn;
end;
EXAMPLE
6.42
Structured nonlocal
transfers
257
In the event of a nonlocal goto , the language implementation must guarantee to repair the run-time stack of subroutine call information. This repair
operation is known as unwinding. It requires not only that the implementation
deallocate the stack frames of any subroutines from which we have escaped,
but also that it perform any bookkeeping operations, such as restoration of
register contents, that would have been performed when returning from those
routines.
As a more structured alternative to the nonlocal goto , Common Lisp provides a return-from statement that names the lexically surrounding function
or block from which to return, and also supplies a return value (eliminating
the need for the artificial rtn variable in Example 6.41).
But what if search_file were not nested inside of search ? We might, for
example, wish to call it from routines that search files in different orders. In
this case the goto of Pascal does not suffice. Algol 60 and PL/I allow labels
to be passed as parameters, so a dynamically nested subroutine can perform a
goto to a caller-defined location. PL/I also allows labels to be stored in variables. If a nested routine needs to return a value it can assign it to some variable in a scope that surrounds all calls. Alternatively, we can pass a reference
parameter into every call, into which the result should be written.
Common Lisp again provides a more structured alternative, also available
in Ruby. In either language an expression can be surrounded with a catch
258
Chapter 6 Control Flow
block, whose value can be provided by any dynamically nested routine that
executes a matching throw . In Ruby we might write
def searchFile(fname, pattern)
file = File.open(fname)
file.each {|line|
throw :found, line if line =~ /#{pattern}/
}
end
match = catch :found
searchFile("f1",
searchFile("f2",
searchFile("f3",
"not found\n"
end
print match
EXAMPLE
6.43
Error-checking with status
codes
do
key)
key)
key)
# default value for catch,
# if control gets this far
Here the throw expression specifies a tag, which must appear in a matching
catch , together with a value ( line ) to be returned as the value of the catch .
(The if clause attached to the throw performs a regular-expression pattern
match, looking for pattern within line . We will consider pattern matching
in more detail in Section 13.4.2.)
Errors and other exceptions: The notion of a multilevel return assumes that the
callee knows what the caller expects, and can return an appropriate value. In
a related and arguably more common situation, a deeply nested block or subroutine may discover that it is unable to proceed with its usual function and,
moreover, lacks the contextual information it would need to recover in any
graceful way. The only recourse in such a situation is to “back out” of the
nested context to some point in the program that is able to recover. Conditions that require a program to “back out” are usually called exceptions. We
2.3.4, where we considered phrase-level recovsaw an example in Section
ery from syntax errors in a recursive-descent parser.
The most straightforward but generally least satisfactory way to cope with
exceptions is to use auxiliary Boolean variables within a subroutine ( if
still_ok then ... ) and to return status codes from calls:
status := my_proc(args);
if status = ok then ...
The auxiliary Booleans can be eliminated by using a nonlocal goto or multilevel return, but the caller to which we return must still inspect status codes
explicitly. As a structured alternative, many modern languages provide an exception handling mechanism for convenient, nonlocal recovery from exceptions. We will discuss exception handling in more detail in Section 8.5. Typically the programmer appends a block of code called a handler to any computation in which an exception may arise. The job of the handler is to take
6.2 Structured and Unstructured Flow
259
whatever remedial action is required to recover from the exception. If the protected computation completes in the normal fashion, execution of the handler
is skipped.
Multilevel returns and structured exceptions have strong similarities. Both
involve a control transfer from some inner, nested context back to an outer
context, unwinding the stack on the way. The distinction lies in where the
computing occurs. In a multilevel return the inner context has all the information it needs. It completes its computation, generating a return value if
appropriate, and transfers to the outer context in a way that requires no postprocessing. At an exception, by contrast, the inner context cannot complete its
work. It performs an “abnormal” return, triggering execution of the handler.
Common Lisp and Ruby provide mechanisms for both multilevel returns
and exceptions, but this dual support is relatively rare. Most languages support
only exceptions; programmers implement multilevel returns by writing a trivial handler. In an unfortunate overloading of terminology, the names catch
and throw , which Common Lisp and Ruby use for multilevel returns, are used
for exceptions in several other languages.
6.2.2
Continuations
The notion of nonlocal goto s that unwind the stack can be generalized by defining what are known as continuations. In low-level terms, a continuation consists
of a code address and a referencing environment to be restored when jumping to
that address. In higher-level terms, a continuation is an abstraction that captures
a context in which execution might continue. Continuations are fundamental to
denotational semantics. They also appear as first-class values in certain languages
(notably Scheme and Ruby), allowing the programmer to define new controlflow constructs.
Continuation support in Scheme takes the form of a general purpose function
called call-with-current-continuation , sometimes abbreviated call/cc .
D E S I G N & I M P L E M E N TAT I O N
Cleaning up continuations
The implementation of continuations in Scheme and Ruby is surprisingly
straightforward. Because local variables have unlimited extent in both languages, activation records must in general be allocated on the heap. As a result, explicit deallocation is neither required nor appropriate when jumping
through a continuation; frames that are no longer accessible will eventually
be reclaimed by a general purpose garbage collector (to be discussed in Section 7.7.3). Restoration of state (e.g., saved registers) from escaped routines
is not required either: the continuation closure holds everything required to
resume the captured context.
260
Chapter 6 Control Flow
This function takes a single argument, f , which is itself a function. It calls f ,
passing as argument a continuation c that captures the current program counter
and referencing environment. The continuation is represented by a closure, indistinguishable from the closures used to represent subroutines passed as parameters. At any point in the future, f can call c to reestablish the captured context.
If nested calls have been made, control pops out of them, as it does with exceptions. More generally, however, c can be saved in variables, returned explicitly by
subroutines, or called repeatedly, even after control has returned from f (recall
that closures in Scheme have unlimited extent; see Section 3.5). Call/cc suffices
to build a wide variety of control abstractions, including goto s, mid-loop exit s,
multilevel returns, exceptions, iterators (Section 6.5.3), call-by-name parameters
(Section 8.3.1), and coroutines (Section 8.6). It even subsumes the notion of returning from a subroutine, though it seldom replaces it in practice.
First-class continuations are an extremely powerful facility. They can be very
useful if applied in well-structured ways (i.e., to define new control-flow constructs). Unfortunately, they also allow the undisciplined programmer to construct completely inscrutable programs.
6.3
Sequencing
Like assignment, sequencing is central to imperative programming. It is the principal means of controlling the order in which side effects (e.g., assignments) occur: when one statement follows another in the program text, the first statement
executes before the second. In most imperative languages, lists of statements can
be enclosed with begin . . . end or {. . . } delimiters and then used in any context
in which a single statement is expected. Such a delimited list is usually called a
compound statement. A compound statement preceded by a set of declarations is
sometimes called a block.
In languages like Algol 68 and C, which blur or eliminate the distinction between statements and expressions, the value of a statement (expression) list is the
value of its final element. In Common Lisp, the programmer can choose to return
the value of the first element, the second, or the last. Of course, sequencing is a
useless operation unless the subexpressions that do not play a part in the return
value have side effects. The various sequencing constructs in Lisp are used only
in program fragments that do not conform to a purely functional programming
model.
Even in imperative languages, there is debate as to the value of certain kinds of
side effects. In Euclid and Turing, for example, functions (that is, subroutines that
return values, and that therefore can appear within expressions) are not permitted to have side effects. Among other things, side-effect freedom ensures that a
Euclid or Turing function, like its counterpart in mathematics, is always idempotent: if called repeatedly with the same set of arguments, it will always return the
same value, and the number of consecutive calls (after the first) will not affect
6.4 Selection
EXAMPLE
6.44
Side effects in a random
number generator
261
the results of subsequent execution. In addition, side-effect freedom for functions means that the value of a subexpression will never depend on whether that
subexpression is evaluated before or after calling a function in some other subexpression. These properties make it easier for a programmer or theorem-proving
system to reason about program behavior. They also simplify code improvement,
for example by permitting the safe rearrangement of expressions.
Unfortunately, there are some situations in which side effects in functions are
highly desirable. We saw one example in the gen new name function of Figure 3.6 (page 125). Another arises in the typical interface to a pseudo-random
number generator.
procedure srand(seed : integer)
–– Initialize internal tables.
–– The pseudo-random generator will return a different
–– sequence of values for each different value of seed.
function rand() : integer
–– No arguments; returns a new “random” number.
Obviously rand needs to have a side effect, so that it will return a different value
each time it is called. One could always recast it as a procedure with a reference
parameter:
procedure rand(var n : integer)
but most programmers would find this less appealing. Ada strikes a compromise:
it allows side effects in functions in the form of changes to static or global variables, but does not allow a function to modify its parameters.
6.4
EXAMPLE
6.45
Selection in Algol 60
Selection
Selection statements in most imperative languages employ some variant of the
if . . . then . . . else notation introduced in Algol 60:
if condition then statement
else if condition then statement
else if condition then statement
...
else statement
As we saw in Section 2.3.2, languages differ in the details of the syntax. In Algol
60 and Pascal both the then clause and the else clause are defined to contain
a single statement (this can of course be a begin . . . end compound statement).
To avoid grammatical ambiguity, Algol 60 requires that the statement after the
then begin with something other than if ( begin is fine). Pascal eliminates this
restriction in favor of a “disambiguating rule” that associates an else with the
closest unmatched then . Algol 68, Fortran 77, and more modern languages avoid
262
EXAMPLE
Chapter 6 Control Flow
6.46
Elsif / elif
the ambiguity by allowing a statement list to follow either then or else , with a
terminating keyword at the end of the construct.
To keep terminators from piling up at the end of nested if statements,
most languages with terminators provide a special elsif or elif keyword. In
Modula-2, one writes
IF a = b THEN ...
ELSIF a = c THEN ...
ELSIF a = d THEN ...
ELSE ...
END
EXAMPLE
6.47
Cond in Lisp
In Lisp, the equivalent construct is
(cond
((= A B)
(...))
((= A C)
(...))
((= A D)
(...))
(T
(...)))
Here cond takes as arguments a sequence of pairs. In each pair the first element
is a condition; the second is an expression to be returned as the value of the
overall construct if the condition evaluates to T ( T means “true” in most Lisp
dialects).
6.4.1
Short-Circuited Conditions
While the condition in an if . . . then . . . else statement is a Boolean expression,
there is usually no need for evaluation of that expression to result in a Boolean
value in a register. Most machines provide conditional branch instructions that
capture simple comparisons. Put another way, the purpose of the Boolean expression in a selection statement is not to compute a value to be stored, but to cause
control to branch to various locations. This observation allows us to generate
particularly efficient code (called jump code) for expressions that are amenable to
the short-circuit evaluation of Section 6.1.5. Jump code is applicable not only to
selection statements such as if . . . then . . . else , but to logically controlled loops
as well; we will consider the latter in Section 6.5.5.
In the usual process of code generation, either via an attribute grammar or via
ad hoc syntax tree decoration, a synthesized attribute of the root of an expression
subtree acquires the name of a register into which the value of the expression will
be computed at run time. The surrounding context then uses this register name
when generating code that uses the expression. In jump code, inherited attributes
of the root inform it of the addresses to which control should branch if the ex-
6.4 Selection
EXAMPLE
6.48
Code generation for a
Boolean condition
263
pression is true or false respectively. Jump code can be generated quite elegantly
by an attribute grammar, particularly one that is not L-attributed (Exercise 6.9).
Suppose, for example, that we are generating code for the following source.
if ((A > B) and (C > D)) or (E = F) then
then clause
else
else clause
In Pascal, which does not use short-circuit evaluation, the output code would
look something like this.
r1 := A
–– load
r2 := B
r1 := r1 > r2
r2 := C
r3 := D
r2 := r2 > r3
r1 := r1 & r2
r2 := E
r3 := F
r2 := r2 = r3
r1 := r1 | r2
if r1 = 0 goto L2
–– (label not actually used)
L1: then clause
goto L3
L2: else clause
L3:
EXAMPLE
6.49
Code generation for
short-circuiting
The root of the subtree for ((A > B) and (C > D)) or (E = F) would name r1 as the
register containing the expression value.
In jump code, by contrast, the inherited attributes of the condition’s root
would indicate that control should “fall through” to L1 if the condition is true,
or branch to L2 if the condition is false. Output code would then look something
like this:
r1 := A
r2 := B
if r1 <= r2 goto L4
r1 := C
r2 := D
if r1 > r2 goto L1
L4: r1 := E
r2 := F
if r1 = r2 goto L2
L1: then clause
goto L3
L2: else clause
L3:
264
EXAMPLE
Chapter 6 Control Flow
6.50
Short-circuit creation of a
Boolean value
Here the value of the Boolean condition is never explicitly placed into a register.
Rather it is implicit in the flow of control. Moreover for most values of A , B , C , D ,
and E , the execution path through the jump code is shorter and therefore faster
(assuming good branch prediction) than the straight-line code that calculates the
value of every subexpression.
If the value of a short-circuited expression is needed explicitly, it can of course
be generated, while still using jump code for efficiency. The Ada fragment
found_it := p /= null and then p.key = val;
is equivalent to
if p /= null and then p.key = val then
found_it := true;
else
found_it := false;
end if;
and can be translated as
r1 := p
if r1 = 0 goto L1
r2 := r1→key
if r2 = val goto L1
r1 := 1
goto L2
L1: r1 := 0
L2: found it := r1
The astute reader will notice that the first goto L1 can be replaced by goto L2 ,
since r1 already contains a zero in this case. The code improvement phase of the
compiler will notice this also, and make the change. It is easier to fix this sort of
thing in the code improver than it is to generate the better version of the code in
the first place. The code improver has to be able to recognize jumps to redundant
instructions for other reasons anyway; there is no point in building special cases
into the short-circuit evaluation routines.
D E S I G N & I M P L E M E N TAT I O N
Short-circuit evaluation
Short-circuit evaluation is one of those happy cases in programming language
design where a clever language feature yields both more useful semantics and a
faster implementation than existing alternatives. Other at least arguable examples include case statements, local scopes for for loop indices (Section 6.5.1),
with statements in Pascal (Section 7.3.3), and parameter modes in Ada (Section 8.3.1).
6.4 Selection
6.4.2
EXAMPLE
6.51
Case statements and
nested if s
265
Case / Switch Statements
The case statements of Algol W and its descendants provide alternative syntax
for a special case of nested if . . . then . . . else . When each condition compares
the same integer expression to a different compile-time constant, then the following code (written here in Modula-2)
i := ... (* potentially complicated expression *)
IF i = 1 THEN
clause A
ELSIF i IN 2, 7 THEN
clause B
ELSIF i IN 3..5 THEN
clause C
ELSIF (i = 10) THEN
clause D
ELSE
clause E
END
can be rewritten as
CASE ... (*
1:
|
2, 7:
|
3..5:
|
10:
ELSE
END
EXAMPLE
6.52
Translation of nested if s
potentially complicated expression *) OF
clause
clause
clause
clause
clause
A
B
C
D
E
The elided code fragments (clause A, clause B, etc.) after the colons and the ELSE
are called the arms of the CASE statement. The lists of constants in front of the
colons are CASE statement labels. The constants in the label lists must be disjoint,
and must be of a type compatible with the tested expression. Most languages allow this type to be anything whose values are discrete: integers, characters, enumerations, and subranges of the same. C# allows strings as well.
The CASE statement version of the code above is certainly less verbose than the
IF . . . THEN . . . ELSE version, but syntactic elegance is not the principal motivation
for providing a CASE statement in a programming language. The principal motivation is to facilitate the generation of efficient target code. The IF . . . THEN . . .
ELSE statement is most naturally translated as follows.
r1 := . . .
if r1 = 1 goto L1
clause A
goto L6
L1: if r1 = 2 goto L2
if r1 = 7 goto L3
L2: clause B
goto L6
–– calculate tested expression
266
Chapter 6 Control Flow
goto L6
L1: clause A
goto L7
L2: clause B
goto L7
L3: clause C
goto L7
...
L4: clause D
goto L7
L5: clause E
goto L7
–– jump to code to compute address
L6: r1 := . . .
goto *r1
L7:
–– computed target of branch
Figure 6.4 General form of target code generated for a five-arm case statement. One could
eliminate the initial goto L6 and the final goto L7 by computing the target of the branch at
the top of the generated code, but it may be cumbersome to do so, particularly in a one-pass
compiler. The form shown adds only a single jump to the control flow in most cases, and allows
the code for all of the arms of the case statement to be generated as encountered, before the
code to determine the target of the branch can be deduced.
L3: if r1 < 3 goto L4
if r1 > 5 goto L4
clause C
goto L6
L4: if r1 = 10 goto L5
clause D
goto L6
L5: clause E
L6:
EXAMPLE
6.53
Jump tables
Rather than test its expression sequentially against a series of possible values,
the case statement is meant to compute an address to which it jumps in a single
instruction. The general form of the target code generated from a case statement
appears in Figure 6.4. The code at label L6 can take any of several forms. The most
common of these simply indexes into an array:
T:
&L1
&L2
&L3
&L3
&L3
&L5
&L2
&L5
&L5
&L4
–– tested expression = 1
–– tested expression = 10
6.4 Selection
267
L6: r1 := . . .
–– calculate tested expression
if r1 < 1 goto L5
if r1 > 10 goto L5
–– L5 is the “else” arm
–– subtract off lower bound
r1 −:= 1
r2 := T[r1]
goto *r2
L7:
Here the “code” at label T is actually a table of addresses, known as a jump table.
It contains one entry for each integer between the lowest and highest values, inclusive, found among the case statement labels. The code at L6 checks to make
sure that the tested expression is within the bounds of the array (if not, we should
execute the else arm of the case statement). It then fetches the corresponding
entry from the table and branches to it.
Alternative Implementations
A linear jump table is fast. It is also space-efficient when the overall set of case
statement labels is dense and does not contain large ranges. It can consume an
extraordinarily large amount of space, however, if the set of labels is nondense or
includes large value ranges. Alternative methods to compute the address to which
to branch include sequential testing, hashing, and binary search. Sequential testing (as in an if . . . then . . . else statement) is the method of choice if the total
number of case statement labels is small. It runs in time O(n), where n is the
number of labels. A hash table is attractive if the range of label values is large but
has many missing values and no large ranges. With an appropriate hash function
it will run in time O(1). Unfortunately, a hash table requires a separate entry for
each possible value of the tested expression, making it unsuitable for statements
with large value ranges. Binary search can accommodate ranges easily. It runs in
time O(log n), with a relatively low constant factor.
To generate good code for all possible case statements, a compiler needs to be
prepared to use a variety of strategies. During compilation it can generate code
for the various arms of the case statement as it finds them, while simultaneously
building up an internal data structure to describe the label set. Once it has seen
all the arms, it can decide which form of target code to generate. For the sake of
simplicity, most compilers employ only some of the possible implementations.
Many use binary search in lieu of hashing. Some generate only indexed jump tables; others only that plus sequential testing. Users of less sophisticated compilers
may need to restructure their case statements if the generated code turns out to
be unexpectedly large or slow.
Syntax and Label Semantics
As with if . . . then . . . else statements, the syntactic details of case statements
vary from language to language. In keeping with the style of its other structured
statements, Pascal defines each arm of a case statement to contain a single statement; begin . . . end delimiters are required to bracket statement lists. Modula,
Ada, Fortran 90, and many other languages expect arms to contain statement
268
Chapter 6 Control Flow
lists by default. Modula uses | to separate an arm from the following label. Ada
brackets labels with when and => .
Standard Pascal does not include a default clause: all values on which to take
action must appear explicitly in label lists. It is a dynamic semantic error for the
expression to evaluate to a value that does not appear. Most Pascal compilers permit the programmer to add a default clause, labeled either else or otherwise ,
as a language extension. Modula allows an optional else clause. If one does not
appear in a given case statement, then it is a dynamic semantic error for the
tested expression to evaluate to a missing value. Ada requires arm labels to cover
all possible values in the domain of the type of the tested expression. If the type
of tested expression has a very large number of values, then this coverage must
be accomplished using ranges or an others clause. In some languages, notably
C and Fortran 90, it is not an error for the tested expression to evaluate to a
missing value. Rather, the entire construct has no effect when the value is missing.
The C switch Statement
C’s syntax for case ( switch ) statements (retained by C++ and Java) is unusual
in other respects.
switch (... /* tested expression */) {
case 1: clause A
break;
case 2:
case 7: clause B
break;
case 3:
case 4:
case 5:
clause C
break;
case 10: clause D
break;
default: clause E
break;
}
D E S I G N & I M P L E M E N TAT I O N
Case statements
Case statements are one of the clearest examples of language design driven by
implementation. Their primary reason for existence is to facilitate the generation of jump tables. Ranges in label lists (not permitted in Pascal or C) may
reduce efficiency slightly, but binary search is still dramatically faster than the
equivalent series of if s.
6.4 Selection
EXAMPLE
6.54
Fall-through in C switch
statements
269
Here each possible value for the tested expression must have its own label
within the switch ; ranges are not allowed. In fact, lists of labels are not allowed,
but the effect of lists can be achieved by allowing a label (such as 2 , 3 , and 4
above) to have an empty arm that simply “falls through” into the code for the
subsequent label. Because of the provision for fall-through, an explicit break
statement must be used to get out of the switch at the end of an arm, rather
than falling through into the next. There are rare circumstances in which the
ability to fall through is convenient:
letter_case = lower;
switch (c) {
...
case ’A’ :
letter_case = upper;
/* FALL THROUGH! */
case ’a’ :
...
break;
...
}
Most of the time, however, the need to insert a break at the end of each arm—
and the compiler’s willingness to accept arms without breaks, silently—is a recipe
for unexpected and difficult-to-diagnose bugs. C# retains the familiar C syntax,
including multiple consecutive labels, but requires every nonempty arm to end
with a break , goto , continue , or return .
Historical Origins
EXAMPLE
6.55
Fortran computed goto
Modern case statements are a descendant of the computed goto statement of
Fortran and the switch construct of Algol 60. In early versions of Fortran, one
could specify multiway branching based on an integer value as follows.
goto (15, 100, 150, 200), I
EXAMPLE
6.56
Algol 60 switch
If I is one, control jumps to the statement labeled 15 . If I is two, control jumps
to the statement labeled 100 . If I is outside the range 1. . . 4, the statement has
no effect. Any integer-valued expression could be used in place of I . Computed
goto s are still allowed in Fortran 90 but are identified by the language manual as
a deprecated feature, retained to facilitate compilation of old programs.
In Algol 60, a switch is essentially an array of labels:
switch S := L15, L100, L150, L200;
...
goto S[I];
Algol 68 eliminates the goto s by, in essence, indexing into an array of statements,
but the syntax is rather cumbersome.
270
Chapter 6 Control Flow
C H E C K YO U R U N D E R S TA N D I N G
19. List the principal uses of goto , and the structured alternatives to each.
20. Explain the distinction between exceptions and multilevel returns.
21. What are continuations? What other language features do they subsume?
22. Why is sequencing a comparatively unimportant form of control flow in Lisp?
23. Explain why it may sometimes be useful for a function to have side effects.
24. Describe the jump code implementation of short-circuit Boolean evaluation.
25. Why do imperative languages commonly provide a case statement in addition to if . . . then . . . else ?
26. Describe three different search strategies that might be employed in the implementation of a case statement, and the circumstances in which each
would be desirable.
6.5
Iteration
Iteration and recursion are the two mechanisms that allow a computer to perform
similar operations repeatedly. Without at least one of these mechanisms, the running time of a program (and hence the amount of work it can do and the amount
of space it can use) is a linear function of the size of the program text, and the
computational power of the language is no greater than that of a finite automaton. In a very real sense, it is iteration and recursion that make computers useful.
In this section we focus on iteration. Recursion is the subject of Section 6.6.
Programmers in imperative languages tend to use iteration more than they
use recursion (recursion is more common in functional languages). In most languages, iteration takes the form of loops. Like the statements in a sequence, the iterations of a loop are generally executed for their side effects: their modifications
of variables. Loops come in two principal varieties; these differ in the mechanisms
used to determine how many times they iterate. An enumeration-controlled loop
is executed once for every value in a given finite set. The number of iterations
is therefore known before the first iteration begins. A logically controlled loop is
executed until some Boolean condition (which must necessarily depend on values altered in the loop) changes value. The two forms of loops share a single
construct in Algol 60. They are distinct in most later languages, with the notable
exception of Common Lisp, whose loop macro provides an astonishing array of
options for initialization, index modification, termination detection, conditional
execution, and value accumulation.
6.5 Iteration
6.5.1
EXAMPLE
6.57
Early Fortran do loop
Enumeration-Controlled Loops
Enumeration-controlled loops are as old as Fortran. The Fortran syntax and semantics have evolved considerably over time. In Fortran I, II, and IV a loop looks
something like this:
10
EXAMPLE
6.58
Meaning of a do loop
271
do 10 i = 1, 10, 2
...
continue
The number after the do is a label that must appear on some statement later in
the current subroutine; the statement it labels is the last one in the body of the
loop: the code that is to be executed multiple times. Continue is a “no-op”: a
statement that has no effect. Using a continue for the final statement of the loop
makes it easier to modify code later: additional “real” statements can be added to
the bottom of the loop without moving the label.5
The variable name after the label is the index of the loop. The commaseparated values after the equals sign indicate the initial value of the index, the
maximum value it is permitted to take, and the amount by which it is to increase
in each iteration (this is called the step size). A bit more precisely, the loop above
is equivalent to
10
i = 1
...
i = i + 2
if i <= 10 goto 10
Index variable i in this example will take on the values 1, 3, 5, 7, and 9 in successive loop iterations. Compilers can translate this loop into very simple, fast code
for most machines.
In practice, unfortunately, this early form of loop proved to have several problems. Some of these problems were comparatively minor. The loop bounds and
step size (1, 10, and 2 in our example) were required to be positive integer constants or variables: no expressions were allowed. Fortran 77 removed this restriction, allowing arbitrary positive and negative integer and real expressions. Also,
as we saw in Section 2.16 (page 57), trivial lexical errors can cause a Fortran IV
compiler to misinterpret the code as an ordinary sequence of statements beginning with an assignment. Fortran 77 makes such misinterpretation less likely by
allowing an extra comma after the label in the do loop header. Fortran 90 takes
back (makes “obsolescent”) the ability to use real numbers for loop bounds and
step sizes. The problem with reals is that limited precision can cause comparisons (e.g., between the index and the upper bound) to produce unexpected or
even implementation-dependent results when the values are close to one another.
5 The continue statement of C probably takes its name from this typical use of the no-op in
Fortran, but its semantics are very different: the C continue starts the next iteration of the loop
even when the current one has not finished.
272
Chapter 6 Control Flow
The more serious problems with the Fortran IV do loop are a bit more subtle:
If statements in the body of the loop (or in subroutines called from the body of
the loop) change the value of i , then the loop may execute a different number
of times than one would assume based on the bounds in its header. If the effect
is accidental, the bug is hard to find. If the effect is intentional, the code is hard
to read.
Goto statements may jump into or out of the loop. Code that jumps out and
(optionally) back in again is expressly allowed (if difficult to understand). On
the other hand, code that simply jumps in, without properly initializing i ,
almost certainly represents a programming error, but will not be caught by
the compiler.
If control leaves a do loop via a goto , the value of i is the one most recently assigned. If the loop terminates normally, however, the value of i is
implementation-dependent. Based on Example 6.58, one might expect the final value to be the first one outside the loop bounds: L+((U −L)/S+1)×S,
where L, U, and S are the lower and upper bounds of the loop and the step
size, respectively. Unfortunately, if the upper bound is close to the largest value
that can be represented given the precision of integers on the target machine,
then the increment at the bottom of the final iteration of the loop may cause
arithmetic overflow. On most machines this overflow will result in an apparently negative value, which will prevent the loop from terminating correctly.
On some it will cause a run-time exception that requires the intervention of
the operating system in order to continue execution. To ensure correct termination and/or avoid the cost of an exception, a compiler must generate more
complex (and slower) code when it is unable to rule out overflow at compile
time. In this event, the index may contain its final value (not the “next” value)
after normal termination of the loop.
Because the test against the upper bound appears at the bottom of the loop,
the body will always be executed at least once, even if the “low” bound is larger
than the “high” bound.
D E S I G N & I M P L E M E N TAT I O N
Numerical imprecision
The writers of numerical software know that the results of arithmetic computations are often approximations. A comparison between values that are approximately equal “may go either way.” The Fortran 90 designers appear to
have decided that such comparisons should be explicit. Fortran 90 do loops,
like the for loops of most other languages, reflect the precision of discrete
types. The programmer who wants to control iteration with floating-point
values must use an explicit comparison in a pre-test or post-test loop (Section 6.5.5).
6.5 Iteration
EXAMPLE
6.59
273
These problems arise in a larger context than merely Fortran IV. They must
be addressed in the design of enumeration-controlled loops in any language.
Consider the arguably more friendly syntax of Modula-2:
Modula-2 for loop
FOR i := first TO last BY step DO
...
END
where first , last , and step can be arbitrarily complex expressions of an integer, enumeration, or subrange type. Based on the preceding discussion, one
might ask several questions.
1. Can i , first , and/or last be modified in the loop? If so, what is the effect
on control?
2. What happens if first is larger than last (or smaller, in the case of a negative step )?
3. What is the value of i when the loop is finished?
4. Can control jump into the loop from outside?
We address these questions in the paragraphs below.
Changes to Loop Indices or Bounds
Most languages, including Algol 68, Pascal, Ada, Fortran 77 and 90, and
Modula-3, prohibit changes to the loop index within the body of an enumeration-controlled loop. They also guarantee to evaluate the bounds of the loop
exactly once, before the first iteration, so any changes to variables on which those
bounds depend will not have any effect on the number of iterations executed.
Modula-2 is vague; the manual says that the index “should not be changed” by
the body of the loop [Wir85b, Sec. 9.8]. ISO Pascal goes to considerable lengths to
prohibit modification. Paraphrasing slightly, it says [Int90, Sec. 6.8.3.9] that the
index variable must be declared in the closest enclosing block, and that neither
the body of the for statement itself nor any statement contained in a subroutine local to the block can “threaten” the index variable. A statement is said to
threaten a variable if it
Assigns to it
Passes it to a subroutine by reference
Reads it from a file
Is a structured statement containing a simpler statement that threatens it
The prohibition against threats in local subroutines is made because a local variable will be accessible to those subroutines, and one of them, if called from within
the loop, might change the value of the variable even if it is not passed to it by
reference.
274
Chapter 6 Control Flow
Empty Bounds
EXAMPLE
6.60
Obvious translation of a
for loop
Modern languages refrain from executing an enumeration-controlled loop if the
bounds are empty. In other words, they test the terminating condition before the
first iteration. The initial test requires a few extra instructions but leads to much
more intuitive behavior. The loop
FOR i := first TO last BY step DO
...
END
can be translated as
r1 := first
r2 := step
r3 := last
L1: if r1 > r3 goto L2
...
r1 := r1 + r2
goto L1
L2:
EXAMPLE
6.61
For loop translation with
test at the bottom
–– loop body; use r1 for i
A slightly better if less straightforward translation is
r1 := first
r2 := step
r3 := last
goto L2
L1: . . .
r1 := r1 + r2
L2: if r1 ≤ r3 goto L1
–– loop body; use r1 for i
The advantage of this second version is that each iteration of the loop contains
a single conditional branch, rather than a conditional branch at the top and an
unconditional branch at the bottom. (We will consider yet another version in
Exercise 15.4.)
The translations shown above work only if first + (( last − first )/ step + 1) × step does not exceed the largest representable integer. If the compiler
cannot verify this property at compile time, then it will have to generate more
cautious code (to be discussed in Example 6.63).
The astute reader may also have noticed that the code shown
here implicitly assumes that step is positive. If step is negative, the test for termination must “go the other direction.” If step is not a compile-time constant,
then the compiler cannot tell which form of test to use. Some languages, including Pascal and Ada, require the programmer to predict the sign of the step. In
Pascal, one must say
Loop Direction
EXAMPLE
6.62
Reverse direction for loop
for i := 10 downto 1 do ...
In Ada, one must say
6.5 Iteration
275
for i in reverse 1..10 do ...
EXAMPLE
6.63
For loop translation with
iteration count
Modula-2 and Modula-3 do not require special syntax for “backward” loops, but
insist that step be a compile-time constant so the compiler can tell the difference
(Modula (1) has no for loop).
In Fortran 77 and Fortran 90, which have neither a special “backward” syntax nor a requirement for compile-time constant steps, the compiler can use an
“iteration count” variable to control the loop:
r1 := first
r2 := step
r3 := max (( last − first + step )/ step , 0)
–– iteration count
–– NB: this calculation may require several instructions.
–– It is guaranteed to result in a value within the precision
of the machine,
–– but we have to be careful to avoid overflow during its calculation.
if r3 ≤ 0 goto L2
L1: . . .
–– loop body; use r1 for i
r1 := r1 + r2
r3 := r3 − 1
if r3 > 0 goto L1
i := r1
L2:
The use of the iteration count avoids the need to test the sign of step within the
loop. It also avoids problems with overflow when testing the terminating condition (assuming that we have been suitably careful in calculating the iteration
count). Some processors, including the PowerPC, PA-RISC, and most CISC machines, can decrement the iteration count, test it against zero, and conditionally
branch, all in a single instruction. In simple cases, the code improvement phase
of the compiler may be able to use a technique known as induction variable elimination to eliminate the need to maintain both r1 and r3 .
Access to the Index Outside the Loop
EXAMPLE
6.64
Index value after loop
Several languages, including Fortran IV and Pascal, leave the value of the loop
index undefined after termination of the loop. Others, such as Fortran 77 and
Algol 60, guarantee that the value is the one “most recently assigned.” For “normal” termination of the loop, this is the first value that exceeds the upper bound.
It is not clear what happens if this value exceeds the largest value representable on
the machine (or the smallest value in the case of a negative step size). A similar
question arises in Pascal, in which the type of an index can be a subrange or enumeration. In this case the first value “after” the upper bound can often be invalid.
var c : ’a’..’z’;
...
for c := ’a’ to ’z’ do begin
...
end;
(* what comes after ’z’? *)
276
EXAMPLE
Chapter 6 Control Flow
6.65
Preserving the final index
value
Examples like this illustrate the rationale for leaving the final value of the index
undefined in Pascal. The alternative—defining the value to be the last one that
was valid—would force the compiler to generate slower code for every loop, with
two branches in each iteration instead of one:
r1 := ’a’
r2 := ’z’
if r1 > r2 goto L3
–– Code improver may remove this test,
–– since ’a’ and ’z’ are constants.
–– loop body; use r1 for i
L1: . . .
if r1 = r2 goto L2
r1 := r1 + 1
–– NB: Pascal step size is always 1 (or −1 if downto )
goto L1
L2: i := r1
L3:
Note that the compiler must generate this sort of code in any event (or use an
iteration count) if arithmetic overflow may interfere with testing the terminating
condition.
Several languages, including Algol W, Algol 68, Ada, Modula-3, and C++,
avoid the issue of the value held by the index outside the loop by making the
index a local variable of the loop. The header of the loop is considered to contain
a declaration of the index. Its type is inferred from the bounds of the loop, and
its scope is the loop’s body. Because the index is not visible outside the loop, its
value is not an issue. Since it is not visible even to local subroutines, much of
the concept of “threatening” in Pascal becomes unnecessary. Finally, there is no
chance that a value held in the index variable before the loop, and needed after,
will inadvertently be destroyed. (Of course, the programmer must not give the
index the same name as any variable that must be accessed within the loop, but
this is a strictly local issue: it has no ramifications outside the loop.)
D E S I G N & I M P L E M E N TAT I O N
For loops
Modern for loops reflect the impact of both semantic and implementation
challenges. As suggested by the subheadings of Section 6.5.1, the semantic
challenges include changes to loop indices or bounds from within the loop,
the scope of the index variable (and its value, if any, outside the loop), and
goto s that enter or leave the loop. Implementation challenges include the imprecision of floating-point values (discussed in the sidebar on page 272), the
direction of the bottom-of-loop test, and overflow at the end of the iteration
range. The “combination loops” of C (to be discussed in Section 6.5.2) move
responsibility for these challenges out of the compiler and into the application
program.
6.5 Iteration
277
Jumps
Algol 60, Fortran 77, and most of their successors place restrictions on the use of
the goto statement that prevent it from entering a loop from outside. Goto s can
be used to exit a loop prematurely, but this is a comparatively clean operation;
questions of uninitialized indices and bounds do not arise. As we shall see in
Section 6.5.5, many languages provide an exit statement as a semistructured
alternative to a loop-escaping goto .
6.5.2
EXAMPLE
6.66
Algol 60 for loop
Combination Loops
Algol 60, as mentioned above, provides a single loop construct that subsumes
the properties of more modern enumeration- and logically controlled loops. The
general form is given by
for stmt −→ for id := for list do stmt
for list −→ enumerator ( , enumerator )*
enumerator −→ expr
−→ expr step expr until expr
−→ expr while condition
Here the index variable takes on values specified by a sequence of enumerators,
each of which can be a single value, a range of values similar to that of modern
enumeration-controlled loops, or an expression with a terminating condition.
Each expression in the current enumerator is reevaluated at the top of the loop.
This reevaluation is what makes the while form of enumerator useful: its condition typically depends on the current value of the index variable. All of the
following are equivalent.
for i := 1, 3, 5, 7, 9 do ...
for i := 1 step 2 until 10 do ...
for i := 1, i + 2 while i < 10 do ...
EXAMPLE
6.67
Combination ( for ) loop
in C
In practice the generality of the Algol 60 for loop turns out to be overkill.
The repeated reevaluation of bounds, in particular, can lead to loops that are
very hard to understand. Some of the power of the Algol 60 loop is retained in
a cleaner form in the for loop of C. A substantially more powerful version (not
described here) is found in Common Lisp.
C’s for loop is, strictly speaking, logically controlled. Any enumerationcontrolled loop, however, can be rewritten in a logically controlled form (this is
of course what the compiler does when it translates into assembler), and C’s for
loop is deliberately designed to facilitate writing the logically controlled equivalent of a Pascal or Algol-style for loop. Our Modula-2 example
FOR i := first TO last BY step DO
...
END
278
Chapter 6 Control Flow
would usually be written in C as
for (i = first; i <= last; i += step) {
...
}
C defines this to be roughly equivalent to
i = first;
while (i <= last) {
...
i += step;
}
This definition means that it is the programmer’s responsibility to worry about
the effect of overflow on testing of the terminating condition. It also means that
both the index and any variables contained in the terminating condition can be
modified by the body of the loop, or by subroutines it calls, and these changes
will affect the loop control. This, too, is the programmer’s responsibility.
Any of the three substatements in the for loop header can be null (the condition is considered true if missing). Alternatively, a substatement can consist of a
sequence of comma-separated expressions. The advantage of the C for loop over
its while loop equivalent is compactness and clarity. In particular, all of the code
affecting the flow of control is localized within the header. In the while loop, one
must read both the top and the bottom of the loop to know what is going on.
6.5.3
Iterators
In all of the examples we have seen so far (with the possible exception of the
combination loops of Algol 60, Common Lisp, or C), a for loop iterates over the
elements of an arithmetic sequence. In general, however, we may wish to iterate
over the elements of any well-defined set (what are often called containers or collections in object-oriented code). Clu introduced an elegant iterator mechanism
(also found in Python, Ruby, and C#) to do precisely that. Euclid and several
more recent languages, notably C++ and Java, define a standard interface for iterator objects (sometimes called enumerators) that are equally easy to use but not
as easy to write. Icon, conversely, provides a generalization of iterators, known as
generators, that combines enumeration with backtracking search.6
True Iterators
Clu, Python, Ruby, and C# allow any container abstraction to provide an iterator
that enumerates its items. The iterator resembles a subroutine that is permitted to
6 Unfortunately, terminology is not consistent across languages. Euclid uses the term “generator”
for what are called “iterator objects” here. Python uses it for what are called “true iterators” here.
6.5 Iteration
EXAMPLE
6.68
Simple iterator in Clu
279
contain yield statements, each of which produces a loop index value. For loops
are then designed to incorporate a call to an iterator. The Modula-2 fragment
FOR i := first TO last BY step DO
...
END
would be written as follows in Clu.
for i in int$from_to_by(first, last, step) do
...
end
EXAMPLE
6.69
Clu iterator for tree
enumeration
Here from_to_by is a built-in iterator that yields the integers from first to
first + ( last − first )/ step × step in increments of step .
When called, the iterator calculates the first index value of the loop, which it
returns to the main program by executing a yield statement. The yield behaves like return , except that when control transfers back to the iterator after
completion of the first iteration of the loop, the iterator continues where it last
left off—not at the beginning of its code. When the iterator has no more elements
to yield it simply returns (without a value), thereby terminating the loop.
In effect, an iterator is a separate thread of control, with its own program
counter, whose execution is interleaved with that of the for loop to which it supplies index values.7 The iteration mechanism serves to “decouple” the algorithm
required to enumerate elements from the code that uses those elements.
As an illustrative example, consider the pre-order enumeration of nodes from
a binary tree. A Clu iterator for this task appears in Figure 6.5. Invoked from the
header of a for loop, it takes the root of a tree as argument. It yields the root
node for the first iteration and then calls itself recursively, twice, to enumerate
the nodes of the left and right subtrees.
Iterator Objects
EXAMPLE
6.70
Java iterator for tree
enumeration
As realized in most imperative languages, iteration involves both a special form of
for loop and a mechanism to enumerate values for the loop. These concepts can
be separated. Euclid, C++, and Java all provide enumeration-controlled loops
reminiscent of those of Clu. They have no yield statement, however, and no
separate thread-like context to enumerate values; rather, an iterator is an ordinary object (in the object-oriented sense of the word) that provides methods for
initialization, generation of the next index value, and testing for completion. Between calls, the state of the iterator must be kept in the object’s data members.
Figure 6.6 contains the Java equivalent of the code in Figure 6.5. The for loop
at the bottom is syntactic sugar for
7 Because iterators are interleaved with loops in a very regular way, they can be implemented more
easily (and cheaply) than fully general threads. We will consider implementation options further
in Section
8.6.3.
280
Chapter 6 Control Flow
bin_tree = cluster is ..., pre_order, ...
% export list
node = record [left, right: bin_tree, val: int]
rep = variant [some: node, empty: null]
...
pre_order = iter(t: cvt) yields(bin_tree)
tagcase t
tag empty: return
tag some(n: node):
yield(n.val)
for i: int in pre_order(n.left) do
yield(i)
end
for i: int in pre_order(n.right) do
yield(i)
end
end
end pre_order
...
end bin_tree
...
for i: int in bin_tree$pre_order(e) do
stream$putl(output, int$unparse(i))
end
Figure 6.5
Clu iterator for pre-order enumeration of the nodes of a binary tree. In this
(simplistic) example we have assumed that the datum in a tree node is simply an int . Within
the bin_tree cluster, the rep (representation) declaration indicates that a binary tree is either
a node or empty. The cvt (convert) in the header of pre_order indicates that parameter t is
a bin_tree whose internal structure ( rep ) should be visible to the code of pre_order itself
but not to the caller. In the for loop at the bottom, int$unparse produces the character string
equivalent of a given int , and stream$putl prints a line to the specified stream.
for (Iterator<Integer> it = myTree.iterator(); it.hasNext();) {
Integer i = it.next();
System.out.println(i);
}
D E S I G N & I M P L E M E N TAT I O N
“True” iterators and iterator objects
While the iterator library mechanisms of C++ and Java are highly useful,
it is worth emphasizing that they are not the functional equivalents of “true”
iterators, as found in Clu, Python, Ruby, and C#. Their key limitation is the
need to maintain all intermediate state in the form of explicit data structures,
rather than in the program counter and local variables of a resumable execution context.
6.5 Iteration
281
class TreeNode<T> implements Iterable<T> {
TreeNode<T> left;
TreeNode<T> right;
T val;
...
public Iterator<T> iterator() {
return new TreeIterator(this);
}
private class TreeIterator implements Iterator<T> {
private Stack<TreeNode<T>> s = new Stack<TreeNode<T>>();
TreeIterator(TreeNode<T> n) {
s.push(n);
}
public boolean hasNext() {
return !s.empty();
}
public T next() {
if (!hasNext()) {
throw new NoSuchElementException();
}
TreeNode<T> n = s.pop();
if (n.right != null) {
s.push(n.right);
}
if (n.left != null) {
s.push(n.left);
}
return n.val;
}
public void remove() {
throw new UnsupportedOperationException();
}
}
...
}
...
TreeNode<Integer> myTree = ...
...
for (Integer i : myTree) {
System.out.println(i);
}
Figure 6.6
Java code for pre-order enumeration of the nodes of a binary tree. The nested
TreeIterator class uses an explicit Stack object (borrowed from the standard library) to
keep track of subtrees whose nodes have yet to be enumerated. Java generics, specified as
<T> type arguments for TreeNode , Stack , Iterator , and Iterable , allow next to return an
object of the appropriate type (here Integer ), rather than the undifferentiated Object . The
remove method is part of the Iterator interface and must therefore be provided, if only as a
placeholder.
282
EXAMPLE
Chapter 6 Control Flow
6.71
Iterator objects in C++
The expression following the colon in the concise version of the loop header
must support the standard Iterable interface, which includes an iterator()
method that returns an Iterator object.
C++ takes a different tack. Rather than propose a special version of the for
loop that would interface with iterator objects, the designers of the C++ standard
library used the language’s unusually flexible overloading and reference mechanisms (Sections 3.6.2 and 8.3.1) to redefine comparison ( != ), increment ( ++ ),
dereference ( * ), and so on, in a way that makes iterating over the elements of
a set look very much like using pointer arithmetic (Section 7.7.1) to traverse a
conventional array:
tree_node<int> *my_tree = ...
...
for (tree_node<int>::iterator n = my_tree->begin();
n != my_tree->end(); ++n) {
cout << *n << "\n";
}
C++ encourages programmers to think of iterators as if they were pointers. Iterator n in this example encapsulates all the state encapsulated by iterator it
in the (no syntactic sugar) Java code of Example 6.70. To obtain the next element of the set, however, the C++ programmer “dereferences” n , using the *
or -> operators. To advance to the following element, the programmer uses the
increment ( ++ ) operator. The end method returns a reference to a special iterator that “points beyond the end” of the set. The increment ( ++ ) operator must
return a reference that tests equal to this special iterator when the set has been
exhausted.
We leave the code of the C++ tree iterator to Exercise 6.15. The details are
somewhat messier than Figure 6.6, due to operator overloading, the value model
of variables (which requires explicit references and pointers), and the lack of
garbage collection. Also, because C++ lacks a common Object base class, its
container classes are always type-specific. Where generics can minimize the need
for type casts in Java and C#, they serve a more fundamental role in C++: without
them one cannot write safe, general purpose container code.
Iterating with First-Class Functions
EXAMPLE
6.72
Passing the “loop body” to
an iterator in Scheme
In functional languages, the ability to specify a function “inline” facilitates a programming idiom in which the body of a loop is written as a function, with the
loop index as an argument. This function is then passed as the final argument to
an iterator. In Scheme we might write
(define uptoby
(lambda (low high step f)
(if (<= low high)
(begin
(f low)
(uptoby (+ low step) high step f))
’())))
6.5 Iteration
283
We could then sum the first 50 odd numbers as follows.
(let ((sum 0))
(uptoby 1 100 2
(lambda (i)
(set! sum (+ sum i))))
sum)
EXAMPLE
6.73
Iteration with blocks in
Smalltalk
⇒ 2500
Here the body of the loop, (set! sum (+ sum i)) , is an assignment. The ⇒
symbol (not a part of Scheme) is used here to mean “evaluates to.”
9.6.1, provides mechanisms that
Smalltalk, which we consider in Section
support a similar idiom:
sum <- 0.
1 to: 100 by: 2 do:
[:i | sum <- sum + i]
Like a lambda expression in Scheme, a square-bracketed block in Smalltalk creates a first-class function, which we then pass as argument to the to: by: do:
iterator. The iterator calls the function repeatedly, passing successive values of
the index variable i as argument. Iterators in Ruby employ a similar but somewhat less general mechanism: where a Smalltalk method can take an arbitrary
number of blocks as argument, a Ruby method can take only one. Continuations
(Section 6.2.2) and lazy evaluation (Section 6.6.2) also allow the Scheme/Lisp
programmer to create iterator objects and more traditional style true iterators;
we consider these options in Exercises 6.30 and 6.31.
Iterating without Iterators
EXAMPLE
6.74
Imitating iterators in C
In a language with neither true iterators nor iterator objects, one can still decouple set enumeration from element use through programming conventions. In C,
for example, one might define a tree_iter type and associated functions that
could be used in a loop as follows.
tree_node *my_tree;
tree_iter ti;
...
for (ti_create(my_tree, &ti); !ti_done(ti); ti_next(&ti)) {
tree_node *n = ti_val(ti);
...
}
ti_delete(&ti);
There are two principal differences between this code and the more structured
alternatives: (1) the syntax of the loop is a good bit less elegant (and arguably
more prone to accidental errors), and (2) the code for the iterator is simply a
type and some associated functions; C provides no abstraction mechanism to
group them together as a module or a class. By providing a standard interface
for iterator abstractions, object-oriented languages like C++, Python, Ruby, Java,
and C# facilitate the design of higher-order mechanisms that manipulate whole
284
Chapter 6 Control Flow
containers: sorting them, merging them, finding their intersection or difference,
and so on. We leave the C code for tree_iter and the various ti_ functions to
Exercise 6.16.
6.5.4
Generators in Icon
Icon generalizes the concept of iterators, providing a generator mechanism that
causes any expression in which it is embedded to enumerate multiple values on
demand.
IN MORE DEPTH
Icon’s enumeration-controlled loop, the every loop, can contain not only a generator, but any expression that contains a generator. Generators can also be used
in constructs like if statements, which will execute their nested code if any generated value makes the condition true, automatically searching through all the
possibilities. When generators are nested, Icon explores all possible combinations
of generated values, and will even backtrack where necessary to undo unsuccessful control-flow branches or assignments.
6.5.5
EXAMPLE
6.75
While loop in Pascal
Logically Controlled Loops
In comparison to enumeration-controlled loops, logically controlled loops have
many fewer semantic subtleties. The only real question to be answered is where
within the body of the loop the terminating condition is tested. By far the most
common approach is to test the condition before each iteration. The familiar
while loop syntax to do this was introduced in Algol-W and retained in Pascal:
while condition do statement
EXAMPLE
6.76
Imitating while loops in
Fortran 77
As with selection statements, most Pascal successors use an explicit terminating
keyword, so that the body of the loop can be a statement list.
Neither (pre-90) Fortran nor Algol 60 really provides a while loop construct;
their loops were designed to be controlled by enumeration. To obtain the effect
of a while loop in Fortran 77, one must resort to goto s:
10
if negated condition goto 20
...
goto 10
20
Post-test Loops
EXAMPLE
6.77
Post-test loop in Pascal
and Modula
Occasionally it is handy to be able to test the terminating condition at the bottom
of a loop. Pascal introduced special syntax for this case, which was retained in
Modula but dropped in Ada. A post-test loop allows us, for example, to write
6.5 Iteration
285
repeat
readln(line)
until line[1] = ’$’;
instead of
readln(line);
while line[1] <> ’$’ do
readln(line);
EXAMPLE
6.78
Post-test loop in C
The difference between these constructs is particularly important when the body
of the loop is longer. Note that the body of a post-test loop is always executed at
least once.
C provides a post-test loop whose condition works “the other direction” (i.e.,
“while” instead of “until”):
do {
line = read_line(stdin);
} while line[0] != ’$’;
Midtest Loops
EXAMPLE
6.79
Midtest loop in Modula
Finally, as we saw in Section 6.2, it is sometimes appropriate to test the terminating condition in the middle of a loop. This “midtest” can be accomplished
with an if and a goto in most languages, but a more structured alternative is
preferable. Modula (1) introduced a midtest, or one-and-a-half loop that allows a
terminating condition to be tested as many times as desired within the loop:
loop
statement list
when condition exit
statement list
when condition exit
...
end
Using this notation, the Pascal construct
while true do begin
readln(line);
if all_blanks(line) then goto 100;
consume_line(line)
end;
100:
can be written as follows in Modula (1).
loop
line := ReadLine;
when AllBlanks(line) exit;
ConsumeLine(line)
end;
286
EXAMPLE
Chapter 6 Control Flow
6.80
Exit as a separate
statement
The when clause here is syntactically part of the loop construct. The syntax ensures that an exit can occur only within a loop , but it has the unfortunate side
effect of preventing an exit from within a nested construct.
Modula-2 abandoned the when clause in favor of a simpler EXIT statement,
which is typically placed inside an IF statement:
LOOP
line := ReadLine;
IF AllBlanks(line) THEN EXIT END;
ConsumeLine(line)
END;
EXAMPLE
6.81
Break statement in C
Because EXIT is no longer part of the LOOP construct syntax, the semantic analysis phase of compilation must ensure that EXIT s appear only inside LOOP s. There
may still be an arbitrary number of them inside a given LOOP . Modula-3 allows
an EXIT to leave a WHILE , REPEAT , or FOR loop, as well as a plain LOOP .
The C break statement, which we have already seen in the context of switch
statements, can be used in a similar manner:
for (;;) {
line = read_line(stdin);
if (all_blanks(line)) break;
consume_line(line);
}
EXAMPLE
6.82
Exit ing a nested loop
Here the missing condition in the for loop header is assumed to always be true;
for some reason, C programmers have traditionally considered this syntax to be
stylistically preferable to the equivalent while (1) .
In Ada an exit statement takes an optional loop-name argument that allows
control to escape a nested loop:
outer: loop
get_line(line, length);
for i in 1..length loop
exit outer when line(i) = ’$’;
consume_char(line(i));
end loop;
end loop outer;
Java extends the C/C++ break statement in a similar fashion: Java loops can
be labeled as in Ada, and the break statement takes an optional loop name as
parameter.
C H E C K YO U R U N D E R S TA N D I N G
27. Describe three subtleties in the implementation of enumeration-controlled
loops.
28. Why do most languages not allow the bounds or increment of an enumeration-controlled loop to be floating-point numbers?
6.6 Recursion
287
29. Why do many languages require the step size of an enumeration-controlled
loop to be a compile-time constant?
30. Describe the “iteration count” loop implementation. What problem(s) does
it solve?
31. What are the advantages of making an index variable local to the loop it controls?
32. What is a container (a collection)?
33. Explain the difference between true iterators and iterator objects.
34. Cite two advantages of iterator objects over the use of programming conventions in a language like C.
35. Describe the approach to iteration typically employed in languages with firstclass functions.
36. Give an example in which a midtest loop results in more elegant code than
does a pretest or post-test loop.
37. Does C have enumeration-controlled loops? Explain.
6.6
Recursion
Unlike the control-flow mechanisms discussed so far, recursion requires no special syntax. In any language that provides subroutines (particularly functions), all
that is required is to permit functions to call themselves, or to call other functions
that then call them back in turn. Most programmers learn in a data structures
class that recursion and (logically controlled) iteration provide equally powerful
means of computing functions: any iterative algorithm can be rewritten, automatically, as a recursive algorithm, and vice versa. We will compare iteration and
recursion in more detail in the first subsection below. In the subsection after that
we will consider the possibility of passing unevaluated expressions into a function. While usually inadvisable, due to implementation cost, this technique will
sometimes allow us to write elegant code for functions that are only defined on a
subset of the possible inputs, or that explore logically infinite data structures.
6.6.1
Iteration and Recursion
As we noted in Section 3.2, Fortran 77 and certain other languages do not permit
recursion. A few functional languages do not permit iteration. Most modern languages, however, provide both mechanisms. Iteration is in some sense the more
“natural” of the two in imperative languages, because it is based on the repeated
modification of variables. Recursion is the more natural of the two in functional
288
EXAMPLE
Chapter 6 Control Flow
6.83
languages, because it does not change variables. In the final analysis, which to use
in which circumstance is mainly a matter of taste. To compute a sum,
A “naturally iterative”
problem
f (i)
1≤i≤10
it seems natural to use iteration. In C one would say
typedef int (*int_func) (int);
int summation(int_func f, int low, int high) {
/* assume low <= high */
int total = 0;
int i;
for (i = low; i <= high; i++) {
total += f(i);
}
return total;
}
EXAMPLE
6.84
A “naturally recursive”
problem
To compute a value defined by a recurrence,

a
gcd(a, b)
≡ gcd(a − b, b)

gcd(a, b − a)
(positive integers a, b)
if a = b
if a > b
if b > a
recursion may seem more natural:
int gcd(int a, int b) {
/* assume a, b > 0 */
if (a == b) return a;
else if (a > b) return gcd(a-b, b);
else return gcd(a, b-a);
}
EXAMPLE
6.85
Implementing problems
“the other way”
In both these cases, the choice could go the other way:
typedef int (*int_func) (int);
int summation(int_func f, int low, int high) {
/* assume low <= high */
if (low == high) return f(low);
else return f(low) + summation(f, low+1, high);
}
int gcd(int a, int b) {
/* assume a, b > 0 */
while (a != b) {
if (a > b) a = a-b;
else b = b-a;
}
return a;
}
6.6 Recursion
289
Tail Recursion
EXAMPLE
6.86
Implementation of tail
recursion
EXAMPLE
6.87
By-hand creation of
tail-recursive code
It is often argued that iteration is more efficient than recursion. It is more accurate to say that naive implementation of iteration is usually more efficient than
naive implementation of recursion. In the preceding examples, the iterative implementations of summation and greatest divisors will be more efficient than
the recursive implementations if the latter make real subroutine calls that allocate space on a run-time stack for local variables and bookkeeping information.
An “optimizing” compiler, however, particularly one designed for a functional
language, will often be able to generate excellent code for recursive functions.
It is particularly likely to do so for tail-recursive functions such as gcd above.
A tail-recursive function is one in which additional computation never follows a
recursive call: the return value is simply whatever the recursive call returns. For
such functions, dynamically allocated stack space is unnecessary: the compiler
can reuse the space belonging to the current iteration when it makes the recursive
call. In effect, a good compiler will recast our recursive gcd function as
int gcd(int a, int b) {
/* assume a, b > 0 */
start:
if (a == b) return a;
else if (a > b) {
a = a-b; goto start;
} else {
b = b-a; goto start;
}
}
Even for functions that are not tail-recursive, automatic, often simple transformations can produce tail-recursive code. The general case of the transformation employs conversion to what is known as continuation-passing style [FWH01,
Chaps. 7–8]. In effect, a recursive function can always avoid doing any work after
returning from a recursive call by passing that work into the recursive call, in the
form of a continuation.
Some specific transformations (not based on continuation-passing) are often
employed by skilled users of functional languages. Consider, for example, the
recursive summation function of Example 6.85, written here in Scheme:
(define summation (lambda (f low high)
(if (= low high)
(f low)
(+ (f low) (summation f (+ low 1) high)))))
; then part
; else part
Recall that Scheme, like all Lisp dialects, uses Cambridge Polish notation for expressions. The lambda keyword is used to introduce a function. As recursive calls
return, our code calculates the sum from “right to left”: from high down to low .
If the programmer (or compiler) recognizes that addition is associative, we can
rewrite the code in a tail-recursive form:
290
Chapter 6 Control Flow
(define summation (lambda (f low high subtotal)
(if (= low high)
(+ subtotal (f low))
(summation f (+ low 1) high (+ subtotal (f low))))))
Here the subtotal parameter accumulates the sum from left to right, passing it
into the recursive calls. Because it is tail-recursive, this function can be translated
into machine code that does not allocate stack space for recursive calls. Of course,
the programmer won’t want to pass an explicit subtotal parameter to the initial
call, so we hide it (the parameter) in an auxiliary, “helper” function:
(define summation (lambda (f low high)
(letrec ((sum-helper (lambda (low subtotal)
(let ((new_subtotal (+ subtotal (f low))))
(if (= low high)
new_subtotal
(sum-helper (+ low 1) new_subtotal))))))
(sum-helper low 0))))
The let construct in Scheme serves to introduce a nested scope in which local
names (e.g., new_subtotal ) can be defined. The letrec construct permits the
definition of recursive functions (e.g., sum-helper ).
Thinking Recursively
EXAMPLE
6.88
Naive recursive Fibonacci
function
Detractors of functional programming sometimes argue, incorrectly, that recursion leads to algorithmically inferior programs. Fibonacci numbers, for example,
are defined by the mathematical recurrence
1
≡
Fn
Fn−1 + Fn−2
(nonnegative integer n)
if n = 0 or n = 1
otherwise
The naive way to implement this recurrence in Scheme is
(define fib (lambda (n)
(cond ((= n 0) 1)
((= n 1) 1)
(#t (+ (fib (- n 1)) (fib (- n 2)))))))
; #t means ’true’ in Scheme
EXAMPLE
6.89
Efficient iterative Fibonacci
function
Unfortunately, this algorithm takes exponential time, when linear time is possible. In C, one might write
int fib(int n) {
int f1 = 1; int f2 = 1;
int i;
for (i = 2; i <= n; i++) {
int temp = f1 + f2;
f1 = f2; f2 = temp;
}
return f2;
}
6.6 Recursion
EXAMPLE
6.90
Efficient tail-recursive
Fibonacci function
291
One can write this iterative algorithm in Scheme: Scheme includes (nonfunctional) iterative features. It is probably better, however, to draw inspiration from
the tail-recursive summation function of Example 6.87 and write the following
O(n) recursive function.
(define fib (lambda (n)
(letrec ((fib-helper (lambda (f1 f2 i)
(if (= i n)
f2
(fib-helper f2 (+ f1 f2) (+ i 1))))))
(fib-helper 0 1 0))))
EXAMPLE
6.91
Tail-recursive Fibonacci
function in Sisal
For a programmer accustomed to writing in a functional style, this code is perfectly natural. One might argue that it isn’t “really” recursive; it simply casts an
iterative algorithm in a tail-recursive form, and this argument has some merit.
Despite the algorithmic similarity, however, there is an important difference between the iterative algorithm in C and the tail-recursive algorithm in Scheme:
the latter has no side effects. Each recursive call of the fib-helper function creates a new scope, containing new variables. The language implementation may
be able to reuse the space occupied by previous instances of the same scope, but
it guarantees that this optimization will never introduce bugs.
We have already noted that many primarily functional languages, including
Common Lisp, Scheme, and ML, provide certain nonfunctional features, including iterative constructs that are executed for their side effects. It is also possible to
define an iterative construct as syntactic sugar for tail recursion, by arranging for
successive iterations of a loop to introduce new scopes. The only tricky part is to
make values from a previous iteration available in the next, when all local names
have been reused for different variables. The dataflow language Val [McG82] and
its successor, Sisal, provide this capability through a special keyword, old . The
newer pH language, a parallel dialect of Haskell, provides the inverse keyword,
next . Figure 6.7 contains side-effect-free iterative code for our Fibonacci function in Sisal. We will mention Sisal and pH again in Sections 10.7 and 12.3.6. 6.6.2
Applicative- and Normal-Order Evaluation
Throughout the discussion so far we have assumed implicitly that arguments are
evaluated before passing them to a subroutine. This need not be the case. It is
possible to pass a representation of the unevaluated arguments to the subroutine
instead, and to evaluate them only when (if) the value is actually needed. The former option (evaluating before the call) is known as applicative-order evaluation;
the latter (evaluating only when the value is actually needed) is known as normalorder evaluation. Normal-order evaluation is what naturally occurs in macros. It
also occurs in short-circuit Boolean evaluation, call-by-name parameters (to be
discussed in Section 8.3.1), and certain functional languages (to be discussed in
Section 10.4).
292
Chapter 6 Control Flow
function fib(n : integer returns integer)
for initial
f1 := 0;
f2 := 1;
i := 0;
while i < n repeat
i := old i + 1;
f1 := old f2;
f2 := old f1 + old f2;
returns value of f2
end for
end function
Figure 6.7 Fibonacci function in Sisal. Each iteration of the while loop defines a new scope,
with new variables named i , f1 , and f2 . The previous instances of these variables are available
in each iteration as old i , old f1 , and old f2 . The entire for construct is an expression; it
can appear in any context in which a value is expected.
EXAMPLE
6.92
Divisibility macro in C
Historically, C has relied heavily on macros for small, nonrecursive “functions” that need to execute quickly. To determine whether one integer divides
another evenly, the C programmer might write
#define DIVIDES(a,n) (!((n) % (a)))
/* true iff n has zero remainder modulo a */
In every location in which the programmer uses DIVIDES , the compiler (actually
a preprocessor that runs before the compiler) will substitute the right-hand side
of the macro definition, textually, with parameters substituted as appropriate:
DIVIDES(y + z, x) becomes (!((x) % (y+z))) .
D E S I G N & I M P L E M E N TAT I O N
Inline as a hint
Formally, the inline keyword is a hint in C++ and C99, rather than a directive:
it suggests but does not require that the compiler actually expand the subroutine inline. The compiler is free to use a conventional implementation when
inline has been specified, or to use an in-line implementation when inline
has not specified, if it has reason to believe that this will result in better code.
In effect, the inclusion of the inline keyword in the language is an acknowledgment on the part of the language designers that compiler technology is not
(yet) at the point where it can always make a better decision with respect to inlining than can an expert programmer. The choice to make inline a hint is an
acknowledgment that compilers sometimes are able to make a better decision,
and that their ability to do so is likely to improve over time.
6.6 Recursion
EXAMPLE
6.93
“Gotchas” in C macros
293
Macros suffer from several limitations. In the code above, for example, the
parentheses around a and n in the right-hand side of the definition are essential.
Without them, DIVIDES(y + z, x) would be replaced by (!(x % y + z)) ,
which is the same as (!((x % y) + z)) , according to the rules of precedence.
More importantly, in a definition like
#define MAX(a,b) ((a) > (b) ? (a) : (b))
the expression MAX(x++, y++) may behave unexpectedly, since the increment
side effects will happen more than once. In general, normal-order evaluation
is safe only if arguments cause no side effects when evaluated. Finally, because
macros are purely textual abbreviations, they cannot be incorporated naturally
into high-level naming and scope rules. Given the following definition, for example,
#define SWAP(a,b) {int t = (a); (a) = (b); (b) = t;}
problems will arise if the programmer writes SWAP(x, t) . In C, a macro that
“returns” a value must be an expression. Since C is not a completely expressionoriented language like Algol 68, many constructs (e.g., loops) cannot occur
within an expression (see Exercise 6.28).
All of these problems can be avoided in C by using real functions instead of
macros. In most C implementations, however, the macros are much more efficient. They avoid the overhead of the subroutine call mechanism (including register saves and restores), and the code they generate can be integrated into any
code improvements that the compiler is able to effect in the code surrounding
the call. In C++ and C99, the programmer can obtain the best of both worlds by
prefacing a function definition with a special inline keyword. This keyword instructs the compiler to expand the definition of the function at the point of call,
if possible. The resulting code is then generally as efficient as a macro, but has the
semantics of a function call.
Algol 60 uses normal-order evaluation by default (applicative order is also
available). This choice was presumably made to mimic the behavior of macros.
Most programmers in 1960 wrote mainly in assembler, and were accustomed
to macro facilities. Because the parameter-passing mechanisms of Algol 60 are
part of the language, rather than textual abbreviations, problems like misinterpreted precedence or naming conflicts do not arise. Side effects, however, are still
very much an issue. We will discuss Algol 60 parameters in more detail in Section 8.3.1.
Lazy Evaluation
From the points of view of clarity and efficiency, applicative-order evaluation
is generally preferable to normal-order evaluation. It is therefore natural for it
to be employed in most languages. In some circumstances, however, normalorder evaluation can actually lead to faster code, or to code that works when
applicative-order evaluation would lead to a run-time error. In both cases, what
294
EXAMPLE
Chapter 6 Control Flow
6.94
Lazy evaluation of an
infinite data structure
matters is that normal-order evaluation will sometimes not evaluate an argument at all, if its value is never actually needed. Scheme provides for optional normal-order evaluation in the form of built-in functions called delay
and force .8 These functions provide an implementation of lazy evaluation. In
the absence of side effects, lazy evaluation has the same semantics as normalorder evaluation, but the implementation keeps track of which expressions
have already been evaluated, so it can reuse their values if they are needed
more than once in a given referencing environment. A delay ed expression
is sometimes called a promise. The mechanism used to keep track of which
promises have already been evaluated is sometimes called memoization.9 Because
applicative-order evaluation is the default in Scheme, the programmer must use
special syntax not only to pass an unevaluated argument, but also to use it.
In Algol 60, subroutine headers indicate which arguments are to be passed which
way; the point of call and the uses of parameters within subroutines look the
same in either case.
A common use of lazy evaluation is to create so-called infinite or lazy data
structures that are “fleshed out” on demand. The following example, adapted
from the Scheme manual [ADH+ 98, p. 28], creates a “list” of all the natural numbers.
(define naturals
(letrec ((next (lambda (n) (cons n (delay (next (+ n 1)))))))
(next 1)))
(define head car)
(define tail (lambda (stream) (force (cdr stream))))
D E S I G N & I M P L E M E N TAT I O N
Normal-order evaluation
Normal-order evaluation is one of many examples we have seen where arguably desirable semantics have been dismissed by language designers because
of fear of implementation cost. Other examples in this chapter include sideeffect freedom (which allows normal order to be implemented via lazy evaluation), iterators (Section 6.5.3), sidebar and nondeterminacy (Section 6.7). As
noted in the sidebar on page 248, however, there has been a tendency over time
to trade a bit of speed for cleaner semantics and increased reliability. Within
the functional programming community, Miranda and its successor Haskell
are entirely side-effect free, and use normal-order (lazy) evaluation for all parameters.
8 More precisely, delay is a special form, rather than a function. Its argument is passed to it unevaluated.
9 Within the functional programming community, the term lazy evaluation is often used for any
implementation that declines to evaluate unneeded function parameters; this includes both naive
implementations of normal-order evaluation and the memoizing mechanism described here.
6.7 Nondeterminacy
295
Here cons can be thought of, roughly, as a concatenation operator. Car returns
the head of a list; cdr returns everything but the head. Given these definitions,
we can access as many natural numbers as we want:
(head naturals)
(head (tail naturals))
(head (tail (tail naturals)))
⇒ 1
⇒ 2
⇒ 3
The list will occupy only as much space as we have actually explored. More elaborate lazy data structures (e.g., trees) can be valuable in combinatorial search
problems, in which a clever algorithm may explore only the “interesting” parts of
a potentially enormous search space.
6.7
Nondeterminacy
Our final category of control flow is nondeterminacy. A nondeterministic construct is one in which the choice between alternatives (i.e., between control paths)
is deliberately unspecified. We have already seen examples of nondeterminacy
in the evaluation of expressions (Section 6.1.4): in most languages, operator or
subroutine arguments may be evaluated in any order. Some languages, notably
Algol 68 and various concurrent languages, provide more extensive nondeterministic mechanisms, which cover statements as well.
IN MORE DEPTH
Absent a nondeterministic construct, the author of a code fragment in which order does not matter must choose some arbitrary (artificial) order. Such a choice
can make it more difficult to construct a formal correctness proof. Some language designers have also argued that it is inelegant. The most compelling uses
for nondeterminacy arise in concurrent programs, where imposing an arbitrary
choice on the order in which a thread interacts with its peers may cause the system as a whole to deadlock. For such programs one may need to ensure that the
choice among nondeterministic alternatives is fair in some formal sense.
C H E C K YO U R U N D E R S TA N D I N G
38. What is a tail-recursive function? Why is tail recursion important?
39. Explain the difference between applicative and normal-order evaluation of expressions. Under what circumstances is each desirable?
40. Describe three common pitfalls associated with the use of macros.
41. What is lazy evaluation? What are promises? What is memoization?
42. Give two reasons why lazy evaluation may be desirable.
296
Chapter 6 Control Flow
43. Name a language in which parameters are always evaluated lazily.
44. Give two reasons why a programmer might sometimes want control flow to
be nondeterministic.
6.8
Summary and Concluding Remarks
In this chapter we introduced the principal forms of control flow found in programming languages: sequencing, selection, iteration, procedural abstraction, recursion, concurrency, and nondeterminacy. Sequencing specifies that certain operations are to occur in order, one after the other. Selection expresses a choice
among two or more control-flow alternatives. Iteration and recursion are the two
ways to execute operations repeatedly. Recursion defines an operation in terms of
simpler instances of itself; it depends on procedural abstraction. Iteration repeats
an operation for its side effect(s). Sequencing and iteration are fundamental to
imperative (especially von Neumann) programming. Recursion is fundamental
to functional programming. Nondeterminacy allows the programmer to leave
certain aspects of control flow deliberately unspecified. We touched on concurrency only briefly; it will be the subject of Chapter 12. Procedural abstractions
(subroutines) are the subject of Chapter 8.
Our survey of control-flow mechanisms was preceded by a discussion of expression evaluation. We considered the distinction between l-values and r-values,
and between the value model of variables, in which a variable is a named container for data, and the reference model of variables, in which a variable is a
reference to a data object. We considered issues of precedence, associativity, and
ordering within expressions. We examined short-circuit Boolean evaluation and
its implementation via jump code, both as a semantic issue that affects the correctness of expressions whose subparts are not always well defined, and as an
implementation issue that affects the time required to evaluate complex Boolean
expressions.
In our survey we encountered many examples of control-flow constructs
whose syntax and semantics have evolved considerably over time. Particularly
noteworthy has been the phasing out of goto -based control flow and the emergence of a consensus on structured alternatives. While convenience and readability are difficult to quantify, most programmers would agree that the controlflow constructs of a language like Ada are a dramatic improvement over those
of, say, Fortran IV. Examples of features in Ada that are specifically designed to
rectify control-flow problems in earlier languages include explicit terminators
( end if , end loop , etc.) for structured constructs; elsif clauses; label ranges
and others clauses in case statements; implicit declaration of for loop indices
as read-only local variables; explicit return statements; multi-level loop exit
statements; and exceptions.
6.8 Summary and Concluding Remarks
297
The evolution of constructs has been driven by many goals, including ease
of programming, semantic elegance, ease of implementation, and run-time efficiency. In some cases these goals have proven complementary. We have seen
for example that short-circuit evaluation leads both to faster code and (in many
cases) to cleaner semantics. In a similar vein, the introduction of a new local
scope for the index variable of an enumeration-controlled loop avoids both the
semantic problem of the value of the index after the loop and (to some extent)
the implementation problem of potential overflow.
In other cases improvements in language semantics have been considered
worth a small cost in run-time efficiency. We saw this in the addition of a pretest
to the Fortran do loop and in the introduction of midtest loops (which almost always require at least two branch instructions). Iterators provide another example:
like many forms of abstraction, they add a modest amount of run-time cost in
many cases (e.g., in comparison to explicitly embedding the implementation of
the enumerated set in the control flow of the loop), but with a large pay-back in
modularity, clarity, and opportunities for code reuse. Sisal’s developers would argue that even if Fortran does enjoy a performance edge in some cases, functional
programming provides a more important benefit: facilitating the construction of
correct, maintainable code. The developers of Java would argue that for many
applications the portability and safety provided by extensive semantic checking,
standard-format numeric types, and so on are far more important than speed.
The ability of Sisal to compete with Fortran (it does very well with numeric
code) is due to advances in compiler technology, and to advances in automatic
code improvement in particular. We have seen several other examples of cases in
which advances in compiler technology or in the simple willingness of designers to build more complex compilers have made it possible to incorporate features once considered too expensive. Label ranges in Ada case statements require
that the compiler be prepared to generate code employing binary search. In-line
functions in C++ eliminate the need to choose between the inefficiency of tiny
functions and the messy semantics of macros. Exceptions (as we shall see in Section 8.5.4) can be implemented in such a way that they incur no cost in the common case (when they do not occur), but the implementation is quite tricky. Iterators, boxing, generics (Section 8.4), and first-class functions are likewise rather
tricky, but are increasingly found in mainstream imperative languages.
Some implementation techniques (e.g., rearranging expressions to uncover
common subexpressions, or avoiding the evaluation of guards in a nondeterministic construct once an acceptable choice has been found) are sufficiently
important to justify a modest burden on the programmer (e.g., adding parentheses where necessary to avoid overflow or ensure numeric stability, or ensuring
that expressions in guards are side-effect-free). Other semantically useful mechanisms (e.g., lazy evaluation, continuations, or truly random nondeterminacy) are
usually considered complex or expensive enough to be worthwhile only in special
circumstances (if at all).
In comparatively primitive languages, we can often obtain some of the benefits of missing features through programming conventions. In early dialects of
298
Chapter 6 Control Flow
Fortran, for example, we can limit the use of goto s to patterns that mimic the
control flow of more modern languages. In languages without short-circuit evaluation, we can write nested selection statements. In languages without iterators,
we can write sets of subroutines that provide equivalent functionality.
6.9
Exercises
6.1 We noted in Section 6.1.1 that most binary arithmetic operators are leftassociative in most programming languages. In Section 6.1.4, however, we
also noted that most compilers are free to evaluate the operands of a binary
operator in either order. Are these statements contradictory? Why or why
not?
6.2 As noted in Figure 6.1, Fortran and Pascal give unary and binary minus the
same level of precedence. Is this likely to lead to nonintuitive evaluations of
certain expressions? Why or why not?
6.3 Translate the following expression into postfix and prefix notation:
[−b + sqrt(b × b − 4 × a × c)]/(2 × a)
Do you need a special symbol for unary negation?
6.4 In Lisp, most of the arithmetic operators are defined to take two or more arguments, rather than strictly two. Thus (* 2 3 4 5) evaluates to 120, and
(- 16 9 4) evaluates to 3. Show that parentheses are necessary to disambiguate arithmetic expressions in Lisp (in other words, give an example of
an expression whose meaning is unclear when parentheses are removed).
In Section 6.1.1 we claimed that issues of precedence and associativity do
not arise with prefix or postfix notation. Reword this claim to make explicit
the hidden assumption.
6.5 Example 6.31 claims that “For certain values of x , (0.1
+ x) * 10.0 and
1.0 + (x * 10.0) can differ by as much as 25%, even when 0.1 and x are
of the same magnitude.” Verify this claim. (Warning: If you’re using an x86
processor, be aware that floating-point calculations [even on single precision variables] are performed internally with 80 bits of precision. Roundoff
errors will appear only when intermediate results are stored out to memory
[with limited precision] and read back in again.)
6.6 Languages that employ a reference model of variables also tend to employ
automatic garbage collection. Is this more than a coincidence? Explain.
6.7 In Section 6.1.2 we noted that C uses = for assignment and == for equality
testing. The language designers state “Since assignment is about twice as
frequent as equality testing in typical C programs, it’s appropriate that the
operator be half as long” [KR88, p. 17]. What do you think of this rationale?
6.9 Exercises
299
6.8 Consider a language implementation in which we wish to catch every use of
an uninitialized variable. In Section 6.1.3 we noted that for types in which
every possible bit pattern represents a valid value, extra space must be used
to hold an initialized/uninitialized flag. Dynamic checks in such a system
can be expensive, largely because of the address calculations needed to access
the flags. We can reduce the cost in the common case by having the compiler
generate code to automatically initialize every variable with a distinguished
sentinel value. If at some point we find that a variable’s value is different
from the sentinel, then that variable must have been initialized. If its value is
the sentinel, we must double-check the flag. Describe a plausible allocation
strategy for initialization flags, and show the assembly language sequences
that would be required for dynamic checks, with and without the use of
sentinels.
6.9 Write an attribute grammar, based on the following context-free grammar,
that accumulates jump code for Boolean expressions (with short-circuiting)
into a synthesized attribute of condition, and then uses this attribute to generate code for if statements.
stmt −→
−→
if condition then stmt else stmt
other stmt
condition −→
c term
c term −→
relation
relation −→
c fact
c fact −→
condition or c term
c term and relation
c fact comparator c fact
identifier
comparator −→
<
<=
not c fact
=
<>
( condition )
>
>=
(Hint: Your task will be easier if you do not attempt to make the grammar L-attributed. For further details see Fischer and LeBlanc’s compiler
book [FL88, Sec. 14.1.4].)
6.10 Neither Algol 60 nor Algol 68 employs short-circuit evaluation for Boolean
expressions. In both languages, however, an if . . . then . . . else construct
can be used as an expression. Show how to use if . . . then . . . else to achieve
the effect of short-circuit evaluation.
6.11 Consider the following expression in C: a/b
> 0 && b/a > 0 . What will
be the result of evaluating this expression when a is zero? What will be the
result when b is zero? Would it make sense to try to design a language in
which this expression is guaranteed to evaluate to false when either a or b
(but not both) is zero? Explain your answer.
6.12 As noted in Section 6.4.2, languages vary in how they handle the situation
in which the tested expression in a case statement does not appear among
the labels on the arms. C and Fortran 90 say the statement has no effect.
Pascal and Modula say it results in a dynamic semantic error. Ada says that
the labels must cover all possible values for the type of the expression, so
300
Chapter 6 Control Flow
the question of a missing value can never arise at run time. What are the
tradeoffs among these alternatives? Which do you prefer? Why?
6.13 Write the equivalent of Figure 6.5 in C# 2.0, Python, or Ruby. Write a second
version that performs an in-order enumeration rather than pre-order.
6.14 Revise the algorithm of Figure 6.6 so that it performs an in-order enumeration, rather than pre-order.
6.15 Write a C++ pre-order iterator to supply tree nodes to the loop in Example 6.71. You will need to know (or learn) how to use pointers, references,
inner classes, and operator overloading in C++. For the sake of (relative)
simplicity, you may assume that the datum in a tree node is always an int ;
this will save you the need to use generics. You may want to use the stack
abstraction from the C++ standard library.
6.16 Write code for the tree_iter type ( struct ) and the ti_create , ti_done ,
ti_next , ti_val , and ti_delete functions employed in Example 6.74.
6.17 Write, in C#, Python, or Ruby, an iterator that yields
(a) all permutations of the integers 1 . . n
(b) all combinations of k integers from the range 1 . . n (0 ≤ k ≤ n).
You may represent your permutations and combinations using either a list
or an array.
6.18 Use iterators to construct a program that outputs (in some order) all structurally distinct binary trees of n nodes. Two trees are considered structurally
distinct if they have different numbers of nodes or if their left or right subtrees are structurally distinct. There are, for example, 5 structurally distinct
trees of 3 nodes:
These are most easily output in “dotted parenthesized form”:
(((.).).)
((.(.)).)
((.).(.))
(.((.).))
(.(.(.)))
(Hint: Think recursively! If you need help, see Section 2.2 of the text by
Finkel [Fin96].)
6.19 Build true iterators in Java using threads. (This requires knowledge of material in Chapter 12.) Make your solution as clean and as general as possible.
In particular, you should provide the standard Iterator or IEnumerable
6.9 Exercises
301
interface for use with extended for or foreach loops, but the programmer should not have to write these. Instead, he or she should write a class
with an Iterate method, which should in turn be able to call a Yield
method, which you should also provide. Evaluate the cost of your solution.
How much more expensive is it than standard Java iterator objects?
6.20 In an expression-oriented language such as Algol 68 or Lisp, a while loop
(a do loop in Lisp) has a value as an expression. How do you think this
value should be determined? (How is it determined in Algol 68 and Lisp?) Is
the value a useless artifact of expression orientation, or are there reasonable
programs in which it might actually be used? What do you think should
happen if the condition on the loop is such that the body is never executed?
6.21 Recall the “blank line” loop of Example 6.80, here written in Modula-2.
LOOP
line := ReadLine;
IF AllBlanks(line) THEN EXIT END;
ConsumeLine(line)
END;
Show how you might accomplish the same task using a while or repeat
loop, if midtest loops were not available. (Hint: One alternative duplicates
part of the code; another introduces a Boolean flag variable.) How do these
alternatives compare to the midtest version?
6.22 Rubin [Rub87] used the following example (rewritten here in C) to argue in
favor of a goto statement.
int first_zero_row = -1;
/* none */
int i, j;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (A[i][j]) goto next;
}
first_zero_row = i;
break;
next: ;
}
The intent of the code is to find the first all-zero row, if any, of an n × n
matrix. Do you find the example convincing? Is there a good structured alternative in C? In any language?
6.23 Bentley [Ben86, Chap. 4] provides the following informal description of binary search.
We are to determine whether the sorted array X[1..N] contains the element T . . . . Binary search solves the problem by keeping track of a range
within the array in which T must be if it is anywhere in the array. Initially,
the range is the entire array. The range is shrunk by comparing its middle
302
Chapter 6 Control Flow
element to T and discarding half the range. The process continues until T
is discovered in the array or until the range in which it must lie is known to
be empty.
Write code for binary search in your favorite imperative programming language. What loop construct(s) did you find to be most useful? (NB: When
he asked more than a hundred professional programmers to solve this problem, Bentley found that only about 10% got it right the first time, without
testing.)
6.24 A loop invariant is a condition that is guaranteed to be true at a given point
within the body of a loop on every iteration. Loop invariants play a major
role in axiomatic semantics, a formal reasoning system used to prove properties of programs. In a less formal way, programmers who identify (and
write down!) the invariants for their loops are more likely to write correct
code. Show the loop invariant(s) for your solution to the preceding exercise.
(Hint: You will find the distinction between < and ≤ [or between > and ≥]
to be crucial.)
6.25 If you have taken a course in automata theory or recursive function theory,
explain why while loops are strictly more powerful than for loops. (If you
haven’t had such a course, skip this question!) Note that we’re referring here
to Pascal-style for loops, not C-style.
6.26 Show how to calculate the number of iterations of a general Fortran 90style do loop. Your code should be written in an assembler-like notation,
and should be guaranteed to work for all valid bounds and step sizes. Be
careful of overflow! (Hint: While the bounds and step size of the loop can
be either positive or negative, you can safely use an unsigned integer for the
iteration count.)
6.27 Write a tail-recursive function in Scheme or ML to compute n factorial
(n! = 1≤i≤n i = 1 × 2 × · · · × n). (Hint: You will probably want to define
a “helper” function, as discussed in Section 6.6.1.)
6.28 Can you write a macro in standard C that “returns” the greatest common
divisor of a pair of arguments, without calling a subroutine? Why or why
not?
6.29 Give an example in C in which an in-line subroutine may be significantly faster than a functionally equivalent macro. Give another example in
which the macro is likely to be faster. (Hint: Think about applicative versus
normal-order evaluation of arguments.)
6.30 Use lazy evaluation ( delay and
force ) to implement iterator objects in
Scheme. More specifically, let an iterator be either the null list or a pair consisting of an element and a promise that when force d will return an iterator.
Give code for an uptoby function that returns an iterator, and a for-iter
function that accepts as arguments a one-argument function and an iterator.
These should allow you to evaluate such expressions as
6.9 Exercises
303
(for-iter (lambda (e) (display e) (newline)) (uptoby 10 50 3))
Note that unlike the standard Scheme for-each , for-iter should not require the existence of a list containing the elements over which to iterate;
the intrinsic space required for (for-iter f (uptoby 1 n 1)) should be
only O(1), rather than O(n).
6.31 (Difficult) Use
call-with-current-continuation ( call/cc ) to implement the following structured nonlocal control transfers in Scheme. (This
requires knowledge of material in Chapter 10.) You will probably want to
consult a Scheme manual for documentation not only on call/cc , but on
define-syntax and dynamic-wind as well.
(a) Multilevel returns. Model your syntax after the catch and throw of
Common Lisp.
(b) True iterators. In a style reminiscent of Exercise 6.30, let an iterator be a
function which when call/cc -ed will return either a null list or a pair
consisting of an element and an iterator. As in that previous exercise,
your implementation should support expressions like
(for-iter (lambda (e) (display e) (newline)) (uptoby 10 50 3))
Where the implementation of uptoby in Exercise 6.30 required the use
of delay and force , however, you should provide an iterator macro
(a Scheme special form) and a yield function that allows uptoby to
look like an ordinary tail-recursive function with an embedded yield :
(define uptoby
(iterator (low high step)
(letrec ((helper
(lambda (next)
(if (> next high) ’()
(begin
; else clause
(yield next)
(helper (+ next step)))))))
(helper low))))
6.32 Explain why the following guarded commands in SR are not equivalent.
if a < b -> c := a
[] b < c -> c := b
[] else -> c := d
fi
6.33–6.35 In More Depth.
if a < b -> c := a
[] b < c -> c := b
[] true -> c := d
fi
304
Chapter 6 Control Flow
6.10
Explorations
6.36 Consider again the idea of loop unrolling, introduced in Exercise 5.15. Loop
unrolling is traditionally implemented by the code improvement phase of a
compiler. It can be implemented at source level, however, if we are faced
with the prospect of “hand optimizing” time-critical code on a system
whose compiler is not up to the task. Unfortunately, if we replicate the body
of a loop k times, we must deal with the possibility that the original number
of loop iterations, n, may not be a multiple of k. Writing in C, and letting
k = 4, we might transform the main loop of Exercise 5.15 from
i = 0;
do {
sum += A[i]; squares += A[i] * A[i]; i++;
} while (i < N);
to
i = 0;
do {
sum
sum
sum
sum
} while
do {
sum
} while
j = N/4;
+= A[i]; squares
+= A[i]; squares
+= A[i]; squares
+= A[i]; squares
(--j > 0);
+=
+=
+=
+=
A[i]
A[i]
A[i]
A[i]
*
*
*
*
A[i];
A[i];
A[i];
A[i];
i++;
i++;
i++;
i++;
+= A[i]; squares += A[i] * A[i]; i++;
(i < N);
In 1983, Tom Duff of Lucasfilm realized that code of this sort can be
“simplified” in C by interleaving a switch statement and a loop. The result
is rather startling, but perfectly valid C. It’s known in programming folklore
as “Duff ’s device.”
i = 0; j = (N+3)/4;
switch (N%4)
case 0: do{ sum
case 3:
sum
case 2:
sum
case 1:
sum
} while
}
+= A[i]; squares
+= A[i]; squares
+= A[i]; squares
+= A[i]; squares
(--j > 0);
+=
+=
+=
+=
A[i]
A[i]
A[i]
A[i]
*
*
*
*
A[i];
A[i];
A[i];
A[i];
i++;
i++;
i++;
i++;
Duff announced his discovery with “a combination of pride and revulsion.”
He noted that “Many people . . . have said that the worst feature of C is
that switch es don’t break automatically before each case label. This code
forms some sort of argument in that debate, but I’m not sure whether it’s
6.11 Bibliographic Notes
305
for or against.” What do you think? Is it reasonable to interleave a loop
and a switch in this way? Should a programming language permit it? Is
automatic fall-through ever a good idea?
6.37 Using your favorite language and compiler, investigate the order of evaluation of subroutine parameters. Are they usually evaluated left to right or
right to left? Are they ever evaluated in the other order? (Can you be sure?)
Write a program in which the order makes a difference in the results of the
computation.
6.38 Consider the different approaches to arithmetic overflow adopted by Pascal,
C, Java, C#, and Common Lisp, as described in Section 6.1.4. Speculate
as to the differences in language design goals that might have caused the
designers to adopt the approaches they did.
6.39 Learn more about container classes and the design patterns (structured programming idioms) they support. Explore the similarities and differences
among the standard container libraries of C++, Java, and C#. Which of
these libraries do you find the most appealing? Why?
6.40–6.43 In More Depth.
6.11
Bibliographic Notes
Many of the issues discussed in this chapter feature prominently in papers on
the history of programming languages. Pointers to several such papers can be
found in the Bibliographic Notes for Chapter 1. Fifteen papers comparing Ada,
C, and Pascal can be found in the collection edited by Feuer and Gehani [FG84].
References for individual languages can be found in Appendix A.
Niklaus Wirth has been responsible for a series of influential languages over a
30-year period, including Pascal [Wir71], its predecessor Algol W [WH66], and
the successors Modula [Wir77b], Modula-2 [Wir85b], and Oberon [Wir88b].
The case statement of Algol W is due to Hoare [Hoa81]. Bernstein [Ber85]
considers a variety of alternative implementations for case, including multilevel
versions appropriate for label sets consisting of several dense “clusters” of values. Guarded commands are due to Dijkstra [Dij75]. Duff ’s device was originally
posted to netnews, the predecessor of Usenet news, in May 1984. The original
posting appears to have been lost, but Duff ’s commentary on it can be found at
many Internet sites, including www.lysator.liu.se/c/duffs-device.html.
Debate over the supposed merits or evils of the goto statement dates from
at least the early 1960s, but became a good bit more heated in the wake of a
1968 article by Dijkstra (“Go To Statement Considered Harmful” [Dij68b]). The
structured programming movement of the 1970s took its name from the text
of Dahl, Dijkstra, and Hoare [DDH72]. A dissenting letter by Rubin in 1987
(“ ‘GOTO Considered Harmful’ Considered Harmful” [Rub87]; Exercise 6.22)
elicited a flurry of responses.
306
Chapter 6 Control Flow
What has been called the “reference model of variables” in this chapter is called
the “object model” in Clu; Liskov and Guttag describe it in Sections 2.3 and 2.4.2
of their text on abstraction and specification [LG86]. Clu iterators are described
in an article by Liskov et al. [LSAS77] and in Chapter 6 of the Liskov and Guttag
text. Icon generators are discussed in Chapters 11 and 14 of the text by Griswold and Griswold [GG96]. The tree-enumeration algorithm of Exercise 6.18
was originally presented (without iterators) by Solomon and Finkel [SF80].
Several texts discuss the use of invariants (Exercise 6.24) as a tool for writing
correct programs. Particularly noteworthy are the works of Dijkstra [Dij76] and
Gries [Gri81]. Kernighan and Plauger provide a more informal discussion of the
art of writing good programs [KP78].
The Blizzard [SFL+ 94] and Shasta [SG96] systems for software distributed
shared memory (S-DSM) make use of sentinels (Exercise 6.8). We will discuss
S-DSM in Section 12.2.1.
Michaelson [Mic89, Chap. 8] provides an accessible formal treatment of
applicative-order, normal-order, and lazy evaluation. Friedman, Wand, and
Haynes provide an excellent discussion of continuation-passing style [FWH01,
Chaps. 7–8].
7
Data Types
Most programming languages include a notion of type for expressions
and/or objects.1 Types serve two principal purposes:
EXAMPLE
7.1
Operations that leverage
type information
EXAMPLE
7.2
Errors captured by type
information
1. Types provide implicit context for many operations, so the programmer does
not have to specify that context explicitly. In Pascal, for instance, the expression a + b will use integer addition if a and b are of integer type; it will use
floating-point addition if a and b are of real type. Similarly, the operation
new p , where p is a pointer, will allocate a block of storage from the heap that
is the right size to hold an object of the type pointed to by p ; the programmer does not have to specify (or even know) this size. In C++, Java, and C#,
the operation new my_type() not only allocates (and returns a pointer to) a
block of storage sized for an object of type my_type , it also automatically calls
any user-defined initialization (constructor) function that has been associated
with that type.
2. Types limit the set of operations that may be performed in a semantically
valid program. They prevent the programmer from adding a character and a
record, for example, or from taking the arctangent of a set, or passing a file
as a parameter to a subroutine that expects an integer. While no type system
can promise to catch every nonsensical operation that a programmer might
put into a program by mistake, good type systems catch enough mistakes to
be highly valuable in practice.
Section 7.1 looks more closely at the meaning and purpose of types, and
presents some basic definitions. Section 7.2 addresses questions of type equivalence and type compatibility: when can we say that two types are the same, and
when can we use a value of a given type in a given context? Sections 7.3–7.9 con1 Recall that unless otherwise noted we are using the term object informally to refer to anything
that might have a name. Object-oriented languages, which we will study in Chapter 9, assign a
different, more formal meaning to the term.
307
308
Chapter 7 Data Types
sider syntactic, semantic, and pragmatic issues for some of the most important
composite types: records, arrays, strings, sets, pointers, lists, and files. The section
on pointers includes a more detailed discussion of the value and reference models of variables introduced in Section 6.1.2, and of the heap management issues
introduced in Section 3.2. The section on files (mostly on the PLP CD) includes a
discussion of input and output. Section 7.10 considers what it means to compare
two complex objects for equality, or to assign one into the other.
7.1
Type Systems
In Section 5.2 we noted that computer hardware is capable of interpreting bits
in memory in several different ways. The various functional units of a processor may interpret bits as, among other things, instructions, addresses, characters,
and integer and floating-point numbers of various lengths. The bits themselves,
however, are untyped; the hardware on most machines makes no attempt to keep
track of which interpretations correspond to which locations in memory. Assembly languages reflect this lack of typing: operations of any kind can be applied to
values in arbitrary locations. High-level languages, on the other hand, almost
always associate types with values, to provide the contextual information and
error-checking just mentioned.
Informally, a type system consists of (1) a mechanism to define types and associate them with certain language constructs and (2) a set of rules for type equivalence, type compatibility, and type inference. The constructs that must have types
are precisely those that have values, or that can refer to objects that have values. These constructs include named constants, variables, record fields, parameters, and sometimes subroutines; explicit (manifest) constants (e.g., 17 , 3.14 ,
"foo" ); and more complicated expressions containing these. Type equivalence
rules determine when the types of two values are the same. Type compatibility
rules determine when a value of a given type can be used in a given context.
Type inference rules define the type of an expression based on the types of its
constituent parts or (sometimes) the surrounding context.
The distinction between the type of an expression (e.g., a name) and the type
of the object to which it refers is important in a language with polymorphic variables or parameters, since a given name may refer to objects of different types
at different times. In a language without polymorphism, the distinction doesn’t
matter.
Subroutines are considered to have types in some languages, but not in others. Subroutines need to have types if they are first- or second-class values (i.e., if
they can be passed as parameters, returned by functions, or stored in variables).
In each of these cases there is a construct in the language whose value is a dynamically determined subroutine; type information allows the language to limit the set
of acceptable values to those that provide a particular subroutine interface (i.e.,
particular numbers and types of parameters). In a statically scoped language that
7.1 Type Systems
309
never creates references to subroutines dynamically (one in which subroutines
are always third-class values), the compiler can always identify the subroutine to
which a name refers, and can ensure that the routine is called correctly without
necessarily employing a formal notion of subroutine types.
7.1.1
Type Checking
Type checking is the process of ensuring that a program obeys the language’s type
compatibility rules. A violation of the rules is known as a type clash. A language is
said to be strongly typed if it prohibits, in a way that the language implementation
can enforce, the application of any operation to any object that is not intended
to support that operation. A language is said to be statically typed if it is strongly
typed and type checking can be performed at compile time. In the strictest sense
of the term, few languages are statically typed. In practice, the term is often applied to languages in which most type checking can be performed at compile
time, and the rest can be performed at run time.
A few examples: Ada is strongly typed and, for the most part, statically typed
(certain type constraints must be checked at run time). A Pascal implementation
can also do most of its type checking at compile time, though the language is not
quite strongly typed: untagged variant records (to be discussed in Section 7.3)
are its only loophole. C89 is significantly more strongly typed than its predecessor dialects, but still significantly less strongly typed than Pascal. Its loopholes
include unions, subroutines with variable numbers of parameters, and the interoperability of pointers and arrays (to be discussed in Section 7.7.1). Implementations of C rarely check anything at run time. A few high-level languages (e.g.,
Bliss [WRH71]) are completely untyped, like assembly languages.
Dynamic (run-time) type checking is a form of late binding and tends to be
found in languages that delay other issues until run time as well. Lisp, Smalltalk,
and most scripting languages are dynamically (though strongly) typed. Languages with dynamic scoping are generally dynamically typed (or not typed at
all): if the compiler can’t identify the object to which a name refers, it usually
can’t determine the type of the object either.
7.1.2
Polymorphism
Polymorphism (Section 3.6.3) allows a single body of code to work with objects
of multiple types. It may or may not imply the need for run-time type checking.
As implemented in Lisp, Smalltalk, and the various scripting languages, fully dynamic typing allows the programmer to apply arbitrary operations to arbitrary
objects. Only at run time does the language implementation check to see that the
objects actually implement the requested operations. Because the types of objects
can be thought of as implied (unspecified) parameters, dynamic typing is said to
support implicit parametric polymorphism.
310
Chapter 7 Data Types
Unfortunately, while powerful and straightforward, dynamic typing incurs
significant run-time cost. It also delays the reporting of errors. ML and its descendants employ a sophisticated system of type inference to support implicit
parametric polymorphism in conjunction with static typing. The ML compiler
infers for every object and expression a (possibly unique) type that captures precisely those properties that the object or expression must have to be used in the
context(s) in which it appears. With rare exceptions, the programmer need not
specify the types of objects explicitly. The task of the compiler is to determine
whether there exists a consistent assignment of types to expressions that guarantees, statically, that no operation will ever be applied to a value of an inappropriate type at run time. This job can be formalized as the problem of unification; we
discuss it further in Section 7.2.4.
In object-oriented languages, subtype polymorphism allows a variable X of type
T to refer to an object of any type derived from T. Since derived types are required
to support all of the operations of the base type, the compiler can be sure that
any operation acceptable for an object of type T will be acceptable for any object
referred to by X. Given a straightforward model of inheritance, type checking
for subtype polymorphism can be implemented entirely at compile time. Most
languages that envision such an implementation, including C++, Eiffel, Java, and
C#, also provide explicit parametric polymorphism (generics), which allow the programmer to define classes with type parameters. Generics are particularly useful
for container (collection) classes: “list of T ” ( List<T> ), “stack of T ” ( Stack<T> ),
and so on, where T is left unspecified. Like subtype polymorphism, generics can
usually be type-checked at compile time, though Java sometimes performs redundant checks at run time for the sake of interoperability with preexisting nongeneric code. Smalltalk, Objective-C, Python, and Ruby use a single mechanism
D E S I G N & I M P L E M E N TAT I O N
Dynamic typing
The growing popularity of scripting languages has led a number of prominent software developers to publicly question the value of static typing. They
ask: given that we can’t check everything at compile time, how much pain is it
worth to check the things we can? As a general rule, it is easier to write typecorrect code than to prove that we have done so, and static typing requires
such proofs. As type systems become more complex (due to object orientation, generics, etc.), the complexity of static typing increases correspondingly.
Anyone who has written extensively in Ada or C++ on the one hand, and in
Python or Scheme on the other, cannot help but be struck at how much easier
it is to write code without complex type declarations. Dynamic checking incurs some run-time cost, of course, and delays the reporting of errors, but this
is increasingly seen as insignificant in comparison to the potential increase in
human productivity. The choice between static and dynamic typing promises
to provide one of the most interesting language debates of the coming decade.
7.1 Type Systems
311
for both parametric and subtype polymorphism, with checking delayed until run
time. We will consider generics further in Section 8.4, and derived types in Chapter 9.
7.1.3
The Definition of Types
Some early high-level languages (e.g., Fortran 77, Algol 60, and Basic) provide a
small, built-in, and nonextensible set of types. As we saw in Section 3.3.1, Fortran does not require variables to be declared; it incorporates default rules to
determine the type of undefined variables based on the spelling of their names
(Basic has similar rules). As noted in the previous subsection, a few languages
(e.g., Bliss) dispense with types, while others keep track of them automatically at
compile time (as in ML, Miranda, or Haskell) or at run time (as in Lisp/Scheme
or Smalltalk). In most languages, however, users must explicitly declare the type
of every object, together with the characteristics of every type that is not builtin.
There are at least three ways to think about types, which we may call the denotational, constructive, and abstraction-based points of view. From the denotational
point of view, a type is simply a set of values. A value has a given type if it belongs
to the set; an object has a given type if its value is guaranteed to be in the set. From
the constructive point of view, a type is either one of a small collection of builtin types (integer, character, Boolean, real, etc.; also called primitive or predefined
types) or a composite type created by applying a type constructor ( record , array ,
set , etc.) to one or more simpler types. (This use of the term constructor is unrelated to the initialization functions of object-oriented languages. It also differs
in a more subtle way from the use of the term in ML.) From the abstractionbased point of view, a type is an interface consisting of a set of operations with
well-defined and mutually consistent semantics. For most programmers (and
language designers), types usually reflect a mixture of these viewpoints.
In denotational semantics (one of the leading ways to formalize the meaning of programs), a set of values is known as a domain. Types are domains. The
meaning of an expression in denotational semantics is a value from the domain
that represents the expression’s type. (Domains are in some sense a generalization of types. The meaning of any language construct is a value from a domain.
The meaning of an assignment statement, for example, is a value from a domain whose elements are functions. Each function maps a store—a mapping
from names to values that represents the current contents of memory—to another store, which represents the contents of memory after the assignment.) One
of the nice things about the denotational view of types is that it allows us in many
cases to describe user-defined composite types (records, arrays, etc.) in terms of
mathematical operations on sets. We will allude to these operations again in Section 7.1.4.
Because it is based on mathematical objects, the denotational view of types
usually ignores such implementation issues as limited precision and word length.
312
Chapter 7 Data Types
This limitation is less serious than it might at first appear: checks for such errors
as arithmetic overflow are usually implemented outside of the type system of a
language anyway: they result in a run-time error, but this error is not called a
type clash.
When a programmer defines an enumerated type (e.g., enum hue {red,
green, blue} in C), he or she certainly thinks of this type as a set of values. For
most other varieties of user-defined type, however, one typically does not think
in terms of sets of values. Rather, one usually thinks in terms of the way the type
is built from simpler types, or in terms of its meaning or purpose. These ways
of thinking reflect the constructive and abstraction-based points of view. The
constructive point of view was pioneered by Algol W and Algol 68, and is characteristic of most languages designed in the 1970s and 1980s. The abstractionbased point of view was pioneered by Simula-67 and Smalltalk, and is characteristic of modern object-oriented languages. It can also be adopted as a matter
of programming discipline in non-object-oriented languages. We will discuss the
abstraction-based point of view in more detail in Chapter 9. The remainder of
this chapter focuses on the constructive point of view.
7.1.4
The Classification of Types
The terminology for types varies some from one language to another. This subsection presents definitions for the most common terms. Most languages provide
built-in types similar to those supported in hardware by most processors: integers, characters, Booleans, and real (floating-point) numbers.
Booleans (sometimes called logicals) are typically implemented as one-byte
quantities, with 1 representing true and 0 representing false . As noted in Section 6.1.2, C is unusual in its lack of a Boolean type: where most languages would
expect a Boolean value, C expects an integer; zero means false , and anything
6.5.4, Icon replaces Booleans with a
else means true . As noted in Section
more general notion of success and failure.
Characters have traditionally been implemented as one-byte quantities as well,
typically (but not always) using the ASCII encoding. More recent languages (e.g.,
Java and C#) use a two-byte representation designed to accommodate the Unicode character set. Unicode is an international standard designed to capture the
characters of a wide variety of languages (see sidebar on page 313). The first 128
characters of Unicode ( \u0000 through \u007f ) are identical to ASCII. C++
provides both regular and “wide” characters, though for wide characters both
the encoding and the actual width are implementation-dependent. Fortran 2003
supports four-byte Unicode characters.
Numeric Types
A few languages (e.g., C and Fortran) distinguish between different lengths of
integers and real numbers; most do not, and leave the choice of precision to the
7.1 Type Systems
313
implementation. Unfortunately, differences in precision across language implementations lead to a lack of portability: programs that run correctly on one system may produce run-time errors or erroneous results on another. Java and C#
are unusual in providing several lengths of numeric types, with a specified precision for each.
A few languages, including C, C++, C#, and Modula-2, provide both signed
and unsigned integers (Modula-2 calls unsigned integers cardinals). A few
languages (e.g., Fortran, C99, Common Lisp, and Scheme) provide a built-in
complex type, usually implemented as a pair of floating-point numbers that represent the real and imaginary Cartesian coordinates. Other languages (e.g., C++)
D E S I G N & I M P L E M E N TAT I O N
Multilingual character sets
The ISO 10646 international standard defines a Universal Character Set (UCS)
intended to include all characters of all known human languages. (It also sets
aside a “private use area” for such artificial [constructed] languages as Klingon,
Tengwar, and Cirth [Tolkein Elvish]. Allocation of this private space is coordinated by a volunteer organization known as the ConScript Unicode Registry.)
All natural languages currently employ codes in the 16-bit Basic Multilingual
Plane (BMP): 0x0000 through 0xfffd .
Unicode is an expanded version of ISO 10646, maintained by an international consortium of software manufacturers. In addition to mapping tables,
it covers such topics as rendering algorithms, directionality of text, and sorting
and comparison conventions.
While recent languages have moved toward 16- or 32-bit internal character representations, these cannot be used for external storage—text files—
without causing severe problems with backward compatibility. To accommodate Unicode without breaking existing tools, Ken Thompson in 1992 proposed a multibyte “expanding” code known as UTF-8 (UCS/Unicode Transformation Format, 8-bit) and codified as a formal annex (appendix) to ISO
10646. UTF-8 characters occupy a maximum of 6 bytes—3 if they lie in the
BMP, and only 1 if they are ordinary ASCII. The trick is to observe that ASCII
is a 7-bit code; in any legacy text file the most significant bit of every byte is 0.
In UTF-8 a most significant bit of 1 indicates a multibyte character. Two-byte
codes begin with the bits 110 . Three-byte codes begin with 1110 . Second and
subsequent bytes of multibyte characters always begin with 10 .
On some systems one also finds files encoded in one of ten variants of the
older 8-bit ISO 8859 standard, but these are inconsistently rendered across
platforms. On the web, non-ASCII characters are typically encoded with numeric character references, which bracket a Unicode value, written in decimal
or hex, with an ampersand and a semicolon. The copyright symbol (©), for
example, is &169#; . Many characters also have symbolic entity names, (e.g.,
&copy; ) but not all browsers support these.
314
Chapter 7 Data Types
support complex numbers in a standard library. A few languages (e.g., Scheme
and Common Lisp) provide a built-in rational type, usually implemented as a
pair of integers that represents the numerator and denominator. Common Lisp
and most dialects of Scheme support integers (and rationals) of arbitrary precision; the implementation uses multiple words of memory where appropriate.
Ada supports fixed-point types, which are represented internally by integers
but have an implied decimal point at a programmer-specified position among
the digits. Fixed-point numbers provide a compact representation of nonintegral values (e.g., dollars and cents) within a restricted range. For example, 32-bit
hardware integers can represent fixed-point numbers with two (decimal) digits to
the right of the decimal point in the range of roughly negative 20 million to positive 20 million. Double-precision (64-bit) numbers would be required to capture the same range in floating-point, since single-precision IEEE floating-point
numbers have ony 23 bits of significand (Section 5.2.1). Addition and subtraction of fixed-point numbers (with the same number of decimal places) can use
ordinary integer operations. Multiplication and division are slightly more complicated, as are operations on values with different numbers of digits to the right
of the decimal point (Exercise 7.4).
Integers, Booleans, and characters are all examples of discrete types (also called
ordinal types): the domains to which they correspond are countable, and have a
well-defined notion of predecessor and successor for each element other than the
first and the last. (In most implementations the number of possible integers is
finite, but this is usually not reflected in the type system.) Two varieties of userD E S I G N & I M P L E M E N TAT I O N
Decimal types
A few languages, notably Cobol and PL/I, provide a decimal type for fixedpoint representation of integers in binary-coded decimal (BCD) format. BCD
devotes one nibble (four bits—half a byte) to each decimal digit. Machines that
support BCD in hardware can perform arithmetic directly on the BCD representation of a number, without converting it to and from binary form. This
capability is particularly useful in business and financial applications, which
treat their data as both numbers and character strings: converting a string of
ASCII digits to or from BCD is significantly cheaper than converting it to or
from binary. BCD format can be found on many (though by no means all)
CISC machines, and on at least one RISC machine: the HP PA-RISC.
C# also provides a decimal type, but its representation is closer to that of
Ada’s fixed point types than to the decimal types of Cobol and PL/I. Specifically, a C# decimal variable is a 128-bit datum that includes 96 binary bits of
precision, a sign, and a decimal scaling factor that can vary between 10−28 and
1028 . Values of decimal type have greater precision but smaller range than
double-precision floating-point values. Within their range they are ideal for
financial calculations, because they represent decimal fractions precisely.
7.1 Type Systems
315
defined types, enumerations and subranges, are also discrete. Discrete, rational,
real, and complex types together constitute the scalar types. Scalar types are also
sometimes called simple types.
Enumeration Types
EXAMPLE
7.3
Enumerations in Pascal
Enumerations were introduced by Wirth in the design of Pascal. They facilitate
the creation of readable programs, and allow the compiler to catch certain kinds
of programming errors. An enumeration type consists of a set of named elements. In Pascal, one can write
type weekday = (sun, mon, tue, wed, thu, fri, sat);
The values of an enumeration type are ordered, so comparisons are generally valid ( mon < tue ), and there is usually a mechanism to determine the
predecessor or successor of an enumeration value (in Pascal, tomorrow :=
succ(today) ). The ordered nature of enumerations facilitates the writing of
enumeration-controlled loops:
for today := mon to fri do begin ...
It also allows enumerations to be used to index arrays:
var daily_attendance : array [weekday] of integer;
EXAMPLE
7.4
Enumerations as constants
An alternative to enumerations, of course, is simply to declare a collection of
constants:
const sun = 0; mon = 1; tue = 2; wed = 3; thu = 4; fri = 5; sat = 6;
In C, the difference between the two approaches is purely syntactic. The declaration
enum weekday {sun, mon, tue, wed, thu, fri, sat};
is essentially equivalent to
typedef int weekday;
const weekday sun = 0, mon = 1, tue = 2,
wed = 3, thu = 4, fri = 5, sat = 6;
EXAMPLE
7.5
Converting to and from
enumeration type
In Pascal and most of its descendants, however, the difference between an enumeration and a set of integer constants is much more significant: the enumeration is a full-fledged type, incompatible with integers. Using an integer or an
enumeration value in a context expecting the other will result in a type clash
error at compile time.
Values of an enumeration type are typically represented by small integers, usually a consecutive range of small integers starting at zero. In many languages these
ordinal values are semantically significant, because built-in functions can be used
to convert an enumeration value to its ordinal value, and sometimes vice versa.
In Pascal, the built-in function ord takes an argument of any enumeration type
(including char and Boolean , which are considered built-in enumerations) and
returns the argument’s ordinal value. The built-in function chr takes an argu-
316
EXAMPLE
Chapter 7 Data Types
7.6
Distinguished values for
enums
ment i of type integer and returns the character whose ordinal value is i (or generates a run-time error if there is no such character). In Ada, weekday’pos(mon)
= 1 and weekday’val(1) = mon .
Several languages allow the programmer to specify the ordinal values of enumeration types if the default assignment is undesirable. In C, C++, and C#, one
could write
enum mips_special_regs {gp = 28, fp = 30, sp = 29, ra = 31};
(The intuition behind these values is explained in Section
In Ada this declaration would be written
5.4.4.)
type mips_special_regs is (gp, sp, fp, ra);
-- must be sorted
for mips_special_regs use (gp => 28, sp => 29, fp => 30, ra => 31);
EXAMPLE
7.7
Emulating distinguished
enum values in Java 5
In recent versions of Java one can obtain a similar effect by giving values an
extra field (here named register ):
enum mips_special_regs { gp(28), fp(30), sp(29), ra(31);
private final int register;
mips_special_regs(int r) register = r;
public int reg() return register;
}
...
int n = mips_special_regs.fp.reg();
As noted in Section 3.6.2, Pascal and C do not allow the same element name
to be used in more than one enumeration type in the same scope. Java and
C# do, but the programmer must identify elements using fully qualified names:
mips_special_regs.fp . Ada relaxes this requirement by saying that element
names are overloaded; the type prefix can be omitted whenever the compiler can
infer it from context.
Subrange Types
EXAMPLE
7.8
Subranges in Pascal
Like enumerations, subranges were first introduced in Pascal, and are found in
many subsequent Algol-family languages. A subrange is a type whose values compose a contiguous subset of the values of some discrete base type (also called the
parent type). In Pascal and most of its descendants, one can declare subranges
of integers, characters, enumerations, and even other subranges. In Pascal, subranges look like this:
type test_score = 0..100;
workday = mon..fri;
EXAMPLE
7.9
Subranges in Ada
In Ada one would write
type test_score is new integer range 0..100;
subtype workday is weekday range mon..fri;
The range... portion of the definition in Ada is called a type constraint. In this
example test_score is a derived type, incompatible with integers. The workday
7.1 Type Systems
EXAMPLE
7.10
Space requirements of
subrange type
317
type, on the other hand, is a constrained subtype; workday s and weekday s can be
more or less freely intermixed. The distinction between derived types and subtypes is a valuable feature of Ada; we will discuss it further in Section 7.2.1. One could of course use integers to represent test scores, or a weekday to represent a workday . Using an explicit subrange has several advantages. For one
thing, it helps to document the program. A comment could also serve as documentation, but comments have a bad habit of growing out of date as programs
change, or of being omitted in the first place. Because the compiler analyzes a
subrange declaration, it knows the expected range of subrange values, and can
generate code to perform dynamic semantic checks to ensure that no subrange
variable is ever assigned an invalid value. These checks can be valuable debugging
tools. In addition, since the compiler knows the number of values in the subrange, it can sometimes use fewer bits to represent subrange values than it would
need to use to represent arbitrary integers. In the example above, test_score
values can be stored in a single byte.
Most implementations employ the same bit patterns for integers and subranges, so subranges whose values are large require large storage locations, even
if the number of distinct values is small. The following type, for example,
type water_temperature = 273..373;
(* degrees Kelvin *)
would be stored in at least two bytes. While there are only 101 distinct values in
the type, the largest (373) is too large to fit in a single byte in its natural encoding.
(An unsigned byte can hold values in the range 0 . . 255; a signed byte can hold
values in the range −128 . . 127.)
Composite Types
Nonscalar types are usually called composite, or constructed types. They are generally created by applying a type constructor to one or more simpler types. Common composite types include records (structures), variant records (unions), arD E S I G N & I M P L E M E N TAT I O N
Multiple sizes of integers
The space savings possible with (small-valued) subrange types in Pascal and
Ada is achieved in several other languages by providing more than one size of
built-in integer type. C and C++, for example, support integer arithmetic on
signed and unsigned variants of char , short , int , long , and (in C99) long
long types, with monotonically nondecreasing sizes.2
2 More specifically, the C99 standard requires ranges for these types corresponding to lengths of
at least 1, 2, 2, 4, and 8 bytes, respectively. In practice, one finds implementations in which plain
int s are 2, 4, or 8 bytes long, including some in which they are the same size as short s but
shorter than long s, and some in which they are the same size as long s, but longer than short s.
318
Chapter 7 Data Types
rays, sets, pointers, lists, and files. All but pointers and lists are easily described in
terms of mathematical set operations (pointers and lists can be described mathematically as well, but the description is less intuitive).
Records were introduced by Cobol, and have been supported by most languages
since the 1960s. A record consists of a collection of fields, each of which belongs
to a (potentially different) simpler type. Records are akin to mathematical tuples; a record type corresponds to the Cartesian product of the types of the
fields.
Variant records differ from “normal” records in that only one of a variant
record’s fields (or collections of fields) is valid at any given time. A variant
record type is the union of its field types, rather than their Cartesian product.
Arrays are the most commonly used composite types. An array can be thought
of as a function that maps members of an index type to members of a component type. Arrays of characters are often referred to as strings, and are often
supported by special purpose operations not available for other arrays.
Sets, like enumerations and subranges, were introduced by Pascal. A set type is
the mathematical powerset of its base type, which must usually be discrete. A
variable of a set type contains a collection of distinct elements of the base type.
Pointers are l-values. A pointer value is a reference to an object of the pointer’s
base type. Pointers are often but not always implemented as addresses. They
are most often used to implement recursive data types. A type T is recursive
if an object of type T may contain one or more references to other objects of
type T.
Lists, like arrays, contain a sequence of elements, but there is no notion of mapping or indexing. Rather, a list is defined recursively as either an empty list
or a pair consisting of a head element and a reference to a sublist. While the
length of an array must be specified at elaboration time in most (though not
all) languages, lists are always of variable length. To find a given element of a
list, a program must examine all previous elements, recursively or iteratively,
starting at the head. Because of their recursive definition, lists are fundamental
to programming in most functional languages.
Files are intended to represent data on mass storage devices, outside the memory
in which other program objects reside. Like arrays, most files can be conceptualized as a function that maps members of an index type (generally integer)
to members of a component type. Unlike arrays, files usually have a notion of
current position, which allows the index to be implied implicitly in consecutive operations. Files often display idiosyncrasies inherited from physical input/output devices. In particular, the elements of some files must be accessed
in sequential order.
We will examine composite types in more detail in Sections 7.3 through 7.9.
7.1 Type Systems
7.1.5
EXAMPLE
7.11
Void (empty) type
319
Orthogonality
In Section 6.1.2 we discussed the importance of orthogonality in the design of
expressions, statements, and control-flow constructs. Orthogonality is equally
important in the design of type systems. Languages vary greatly in the degree
of orthogonality they display. A language with a high degree of orthogonality
tends to be easier to understand, to use, and to reason about in a formal way. We
have noted that languages like Algol 68 and C enhance orthogonality by eliminating (or at least blurring) the distinction between statements and expressions.
To characterize a statement that is executed for its side effect(s) and has no useful
value, some languages provide an “empty” type. In C and Algol, for example, a
subroutine that is meant to be used as a procedure is generally declared with a
“return” type of void . In ML, the empty type is called unit . If the programmer
wishes to call a subroutine that does return a value, but the value is not needed
in this particular case (all that matters is the side effect[s]), then the return value
in C can be cast to void (casts will be discussed in Section 7.2.1):
foo_index = insert_in_symbol_table(foo);
...
(void) insert_in_symbol_table(bar);
/* don’t care where it went */
/* cast is optional; implied if omitted */
EXAMPLE
7.12
Making do without void
In a language (e.g., Pascal) without an empty type, the latter of these two calls
would need to use a dummy variable:
var dummy : symbol_table_index;
...
dummy := insert_in_symbol_table(bar);
The type system of Pascal is more orthogonal than that of (pre-Fortran 90)
Fortran. Among other things, it allows arrays to be constructed from any discrete
index type and any component type; pre-Fortran 90 arrays are always indexed
by integers and have scalar components. At the same time, Pascal displays several
nonorthogonal wrinkles. As we shall see in Section 7.3, it requires that variant
fields of a record follow all other fields. It limits function return values to scalar
and pointer types. It requires the bounds of each array to be specified at compile
time except when the array is a formal parameter of a subroutine. Perhaps most
important, while it allows subroutines to be passed as parameters, it does not
give them first-class status: a subroutine cannot be returned by a function or
stored in a variable. By contrast, the type system of ML, which we examine in
Section 7.2.4, is almost completely orthogonal.
One particularly useful aspect of type orthogonality is the ability to specify literal values of arbitrary composite types. Several languages provide this capability,
but many others do not. Pascal and Modula provide notation for literal character strings and sets, but not for arrays, records, or recursive data structures. The
lack of notation for most user-defined composite types means that many Pascal
and Modula programs must devote time in every program run to initializing data
structures full of compile-time constants.
320
EXAMPLE
Chapter 7 Data Types
7.13
Aggregates in Ada
Composite values in Ada are specified using aggregates:
type person is record
name : string (1..10);
age : integer;
end record;
p, q : person;
A, B : array (1..10) of integer;
...
p := ("Jane Doe ", 37);
q := (age => 36, name => "John Doe ");
A := (1, 0, 3, 0, 3, 0, 3, 0, 0, 0);
B := (1 => 1, 3 | 5 | 7 => 3, others => 0);
Here the aggregates assigned into p and A are positional; the aggregates assigned
into q and B name their elements explicitly. The aggregate for B uses a shorthand notation to assign the same value ( 3 ) into array elements 3 , 5 , and 7 ,
and to assign a 0 into all unnamed fields. Several languages, including C, Fortran 90, and Lisp, provide similar capabilities. ML provides a very general facility
for composite expressions, based on the use of constructors (discussed in Section 7.2.4).
C H E C K YO U R U N D E R S TA N D I N G
1. What purpose(s) do types serve in a programming language?
2. What does it mean for a language to be strongly typed? Statically typed? What
prevents, say, C from being strongly typed?
3. Name two important programming languages that are strongly but dynamically typed.
4. What is a type clash?
5. Discuss the differences between the denotational, constructive, and abstraction-based views of types.
6. What is the difference between discrete and scalar types?
7. Give two examples of languages that lack a Boolean type. What do they use
instead?
8. In what ways may an enumeration type be preferable to a collection of named
constants? In what ways may a subrange type be preferable to its base type? It
what ways may a string be preferable to an array of characters?
9. What does it mean for a set of language features (e.g., a type system) to be
orthogonal?
10. What are aggregates?
7.2 Type Checking
7.2
321
Type Checking
In most statically typed languages, every definition of an object (constant, variable, subroutine, etc.) must specify the object’s type. Moreover, many of the contexts in which an object might appear are also typed, in the sense that the rules of
the language constrain the types that an object in that context may validly possess. In the following subsections we will consider the topics of type equivalence,
type compatibility, and type inference. Of the three, type compatibility is the one
of most concern to programmers. It determines when an object of a certain type
can be used in a certain context. At a minimum, the object can be used if its type
and the type expected by the context are equivalent (i.e., the same). In many languages, however, compatibility is a looser relationship than equivalence: objects
and contexts are often compatible even when their types are different. Our discussion of type compatibility will touch on the subjects of type conversion (also
called casting), which changes a value of one type into a value of another; type
coercion, which performs a conversion automatically in certain contexts; and nonconverting type casts, which are sometimes used in systems programming to interpret the bits of a value of one type as if they represented a value of some other
type.
Whenever an expression is constructed from simpler subexpressions, the question arises: given the types of the subexpressions (and possibly the type expected
by the surrounding context), what is the type of the expression as a whole? This
question is answered by type inference. Type inference is often trivial: the sum of
two integers is still an integer, for example. In other cases (e.g., when dealing with
sets) it is a good bit trickier. Type inference plays a particularly important role in
ML, Miranda, and Haskell, in which all type information is inferred.
7.2.1
EXAMPLE
7.14
Trivial differences in type
Type Equivalence
In a language in which the user can define new types, there are two principal
ways of defining type equivalence. Structural equivalence is based on the content
of type definitions: roughly speaking, two types are the same if they consist of the
same components, put together in the same way. Name equivalence is based on
the lexical occurrence of type definitions: roughly speaking, each definition introduces a new type. Structural equivalence is used in Algol-68, Modula-3, and (with
various wrinkles) C and ML. It was also used in many early implementations of
Pascal. Name equivalence is the more popular approach in recent languages. It is
used in Java, C#, standard Pascal, and most Pascal descendants, including Ada.
The exact definition of structural equivalence varies from one language to another. It requires that one decide which potential differences between types are
important, and which may be considered unimportant. Most people would probably agree that the format of a declaration should not matter: in a Pascal-like
language with structural equivalence,
322
Chapter 7 Data Types
type foo = record a, b : integer end;
should be considered the same as
type foo = record
a, b : integer
end;
These definitions should probably also be considered the same as
type foo = record
a : integer;
b : integer
end;
But what about
type foo = record
b : integer;
a : integer
end;
EXAMPLE
7.15
Other minor differences in
type
Should the reversal of the order of the fields change the type? Here the answer is
not as clear: ML says no; most languages say yes.
In a similar vein, the definition of structural equivalence should probably “factor out” different representations of constants: again in a Pascal-like notation,
type str = array [1..10] of char;
should be considered the same as
type str = array [1..2*5] of char;
On the other hand, these should probably be considered different from
type str = array [0..9] of char;
EXAMPLE
7.16
The problem with
structural equivalence
Here the length of the array has not changed, but the index values are different.
To determine if two types are structurally equivalent, a compiler can expand
their definitions by replacing any embedded type names with their respective
definitions, recursively, until nothing is left but a long string of type constructors, field names, and built-in types. If these expanded strings are the same, then
the types are equivalent, and conversely. Recursive and pointer-based types complicate matters, since their expansion does not terminate, but the problem is not
insurmountable; we consider a solution in Exercise 7.23.
Structural equivalence is a straightforward but somewhat low-level, implementation-oriented way to think about types. Its principal problem is an inability
to distinguish between types that the programmer may think of as distinct, but
which happen by coincidence to have the same internal structure:
7.2 Type Checking
323
1. type student = record
2.
name, address : string
3.
age : integer
4. type school = record
5.
name, address : string
6.
age : integer
7.
8.
9.
10.
x : student;
y : school;
...
x := y;
–– is this an error?
Most programmers would probably want to be informed if they accidentally assigned a value of type school into a variable of type student , but a compiler
whose type checking is based on structural equivalence will blithely accept such
an assignment.
Name equivalence is based on the assumption that if the programmer takes the
effort to write two type definitions, then those definitions are probably meant to
represent different types. In our example code, variables x and y will be considered to have different types under name equivalence: x uses the type declared at
line 1; y uses the type declared at line 4.
Variants of Name Equivalence
EXAMPLE
7.17
Semantically equivalent
alias types
One subtlety in the use of name equivalence arises with an alias type, whose definition simply specifies the name of some other type. In Modula-2 we might
say
TYPE new_type = old_type;
Should new_type and old_type be considered the same or different? The answer
may depend on how the types are used. One possible use is the following.
TYPE stack_element = INTEGER;
(* or whatever type the user prefers *)
MODULE stack;
IMPORT stack_element;
EXPORT push, pop;
...
PROCEDURE push(elem : stack_element);
...
PROCEDURE pop() : stack_element;
...
EXAMPLE
7.18
Semantically distinct alias
types
Here the stack module is meant to serve as an abstraction that allows the programmer, via textual inclusion, to create a stack of any desired type (in this case
integer). If aliased types are not considered equivalent, then the stack is no longer
reusable; it cannot be used for objects whose type has a name of the programmer’s choosing.
Unfortunately, there are other times, even in Modula-2, when aliased types
should probably not be the same.
324
Chapter 7 Data Types
TYPE celsius_temp = REAL;
fahrenheit_temp = REAL;
VAR c : celsius_temp;
f : fahrenheit_temp;
...
f := c;
(* this should probably be an error *)
EXAMPLE
7.19
Derived types and
subtypes in Ada
A language in which aliased types are considered distinct is said to have strict
name equivalence. A language in which aliased types are considered equivalent
is said to have loose name equivalence. Most Pascal-family languages (including
Modula-2) use loose name equivalence. Ada achieves the best of both worlds by
allowing the programmer to indicate whether an alias represents a derived type
or a subtype. A subtype is compatible with its base (parent) type; a derived type
is incompatible. (Subtypes of the same base type are also compatible with each
other.) The types of Examples 7.17 and 7.18 would be written as follows.
subtype stack_element is integer;
...
type celsius_temp is new integer;
type fahrenheit_temp is new integer;
EXAMPLE
7.20
Name v. structural
equivalence
Modula-3, which relies on structural type equivalence, achieves some of the
effect of derived types through use of a branding mechanism. A BRANDED type
is distinct from all other types, regardless of structure. Branding is permitted
only for pointers and abstract objects (in the object-oriented sense of the word).
Its principal purpose is not to distinguish among types like celsius_temp and
fahrenheit_temp above, but rather to prevent the programmer from using
structural equivalence, deliberately or accidentally, to look inside an abstraction
that is supposed to be opaque.
One way to think about the difference between strict and loose name equivalence is to remember the distinction between declarations and definitions (Section 3.3.3). Under strict name equivalence, a declaration type A = B is considered a definition. Under loose name equivalence it is merely a declaration;
A shares the definition of B .
Consider the following example.
1.
2.
3.
4.
5.
6.
7.
8.
type cell
= ...
type alink = pointer to cell
type blink = alink
p, q : pointer to cell
r
: alink
s
: blink
t
: pointer to cell
u : alink
–– whatever
Here the declaration at line 3 is an alias; it defines blink to be “the same as” alink .
Under strict name equivalence, line 3 is both a declaration and a definition, and
blink is a new type, distinct from alink . Under loose name equivalence, line 3 is
just a declaration; it uses the definition at line 2.
7.2 Type Checking
325
Under strict name equivalence, p and q have the same type, because they both
use the anonymous (unnamed) type definition on the right-hand side of line 4,
and r and u have the same type, because they both use the definition at line 2.
Under loose name equivalence, r , s , and u all have the same type, as do p and q .
Under structural equivalence, all six of the variables shown have the same type,
namely pointer to whatever cell is.
Both structural and name equivalence can be tricky to implement in the presence of separate compilation. We will return to this issue in Section 14.6.
Type Conversion and Casts
EXAMPLE
7.21
Contexts that expect a
given type
In a language with static typing, there are many contexts in which values of a
specific type are expected. In the statement
a := expression
we expect the right-hand side to have the same type as a . In the expression
a+b
the overloaded + symbol designates either integer or floating-point addition; we
therefore expect either that a and b will both be integers or that they will both be
reals. In a call to a subroutine,
foo(arg1, arg2, . . . , argN)
we expect the types of the arguments to match those of the formal parameters, as
declared in the subroutine’s header.
Suppose for the moment that we require in each of these cases that the types
(expected and provided) be exactly the same. Then if the programmer wishes to
use a value of one type in a context that expects another, he or she will need to
specify an explicit type conversion (also sometimes called a type cast). Depending
on the types involved, the conversion may or may not require code to be executed
at run time. There are three principal cases:
1. The types would be considered structurally equivalent, but the language uses
name equivalence. In this case the types employ the same low-level representation, and have the same set of values. The conversion is therefore a purely
conceptual operation; no code will need to be executed at run time.
2. The types have different sets of values, but the intersecting values are represented in the same way. One type may be a subrange of the other, for example,
or one may consist of two’s complement signed integers, while the other is
unsigned. If the provided type has some values that the expected type does
not, then code must be executed at run time to ensure that the current value
is among those that are valid in the expected type. If the check fails, then a dynamic semantic error results. If the check succeeds, then the underlying representation of the value can be used, unchanged. Some language implementations may allow the check to be disabled, resulting in faster but potentially
unsafe code.
326
Chapter 7 Data Types
3. The types have different low-level representations, but we can nonetheless
define some sort of correspondence among their values. A 32-bit integer, for
example, can be converted to a double-precision IEEE floating-point number
with no loss of precision. Most processors provide a machine instruction to
effect this conversion. A floating-point number can be converted to an integer
by rounding or truncating, but fractional digits will be lost, and the conversion will overflow for many exponent values. Again, most processors provide
a machine instruction to effect this conversion. Conversions between different
lengths of integers can be effected by discarding or sign-extending high-order
bytes.
EXAMPLE
7.22
Type conversions in Ada
We can illustrate these options with the following examples of type conversions
in Ada.
n : integer;
-- assume 32 bits
r : real;
-- assume IEEE double-precision
t : test_score;
-- as in Example 7.9
c : celsius_temp;
-- as in Example 7.19
...
t := test_score(n);
-- run-time semantic check required
n := integer(t);
-- no check req.; every test_score is an int
r := real(n);
-- requires run-time conversion
n := integer(r);
-- requires run-time conversion and check
n := integer(c);
-- no run-time code required
c := celsius_temp(n); -- no run-time code required
In each of these last six lines, the name of a type is used as a pseudo-function
that performs a type conversion. The first conversion requires a run-time check
to ensure that the value of n is within the bounds of a test_score . The second conversion requires no code, since every possible value of t is acceptable
for n . The third and fourth conversions require code to change the low-level representation of values. The fourth conversion also requires a semantic check. It is
generally understood that converting from a floating-point value to an integer
results in the loss of fractional digits; this loss is not an error. If the conversion
results in integer overflow, however, an error needs to result. The final two conversions require no run-time code; the integer and celsius_temp types (at
least as we have defined them) have the same sets of values and the same underlying representation. A purist might say that celsius_temp should be defined as
new integer range -273..integer’last , in which case a run-time semantic
check would be required on the final conversion.
Occasionally, particularly in systems programs,
one needs to change the type of a value without changing the underlying
implementation—in other words, to interpret the bits of a value of one type as if
they were another type. One common example occurs in memory allocation algorithms, which use a large array of characters or integers to represent a heap, but
then reinterpret portions of that array as pointers and integers (for bookkeeping
purposes), or as various user-allocated data structures. Another common exam-
Nonconverting Type Casts
7.2 Type Checking
EXAMPLE
7.23
Unchecked conversions in
Ada
327
ple occurs in high-performance numeric software, which may need to reinterpret
a floating-point number as an integer or a record in order to extract the exponent, significand, and sign fields. These fields can be used to implement special
purpose algorithms for square root, trigonometric functions, and so on.
A change of type that does not alter the underlying bits is called a nonconverting type cast. It should not be confused with use of the term cast for conversions
in languages like C. In Ada, nonconverting casts can be effected using instances
of a built-in generic subroutine called unchecked_conversion :
-- assume ’float’ has been declared to match IEEE single-precision
function cast_float_to_int is
new unchecked_conversion(float, integer);
function cast_int_to_float is
new unchecked_conversion(integer, float);
...
f := cast_int_to_float(n);
n := cast_float_to_int(f);
EXAMPLE
7.24
Type conversions in C
A type conversion in C (i.e., what C calls a type cast) is specified by using the
name of the desired type, in parentheses, as a prefix operator:
r = (float) n;
n = (int) r;
/* generates code for run-time conversion */
/* also run-time conversion, with no overflow check */
C and its descendants do not by default perform run-time checks for arithmetic overflow on any operation, though such checks can be enabled if desired
in C#.
C++ inherits the casting mechanism of C but also provides a family of semantically cleaner alternatives. Specifically, static_cast performs a type conversion, reinterpret_cast performs a nonconverting type cast, and dynamic_cast
allows programs that manipulate pointers of polymorphic types to perform assignments whose validity cannot be guaranteed statically, but can be checked
at run time (more on this in Chapter 9). There is also a const_cast that can
be used to add or remove read-only qualification. C-style type casts in C++ are
defined in terms of const_cast , static_cast , and reinterpret_cast ; the
precise behavior depends on the source and target types.
Any nonconverting type cast constitutes a dangerous subversion of the language’s type system. In a language with a weak type system such subversions can
be difficult to find. In a language with a strong type system, the use of explicit
nonconverting type casts at least labels the dangerous points in the code, facilitating debugging if problems arise.
7.2.2
Type Compatibility
Most languages do not require equivalence of types in every context. Instead,
they merely say that a value’s type must be compatible with that of the context
in which it appears. In an assignment statement, the type of the right-hand side
328
Chapter 7 Data Types
must be compatible with that of the left-hand side. The types of the operands
of + must either both be compatible with the built-in integer type, or both be
compatible with the built-in floating-point type. In a subroutine call, the types
of any arguments passed into the subroutine must be compatible with the types
of the corresponding formal parameters, and the types of any formal parameters
passed back to the caller must be compatible with the types of the corresponding
arguments.
The definition of type compatibility varies greatly from language to language.
Ada takes a relatively restrictive approach: an Ada type S is compatible with an
expected type T if and only if (1) S and T are equivalent, (2) one is a subtype of
the other (or both are subtypes of the same base type), or (3) both are arrays, with
the same numbers and types of elements in each dimension. Pascal is only slightly
more lenient: in addition to allowing the intermixing of base and subrange types,
it allows an integer to be used in a context where a real is expected.
Coercion
Whenever a language allows a value of one type to be used in a context that expects another, the language implementation must perform an automatic, implicit
D E S I G N & I M P L E M E N TAT I O N
Nonconverting casts
C programmers sometimes attempt a nonconverting type cast by taking the
address of an object, converting the type of the resulting pointer, and then
dereferencing:
r = *((float *) &n);
This arcane bit of hackery usually works, because most (but not all!) implementations use the same representation for pointers to integers and pointers to floating-point values—namely, an address. The ampersand operator ( & )
means “address of,” or “pointer to.” The parenthesized (float *) is the type
name for “pointer to float” (float is a built-in floating-point type). The prefix
* operator is a pointer dereference. The cast produces no run-time code; it
merely causes the compiler to interpret the bits of n as if it were a float . The
reinterpretation will fail if n is not an l-value (has no address), or if int s and
float s have different sizes (again, this second condition is often but not always true in C). If n does not have an address then the compiler will announce
a static semantic error. If int and float do not occupy the same number of
bytes, then the effect of the cast may depend on a variety of factors, including
the relative size of the objects, the alignment and “endian-ness” of memory
(Section 5.2), and the choices the compiler has made regarding what to place
in adjacent locations in memory. Safer and more portable nonconverting casts
can be achieved in C by means of union s (variant records); we consider this
option in Exercise 7.9.
7.2 Type Checking
EXAMPLE
7.25
Coercion in Ada
329
conversion to the expected type. This conversion is called a type coercion. Like an
explicit conversion, a coercion may require run-time code to perform a dynamic
semantic check or to convert between low level representations. Ada coercions
sometimes need the former, though never the latter:
d : weekday;
-- as in Example 7.3
k : workday;
-- as in Example 7.9
type calendar_column is new weekday;
c : calendar_column;
...
k := d;
-- run-time check required
d := k;
-- no check required; every workday is a weekday
c := d;
-- static semantic error;
-- weekdays and calendar_columns are not compatible
To perform this third assignment in Ada we would have to use an explicit
conversion:
c := calendar_column(d);
EXAMPLE
7.26
Coercion in C
Coercions are a controversial subject in language design. Because they allow
types to be mixed without an explicit indication of intent on the part of the programmer, they represent a significant weakening of type security. Fortran and C,
which have relatively weak type systems, perform quite a bit of coercion. They allow values of most numeric types to be intermixed in expressions, and will coerce
types back and forth “as necessary.” Here are some examples in C.
short int s;
unsigned long int l;
char c;
/* may be signed or unsigned -- implementation-dependent */
float f;
/* usually IEEE single-precision */
double d;
/* usually IEEE double-precision */
...
s = l; /* l’s low-order bits are interpreted as a signed number. */
l = s; /* s is sign-extended to the longer length, then
its bits are interpreted as an unsigned number. */
s = c; /* c is either sign-extended or zero-extended to s’s length;
the result is then interpreted as a signed number. */
f = l; /* l is converted to floating-point. Since f has fewer
significant bits, some precision may be lost. */
d = f;
f = d;
/* f is converted to the longer format; no precision lost. */
/* d is converted to the shorter format; precision may be lost.
If d’s value cannot be represented in single-precision, the
result is undefined, but NOT a dynamic semantic error. */
Fortran 90 allows arrays and records to be intermixed if their types have the
same shape. Two arrays have the same shape if they have the same number of
dimensions, each dimension has the same size, and the individual elements have
330
Chapter 7 Data Types
the same shape. These rules are roughly equivalent to the compatibility rules for
arrays in Ada, but Fortran 90 allows arrays to be used in many more contexts. In
particular, it allows its full set of arithmetic operations to be applied, element-byelement, to array-valued operands.
Two Fortran 90 records have the same shape if they have the same number of
fields, and corresponding fields, in order, have the same shape. Field names do
not matter, nor do the actual high and low bounds of array dimensions. C does
not allow records (structures) to be intermixed unless they are structurally equivalent, with identical field names. C provides no operations that take an entire array as an operand. C does, however, allow arrays and pointers to be intermixed
in many cases; we will discuss this unusual form of type compatibility further in
Section 7.7.1.
Most modern languages reflect a trend toward static typing and away from
type coercion. Some language designers have argued, however, that coercions
are a natural way in which to support abstraction and program extensibility, by
making it easier to use new types in conjunction with existing ones. C++ in particular provides an extremely rich, programmer-extensible set of coercion rules.
When defining a new type (a class in C++), the programmer can define coercion
operations to convert values of the new type to and from existing types. These
rules interact in complicated ways with the rules for resolving overloading (Section 3.6.2); they add significant flexibility to the language, but are one of the most
difficult C++ features to understand and use correctly.
Overloading and Coercion
We have noted (in Section 3.6.3) that overloading and coercion (as well as various forms of polymorphism) can sometimes be used to similar effect. It is worth
repeating some of the distinctions here. An overloaded name can refer to more
than one object; the ambiguity must be resolved by context. In the expression
a + b , for example, + may refer to either the integer or the floating-point addition operation. In a language without coercion, a and b must either both be
integer or both be real; the compiler chooses the appropriate interpretation of +
depending on their type. In a language with coercion, + refers to the floatingpoint addition operation if either a or b is real; otherwise it refers to the integer
addition operation. If only one of a and b is real, the other is coerced to match.
One could imagine a language in which + was not overloaded, but rather referred
to floating-point addition in all cases. Coercion could still allow + to take integer
arguments, but they would always be converted to real. The problem with this
approach is that conversions from integer to floating-point format take a nonnegligible amount of time, especially on machines without hardware conversion
instructions, and floating-point addition is significantly more expensive than integer addition.
In most languages literal (manifest) constants (e.g., numbers, character
strings, the empty set [ [ ] ] or the null pointer [ nil ]) can be intermixed in
expressions with values of many types. One might say that constants are over-
7.2 Type Checking
331
loaded: nil for example might be thought of as referring to the null pointer
value for whatever type is needed in the surrounding context. More commonly,
however, constants are simply treated as a special case in the language’s typechecking rules. Internally, the compiler considers a constant to have one of a
small number of built-in “constant types” (int const, real const, string, nil),
which it then coerces to some more appropriate type as necessary, even if coercions are not supported elsewhere in the language. Ada formalizes this notion
of “constant type” for numeric quantities: an integer constant (one without a
decimal point) is said to have type universal_integer ; a floating-point constant (one with an embedded decimal point and/or an exponent) is said to have
type universal_real . The universal_integer type is compatible with any
type derived from integer ; universal_real is compatible with any type derived from real .
Generic Reference Types
For systems programming, or to facilitate the writing of general purpose container (collection) objects (lists, stacks, queues, sets, etc.) that hold references to
other objects, several languages provide a “generic reference” type. In C and C++,
this type is called void * . In Clu it is called any ; in Modula-2, address ; in
Modula-3, refany ; in Java, Object ; in C#, object . Arbitrary l-values can be
assigned into an object of generic reference type, with no concern about type
safety: because the type of the object referred to by a generic reference is unknown, the compiler will not allow any operations to be performed on that object. Assignments back into objects of a particular reference type (e.g., a pointer
to a programmer-specified record type) are a bit trickier, if type safety is to be
maintained. We would not want a generic reference to a floating-point number,
for example, to be assigned into a variable that is supposed to hold a reference to
an integer, because subsequent operations on the “integer” would interpret the
bits of the object incorrectly. In object-oriented languages, the question of how to
ensure the validity of a generic to specific assignment generalizes to the question
of how to ensure the validity of any assignment in which the type of the object
on left-hand side supports operations that the object on the right-hand side may
not.
One way to ensure the safety of generic to specific assignments (or, in general,
less specific to more specific assignments) is to make objects self-descriptive—
that is, to include in the representation of each object an indication of its type.
This approach is common in object-oriented languages: it is taken in Java, C#,
Eiffel, Modula-3, and C++. (Smalltalk objects are self-descriptive, but Smalltalk
variables are not typed.) Type tags in objects can consume a nontrivial amount
of space, but allow the implementation to prevent the assignment of an object
of one type into a variable of another. In Java and C#, a generic to specific assignment requires a type cast, but will generate an exception if the generic reference does not refer to an object of the casted type. In Eiffel, the equivalent
operation uses a special assignment operator ( ?= instead of := ); in C++ it uses a
dynamic_cast operation.
332
EXAMPLE
Chapter 7 Data Types
7.27
Java container of Object
Java and C# programmers frequently create container classes that hold objects
of the generic reference class ( Object or object , respectively). When an object
is removed from a container, it must be assigned (with a type cast) into a variable
of an appropriate class before anything interesting can be done with it:3
import java.util.*;
// library containing Stack container class
...
Stack myStack = new Stack();
String s = "Hi, Mom";
Foo f = new Foo();
// f is of user-defined class type Foo
...
myStack.push(s);
myStack.push(f);
// we can push any kind of object on a stack
...
s = (String) myStack.pop();
// type cast is required, and will generate an exception at run
// time if element at top-of-stack is not a string
In a language without type tags, the assignment of a generic reference into
an object of a specific reference type cannot be checked, because objects are not
self-descriptive: there is no way to identify their type at run time. The programmer must therefore resort to an (unchecked) type conversion. C++ minimizes
the overhead of type tags by permitting dynamic_cast operations only on objects of polymorphic types. A thorough explanation of this restriction requires
an understanding of virtual methods and their implementation, something we
defer to Sections 9.4.1 and 9.4.2.
7.2.3
Type Inference
We have seen how type checking ensures that the components of an expression
(e.g., the arguments of a binary operator) have appropriate types. But what determines the type of the overall expression? In most cases, the answer is easy. The
result of an arithmetic operator usually has the same type as the operands. The
result of a comparison is usually Boolean. The result of a function call has the
type declared in the function’s header. The result of an assignment (in languages
in which assignments are expressions) has the same type as the left-hand side.
In a few cases, however, the answer is not obvious. In particular, operations on
subranges and on composite objects do not necessarily preserve the types of the
operands. We examine these cases in the remainder of this subsection. We then
consider (on the PLP CD) a more elaborate form of type inference found in ML,
Miranda, and Haskell.
3 If the programmer knows that a container will be used to hold objects of only one type, then
it may be possible to eliminate the type cast and, ideally, its run-time cost by using generics
(Section 8.4).
7.2 Type Checking
333
Subranges
EXAMPLE
7.28
Inference of subrange types
EXAMPLE
7.29
Using inference to avoid
run-time checks
For simple arithmetic operators, the principal type system subtlety arises when
one or more operands have subrange types (what Ada calls subtypes with range
constraints). Given the following Pascal definitions, for example,
type Atype = 0..20;
Btype = 10..20;
var a : Atype;
b : Btype;
what is the type of a + b ? Certainly it is neither Atype nor Btype , since the possible values range from 10 to 40. One could imagine it being a new anonymous
subrange type with 10 and 40 as bounds. The usual answer in Pascal and its descendants is to say that the result of any arithmetic operation on a subrange has
the subrange’s base type, in this case integer.
In Ada, the type of an arithmetic expression assumes special significance in the
header of a for loop (Section 6.5.1) because it determines the type of the index
variable. For the sake of uniformity, Ada says that the index of a for loop always
has the base type of the loop bounds, whether they are built-up expressions or
simple variables or constants.
If the result of an arithmetic operation is assigned into a variable of a subrange type, then a dynamic semantic check may be required. To avoid the expense of some unnecessary checks, a compiler may keep track at compile time of
the largest and smallest possible values of each expression, in essence computing
the anonymous 10 . . . 40 type. Appropriate bounds for the result of an arithmetic
operator can always be calculated from the values for the operands. In addition,
for example,
result.min := operand1.min + operand2.min
result.max := operand1.max + operand2.max
For subtraction,
result.min := operand1.min − operand2.max
result.max := operand1.max − operand2.min
The rules for other operators are analogous.
When an expression is assigned to a subrange variable or passed as a subrange
parameter, the compiler can decide on the need for checks based on the bounds
of the expected type and on the minimum and maximum values maintained for
the expression. If the minimum possible value of the expression is smaller than
the lower bound of the expected type, or if the maximum possible value of the
expression is larger than the upper bound of the expected type, a run-time check
is required. At the same time, if the minimum possible value of the expression
is larger than the upper bound of the expected type, or the maximum possible
value of the expression is smaller than the lower bound of the expected type, then
the compiler can issue a semantic error message at compile time.
334
EXAMPLE
Chapter 7 Data Types
7.30
Heuristic nature of
subrange inference
It should be noted that this bounds-tracking technique will not eliminate all
unnecessary checks. In the following Ada code, for example, a compiler that
aimed to do a perfect job of predicting the need for dynamic semantic checks
would need to predict the possible return values of a programmer-specified function.
a : integer range 0..20;
b : integer range 10..20;
function foo(i : integer) return integer is ...
...
a := b - foo(10);
-- does this require a dynamic semantic check?
If foo(10) is guaranteed to lie between 0 and 10, then no dynamic check is required; the assignment is sure to be ok. If foo(10) is guaranteed to be greater
than 20 or less than −10, then again no check is required; an error can be announced at compile time. Unfortunately, the value of foo may depend on values
read at run time. Even if it does not, basic results in complexity theory imply that
no compiler will be able to predict the behavior of all user-specified functions.
Because of these limitations, the compiler must inevitably generate some unnecessary run-time checks; straightforward tracking of the minimum and maximum
values for expressions is only a heuristic that allows us to eliminate some unnecessary checks in practice. More sophisticated techniques can be used to eliminate
many checks in loops; we will consider these in Section 15.5.2.
Composite Types
EXAMPLE
7.31
Type inference on string
operations
EXAMPLE
7.32
Type inference for sets
Most built-in operators in most languages take operands of built-in types. Some
operators, however, can be applied to values of composite types, including aggregates. Type inference becomes an issue when an operation on composites yields
a result of a different type than the operands.
Character strings provide a simple example. In Pascal, the literal string ’abc’
has type array [1..3] of char . In Ada, the analogous string (denoted "abc" )
is considered to have an incompletely specified type that is compatible with
any three-element array of characters. In the Ada expression "abc" & "defg" ,
"abc" is a three-character array, "defg" is a four-character array, and the result is a seven-character array formed by concatenating the two. For all three, the
size of the array is known, but the bounds and the index type are not; they must
be inferred from context. The seven-character result of the concatenation could
be assigned into an array of type array (1..7) of character or into an array of type array (weekday) of character , or into any other seven-element
character array.
Operations on composite values also occur when manipulating sets in Pascal
and Modula. As with string concatenation, operations on sets do not necessarily
produce a result of the same type as the operands. Consider the following example in Pascal.
7.2 Type Checking
var
A
B
C
i
:
:
:
:
335
set of 1..10;
set of 10..20;
set of 1..15;
1..30;
...
C := A + B * [1..5, i];
Pascal provides three operations on sets: union ( + ), intersection ( * ), and difference ( - ). Set operands are said to have compatible types if their elements have
the same base type T . The result of a set operation is then of type set of T . In
the example above, A , B , and the constructed set [1..5, i] all have the same
base type—namely integer. The type of the right-hand side of the assignment is
therefore set of integer . When an expression is assigned to a set variable or
passed as a set parameter, a dynamic semantic check may be required. In the example, the assignment will require a check to ensure that none of the possible
values between 16 and 20 actually occur in the set.
As with subranges, a compiler can avoid the need for checks in certain cases
by keeping track of the minimum and maximum possible members of the set
expression. Because a set may have many members, some of which may be known
at compile time, it can be useful to track not only the largest and smallest values
that may be in a set, but also the values that are known to be in the set (see
Exercise 7.7).
In Section 7.2.2 we noted that Fortran 90 allows all of its built-in arithmetic
operations to be applied to arrays. The result of an array operation has the same
shape as the operands. Each of its elements is the result of applying the operation
to the corresponding elements of the operand arrays. Since shape is preserved,
type inference is not an issue.
7.2.4
The ML Type System
The most sophisticated form of type inference occurs in certain functional
languages—notably ML, Miranda, and Haskell. Programmers have the option
of declaring the types of objects in these languages, in which case the compiler
behaves much like that of a more traditional statically typed language. As we
noted near the beginning of Section 7.1, however, programmers may also choose
not to declare certain types, in which case the compiler will infer them, based on
the known types of manifest constants, the explicitly declared types of any objects that have them, and the syntactic structure of the program. ML-style type
inference is the invention of the language’s creator, Robin Milner.4
4 Robin Milner (1934–), of Cambridge University’s Computer Laboratory, is responsible not only
for the development of ML and its type system, but for the Logic of Computable Functions,
which provides a formal basis for machine-assisted proof construction, and the Calculus of
Communicating Systems, which provides a general theory of concurrency. He received the ACM
Turing Award in 1991.
336
Chapter 7 Data Types
IN MORE DEPTH
The key to type inference in ML and its descendants is to unify the (partial)
type information available for two expressions whenever the rules of the type
system say that their types must be the same. Information known about each is
then known about the other as well. Any discovered inconsistencies are identified
as static semantic errors. Any expression whose type remains incompletely specified after inference is automatically polymorphic; this is the implicit parametric
polymorphism referred to in Section 3.6.3. ML family languages also incorporate
a powerful run-time pattern-matching facility and several unconventional structured types, including ordered tuples, (unordered) records, lists, and a datatype
mechanism that subsumes unions and recursive types.
C H E C K YO U R U N D E R S TA N D I N G
11. What is the difference between type equivalence and type compatibility?
12. Discuss the comparative advantages of structural and name equivalence for
types. Name three languages that use each approach.
13. Explain the difference between strict and loose name equivalence.
14. Explain the distinction between derived types and subtypes in Ada.
15. Explain the difference between type conversion, type coercion, and nonconverting type casts.
16. Summarize the arguments for and against coercion.
17. Under what circumstances does a type conversion require a run-time check?
18. What purpose is served by “generic reference” types?
19. What is type inference? Describe three contexts in which it occurs.
7.3
Records (Structures) and Variants (Unions)
As we have seen, record types allow related data of heterogeneous types to be
stored and manipulated together. Some languages (notably Algol 68, C, C++, and
Common Lisp) use the term structure (declared with the keyword struct ) instead of record. Fortran 90 simply calls its records “types”: they are the only form
of programmer-defined type other than arrays, which have their own special syntax. Structures in C++ are defined as a special form of class (one in which members are globally visible by default). Java has no distinguished notion of struct ;
its programmers use classes in all cases. C# uses a reference model for variables
of class types, and a value model for variables of struct types. C# struct s do
not support inheritance.
7.3 Records (Structures) and Variants (Unions)
7.3.1
EXAMPLE
7.33
A Pascal record
EXAMPLE
7.34
A C struct
EXAMPLE
7.35
Accessing record fields
EXAMPLE
7.36
Nested records
337
Syntax and Operations
In Pascal, a simple record might be defined as follows.
type two_chars = packed array [1..2] of char;
(* Packed arrays will be explained in Example 7.39.
Packed arrays of char are compatible with quoted strings. *)
type element = record
name : two_chars;
atomic_number : integer;
atomic_weight : real;
metallic : Boolean
end;
In C, the corresponding declaration would be
struct element {
char name[2];
int atomic_number;
double atomic_weight;
_Bool metallic;
};
Each of the record components is known as a field. To refer to a given field of
a record, most languages use “dot” notation. In Pascal:
var copper : element;
const AN = 6.022e23;
(* Avogadro’s number *)
...
copper.name := ’Cu’;
atoms := mass / copper.atomic_weight * AN;
The C notation is similar to that of Pascal; in Fortran 90 one would say
copper%name and copper%atomic_weight . Cobol and Algol 68 reverse the order of the field and record names: name of copper and atomic_weight of
copper . ML’s notation is also “reversed,” but uses a prefix # : #name copper
and #atomic_weight copper . (Fields of an ML record can also be extracted
using patterns.) In Common Lisp, one would say (element-name copper) and
(element-atomic_weight copper) .
Most languages allow record definitions to be nested. Again in Pascal:
type short_string = packed array [1..30] of char;
type ore = record
name : short_string;
element_yielded : record
name : two_chars;
atomic_number : integer;
atomic_weight : real;
metallic : Boolean
end
end;
Alternatively, one could say
338
Chapter 7 Data Types
Figure 7.1 Likely layout in memory for objects of type element on a 32-bit machine. Alignment restrictions lead to the shaded “holes.”
type ore = record
name : short_string;
element_yielded : element
end;
EXAMPLE
7.37
ML records and tuples
In Fortran 90 and Common Lisp, only the second alternative is permitted:
record fields can have record types, but the declarations cannot be lexically
nested. Naming for nested records is straightforward: malachite.element_
yielded.atomic_number in Pascal or C; atomic_number of element_yielded
of malachite in Cobol; #atomic_number #element_yielded malachite
in ML; (element-atomic_number (ore-element_yielded malachite)) in
Common Lisp.
As noted in Example 7.14, ML differs from most languages in specifying that
the order of record fields is insignificant. The ML record value {name = "Cu",
atomic_number = 29, atomic_weight = 63.546, metallic = true} is the
same as the value {atomic_number = 29, name = "Cu", atomic_weight =
63.546, metallic = true} (they will test true for equality). ML tuples are
defined as abbreviations for records whose field names are small integers. The
values ("Cu", 29) , {1 = "Cu", 2 = 29} , and {2 = 29, 1 = "Cu"} will all
test true for equality.
7.3.2
EXAMPLE
7.38
Memory layout for a
record type
Memory Layout and Its Impact
The fields of a record are usually stored in adjacent locations in memory. In its
symbol table, the compiler keeps track of the offset of each field within each
record type. When it needs to access a field, the compiler typically generates a
load or store instruction with displacement addressing. For a local object, the
base register is the frame pointer; for a global object, the base register is the globals pointer. In either case, the displacement is the sum of the record’s offset from
the register and the field’s offset within the record.
A likely layout for our element type on a 32-bit machine appears in Figure 7.1. Because the name field is only two characters long, it occupies two bytes
in memory. Since atomic_number is an integer, and must (on most machines) be
longword-aligned, there is a two-byte “hole” between the end of name and the be-
7.3 Records (Structures) and Variants (Unions)
339
Figure 7.2
Likely memory layout for packed element records. The atomic_number and
atomic_weight fields are nonaligned, and can only be read or written (on most machines) via
multi-instruction sequences.
EXAMPLE
7.39
Layout of packed types
ginning of atomic_number . Similarly, since Boolean variables (in most language
implementations) occupy a single byte, there are three bytes of empty space between the end of the metallic field and the next aligned location. In an array of
element s, most compilers would devote 20 bytes to every member of the array.
Pascal allows the programmer to specify that a record type (or an array, set, or
file type) should be packed:
type element = packed record
name : two_chars;
atomic_number : integer;
atomic_weight : real;
metallic : Boolean
end;
EXAMPLE
7.40
Assignment and
comparison of records
The keyword packed indicates that the compiler should optimize for space instead of speed. In most implementations a compiler will implement a packed
record without holes, by simply “pushing the fields together.” To access a nonaligned field, however, it will have to issue a multi-instruction sequence that
retrieves the pieces of the field from memory and then reassembles them in a
register. A likely packed layout for our element type (again for a 32-bit machine) appears in Figure 7.2. It is 15 bytes in length. An array of packed element
records would probably devote 16 bytes to each member of the array—that is,
it would align each element. A packed array of packed records would probably devote only 15 bytes to each; only every fourth element would be aligned.
Ada, Modula-3, and C provide more elaborate packing mechanisms, which allow the programmer to specify precisely how many bits are to be devoted to each
field.
Most languages allow a value to be assigned to an entire record in a single
operation:
my_element := copper;
Ada also allows records to be compared for equality ( if my_element = copper
then ... ), but most other languages (including Pascal, Modula, C, and C++) do
not, though C++ allows the programmer to define equality tests for individual
record types.
340
Chapter 7 Data Types
Figure 7.3
Rearranging record fields to minimize holes. By sorting fields according to the size
of their alignment constraint, a compiler can minimize the space devoted to holes, while keeping
the fields aligned.
EXAMPLE
7.41
Minimizing holes by sorting
fields
For small records, both copies and comparisons can be performed in-line on
a field-by-field basis. For longer records, we can save significantly on code space
by deferring to a library routine. A block_copy routine can take source address,
destination address, and length as arguments, but the analogous block_compare
routine would fail on records with different (garbage) data in the holes. One solution is to arrange for all holes to contain some predictable value (e.g., zero),
but this requires code at every elaboration point. Another is to have the compiler
generate a customized field-by-field comparison routine for every record type.
Different routines would be called to compare records of different types. Languages like Pascal and C avoid the whole issue by simply outlawing full-record
comparisons.
In addition to complicating comparisons, holes in records waste space. Packing eliminates holes, but at potentially heavy cost in access time. A compromise,
adopted by some compilers, is to sort a record’s fields according to the size of their
alignment constraints. All byte-aligned fields come first, followed by any halfword aligned fields, word-aligned fields, and (if the hardware requires) doubleword-aligned fields. For our element type, the resulting rearrangement is shown
in Figure 7.3.
In most cases, reordering of fields is purely an implementation issue: the programmer need not be aware of it, as long as all instances of a record type are
reordered in the same way. The exception occurs in systems programs, which
sometimes “look inside” the implementation of a data type with the expectation
D E S I G N & I M P L E M E N TAT I O N
The order of record fields
Issues of record field order are intimately tied to implementation tradeoffs:
Holes in records waste space, but alignment makes for faster access. If holes
contain garbage, we can’t can’t compare records by looping over words or
bytes, but zero-ing out the holes would incur costs in time and code space.
Predictable layout is important for mirroring hardware structures in “systems”
languages, but reorganization may be advantageous in large records if we can
group frequently accessed fields together, so they lie in the same cache line.
7.3 Records (Structures) and Variants (Unions)
341
that it will be mapped to memory in a particular way. A kernel programmer, for
example, may count on a particular layout strategy in order to define a record
that mimics the organization of memory-mapped control registers for a particular Ethernet device. C and C++, which are designed in large part for systems
programs, guarantee that the fields of a struct will be allocated in the order declared. The first field is guaranteed to have the coarsest alignment required by the
hardware for any type (generally a four- or eight-byte boundary). Subsequent
fields have the natural alignment for their type. To accommodate systems programs, Ada and C++ allow the programmer to specify nonstandard alignment
for the fields of specific record types.
7.3.3
EXAMPLE
7.42
Pascal with statement
With Statements
In programs with complicated data structures, manipulating the fields of a deeply
nested record can be awkward:
ruby.chemical_composition.elements[1].name := ’Al’;
ruby.chemical_composition.elements[1].atomic_number := 13;
ruby.chemical_composition.elements[1].atomic_weight := 26.98154;
ruby.chemical_composition.elements[1].metallic := true;
Pascal provides a with statement to simplify such constructions:
with ruby.chemical_composition.elements[1] do begin
name := ’Al’;
atomic_number := 13;
atomic_weight := 26.98154;
metallic := true
end;
IN MORE DEPTH
Pascal with statements are generally considered an improvement on the earlier
elliptical references of Cobol and PL/I. They still suffer from several limitations,
however, most of which are addressed in Modula-3. Similar functionality can be
achieved with nested scopes in languages like Lisp and ML, which use a reference
model of variables, and in languages like C and C++, which allow the programmer to create pointers or references to arbitrary objects.
7.3.4
EXAMPLE
7.43
Variant record in Pascal
Variant Records
A variant record provides two or more alternative fields or collections of fields,
only one of which is valid at any given time. In Pascal, we might augment our
element type as follows.
342
Chapter 7 Data Types
type long_string = packed array [1..200] of char;
type string_ptr = ^long_string;
type element = record
name : two_chars;
atomic_number : integer;
atomic_weight : real;
metallic : Boolean;
case naturally_occurring : Boolean of
true : (
source : string_ptr;
(* textual description of principal commercial source *)
prevalence : real;
(* fraction, by weight, of Earth’s crust *)
);
false : (
lifetime : real;
(* half-life in seconds of the most stable known isotope *)
)
end;
EXAMPLE
7.44
Fortran equivalence
statement
EXAMPLE
7.45
Mixing structs and
unions in C
Here the naturally_occurring field of the record is known as its tag, or
discriminant. A true tag indicates that the element has at least one naturally
occurring stable isotope; in this case the record contains two additional fields—
source and prevalence —that describe how the element may be obtained and
how commonly it occurs. A false tag indicates that the element results only
from atomic collisions or the decay of heavier elements; in this case, the record
contains an additional field— lifetime —that indicates how long atoms so
created tend to survive before undergoing radioactive decay. Each of the parenthesized field lists (one containing source and prevalence , the other containing lifetime ) is known as a variant. Either the first or the second variant
may be useful, but never both at once. From an implementation point of view,
these nonoverlapping uses mean that the variants may share space (see Figure 7.4).
Variant records have their roots in the equivalence statement of Fortran I
and in the union types of Algol 68. The Fortran syntax looks like this:
integer i
real r
logical b
equivalence (i, r, b)
The equivalence statement informs the compiler that i , r , and b will never be
used at the same time, and should share the same space in memory.
Pascal’s principal contribution to union types (retained by Modula and Ada)
was to integrate them with records. This was an important contribution, because
the need for alternative types seldom arises anywhere else. In our running example, we use the same field-name syntax to access both the atomic_weight
and lifetime fields of an element , despite the fact that the former is present
7.3 Records (Structures) and Variants (Unions)
343
Figure 7.4
Likely memory layouts for element variants.
The value of the
naturally_occurring field (shown here with a double border) determines which of the
interpretations of the remaining space is valid. Type string_ptr is assumed to be represented
by a (four-byte) pointer to dynamically allocated storage.
in every element , while the latter is present only in those that are not naturally
occurring. Without the integration of records and unions, the notation is less
convenient. Here’s what it looks like in C:
struct element {
char name[2];
int atomic_number;
double atomic_weight;
_Bool metallic;
_Bool naturally_occurring;
union {
struct {
char *source;
double prevalence;
} natural_info;
double lifetime;
} extra_fields;
} copper;
Because the union is not a part of the struct , we have to introduce two extra
levels of naming. The third field is still copper.atomic_weight , but the source
field must be accessed as copper.extra_fields.natural_info.source . A similar situation occurs in ML, in which datatype s can be used for unions, but the
notation is not integrated with records (Exercise 7.33).
Safety
EXAMPLE
7.46
Breaking type safety with
equivalence
One of the principal problems with equivalence statements is that they provide
no built-in means of determining which of the equivalence -ed objects is currently valid: the program must keep track. Mistakes in which the programmer
writes to one object and then reads from the other are relatively common:
344
Chapter 7 Data Types
r = 3.0
...
print ’(I10)’, i
EXAMPLE
7.47
Union conformity in
Algol 68
Here the print statement, which attempts to output i as a 10-digit integer, will
(in most implementations) take its bits from the floating-point representation
of 3.0. This is almost certainly a mistake, but one that the language implementation will not catch.
Fortran equivalence statements introduce an extreme case of aliases: not
only are there two names for the “same thing” (in this case the same block of
storage), but the types associated with those names are different. To address this
potential source of bugs, the Algol 68 designers required that the language implementation track union -ed types at run time:
union (int, real, bool) uirb
# uirb can be an integer, a floating-point number, or a Boolean #
...
uirb := 1
# uirb is now an integer #
...
uirb := 3.14
# uirb is now a floating-point number #
To use the value stored inside a union, the programmer must employ a special
form of case statement (called a conformity clause in Algol 68) that determines
which type is currently valid:
case uirb in
(int i) : print(i),
(real r) : print(r),
(bool b) : print(b)
esac
EXAMPLE
7.48
Tagged variant record in
Pascal
The labels on the arms of the case statement provide names for the “deunified”
values. A similar tagcase construct can be found in Clu.
To enforce correct usage of union types in Algol 68, the language implementation must maintain a hidden variable for every union object that indicates which
type is currently valid. When an object of a union type is assigned a value, the
hidden variable is also set, to indicate the type of the value just assigned. When
execution encounters a conformity clause, the hidden field is inspected to determine which arm to execute.
In effect, the tag field of a Pascal variant record is an explicit representation of the hidden variable required in an Algol 68 union. Our integer/floatingpoint/Boolean example could be written as follows in Pascal.
type tag = (is_int, is_real, is_bool);
var uirb : record
case which : tag of
is_int : (i : integer);
is_real : (r : real);
is_bool : (b : Boolean)
end;
7.3 Records (Structures) and Variants (Unions)
EXAMPLE
7.49
Breaking type safety with
variant records
345
Unfortunately, while the hidden tag of an Algol 68 union can only be changed
implicitly, by assigning a value of a different type to the union as a whole, the tag
of a Pascal variant record can be changed by an ordinary assignment statement.
The compiler can generate code to verify that a field in variant v is never accessed
unless the value of the tag indicates that v is currently valid, but this is not enough
to guarantee type safety. It can catch errors of the form
uirb.which := is_real;
uirb.r := 3.0;
...
writeln(uirb.i);
(* dynamic semantic error *)
but it cannot catch the following.
uirb.which := is_real;
uirb.r := 3.0;
uirb.which := is_int;
...
(* no intervening assignment to i *)
writeln(uirb.i);
(* ouch! *)
EXAMPLE
7.50
Untagged variants in Pascal
Any Pascal implementation will accept this code, but the output is likely to be
erroneous, just as it was in Fortran.
Semantically speaking, changing the tag of a Pascal variant record should make
the remaining fields of the variant uninitialized. It is possible, by adding hidden
fields, to flag them as such and generate a semantic error message on any subsequent access, but the code to do so is expensive [FL80], and outlaws programs
that, while arguably erroneous, are permitted by the language definition (Exercise 7.12).
The situation in Pascal is actually worse than our example so far might imply.
Additional insecurity stems from the fact that Pascal’s tag fields are optional. We
could eliminate the which field of our uirb record:
var uirb : record
case tag of
is_int : (i : integer);
is_real : (r : real);
is_bool : (b : Boolean)
end;
...
uirb.r := 3.0;
...
writeln(uirb.i);
(* no intervening assignment to i *)
(* ouch! *)
Now the language implementation is not required to devote any space to either
an explicit or hidden tag, but even the limited form of checking (make sure the
tag has an appropriate value when a field of a variant is accessed) is no longer
possible (but see Exercise 7.13). Variant records with tags (explicit or hidden)
are known as discriminated unions. Variant records without tags are known as
nondiscriminated unions.
346
Chapter 7 Data Types
The degree of type safety provided is arguably the most important dimension
of variation among the variant records and union types of modern languages.
Though designed after Algol 68 (and borrowing its union terminology), the
union types of C are semantically closer to Fortran’s equivalence statements.
Their fields share space, but nothing prevents the programmer from using them
in inappropriate ways. By contrast, the variant records of Ada are syntactically
similar to those of Pascal, but are as type-safe as the unions of Algol 68. Concerned at the lack of type safety in Pascal and Modula-2, and reluctant to introduce the complexity of Ada’s rules, the designers of Modula-3 chose to eliminate
variant records from the language entirely. They note [Har92, p. 110] that much
of the same effect can be obtained via object types and subtypes. The designers
of Java and C#, likewise, dropped the union s of C and C++.
Variants in Ada
EXAMPLE
7.51
Ada variants and tags
(discriminants)
Ada variant records must always have a tag (called the discriminant in Ada).
Moreover, the tag can never be changed without simultaneously assigning values
to all of the fields of the corresponding variant. The assignment can occur either
via whole-record assignment (e.g., A := B , where A and B are variant records) or
via assignment of an aggregate (e.g., A := {which => is_real, r => pi}; ).
In addition to appearing as a field within the record, the discriminant of a variant
record in Ada must also appear in the header of the record’s declaration:
type element (naturally_occurring : Boolean := true) is record
name : string (1..2);
atomic_number : integer;
atomic_weight : real;
metallic : Boolean;
case naturally_occurring is
when true =>
source : string_ptr;
prevalence : real;
when false =>
lifetime : real;
end case;
end record;
Here we have not only declared the discriminant of the record in its header,
we have also specified a default value for it. A declaration of a variable of type
element has the option of accepting this default value:
copper : element;
or overriding it:
plutonium : element (false);
neptunium : element (naturally_occurring => false);
-- alternative syntax
If the type declaration for element did not specify a default value for
naturally_occurring , then all variables of type element would have to pro-
7.3 Records (Structures) and Variants (Unions)
EXAMPLE
7.52
A discriminated subtype in
Ada
347
vide a value. These rules guarantee that the tag field of a variant record is never
uninitialized.
An Ada record variable whose declaration specifies a value for the discriminant is said to be constrained. Its tag field can never be changed by a subsequent
assignment. This immutability means that the compiler can allocate just enough
space to hold the specified variant; this space may in some cases be significantly
smaller than would be required for other variants. A variable whose declaration
does not provide an initial value for the discriminant is said to be unconstrained.
Its tag will be initialized to the value in the type declaration, but may be changed
by later (whole-record) assignments, so the space that the record occupies must
be large enough to hold any possible variant.
An Ada subtype definition can also constrain the discriminant(s) of its parent
type:
subtype natural_element is element (true);
Variables of type natural_element will all be constrained; their naturally_
occurring field cannot be changed. Because natural_element is a subtype,
rather than a derived type, values of type element and natural_element are
EXAMPLE
7.53
Discriminated array in Ada
compatible with each other, though a run-time semantic check will usually be
required to assign the former into the latter.
Ada uses record discriminants not only for variant tags, but in general for any
value that affects the size of a record. Here is an example that uses a discriminant
to specify the length of an array:
D E S I G N & I M P L E M E N TAT I O N
The placement of variant fields
To facilitate space-saving in constrained variant records, Ada requires that all
variant parts of a record appear at the end. This rule ensures that every field
has a constant offset from the beginning of the record, with no holes (in any
variant) other than those required for alignment. When a constrained variant record is elaborated, the Ada run-time system need only allocate sufficient
space to hold the specified variant, which is never allowed to change. Pascal has
a similar rule, designed for a similar purpose. When a variant record is allocated from the heap in Pascal (via the built-in new operator), the programmer
has the option of specifying case labels for the variant portions of the record.
A record so allocated is never allowed to change to a different variant, so the
implementation can allocate precisely the right amount of space.
Modula-2, which does not provide new as a built-in operation, eliminates
the ordering restriction on variants. All variables of a variant record type must
be large enough to hold any variant. The usual implementation assigns a fixed
offset to every field, with holes following small internal variants as necessary
(see Figure 7.5 and Exercise 7.14).
348
Chapter 7 Data Types
TYPE element = RECORD
name : ARRAY [1..2] OF CHAR;
metallic : BOOLEAN;
CASE naturally_occurring : BOOLEAN OF
TRUE :
source : string_ptr;
prevalence : REAL;
| FALSE :
lifetime : REAL;
END;
atomic_number : INTEGER;
atomic_weight : REAL;
END;
Figure 7.5
Likely memory layout for a variant record in Modula-2. Here the variant portion
of the record is not required to lie at the end. Every field has a fixed offset from the beginning
of the record, with internal holes as necessary following small-size variants.
type element_array is array (integer range <>) of element;
type alloy (num_components : integer) is record
name : string (1..30);
components : element_array (1..num_components);
tensile_strength : real;
end record;
The <> notation in the initial definition of element_array indicates that the
bounds are not statically known. We will have more to say about dynamic arrays
in Section 7.4.2. As with discriminants used for variant tags, the programmer
must either specify a default value for the discriminant in the type declaration
(we did not do so above) or else every declaration of a variable of the type must
specify a value for the discriminant (in which case the variable is constrained,
and the discriminant cannot be changed).
C H E C K YO U R U N D E R S TA N D I N G
20. Discuss the significance of “holes” in records. Why do they arise? What problems do they cause?
7.4 Arrays
349
21. What is packing? What are its advantages and disadvantages?
22. Why might a compiler reorder the fields of a record? What problems might
this cause?
23. Why is it useful to integrate variants (unions) with records (structs)? Why
not leave them as separate mechanisms, as they are in Algol 68 and C?
24. Discuss the type safety problems that arise with variant records. How can
these problems be addressed?
25. What is a tag (discriminant)? How does it differ from an ordinary field?
26. Summarize the rules that prevent access to inappropriate fields of a variant
record in Ada.
27. Why might one wish to constrain a variable, so that it can hold only one
variant of a type?
7.4
Arrays
Arrays are the most common and important composite data types. They have
been a fundamental part of almost every high-level language, beginning with
Fortran I. Unlike records, which group related fields of disparate types, arrays are
usually homogeneous. Semantically, they can be thought of as a mapping from an
index type to a component or element type. Some languages (e.g., Fortran) require
that the index type be integer ; many languages allow it to be any discrete type.
Some languages (e.g., Fortran 77) require that the element type of an array be
scalar. Most (including Fortran 90) allow any element type.
Some languages (notably scripting languages) allow nondiscrete index types.
The resulting associative arrays must generally be implemented with hash tables,
rather than with the more efficient contiguous allocation to be described in Section 7.4.3. Associative arrays in C++ are known as map s; they are supported by a
standard library template. Java and C# have similar library classes. For the purposes of this chapter, we will assume that array indices are discrete (but see Section 13.4.3).
7.4.1
Syntax and Operations
Most languages refer to an element of an array by appending a subscript—
delimited by parentheses or square brackets—to the name of the array. In Fortran and Ada, one says A(3) ; in Pascal and C, one says A[3] . Since parentheses
are generally used to delimit the arguments to a subroutine call, square bracket
subscript notation has the advantage of distinguishing between the two. The difference in notation makes a program easier to compile and, arguably, easier to
350
Chapter 7 Data Types
read. Fortran’s use of parentheses for arrays stems from the absence of square
bracket characters on IBM keypunch machines, which at one time were widely
used to enter Fortran programs. Ada’s use of parentheses represents a deliberate
decision on the part of the language designers to embrace notational ambiguity
for functions and arrays. If we think of an array as a mapping from the index
type to the element type, it makes perfectly good sense to use the same notation used for functions. In some cases, a programmer may even choose to change
from an array to a function-based implementation of a mapping, or vice versa
(Exercise 7.15).
Declarations
EXAMPLE
7.54
Array declarations
In some languages one declares an array by appending subscript notation to the
syntax that would be used to declare a scalar. In C:
char upper[26];
In Fortran:
character, dimension (1:26) :: upper
character (26) upper
! shorthand notation
In C, the lower bound of an index range is always zero: the indices of an n-element
array are 0 . . n − 1. In Fortran, the lower bound of the index range is one by
default. Fortran 90 allows a different lower bound to be specified if desired, using
the notation shown in the first of the two declarations above.
In other languages, arrays are declared with an array constructor. In Pascal:
var upper : array [’a’..’z’] of char;
In Ada:
upper : array (character range ’a’..’z’) of character;
EXAMPLE
7.55
Multidimensional arrays
Most languages make it easy to declare multidimensional arrays:
matrix : array (1..10, 1..10) of real;
-- Ada
real, dimension (10,10) :: matrix
! Fortran
In some languages (e.g., Pascal, Ada, and Modula-3), one can also declare a multidimensional array by using the array constructor more than once in the same
declaration. In Modula-3,
VAR matrix : ARRAY [1..10], [1..10] OF REAL;
is syntactic sugar for
VAR matrix : ARRAY [1..10] OF ARRAY [1..10] OF REAL;
and matrix[3, 4] is syntactic sugar for matrix[3][4] . Similar equivalences
hold in Pascal.
7.4 Arrays
EXAMPLE
7.56
Multidimensional v. built-up
arrays
351
In Ada, by contrast,
matrix : array (1..10, 1..10) of real;
is not the same as
matrix : array (1..10) of array (1..10) of real;
The former is a two-dimensional array, while the latter is an array of onedimensional arrays. With the former declaration, we can access individual real
numbers as matrix(3, 4) ; with the latter we must say matrix(3)(4) . The
two-dimensional array is arguably more elegant, but the array of arrays supports additional operations: it allows us to name the rows of matrix individually
D E S I G N & I M P L E M E N TAT I O N
Is [ ] an operator?
The definition of associative arrays in C++ leverages the ability to overload
square brackets ( [ ] ), which C++ treats as an operator. C#, like C++, provides
extensive facilities for operator overloading, but it does not use these facilities
to support associative arrays. Instead, the language provides a special indexer
mechanism, with its own unique syntax:
class directory {
Hashtable table;
// from standard library
...
public directory() {
// constructor
table = new Hashtable();
}
...
public string this[string name] {
// indexer method
get {
return (string) table.get_Item(name);
}
set {
table.Add(name, value);
// value is implicitly
}
// a parameter of set
}
}
...
directory d = new directory();
...
d["Jane Doe"] = "234-5678";
Console.WriteLine(d["Jane Doe"]);
Why the difference? In C++, operator[] can return a reference (an explicit
lvalue—see Section 8.3.1), which can be used on either side of an assignment.
C# has no comparable notion of reference, so it needs separate methods to get
and set the value of d["Jane Doe"] .
352
Chapter 7 Data Types
Figure 7.6 Array slices (sections) in Fortran 90. Much like the values in the header of an
enumeration-controlled loop (Section 6.5.1), a : b : c in a subscript indicates positions a, a + c,
a + 2c, . . . through b. If a or b is omitted, the corresponding bound of the array is assumed.
If c is omitted, 1 is assumed. It is even possible to use negative values of c in order to select
positions in reverse order. The slashes in the second subscript of the lower-right example delimit
an explicit list of positions.
EXAMPLE
7.57
Arrays of arrays in C
( matrix(3) is a 10-element, single-dimensional array), and it allows us to take
slices, as discussed below.
In C, one must also declare an array of arrays, and use two-subscript notation,
but C’s integration of pointers and arrays (to be discussed in Section 7.7.1) means
that slices are not supported.
double matrix[10][10];
Given this definition, matrix[3][4] denotes an individual element of the array,
but matrix[3] denotes a reference, either to the third row of the array or to the
first element of that row, depending on context.
Slices and Array Operations
EXAMPLE
7.58
Array slice operations
A slice or section is a rectangular portion of an array. Fortran 90 provides extensive
facilities for slicing, as do many scripting languages, including Perl, Python, Ruby,
and R. Figure 7.6 illustrates some of the possibilities in Fortran 90, using the
declaration of matrix shown above. Ada provides more limited support: a slice
is simply a contiguous range of elements in a one-dimensional array.
In most languages, the only operations permitted on an array are selection
of an element (which can then be used for whatever operations are valid on its
type), and assignment. A few languages (e.g., Ada and Fortran 90) allow arrays
7.4 Arrays
353
to be compared for equality. Ada allows one-dimensional arrays whose elements
are discrete to be compared for lexicographic ordering: A < B if the first element
of A that is not equal to the corresponding element of B is less than that corresponding element. Ada also allows the built-in logical operators ( or , and , xor )
to be applied to Boolean arrays.
Fortran 90 has a very rich set of array operations: built-in operations that take
entire arrays as arguments. Because Fortran uses structural type equivalence, the
operands of an array operator need only have the same element type and shape.
In particular, slices of the same shape can be intermixed in array operations, even
if the arrays from which they were sliced have very different shapes. Any of the
built-in arithmetic operators will take arrays as operands; the result is an array,
of the same shape as the operands, whose elements are the result of applying
the operator to corresponding elements. As a simple example, A + B is an array
each of whose elements is the sum of the corresponding elements of A and B .
Fortran 90 also provides a huge collection of intrinsic, or built-in functions. More
than 60 of these (including logic and bit manipulation, trigonometry, logs and
exponents, type conversion, and string manipulation) are defined on scalars but
will also perform their operation element-wise if passed arrays as arguments. The
function tan(A) , for example, returns an array consisting of the tangents of the
elements of A . Many additional intrinsic functions are defined solely on arrays.
These include searching and summarization, transposition, and reshaping and
subscript permutation.
An equally rich set of array operations can be found in APL, an array manipulation language developed by Iverson and others in the early to mid-1960s.5 APL
was designed primarily as a terse mathematical notation for array manipulations.
It employs an enormous character set that makes it difficult to use with conventional keyboards. Its variables are all arrays, and many of the special characters
denote array operations. APL implementations are designed for interpreted, interactive use. They are best suited to “quick and dirty” solutions of mathematical
problems. The combination of very powerful operators with very terse notation
makes APL programs notoriously difficult to read and understand. The J notation, a successor to APL, uses a conventional character set.
7.4.2
Dimensions, Bounds, and Allocation
In all of the examples in the previous subsection, the number of dimensions and
bounds of each array (what Fortran calls its shape) were specified in the declaration. This need not be the case. And even when the shape of an array is specified,
it may depend in some languages on values that are not known at compile time.
5 Kenneth Iverson (1920–2004), a Canadian mathematician, joined the faculty at Harvard University in 1954, where he conceived APL as a notation for describing mathematical algorithms.
He moved to IBM in 1960, where he helped develop the notation into a practical programming
language. He was named an IBM Fellow in 1970, and received the ACM Turing Award in 1979.
354
Chapter 7 Data Types
Figure 7.7
Allocation in Ada of local arrays whose shape is bound at elaboration time. Here
M is a square two-dimensional array whose width is determined by a parameter passed to foo
at run time. The compiler arranges for a pointer to M to reside at a static offset from the
frame pointer. M cannot be placed among the other local variables because it would prevent
those higher in the frame from having static offsets. Additional variable-size arrays are easily
accommodated. The purpose of the dope vector field is explained in Section 7.4.3.
The time at which the shape of an array is bound has a major impact on how
storage for the array is managed. At least five cases arise.
EXAMPLE
7.59
Stack allocation of
elaborated arrays
global lifetime, static shape: If the shape of an array is known at compile time,
and if the array can exist throughout the execution of the program, then the
compiler can allocate space for the array in static global memory.
local lifetime, static shape: If the shape of the array is known at compile time,
but the array should not exist throughout the execution of the program (generally because it is a local variable of a potentially recursive subroutine), then
space can be allocated in the subroutine’s stack frame at run time.
local lifetime, shape bound at elaboration time: In some languages (e.g., Ada and
C99), the shape of an array may not be known until elaboration time. In this
case it is still possible to place the space for the array in the stack frame of its
subroutine, but an extra level of indirection is required (see Figure 7.7).
In order to ensure that every local object can be found using a known offset
from the frame pointer, we divide the stack frame into a fixed-size part and a
variable-size part. An object whose size is statically known goes in the fixed-
7.4 Arrays
EXAMPLE
7.60
Stack allocation of new
arrays
355
size part. An object whose size is not known until elaboration time goes in the
variable-size part, and a pointer to it goes in the fixed-size part. (We shall see
in Section 7.4.3 that the pointer must be augmented with a descriptor, or dope
vector, that specifies any bounds that were not known at compile time.) If the
elaboration of the array is buried in a nested block, the compiler delays allocating space (i.e., changing the stack pointer) until the block is entered. It still
allocates space for the pointer among the local variables when the subroutine
itself is entered.
arbitrary lifetime, shape bound at elaboration time: In Java and C#, every array
variable is a reference to an object in the object-oriented sense of the word.
The declaration int[ ] A does not allocate space; it simply creates a reference. To make the reference refer to something, the programmer must either
explicitly allocate a new object from the heap ( A = new int[size] ) or assign
a reference from another array ( A = B ), which already holds a reference to an
object in the heap. In either case, the size of an array, once allocated, never
changes.
arbitrary lifetime, dynamic shape: If the size of an array can change as the result
of executable statements, then allocation in the stack frame will not suffice,
because the space at both ends of an array might be in use for something else
when the array needs to grow. To allow the size to change, an array must generally be allocated from the heap. (A pointer to the array still resides in the
fixed-size portion of the stack frame.) In most cases, increasing the size will
require that we allocate a larger block, copy any data that is to be retained
from the old block to the new, and then deallocate the old.
Arrays of static shape are heavily used by many kinds of programs. Arrays
whose shape is not known until elaboration time are also very common, particularly in numerical software. Many scientific programs rely on numerical libraries
for linear algebra and the manipulation of systems of equations. Since different
programs use arrays of different shapes, the subroutines in these libraries need to
be able to take arguments whose size is not known at compile time.
Conformant Arrays
EXAMPLE
7.61
Conformant array
parameters
Early versions of Pascal required the shape of all arrays to be specified statically.
Standard Pascal relaxes this requirement by allowing array parameters to have
bounds that are symbolic names rather than constants. It calls these parameters
conformant arrays:
function DotProduct(A, B : array [lower..upper : integer] of real)
: real;
var i : integer;
rtn : real;
begin
rtn := 0;
for i := lower to upper do rtn := rtn + A[i] * B[i];
DotProduct := rtn
end;
356
EXAMPLE
Chapter 7 Data Types
7.62
Local arrays of dynamic
shape
Here lower and upper are initialized at the time of call, providing DotProduct
with the information it needs to understand the shape of A and B . In effect,
lower and upper are extra parameters of DotProduct . Conformant arrays can
be passed either by value or by reference. C also supports dynamic-size array
parameters, as a natural consequence of its merger of arrays and pointers (to be
discussed in Section 7.7.1). C arrays are always passed by reference.
Pascal does not allow a local variable to be an array of dynamic shape. Ada
and C99 do. Among other things, local arrays can be declared to match the shape
of dynamic-size array parameters, facilitating the implementation of algorithms
that require “scratch space.” Figure 7.8 contains an Ada example. The program
shown plays Conway’s game of Life [Gar70]. (Life is a cellular automaton meant to
model biological populations. The patterns it produces can be amazingly complex and beautiful. Type “Conway Game of Life” into any search engine for a
wealth of online examples.) The main routine allocates a local array the same size
as the game board, which it uses to calculate successive generations. Note that
much more efficient algorithms exist; we present this one because it is brief and
clear.
The <> notation in the definition of lifeboard indicates that the bounds of
the array are not statically known. Ada actually defines an array type to have no
bounds. The type of any array with bounds is a constrained subtype of an array type with the same number of dimensions but unknown bounds. Bounds
of a dynamic array can be obtained at run-time through use of the array attributes ’first and ’last . A’first(1) is the low bound of A ’s first dimension;
A’last(2) is the upper bound of its second dimension. The expression A’range
is short for A’first..A’last .
Dynamic Arrays
EXAMPLE
7.63
Dynamic strings in Java and
C#
Several languages, including Snobol, Icon, and Perl, allow strings—arrays of
characters—to change size after elaboration time. Java and C# provide a similar capability (with a similar implementation), but describe the semantics differently: string variables in Java and C# are references to immutable string objects:
String s = "short";
...
s = s + " but sweet";
EXAMPLE
7.64
Elaborated arrays in
Fortran 90
// + is the concatenation operator
Here the declaration String s introduces a string variable, which we initialize
with a reference to the constant string "short" . In the subsequent assignment, +
creates a new string containing the concatenation of the old s and the constant
" but sweet" ; s is then set to refer to this new string, rather than the old. Java
and C# strings, by the way, are not the same as arrays of characters: strings are
immutable, but elements of an array can be changed in place.
Dynamically resizable arrays (other than strings) appear in APL, Perl, and
Common Lisp. They are also supported by the vector , Vector , and ArrayList
classes of the C++, Java, and C# libraries, respectively. Fortran 90 allows specifi-
7.4 Arrays
357
type presence is integer range 0..1;
type lifeboard is array (integer range <>, integer range <>) of presence;
-- cell is 1 if occupied; 0 otherwise
-- border row around the edge is permanently empty
unexpected : exception;
procedure life(B : in out lifeboard;
generations : in integer) is
T : lifeboard(B’range(1), B’range(2));
-- mimic the bounds of B
begin
for i in 1..generations loop
T := B;
-- copy board, including empty borders
for i in B’first(1)+1..B’last(1)-1 loop
for j in B’first(2)+1..B’last(2)-1 loop
case T(i-1, j-1) + T(i-1, j) + T(i-1, j+1)
+ T(i, j-1) + T(i, j+1)
+ T(i+1, j-1) + T(i+1, j) + T(i+1, j+1) is
when 0 | 1 => B(i, j) := 0;
-- die of loneliness
when 2 => B(i, j) := T(i, j);
-- no-op; survive if present
when 3 => B(i, j) := 1;
-- reproduce
when 4..8 => B(i, j) := 0;
-- die of overcrowding
when others =>
raise unexpected;
end case;
end loop;
end loop;
end loop;
end life;
Figure 7.8
Dynamic local arrays in Ada.
cation of the bounds of an array to be delayed until after elaboration, but it does
not allow those bounds to change once they have been defined:
real, dimension (:,:), allocatable :: mat
! mat is two-dimensional, but with unspecified bounds
...
allocate (mat (a:b, 0:m-1))
! first dimension has bounds a..b; second has bounds 0..m-1
...
deallocate (mat)
! implementation is now free to reclaim mat’s space
A similar effect can be obtained in some languages through the use of pointers
(see Exercise 7.18).
358
Chapter 7 Data Types
7.4.3
EXAMPLE
7.65
Row-major v.
column-major array layout
EXAMPLE
7.66
Array layout and cache
performance
Memory Layout
Arrays in most language implementations are stored in contiguous locations in
memory. In a one-dimensional array, the second element of the array is stored
immediately after the first (subject to alignment constraints); the third is stored
immediately after the second, and so forth. For arrays of records, it is common
for each subsequent element to be aligned at an address appropriate for any type;
small holes between consecutive records may result. On some machines, an implementation may even place holes between elements of built-in types. Some languages (e.g., Pascal) allow the programmer to specify that an array be packed .
A packed array generally has no holes between elements, but access to its elements may be slow. A packed array of records may have holes within the records,
unless they too are packed.
For multidimensional arrays, it still makes sense to put the first element of the
array in the array’s first memory location. But which element comes next? There
are two reasonable answers, called row-major and column-major order. In rowmajor order, consecutive locations in memory hold elements that differ by one
in the final subscript (except at the ends of rows). A[2, 4] , for example, is followed by A[2, 5] . In column-major order, consecutive locations hold elements
that differ by one in the initial subscript: A[2, 4] is followed by A[3, 4] . These
options are illustrated for two-dimensional arrays in Figure 7.9. The layouts for
three or more dimensions are analogous. Fortran uses column-major order; most
other languages use row-major order.6 The advantage of row-major order is that
it makes it easy to define a multidimensional array as an array of subarrays, as described in Section 7.4.1. With column-major order, the elements of the subarray
would not be contiguous in memory.
The difference between row- and column-major layout can be important for
programs that use nested loops to access all the elements of a large, multidimensional array. On modern machines the speed of such loops is often limited by memory system performance, which depends heavily on the effectiveness of caching (Section 5.1). Figure 7.9 shows the orientation of cache lines for
row- and column-major layout of arrays. If a small array is accessed frequently,
all or most of its elements are likely to remain in the cache, and the orientation
of cache lines will not matter. For a large array, however, many of the accesses
that occur during a full-array traversal are likely to result in cache misses, because the corresponding lines have been evicted from the cache (to make room
for other things) since the last traversal. If array elements are accessed in order
of consecutive addresses, then each miss will bring into the cache not only the
desired element, but the next several elements as well. If elements are accessed
across cache lines instead (i.e., along the rows of a Fortran array, or the columns
6 Correspondence with Frances Allen, an IBM Fellow and Fortran pioneer, suggests that columnmajor order was originally adopted in order to accommodate idiosyncrasies of the console debugger and instruction set of the IBM model 704 computer, on which the language was first
implemented.
7.4 Arrays
359
Figure 7.9
Row- and column-major memory layout for two-dimensional arrays. In row-major
order, the elements of a row are contiguous in memory; in column-major order, the elements
of a column are contiguous. The second cache line of each array is shaded, on the assumption
that each element is an eight-byte floating-point number, that cache lines are 32 bytes long (a
common size), and that the array begins at a cache line boundary. If the array is indexed from
A[0,0] to A[9,9] , then in the row-major case elements A[0,4] through A[0,7] share a cache line;
in the column-major case elements A[4,0] through A[7,0] share a cache line.
of an array in most other languages), then there is a good chance that almost
every access will result in a cache miss, dramatically reducing the performance of
the code.
Row-Pointer Layout
Some languages employ an alternative to contiguous allocation for some arrays.
Rather than require the rows of an array to be adjacent, they allow them to lie
anywhere in memory, and create an auxiliary array of pointers to the rows. If the
array has more than two dimensions, it may be allocated as an array of pointers to arrays of pointers to. . . . This row-pointer memory layout requires more
space in most cases but has three potential advantages. First, it sometimes allows individual elements of the array to be accessed more quickly, especially on
CISC machines with slow multiplication instructions (see the discussion of address calculations below). Second, it allows the rows to have different lengths,
without devoting space to holes at the ends of the rows; the lack of holes may
sometimes offset the increased space for pointers. Third, it allows a program to
construct an array from preexisting rows (possibly scattered throughout memory) without copying. C, C++, and C# provide both contiguous and row-pointer
organizations for multidimensional arrays. Technically speaking, the contiguous
layout is a true multidimensional array, while the row-pointer layout is an array
of pointers to arrays. Java uses the row-pointer layout for all arrays.
360
Chapter 7 Data Types
char days[][10] = {
"Sunday", "Monday", "Tuesday",
"Wednesday", "Thursday",
"Friday", "Saturday"
};
...
days[2][3] == ’s’; /* in Tuesday */
char *days[] = {
"Sunday", "Monday", "Tuesday",
"Wednesday", "Thursday",
"Friday", "Saturday"
};
...
days[2][3] == ’s’; /* in Tuesday */
Figure 7.10 Contiguous array allocation v. row pointers in C. The declaration on the left is a true two-dimensional array.
The slashed boxes are NUL bytes; the shaded areas are holes. The declaration on the right is an array of pointers to arrays
of characters. In both cases, we have omitted bounds in the declaration that can be deduced from the size of the initializer
(aggregate). Both data structures permit individual characters to be accessed using double subscripts, but the memory layout
(and corresponding address arithmetic) is quite different.
EXAMPLE
7.67
Contiguous v. row-pointer
array layout
By far the most common use of the row-pointer layout in C is to represent arrays of strings. A typical example appears in Figure 7.10. In this example (representing the days of the week), the row-pointer memory layout consumes 57 bytes
for the characters themselves (including a NUL byte at the end of each string),
plus 28 bytes for pointers (assuming a 32-bit architecture), for a total of 85 bytes.
The contiguous layout alternative devotes ten bytes to each day (room enough for
Wednesday and its NUL byte), for a total of 70 bytes. The additional space required
D E S I G N & I M P L E M E N TAT I O N
Array layout
The layout of arrays in memory, like the ordering of record fields, is intimately
tied to tradeoffs in design and implementation. While column-major layout
appears to offer no advantages on modern machines, its continued use in Fortran means that programmers must be aware of the underlying implementation in order to achieve good locality in nested loops. Row-pointer layout,
likewise, has no performance advantage on modern machines (and a likely
performance penalty, at least for numeric code), but it is a more natural fit for
the “reference to object” data organization of languages like Java. Its impacts
on space consumption and locality may be positive or negative, depending on
the details of individual applications.
7.4 Arrays
361
for the row-pointer organization comes to 21%. In other cases, row pointers may
actually save space. A Java compiler written in C, for example, would probably
use row pointers to store the character-string representations of the 51 Java keywords and wordlike literals. This data structure would use 51 × 4 = 204 bytes for
the pointers, plus 343 bytes for the keywords, for a total of 547 bytes (548 when
aligned). Since the longest keyword ( synchronized ) requires 13 bytes (including space for the terminating NUL ), a contiguous two-dimensional array would
consume 51 × 13 = 663 bytes (664 when aligned). In this case, row pointers save
a little over 21%.
Address Calculations
EXAMPLE
7.68
Indexing a contiguous array
For the usual contiguous layout of arrays, calculating the address of a particular
element is somewhat complicated, but straightforward. Suppose a compiler is
given the following declaration for a three-dimensional array.
A : array [L1 . . U1 ] of array [L2 . . U2 ] of array [L3 . . U3 ] of elem type;
Let us define constants for the sizes of the three dimensions:
S3 = size of elem type
S2 = (U3 − L3 + 1) × S3
S1 = (U2 − L2 + 1) × S2
Here the size of a row (S2 ) is the size of an individual element (S3 ) times the
number of elements in a row (assuming row-major layout). The size of a plane
(S1 ) is the size of a row (S2 ) times the number of rows in a plane. The address of
A[i, j, k] is then
address of A
+ (i − L1 ) × S1
+ (j − L2 ) × S2
+ (k − L3 ) × S3
As written, this computation involves three multiplications and six additions/
subtractions. We could compute the entire expression at run time, but in most
cases a little rearrangement reveals that much of the computation can be performed at compile time. In particular, if the bounds of the array are known at
compile time, then S1 , S2 , and S3 are compile-time constants, and the subtractions of lower bounds can be distributed out of the parentheses:
(i × S1 ) + (j × S2 ) + (k × S3 ) + address of A
−[(L1 × S1 ) + (L2 × S2 ) + (L3 × S3 )]
The bracketed expression in this formula is a compile-time constant (assuming
the bounds of A are statically known). If A is a global variable, then the address of
A is statically known as well, and can be incorporated in the bracketed expression.
362
Chapter 7 Data Types
Figure 7.11 Virtual location of an array with nonzero lower bounds. By computing the constant portions of an array index at compile time, we effectively index into an array whose starting
address is offset in memory but whose lower bounds are all zero.
EXAMPLE
7.69
Pseudo-assembler for
contiguous array indexing
If A is a local variable of a subroutine (with static shape), then the address of A
can be decomposed into a static offset (included in the bracketed expression) plus
the contents of the frame pointer at run time. We can think of the address of A
plus the bracketed expression as calculating the location of an imaginary array
whose [i, j, k] th element coincides with that of A , but whose lower bound in each
dimension is zero. This imaginary array is illustrated in Figure 7.11.
If A ’s elements are integers, and are allocated contiguously in memory, then
the instruction sequence to load A[i, j, k] into a register looks something like this:
1.
2.
3.
4.
5.
6.
EXAMPLE
7.70
Static and dynamic
portions of an array index
–– assume i is in r1, j is in r2, and k is in r3
r4 := r1 × S1
r5 := r2 × S2
r6 := &A − L1 × S1 − L2 × S2 − L3 × 4 –– one or two instructions
r6 := r6 + r4
r6 := r6 + r5
r7 := *r6[r3]
–– load
We have assumed that the hardware provides an indexed addressing mode, and
that it scales its indexing by the size of the quantity loaded (in this case a four-byte
integer).
If i , j , and/or k is known at compile time, then additional portions of the
calculation of the address of A[i, j, k] will move from the dynamic to the static
part of the formula shown above. If all of the subscripts are known, then the
entire address can be calculated statically. Conversely, if any of the bounds of
the array are not known at compile time, then portions of the calculation will
move from the static to the dynamic part of the formula. For example, if L1 is
not known until run time, but k is known to be 3 at compile time, then the
7.4 Arrays
363
calculation becomes
(i × S1 ) + (j × S2 ) − (L1 × S1 ) + address of A − [(L2 × S2 ) + (L3 × S3 ) − (3 × S3 )]
EXAMPLE
7.71
Indexing complex
structures
Again, the bracketed part can be computed at compile time. If lower bounds are
always restricted to zero, as they are in C, then they never contribute to run-time
cost.
In all our examples, we have ignored the issue of dynamic semantic checks
for out-of-bound subscripts. We explore the code for these in Exercise 7.22. In
Section 15.5.2 we will consider code improvement techniques that can be used
to eliminate many checks statically, particularly in enumeration-controlled loops.
The notion of “static part” and “dynamic part” of an address computation
generalizes to more than just arrays. Suppose, for example, that V is a messy
local array of records containing a nested, two-dimensional array in field M . The
address of V[i].M[3, j] could be calculated as
i × S1V
− L1V × S1V
+ M ’s offset as a field
+ j × S2
V
+ fp
+ (3 − L1V ) × S1V
− L2V × S2V
+ offset of V in frame
D E S I G N & I M P L E M E N TAT I O N
Lower bounds on array indices
In C, the lower bound of every array dimension is always zero. It is often assumed that the language designers adopted this convention in order to avoid
subtracting lower bounds from indices at run time, thereby avoiding a potential source of inefficiency. As our discussion has shown, however, the compiler
can avoid any run-time cost by translating to a virtual starting location. (The
one exception to this statement occurs when the lower bound has a very large
absolute value: if any index (scaled by element size) exceeds the maximum offset available with displacement mode addressing [typically 215 bytes on RISC
machines], then subtraction may still be required at run time.)
A more likely explanation lies in the interoperability of arrays and pointers
in C (Section 7.7.1): C’s conventions allow the compiler to generate code for an
index operation on a pointer without worrying about the lower bound of the
array into which the pointer points. Interestingly, Fortran array dimensions
have a default lower bound of 1; unless the programmer explicitly specifies
a lower bound of 0, the compiler must always translate to a virtual starting
location.
364
EXAMPLE
Chapter 7 Data Types
7.72
Pseudo-assembler for
row-pointer array indexing
Here the calculations on the left must be performed at run time; the calculations
on the right can be performed at compile time. (The notation for bounds and size
places the name of the variable in a superscript and the dimension in a subscript:
L2M is the lower bound of the second dimension of M .)
Address calculation for arrays that use row pointers is comparatively
straightforward. Using our three-dimensional array A as an example, the expression A[ i, j, k ] is equivalent to (*(*A[ i ]) [ j ]) [ k ] or, in more Pascal-like notation, A[ i ]^[ j ]^[ k ] . The instruction sequence to load A[ i, j, k ] into a register looks
something like this:
1.
2.
3.
4.
–– assume i is in r1, j is in r2, and k is in r3
r4 := &A
–– one or two instructions
r4 := *r4 [ r1 ]
r4 := *r4 [ r2 ]
r7 := r4 [ r3 ]
Assuming that the loads at lines 2 and 3 hit in the cache, this code will be comparable in cost to the instruction sequence for contiguous allocation shown above
(given load delays). If the intermediate loads miss in the cache, it will be slower.
On a 1970s CISC machine, the balance would probably tip in favor of the rowpointer code: multiplies would be slower, and memory accesses would be faster.
In any event (contiguous or row-pointer allocation, old or new machine), important code improvements will often be possible when several array references
use the same subscript expression, or when array references are embedded in
loops.
Dope Vectors
For every array, a compiler maintains dimension, bounds, and size information
in the symbol table. For every record, it maintains the offset of every field. When
the bounds and size of array dimensions are statically known, the compiler can
look them up in the symbol table in order to compute the address of elements
of the array. When the bounds and size are not statically known, the compiler
must arrange for them to be available when the compiled program needs to
compute an address at run time. The usual mechanism employs a run-time descriptor, or dope vector for the array.7 Typically, a dope vector for an array of
dynamic shape will contain the lower bound of each dimension and the size of
7 The name “dope vector” presumably derives from the notion of “having the dope on (something),” a colloquial expression that originated in horse racing: advance knowledge that a horse
has been drugged (“doped”) is of significant, if unethical, use in placing bets.
7.4 Arrays
365
every dimension except the last (which will always be statically known). If the
language implementation performs dynamic semantic checks for out-of-bounds
subscripts in array references, then the dope vector will need to contain upper
bounds as well. Given upper and lower bounds, the size information is redundant, but it is usually included anyway, to avoid computing it repeatedly at run
time.
If some of the dimension bounds or sizes for an array are known at compile
time, then they may be omitted from the dope vector. One might imagine, then,
that the size of a dope vector would depend on the number of statically unknown
quantities. More commonly, it depends only on the number of dimensions: the
modest loss in space is offset by the comparative simplicity of always being able
to find a given bound or size at the same offset within the dope vector for any
array of the appropriate number of dimensions.
The dope vector for an array of dynamic shape is generally placed next to the
pointer to the array in the fixed-size part of the stack frame. The contents of the
dope vector are initialized at elaboration time, or whenever the array changes
shape. If one fully dynamic array is assigned into a second and the two are of
different shapes and sizes, then run-time code will not only need to deallocate
the old heap space of the target array, allocate new space, and copy the data into
it, but it will also need to copy information from the dope vector of the source
array into the dope vector of the target.
In some languages a record may contain an array of dynamic shape. In order
to arrange for every field to have a static offset from the beginning of the record, a
compiler can treat the record much like the fixed-size portion of the stack frame,
with a pointer to the array at a fixed offset in the record, and the data in the
variable-size part of the current stack frame. The problem with this approach is
that it abandons contiguous allocation for records. Among other things, a block
copy routine can no longer be used to assign one record into another. An arguably
preferable approach is to abandon fixed field offsets and create dope vectors for
dynamic-size records, just as we do for dynamic-size arrays. The dope vector for
a record lists the offsets of the record’s fields. All of the actual data then go in the
variable-size part of the stack frame or the heap (depending on whether sizes are
known at elaboration time).
C H E C K YO U R U N D E R S TA N D I N G
28. What is an array slice? For what purposes are slices useful?
29. Is there any significant difference between a two-dimensional array and an
array of one-dimensional arrays?
30. What is the shape of an array?
31. Under what circumstances can an array declared within a subroutine be allocated in the stack? Under what circumstances must it be allocated in the
heap?
366
Chapter 7 Data Types
32. What is a conformant array?
33. Discuss the comparative advantages of contiguous and row-pointer layout for
arrays.
34. Explain the difference between row-major and column-major layout for contiguously allocated arrays. Why does a programmer need to know which layout the compiler uses? Why do most language designers consider row-major
layout to be better?
35. How much of the work of computing the address of an element of an array
can be performed at compile time? How much must be performed at run
time?
36. What is a dope vector? What purpose does it serve?
7.5
EXAMPLE
7.73
Character escapes in C
and C++
Strings
In many languages, a string is simply an array of characters. In other languages,
strings have special status, with operations that are not available for arrays of
other sorts. Particularly powerful string facilities are found in Snobol, Icon, and
the various scripting languages.
As we saw in Section 6.5.3, mechanisms to search for patterns within strings
are a key part of Icon’s distinctive generator-based control flow. Icon has dozens
of built-in string operators, functions, and generators, including sophisticated
pattern-matching facilities based on regular expressions. Perl, Python, Ruby, and
other scripting languages provide similar functionality, though none includes the
full power of Icon’s backtracking search. We will consider the string and patternmatching facilities of scripting languages in more detail in Section 13.4.2. In the
remainder of this section we focus on the role of strings in more traditional languages.
Almost all programming languages allow literal strings to be specified as a sequence of characters, usually enclosed in single or double quote marks. Many
languages, including C and its descendants, distinguish between literal characters (usually delimited with single quotes) and literal strings (usually delimited with double quotes). Other languages (e.g., Pascal) make no distinction: a
character is just a string of length one. Most languages also provide escape sequences that allow nonprinting characters and quote marks to appear inside of
strings. In Pascal, for example, a quote mark is included in a string by doubling it: ’ab’’cde’ is a six-character string whose third character is a quote
mark.
C99 and C++ provide a very rich set of escape sequences. An arbitrary character can be represented by a backslash followed by (a) 1–3 octal (base 8) digits,
(b) an x and one or more hexadecimal (base 16) digits, (c) a u and exactly four
7.6 Sets
EXAMPLE
7.74
Char* assignment in C
hexadecimal digits, or (d) a U and exactly eight hexadecimal digits. The \u notation is meant to capture the two-byte (16-bit) Unicode character set. The \U
notation is for 32-bit “extended” characters. Many of the most common control
characters also have single-character escape sequences, many of which have been
adopted by other languages as well. For example, \n is a line feed; \t is a tab; \r
is a carriage return; \\ is a backslash. C# omits the octal sequences of C99 and
C++; Java also omits the 32-bit extended sequences.
The set of operations provided for strings is strongly tied to the implementation envisioned by the language designer(s). Several languages that do not in general allow arrays to change size dynamically do provide this flexibility for strings.
The rationale is twofold. First, manipulation of variable-length strings is fundamental to a huge number of computer applications, and in some sense “deserves”
special treatment. Second, the fact that strings are one-dimensional, have onebyte elements, and never contain references to anything else makes dynamic-size
strings easier to implement than general dynamic arrays.
Some languages require that the length of a string-valued variable be bound
no later than elaboration time, allowing the variable to be implemented as a contiguous array of characters in the current stack frame. Languages in this category
include C, Pascal, and Ada. Pascal and Ada support a few string operations, including assignment and comparison for lexicographic ordering. C, on the other
hand, provides only the ability to create a pointer to a string literal. Because of
C’s unification of arrays and pointers, even assignment is not supported. Given
the declaration char *s , the statement s = "abc" makes s point to the constant "abc" in static storage. If s is declared as an array, rather than a pointer
( char s[4] ), then the statement will trigger an error message from the compiler. To assign one array into another in C, the program must copy the elements
individually.
Other languages allow the length of a string-valued variable to change over its
lifetime, requiring that the variable be implemented as a block or chain of blocks
in the heap. Languages in this category include Lisp, Icon, ML, Java, and C#.
ML and Lisp provide strings as a built-in type. C++, Java, and C# provide them
as predefined classes of object, in the formal, object-oriented sense. In all these
languages a string variable is a reference to a string. Assigning a new value to such
a variable makes it refer to a different object. Concatenation and other string
operators implicitly create new objects. The space used by objects that are no
longer reachable from any variable is reclaimed automatically.
7.6
EXAMPLE
7.75
Set types
367
Sets
A programming language set is an unordered collection of an arbitrary number of distinct values of a common type. Sets were introduced by Pascal, and are
found in many more recent languages as well. They are a useful form of composite type for many applications. Pascal supports sets of any discrete type, and
provides union, intersection, and difference operations:
368
Chapter 7 Data Types
var A,
D,
...
A := B
A := B
A := B
B, C : set of char;
E : set of weekday;
+ C;
* C;
- C;
(* union; A := {x | x is in B or x is in C} *)
(* intersection; A := {x | x is in B and x is in C} *)
(* difference; A := {x | x is in B and x is not in C} *)
The type from which elements of a set are drawn is known as the base or universe
type. Icon supports sets of characters (called csets) but not sets of any other base
6.5.4, cset s play an important role in Icon’s
type. As illustrated in Section
search facilities. Ada does not provide a set constructor for types, but its generic
facility can be used to define a set package (module) with functionality comparable to the sets of Pascal [IBFW91, pp. 242–244].
There are many ways to implement sets, including arrays, hash tables, and
various forms of trees. The most common implementation employs a bit vector
whose length (in bits) is the number of distinct values of the base type. A set of
characters, for example (in a language that uses ASCII) would be 128 bits—16
bytes—in length. A one in the kth position in the bit vector indicates that the kth
element of the base type is a member of the set; a zero indicates that it is not.
Operations on bit-vector sets can make use of fast logical instructions on most
machines. Union is bit-wise or ; intersection is bit-wise and ; difference is bit-wise
not , followed by bit-wise and .
D E S I G N & I M P L E M E N TAT I O N
Representing sets
Unfortunately, bit vectors do not work well for large base types: a set of integers, represented as a bit vector, would consume some 500 megabytes on a
32-bit machine. With 64-bit integers, a bit-vector set would consume more
memory than is currently contained on all the computers in the world. Because of this problem, many languages (including early versions of Pascal, but
not the ISO standard) limit sets to base types of fewer than some fixed number of members. Both 128 and 256 are common limits; they suffice to cover
ASCII characters. A few languages (e.g., early versions of Modula-2) limit base
types to the number of elements that can be represented by a one-word bit
vector, but there is really no excuse for such a severe restriction. A language
that permits sets with very large base types must employ an alternative implementation (e.g., a hash table). It will still be expensive to represent sets with
enormous numbers of elements, but reasonably easy to represent sets with a
modest number of elements drawn from a very large universe.
7.7 Pointers and Recursive Types
7.7
369
Pointers and Recursive Types
A recursive type is one whose objects may contain one or more references to other
objects of the type. Most recursive types are records, since they need to contain
something in addition to the reference, implying the existence of heterogeneous
fields. Recursive types are used to build a wide variety of “linked” data structures,
including lists and trees.
In languages like Lisp, ML, Clu, or Java, which use a reference model of variables, it is easy for a record of type foo to include a reference to another record
of type foo : every variable (and hence every record field) is a reference anyway.
In languages like C, Pascal, or Ada, which use a value model of variables, recursive types require the notion of a pointer: a variable (or field) whose value is a
reference to some object. Pointers were first introduced in PL/I.
In some languages (e.g., Pascal, Ada 83, and Modula-3), pointers are restricted
to point only to objects in the heap. The only way to create a new pointer value
(without using variant records or casts to bypass the type system) is to call a
built-in function that allocates a new object in the heap and returns a pointer to
it. In other languages (e.g., PL/I, Algol 68, C, C++, and Ada 95), one can create a
pointer to a nonheap object by using an “address of ” operator. We will examine
pointer operations and the ramifications of the reference and value models in
more detail in the following subsection.
In any language that permits new objects to be allocated from the heap, the
question arises: how and when is storage reclaimed for objects that are no longer
needed? In short-lived programs it may be acceptable simply to leave the storage
unused, but in most cases unused space must be reclaimed, to make room for
other things. A program that fails to reclaim the space for objects that are no
longer needed is said to “leak memory.” If such a program runs for an extended
period of time, it may run out of space and crash.
Many languages, including C, C++, Pascal, and Modula-2, require the programmer to reclaim space explicitly. Other languages, including Modula-3, Java,
C#, and all the functional and scripting languages, require the language impleD E S I G N & I M P L E M E N TAT I O N
Implementation of pointers
It is common for programmers (and even textbook writers) to equate pointers
with addresses, but this is a mistake. A pointer is a high level concept: a reference to an object. An address is a low-level concept: the location of a word
in memory. Pointers are often implemented as addresses, but not always. On
a machine with a segmented memory architecture, a pointer may consist of a
segment id and an offset within the segment. In a language that attempts to
catch uses of dangling references, a pointer may contain both an address and
an access key.
370
Chapter 7 Data Types
mentation to reclaim unused objects automatically. Explicit storage reclamation
simplifies the language implementation, but raises the possibility that the programmer will forget to reclaim objects that are no longer live (thereby leaking
memory) or will accidentally reclaim objects that are still in use (thereby creating
dangling references). Automatic storage reclamation (otherwise known as garbage
collection) dramatically simplifies the programmer’s task, but raises the question
of how the language implementation is to distinguish garbage from active objects. We will discuss these issues further in Sections 7.7.2 and 7.7.3.
7.7.1
Syntax and Operations
Operations on pointers include allocation and deallocation of objects in the heap,
dereferencing of pointers to access the objects to which they point, and assignment of one pointer into another. The behavior of these operations depends
heavily on whether the language is functional or imperative, and on whether it
employs a reference or value model for variables/names.
Functional languages generally employ a reference model for names (a purely
functional language has no variables or assignments). Objects in a functional language tend to be allocated automatically as needed, with a structure determined
by the language implementation. Most implementations of Lisp, for example,
build lists out of two-pointer blocks called cons cells. Lisp’s imperative features
allow the programmer to modify cons cells explicitly, but this ability must be
used with care: because of the reference model, a cons cell is commonly part of
the object to which more than one variable refers; a change made through one
variable will often change other variables as well.
Variables in an imperative language may use either a value or a reference
model, or some combination of the two. In C, Pascal, or Ada, which employ a
value model, the assignment A := B puts the value of B into A . If we want B to
refer to an object, and we want A := B to make A refer to the object to which
B refers, then A and B must be pointers. In Clu and Smalltalk, which employ a
reference model, the assignment A := B always makes A refer to the same object
to which B refers. A straightforward implementation would represent every variable as an address, but this would lead to very inefficient code for built-in types.
A better and more common approach is to use addresses for variables that refer
to mutable objects such as tree nodes, whose value can change, but to use actual
values for variables that refer to immutable objects such as integers, real numbers,
and characters. In other words, while every variable is semantically a reference,
it does not matter whether a reference to the number 3 is implemented as the
address of a 3 in memory or as the value 3 itself: since the value of “the 3” never
changes, the two are indistinguishable.
Java charts an intermediate course, in which the usual implementation of the
reference model is made explicit in the language semantics. Variables of built-in
Java types (integers, floating-point numbers, characters, and Booleans) employ a
value model; variables of user-defined types (strings, arrays, and other objects in
7.7 Pointers and Recursive Types
Figure 7.12
371
Implementation of a tree in ML. The abstract (conceptual) tree is shown at the
lower left.
the object-oriented sense of the word) employ a reference model. The assignment
A := B in Java places the value of B into A if A and B are of built-in type; it
makes A refer to the object to which B refers if A and B are of user-defined type.
C# mirrors Java by default, but additional language features, explicitly labeled
“ unsafe ,” allow systems programmers to use pointers when desired.
Reference Model
EXAMPLE
7.76
Tree type in ML
Section
types:
7.2.4 explains how ML datatype s can be used to declare recursive
datatype chr_tree = empty | node of char * chr_tree * chr_tree;
EXAMPLE
7.77
Tree type in Lisp
The node constructor of a chr_tree builds tuples containing a reference to a
character and two references to chr_tree s.
It is natural in ML to include a chr_tree within a chr_tree because every
variable is a reference. The tree node (#"R" , node (#"X" , empty , empty) , node
(#"Y" , node (#"Z" , empty , empty) , node (#"W" , empty , empty))) would
most likely be represented in memory as shown in Figure 7.12. Each individual
rectangle in the right-hand portion of this figure represents a block of storage allocated from the heap. In effect, the tree is a tuple (record) tagged to indicate that
it is a node . This tuple in turn refers to two other tuples that are also tagged as
node s. At the fringe of the tree are tuples that are tagged as empty ; these contain
no further references. Because all empty tuples are the same, the implementation
is free to use just one, and to have every reference point to it.
In Lisp, which uses a reference model of variables but is not statically typed,
our tree could be specified textually as ’(#\R (#\X ()()) (#\Y (#\Z ()())
(#\W ()()))) , and would be represented in memory as shown in Figure 7.13.
The parentheses denote a list, which in Lisp consists of two references: one to
372
Chapter 7 Data Types
Figure 7.13 Implementation of a tree in Lisp. A diagonal slash through a box indicates a nil pointer. The C and A tags serve
to distinguish the two kinds of memory blocks: cons cells and blocks containing atoms.
EXAMPLE
7.78
Mutually recursive types in
ML
the head of the list and one to the remainder of the list (which is itself a list).
The prefix #\ notation serves the same purpose as surrounding quotes in other
languages. As we noted in Section 7.7.1, a Lisp list is almost always represented in
memory by a cons cell containing two pointers. A binary tree can be represented
as a three-element (three cons cell) list. The first cell represents the root; the
second and third cells represent the left and right subtrees. Each heap block is
tagged to indicate whether it is a cons cell or an atom. An atom is anything other
than a cons cell—that is, an object of a built-in type (integer, real, character,
string, etc.), or a user-defined structure (record) or array. The uniformity of Lisp
lists (everything is a cons cell or an atom) makes it easy to write polymorphic
functions, though without the static type checking of ML.
If one programs in a purely functional style in ML or in Lisp, the data structures created with recursive types turn out to be acyclic. New objects refer to
old ones, but old ones never change, and thus never point to new ones. Circular
structures can be defined only by using the imperative features of the languages.
In ML, these features include an explicit notion of pointer, discussed briefly under “Value Model” below.
Even when writing in a functional style, one often finds a need for types that
are mutually recursive. In a compiler, for example, it is likely that symbol table
records and syntax tree nodes will need to refer to each other. A syntax tree node
that represents a subroutine call will need to refer to the symbol table record that
represents the subroutine. The symbol table record, for its part, will need to refer
to the syntax tree node at the root of the subtree that represents the subroutine’s
code. If types are declared one at a time, and if names must be declared before
they can be used, then whichever mutually recursive type is declared first will be
7.7 Pointers and Recursive Types
373
unable to refer to the other. ML addresses this problem by allowing types to be
declared together in a group:
datatype sym_tab_rec = variable of ...
| type of ...
| ...
| subroutine of {code : syn_tree_node, ...}
and syn_tree_node = expression of ...
| loop of ...
| ...
| subr_call of {subr : sym_tab_rec, ...};
Mutually recursive types of this sort are trivial in Lisp, since it is dynamically
typed. (Common Lisp includes a notion of structures, but field types are not
declared. In simpler Lisp dialects programmers use nested lists in which fields
are merely positional conventions.)
Value Model
EXAMPLE
7.79
Tree types in Pascal, Ada,
and C
In Pascal, our tree data type would be declared as follows.
type chr_tree_ptr = ^chr_tree;
chr_tree = record
left, right : chr_tree_ptr;
val : char
end;
The Ada declaration is similar:
type chr_tree;
type chr_tree_ptr is access chr_tree;
type chr_tree is record
left, right : chr_tree_ptr;
val : character;
end record;
In C, the equivalent declaration8 is as follows.
struct chr_tree {
struct chr_tree *left, *right;
char val;
};
As mentioned in Section 3.3.3, Pascal permits forward references in the declaration of pointer types, to support recursive types. Ada and C use incomplete type
declarations instead.
8 One of the peculiarities of the C type system is that struct tags are not exactly type names.
In this example, the name of the type is the two-word phrase struct chr_tree . To obtain a
one-word name, one can say typedef struct chr_tree chr_tree_type , or even typedef
struct chr_tree chr_tree : struct tags and typedef names have separate name spaces, so
the same name can be used in each.
374
Chapter 7 Data Types
Figure 7.14 Typical implementation of a tree in a language with explicit pointers. As in Figure 7.13, a diagonal slash through a box indicates a nil pointer.
EXAMPLE
7.80
Allocating heap nodes
No aggregate syntax is available for linked data structures in Pascal, Ada, or C;
a tree must be constructed node by node. To allocate a new node from the heap,
the programmer calls a built-in function. In Pascal:
new(my_ptr);
In Ada:
my_ptr := new chr_tree;
In C:
C’s malloc is defined as a library function, not a built-in part of the language
(though some compilers recognize and optimize it as a special case); hence the
need to specify the size of the allocated object, and to cast the return value to the
appropriate type. C++, Java, and C# replace malloc with a built-in new :
my_ptr = (struct chr_tree *) malloc(sizeof(struct chr_tree));
EXAMPLE
7.81
Object-oriented allocation
of heap nodes
my_ptr = new chr_tree( arg list );
In addition to “knowing” the size of the requested type, the C++/Java/C# new will
automatically call any user-specified constructor (initialization) function, passing
the specified argument list. In a similar but less flexible vein, Ada’s new may specify an initial value for the allocated object:
After we have allocated and linked together appropriate nodes in C, Pascal, or
Ada, our tree example is likely to be implemented as shown in Figure 7.14. As in
Lisp, a leaf is distinguished from an internal node simply by the fact that its two
pointer fields are nil .
To access the object referred to by a pointer, most languages use an explicit
dereferencing operator. In Pascal and Modula this operator takes the form of a
postfix “up-arrow”:
my_ptr := new chr_tree’(null, null, ’X’);
EXAMPLE
7.82
Pointer-based tree
EXAMPLE
7.83
Pointer dereferencing
my_ptr^.val := ’X’;
In C it is a prefix star:
(*my_ptr).val = ’X’;
7.7 Pointers and Recursive Types
375
Because pointers so often refer to records ( struct s), for which the prefix notation is awkward, C also provides a postfix “right-arrow” operator that plays the
role of the “up-arrow dot” combination in Pascal:
my_ptr->val = ’X’;
EXAMPLE
7.84
Implicit dereferencing in
Ada
On the assumption that pointers almost always refer to records, Ada dispenses
with dereferencing altogether. The same dot-based syntax can be used to access
either a field of the record foo or a field of the record pointed to by foo , depending on the type of foo :
T : chr_tree;
P : chr_tree_ptr;
...
T.val := ’X’;
P.val := ’Y’;
In those cases in which one actually wants to name the entire object referred to
by a pointer, Ada provides a special “pseudofield” called all :
T := P.all;
EXAMPLE
7.85
Pointer dereferencing in
ML
In essence, pointers in Ada are automatically dereferenced when needed. A more
ambitious (and unfortunately rather confusing) form of automatic dereferencing
can be found in Algol 68.
The imperative features of ML include an assignment statement, but this statement requires that the left-hand side be a pointer: its effect is to make the pointer
refer to the object on the right-hand side. To access the object referred to by a
pointer, one uses an exclamation point as a prefix dereferencing operator:
val p = ref 2; (* p is a pointer to 2 *)
...
p := 3;
(* p now points to 3 *)
...
let val n = !p in ...
(* n is simply 3 *)
EXAMPLE
7.86
Assignment in Lisp
ML thus makes the distinction between l-values and r-values very explicit. Most
Algol-family languages blur the distinction by implicitly dereferencing variables on the right-hand side of every assignment statement. Algol 68 and Ada
blur the distinction further by dereferencing pointers automatically in certain
circumstances.
The imperative features of Lisp do not include a dereferencing operator. Since
every object has a self-evident type, and assignment is performed using a small
set of built-in operators, there is never any ambiguity as to what is intended.
Assignment in Common Lisp employs the setf operator (Scheme uses set! ),
rather than the more common := . For example, if foo refers to a list, then (cdr
foo) is the right-hand (“rest of list”) pointer of foo ’s cons cell, and the assignment (setf (cdr foo) foo) makes this pointer refer back to foo , creating a
one- cons -cell circular list.
376
Chapter 7 Data Types
Pointers and Arrays in C
EXAMPLE
7.87
Array names and pointers
in C
Pointers and arrays are closely linked in C. Consider the following declarations.
int n;
int *a;
int b[10];
/* pointer to integer */
/* array of 10 integers */
Now all of the following are valid.
1.
2.
3.
4.
5.
EXAMPLE
7.88
Pointer comparison and
subtraction in C
EXAMPLE
7.89
Pointer and array
declarations in C
a
n
n
n
n
=
=
=
=
=
b;
a[3];
*(a+3);
b[3];
*(b+3);
/* make a point to the initial element of b */
/* equivalent to previous line */
/* equivalent to previous line */
In most contexts, an unsubscripted array name in C is automatically converted
to a pointer to the array’s first element (the one with index zero), as shown here
in line 1. (Line 5 embodies the same conversion.) Lines 3 and 5 illustrate pointer
arithmetic: given a pointer to an element of an array, the addition of an integer
k produces a pointer to the element k positions later in the array (earlier if k is
negative). The prefix * is a pointer dereference operator. Pointer arithmetic is
valid only within the bounds of a single array, but C compilers are not required
to check this.
Remarkably, the subscript operator [ ] in C is actually defined in terms of
pointer arithmetic: lines 2 and 4 are syntactic sugar for lines 3 and 5, respectively. More precisely, E1[E2] , for any expressions E1 and E2 , is defined to be
(*((E1)+(E2))) , which is of course the same as (*((E2)+(E1))) . (Extra parentheses have been used in this definition to avoid any questions of precedence if E1
and E2 are complicated expressions.) Correctness requires only that one operand
of [ ] have an array type and the other have an integral type. Thus A[3] is equivalent to 3[A] , something that comes as a surprise to most programmers.
In addition to allowing an integer to be added to a pointer, C allows pointers
to be subtracted from one another or compared for ordering, provided that they
refer to elements of the same array. The comparison p < q , for example, tests to
see if p refers to an element closer to the beginning of the array than the one referred to by q . The expression p - q returns the number of array positions that
separate the elements to which p and q refer. All arithmetic operations on pointers “scale” their results as appropriate, based on the size of the referenced objects.
For multidimensional arrays with row-pointer layout, a[i][j] is equivalent to
(*(a+i))[j] or *(a[i]+j) or *(*(a+i)+j) .
Despite the interoperability of pointers and arrays in C, programmers need
to be aware that the two are not the same, particularly in the context of variable
declarations, which need to allocate space when elaborated. The declaration of
a pointer variable allocates space to hold a pointer, while the declaration of an
array variable allocates space to hold the whole array. In the case of an array the
declaration must specify a size for each dimension. Thus int *a[n] , when elab-
7.7 Pointers and Recursive Types
EXAMPLE
7.90
Arrays as parameters in C
EXAMPLE
7.91
Sizeof in C
377
orated, will allocate space for n row pointers; int a[n][m] will allocate space for
a two-dimensional array with contiguous layout.9
When an array is included in the argument list of a function call, C passes a
pointer to the first element of the array, not the array itself. For a one-dimensional
array of integers, the corresponding formal parameter may be declared as int
a[ ] or int *a . For a two-dimensional array of integers with row-pointer layout, the formal parameter may be declared as int *a[ ] or int **a . For a twodimensional array with contiguous layout, the formal parameter may be declared
as int a[ ][m] or int (*a)[m] . The size of the first dimension is irrelevant; all
that is passed is a pointer, and C performs no dynamic checks to ensure that
references are within the bounds of the array.
In all cases, a declaration must allow the compiler (or human reader) to determine the size of the elements of an array or, equivalently, the size of the objects
referred to by a pointer. Thus neither int a[ ][ ] nor int (*a)[ ] is a valid
declaration: neither provides the compiler with the size information it needs to
generate code for a + i or a[i] . (An exception: a variable declaration that includes initialization to an aggregate can omit size information if that information
can be inferred from the contents of the aggregate.)
The built-in sizeof operator returns the size in bytes of an object or type.
When given an array as argument it returns the size of the entire array. When
given a pointer as argument it returns the size of the pointer itself. If a is an
D E S I G N & I M P L E M E N TAT I O N
Pointers and arrays
Many C programs use pointers instead of subscripts to iterate over the elements of arrays. Before the development of modern optimizing compilers,
pointer-based array traversal often served to eliminate redundant address calculations, thereby leading to faster code. With modern compilers, however,
the opposite may be true: redundant address calculations can be identified as
common subexpressions, and certain other code improvements are easier for
indices than they are for pointers. In particular, as we shall see in Chapter 15,
pointers make it significantly more difficult for the code improver to determine when two l-values may be aliases for one another.
Today the use of pointer arithmetic is mainly a matter of personal taste:
some C programmers consider pointer-based algorithms to be more elegant
than their array-based counterparts. Certainly the fact that arrays are passed
as pointers makes it natural to write subroutines in the pointer style.
9 To read declarations in C, it is helpful to follow the following rule: start at the name of the
variable and work right as far as possible, subject to parentheses; then work left as far as possible;
then jump out a level of parentheses and repeat. Thus int *a[n] means that a is an n-element
array of pointers to integers, while int (*a)[n] means that a is a pointer to an n-element array
of integers.
378
Chapter 7 Data Types
array, sizeof(a) / sizeof(a[0]) returns the number of elements in the array.
Similarly, if pointers occupy 4 bytes and double-precision floating point numbers
occupy 8 bytes, then given
double *a;
double (*b)[10];
EXAMPLE
7.92
Multidimensional array
parameters in C
/* pointer to double */
/* pointer to array of 10 doubles */
we have sizeof(a) = sizeof(b) = 4, sizeof(*a) = sizeof(*b[0]) = 8, and
sizeof(*b) = 80. In most cases, sizeof can be evaluated at compile time; the
principal exception occurs for variable-length arrays, whose shape is not known
until elaboration time.
Variable-length arrays are particularly useful in numeric code, where we
can write general purpose library routines that manipulate arrays of arbitrary
size:
double determinant(int rows, int cols, double M[rows][cols]) {
...
val = M[i][j];
/* normal syntax */
It is possible but awkward to write functionally equivalent code in earlier versions
of C:
double determinant(int rows, int cols, double *M) {
...
val = *(M + (i * cols) + j);
/* M[i][j] */
C H E C K YO U R U N D E R S TA N D I N G
37. Name three languages that provide particularly extensive support for character strings.
38. Why might a language permit operations on strings that it does not provide
for arrays?
39. What are the strengths and weaknesses of the bit-vector representation for
sets? How else might sets be implemented?
40. Discuss the tradeoffs between pointers and the recursive types that arise naturally in a language with a reference model of variables.
41. Summarize the ways in which one dereferences a pointer in various programming languages.
42. What is the difference between a pointer and an address?
43. Discuss the advantages and disadvantages of the interoperability of pointers
and arrays in C.
44. Under what circumstances must the bounds of a C array be specified in its
declaration?
7.7 Pointers and Recursive Types
7.7.2
EXAMPLE
7.93
Explicit storage
reclamation
379
Dangling References
In Section 3.2 we described three storage classes for objects: static, stack, and heap.
Static objects remain live for the duration of the program. Stack objects are live
for the duration of the subroutine in which they are declared. Heap objects have
a less well-defined lifetime.
When an object is no longer live, a long-running program needs to reclaim the
object’s space. Stack objects are reclaimed automatically as part of the subroutine
calling sequence. How are heap objects reclaimed? There are two alternatives.
Languages like Pascal, C, and C++ require the programmer to reclaim an object
explicitly. In Pascal:
dispose(my_ptr);
In C:
free(my_ptr);
In C++:
delete my_ptr;
C++ provides additional functionality: prior to reclaiming the space, it automatically calls any user-provided destructor function for the object. A destructor can
reclaim space for subsidiary objects, remove the object from indices or tables,
print messages, or perform any other operation appropriate at the end of the
object’s lifetime.
A dangling reference is a live pointer that no longer points to a valid object.
In languages like Algol 68 or C, which allow the programmer to create pointers
to stack objects, a dangling reference may be created when a subroutine returns
while some pointer in a wider scope still refers to a local object of that subroutine.
In a language with explicit reclamation of heap objects, a dangling reference is
created whenever the programmer reclaims an object to which pointers still refer.
(Note that while the dispose and delete operators of Pascal and C++ change
their pointer argument to nil , this does not solve the problem, because other
pointers may still refer to the same object.) Because a language implementation
may reuse the space of reclaimed stack and heap objects, a program that uses a
dangling reference may read or write bits in memory that are now part of some
other object. It may even modify bits that are now part of the implementation’s
bookkeeping information, corrupting the structure of the stack or heap.
Algol 68 addresses the problem of dangling references to stack objects by forbidding a pointer from pointing to any object whose lifetime is briefer than
that of the pointer itself. Unfortunately, this rule is difficult to enforce. Among
other things, since both pointers and objects to which pointers might refer can
be passed as arguments to subroutines, dynamic semantic checks are possible
only if reference parameters are accompanied by a hidden indication of lifetime.
Ada 95 has a more restrictive rule that is easier to enforce: it forbids a pointer
380
Chapter 7 Data Types
Figure 7.15 Tombstones. A valid pointer refers to a tombstone that in turn refers to an object.
A dangling reference refers to an “expired” tombstone.
from pointing to any object whose lifetime is briefer than that of the pointer’s
type.
Tombstones
EXAMPLE
7.94
Dangling reference
detection with tombstones
Tombstones [Lom75, Lom85] are a mechanism by which a language implementation can catch all dangling references, to objects in both the stack and the heap.
The idea is simple: rather than have a pointer refer to an object directly, we introduce an extra level of indirection (Figure 7.15). When an object is allocated
in the heap (or when a pointer is created to an object in the stack), the language
run-time system allocates a tombstone. The pointer contains the address of the
tombstone; the tombstone contains the address of the object. When the object is
reclaimed, the tombstone is modified to contain a value (typically zero) that cannot be a valid address. To avoid special cases in the generated code, tombstones
are also created for pointers to static objects.
For heap objects, it is easy to invalidate a tombstone when the program calls
the deallocation operation. For stack objects, the language implementation must
be able to find all tombstones associated with objects in the current stack frame
when returning from a subroutine. One possible solution is to link all stackobject tombstones together in a list, sorted by the address of the stack frame in
which the object lies. When a pointer is created to a local object, the tombstone
can simply be added to the beginning of the list. When a pointer is created to a
parameter, the run-time system must scan down the list and insert in the middle,
to keep it sorted. When a subroutine returns, the epilogue portion of the calling
7.7 Pointers and Recursive Types
381
sequence invalidates the tombstones at the head of the list, and removes them
from the list.
Tombstones may be allocated from the heap itself or, more commonly, from
a separate pool. The latter option avoids fragmentation problems, and makes
allocation relatively fast, since the first tombstone on the free list is always the
right size.
Tombstones can be expensive, both in time and in space. The time overhead
includes (1) creation of tombstones when allocating heap objects or using a
“pointer to” operator, (2) checking for validity on every access, and (3) doubleindirection. Fortunately, checking for validity can be made essentially free on
most machines by arranging for the address in an “invalid” tombstone to lie outside the program’s address space. Any attempt to use such an address will result
in a hardware interrupt, which the operating system can reflect up into the language run-time system. We can also use our invalid address, in the pointer itself,
to represent the constant nil . If the compiler arranges to set every pointer to
nil at elaboration time, then the hardware will catch any use of an uninitialized
pointer. (This technique works without tombstones, as well.)
The space overhead for tombstones can be significant. The simplest approach
is never to reclaim them. Since a tombstone is usually significantly smaller than
the object to which it refers, a program will waste less space by leaving a tombstone around forever than it would waste by never reclaiming the associated object. Even so, any long-running program that continually creates and reclaims objects will eventually run out of space for tombstones. A potential solution, which
we will consider in Section 7.7.3, is to augment every tombstone with a reference
count, and reclaim tombstones themselves when the reference count goes to zero.
Tombstones have a valuable side effect. Because of double-indirection, it is
easy to change the location of an object in the heap. The run-time system need
not locate every pointer that refers to the object; all that is required is to change
the address in the tombstone. The principal reason to change heap locations is
for storage compaction, in which all dynamically allocated blocks are “scooted
together” at one end of the heap in order to eliminate external fragmentation.
Tombstones are not widely used in language implementations, but the Macintosh
operating system (versions 9 and below) uses them internally, for references to
system objects such as file and window descriptors.
Locks and Keys
EXAMPLE
7.95
Dangling reference
detection with locks and
keys
Locks and keys [FL80] are an alternative to tombstones. Their disadvantages are
that they work only for objects in the heap, and they provide only probabilistic
protection from dangling pointers. Their advantage is that they avoid the need to
keep tombstones around forever (or to figure out when to reclaim them). Again
the idea is simple: Every pointer is a tuple consisting of an address and a key.
Every object in the heap begins with a lock. A pointer to an object in the heap is
valid only if the key in the pointer matches the lock in the object (Figure 7.16).
When the run-time system allocates a new heap object, it generates a new key
382
Chapter 7 Data Types
Figure 7.16
Locks and Keys. A valid pointer contains a key that matches the lock on an object
in the heap. A dangling reference is unlikely to match.
value. These can be as simple as serial numbers, but should avoid “common”
values such as zero and one. When an object is reclaimed, its lock is changed to
some arbitrary value (e.g., zero) so that the keys in any remaining pointers will
not match. If the block is subsequently reused for another purpose, we expect it
to be very unlikely that the location that used to contain the lock will be restored
to its former value by coincidence.
Like tombstones, locks and keys incur significant overhead. They add an extra
word of storage to every pointer and to every block in the heap. They increase the
cost of copying one pointer into another. Most significantly, they incur the cost
of comparing locks and keys on every access (or every provably nonredundant
access). It is unclear whether the lock and key check is cheaper or more expensive
than the tombstone check. A tombstone check may result in two cache misses
(one for the tombstone and one for the object); a lock and key check is unlikely
to cause more than one. On the other hand, the lock and key check requires a
significantly longer instruction sequence on most machines.
To minimize time and space overhead, most compilers do not by default generate code to check for dangling references. Most Pascal compilers allow the programmer to request dynamic checks, which are usually implemented with locks
and keys. In most implementations of C, even optional checks are unavailable.
7.7 Pointers and Recursive Types
7.7.3
383
Garbage Collection
Explicit reclamation of heap objects is a serious burden on the programmer and a
major source of bugs (memory leaks and dangling references). The code required
to keep track of object lifetimes makes programs more difficult to design, implement, and maintain. An attractive alternative is to have the language implementation notice when objects are no longer useful and reclaim them automatically.
Automatic reclamation (otherwise known as garbage collection) is more or less
essential for functional languages: delete is a very imperative sort of operation,
and the ability to construct and return arbitrary objects from functions means
that many objects that would be allocated on the stack in an imperative language
must be allocated from the heap in a functional language, to give them unlimited
extent.
Over time, automatic garbage collection has become popular for imperative
languages as well. It can be found in, among others, Clu, Cedar, Modula-3, Java,
C#, and all the major scripting languages. Automatic collection is difficult to implement, but the difficulty pales in comparison to the convenience enjoyed by
programmers once the implementation exists. Automatic collection also tends to
be slower than manual reclamation, though it eliminates any need to check for
dangling references.
D E S I G N & I M P L E M E N TAT I O N
Garbage collection
Garbage collection presents a classic tradeoff between convenience and safety
on the one hand and performance on the other. Manual storage reclamation,
implemented correctly by the application program, is almost invariably faster
than any automatic garbage collector. It is also more predictable: automatic
collection is notorious for its tendency to introduce intermittent “hiccups” in
the execution of real-time or interactive programs.
Ada takes the unusual position of refusing to take a stand: the language
design makes automatic garbage collection possible, but implementations are
not required to provide it (most don’t), and programmers can request manual reclamation with a built-in routine called Unchecked_Deallocation . The
Ada 95 version of the language provides extensive facilities whereby programmers can implement their own storage managers (garbage collected or not),
with different types of pointers corresponding to different storage “pools.”
In a similar vein, the Real Time Specification for Java allows the programmer to create so-called scoped memory areas that are accessible to only a subset of the currently running threads. When all threads with access to a given
area terminate, the area is reclaimed in its entirety. Objects allocated in a
scoped memory area are never examined by the garbage collector; performance anomalies due to garbage collection can therefore be avoided by providing scoped memory to every real-time thread.
384
Chapter 7 Data Types
Reference Counts
EXAMPLE
7.96
Reference counts and
circular structures
When is an object no longer useful? One possible answer is: when no pointers to it
exist.10 The simplest garbage collection technique simply places a counter in each
object that keeps track of the number of pointers that refer to the object. When
the object is created, this reference count is set to one, to represent the pointer returned by the new operation. When one pointer is assigned into another, the runtime system decrements the reference count of the object formerly referred to by
the assignment’s left-hand side and increments the count of the object referred to
by the right-hand side. On subroutine return, the calling sequence epilogue must
decrement the reference count of any object referred to by a local pointer that
is about to be destroyed. When a reference count reaches zero, its object can be
reclaimed. Recursively, the run-time system must decrement counts for any objects referred to by pointers within the object being reclaimed, and reclaim those
objects if their counts reach zero. To prevent the collector from following garbage
addresses, each pointer must be set to nil at elaboration time.
In order for reference counts to work, the language implementation must be
able to identify the location of every pointer. When a subroutine returns, it must
be able to tell which words in the stack frame represent pointers; when an object
in the heap is reclaimed, it must be able to tell which words within the object represent pointers. The standard technique to track this information relies on type
descriptors generated by the compiler. There is one descriptor for every distinct
type in the program, plus one for the stack frame of each subroutine and one for
the set of global variables. Most descriptors are simply a table that lists the offsets
within the type at which pointers can be found, together with the addresses of
descriptors for the types of the objects referred to by those pointers. For a tagged
variant record (discriminated union) type, the descriptor is a bit more complicated: it must contain a list of values (or ranges) for the tag, together with a table
for the corresponding variant. For untagged variant records, there is no acceptable solution: reference counts work only if the language is strongly typed (but
see the discussion of “Conservative Collection” on page 389).
The most important problem with reference counts stems from their definition of a “useful object.” While it is definitely true that an object is useless if no
references to it exist, it may also be useless when references do exist. As shown
in Figure 7.17, reference counts may fail to collect circular structures. They work
well only for structures that are guaranteed to be noncircular. Many language
implementations use reference counts for variable-length strings; strings never
contain references to anything else. Perl uses reference counts for all dynamically
allocated data; the manual warns the programmer to break cycles manually when
data aren’t needed anymore. Some purely functional languages may also be able
to use reference counts safely in all cases, if the lack of an assignment statement
10 Throughout the following discussion we will use the pointer-based terminology of languages
with a value model of variables. The techniques apply equally well, however, to languages with a
reference model of variables.
7.7 Pointers and Recursive Types
385
Figure 7.17
Reference counts and circular lists. The list shown here cannot be found via any
program variable, but because the list is circular, every cell contains a nonzero count.
prevents them from introducing circularity. Finally, reference counts can be used
to reclaim tombstones. While it is certainly possible to create a circular structure
with tombstones, the fact that the programmer is responsible for explicit deallocation of heap objects implies that reference counts will fail to reclaim tombstones only when the programmer has failed to reclaim the objects to which they
refer.
Tracing Collection
A better definition of a “useful” object is one that can be reached by following a
chain of valid pointers starting from something that has a name (i.e., something
outside the heap). According to this definition, the blocks in the bottom half of
Figure 7.17 are useless, even though their reference counts are nonzero. Tracing
collectors work by recursively exploring the heap, starting from external pointers,
to determine what is useful.
Mark-and-Sweep The classic mechanism to identify useless blocks, under this
more accurate definition, is known as mark-and-sweep. It proceeds in three main
steps, executed by the garbage collector when the amount of free space remaining
in the heap falls below some minimum threshold.
1. The collector walks through the heap, tentatively marking every block as “useless.”
2. Beginning with all pointers outside the heap, the collector recursively explores all linked data structures in the program, marking each newly discovered block as “useful.” (When it encounters a block that is already marked as
386
Chapter 7 Data Types
“useful,” the collector knows it has reached the block over some previous path,
and returns without recursing.)
3. The collector again walks through the heap, moving every block that is still
marked “useless” to the free list.
Several potential problems with this algorithm are immediately apparent.
First, both the initial and final walks through the heap require that the collector be able to tell where every “in-use” block begins and ends. In a language with
variable-size heap blocks, every block must begin with an indication of its size,
and of whether it is currently free. Second, the collector must be able in Step 2 to
find the pointers contained within each block. The standard solution is to place
a pointer to a type descriptor near the beginning of each block.
The space overhead for bookkeeping information in heap blocks is not as large
as it might at first appear. If every type descriptor contains an indication of size,
then a heap block that includes the address of its type descriptor need not include
its size as a separate field (though the extra indirection required to find the size in
the descriptor makes walking the heap more expensive). Moreover, since a type
descriptor must be word-aligned on most machines, the two low-order bits of its
address are guaranteed to be zero. If we are willing to mask these bits out before
using the address, we can use them to store the “free” and “useful” flags.
The exploration step (Step 2) of mark-and-sweep collection
is naturally recursive. The obvious implementation needs a stack whose maximum depth is proportional to the longest chain through the heap. In practice,
the space for this stack may not be available: after all, we run garbage collection
when we’re about to run out of space!11 An alternative implementation of the
exploration step uses a technique first suggested by Schorr and Waite [SW67]
to embed the equivalent of the stack in already-existing fields in heap blocks.
More specifically, as the collector explores the path to a given block, it reverses
the pointers it follows, so that each points back to the previous block instead of
forward to the next. This pointer-reversal technique is illustrated in Figure 7.18.
As it explores, the collector keeps track of the current block and the block from
whence it came (the two gray arrows in the figure).
When it returns from block W to block Y , the collector uses the reversed
pointer in Y to restore its notion of previous block ( R in our example). It then
flips the reversed pointer back to W and updates its notion of current block to
Y . If the block to which it has returned contains additional pointers, the collector
proceeds forward again; otherwise it returns across the previous reversed pointer
and tries again. At most one pointer in every block will be reversed at any given
time. This pointer must be marked, probably by means of another bookkeeping
Pointer Reversal
EXAMPLE
7.97
Heap tracing with pointer
reversal
11 In many language implementations, the stack and heap grow toward each other from opposite
ends of memory; if the heap is full, the stack can’t grow. In a system with virtual memory the
distance between the two may theoretically be enormous, but the space that backs them up on
disk is still limited, and shared between them.
7.7 Pointers and Recursive Types
387
Figure 7.18 Heap exploration via pointer reversal. The block currently under examination is
indicated by the large gray arrow. The previous block is indicated by the small gray arrow. As the
garbage collector moves from one block to the next, it changes the pointer it follows to refer
back to the previous block. When it returns to a block it restores the pointer. Each reversed
pointer must be marked, to distinguish it from other, forward pointers in the same block. We
assume in this figure that the root node R is outside the heap, so none of its pointers are
reversed.
field at the beginning of each block. (We could mark the pointer by setting one of
its low-order bits, but the cost in time would probably be prohibitive: we’d have
to search the block on every visit.)
In a language with variable-size heap blocks, the garbage collector can reduce external fragmentation by performing storage compaction, as
noted in the preceding discussion of tombstones. Compaction with tombstones
is easier because there is only a single pointer to each object. Many garbage collectors employ a technique known as stop-and-copy that achieves compaction
while simultaneously eliminating Steps 1 and 3 in the standard mark and sweep
algorithm. Specifically, they divide the heap into two regions of equal size. All
allocation happens in the first half. When this half is (nearly) full, the collector
begins its exploration of reachable data structures. Each reachable block is copied
into the second half of the heap, with no external fragmentation. The old version
of the block, in the first half of the heap, is overwritten with a “useful” flag and a
pointer to the new location. Any other pointer that refers to the same block (and
Stop-and-Copy
388
Chapter 7 Data Types
is found later in the exploration) is set to point to the new location. When the
collector finishes its exploration, all useful objects have been moved (and compacted) into the second half of the heap, and nothing in the first half is needed
anymore. The collector can therefore swap its notion of first and second halves,
and the program can continue. Obviously, this algorithm suffers from the fact
that only half of the heap can be used at any given time, but in a system with virtual memory it is only the virtual space that is underutilized; each “half ” of the
heap can occupy most of physical memory as needed. Moreover, by eliminating
Steps 1 and 3 of standard mark and sweep, stop and copy incurs overhead proportional to the number of nongarbage blocks, rather than the total number of
blocks.
Generational Collection To further reduce the cost of collection, some
garbage collectors employ a “generational” technique, exploiting the observation
that most dynamically allocated objects are short lived. The heap is divided into
multiple regions (often two). When space runs low the collector first examines
the youngest region (the “nursery”), which it assumes is likely to have the highest proportion of garbage. Only if it is unable to reclaim sufficient space in this
region does the collector examine the next-older region. Any object that survives
some small number of collections (often one) in its current region is promoted
(moved) to the next older region, in a manner reminiscent of stop-and-copy.
Promotion requires, of course, that pointers from old objects to new objects be
D E S I G N & I M P L E M E N TAT I O N
Reference counts v. tracing
Reference counts require a counter field in every heap object. For small objects such as cons cells, this space overhead may be significant. The ongoing
expense of updating reference counts when pointers are changed can also be
significant in a program with large amounts of pointer manipulation. Other
garbage collection techniques, however, have similar overheads. Tracing generally requires a reversed pointer indicator in every heap block, which reference
counting does not, and generational collectors must generally incur overhead
on every pointer assignment in order to keep track of pointers into the newest
section of the heap.
The two principal tradeoffs between reference counting and tracing are the
inability of the former to handle cycles and the tendency of the latter to “stop
the world” periodically in order to reclaim space. On the whole, implementors
tend to favor reference counting for applications in which circularity is not an
issue, and tracing collectors in the general case. The “stop the world” problem can be addressed with incremental or parallel collectors, which execute
concurrently with the rest of the program, but these tend to have higher total
overhead. Efficient, effective garbage collection techniques remain an active
area of research.
7.8 Lists
389
updated to reflect the new locations. While such old-to-new pointers tend to be
rare, a generational collector must track them in an explicit data structure (updated at pointer assignment time) in order to avoid scanning the older portions
of the heap in order to find them. A collector for a long-running system, which
cannot afford to leak storage, must be able in the general case to examine the
entire heap, but in most cases the overhead of collection will be proportional to
the size of the youngest region only.
Conservative Collection Language implementors have traditionally assumed
that automatic storage reclamation is possible only in languages that are strongly
typed: both reference counts and tracing collection require that we be able to
find the pointers within an object. If we are willing to admit the possibility that
some garbage will go unreclaimed, it turns out that we can implement markand-sweep collection without being able to find pointers [BW88]. The key is to
observe that the number of blocks in the heap is much smaller than the number
of possible bit patterns in an address. The probability that a word in memory that
is not a pointer into the heap will happen to contain a bit pattern that looks like
such a pointer is relatively small. If we assume, conservatively, that everything
that seems to point to a heap block is in fact a valid pointer, then we can proceed
with mark-and-sweep collection. When space runs low, the collector (as usual)
tentatively marks all blocks in the heap as useless. It then scans all word-aligned
quantities in the stack and in global storage. If any of these “pointers” contains
the address of a block in the heap, the collector marks that block as useful. Recursively, the collector then scans all word-aligned quantities in the block and marks
as useful any other blocks whose addresses are found therein. Finally (as usual),
the collector reclaims any blocks that are still marked useless. The algorithm is
completely safe (in the sense that it never reclaims useful blocks) as long as the
programmer never “hides” a pointer. In C, for example, the collector is unlikely
to function correctly if the programmer casts a pointer to int and then xor s it
with a constant, with the expectation of restoring and using the pointer at a later
time. In addition to sometimes leaving garbage unreclaimed, conservative collection suffers from the inability to perform compaction: the collector can never be
sure which “pointers” should be changed.
7.8
Lists
A list is defined recursively as either the empty list or a pair consisting of an
object (which may be either a list or an atom) and another (shorter) list. Lists
are ideally suited to programming in functional and logic languages, which do
most of their work via recursion and higher-order functions (to be described in
Section 10.5). In Lisp, in fact, a program is a list, and can extend itself at run time
by constructing a list and executing it (this capability will be examined further in
Section 10.3.5; it depends heavily on the fact that Lisp delays almost all semantic
checking until run time).
390
EXAMPLE
Chapter 7 Data Types
7.98
Lists in ML and Lisp
EXAMPLE
7.99
List notation
Lists can also be used in imperative programs. Clu provides a built-in type
constructor for lists, and a list class is easy to write in most object-oriented languages. Several scripting languages, notably Perl and Python, provide extensive
list support. In any language with records and pointers, the programmer can
build lists by hand. Since many of the standard list operations tend to generate
garbage, lists work best in a language with automatic garbage collection.
We have already discussed certain aspects of lists in ML (Section 7.2.4) and
Lisp (Section 7.7.1). As we noted in those sections, lists in ML are homogeneous:
every element of the list must have the same type. Lisp lists, by contrast, are heterogeneous: any object may be placed in a list, as long as it is never used in an
inconsistent fashion.12 The different approaches to type in ML and in Lisp lead to
different implementations. An ML list is usually a chain of blocks, each of which
contains an element and a pointer to the next block. A Lisp list is a chain of cons
cells, each of which contains two pointers, one to the element and one to the next
cons cell (see Figures 7.12 and 7.13, pages 371 and 372). For historical reasons,
the two pointers in a cons cell are known as the car and the cdr ; they represent
the head of the list and the remaining elements, respectively. In both semantics
(homogeneity versus heterogeneity) and implementation (chained blocks versus
cons cells), Clu resembles ML, while Python and Prolog (to be discussed in Section 11.2) resemble Lisp.
Both ML and Lisp provide convenient notation for lists. An ML list is enclosed in square brackets, with elements separated by commas: [a, b, c, d ] .
A Lisp list is enclosed in parentheses, with elements separated by white space:
(a b c d) . In both cases, the notation represents a proper list: one whose innermost pair consists of the final element and the empty list. In Lisp, it is also possible to construct an improper list, whose final pair contains two elements. (Strictly
speaking, such a list does not conform to the standard recursive definition.) Lisp
systems provide a more general but cumbersome dotted list notation that captures both proper and improper lists. A dotted list is either an atom (possibly
nil ) or a pair consisting of two dotted lists separated by a period and enclosed
in parentheses. The dotted list (a . (b . (c . (d . nil)))) is the same as (a b c
d) . The list (a . (b . (c . d))) is improper; its final cons cell contains a pointer
to d in the second position, where a pointer to a list is normally required.
Both ML and Lisp provide a wealth of built-in polymorphic functions to manipulate arbitrary lists. Because programs are lists in Lisp, Lisp must distinguish
between lists that are to be evaluated and lists that are to be left “as is” as structures. To prevent a literal list from being evaluated, the Lisp programmer may
quote it: (quote (a b c d)) , abbreviated ’(a b c d) . To evaluate an internal
list (e.g., one returned by a function), the programmer may pass it to the built-in
function eval . In ML, programs are not lists, so a literal list is always a structural
aggregate.
12 Recall that objects are self-descriptive in Lisp. The only type checking occurs when a function
“deliberately” inspects an argument to see whether it is a list or an atom of some particular type.
7.8 Lists
EXAMPLE
7.100
Basic list operations in Lisp
The most fundamental operations on lists are those that construct them from
their components or extract their components from them. In Lisp:
(cons ’a ’(b))
(car ’(a b))
(car nil)
(cdr ’(a b c))
(cdr ’(a))
(cdr nil)
(append ’(a b) ’(c d))
EXAMPLE
7.101
Basic list operations in ML
391
⇒
⇒
⇒
⇒
⇒
⇒
⇒
(a b)
a
??
(b c)
nil
??
(a b c d)
As in Chapter 6, we have used ⇒ to mean “evaluates to.” The car and cdr
of the empty list ( nil ) are defined to be nil in Common Lisp; in Scheme they
result in a dynamic semantic error.
In ML the equivalent operations are written as follows.
a :: [b]
hd [a, b]
hd [ ]
tl [a, b, c]
tl [a]
tl [ ]
[a, b] @ [c, d]
⇒
⇒
⇒
⇒
⇒
⇒
⇒
[a, b]
a
run-time exception
[b, c]
nil
run-time exception
[a, b, c, d]
Run-time exceptions may be caught by the program if desired; further details will
appear in Section 8.5.
Both ML and Lisp provide many additional list functions, including ones that
test a list to see if it is empty; return the length of a list; return the nth element
of a list, or a list consisting of all but the first n elements; reverse the order of the
D E S I G N & I M P L E M E N TAT I O N
Car and cdr
The names of the functions car and cdr are historical accidents: they derive from the original (1959) implementation of Lisp on the IBM 704 at MIT.
The machine architecture included 15-bit “address” and “decrement” fields in
some of the (36-bit) loop-control instructions, together with additional instructions to load an index register from, or store it to, one of these fields
within a 36-bit memory word. The designers of the Lisp interpreter decided
to make cons cells mimic the internal format of instructions, so they could
exploit these special instructions. In now archaic usage, memory words were
also known as “registers.” What might appropriately have been called “first”
and “rest” pointers thus came to be known as the CAR (contents of address
of register) and CDR (contents of decrement of register). The 704, incidentally, was also the machine on which Fortran was first developed, and the first
commercial machine to include hardware floating point and magnetic core
memory.
392
EXAMPLE
Chapter 7 Data Types
7.102
List comprehensions
elements of a list; search a list for elements matching some predicate; or apply a
function to every element of a list, returning the results as a list.
Miranda, Haskell, and Python provide lists that resemble those of ML, but
with an important additional mechanism, known as list comprehensions. A common form of list comprehension comprises an expression, an enumerator, and
one or more filters. In Miranda and Haskell, the following denotes a list of the
squares of all odd numbers less than 100.
[i*i | i <- [1..100], i ‘mod‘ 2 == 1]
Here the vertical bar means “such that”; the left arrow is roughly equivalent to
“is a member of.” (Python syntax is slightly different.) We could of course write
an equivalent expression with a collection of appropriate functions. The brevity
of the list comprehension syntax, however, can sometimes lead to remarkably
elegant programs (see, for example, Exercise 7.32).
7.9
Files and Input/Output
Input/output (I/O) facilities allow a program to communicate with the outside
world. In discussing this communication, it is customary to distinguish between
interactive I/O and I/O with files. Interactive I/O generally implies communication with human users or physical devices, which work in parallel with the running program, and whose input to the program may depend on earlier output
from the program (e.g., prompts). Files generally correspond to storage outside
the program’s address space, implemented by the operating system. Files may be
further categorized into those that are temporary and those that are persistent.
Temporary files exist for the duration of a single program run; their purpose is to
store information that is too large to fit in the memory available to the program.
Persistent files allow a program to read data that existed before the program began running, and to write data that will continue to exist after the program has
ended.
I/O is one of the most difficult aspects of a language to design, and one that
displays the least commonality from one language to the next. Some languages
provide built-in file data types and special syntactic constructs for I/O. OthD E S I G N & I M P L E M E N TAT I O N
I/O
Regardless of the level of language integration, the design of I/O facilities is
complicated by the tension between “power” and portability: designers generally want to take advantage of (and provide access to) all the features supported
by the underlying operating system. At the same time, they want to minimize
the amount of work required to move a program from one system to another.
7.10 Equality Testing and Assignment
393
ers relegate I/O entirely to library packages, which export a (usually opaque)
file type and a variety of input and output subroutines. The principal advantage of language integration is the ability to employ non-subroutine-call syntax
and to perform operations (e.g., type checking on subroutine calls with varying
numbers of parameters) that may not otherwise be available to library routines.
A purely library-based approach to I/O, on the other hand, may keep a substantial amount of “clutter” out of the language definition.
IN MORE DEPTH
After a brief introduction to interactive and file-based I/O, we focus mainly on the
common case of text files. The data in a text file are stored in character form but
may be converted to and from internal types during read and write operations.
As examples, we consider the text I/O facilities of Fortran, Ada, C, and C++.
7.10
Equality Testing and Assignment
For simple, primitive data types such as integers, floating-point numbers, or
characters, equality testing and assignment are relatively straightforward operations, with obvious semantics and obvious implementations (bit-wise comparison or copy). For more complicated or abstract data types, however, both semantic and implementation subtleties arise.
Consider for example the problem of comparing two character strings. Should
the expression s = t determine whether s and t
are aliases for one another?
occupy storage that is bit-wise identical over its full length?
contain the same sequence of characters?
would appear the same if printed?
The second of these tests is probably too low level to be of interest in most programs; it suggests the possibility that a comparison might fail because of garbage
in currently unused portions of the space reserved for a string. The other three
alternatives may all be of interest in certain circumstances, and may generate different results.
In many cases the definition of equality boils down to the distinction between
l-values and r-values: in the presence of references, should expressions be considered equal only if they refer to the same object, or also if the objects to which
they refer are in some sense equal? The first option (refer to the same object) is
known as a shallow comparison. The second (refer to equal objects) is called a
deep comparison. For complicated data structures (e.g., lists or graphs) a deep
comparison may require recursive traversal.
394
EXAMPLE
Chapter 7 Data Types
7.103
Equality testing in Scheme
In imperative programming languages assignment operations may also be
deep or shallow. Under a reference model of variables, a shallow assignment
a := b will make a refer to the object to which b refers. A deep assignment
will create a copy of the object to which b refers, and make a refer to the copy.
Under a value model of variables, a shallow assignment will copy the value of b
into a , but if that value is a pointer (or a record containing pointers), then the
objects to which the pointer(s) refer will not be copied.
Most programming languages employ both shallow comparisons and shallow
assignment. A few (notably Python and the various dialects of Lisp) provide more
than one option for comparison. Scheme, for example, has three equality-testing
functions:
(eq? a b)
(eqv? a b)
(equal? a b)
; do a and b refer to the same object?
; are a and b provably semantically equivalent?
; do a and b have the same recursive structure?
The intent behind the eq? predicate is to make the implementation as fast
as possible while still producing useful results for many types of operands. The
intent behind eqv? is to provide as intuitively appealing a result as possible for
as wide a range of types as possible.
The eq? predicate behaves as one would expect for Booleans, symbols
(names), and pairs (things built by cons ), but can have implementation-defined
behavior on numbers, characters, and strings.
(eq? #t #t)
(eq? ’foo ’foo)
(eq? ’(a b) ’(a b))
(let ((p ’(a b)))
(eq? p p))
(eq? 2 2)
(eq? "foo" "foo")
⇒
⇒
⇒
#t (true)
#t
#f (false); created by separate cons-es
⇒
⇒
⇒
#t; created by the same cons
unspecified
unspecified
In any particular implementation, numeric, character, and string tests will always
work the same way; if (eq? 2 2) returns true , then (eq? 37 37) will return
true also. Implementations are free to choose whichever behavior results in the
fastest code.
The exact rules that govern the situations in which eqv? is guaranteed to return true or false are quite involved. Among other things, they specify that
eqv? should behave as one might expect for numbers, characters, and nonempty
strings, and that two objects will never test true for eqv? if there are any circumstances under which they would behave differently. (Conversely, however,
eqv? is allowed to return false for certain objects—functions, for example—
that would behave identically in all circumstances.) The eqv? predicate is “less
discriminating” than eq? , in the sense that eqv? will never return false when
eq? returns true .
For structures (lists), eqv? returns false if its arguments refer to different
root cons cells. In many programs this is not the desired behavior. The equal?
7.11 Summary and Concluding Remarks
395
predicate recursively traverses two lists to see if their internal structure is the same
and their leaves are eqv? . The equal? predicate may lead to an infinite loop if
the programmer has used the imperative features of Scheme to create a circular
list.
Deep assignments are relatively rare. They are used primarily in distributed
computing, and in particular for parameter passing in remote procedure call
(RPC) systems. These will be discussed in Section 12.4.4.
For user-defined abstractions, no single language-specified mechanism for
equality testing or assignment is likely to produce the desired results in all cases.
Languages with sophisticated data abstraction mechanisms usually allow the programmer to define the comparison and assignment operators for each new data
type—or to specify that equality testing and/or assignment is not allowed.
C H E C K YO U R U N D E R S TA N D I N G
45. What are dangling references? How are they created, and why are they a problem? Discuss the comparative advantages of tombstones and locks and keys as
a means of solving the problem.
46. What is garbage? How is it created, and why is it a problem? Discuss the comparative advantages of reference counts and tracing collection as a means of
solving the problem.
47. Summarize the differences among mark-and-sweep, stop-and-copy, and generational garbage collection.
48. What is pointer reversal? What problem does it address?
49. What is “conservative” garbage collection? How does it work?
50. Do dangling references and garbage ever arise in the same programming language? Why or why not?
51. Why was automatic garbage collection so slow to be adopted by imperative
programming languages?
52. What are the advantages and disadvantages of allowing pointers to refer to
objects that do not lie in the heap?
53. Why are lists so heavily used in functional programming languages?
54. Why is equality testing more subtle than it first appears?
7.11
Summary and Concluding Remarks
This section concludes the third of our five core chapters on language design
(names [from Part I], control flow, types, subroutines, and classes). In the first
396
Chapter 7 Data Types
two sections we looked at the general issues of type systems and type checking.
In the remaining sections we examined the most important composite types:
records and variants, arrays and strings, sets, pointers and recursive types, lists,
and files. We noted that types serve two principal purposes: they provide implicit
context for many operations, freeing the programmer from the need to specify
that context explicitly, and they allow the compiler to catch a wide variety of
common programming errors. A type system consists of a set of built-in types;
a mechanism to define new types; and rules for type equivalence, type compatibility, and type inference. Type equivalence determines when two names or values
have the same type. Type compatibility determines when a value of one type may
be used in a context that “expects” another type. Type inference determines the
type of an expression based on the types of its components or (sometimes) the
surrounding context. A language is said to be strongly typed if it never allows an
operation to be applied to an object that does not support it; a language is said to
be statically typed if it enforces strong typing at compile time.
In our general discussion of types we distinguished between the denotational,
constructive, and abstraction-based points of view, which regard types, respectively, in terms of their values, their substructure, and the operations they support. We introduced terminology for the common built-in types and for enumerations, subranges, and the common type constructors. We discussed several
different approaches to type equivalence, compatibility, and inference, including (on the PLP CD) a detailed examination of the inference rules of ML. We
also examined type conversion, coercion, and nonconverting casts. In the area of
type equivalence, we contrasted the structural and name-based approaches, noting that while name equivalence appears to have gained in popularity, structural
equivalence retains its advocates.
In our survey of composite types, we spent the most time on records, arrays,
and recursive types. Key issues for records include the syntax and semantics of
variant records, whole-record operations, type safety, and the interaction of each
of these with memory layout. Memory layout is also important for arrays, in
which it interacts with binding time for shape; static, stack, and heap-based allocation strategies; efficient array traversal in numeric applications; the interoperability of pointers and arrays in C; and the available set of whole-array and
slice-based operations.
For recursive data types, much depends on the choice between the value and
reference models of variables/names. Recursive types are a natural fall-out of the
reference model; with the value model they require the notion of a pointer: a variable whose value is a reference. The distinction between values and references is
important from an implementation point of view: it would be wasteful to implement built-in types as references, so languages with a reference model generally
implement built-in and user-defined types differently. Java reflects this distinction in the language semantics, calling for a value model of built-in types and a
reference model for objects of user-defined type classes.
Recursive types are generally used to create linked data structures. In most
cases these structures must be allocated from a heap. In some languages, the pro-
7.11 Summary and Concluding Remarks
397
grammer is responsible for deallocating heap objects that are no longer needed.
In other languages, the language run-time system identifies and reclaims such
garbage automatically. Explicit deallocation is a burden on the programmer,
and leads to the problems of memory leaks and dangling references. While language implementations almost never attempt to catch memory leaks (see Exploration 3.28 and Exercise 7.30, however, for some ideas on this subject) tombstones
or locks and keys are sometimes used to catch dangling references. Automatic
garbage collection can be expensive but has proven increasingly popular. Most
garbage-collection techniques rely either on reference counts or on some form of
recursive exploration (tracing) of currently accessible structures. Techniques in
this latter category include mark-and-sweep, stop-and-copy, and generational collection.
Few areas of language design display as much variation as I/O. Our discussion
(largely on the PLP CD) distinguished between interactive I/O, which tends to be
very platform specific, and file-based I/O, which subdivides into temporary files,
used for voluminous data within a single program run, and persistent files, used
for off-line storage. Files also subdivide into those that represent their information in a binary form that mimics layout in memory and those that convert to
and from character-based text. In comparison to binary files, text files generally
incur both time and space overhead, but they have the important advantages of
portability and human readability.
In our examination of types, we saw many examples of language innovations
that have served to improve the clarity and maintainability of programs, often
with little or no performance overhead. Examples include the original idea of
user-defined types (Algol 68), enumeration and subrange types (Pascal), the integration of records and variants (Pascal), and the distinction between subtypes
and derived types in Ada. In Chapter 9 we will examine what many consider the
most important innovation of the past thirty years, namely object orientation.
In some cases, the distinctions between languages are less a matter of evolution than of fundamental differences in philosophy. We have already mentioned
the choice between the value and reference models of variables/names. In a similar vein, most languages have adopted static typing, but Smalltalk, Lisp, and the
many scripting languages work well with dynamic types. Most statically typed
languages have adopted name equivalence, but ML and Modula-3 work well with
structural equivalence. Most languages have moved away from type coercions,
but C++ embraces them: together with operator overloading, they make it possible to define terse, type-safe I/O routines outside the language proper.
As in the previous chapter, we saw several cases in which a language’s convenience, orthogonality, or type safety appears to have been compromised in order
to simplify the compiler, or to make compiled programs smaller or faster. Examples include the lack of an equality test for records in most languages, the
requirement in Pascal and Ada that the variant portion of a record lie at the end,
the limitations in many languages on the maximum size of sets, the lack of type
checking for I/O in C, and the general lack of dynamic semantic checks in many
language implementations. We also saw several examples of language features in-
398
Chapter 7 Data Types
troduced at least in part for the sake of efficient implementation. These include
packed types, multilength numeric types, with statements, decimal arithmetic,
and C-style pointer arithmetic.
At the same time, one can identify a growing willingness on the part of language designers and users to tolerate complexity and cost in language implementation in order to improve semantics. Examples here include the type-safe variant
records of Ada; the standard-length numeric types of Java and C#; the variablelength strings and string operators of Icon, Java, and C#; the late binding of array
bounds in Ada; and the wealth of whole-array and slice-based array operations in
Fortran 90. One might also include the polymorphic type inference of ML. Certainly one should include the trend toward automatic garbage collection. Once
considered too expensive for production-quality imperative languages, garbage
collection is now standard not only in such experimental languages as Clu and
Cedar, but in Ada, Modula-3, Java, and C# as well. Many of these features, including variable-length strings, slices, and garbage collection, have been embraced by
scripting languages.
7.12
Exercises
7.1 Most modern Algol-family languages use some form of name equivalence
for types. Is structural equivalence a bad idea? Why or why not?
7.2 In the following code, which of the variables will a compiler consider to have
compatible types under structural equivalence? Under strict name equivalence? Under loose name equivalence?
type T = array [1..10] of integer
S=T
A : T
B : T
C : S
D : array [1..10] of integer
7.3 Consider the following declarations.
1.
2.
3.
4.
5.
6.
7.
type cell
–– a forward declaration
type cell ptr = pointer to cell
x : cell
type cell = record
val : integer
next : cell ptr
y : cell
Should the declaration at line 4 be said to introduce an alias type? Under
strict name equivalence, should x and y have the same type? Explain.
7.12 Exercises
399
7.4 Suppose you are implementing an Ada compiler, and must support arithmetic on 32-bit fixed-point binary numbers with a programmer-specified
number of fractional bits. Describe the code you would need to generate to
add, subtract, multiply, or divide two fixed-point numbers. You should assume that the hardware provides arithmetic instructions only for integers
and IEEE floating point. You may assume that the integer instructions preserve full precision; in particular, integer multiplication produces a 64-bit
result. Your description should be general enough to deal with operands
and results that have different numbers of fractional bits.
7.5 When Sun Microsystems ported Berkeley Unix from the Digital VAX to the
Motorola 680x0 in the early 1980s, many C programs stopped working and
had to be repaired. In effect, the 680x0 revealed certain classes of program
bugs that one could “get away with” on the VAX. One of these classes of bugs
occurred in programs that use more than one size of integer (e.g., short
and long ) and arose from the fact that the VAX is a little-endian machine,
while the 680x0 is big-endian (Section 5.2). Another class of bugs occurred
in programs that manipulate both null and empty strings. It arose from the
fact that location zero in a process’s address space on the VAX always contained a zero, while the same location on the 680x0 is not in the address
space, and will generate a protection error if used. For both of these classes
of bugs, give examples of program fragments that would work on a VAX but
not on a 680x0.
7.6 Ada provides two “remainder” operators, rem and mod for integer types,
defined as follows [Ame83, Sec. 4.5.5]:
Integer division and remainder are defined by the relation A = (A/B)*B +
(A rem B) , where (A rem B) has the sign of A and an absolute value less
than the absolute value of B . Integer division satisfies the identity (-A)/B
= -(A/B) = A/(-B) .
The result of the modulus operation is such that (A mod B) has the sign
of B and an absolute value less than the absolute value of B ; in addition,
for some integer value N , this result must satisfy the relation A = B*N +
(A mod B) .
Give values of A and B for which A rem B and A mod B differ. For what
purposes would one operation be more useful than the other? Does it make
sense to provide both, or is it overkill?
Consider also the % operator of C and the mod operator of Pascal. The
designers of these languages could have picked semantics resembling those
of either Ada’s rem or its mod . Which did they pick? Do you think they made
the right choice?
7.7 Consider the problem of performing range checks on set expressions in Pascal. Given that a set may contain many elements, some of which may be
known at compile time, describe the information that a compiler might
maintain in order to track both the elements known to belong to the set
400
Chapter 7 Data Types
and the possible range of unknown elements. Then explain how to update
this information for the following set operations: union, intersection, and
difference. The goal is to determine (1) when subrange checks can be eliminated at run time and (2) when subrange errors can be reported at compile
time. Bear in mind that the compiler cannot do a perfect job: some unnecessary run-time checks will inevitably be performed, and some operations
that must always result in errors will not be caught at compile time. The goal
is to do as good a job as possible at reasonable cost.
7.8 Suppose we are compiling for a machine with 1-byte characters, 2-byte
shorts, 4-byte integers, and 8-byte reals, and with alignment rules that require the address of every primitive data element to be an even multiple
of the element’s size. Suppose further that the compiler is not permitted to
reorder fields. How much space will be consumed by the following array?
A : array [0 . . 9] of record
s : short
c : char
t : short
d : char
r : real
i : integer
7.9 Show how variant records in Pascal or unions in C can be used to interpret
the bits of a value of one type as if they represented a value of some other
type. Explain why the same technique does not work in Ada. If you have
access to an Ada manual, describe how an unchecked pragma can be used
to get around the Ada rules.
7.10 Are variant records a form of polymorphism? Why or why not?
7.11 Pascal does not permit the tag field of a variant record to be passed to a
subroutine by reference (i.e., as a var parameter). Why not?
7.12 Explain how to implement dynamic semantic checks to catch references to
uninitialized fields of a tagged variant record in Pascal. Changing the value
of the tag field should cause all fields of the variant part of the record to
become uninitialized. Suppose you want to avoid adding flag fields within
the record itself (e.g., to avoid changing the offsets of fields in a systems
program). How much harder is your task?
7.13 Explain how to implement dynamic semantic checks to catch references to
uninitialized fields of an untagged variant record in Pascal. Any assignment
to a field of a variant should cause all fields of other variants to become
uninitialized. Any assignment that changes the record from one variant to
another should also cause all other fields of the new variant to be uninitialized. Again, suppose you want to avoid adding flag fields within the untagged record itself. How much harder is your task?
7.12 Exercises
401
7.14 We noted in Section 7.3.4 that Pascal and Ada require the variant portions
of a record to occur at the end, to save space when a particular record is
constrained to have a comparatively small variant part. Could a compiler
rearrange fields to achieve the same effect, without the restriction on the
declaration order of fields? Why or why not?
7.15 Give Ada code to map from lowercase to uppercase letters, using
(a) an array
(b) a function
Note the similarity of syntax: in both cases upper(’a’) is ’A’ .
7.16 In Section 7.4 we discussed how to differentiate between the constant and
variable portions of an array reference, in order to efficiently access the subparts of array and record objects. An alternative approach is to generate
naive code and count on the compiler’s code improver to find the constant
portions, group them together, and calculate them at compile time. Discuss
the advantages and disadvantages of each approach.
7.17 Explain how to extend Figure 7.7 to accommodate subroutine arguments
that are passed by value, but whose shape is not known until the subroutine
is called at run time.
7.18 Explain how to obtain the effect of Fortran 90’s
allocate statement for
one-dimensional arrays using pointers in C. You will probably find that your
solution does not generalize to multidimensional arrays. Why not? If you are
familiar with C++, show how to use its class facilities to solve the problem.
7.19 Consider the following C declaration, compiled on a 32-bit Pentium machine.
struct {
int n;
char c;
} A[10][10];
If the address of A[0][0] is 1000 (decimal), what is the address of A[3][7]?
7.20 Consider the following Pascal variable declarations.
var A : array [1..10, 10..100] of real;
i : integer;
x : real;
Assume that a real number occupies eight bytes and that A , i , and x are
global variables. In something resembling assembly language for a RISC machine, show the code that a reasonable compiler would generate for the following assignment: x := A[3,i] . Explain how you arrived at your answer.
402
Chapter 7 Data Types
A is a 10 × 10 array of (4-byte) integers, indexed from [0][0]
through [9][9]. Suppose further that the address of A is currently in register r1 , the value of integer i is currently in register r2 , and the value of
integer j is currently in register r3 .
Give pseudo-assembly language for a code sequence that will load the
value of A[i][j] into register r1 (a) assuming that A is implemented using (row-major) contiguous allocation; (b) assuming that A is implemented
using row pointers. Each line of your pseudocode should correspond to a
single instruction on a typical modern machine. You may use as many registers as you need. You need not preserve the values in r1 , r2 , and r3 . You
may assume that i and j are in bounds, and that addresses are 4 bytes long.
Which code sequence is likely to be faster? Why?
7.21 Suppose
7.22 In Examples 7.69 and 7.70, show the code that would be required to access
A[i, j, k] if subscript bounds checking were required.
7.23 Pointers and recursive type definitions complicate the algorithm for determining structural equivalence of types. Consider, for example, the following
definitions.
type A = record
x : pointer to B
y : real
type B = record
x : pointer to A
y : real
The simple definition of structural equivalence given in Section 7.2.1 (expand the subparts recursively until all you have is a string of built-in types
and type constructors; then compare them) does not work: we get an infinite expansion ( type A = record x : pointer to record x : pointer to record x :
pointer to record . . . ). The obvious reinterpretation is to say two types A and
B are equivalent if any sequence of field selections, array subscripts, pointer
dereferences, and other operations that takes one down into the structure
of A , and that ends at a built-in type, always ends at the same built-in type
when used to dive into the structure of B (and encounters the same field
names along the way). Under this reinterpretation, A and B above have the
same type. Give an algorithm based on this reinterpretation that could be
used in a compiler to determine structural equivalence. (Hint: The fastest
approach is due to J. Král [Krá73]. It is based on the algorithm used to find
the smallest deterministic finite automaton that accepts a given regular language. This algorithm was outlined in Example 2.13 [page 53]; details can
be found in any automata theory textbook [e.g., [HMU01]].)
7.24 Explain the meaning of the following C declarations.
double *a[n];
double (*b)[n];
7.12 Exercises
403
double (*c[n])();
double (*d())[n];
7.25 In Ada 83, as in Pascal, pointers ( access variables) can point only to objects
in the heap. Ada 95 allows a new kind of pointer, the access all type, to
point to other objects as well, provided that those objects have been declared
to be aliased :
type int_ptr is access all Integer;
foo : aliased Integer;
ip : int_ptr;
...
ip := foo’Access;
The ’Access attribute is roughly equivalent to C’s “address of ” ( & ) operator. How would you implement access all types and aliased objects?
How would your implementation interact with automatic garbage collection (assuming it exists) for objects in the heap?
7.26 As noted in Section 7.7.2, Ada 95 forbids an access
all pointer from referring to any object whose lifetime is briefer than that of the pointer’s type.
Can this rule be enforced completely at compile time? Why or why not?
7.27 In the discussion of pointers in Section 7.7, we assumed implicitly that every
pointer into the heap points to the beginning of a dynamically allocated
block of storage. In some languages, including Algol 68 and C, pointers may
also point to data inside a block in the heap. If you were trying to implement
dynamic semantic checks for dangling references or, alternatively, automatic
garbage collection, how would your task be complicated by the existence of
such “internal pointers”?
7.28 (a) A tracing garbage collector in a typesafe language can find and reclaim all unreachable objects. It will not necessarily reclaim all useless
objects—those that will never be used again. Explain.
(b) With future technology, might it be possible to design a garbage collector that will reclaim all useless objects? Again, explain.
7.29 (a) Occasionally one encounters the suggestion that a garbage-collected
language should provide a delete operation as an optimization: by explicitly delete -ing objects that will never be used again, the programmer might save the garbage collector the trouble of finding and reclaiming those objects automatically, thereby improving performance. What
do you think of this suggestion? Explain.
(b) Alternatively, one might allow the programmer to “tenure” an object,
so that it will never be a candidate for reclamation. Is this a good idea?
7.30 In Example 7.96 we noted that reference counts can be used to reclaim
tombstones, failing only when the programmer neglects to manually delete
404
Chapter 7 Data Types
the object to which a tombstone refers. Explain how to leverage this observation to catch memory leaks at run time. Does your solution work in all
cases? Explain.
7.31 In Example 7.96 we also noted that functional languages can safely use reference counts if the lack of an assignment statement prevents them from introducing circularity. This isn’t strictly true: constructs like the Lisp letrec
can also be used to make cycles. How might you address this problem?
7.32 Here is a skeleton for the standard quicksort algorithm in Haskell:
quicksort [] = []
quicksort (a : l) = quicksort [...] ++ [a] ++ quicksort [...]
The ++ operator denotes list concatenation (similar to @ in ML). The : operator is equivalent to ML’s :: or Lisp’s cons . Show how to express the two
elided expressions as list comprehensions.
7.33–7.37 In More Depth.
7.13
Explorations
7.38 Some language definitions specify a particular representation for data types
in memory, while others specify only the semantic behavior of those types.
For languages in the latter class, some implementations guarantee a particular representation, while others reserve the right to choose different representations in different circumstances. Which approach do you prefer? Why?
7.39 If you have access to a compiler that provides optional dynamic semantic
checks for out-of-bounds array subscripts, use of an inappropriate record
variant, and/or dangling or uninitialized pointers, experiment with the cost
of these checks. How much do they add to the execution time of programs
that make a significant number of checked accesses? Experiment with different levels of optimization (code improvement) to see what effect it has
on the overhead of checks.
7.40 Investigate the typestate mechanism employed by Strom et al. in the Hermes
programming language [SBG+ 91]. Discuss its relationship to the notion of
definite assignment in Java and C# (Section 6.1.3).
7.41 Investigate the notion of type conformance, employed by Black et al. in the
Emerald programming language [BHJ+ 87]. Discuss how conformance relates to the type inference of ML and to the class-based typing of objectoriented languages.
7.42 Write a library package that might be used by a language implementation to
manage sets of elements drawn from a very large base type (e.g., integer ).
You should support membership tests, union, intersection, and difference.
7.14 Bibliographic Notes
405
Does your package allocate memory from the heap? If so, what would a
compiler that assumed the use of your package need to do to make sure
that space was reclaimed when no longer needed?
7.43 Learn about SETL [SDDS86], a programming language based on sets, designed by Jack Schwartz of New York University. List the mechanisms provided as built-in set operations. Compare this list with the set facilities of
other programming languages. What data structure(s) might a SETL implementation use to represent sets in a program?
7.44 Implement your favorite garbage collection algorithm in Ada 95. Alternatively, implement a special pointer class in C++ for which storage is
garbage-collected. You’ll want to use templates (generics) so that your class
can be instantiated for arbitrary pointed-to types.
7.45 Experiment with the cost of garbage collection in your favorite language
implementation. What kind of collector does it use? Can you create artificial
programs for which it performs particularly well or poorly? (Hint: Check
to see if your machine and operating system allow user-level programs to
access a low-cost, high-resolution clock register.)
7.46–7.48 In More Depth.
7.14
Bibliographic Notes
References to general information on the various programming languages mentioned in this chapter can be found in Appendix A, and in the Bibliographic
Notes for Chapters 1 and 6. Welsh, Sneeringer, and Hoare [WSH77] provide a
critique of the original Pascal definition, with a particular emphasis on its type
system. Tanenbaum’s comparison of Pascal and Algol 68 also focuses largely on
types [Tan78]. Cleaveland [Cle86] provides a book-length study of many of the
issues in this chapter. Pierce [Pie02] provides a formal and detailed modern coverage of the subject. The ACM Special Interest Group on Programming Languages launched a biennial workshop on Types in Language Design and Implementation in 2003.
What we have referred to as the denotational model of types originates with
Hoare [DDH72]. Denotational formulations of the overall semantics of programming languages are discussed in the Bibliographic Notes for Chapter 4.
A related but distinct body of work uses algebraic techniques to formalize data
abstraction; key references include Guttag [Gut77] and Goguen et al. [GTW78].
Milner’s original paper [Mil78] is the seminal reference on type inference in ML.
Mairson [Mai90] proves that the cost of unifying ML types is O(2n ), where n is
the length of the program. Fortunately, the cost is linear in the size of the program’s type expressions, so the worst case arises only in programs whose semantics are too complex for a human being to understand anyway.
406
Chapter 7 Data Types
Hoare [Hoa75] discusses the definition of recursive types under a reference
model of variables. Cardelli and Wegner survey issues related to polymorphism,
overloading, and abstraction [CW85]. The new Character Model standard for
the World Wide Web provides a remarkably readable introduction to the subtleties and complexities of international character sets [Wor05]. Conway’s game
of Life, which appeared in Figure 7.8, was first described by Martin Gardner in
his “Mathematical Games” column in Scientific American [Gar70].
Tombstones are due to Lomet [Lom75, Lom85]. Locks and keys are due to
Fischer and LeBlanc [FL80]. The latter also discuss how to check for various
other dynamic semantic errors in Pascal, including those that arise with variant records. Constant-space (pointer-reversing) mark-and-sweep garbage collection is due to Schorr and Waite [SW67]. Stop-and-copy collection was developed
by Fenichel and Yochelson [FY69], based on ideas due to Minsky. Deutsch and
Bobrow [DB76] describe an incremental garbage collector that avoids the “stopthe-world” phenomenon. Wilson and Johnstone [WJ93] describe a more recent
incremental collector. The conservative collector described at the end of Section 7.7.3 is due to Boehm and Weiser [BW88]. Cohen [Coh81] surveys garbagecollection techniques as of 1981; Wilson [Wil92b] and Jones and Lins [JL96]
provide more recent views.
8
Subroutines and Control Abstraction
In the introduction to Chapter 3, we defined abstraction as a process by
which the programmer can associate a name with a potentially complicated program fragment, which can then be thought of in terms of its purpose or function, rather than in terms of its implementation. We sometimes distinguish between control abstraction, in which the principal purpose of the abstraction is to
perform a well-defined operation, and data abstraction, in which the principal
purpose of the abstraction is to represent information.1 We will consider data
abstraction in more detail in Chapter 9.
Subroutines are the principal mechanism for control abstraction in most programming languages. A subroutine performs its operation on behalf of a caller,
who waits for the subroutine to finish before continuing execution. Most subroutines are parameterized: the caller passes arguments that influence the subroutine’s behavior, or provide it with data on which to operate. Arguments are also
called actual parameters. They are mapped to the subroutine’s formal parameters
at the time a call occurs. A subroutine that returns a value is usually called a function. A subroutine that does not return a value is usually called a procedure. Most
languages require subroutines to be declared before they are used, though a few
(including Fortran, C, and Lisp) do not. Declarations allow the compiler to verify
that every call to a subroutine is consistent with the declaration—that is, that it
passes the right number and types of arguments.
As noted in Section 3.2.2, the storage consumed by parameters and local variables can in most languages be allocated on a stack. We therefore begin this chapter, in Section 8.1, by reviewing the layout of the stack. We then turn in Section 8.2
to the calling sequences that serve to maintain this layout. In the process, we revisit
the use of static chains to access nonlocal variables in nested subroutines and consider (on the PLP CD) an alternative mechanism, known as a display, that serves
1 The distinction between control and data abstraction is somewhat fuzzy, because the latter usually encapsulates not only information, but also the operations that access and modify that information. Put another way, most data abstractions include control abstraction.
407
408
Chapter 8 Subroutines and Control Abstraction
a similar purpose. We also consider subroutine inlining and the representation
of closures. To illustrate some of the possible implementation alternatives, we
present (again on the PLP CD) a pair of case studies: the SGI MIPSpro C compiler for the MIPS instruction set, and the GNU gpc Pascal compiler for the x86
instruction set, as well as the register window mechanism of the Sparc instruction
set.
In Section 8.3 we look more closely at subroutine parameters. We consider
parameter-passing modes, which determine the operations that a subroutine can
apply to its formal parameters and the effects of those operations on the corresponding actual parameters. We also consider conformant arrays, named and
default parameters, variable numbers of arguments, and function return mechanisms. In Section 8.4 we turn to generic subroutines and modules (classes), which
support explicit parametric polymorphism, as defined in Section 3.6.3. Where
conventional parameters allow a subroutine to operate on many different values,
generic parameters allow it to operate on data of many different types.
In Section 8.5, we consider the handling of exceptional conditions. While exceptions can sometimes be confined to the current subroutine, in the general case
they require a mechanism to “pop out of ” a nested context without returning, so
that recovery can occur in the calling context. Finally, in Section 8.6, we consider
coroutines, which allow a program to maintain two or more execution contexts,
and to switch back and forth among them. Coroutines can be used to implement
iterators (Section 6.5.3), but they have other uses as well, particularly in simulation and in server programs. In Chapter 12 we will use them as the basis for
concurrent (“quasiparallel”) threads.
8.1
EXAMPLE
8.1
Layout of run-time stack
(reprise)
EXAMPLE
8.2
Offsets from frame pointer
Review of Stack Layout
In Section 3.2.2 we discussed the allocation of space on a subroutine call stack
(Figure 3.2, page 110). Each routine, as it is called, is given a new stack frame,
or activation record, at the top of the stack. This frame may contain arguments
and/or return values, bookkeeping information (including the return address
and saved registers), local variables, and/or temporaries. When a subroutine returns, its frame is popped from the stack.
At any given time, the stack pointer register contains the address of either the
last used location at the top of the stack or the first unused location, depending
on convention. The frame pointer register contains an address within the frame.
Objects in the frame are accessed via displacement addressing with respect to the
frame pointer. If the size of an object (e.g., a local array) is not known at compile
time, then the object is placed in a variable-size area at the top of the frame; its
address and dope vector are stored in the fixed-size portion of the frame, at a
statically known offset from the frame pointer (Figure 7.7, page 354). If there
are no variable-size objects, then every object within the frame has a statically
known offset from the stack pointer, and the implementation may dispense with
8.1 Review of Stack Layout
409
Figure 8.1 Example of subroutine nesting, taken from Figure 3.5. Within B , C , and D , all five
routines are visible. Within A and E , routines A , B , and E are visible, but C and D are not.
Given the calling sequence A , E , B , D , C , in that order, frames will be allocated on the stack as
shown at right, with the indicated static and dynamic links.
EXAMPLE
8.3
Static and dynamic links
EXAMPLE
8.4
Visibility of nested routines
the frame pointer, freeing up a register for other use. If the size of an argument is
not known at compile time, then the argument may be placed in a variable-size
portion of the frame below the other arguments, with its address and dope vector
at known offsets from the frame pointer. Alternatively, the caller may simply pass
a temporary address and dope vector, counting on the called routine to copy the
argument into the variable-size area at the top of the frame.
In a language with nested subroutines and static scoping (e.g., Pascal, Ada,
ML, Common Lisp, or Scheme), objects that lie in surrounding subroutines and
are thus neither local nor global can be found by maintaining a static chain (Figure 8.1). Each stack frame contains a reference to the frame of the lexically surrounding subroutine. This reference is called the static link. By analogy, the saved
value of the frame pointer, which will be restored on subroutine return, is called
the dynamic link. The static and dynamic links may or may not be the same, depending on whether the current routine was called by its lexically surrounding
routine, or by some other routine nested in that surrounding routine.
Whether or not a subroutine is called directly by the lexically surrounding
routine, we can be sure that the surrounding routine is active; there is no other
way that the current routine could have been visible, allowing it to be called.
Consider for example, the subroutine nesting shown in Figure 8.1. If subroutine
D is called directly from B , then clearly B ’s frame will already be on the stack.
How else could D be called? It is not visible in A or E , because it is nested inside of
410
Chapter 8 Subroutines and Control Abstraction
B . A moment’s thought makes clear that it is only when control enters B (placing
B ’s frame on the stack) that D comes into view. It can therefore be called by C ,
or by any other routine (not shown) that is nested inside of C or D , but only
because these are also within B .
8.2
Calling Sequences
In Section 3.2.2 we also mentioned that maintenance of the subroutine call stack
is the responsibility of the calling sequence—the code executed by the caller immediately before and after a subroutine call—and of the prologue (code executed
at the beginning) and epilogue (code executed at the end) of the subroutine itself.
Sometimes the term “calling sequence” is used to refer to the combined operations of the caller, the prologue, and the epilogue.
Tasks that must be accomplished on the way into a subroutine include passing
parameters, saving the return address, changing the program counter, changing
the stack pointer to allocate space, saving registers (including the frame pointer)
that contain important values and that may be overwritten by the callee, changing
the frame pointer to refer to the new frame, and executing initialization code for
any objects in the new frame that require it. Tasks that must be accomplished
on the way out include passing return parameters or function values, executing
finalization code for any local objects that require it, deallocating the stack frame
(restoring the stack pointer), restoring other saved re