Core Python Applications Programming

Core Python Applications Programming
www.allitebooks.com
“The simplified yet deep level of detail, comprehensive coverage of material,
and informative historical references make this book perfect for the classroom... An easy read, with complex examples presented simply, and great
historical references rarely found in such books. Awesome!”
—Gloria W.
Praise for the Previous Edition
“The long-awaited second edition of Wesley Chun’s Core Python Programming
proves to be well worth the wait—its deep and broad coverage and useful
exercises will help readers learn and practice good Python.”
—Alex Martelli, author of Python in a Nutshell and editor of Python Cookbook
“There has been lot of good buzz around Wesley Chun’s Core Python
Programming. It turns out that all the buzz is well earned. I think this is the
best book currently available for learning Python. I would recommend Chun’s
book over Learning Python (O’Reilly), Programming Python (O’Reilly), or The
Quick Python Book (Manning).”
—David Mertz, Ph.D., IBM DeveloperWorks
“I have been doing a lot of research [on] Python for the past year and have
seen a number of positive reviews of your book. The sentiment expressed
confirms the opinion that Core Python Programming is now considered the
standard introductory text.”
—Richard Ozaki, Lockheed Martin
“Finally, a book good enough to be both a textbook and a reference on the
Python language now exists.”
—Michael Baxter, Linux Journal
“Very well written. It is the clearest, friendliest book I have come across
yet for explaining Python, and putting it in a wider context. It does not
presume a large amount of other experience. It does go into some important Python topics carefully and in depth. Unlike too many beginner
books, it never condescends or tortures the reader with childish hide-andseek prose games. [It] sticks to gaining a solid grasp of Python syntax and
structure.”
—http://python.org bookstore Web site
www.allitebooks.com
“[If ] I could only own one Python book, it would be Core Python Programming
by Wesley Chun. This book manages to cover more topics in more depth
than Learning Python but includes it all in one book that also more than
adequately covers the core language. [If] you are in the market for just one
book about Python, I recommend this book. You will enjoy reading it,
including its wry programmer’s wit. More importantly, you will learn
Python. Even more importantly, you will find it invaluable in helping
you in your day-to-day Python programming life. Well done, Mr. Chun!”
—Ron Stephens, Python Learning Foundation
“I think the best language for beginners is Python, without a doubt. My
favorite book is Core Python Programming.”
—s003apr, MP3Car.com Forums
“Personally, I really like Python. It’s simple to learn, completely intuitive,
amazingly flexible, and pretty darned fast. Python has only just started to
claim mindshare in the Windows world, but look for it to start gaining lots
of support as people discover it. To learn Python, I’d start with Core Python
Programming by Wesley Chun.”
—Bill Boswell, MCSE, Microsoft Certified Professional Magazine Online
“If you learn well from books, I suggest Core Python Programming. It is by
far the best I’ve found. I’m a Python newbie as well and in three months’
time I’ve been able to implement Python in projects at work (automating
MSOffice, SQL DB stuff, etc.).”
—ptonman, Dev Shed Forums
“Python is simply a beautiful language. It’s easy to learn, it’s cross-platform, and it works. It has achieved many of the technical goals that Java
strives for. A one-sentence description of Python would be: ‘All other languages appear to have evolved over time—but Python was designed.’ And
it was designed well. Unfortunately, there aren’t a large number of books for
Python. The best one I’ve run across so far is Core Python Programming.”
—Chris Timmons, C. R. Timmons Consulting
“If you like the Prentice Hall Core series, another good full-blown treatment to consider would be Core Python Programming. It addresses in elaborate concrete detail many practical topics that get little, if any, coverage in
other books.”
—Mitchell L. Model, MLM Consulting
www.allitebooks.com
Core
PYTHON
Applications Programming
Third Edition
www.allitebooks.com
The Core Series
Visit informit.com/coreseries for a complete list of available publications.
The Core Series is designed to provide you – the experienced programmer –
with the essential information you need to quickly learn and apply the latest,
most important technologies.
Authors in The Core Series are seasoned professionals who have pioneered
the use of these technologies to achieve tangible results in real-world settings.
These experts:
• Share their practical experiences
• Support their instruction with real-world examples
• Provide an accelerated, highly effective path to learning the subject at hand
The resulting book is a no-nonsense tutorial and thorough reference that allows
you to quickly produce robust, production-quality code.
Make sure to connect with us!
informit.com/socialconnect
www.allitebooks.com
Core
PYTHON
Applications Programming
Third Edition
Wesley J. Chun
Upper Saddle River, NJ • Boston • Indianapolis • San Francisco
New York • Toronto • Montreal • London • Munich • Paris • Madrid
Capetown • Sydney • Tokyo • Singapore • Mexico City
www.allitebooks.com
Many of the designations used by manufacturers and sellers to distinguish their products
are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial
capital letters or in all capitals.
The author and publisher have taken care in the preparation of this book, but make no
expressed or implied warranty of any kind and assume no responsibility for errors or
omissions. No liability is assumed for incidental or consequential damages in connection
with or arising out of the use of the information or programs contained herein.
The publisher offers excellent discounts on this book when ordered in quantity for bulk
purchases or special sales, which may include electronic versions and/or custom covers
and content particular to your business, training goals, marketing focus, and branding
interests. For more information, please contact:
U.S. Corporate and Government Sales
(800) 382-3419
[email protected]
For sales outside the United States please contact:
International Sales
[email protected]
Visit us on the Web: informit.com/ph
Library of Congress Cataloging-in-Publication Data
Chun, Wesley.
Core python applications programming / Wesley J. Chun. — 3rd ed.
p. cm.
Rev. ed. of: Core Python programming / Wesley J. Chun. c2007.
Includes index.
ISBN 0-13-267820-9 (pbk. : alk. paper)
1. Python (Computer program language) I. Chun, Wesley. Core Python
programming. II. Title.
QA76.73.P98C48 2012
005.1'17—dc23
2011052903
Copyright © 2012 Pearson Education, Inc.
All rights reserved. Printed in the United States of America. This publication is protected
by copyright, and permission must be obtained from the publisher prior to any prohibited
reproduction, storage in a retrieval system, or transmission in any form or by any means,
electronic, mechanical, photocopying, recording, or likewise. To obtain permission to
use material from this work, please submit a written request to Pearson Education, Inc.,
Permissions Department, One Lake Street, Upper Saddle River, New Jersey 07458, or you
may fax your request to (201) 236-3290.
ISBN-13: 978-0-13-267820-9
ISBN-10:
0-13-267820-9
Text printed in the United States on recycled paper at Edwards Brothers in Ann Arbor,
Michigan.
First printing, March 2012
www.allitebooks.com
To my parents,
who taught me that everybody is different.
And to my wife,
who lives with someone who is different.
www.allitebooks.com
This page intentionally left blank
www.allitebooks.com
CONTENTS
Preface
xv
Acknowledgments
xxvii
About the Author
xxxi
Part I General Application Topics
1
Chapter 1 Regular Expressions
2
1.1
1.2
1.3
1.4
1.5
1.6
Introduction/Motivation
Special Symbols and Characters
Regexes and Python
Some Regex Examples
A Longer Regex Example
Exercises
Chapter 2 Network Programming
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Introduction
What Is Client/Server Architecture?
Sockets: Communication Endpoints
Network Programming in Python
*The SocketServer Module
*Introduction to the Twisted Framework
Related Modules
Exercises
3
6
16
36
41
48
53
54
54
58
61
79
84
88
89
ix
www.allitebooks.com
x
Contents
Chapter 3 Internet Client Programming
3.1
3.2
3.3
3.4
3.5
3.6
What Are Internet Clients?
Transferring Files
Network News
E-Mail
Related Modules
Exercises
Chapter 4 Multithreaded Programming
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
Introduction/Motivation
Threads and Processes
Threads and Python
The thread Module
The threading Module
Comparing Single vs. Multithreaded Execution
Multithreading in Practice
Producer-Consumer Problem and the Queue/queue Module
Alternative Considerations to Threads
Related Modules
Exercises
Chapter 5 GUI Programming
5.1
5.2
5.3
5.4
5.5
5.6
Introduction
Tkinter and Python Programming
Tkinter Examples
A Brief Tour of Other GUIs
Related Modules and Other GUIs
Exercises
Chapter 6 Database Programming
6.1
6.2
6.3
6.4
6.5
6.6
94
95
96
104
114
146
148
156
157
158
160
164
169
180
182
202
206
209
210
213
214
216
221
236
247
250
253
Introduction
The Python DB-API
ORMs
Non-Relational Databases
Related References
Exercises
254
259
289
309
316
319
Chapter 7 *Programming Microsoft Office
324
7.1
7.2
7.3
7.4
7.5
7.6
Introduction
COM Client Programming with Python
Introductory Examples
Intermediate Examples
Related Modules/Packages
Exercises
325
326
328
338
357
357
Contents
Chapter 8 Extending Python
8.1
8.2
8.3
8.4
Introduction/Motivation
Extending Python by Writing Extensions
Related Topics
Exercises
xi
364
365
368
384
388
Part II Web Development
389
Chapter 9 Web Clients and Servers
390
9.1
9.2
9.3
9.4
9.5
9.6
Introduction
Python Web Client Tools
Web Clients
Web (HTTP) Servers
Related Modules
Exercises
Chapter 10 Web Programming: CGI and WSGI
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
10.9
Introduction
Helping Web Servers Process Client Data
Building CGI Applications
Using Unicode with CGI
Advanced CGI
Introduction to WSGI
Real-World Web Development
Related Modules
Exercises
Chapter 11 Web Frameworks: Django
11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
11.9
11.10
11.11
11.12
11.13
11.14
11.15
11.16
11.17
11.18
11.19
Introduction
Web Frameworks
Introduction to Django
Projects and Apps
Your “Hello World” Application (A Blog)
Creating a Model to Add Database Service
The Python Application Shell
The Django Administration App
Creating the Blog’s User Interface
Improving the Output
Working with User Input
Forms and Model Forms
More About Views
*Look-and-Feel Improvements
*Unit Testing
*An Intermediate Django App: The TweetApprover
Resources
Conclusion
Exercises
391
396
410
428
433
436
441
442
442
446
464
466
478
487
488
490
493
494
494
496
501
507
509
514
518
527
537
542
546
551
553
554
564
597
597
598
xii
Contents
Chapter 12 Cloud Computing: Google App Engine
12.1
12.2
12.3
12.4
12.5
12.6
12.7
12.8
12.9
12.10
12.11
12.12
12.13
12.14
12.15
12.16
12.17
12.18
12.19
12.20
12.21
12.22
Introduction
What Is Cloud Computing?
The Sandbox and the App Engine SDK
Choosing an App Engine Framework
Python 2.7 Support
Comparisons to Django
Morphing “Hello World” into a Simple Blog
Adding Memcache Service
Static Files
Adding Users Service
Remote API Shell
Lightning Round (with Python Code)
Sending Instant Messages by Using XMPP
Processing Images
Task Queues (Unscheduled Tasks)
Profiling with Appstats
The URLfetch Service
Lightning Round (without Python Code)
Vendor Lock-In
Resources
Conclusion
Exercises
Chapter 13 Web Services
13.1
13.2
13.3
13.4
Introduction
The Yahoo! Finance Stock Quote Server
Microblogging with Twitter
Exercises
604
605
605
612
617
626
628
631
647
651
652
654
656
660
662
663
670
672
673
675
676
679
680
684
685
685
690
707
Part III Supplemental/Experimental
713
Chapter 14 Text Processing
714
14.1
14.2
14.3
14.4
14.5
14.6
Comma-Separated Values
JavaScript Object Notation
Extensible Markup Language
References
Related Modules
Exercises
Chapter 15 Miscellaneous
15.1
15.2
15.3
Jython
Google+
Exercises
715
719
724
738
740
740
743
744
748
759
Contents
xiii
Appendix A Answers to Selected Exercises
763
Appendix B Reference Tables
768
Appendix C Python 3: The Evolution of a Programming Language
798
C.1
C.2
C.3
C.4
C.5
Why Is Python Changing?
What Has Changed?
Migration Tools
Conclusion
References
799
799
805
806
806
Appendix D Python 3 Migration with 2.6+
807
D.1
D.2
D.3
D.4
D.5
D.6
D.7
D.8
D.9
Index
Python 3: The Next Generation
Integers
Built-In Functions
Object-Oriented Programming: Two Different Class Objects
Strings
Exceptions
Other Transition Tools and Tips
Writing Code That is Compatible in Both Versions 2.x and 3.x
Conclusion
807
809
812
814
815
816
817
818
822
823
This page intentionally left blank
P R E FA C E
Welcome to the Third Edition of Core Python
Applications Programming!
We are delighted that you have engaged us to help you learn Python as
quickly and as deeply as possible. The goal of the Core Python series of
books is not to just teach developers the Python language; we want you
you to develop enough of a personal knowledge base to be able to develop
software in any application area.
In our other Core Python offerings, Core Python Programming and Core
Python Language Fundamentals, we not only teach you the syntax of the
Python language, but we also strive to give you in-depth knowledge of
how Python works under the hood. We believe that armed with this
knowledge, you will write more effective Python applications, whether
you’re a beginner to the language or a journeyman (or journeywoman!).
Upon completion of either or any other introductory Python books, you
might be satisfied that you have learned Python and learned it well. By
completing many of the exercises, you’re probably even fairly confident in
your newfound Python coding skills. Still, you might be left wondering,
“Now what? What kinds of applications can I build with Python?” Perhaps you learned Python for a work project that’s constrained to a very
narrow focus. “What else can I build with Python?”
xv
xvi
Preface
About this Book
In Core Python Applications Programming, you will take all the Python
knowledge gained elsewhere and develop new skills, building up a toolset
with which you’ll be able to use Python for a variety of general applications. These advanced topics chapters are meant as intros or “quick dives”
into a variety of distinct subjects. If you’re moving toward the specific
areas of application development covered by any of these chapters, you’ll
likely discover that they contain more than enough information to get you
pointed in the right direction. Do not expect an in-depth treatment because
that will detract from the breadth-oriented treatment that this book is
designed to convey.
Like all other Core Python books, throughout this one, you will find
many examples that you can try right in front of your computer. To hammer the concepts home, you will also find fun and challenging exercises at
the end of every chapter. These easy and intermediate exercises are meant
to test your learning and push your Python skills. There simply is no substitute for hands-on experience. We believe you should not only pick up
Python programming skills but also be able to master them in as short a
time period as possible.
Because the best way for you to extend your Python skills is through
practice, you will find these exercises to be one of the greatest strengths of
this book. They will test your knowledge of chapter topics and definitions
as well as motivate you to code as much as possible. There is no substitute
for improving your skills more effectively than by building applications.
You will find easy, intermediate, and difficult problems to solve. It is also
here that you might need to write one of those “large” applications that
many readers wanted to see in the book, but rather than scripting
them—which frankly doesn’t do you all that much good—you gain by
jumping right in and doing it yourself. Appendix A, “Answers to Selected
Exercises,” features answers to selected problems from each chapter. As
with the second edition, you’ll find useful reference tables collated in
Appendix B, “Reference Tables.”
I’d like to personally thank all readers for your feedback and encouragement. You’re the reason why I go through the effort of writing these books.
I encourage you to keep sending your feedback and help us make a fourth
edition possible, and even better than its predecessors!
Preface
xvii
Who Should Read This Book?
This book is meant for anyone who already knows some Python but wants
to know more and expand their application development skillset.
Python is used in many fields, including engineering, information technology, science, business, entertainment, and so on. This means that the list
of Python users (and readers of this book) includes but is not limited to
• Software engineers
• Hardware design/CAD engineers
• QA/testing and automation framework developers
• IS/IT/system and network administrators
• Scientists and mathematicians
• Technical or project management staff
• Multimedia or audio/visual engineers
• SCM or release engineers
• Web masters and content management staff
• Customer/technical support engineers
• Database engineers and administrators
• Research and development engineers
• Software integration and professional services staff
• Collegiate and secondary educators
• Web service engineers
• Financial software engineers
• And many others!
Some of the most famous companies that use Python include Google,
Yahoo!, NASA, Lucasfilm/Industrial Light and Magic, Red Hat, Zope, Disney,
Pixar, and Dreamworks.
xviii
Preface
The Author and Python
I discovered Python over a decade ago at a company called Four11. At the
time, the company had one major product, the Four11.com White Page
directory service. Python was being used to design its next product: the
Rocketmail Web-based e-mail service that would eventually evolve into
what today is Yahoo! Mail.
It was fun learning Python and being on the original Yahoo! Mail engineering team. I helped re-design the address book and spell checker. At
the time, Python also became part of a number of other Yahoo! sites,
including People Search, Yellow Pages, and Maps and Driving Directions,
just to name a few. In fact, I was the lead engineer for People Search.
Although Python was new to me then, it was fairly easy to pick
up—much simpler than other languages I had learned in the past. The
scarcity of textbooks at the time led me to use the Library Reference and
Quick Reference Guide as my primary learning tools; it was also a driving
motivation for the book you are reading right now.
Since my days at Yahoo!, I have been able to use Python in all sorts of
interesting ways at the jobs that followed. In each case, I was able to harness the power of Python to solve the problems at hand, in a timely manner. I have also developed several Python courses and have used this book
to teach those classes—truly eating my own dogfood.
Not only are the Core Python books great learning devices, but they’re
also among the best tools with which to teach Python. As an engineer, I
know what it takes to learn, understand, and apply a new technology. As a
professional instructor, I also know what is needed to deliver the most effective
sessions for clients. These books provide the experience necessary to be able
to give you real-world analogies and tips that you cannot get from someone who is “just a trainer” or “just a book author.”
What to Expect of the Writing Style:
Technical, Yet Easy Reading
Rather than being strictly a “beginners” book or a pure, hard-core computer science reference book, my instructional experience has taught me
that an easy-to-read, yet technically oriented book serves the purpose the
best, which is to get you up to speed on Python as quickly as possible so
that you can apply it to your tasks posthaste. We will introduce concepts
Preface
xix
coupled with appropriate examples to expedite the learning process. At the
end of each chapter you will find numerous exercises to reinforce some of
the concepts and ideas acquired in your reading.
We are thrilled and humbled to be compared with Bruce Eckel’s writing
style (see the reviews to the first edition at the book’s Web site, http://
corepython.com). This is not a dry college textbook. Our goal is to have a
conversation with you, as if you were attending one of my well-received
Python training courses. As a lifelong student, I constantly put myself in
my student’s shoes and tell you what you need to hear in order to learn
the concepts as quickly and as thoroughly as possible. You will find reading this book fast and easy, without losing sight of the technical details.
As an engineer, I know what I need to tell you in order to teach you a
concept in Python. As a teacher, I can take technical details and boil them
down into language that is easy to understand and grasp right away. You
are getting the best of both worlds with my writing and teaching styles,
but you will enjoy programming in Python even more.
Thus, you’ll notice that even though I’m the sole author, I use the “thirdperson plural” writing structure; that is to say, I use verbiage such as “we”
and “us” and “our,” because in the grand scheme of this book, we’re all in
this together, working toward the goal of expanding the Python programming universe.
About This Third Edition
At the time the first edition of this book was published, Python was entering its second era with the release of version 2.0. Since then, the language
has undergone significant improvements that have contributed to the
overall continued success, acceptance, and growth in the use of the language. Deficiencies have been removed and new features added that bring
a new level of power and sophistication to Python developers worldwide.
The second edition of the book came out in 2006, at the height of Python’s
ascendance, during the time of its most popular release to date, 2.5.
The second edition was released to rave reviews and ended up outselling the first edition. Python itself had won numerous accolades since that
time as well, including the following:
• Tiobe (www.tiobe.com)
– Language of the Year (2007, 2010)
www.allitebooks.com
xx
Preface
• LinuxJournal (linuxjournal.com)
– Favorite Programming Language (2009–2011)
– Favorite Scripting Language (2006–2008, 2010, 2011)
• LinuxQuestions.org Members Choice Awards
– Language of the Year (2007–2010)
These awards and honors have helped propel Python even further.
Now it’s on its next generation with Python 3. Likewise, Core Python Programming is moving towards its “third generation,” too, as I’m exceedingly
pleased that Prentice Hall has asked me to develop this third edition.
Because version 3.x is backward-incompatible with Python 1 and 2, it will
take some time before it is universally adopted and integrated into industry. We are happy to guide you through this transition. The code in this
edition will be presented in both Python 2 and 3 (as appropriate—not
everything has been ported yet). We’ll also discuss various tools and practices when porting.
The changes brought about in version 3.x continue the trend of iterating
and improving the language, taking a larger step toward removing some
of its last major flaws, and representing a bigger jump in the continuing
evolution of the language. Similarly, the structure of the book is also making a rather significant transition. Due to its size and scope, Core Python
Programming as it has existed wouldn’t be able to handle all the new material introduced in this third edition.
Therefore, Prentice Hall and I have decided the best way of moving forward is to take that logical division represented by Parts I and II of the previous editions, representing the core language and advanced applications
topics, respectively, and divide the book into two volumes at this juncture.
You are holding in your hands (perhaps in eBook form) the second half of
the third edition of Core Python Programming. The good news is that the
first half is not required in order to make use of the rich amount of content
in this volume. We only recommend that you have intermediate Python
experience. If you’ve learned Python recently and are fairly comfortable
with using it, or have existing Python skills and want to take it to the next
level, then you’ve come to the right place!
As existing Core Python Programming readers already know, my primary
focus is teaching you the core of the Python language in a comprehensive manner, much more than just its syntax (which you don’t really need
a book to learn, right?). Knowing more about how Python works under
the hood—including the relationship between data objects and memory
management—will make you a much more effective Python programmer
Preface
xxi
right out of the gate. This is what Part I, and now Core Python Language
Fundamentals, is all about.
As with all editions of this book, I will continue to update the book’s
Web site and my blog with updates, downloads, and other related articles
to keep this publication as contemporary as possible, regardless to which
new release of Python you have migrated.
For existing readers, the new topics we have added to this edition include:
• Web-based e-mail examples (Chapter 3)
• Using Tile/Ttk (Chapter 5)
• Using MongoDB (Chapter 6)
• More significant Outlook and PowerPoint examples (Chapter 7)
• Web server gateway interface (WSGI) (Chapter 10)
• Using Twitter (Chapter 13)
• Using Google+ (Chapter 15)
In addition, we are proud to introduce three brand new chapters to the
book: Chapter 11, “Web Frameworks: Django,” Chapter 12, “Cloud Computing: Google App Engine,” and Chapter 14, “Text Processing.” These represent new or ongoing areas of application development for which Python
is used quite often. All existing chapters have been refreshed and updated
to the latest versions of Python, possibly including new material. Take a
look at the chapter guide that follows for more details on what to expect
from every part of this volume.
Chapter Guide
This book is divided into three parts. The first part, which takes up about
two-thirds of the text, gives you treatment of the “core” members of any
application development toolset (with Python being the focus, of course).
The second part concentrates on a variety of topics, all tied to Web programming. The book concludes with the supplemental section which provides experimental chapters that are under development and hopefully
will grow into independent chapters in future editions.
All three parts provide a set of various advanced topics to show what
you can build by using Python. We are certainly glad that we were at least
able to provide you with a good introduction to many of the key areas of
Python development including some of the topics mentioned previously.
Following is a more in-depth, chapter-by-chapter guide.
xxii
Preface
Part I: General Application Topics
Chapter 1—Regular Expressions
Regular expressions are a powerful tool that you can use for pattern
matching, extracting, and search-and-replace functionality.
Chapter 2—Network Programming
So many applications today need to be network oriented. In this chapter, you
learn to create clients and servers using TCP/IP and UDP/IP as well as get an
introduction to SocketServer and Twisted.
Chapter 3—Internet Client Programming
Most Internet protocols in use today were developed using sockets. In
Chapter 3, we explore some of those higher-level libraries that are used to
build clients of these Internet protocols. In particular, we focus on file
transfer (FTP), the Usenet news protocol (NNTP), and a variety of e-mail
protocols (SMTP, POP3, IMAP4).
Chapter 4—Multithreaded Programming
Multithreaded programming is one way to improve the execution performance of many types of applications by introducing concurrency. This
chapter ends the drought of written documentation on how to implement
threads in Python by explaining the concepts and showing you how to
correctly build a Python multithreaded application and what the best use
cases are.
Chapter 5—GUI Programming
Based on the Tk graphical toolkit, Tkinter (renamed to tkinter in Python 3)
is Python’s default GUI development library. We introduce Tkinter to you
by showing you how to build simple GUI applications. One of the best
ways to learn is to copy, and by building on top of some of these applications, you will be on your way in no time. We conclude the chapter by taking a brief look at other graphical libraries, such as Tix, Pmw, wxPython,
PyGTK, and Ttk/Tile.
Preface
xxiii
Chapter 6—Database Programming
Python helps simplify database programming, as well. We first review
basic concepts and then introduce you to the Python database application
programmer’s interface (DB-API). We then show you how you can connect
to a relational database and perform queries and operations by using
Python. If you prefer a hands-off approach that uses the Structured Query
Language (SQL) and want to just work with objects without having to
worry about the underlying database layer, we have object-relational managers (ORMs) just for that purpose. Finally, we introduce you to the world
of non-relational databases, experimenting with MongoDB as our NoSQL
example.
Chapter 7—Programming Microsoft Office
Like it or not, we live in a world where we will likely have to interact with
Microsoft Windows-based PCs. It might be intermittent or something we
have to deal with on a daily basis, but regardless of how much exposure
we face, the power of Python can be used to make our lives easier. In this
chapter, we explore COM Client programming by using Python to control
and communicate with Office applications, such as Word, Excel, PowerPoint, and Outlook. Although experimental in the previous edition, we’re
glad we were able to add enough material to turn this into a standalone
chapter.
Chapter 8—Extending Python
We mentioned earlier how powerful it is to be able to reuse code and
extend the language. In pure Python, these extensions are modules and
packages, but you can also develop lower-level code in C/C++, C#, or Java.
Those extensions then can interface with Python in a seamless fashion.
Writing your extensions in a lower-level programming language gives you
added performance and some security (because the source code does not
have to be revealed). This chapter walks you step-by-step through the
extension building process using C.
xxiv
Preface
Part II: Web Development
Chapter 9—Web Clients and Servers
Extending our discussion of client-server architecture in Chapter 2, we apply
this concept to the Web. In this chapter, we not only look at clients, but also
explore a variety of Web client tools, parsing Web content, and finally, we
introduce you to customizing your own Web servers in Python.
Chapter 10—Web Programming: CGI and WSGI
The main job of Web servers is to take client requests and return results.
But how do servers get that data? Because they’re really only good at
returning results, they generally do not have the capabilities or logic necessary to do so; the heavy lifting is done elsewhere. CGI gives servers the
ability to spawn another program to do this processing and has historically been the solution, but it doesn’t scale and is thus not really used in
practice; however, its concepts still apply, regardless of what framework(s)
you use, so we’ll spend most of the chapter learning CGI. You will also
learn how WSGI helps application developers by providing them a common programming interface. In addition, you’ll see how WSGI helps
framework developers who have to connect to Web servers on one side
and application code on the other so that application developers can write
code without having to worry about the execution platform.
Chapter 11—Web Frameworks: Django
Python features a host of Web frameworks with Django being one of the
most popular. In this chapter, you get an introduction to this framework
and learn how to write simple Web applications. With this knowledge,
you can then explore other Web frameworks as you wish.
Chapter 12—Cloud Computing: Google App Engine
Cloud computing is taking the industry by storm. While the world is most
familiar with infrastructure services like Amazon’s AWS and online applications such as Gmail and Yahoo! Mail, platforms present a powerful alternative that take advantage of infrastructure without user involvement but
give more flexibility than cloud software because you control the application
and its code. In this chapter, you get a comprehensive introduction to the first
platform service using Python, Google App Engine. With the knowledge
gained here, you can then explore similar services in the same space.
Preface
xxv
Chapter 13—Web Services
In this chapter, we explore higher-level services on the Web (using HTTP).
We look at an older service (Yahoo! Finance) and a newer one (Twitter).
You learn how to interact with both of these services by using Python as
well as knowledge you’ve gained from earlier chapters.
Part III: Supplemental/Experimental
Chapter 14—Text Processing
Our first supplemental chapter introduces you to text processing using
Python. We first explore CSV, then JSON, and finally XML. In the last part
of this chapter, we take our client/server knowledge from earlier in the
book and combine it XML to look at how you can create online remote
procedure calls (RPC) services by using XML-RPC.
Chapter 15—Miscellaneous
This chapter consists of bonus material that we will likely develop into
full, individual chapters in the next edition. Topics covered here include
Java/Jython and Google+.
Conventions
All program output and source code are in monospaced font. Python keywords appear in Bold-monospaced font. Lines of output with three leading
greater than signs (>>>) represent the Python interpreter prompt. A leading asterisk (*) in front of a chapter, section, or exercise, indicates that this
is advanced and/or optional material.
Represents Core Notes
Represents Core Module
Represents Core Tips
2.5
New features to Python are highlighted with this icon, with the number representing version(s) of Python in which the features first
appeared.
xxvi
Preface
Book Resources
We welcome any and all feedback—the good, the bad, and the ugly. If you
have any comments, suggestions, kudos, complaints, bugs, questions, or
anything at all, feel free to contact me at [email protected]
You will find errata, source code, updates, upcoming talks, Python training, downloads, and other information at the book’s Web site located at:
http://corepython.com. You can also participate in the community discus-
sion around the “Core Python” books at their Google+ page, which is
located at: http://plus.ly/corepython.
ACKNOWLEDGMENTS
Acknowledgments for the Third Edition
Reviewers and Contributors
Gloria Willadsen (lead reviewer)
Martin Omander (reviewer and also coauthor of Chapter 11, “Web
Frameworks: Django,” creator of the TweetApprover application, and
coauthor of Section 15.2, “Google+,” in Chapter 15, “Miscellaneous”).
Darlene Wong
Bryce Verdier
Eric Walstad
Paul Bissex (coauthor of Python Web Development with Django)
Johan “proppy” Euphrosine
Anthony Vallone
Inspiration
My wife Faye, who has continued to amaze me by being able to run the
household, take care of the kids and their schedule, feed us all, handle the
finances, and be able to do this while I’m off on the road driving cloud
adoption or under foot at home, writing books.
xxvii
xxviii
Acknowledgments
Editorial
Mark Taub (Editor-in-Chief)
Debra Williams Cauley (Acquisitions Editor)
John Fuller (Managing Editor)
Elizabeth Ryan (Project Editor)
Bob Russell, Octal Publishing, Inc. (Copy Editor)
Dianne Russell, Octal Publishing, Inc. (Production and Management Services)
Acknowledgments for the Second Edition
Reviewers and Contributors
Shannon -jj Behrens (lead reviewer)
Michael Santos (lead reviewer)
Rick Kwan
Lindell Aldermann (coauthor of the Unicode section in Chapter 6)
Wai-Yip Tung (coauthor of the Unicode example in Chapter 20)
Eric Foster-Johnson (coauthor of Beginning Python)
Alex Martelli (editor of Python Cookbook and author of Python in a Nutshell)
Larry Rosenstein
Jim Orosz
Krishna Srinivasan
Chuck Kung
Inspiration
My wonderful children and pet hamster.
Acknowledgments
xxix
Acknowledgments for the First Edition
Reviewers and Contributors
Guido van Rossum (creator of the Python language)
Dowson Tong
James C. Ahlstrom (coauthor of Internet Programming with Python)
S. Candelaria de Ram
Cay S. Horstmann (coauthor of Core Java and Core JavaServer Faces)
Michael Santos
Greg Ward (creator of distutils package and its documentation)
Vincent C. Rubino
Martijn Faassen
Emile van Sebille
Raymond Tsai
Albert L. Anders (coauthor of MT Programming chapter)
Fredrik Lundh (author of Python Standard Library)
Cameron Laird
Fred L. Drake, Jr. (coauthor of Python & XML and editor of the official
Python documentation)
Jeremy Hylton
Steve Yoshimoto
Aahz Maruch (author of Python for Dummies)
Jeffrey E. F. Friedl (author of Mastering Regular Expressions)
Pieter Claerhout
Catriona (Kate) Johnston
David Ascher (coauthor of Learning Python and editor of Python Cookbook)
Reg Charney
Christian Tismer (creator of Stackless Python)
Jason Stillwell
and my students at UC Santa Cruz Extension
Inspiration
I would like to extend my great appreciation to James P. Prior, my high
school programming teacher.
To Louise Moser and P. Michael Melliar-Smith (my graduate thesis advisors at The University of California, Santa Barbara), you have my deepest
gratitude.)
www.allitebooks.com
xxx
Acknowledgments
Thanks to Alan Parsons, Eric Woolfson, Andrew Powell, Ian Bairnson, Stuart
Elliott, David Paton, all other Project participants, and fellow Projectologists
and Roadkillers (for all the music, support, and good times).
I would like to thank my family, friends, and the Lord above, who have kept
me safe and sane during this crazy period of late nights and abandonment,
on the road and off. I want to also give big thanks to all those who
believed in me for the past two decades (you know who you are!)—I
couldn’t have done it without you.
Finally, I would like to thank you, my readers, and the Python community
at large. I am excited at the prospect of teaching you Python and hope that
you enjoy your travels with me on this, our third journey.
Wesley J. Chun
Silicon Valley, CA
(It’s not so much a place as it is a state of sanity.)
October 2001; updated July 2006,
March 2009, March 2012
ABOUT THE AUTHOR
Wesley Chun was initiated into the world of computing during high
school, using BASIC and 6502 assembly on Commodore systems. This was
followed by Pascal on the Apple IIe, and then ForTran on punch cards. It
was the last of these that made him a careful/cautious developer, because
sending the deck out to the school district’s mainframe and getting the
results was a one-week round-trip process. Wesley also converted the
journalism class from typewriters to Osborne 1 CP/M computers. He got
his first paying job as a student-instructor teaching BASIC programming to
fourth, fifth, and sixth graders and their parents.
After high school, Wesley went to University of California at Berkeley
as a California Alumni Scholar. He graduated with an AB in applied math
(computer science) and a minor in music (classical piano). While at Cal, he
coded in Pascal, Logo, and C. He also took a tutoring course that featured
videotape training and psychological counseling. One of his summer
internships involved coding in a 4GL and writing a “Getting Started” user
manual. He then continued his studies several years later at University of
California, Santa Barbara, receiving an MS in computer science (distributed
systems). While there, he also taught C programming. A paper based on his
master’s thesis was nominated for Best Paper at the 29th HICSS conference,
and a later version appeared in the University of Singapore’s Journal of High
Performance Computing.
xxxi
xxxii
About the Author
Wesley has been in the software industry since graduating and has continued to teach and write, publishing several books and delivering hundreds of conference talks and tutorials, plus Python courses, both to the
public as well as private corporate training. Wesley’s Python experience
began with version 1.4 at a startup where he designed the Yahoo! Mail
spellchecker and address book. He then became the lead engineer for
Yahoo! People Search. After leaving Yahoo!, he wrote the first edition of
this book and then traveled around the world. Since returning, he has
used Python in a variety of ways, from local product search, anti-spam
and antivirus e-mail appliances, and Facebook games/applications to
something completely different: software for doctors to perform spinal
fracture analysis.
In his spare time, Wesley enjoys piano, bowling, basketball, bicycling,
ultimate frisbee, poker, traveling, and spending time with his family. He
volunteers for Python users groups, the Tutor mailing list, and PyCon.
He also maintains the Alan Parsons Project Monster Discography. If you
think you’re a fan but don’t have “Freudiana,” you had better find it! At
the time of this writing, Wesley was a Developer Advocate at Google, representing its cloud products. He is based in Silicon Valley, and you can follow him at @wescpy or plus.ly/wescpy.
PA R T
General
Application
Topics
CHAPTER
Regular Expressions
Some people, when confronted with a problem, think, “I know, I’ll
use regular expressions.” Now they have two problems.
—Jamie “jwz” Zawinski, August 1997
In this chapter...
• Introduction/Motivation
• Special Symbols and Characters
• Regexes and Python
• Some Regex Examples
• A Longer Regex Example
2
1.1 Introduction/Motivation
1.1
3
Introduction/Motivation
Manipulating text or data is a big thing. If you don’t believe me, look very
carefully at what computers primarily do today. Word processing, “fillout-form” Web pages, streams of information coming from a database
dump, stock quote information, news feeds—the list goes on and on.
Because we might not know the exact text or data that we have programmed our machines to process, it becomes advantageous to be able to
express it in patterns that a machine can recognize and take action upon.
If I were running an e-mail archiving company, and you, as one of my
customers, requested all of the e-mail that you sent and received last February, for example, it would be nice if I could set a computer program to
collate and forward that information to you, rather than having a human
being read through your e-mail and process your request manually. You
would be horrified (and infuriated) that someone would be rummaging
through your messages, even if that person were supposed to be looking
only at time-stamp. Another example request might be to look for a subject
line like “ILOVEYOU,” indicating a virus-infected message, and remove
those e-mail messages from your personal archive. So this begs the question of how we can program machines with the ability to look for patterns
in text.
Regular expressions provide such an infrastructure for advanced text pattern matching, extraction, and/or search-and-replace functionality. To put
it simply, a regular expression (a.k.a. a “regex” for short) is a string that use
special symbols and characters to indicate pattern repetition or to represent multiple characters so that they can “match” a set of strings with similar characteristics described by the pattern (Figure 1-1). In other words,
they enable matching of multiple strings—a regex pattern that matched
only one string would be rather boring and ineffective, wouldn’t you say?
Python supports regexes through the standard library re module. In
this introductory subsection, we will give you a brief and concise introduction. Due to its brevity, only the most common aspects of regexes used
in everyday Python programming will be covered. Your experience will,
of course, vary. We highly recommend reading any of the official supporting documentation as well as external texts on this interesting subject. You
will never look at strings in the same way again!
4
Chapter 1 • Regular Expressions
Regular
Expression
Engine
Figure 1-1 You can use regular expressions, such as the one here, which recognizes valid Python
identifiers. [A-Za-z]\w+ means the first character should be alphabetic, that is, either A–Z or a–z,
followed by at least one (+) alphanumeric character (\w). In our filter, notice how many strings go
into the filter, but the only ones to come out are the ones we asked for via the regex. One
example that did not make it was “4xZ” because it starts with a number.
CORE NOTE: Searching vs. matching
Throughout this chapter, you will find references to searching and matching.
When we are strictly discussing regular expressions with respect to patterns in
strings, we will say “matching,” referring to the term pattern-matching. In Python
terminology, there are two main ways to accomplish pattern-matching:
searching, that is, looking for a pattern match in any part of a string; and matching,
that is, attempting to match a pattern to an entire string (starting from the beginning). Searches are accomplished by using the search() function or method, and
matching is done with the match() function or method. In summary, we keep
1.1 Introduction/Motivation
5
the term “matching” universal when referencing patterns, and we differentiate
between “searching” and “matching” in terms of how Python accomplishes
pattern-matching.
1.1.1
Your First Regular Expression
As we mentioned earlier, regexes are strings containing text and special
characters that describe a pattern with which to recognize multiple strings.
We also briefly discussed a regular expression alphabet. For general text, the
alphabet used for regular expressions is the set of all uppercase and lowercase letters plus numeric digits. Specialized alphabets are also possible; for
instance, you can have one consisting of only the characters “0” and “1.”
The set of all strings over this alphabet describes all binary strings, that is,
“0,” “1,” “00,” “01,” “10,” “11,” “100,” etc.
Let’s look at the most basic of regular expressions now to show you that
although regexes are sometimes considered an advanced topic, they can
also be rather simplistic. Using the standard alphabet for general text, we
present some simple regexes and the strings that their patterns describe.
The following regular expressions are the most basic, “true vanilla,” as it
were. They simply consist of a string pattern that matches only one string:
the string defined by the regular expression. We now present the regexes
followed by the strings that match them:
Regex Pattern
String(s) Matched
foo
foo
Python
Python
abc123
abc123
The first regular expression pattern from the above chart is “foo.” This
pattern has no special symbols to match any other symbol other than those
described, so the only string that matches this pattern is the string “foo.”
The same thing applies to “Python” and “abc123.” The power of regular
expressions comes in when special characters are used to define character
sets, subgroup matching, and pattern repetition. It is these special symbols
that allow a regex to match a set of strings rather than a single one.
6
Chapter 1 • Regular Expressions
1.2
Special Symbols and Characters
We will now introduce the most popular of the special characters and symbols, known as metacharacters, which give regular expressions their power
and flexibility. You will find the most common of these symbols and characters in Table 1-1.
Table 1-1 Common Regular Expression Symbols and Special Characters
Notation
Description
Example Regex
literal
Match literal string value literal
foo
re1|re2
Match regular expressions re1
or re2
foo|bar
.
Match any character (except
\n)
b.b
^
Match start of string
^Dear
$
Match end of string
/bin/*sh$
*
Match 0 or more occurrences of preceding regex
[A-Za-z0-9]*
+
Match 1 or more occurrences of preceding regex
[a-z]+\.com
?
Match 0 or 1 occurrence(s) of preceding regex
goo?
{N}
Match N occurrences of preceding
regex
[0-9]{3}
{M,N}
Match from M to N occurrences of
preceding regex
[0-9]{5,9}
[...]
Match any single character from
character class
[aeiou]
[..x-y..]
Match any single character in the
range from x to y
[0-9],[A-Za-z]
Symbols
1.2 Special Symbols and Characters
Notation
Description
Example Regex
[^...]
Do not match any character from
character class, including any
ranges, if present
[^aeiou],
[^A-Za-z0-9_]
(*|+|?|{})?
Apply “non-greedy” versions of
above occurrence/repetition symbols
.*?[a-z]
7
Symbols
(*, +, ?, {})
(...)
Match enclosed regex and save as
subgroup
([0-9]{3})?,
f(oo|u)bar
\d
Match any decimal digit, same as
[0-9] (\D is inverse of \d: do not
match any numeric digit)
data\d+.txt
\w
Match any alphanumeric character,
same as [A-Za-z0-9_] (\W is inverse
of \w)
[A-Za-z_]\w+
\s
Match any whitespace character,
same as [ \n\t\r\v\f] (\S is inverse
of \s)
of\sthe
\b
Match any word boundary (\B is
inverse of \b)
\bThe\b
\N
Match saved subgroup N (see (...)
above)
price: \16
\c
Match any special character c verbatim (i.e., without its special meaning, literal)
\., \\, \*
\A (\Z)
Match start (end) of string (also see ^
and $ above)
\ADear
Special Characters
(Continued)
www.allitebooks.com
8
Chapter 1 • Regular Expressions
Table 1-1 Common Regular Expression Symbols and Special Characters
(Continued)
Notation
Description
Example Regex
(?iLmsux)
Embed one or more special “flags”
parameters within the regex itself
(vs. via function/method)
(?x), (?im)
(?:...)
Signifies a group whose match is not
saved
(?:\w+\.)*
(?P<name>...)
Like a regular group match only
identified with name rather than a
numeric ID
(?P<data>)
(?P=name)
Matches text previously grouped by
(?P<name>) in the same string
(?P=data)
(?#...)
Specifies a comment, all contents
within ignored
(?#comment)
(?=...)
Matches if ... comes next without
consuming input string; called
positive lookahead assertion
(?=.com)
(?!...)
Matches if ... doesn’t come next
without consuming input; called
negative lookahead assertion
(?!.net)
(?<=...)
Matches if ... comes prior without
consuming input string; called positive lookbehind assertion
(?<=800-)
(?<!...)
Matches if ... doesn’t come prior
without consuming input; called
negative lookbehind assertion
(?<!192\.168\.)
(?(id/name)Y|N)
Conditional match of regex Y if
group with given id or name exists
else N; |N is optional
(?(1)y|x
Extension Notation
1.2 Special Symbols and Characters
1.2.1
9
Matching More Than One Regex Pattern
with Alternation (|)
The pipe symbol (|), a vertical bar on your keyboard, indicates an
alternation operation. It is used to separate different regular expressions.
For example, the following are some patterns that employ alternation,
along with the strings they match:
Regex Pattern
Strings Matched
at|home
at, home
r2d2|c3po
r2d2, c3po
bat|bet|bit
bat, bet, bit
With this one symbol, we have just increased the flexibility of our regular
expressions, enabling the matching of more than just one string. Alternation is also sometimes called union or logical OR.
1.2.2
Matching Any Single Character (.)
The dot or period (.) symbol matches any single character except for \n.
(Python regexes have a compilation flag [S or DOTALL], which can override
this to include \ns.) Whether letter, number, whitespace (not including
“\n”), printable, non-printable, or a symbol, the dot can match them all.
Regex Pattern
Strings Matched
f.o
Any character between “f ” and “o”; for example,
fao, f9o, f#o, etc.
..
Any pair of characters
.end
Any character before the string end
Q: What if I want to match the dot or period character?
A: To specify a dot character explicitly, you must escape its functionality
with a backslash, as in “\.”.
10
Chapter 1 • Regular Expressions
1.2.3
Matching from the Beginning or End of
Strings or Word Boundaries (^, $, \b, \B)
There are also symbols and related special characters to specify searching
for patterns at the beginning and end of strings. To match a pattern starting from the beginning, you must use the carat symbol (^) or the special
character \A (backslash-capital “A”). The latter is primarily for keyboards
that do not have the carat symbol (for instance, an international keyboard). Similarly, the dollar sign ($) or \Z will match a pattern from the
end of a string.
Patterns that use these symbols differ from most of the others we
describe in this chapter because they dictate location or position. In the
previous Core Note, we noted that a distinction is made between matching
(attempting matches of entire strings starting at the beginning) and searching (attempting matches from anywhere within a string). With that said,
here are some examples of “edge-bound” regex search patterns:
Regex Pattern
Strings Matched
^From
Any string that starts with From
/bin/tcsh$
Any string that ends with /bin/tcsh
^Subject: hi$
Any string consisting solely of the string Subject: hi
Again, if you want to match either (or both) of these characters verbatim, you must use an escaping backslash. For example, if you wanted to
match any string that ended with a dollar sign, one possible regex solution
would be the pattern .*\$$.
The special characters \b and \B pertain to word boundary matches. The
difference between them is that \b will match a pattern to a word boundary, meaning that a pattern must be at the beginning of a word, whether
there are any characters in front of it (word in the middle of a string) or not
(word at the beginning of a line). And likewise, \B will match a pattern
only if it appears starting in the middle of a word (i.e., not at a word
boundary). Here are some examples:
Regex Pattern
Strings Matched
the
Any string containing the
\bthe
Any word that starts with the
1.2 Special Symbols and Characters
Regex Pattern
Strings Matched
\bthe\b
Matches only the word the
\Bthe
Any string that contains but does not begin
with the
1.2.4
11
Creating Character Classes ([])
Whereas the dot is good for allowing matches of any symbols, there might
be occasions for which there are specific characters that you want to
match. For this reason, the bracket symbols ([]) were invented. The regular expression will match any of the enclosed characters. Here are some
examples:
Regex Pattern
Strings Matched
b[aeiu]t
bat, bet, bit, but
[cr][23][dp][o2]
A string of four characters: first is “c” or “r,”
then “2” or “3,” followed by “d” or “p,” and
finally, either “o” or “2.” For example, c2do,
r3p2, r2d2, c3po, etc.
One side note regarding the regex [cr][23][dp][o2]—a more restrictive version of this regex would be required to allow only “r2d2” or
“c3po” as valid strings. Because brackets merely imply logical OR functionality, it is not possible to use brackets to enforce such a requirement.
The only solution is to use the pipe, as in r2d2|c3po.
For single-character regexes, though, the pipe and brackets are equivalent. For example, let’s start with the regular expression “ab,” which
matches only the string with an “a” followed by a “b.” If we wanted either
a one-letter string, for instance, either “a” or a “b,” we could use the regex
[ab]. Because “a” and “b” are individual strings, we can also choose the
regex a|b. However, if we wanted to match the string with the pattern
“ab” followed by “cd,” we cannot use the brackets because they work
only for single characters. In this case, the only solution is ab|cd, similar to
the r2d2/c3po problem just mentioned.
12
Chapter 1 • Regular Expressions
1.2.5
Denoting Ranges (-) and Negation (^)
In addition to single characters, the brackets also support ranges of characters. A hyphen between a pair of symbols enclosed in brackets is used to
indicate a range of characters; for example A–Z, a–z, or 0–9 for uppercase
letters, lowercase letters, and numeric digits, respectively. This is a lexicographic range, so you are not restricted to using just alphanumeric characters. Additionally, if a caret (^) is the first character immediately inside the
open left bracket, this symbolizes a directive not to match any of the characters in the given character set.
Regex Pattern
Strings Matched
z.[0-9]
“z” followed by any character then followed by a
single digit
[r-u][env-y]
[us]
“r,” “s,” “t,” or “u” followed by “e,” “n,” “v,” “w,”
“x,” or “y” followed by “u” or “s”
[^aeiou]
A non-vowel character (Exercise: why do we say
“non-vowels” rather than “consonants”?)
[^\t\n]
Not a TAB or \n
["-a]
In an ASCII system, all characters that fall between
‘"’ and “a,” that is, between ordinals 34 and 97
1.2.6
Multiple Occurrence/Repetition Using
Closure Operators (*, +, ?, {})
We will now introduce the most common regex notations, namely, the special symbols *, +, and ?, all of which can be used to match single, multiple,
or no occurrences of string patterns. The asterisk or star operator (*) will
match zero or more occurrences of the regex immediately to its left (in language and compiler theory, this operation is known as the Kleene Closure).
The plus operator (+) will match one or more occurrences of a regex
(known as Positive Closure), and the question mark operator (?) will match
exactly 0 or 1 occurrences of a regex.
There are also brace operators ({ }) with either a single value or a
comma-separated pair of values. These indicate a match of exactly N occurrences (for {N}) or a range of occurrences; for example, {M, N} will match
from M to N occurrences. These symbols can also be escaped by using the
backslash character; \* matches the asterisk, etc.
1.2 Special Symbols and Characters
13
In the previous table, we notice the question mark is used more than
once (overloaded), meaning either matching 0 or 1 occurrences, or its
other meaning: if it follows any matching using the close operators, it will
direct the regular expression engine to match as few repetitions as possible.
What does “as few repetitions as possible” mean? When patternmatching is employed using the grouping operators, the regular expression engine will try to “absorb” as many characters as possible that match
the pattern. This is known as being greedy. The question mark tells the
engine to lay off and, if possible, take as few characters as possible in the
current match, leaving the rest to match as many succeeding characters of
the next pattern (if applicable). Toward the end of the chapter, we will
show you a great example where non-greediness is required. For now, let’s
continue to look at the closure operators:
Regex Pattern
Strings Matched
[dn]ot?
“d” or “n,” followed by an “o” and, at most,
one “t” after that; thus, do, no, dot, not.
0?[1-9]
Any numeric digit, possibly prepended with
a “0.” For example, the set of numeric representations of the months January to September,
whether single or double-digits.
[0-9]{15,16}
Fifteen or sixteen digits (for example, credit
card numbers.
</?[^>]+>
Strings that match all valid (and invalid)
HTML tags.
[KQRBNP][a-h][1-8][a-h][1-8]
Legal chess move in “long algebraic” notation
(move only, no capture, check, etc.); that is,
strings that start with any of “K,” “Q,” “R,”
“B,” “N,” or “P” followed by a hyphenatedpair of chess board grid locations from “a1” to
“h8” (and everything in between), with the
first coordinate indicating the former position, and the second being the new position.
14
Chapter 1 • Regular Expressions
1.2.7
Special Characters Representing
Character Sets
We also mentioned that there are special characters that can represent
character sets. Rather than using a range of “0–9,” you can simply use \d to
indicate the match of any decimal digit. Another special character, \w, can
be used to denote the entire alphanumeric character class, serving as a
shortcut for A-Za-z0-9_, and \s can be used for whitespace characters.
Uppercase versions of these strings symbolize non-matches; for example,
\D matches any non-decimal digit (same as [^0-9]), etc.
Using these shortcuts, we will present a few more complex examples:
Regex Pattern
Strings Matched
\w+-\d+
Alphanumeric string and number separated by a
hyphen
[A-Za-z]\w*
Alphabetic first character; additional characters (if
present) can be alphanumeric (almost equivalent to
the set of valid Python identifiers [see exercises])
\d{3}-\d{3}\d{4}
American-format telephone numbers with an area
code prefix, as in 800-555-1212
\[email protected]\w+\.com
Simple e-mail addresses of the form [email protected]
1.2.8
Designating Groups with Parentheses (())
Now, we have achieved the goal of matching a string and discarding nonmatches, but in some cases, we might also be more interested in the data
that we did match. Not only do we want to know whether the entire string
matched our criteria, but also whether we can extract any specific
strings or substrings that were part of a successful match. The answer is
yes. To accomplish this, surround any regex with a pair of parentheses.
A pair of parentheses (( )) can accomplish either (or both) of the following when used with regular expressions:
• Grouping regular expressions
• Matching subgroups
1.2 Special Symbols and Characters
15
One good example of why you would want to group regular expressions is when you have two different regexes with which you want to
compare a string. Another reason is to group a regex in order to use a repetition operator on the entire regex (as opposed to an individual character
or character class).
One side effect of using parentheses is that the substring that matched
the pattern is saved for future use. These subgroups can be recalled for the
same match or search, or extracted for post-processing. You will see some
examples of pulling out subgroups at the end of Section 1.3.9.
Why are matches of subgroups important? The main reason is that there
are times when you want to extract the patterns you match, in addition to
making a match. For example, what if we decided to match the pattern
\w+-\d+ but wanted save the alphabetic first part and the numeric second
part individually? We might want to do this because with any successful
match, we might want to see just what those strings were that matched
our regex patterns.
If we add parentheses to both subpatterns such as (\w+)-(\d+), then we
can access each of the matched subgroups individually. Subgrouping is
preferred because the alternative is to write code to determine we have a
match, then execute another separate routine (which we also had to create)
to parse the entire match just to extract both parts. Why not let Python do
it; it’s a supported feature of the re module, so why reinvent the wheel?
Regex Pattern
Strings Matched
\d+(\.\d*)?
Strings representing simple floating-point numbers; that is, any number of digits followed
optionally by a single decimal point and zero or
more numeric digits, as in “0.004,” “2,” “75.,” etc.
(Mr?s?\. )?[A-Z]
[a-z]* [ A-Za-z-]+
First name and last name, with a restricted first
name (must start with uppercase; lowercase only
for remaining letters, if any), the full name, prepended by an optional title of “Mr.,” “Mrs.,”
“Ms.,” or “M.,” and a flexible last name, allowing
for multiple words, dashes, and uppercase letters
16
Chapter 1 • Regular Expressions
1.2.9
Extension Notations
One final aspect of regular expressions we have not touched upon yet
include the extension notations that begin with the question mark symbol
(? . . .). We are not going to spend a lot of time on these as they are generally used more to provide flags, perform look-ahead (or look-behind), or
check conditionally before determining a match. Also, although parentheses are used with these notations, only (?P<name>) represents a grouping
for matches. All others do not create a group. However, you should still
know what they are because they might be “the right tool for the job.”
Regex Pattern
Notation Definition
(?:\w+\.)*
Strings that end with a dot, like “google.”, “twitter.”,
“facebook.”, but such matches are neither saved for
use nor retrieval later.
(?#comment)
No matching here, just a comment.
(?=.com)
Only do a match if “.com” follows; do not consume
any of the target string.
(?!.net)
Only do a match if “.net” does not follow.
(?<=800-)
Only do a match if string is preceded by “800-”, presumably for phone numbers; again, do not consume
the input string.
(?<!192\.168\.)
Only do a match if string is not preceded by “192.168.”,
presumably to filter out a group of Class C IP addresses.
(?(1)y|x)
If a matched group 1 (\1) exists, match against y;
otherwise, match against x.
1.3
2.5
Regexes and Python
Now that we know all about regular expressions, we can examine how
Python currently supports regular expressions through the re module,
which was introduced way back in ancient history (Python 1.5), replacing the deprecated regex and regsub modules—both modules were
removed from Python in version 2.5, and importing either module from
that release on triggers an ImportError exception.
The re module supports the more powerful and regular Perl-style (Perl 5)
regexes, allows multiple threads to share the same compiled regex objects,
and supports named subgroups.
1.3 Regexes and Python
1.3.1
17
The re Module: Core Functions and
Methods
The chart in Table 1-2 lists the more popular functions and methods from
the re module. Many of these functions are also available as methods of
compiled regular expression objects (regex objects and regex match objects.
In this subsection, we will look at the two main functions/methods, match()
and search(), as well as the compile() function. We will introduce several
more in the next section, but for more information on all these and the others
that we do not cover, we refer you to the Python documentation.
Table 1-2 Common Regular Expression Attributes
Function/Method
Description
re Module Function Only
compile(pattern,
flags=0)
Compile regex pattern with any optional flags and
return a regex object
re Module Functions and Regex Object Methods
match(pattern,
string, flags=0)
Attempt to match pattern to string with optional
flags; return match object on success, None on failure
search(pattern,
string, flags=0)
Search for first occurrence of pattern within string
with optional flags; return match object on success,
None on failure
findall(pattern,
string[,flags])a
Look for all (non-overlapping) occurrences of pattern
in string; return a list of matches
finditer(pattern,
string[, flags])b
Same as findall(), except returns an iterator instead
of a list; for each match, the iterator returns a match
object
split(pattern,
string, max=0)c
Split string into a list according to regex pattern
delimiter and return list of successful matches, splitting at most max times (split all occurrences is the
default)
(Continued)
www.allitebooks.com
18
Chapter 1 • Regular Expressions
Table 1-2 Common Regular Expression Attributes (Continued)
Function/Method
Description
re Module Functions and Regex Object Methods
sub(pattern, repl,
string, count=0)c
Replace all occurrences of the regex pattern in string
with repl, substituting all occurrences unless count
provided (see also subn(), which, in addition, returns
the number of substitutions made)
purge()
Purge cache of implicitly compiled regex patterns
Common Match Object Methods (see documentation for others)
group(num=0)
Return entire match (or specific subgroup num)
groups
(default=None)
Return all matching subgroups in a tuple (empty if
there aren’t any)
groupdict
(default=None)
Return dict containing all matching named subgroups
with the names as the keys (empty if there weren’t any)
Common Module Attributes (flags for most regex functions)
re.I, re.IGNORECASE
Case-insensitive matching
re.L, re.LOCALE
Matches via \w, \W, \b, \B, \s, \S depends on locale
re.M, re.MULTILINE
Respectively causes ^ and $ to match the beginning
and end of each line in target string rather than strictly
the beginning and end of the entire string itself
re.S, re.DOTALL
The . normally matches any single character except \n;
this flag says . should match them, too
re.X, re.VERBOSE
All whitespace plus # (and all text after it on a single
line) are ignored unless in a character class or backslash-escaped, allowing comments and improving
readability
a.
b.
c.
New in Python 1.5.2; flags parameter added in 2.4.
New in Python 2.2; flags parameter added in 2.4.
flags parameter added in version 2.7 and 3.1.
1.3 Regexes and Python
19
CORE NOTE: Regex compilation (to compile or not to compile?)
In the Execution Environment chapter of Core Python Programming or the forthcoming Core Python Language Fundamentals, we describe how Python code is
eventually compiled into bytecode, which is then executed by the interpreter. In
particular, we specified that calling eval() or exec (in version 2.x or exec()
in version 3.x) with a code object rather than a string provides a performance
improvement due to the fact that the compilation process does not have to be
performed repeatedly. In other words, using precompiled code objects is faster
than using strings because the interpreter will have to compile it into a code object
(anyway) each time before execution.
The same concept applies to regexes—regular expression patterns must be
compiled into regex objects before any pattern matching can occur. For regexes,
which are compared many times during the course of execution, we highly
recommend using precompilation because, again, regexes have to be compiled
anyway, so doing it ahead of time is prudent for performance reasons.
re.compile() provides this functionality.
The module functions do cache the compiled objects, though, so it’s not as if
every search() and match() with the same regex pattern requires compilation. Still, you save the cache lookups and do not have to make function calls
with the same string, over and over. The number of compiled regex objects that
are cached might vary between releases, and is undocumented. The purge()
function can be used to clear this cache.
1.3.2
Compiling Regexes with compile()
Almost all of the re module functions we will be describing shortly are
available as methods for regex objects. Remember, even though we recommend it, precompilation is not required. If you compile, you will use
methods; if you don’t, you will just use functions. The good news is that
either way, the names are the same, whether a function or a method. (This
is the reason why there are module functions and methods that are identical; for example, search(), match(), etc., in case you were wondering.)
Because it saves one small step for most of our examples, we will use
strings, instead. We will throw in a few with compilation, though, just so
you know how it is done.
Optional flags may be given as arguments for specialized compilation.
These flags allow for case-insensitive matching, using system locale settings for matching alphanumeric characters, etc. Please see the entries in
20
Chapter 1 • Regular Expressions
Table 1-2 and the official documentation for more information on these
flags (re.IGNORECASE, re.MULTILINE, re.DOTALL, re.VERBOSE, etc.). They can
be combined by using the bitwise OR operator (|).
These flags are also available as a parameter to most re module functions.
If you want to use these flags with the methods, they must already be integrated into the compiled regex objects, or you need to use the (?F) notation directly embedded in the regex itself, where F is one or more of i (for
re.I/IGNORECASE), m (for re.M/MULTILINE), s (for re.S/DOTALL), etc. If more
than one is desired, you place them together rather than using the bitwise OR
operation; for example, (?im) for both re.IGNORECASE plus re.MULTILINE.
1.3.3
Match Objects and the group() and
groups() Methods
When dealing with regular expressions, there is another object type in
addition to the regex object: the match object. These are the objects returned
on successful calls to match() or search(). Match objects have two primary
methods, group() and groups().
group() either returns the entire match, or a specific subgroup, if
requested. groups() simply returns a tuple consisting of only/all the subgroups. If there are no subgroups requested, then groups() returns an
empty tuple while group() still returns the entire match.
Python regexes also allow for named matches, which are beyond the
scope of this introductory section. We refer you to the complete re module
documentation for a complete listing of the more advanced details we
have omitted here.
1.3.4
Matching Strings with match()
is the first re module function and regex object (regex object)
method we will look at. The match() function attempts to match the pattern to the string, starting at the beginning. If the match is successful, a
match object is returned; if it is unsuccessful, None is returned. The group()
method of a match object can be used to show the successful match. Here
is an example of how to use match() [and group()]:
match()
>>> m = re.match('foo', 'foo')
# pattern matches string
>>> if m is not None:
# show match if successful
...
m.group()
...
'foo'
1.3 Regexes and Python
21
The pattern “foo” matches exactly the string “foo.” We can also confirm
that m is an example of a match object from within the interactive interpreter:
>>> m
# confirm match object returned
<re.MatchObject instance at 80ebf48>
Here is an example of a failed match for which None is returned:
>>> m = re.match('foo', 'bar')# pattern does not match string
>>> if m is not None: m.group() # (1-line version of if clause)
...
>>>
The preceding match fails, thus None is assigned to m, and no action is
taken due to the way we constructed our if statement. For the remaining
examples, we will try to leave out the if check for brevity, if possible, but
in practice, it is a good idea to have it there to prevent AttributeError
exceptions. (None is returned on failures, which does not have a group()
attribute [method].)
A match will still succeed even if the string is longer than the pattern, as
long as the pattern matches from the beginning of the string. For example,
the pattern “foo” will find a match in the string “food on the table”
because it matches the pattern from the beginning:
>>> m = re.match('foo', 'food on the table') # match succeeds
>>> m.group()
'foo'
As you can see, although the string is longer than the pattern, a successful match was made from the beginning of the string. The substring “foo”
represents the match, which was extracted from the larger string.
We can even sometimes bypass saving the result altogether, taking
advantage of Python’s object-oriented nature:
>>> re.match('foo', 'food on the table').group()
'foo'
Note from a few paragraphs above that an AttributeError will be generated on a non-match.
1.3.5
Looking for a Pattern within a String with
search() (Searching versus Matching)
The chances are greater that the pattern you seek is somewhere in the middle of a string, rather than at the beginning. This is where search() comes
in handy. It works exactly in the same way as match, except that it searches
22
Chapter 1 • Regular Expressions
for the first occurrence of the given regex pattern anywhere with its string
argument. Again, a match object is returned on success; None is returned
otherwise.
We will now illustrate the difference between match() and search().
Let’s try a longer string match attempt. This time, let’s try to match our
string “foo” to “seafood”:
>>> m = re.match('foo', 'seafood')
>>> if m is not None: m.group()
...
>>>
# no match
As you can see, there is no match here. match() attempts to match the
pattern to the string from the beginning; that is, the “f” in the pattern is
matched against the “s” in the string, which fails immediately. However,
the string “foo” does appear (elsewhere) in “seafood,” so how do we get
Python to say “yes”? The answer is by using the search() function. Rather
than attempting a match, search() looks for the first occurrence of the pattern within the string. search() evaluates a string strictly from left to right.
>>> m = re.search('foo', 'seafood')
# use search() instead
>>> if m is not None: m.group()
...
'foo'
# search succeeds where match failed
>>>
Furthermore, both match() and search() take the optional flags parameter described earlier in Section 1.3.2. Lastly, we want to note that the equivalent regex object methods optionally take pos and endpos arguments to
specify the search boundaries of the target string.
We will be using the match() and search() regex object methods and
the group() and groups() match object methods for the remainder of this
subsection, exhibiting a broad range of examples of how to use regular
expressions with Python. We will be using almost all of the special characters and symbols that are part of the regular expression syntax.
1.3.6
Matching More than One String (|)
In Section 1.2, we used the pipe character in the regex bat|bet|bit. Here
is how we would use that regex with Python:
>>> bt = 'bat|bet|bit'
>>> m = re.match(bt, 'bat')
>>> if m is not None: m.group()
...
# regex pattern: bat, bet, bit
# 'bat' is a match
1.3 Regexes and Python
23
'bat'
>>> m = re.match(bt, 'blt')
# no match for 'blt'
>>> if m is not None: m.group()
...
>>> m = re.match(bt, 'He bit me!') # does not match string
>>> if m is not None: m.group()
...
>>> m = re.search(bt, 'He bit me!') # found 'bit' via search
>>> if m is not None: m.group()
...
'bit'
1.3.7
Matching Any Single Character (.)
In the following examples, we show that a dot cannot match a \n or a noncharacter; that is, the empty string:
>>> anyend = '.end'
>>> m = re.match(anyend, 'bend')
# dot matches 'b'
>>> if m is not None: m.group()
...
'bend'
>>> m = re.match(anyend, 'end')
# no char to match
>>> if m is not None: m.group()
...
>>> m = re.match(anyend, '\nend')
# any char except \n
>>> if m is not None: m.group()
...
>>> m = re.search('.end', 'The end.')# matches ' ' in search
>>> if m is not None: m.group()
...
' end'
The following is an example of searching for a real dot (decimal point)
in a regular expression, wherein we escape its functionality by using a
backslash:
>>> patt314 = '3.14'
>>> pi_patt = '3\.14'
>>> m = re.match(pi_patt, '3.14')
>>> if m is not None: m.group()
...
'3.14'
>>> m = re.match(patt314, '3014')
>>> if m is not None: m.group()
...
'3014'
>>> m = re.match(patt314, '3.14')
>>> if m is not None: m.group()
...
'3.14'
# regex dot
# literal dot (dec. point)
# exact match
# dot matches '0'
# dot matches '.'
24
Chapter 1 • Regular Expressions
1.3.8
Creating Character Classes ([])
Earlier, we had a long discussion about [cr][23][dp][o2] and how it differs from r2d2|c3po” In the following examples, we will show that
r2d2|c3po is more restrictive than [cr][23][dp][o2]:
>>> m = re.match('[cr][23][dp][o2]', 'c3po')# matches 'c3po'
>>> if m is not None: m.group()
...
'c3po'
>>> m = re.match('[cr][23][dp][o2]', 'c2do')# matches 'c2do'
>>> if m is not None: m.group()
...
'c2do'
>>> m = re.match('r2d2|c3po', 'c2do')# does not match 'c2do'
>>> if m is not None: m.group()
...
>>> m = re.match('r2d2|c3po', 'r2d2')# matches 'r2d2'
>>> if m is not None: m.group()
...
'r2d2'
1.3.9
Repetition, Special Characters, and
Grouping
The most common aspects of regexes involve the use of special characters,
multiple occurrences of regex patterns, and using parentheses to group
and extract submatch patterns. One particular regex we looked at related
to simple e-mail addresses (\[email protected]\w+\.com). Perhaps we want to match more
e-mail addresses than this regex allows. To support an additional hostname that precedes the domain, for example, www.xxx.com as opposed to
accepting only xxx.com as the entire domain, we have to modify our
existing regex. To indicate that the hostname is optional, we create a
pattern that matches the hostname (followed by a dot), use the ? operator, indicating zero or one copy of this pattern, and insert the optional
regex into our previous regex as follows: \[email protected](\w+\.)?\w+\.com. As you
can see from the following examples, either one or two names are now
accepted before the .com:
>>> patt = '\[email protected](\w+\.)?\w+\.com'
>>> re.match(patt, '[email protected]').group()
'[email protected]'
>>> re.match(patt, '[email protected]').group()
'[email protected]'
1.3 Regexes and Python
25
Furthermore, we can even extend our example to allow any number of
intermediate subdomain names with the following pattern. Take special
note of our slight change from using ? to *. : \[email protected](\w+\.)*\w+\.com:
>>> patt = '\[email protected](\w+\.)*\w+\.com'
>>> re.match(patt, '[email protected]').group()
'[email protected]'
However, we must add the disclaimer that using solely alphanumeric
characters does not match all the possible characters that might make up
e-mail addresses. The preceding regex patterns would not match a domain
such as xxx-yyy.com or other domains with \W characters.
Earlier, we discussed the merits of using parentheses to match and save
subgroups for further processing rather than coding a separate routine to
manually parse a string after a regex match had been determined. In particular, we discussed a simple regex pattern of an alphanumeric string and
a number separated by a hyphen, \w+-\d+, and how adding subgrouping
to form a new regex, (\w+)-(\d+), would do the job. Here is how the
original regex works:
>>> m = re.match('\w\w\w-\d\d\d', 'abc-123')
>>> if m is not None: m.group()
...
'abc-123'
>>> m = re.match('\w\w\w-\d\d\d', 'abc-xyz')
>>> if m is not None: m.group()
...
>>>
In the preceding code, we created a regex to recognize three alphanumeric characters followed by three digits. Testing this regex on abc-123,
we obtained positive results, whereas abc-xyz fails. We will now modify
our regex as discussed before to be able to extract the alphanumeric string
and number. Note how we can now use the group() method to access individual subgroups or the groups() method to obtain a tuple of all the subgroups matched:
>>> m = re.match('(\w\w\w)-(\d\d\d)',
>>> m.group()
#
'abc-123'
>>> m.group(1)
#
'abc'
>>> m.group(2)
#
'123'
>>> m.groups()
#
('abc', '123')
'abc-123')
entire match
subgroup 1
subgroup 2
all subgroups
26
Chapter 1 • Regular Expressions
As you can see, group() is used in the normal way to show the entire
match, but it can also be used to grab individual subgroup matches. We
can also use the groups() method to obtain a tuple of all the substring
matches.
Here is a simpler example that shows different group permutations,
which will hopefully make things even more clear:
>>> m = re.match('ab', 'ab')
>>> m.group()
'ab'
>>> m.groups()
()
>>>
>>> m = re.match('(ab)', 'ab')
>>> m.group()
'ab'
>>> m.group(1)
'ab'
>>> m.groups()
('ab',)
>>>
>>> m = re.match('(a)(b)', 'ab')
>>> m.group()
'ab'
>>> m.group(1)
'a'
>>> m.group(2)
'b'
>>> m.groups()
('a', 'b')
>>>
>>> m = re.match('(a(b))', 'ab')
>>> m.group()
'ab'
>>> m.group(1)
'ab'
>>> m.group(2)
'b'
>>> m.groups()
('ab', 'b')
1.3.10
# no subgroups
# entire match
# all subgroups
# one subgroup
# entire match
# subgroup 1
# all subgroups
# two subgroups
# entire match
# subgroup 1
# subgroup 2
# all subgroups
# two subgroups
# entire match
# subgroup 1
# subgroup 2
# all subgroups
Matching from the Beginning and End of
Strings and on Word Boundaries
The following examples highlight the positional regex operators. These
apply more for searching than matching because match() always starts at
the beginning of a string.
1.3 Regexes and Python
>>> m = re.search('^The', 'The end.')
>>> if m is not None: m.group()
...
'The'
>>> m = re.search('^The', 'end. The')
>>> if m is not None: m.group()
...
>>> m = re.search(r'\bthe', 'bite the dog')
>>> if m is not None: m.group()
...
'the'
>>> m = re.search(r'\bthe', 'bitethe dog')
>>> if m is not None: m.group()
...
>>> m = re.search(r'\Bthe', 'bitethe dog')
>>> if m is not None: m.group()
...
27
# match
# not at beginning
# at a boundary
# no boundary
# no boundary
'the'
You will notice the appearance of raw strings here. You might want to
take a look at the Core Note, “Using Python raw strings,” toward the end
of this chapter for clarification on why they are here. In general, it is a
good idea to use raw strings with regular expressions.
There are four other re module functions and regex object methods that
we think you should be aware of: findall(), sub(), subn(), and split().
1.3.11
Finding Every Occurrence with findall()
and finditer()
findall() looks for all non-overlapping occurrences of a regex pattern in a
string. It is similar to search() in that it performs a string search, but it differs from match() and search() in that findall() always returns a list. The
list will be empty if no occurrences are found, but if successful, the list will
consist of all matches found (grouped in left-to-right order of occurrence).
>>> re.findall('car', 'car')
['car']
>>> re.findall('car', 'scary')
['car']
>>> re.findall('car', 'carry the barcardi to the car')
['car', 'car', 'car']
Subgroup searches result in a more complex list returned, and that makes
sense, because subgroups are a mechanism with which you can extract
specific patterns from within your single regular expression, such as
matching an area code that is part of a complete telephone number, or a
login name that is part of an entire e-mail address.
www.allitebooks.com
28
2.2
Chapter 1 • Regular Expressions
For a single successful match, each subgroup match is a single element
of the resulting list returned by findall(); for multiple successful matches,
each subgroup match is a single element in a tuple, and such tuples (one
for each successful match) are the elements of the resulting list. This part
might sound confusing at first, but if you try different examples, it will
help to clarify things.
The finditer() function, which was added back in Python 2.2, is a similar, more memory-friendly alternative to findall(). The main difference
between it and its cousin, other than the return of an iterator versus a list
(obviously), is that rather than returning matching strings, finditer()
iterates over match objects. The following are the differences between the
two with different groups in a single string:
>>> s = 'This and that.'
>>> re.findall(r'(th\w+) and (th\w+)', s, re.I)
[('This', 'that')]
>>> re.finditer(r'(th\w+) and (th\w+)', s,
...
re.I).next().groups()
('This', 'that')
>>> re.finditer(r'(th\w+) and (th\w+)', s,
...
re.I).next().group(1)
'This'
>>> re.finditer(r'(th\w+) and (th\w+)', s,
...
re.I).next().group(2)
'that'
>>> [g.groups() for g in re.finditer(r'(th\w+) and (th\w+)',
...
s, re.I)]
[('This', 'that')]
In the example that follows, we have multiple matches of a single group
in a single string:
>>> re.findall(r'(th\w+)', s, re.I)
['This', 'that']
>>> it = re.finditer(r'(th\w+)', s, re.I)
>>> g = it.next()
>>> g.groups()
('This',)
>>> g.group(1)
'This'
>>> g = it.next()
>>> g.groups()
('that',)
>>> g.group(1)
'that'
>>> [g.group(1) for g in re.finditer(r'(th\w+)', s, re.I)]
['This', 'that']
Note all the additional work that we had to do using finditer() to get
its output to match that of findall().
1.3 Regexes and Python
29
Finally, like match() and search(), the method versions of findall()
and finditer() support the optional pos and endpos parameters that control the search boundaries of the target string, as described earlier in this
chapter.
1.3.12
Searching and Replacing with sub()
and subn()
There are two functions/methods for search-and-replace functionality: sub()
and subn(). They are almost identical and replace all matched occurrences of the regex pattern in a string with some sort of replacement. The
replacement is usually a string, but it can also be a function that returns a
replacement string. subn() is exactly the same as sub(), but it also returns
the total number of substitutions made—both the newly substituted string
and the substitution count are returned as a 2-tuple.
>>> re.sub('X', 'Mr. Smith', 'attn: X\n\nDear X,\n')
'attn: Mr. Smith\012\012Dear Mr. Smith,\012'
>>>
>>> re.subn('X', 'Mr. Smith', 'attn: X\n\nDear X,\n')
('attn: Mr. Smith\012\012Dear Mr. Smith,\012', 2)
>>>
>>> print re.sub('X', 'Mr. Smith', 'attn: X\n\nDear X,\n')
attn: Mr. Smith
Dear Mr. Smith,
>>> re.sub('[ae]', 'X', 'abcdef')
'XbcdXf'
>>> re.subn('[ae]', 'X’, 'abcdef')
('XbcdXf', 2)
As we saw in an earlier section, in addition to being able to pull out the
matching group number using the match object’s group() method, you can
use \N, where N is the group number to use in the replacement string.
Below, we’re just converting the American style of date presentation, MM/
DD/YY{,YY} to the format used by all other countries, DD/MM/YY{,YY}:
>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2}|\d{4})',
...
r'\2/\1/\3', '2/20/91') # Yes, Python is...
'20/2/91'
>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2}|\d{4})',
...
r'\2/\1/\3', '2/20/1991') # ... 20+ years old!
'20/2/1991'
30
Chapter 1 • Regular Expressions
1.3.13
Splitting (on Delimiting Pattern) with
split()
The re module and regex object method split() work similarly to its
string counterpart, but rather than splitting on a fixed string, they split a
string based on a regex pattern, adding some significant power to string
splitting capabilities. If you do not want the string split for every occurrence of the pattern, you can specify the maximum number of splits by setting a value (other than zero) to the max argument.
If the delimiter given is not a regular expression that uses special symbols to match multiple patterns, then re.split() works in exactly the
same manner as str.split(), as illustrated in the example that follows
(which splits on a single colon):
>>> re.split(':', 'str1:str2:str3')
['str1', 'str2', 'str3']
That’s a simple example. What if we have a more complex example,
such as a simple parser for a Web site like Google or Yahoo! Maps? Users
can enter city and state, or city plus ZIP code, or all three? This requires
more powerful processing than just a plain ’ol string split:
>>> import re
>>> DATA = (
...
'Mountain View, CA 94040',
...
'Sunnyvale, CA',
...
'Los Altos, 94023',
...
'Cupertino 95014',
...
'Palo Alto CA',
... )
>>> for datum in DATA:
...
print re.split(', |(?= (?:\d{5}|[A-Z]{2})) ', datum)
...
['Mountain View', 'CA', '94040']
['Sunnyvale', 'CA']
['Los Altos', '94023']
['Cupertino', '95014']
['Palo Alto', 'CA']
The preceding regex has a simple component, split on comma-space
(“, “). The harder part is the last regex, which previews some of the extension notations that you’ll learn in the next subsection. In plain English, this
is what it says: also split on a single space if that space is immediately followed by five digits (ZIP code) or two capital letters (US state abbreviation). This allows us to keep together city names that have spaces in them.
Naturally, this is just a simplistic regex that could be a starting point for
an application that parses location information. It doesn’t process (or fails)
1.3 Regexes and Python
31
lowercase states or their full spellings, street addresses, country codes,
ZIP+4 (nine-digit ZIP codes), latitude-longitude, multiple spaces, etc. It’s
just meant as a simple demonstration of re.split() doing something
str.split() can’t do.
As we just demonstrated, you benefit from much more power with a
regular expression split; however, remember to always use the best tool
for the job. If a string split is good enough, there’s no need to bring in the
additional complexity and performance impact of regexes.
1.3.14
Extension Notations (?...)
There are a variety of extension notations supported by Python regular
expressions. Let’s take a look at some of them now and provide some
usage examples.
With the (?iLmsux) set of options, users can specify one or more flags
directly into a regular expression rather than via compile() or other re
module functions. Below are several examples that use re.I/IGNORECASE,
with the last mixing in re.M/MULTILINE:
>>> re.findall(r'(?i)yes', 'yes? Yes. YES!!')
['yes', 'Yes', 'YES']
>>> re.findall(r'(?i)th\w+', 'The quickest way is through this
tunnel.')
['The', 'through', 'this']
>>> re.findall(r'(?im)(^th[\w ]+)', """
... This line is the first,
... another line,
... that line, it's the best
... """)
['This line is the first', 'that line']
For the previous examples, the case-insensitivity should be fairly
straightforward. In the last example, by using “multiline” we can perform
the search across multiple lines of the target string rather than treating the
entire string as a single entity. Notice that the instances of “the” are
skipped because they do not appear at the beginning of their respective
lines.
The next pair demonstrates the use of re.S/DOTALL. This flag indicates
that the dot (.) can be used to represent \n characters (whereas normally it
represents all characters except \n):
>>>
...
...
...
re.findall(r'th.+', '''
The first line
the second line
the third line
32
Chapter 1 • Regular Expressions
... ''')
['the second line', 'the third line']
>>> re.findall(r'(?s)th.+', '''
... The first line
... the second line
... the third line
... ''')
['the second line\nthe third line\n']
The re.X/VERBOSE flag is quite interesting; it lets users create more
human-readable regular expressions by suppressing whitespace characters within regexes (except those in character classes or those that are
backslash-escaped). Furthermore, hash/comment/octothorpe symbols (#)
can also be used to start a comment, also as long as they’re not within a
character class backslash-escaped:
>>> re.search(r'''(?x)
...
\((\d{3})\) # area code
...
[ ]
# space
...
(\d{3})
# prefix
...
# dash
...
(\d{4})
# endpoint number
... ''', '(800) 555-1212').groups()
('800', '555', '1212')
The (?:...) notation should be fairly popular; with it, you can group
parts of a regex, but it does not save them for future retrieval or use. This
comes in handy when you don’t want superfluous matches that are saved
and never used:
>>> re.findall(r'http://(?:\w+\.)*(\w+\.com)',
...
'http://google.com http://www.google.com http://
code.google.com')
['google.com', 'google.com', 'google.com']
>>> re.search(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?:\d{4})',
...
'(800) 555-1212').groupdict()
{'areacode': '800', 'prefix': '555'}
You can use the (?P<name>) and (?P=name) notations together. The former saves matches by using a name identifier rather than using increasing
numbers, starting at one and going through N, which are then retrieved
later by using \1, \2, ... \N. You can retrieve them in a similar manner
using \g<name>:
>>> re.sub(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?:\d{4})',
...
'(\g<areacode>) \g<prefix>-xxxx', '(800) 555-1212')
'(800) 555-xxxx'
Using the latter, you can reuse patterns in the same regex without specifying the same pattern again later on in the (same) regex, such as in this
example, which presumably lets you validate normalization of phone
1.3 Regexes and Python
33
numbers. Here are the ugly and compressed versions followed by a good
use of (?x) to make things (slightly) more readable:
>>> bool(re.match(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})(?P<number>\d{4}) (?P=areacode)-(?P=prefix)-(?P=number)
1(?P=areacode)(?P=prefix)(?P=number)',
...
'(800) 555-1212 800-555-1212 18005551212'))
True
>>> bool(re.match(r'''(?x)
...
...
# match (800) 555-1212, save areacode, prefix, no.
...
\((?P<areacode>\d{3})\)[ ](?P<prefix>\d{3})-(?P<number>\d{4})
...
...
# space
...
[ ]
...
...
# match 800-555-1212
...
(?P=areacode)-(?P=prefix)-(?P=number)
...
...
# space
...
[ ]
...
...
# match 18005551212
...
1(?P=areacode)(?P=prefix)(?P=number)
...
... ''', '(800) 555-1212 800-555-1212 18005551212'))
True
You use the (?=...) and (?!...) notations to perform a lookahead in
the target string without actually consuming those characters. The first is
the positive lookahead assertion, while the latter is the negative. In the
examples that follow, we are only interested in the first names of the persons who have a last name of “van Rossum,” and the next example let’s us
ignore e-mail addresses that begin with “noreply” or “postmaster.”
The third snippet is another demonstration of the difference between
findall() and finditer(); we use the latter to build a list of e-mail
addresses (in a more memory-friendly way by skipping the creation of the
intermediary list that would be thrown away) using the same login names
but on a different domain.
>>> re.findall(r'\w+(?= van Rossum)',
... '''
...
Guido van Rossum
...
Tim Peters
...
Alex Martelli
...
Just van Rossum
...
Raymond Hettinger
... ''')
['Guido', 'Just']
>>> re.findall(r'(?m)^\s+(?!noreply|postmaster)(\w+)',
34
Chapter 1 • Regular Expressions
... '''
...
[email protected]
...
[email protected]
...
[email protected]
...
[email protected]
...
[email protected]
... ''')
['sales', 'eng', 'admin']
>>> ['%[email protected]' % e.group(1) for e in \
re.finditer(r'(?m)^\s+(?!noreply|postmaster)(\w+)',
... '''
...
[email protected]
...
[email protected]
...
[email protected]
...
[email protected]
...
[email protected]
... ''')]
['[email protected]', '[email protected]', '[email protected]']
The last examples demonstrate the use of conditional regular expression matching. Suppose that we have another specialized alphabet consisting only of the characters ‘x’ and ‘y,’ where we only want to restrict the
string in such a way that two-letter strings must consist of one character
followed by the other. In other words, you can’t have both letters be the
same; either it’s an ‘x’ followed by a ‘y’ or vice versa:
>>> bool(re.search(r'(?:(x)|y)(?(1)y|x)', 'xy'))
True
>>> bool(re.search(r'(?:(x)|y)(?(1)y|x)', 'xx'))
False
1.3.15
Miscellaneous
There can be confusion between regular expression special characters and
special ASCII symbols. We can use \n to represent a NEWLINE character,
but we can use \d meaning a regular expression match of a single numeric
digit.
Problems can occur if there is a symbol used by both ASCII and regular
expressions, so in the following Core Note, we recommend the use of
Python raw strings to prevent any problems. One more caution: the \w and
\W alphanumeric character sets are affected by the re.L/LOCALE and Unicode
(re.U/UNICODE) flags.
1.3 Regexes and Python
35
CORE NOTE: Using Python raw strings
You might have seen the use of raw strings in some of the previous examples.
Regular expressions were a strong motivation for the advent of raw strings. The
reason lies in the conflicts between ASCII characters and regular expression special characters. As a special symbol, \b represents the ASCII character for backspace, but \b is also a regular expression special symbol, meaning “match” on a
word boundary. For the regex compiler to see the two characters \b as your string
and not a (single) backspace, you need to escape the backslash in the string by
using another backslash, resulting in \\b.
This can get messy, especially if you have a lot of special characters in your
string, adding to the confusion. We were introduced to raw strings in the
Sequences chapter of Core Python Programming or Core Python Language
Fundamentals, and they can be (and are often) used to help keep regexes looking
somewhat manageable. In fact, many Python programmers swear by these and
only use raw strings when defining regular expressions.
Here are some examples of differentiating between the backspace \b and the
regular expression \b, with and without raw strings:
>>> m = re.match('\bblow', 'blow') # backspace, no match
>>> if m: m.group()
...
>>> m = re.match('\\bblow', 'blow') # escaped \, now it works
>>> if m: m.group()
...
'blow'
>>> m = re.match(r'\bblow', 'blow') # use raw string instead
>>> if m: m.group()
...
'blow'
You might have recalled that we had no trouble using \d in our regular expressions without using raw strings. That is because there is no ASCII equivalent
special character, so the regular expression compiler knew that you meant a
decimal digit.
36
Chapter 1 • Regular Expressions
1.4
Some Regex Examples
Let’s look at a few examples of some Python regex code that takes us a step
closer to something that you would actually use in practice. Take, for
example, the output from the POSIX (Unix-flavored systems like Linux,
Mac OS X, etc.) who command, which lists all the users logged in to a system:
$ who
wesley
wesley
wesley
wesley
wesley
wesley
wesley
wesley
wesley
wesley
console
pts/9
pts/1
pts/2
pts/4
pts/3
pts/5
pts/6
pts/7
pts/8
Jun
Jun
Jun
Jun
Jun
Jun
Jun
Jun
Jun
Jun
20
22
20
20
20
20
20
20
20
20
20:33
01:38
20:33
20:33
20:33
20:33
20:33
20:33
20:33
20:33
(192.168.0.6)
(:0.0)
(:0.0)
(:0.0)
(:0.0)
(:0.0)
(:0.0)
(:0.0)
(:0.0)
Perhaps we want to save some user login information such as login
name, the teletype at which the user logged in, when the user logged in,
and from where. Using str.split() on the preceding example would not
be effective because the spacing is erratic and inconsistent. The other problem is that there is a space between the month, day, and time for the login
timestamps. We would probably want to keep these fields together.
You need some way to describe a pattern such as “split on two or more
spaces.” This is easily done with regular expressions. In no time, we whip up
the regex pattern \s\s+, which means at least two whitespace characters.
Let’s create a program called rewho.py that reads the output of the who
command, presumably saved into a file called whodata.txt. Our rewho.py
script initially looks something like this:
import re
f = open('whodata.txt', 'r')
for eachLine in f:
print re.split(r'\s\s+', eachLine)
f.close()
The preceding code also uses raw strings (leading “r” or “R” in front of
the opening quotes). The main idea is to avoid translating special string
characters like \n, which is not a special regex pattern. For regex patterns
that do have backslashes, you want them treated verbatim; otherwise,
you’d have to double-backslash them to keep them safe.
We will now execute the who command, saving the output into whodata.txt,
and then call rewho.py to take a look at the results:
1.4 Some Regex Examples
37
$ who > whodata.txt
$ rewho.py
['wesley', 'console', 'Jun 20 20:33\012']
['wesley', 'pts/9', 'Jun 22 01:38\011(192.168.0.6)\012']
['wesley', 'pts/1', 'Jun 20 20:33\011(:0.0)\012']
['wesley', 'pts/2', 'Jun 20 20:33\011(:0.0)\012']
['wesley', 'pts/4', 'Jun 20 20:33\011(:0.0)\012']
['wesley', 'pts/3', 'Jun 20 20:33\011(:0.0)\012']
['wesley', 'pts/5', 'Jun 20 20:33\011(:0.0)\012']
['wesley', 'pts/6', 'Jun 20 20:33\011(:0.0)\012']
['wesley', 'pts/7', 'Jun 20 20:33\011(:0.0)\012']
['wesley', 'pts/8', 'Jun 20 20:33\011(:0.0)\012']
It was a good first try, but not quite correct. For one thing, we did not
anticipate a single TAB (ASCII \011) as part of the output (which looked
like at least two spaces, right?), and perhaps we aren’t really keen on saving
the \n (ASCII \012), which terminates each line. We are now going to fix
those problems as well as improve the overall quality of our application by
making a few more changes.
First, we would rather run the who command from within the script
instead of doing it externally and saving the output to a whodata.txt
file—doing this repeatedly gets tiring rather quickly. To accomplish invoking another program from within ours, we call upon the os.popen() command. Although os.popen() has now been made obsolete by the subprocess
module, it’s still simpler to use, and the main point is to illustrate the functionality of re.split().
We get rid of the trailing \ns (with str.rstrip()) and add the detection of a
single TAB as an additional, alternative re.split() delimiter. Example 1-1
presents the final Python 2 version of our rewho.py script:
Example 1-1
Split Output of the POSIX who Command (rewho.py)
This script calls the who command and parses the input by splitting up its data
along various types of whitespace characters.
1
2
3
4
5
6
7
8
9
#!/usr/bin/env python
import os
import re
f = os.popen('who', 'r')
for eachLine in f:
print re.split(r'\s\s+|\t', eachLine.rstrip())
f.close()
Example 1-2 presents rewho3.py, which is the Python 3 version with an
additional twist. The main difference from the Python 2 version is the
www.allitebooks.com
3.x
38
Chapter 1 • Regular Expressions
print() function (vs. a statement). This entire line is italicized to indicate
critical
Python 2 versus 3 differences. The with statement, available as
2.5-2.6
experimental in version 2.5, and official in version 2.6, works with objects
built to support it.
Example 1-2
Python 3 Version of rewho.py Script (rewho3.py)
This Python 3 equivalent of rewho.py simply replaces the print statement with
the print() function. When using the with statement (available starting in
Python 2.5), keep in mind that the file (Python 2) or io (Python 3) object’s
context manager will automatically call f.close() for you.
1
2
3
4
5
6
7
8
#!/usr/bin/env python
import os
import re
with os.popen('who', 'r') as f:
for eachLine in f:
print(re.split(r'\s\s+|\t', eachLine.strip()))
Objects that have context managers implemented for them makes them
eligible to be used with with. For more on the with statement and context
management, please review the “Errors and Exceptions” chapter of Core
Python Programming or Core Python Language Fundamentals. Don’t forget for
either version (rewho.py or rewho3.py) that the who command is only available on POSIX systems unless you’re using Cygwin on a Windows-based
computer. For PCs running Microsoft Windows, try tasklist instead, but
there’s an additional tweak you need to do. Keep reading to see a sample
execution using that command.
Example 1-3 merges together both rewho.py and rewho3.py into
rewhoU.py, with the name meaning “rewho universal.” It runs under both
Python 2 and 3 interpreters. We cheat and avoid the use of print or
print() by using a less than fully-featured function that exists in both version 2.x and version 3.x: distutils.log.warn(). It’s a one-string output
function, so if your display is more complex than that, you’ll need to
merge it all into a single string, and then make the call. To indicate its use
within our script, we’ll name it printf().
We also roll in the with statement here, too. This means that you need at
least version 2.6 to run this. Well, that’s not quite true. We mentioned earlier that it’s experimental in version 2.5. This means that you need to
include this additional statement if you wish to use it: from __future__
import with_statement. If you’re still using version 2.4 or older, you have
no access to this import and must run code such as that in Example 1-1.
1.4 Some Regex Examples
Example 1-3
39
Universal Version of rewho.py Script (rewhoU.py)
This script runs under both Python 2 and 3 by proxying out the print statement
and the print() function with a cheap substitute. It also includes the with
statement available starting in Python 2.5.
1
2
3
4
5
6
7
8
9
#!/usr/bin/env python
import os
from distutils.log import warn as printf
import re
with os.popen('who', 'r') as f:
for eachLine in f:
printf(re.split(r'\s\s+|\t', eachLine.strip()))
The creation of rewhoU.py is one example of how you can create a universal script that helps avoid the need to maintain two versions of the
same script for both Python 2 and 3.
Executing any of these scripts with the appropriate interpreter yields
the corrected, cleaner output:
$ rewho.py
['wesley',
['wesley',
['wesley',
['wesley',
['wesley',
'console',
'ttys000',
'ttys001',
'ttys002',
'ttys003',
'Feb
'Feb
'Feb
'Feb
'Feb
22
22
22
25
24
14:12']
14:18']
14:49']
00:13', '(192.168.0.20)']
23:49', '(192.168.0.20)']
Also don’t forget that the re.split() function also takes the optional
described earlier in this chapter.
A similar exercise can be achieved on Windows-based computers by
using the tasklist command in place of who. Let’s take a look at its output
on the following page.
flags parameter
C:\WINDOWS\system32>tasklist
Image Name
PID Session Name
Session#
Mem Usage
========================= ====== ================ ======== ============
System Idle Process
0 Console
0
28 K
System
4 Console
0
240 K
smss.exe
708 Console
0
420 K
csrss.exe
764 Console
0
4,876 K
winlogon.exe
788 Console
0
3,268 K
services.exe
836 Console
0
3,932 K
. . .
As you can see, the output contains different information than that of
who, but the format is similar, so we can consider our previous solution by
performing an re.split() on one or more spaces (no TAB issue here).
40
Chapter 1 • Regular Expressions
The problem is that the command name might have a space, and we
(should) prefer to keep the entire command name together. The same is
true of the memory usage, which is given by “NNN K,” where NNN is the
amount of memory K designates kilobytes. We want to keep this together,
too, so we’d better split off of at least one space, right?
Nope, no can do. Notice that the process ID (PID) and Session Name
columns are delimited only by a single space. This means that if we split
off at least one space, the PID and Session Name would be kept together
as a single result. If we copied one of the preceding scripts and call it
retasklist.py, change the command from who to tasklist /nh (the /nh
option suppresses the column headers), and use a regex of \s\s+, we get
output that looks like this:
Z:\corepython\ch1>python retasklist.py
['']
['System Idle Process', '0 Console', '0', '28 K']
['System', '4 Console', '0', '240 K']
['smss.exe', '708 Console', '0', '420 K']
['csrss.exe', '764 Console', '0', '5,028 K']
['winlogon.exe', '788 Console', '0', '3,284 K']
['services.exe', '836 Console', '0', '3,924 K']
. . .
We have confirmed that although we’ve kept the command name and
memory usage strings together, we’ve inadvertently put the PID and Session Name together. We have to discard our use of split and just do a regular
expression match. Let’s do that and filter out both the Session Name and
Number because neither add value to our output. Example 1-4 shows the
final version of our Python 2 retasklist.py:
Example 1-4
Processing the DOS tasklist Command Output
(retasklist.py)
This script uses a regex and findall() to parse the output of the DOS tasklist
command, displaying only the data that’s interesting to us. Porting this script to
Python 3 merely requires a switch to the print() function.
1
2
3
4
5
6
7
8
9
10
11
#!/usr/bin/env python
import os
import re
f = os.popen('tasklist /nh', 'r')
for eachLine in f:
print re.findall(
r'([\w.]+(?: [\w.]+)*)\s\s+(\d+) \w+\s\s+\d+\s\s+([\d,]+ K)',
eachLine.rstrip())
f.close()
1.5 A Longer Regex Example
41
If we run this script, we get our desired (truncated) output:
Z:\corepython\ch1>python retasklist.py
[]
[('System Idle Process', '0', '28 K')]
[('System', '4', '240 K')]
[('smss.exe', '708', '420 K')]
[('csrss.exe', '764', '5,016 K')]
[('winlogon.exe', '788', '3,284 K')]
[('services.exe', '836', '3,932 K')]
. . .
The meticulous regex used goes through all five columns of the output
string, grouping together only those values that matter to us: the command name, its PID, and how much memory it takes. It uses many regex
features that we’ve already read about in this chapter.
Naturally, all of the scripts we’ve done in this subsection merely display
output to the user. In practice, you’re likely to be processing this data,
instead, saving it to a database, using the output to generate reports to
management, etc.
1.5
A Longer Regex Example
We will now run through an in-depth example of the different ways to use
regular expressions for string manipulation. The first step is to come up
with some code that actually generates random (but not too random) data
on which to operate. In Example 1-5, we present gendata.py, a script that
generates a data set. Although this program simply displays the generated
set of strings to standard output, this output could very well be redirected
to a test file.
Example 1-5
Data Generator for Regex Exercises (gendata.py)
This script creates random data for regular expressions practice and outputs
the generated data to the screen. To port this to Python 3, just convert print to
a function, switch from xrange() back to range(), and change from using
sys.maxint to sys.maxsize.
1
2
3
4
5
6
7
#!/usr/bin/env python
from
from
from
from
random import randrange, choice
string import ascii_lowercase as lc
sys import maxint
time import ctime
(Continued)
42
Chapter 1 • Regular Expressions
Example 1-5
8
9
10
11
12
13
14
15
16
17
18
Data Generator for Regex Exercises (gendata.py)
(Continued)
tlds = ('com', 'edu', 'net', 'org', 'gov')
for i in xrange(randrange(5, 11)):
dtint = randrange(maxint)
# pick date
dtstr = ctime(dtint)
# date string
llen = randrange(4, 8)
# login is shorter
login = ''.join(choice(lc) for j in range(llen))
dlen = randrange(llen, 13)
# domain is longer
dom = ''.join(choice(lc) for j in xrange(dlen))
print '%s::%[email protected]%s.%s::%d-%d-%d' % (dtstr, login,
dom, choice(tlds), dtint, llen, dlen)
This script generates strings with three fields, delimited by a pair of
colons, or a double-colon. The first field is a random (32-bit) integer, which
is converted to a date. The next field is a randomly generated e-mail
address, and the final field is a set of integers separated by a single dash (-).
Running this code, we get the following output (your mileage will definitely vary) and store it locally as the file redata.txt:
Thu
Sun
Sat
Thu
Thu
Tue
Jul
Jul
May
Feb
Jun
Apr
22
13
5
15
26
10
19:21:19
22:42:11
16:36:23
17:46:04
19:08:59
01:04:45
2004::[email protected]::1090549279-4-11
2008::[email protected]::1216014131-4-11
1990::[email protected]::641950583-6-10
2007::[email protected]::1171590364-6-8
2036::[email protected]::2098145339-7-7
2012::[email protected]::1334045085-5-10
You might or might not be able to tell, but the output from this program
is ripe for regular expression processing. Following our line-by-line explanation, we will implement several regexes to operate on this data as well
as leave plenty for the end-of-chapter exercises.
Line-by-Line Explanation
Lines 1–6
In our example script, we require the use of multiple modules. Although
we caution against the use of from-import because of various reasons (e.g.,
it’s easier to determine where a function comes from, possible local module conflict, etc.), we choose to import only specific attributes from these
modules to help you focus on those attributes only as well as shortening
each line of code.
1.5 A Longer Regex Example
43
Line 8
tlds is simply a set of higher-level domain names from which we will randomly pick for each randomly generated e-mail address.
Lines 10–12
Each time gendata.py executes, between 5 and 10 lines of output are generated. (Our script uses the random.randrange() function for all cases for
which we desire a random integer.) For each line, we choose a random
integer from the entire possible range (0 to 231 – 1 [sys.maxint]), and then
convert that integer to a date by using time.ctime(). System time in
Python and most POSIX-based computers is based on the number of seconds that have elapsed since the “epoch,” which is midnight UTC/GMT
on January 1, 1970. If we choose a 32-bit integer, that represents one
moment in time from the epoch to the maximum possible time, 232 seconds
after the epoch.
Lines 13–16
The login name for the fake e-mail address should be between 4 and 7
characters in length (thus randrange(4, 8)). To put it together, we randomly
choose between 4 and 7 random lowercase letters, concatenating each letter
to our string, one at a time. The functionality of the random.choice() function is to accept a sequence, and then return a random element of that
sequence. In our case, the sequence is the set of all 26 lowercase letters of
the alphabet, string.ascii_lowercase.
We decided that the main domain name for the fake e-mail address
should be no more than 12 characters in length, but at least as long as the
login name. Again, we use random lowercase letters to put this name
together, letter by letter.
Lines 17–18
The key component of our script puts together all of the random data into
the output line. The date string comes first, followed by the delimiter. We
then put together the random e-mail address by concatenating the login
name, the “@” symbol, the domain name, and a randomly chosen highlevel domain. After the final double-colon, we put together a random integer
string using the original time chosen (for the date string), followed by the
lengths of the login and domain names, all separated by a single hyphen.
44
Chapter 1 • Regular Expressions
1.5.1
Matching a String
For the following exercises, create both permissive and restrictive versions
of your regexes. We recommend that you test these regexes in a short
application that utilizes our sample redata.txt, presented earlier (or use
your own generated data from running gendata.py). You will need to use
it again when you do the exercises.
To test the regex before putting it into our little application, we will import
the re module and assign one sample line from redata.txt to a string variable
data. These statements are constant across both illustrated examples.
>>> import re
>>> data = 'Thu Feb 15 17:46:04 2007::[email protected]::1171590364-6-8'
In our first example, we will create a regular expression to extract (only)
the days of the week from the timestamps from each line of the data file
redata.txt. We will use the following regex:
“^Mon|^Tue|^Wed|^Thu|^Fri|^Sat|^Sun”
This example requires that the string start with (“^” regex operator) any
of the seven strings listed. If we were to “translate” the above regex to
English, it would read something like, “the string should start with
“Mon,” “Tue,”. . . , “Sat,” or “Sun.”
Alternatively, we can bypass all the caret operators with a single caret if
we group the day strings like this:
“^(Mon|Tue|Wed|Thu|Fri|Sat|Sun)”
The parentheses around the set of strings mean that one of these strings
must be encountered for a match to succeed. This is a “friendlier” version
of the original regex we came up with, which did not have the parentheses. Using our modified regex, we can take advantage of the fact that we
can access the matched string as a subgroup:
>>> patt = '^(Mon|Tue|Wed|Thu|Fri|Sat|Sun)'
>>> m = re.match(patt, data)
>>> m.group()
# entire match
'Thu'
>>> m.group(1)
# subgroup 1
'Thu'
>>> m.groups()
# all subgroups
('Thu',)
This feature might not seem as revolutionary as we have made it out to
be for this example, but it is definitely advantageous in the next example
or anywhere you provide extra data as part of the regex to help in the
1.5 A Longer Regex Example
45
string matching process, even though those characters might not be part of
the string you are interested in.
Both of the above regexes are the most restrictive, specifically requiring
a set number of strings. This might not work well in an internationalization environment, where localized days and abbreviations are used. A
looser regex would be: ^\w{3}. This one requires only that a string begin
with three consecutive alphanumeric characters. Again, to translate the
regex into English, the caret indicates “begins with,” the \w means any
single alphanumeric character, and the {3} means that there should be 3
consecutive copies of the regex which the {3} embellishes. Again, if you
want grouping, parentheses should be used, such as ^(\w{3}):
>>> patt = '^(\w{3})'
>>> m = re.match(patt, data)
>>> if m is not None: m.group()
...
'Thu'
>>> m.group(1)
'Thu'
Note that a regex of ^(\w){3} is not correct. When the {3} was inside the
parentheses, the match for three consecutive alphanumeric characters was
made first, and then represented as a group. But by moving the {3} outside,
it is now equivalent to three consecutive single alphanumeric characters:
>>> patt = '^(\w){3}'
>>> m = re.match(patt, data)
>>> if m is not None: m.group()
...
'Thu'
>>> m.group(1)
'u'
The reason why only the “u” shows up when accessing subgroup 1 is
that subgroup 1 was being continually replaced by the next character. In
other words, m.group(1) started out as “T,” then changed to “h,” and then
finally was replaced by “u.” These are three individual (and overlapping)
groups of a single alphanumeric character, as opposed to a single group
consisting of three consecutive alphanumeric characters.
In our next (and final) example, we will create a regular expression to
extract the numeric fields found at the end of each line of redata.txt.
46
Chapter 1 • Regular Expressions
1.5.2
Search versus Match... and Greediness, too
Before we create any regexes, however, we realize that these integer data
items are at the end of the data strings. This means that we have a choice
of using either search or match. Initiating a search makes more sense
because we know exactly what we are looking for (a set of three integers),
that what we seek is not at the beginning of the string, and that it does
not make up the entire string. If we were to perform a match, we would
have to create a regex to match the entire line and use subgroups to save
the data we are interested in. To illustrate the differences, we will perform
a search first, and then do a match to show you that searching is more
appropriate.
Because we are looking for three integers delimited by hyphens, we create our regex to indicate as such: \d+-\d+-\d+. This regular expression
means, “any number of digits (at least one, though) followed by a hyphen,
then more digits, another hyphen, and finally, a final set of digits.” We test
our regex now by using search():
>>> patt = '\d+-\d+-\d+'
>>> re.search(patt, data).group()
'1171590364-6-8'
# entire match
A match attempt, however, would fail. Why? Because matches start at
the beginning of the string, the numeric strings are at the end. We would
have to create another regex to match the entire string. We can be lazy,
though, by using .+ to indicate just an arbitrary set of characters followed
by what we are really interested in:
patt = '.+\d+-\d+-\d+'
>>> re.match(patt, data).group()
# entire match
'Thu Feb 15 17:46:04 2007::[email protected]::1171590364-6-8'
This works great, but we really want the number fields at the end, not
the entire string, so we have to use parentheses to group what we want:
>>> patt = '.+(\d+-\d+-\d+)'
>>> re.match(patt, data).group(1)
'4-6-8'
# subgroup 1
What happened? We should have extracted 1171590364-6-8, not just
Where is the rest of the first integer? The problem is that regular
expressions are inherently greedy. This means that with wildcard patterns,
regular expressions are evaluated in left-to-right order and try to “grab” as
many characters as possible that match the pattern. In the preceding case,
the .+ grabbed every single character from the beginning of the string,
including most of the first integer field that we wanted. The \d+ needed only
4-6-8.
1.5 A Longer Regex Example
47
a single digit, so it got “4,” whereas the .+ matched everything from the
beginning of the string up to that first digit: “Thu Feb 15 17:46:04
2007::[email protected]::117159036,” as indicated in Figure 1–2.
+ is a greedy operator
Thu Feb 15 17:46:04 2007::[email protected]::117159036 4-6-8
.+
\d+-\d+-\d+
Figure 1-2 Why our match went awry: + is a greedy operator.
One solution is to use the “don’t be greedy” operator: ?. You can use this
operator after *, +, or ?. It directs the regular expression engine to match as
few characters as possible. So if we place a ? after the .+, we obtain the
desired result, as illustrated in Figure 1–3.
>>> patt = '.+?(\d+-\d+-\d+)'
>>> re.match(patt, data).group(1)
'1171590364-6-8'
# subgroup 1
? requests non-greedy operation
Thu Feb 15 17:46:04 2007::[email protected]::1171590364-6-8
.+ ?
\d+-\d+-\d+
Figure 1-3 Solving the greedy problem: ? requests non-greediness.
Another solution, which is actually easier, is to recognize that “::” is our
field separator. You can then just use the regular string strip('::')
method to get all the parts, and then take another split on the dash with
strip('-') to obtain the three integers you were originally seeking. Now,
we did not choose this solution first because this is how we put the strings
together to begin with using gendata.py!
48
Chapter 1 • Regular Expressions
One final example: suppose that we want to pull out only the middle
integer of the three-integer field. Here is how we would do it (using a
search so that we don’t have to match the entire string): -(\d+)-. Trying
out this pattern, we get:
>>> patt = '-(\d+)-'
>>> m = re.search(patt, data)
>>> m.group()
'-6-'
>>> m.group(1)
'6'
# entire match
# subgroup 1
We barely touched upon the power of regular expressions, and in this
limited space we have not been able to do them justice. However, we hope
that we have given an informative introduction so that you can add this
powerful tool to your programming skills. We suggest that you refer to the
documentation for more details on how to use regexes with Python. For a
more complete immersion into the world of regular expressions, we recommend Mastering Regular Expressions by Jeffrey E. F. Friedl.
1.6
Exercises
Regular Expressions. Create regular expressions in Exercises 1-1 to1-12 that:
1-1. Recognize the following strings: “bat,” “bit,” “but,” “hat,”
“hit,” or “hut.”
1-2. Match any pair of words separated by a single space, that is,
first and last names.
1-3. Match any word and single letter separated by a comma and
single space, as in last name, first initial.
1-4. Match the set of all valid Python identifiers.
1-5. Match a street address according to your local format (keep
your regex general enough to match any number of street
words, including the type designation). For example, American
street addresses use the format: 1180 Bordeaux Drive. Make
your regex flexible enough to support multi-word street
names such as: 3120 De la Cruz Boulevard.
1-6. Match simple Web domain names that begin with “www.”
and end with a “.com” suffix; for example, www.yahoo.com.
Extra Credit: If your regex also supports other high-level
domain names, such as .edu, .net, etc. (for example,
www.foothill.edu).
1.6 Exercises
1-7. Match the set of the string representations of all Python
integers.
1-8. Match the set of the string representations of all Python longs.
1-9. Match the set of the string representations of all Python floats.
1-10. Match the set of the string representations of all Python complex numbers.
1-11. Match the set of all valid e-mail addresses (start with a loose
regex, and then try to tighten it as much as you can, yet
maintain correct functionality).
1-12. Match the set of all valid Web site addresses (URLs) (start
with a loose regex, and then try to tighten it as much as you
can, yet maintain correct functionality).
1-13. type(). The type() built-in function returns a type object,
which is displayed as the following Pythonic-looking string:
>>> type(0)
<type 'int'>
>>> type(.34)
<type 'float'>
>>> type(dir)
<type 'builtin_function_or_method'>
Create a regex that would extract the actual type name from
the string. Your function should take a string like this <type
'int'> and return int. (Ditto for all other types, such as
‘float’, ‘builtin_function_or_method’, etc.) Note: You
are implementing the value that is stored in the __name__
attribute for classes and some built-in types.
1-14. Processing Dates. In Section 1.2, we gave you the regex pattern
that matched the single or double-digit string representations of
the months January to September (0?[1-9]). Create the regex
that represents the remaining three months in the standard
calendar.
1-15. Processing Credit Card Numbers. Also in Section 1.2, we gave
you the regex pattern that matched credit card (CC) numbers
([0-9]{15,16}). However, this pattern does not allow for
hyphens separating blocks of numbers. Create the regex that
allows hyphens, but only in the correct locations. For example, 15-digit CC numbers have a pattern of 4-6-5, indicating
four digits-hyphen-six digits-hyphen-five digits; and 16-digit
CC numbers have a 4-4-4-4 pattern. Remember to “balloon”
49
50
Chapter 1 • Regular Expressions
the size of the entire string correctly. Extra Credit: There is a
standard algorithm for determining whether a CC number is
valid. Write some code that not only recognizes a correctly
formatted CC number, but also a valid one.
Playing with gendata.py. The next set of Exercises (1-16 through 1-27) deal
specifically with the data that is generated by gendata.py. Before approaching Exercises 1-17 and 1-18, you might want to do 1-16 and all the regular
expressions first.
1-16. Update the code for gendata.py so that the data is written
directly to redata.txt rather than output to the screen.
1-17. Determine how many times each day of the week shows up
for any incarnation of redata.txt. (Alternatively, you can
also count how many times each month of the year was
chosen.)
1-18. Ensure that there is no data corruption in redata.txt by confirming that the first integer of the integer field matches the
timestamp given at the beginning of each output line.
Create Regular Expressions That:
1-19.
1-20.
1-21.
1-22.
1-23.
1-24.
Extract the complete timestamps from each line.
Extract the complete e-mail address from each line.
Extract only the months from the timestamps.
Extract only the years from the timestamps.
Extract only the time (HH:MM:SS) from the timestamps.
Extract only the login and domain names (both the main
domain name and the high-level domain together) from the
e-mail address.
1-25. Extract only the login and domain names (both the main
domain name and the high-level domain) from the e-mail
address.
1-26. Replace the e-mail address from each line of data with your
e-mail address.
1-27. Extract the months, days, and years from the timestamps and
output them in “Mon, Day, Year” format, iterating over each
line only once.
1.6 Exercises
51
Processing Telephone Numbers. For Exercises 1-28 and 1-29, recall the regular
expression introduced in Section 1.2, which matched telephone numbers
but allowed for an optional area code prefix: \d{3}-\d{3}-\d{4}. Update
this regular expression so that:
1-28. Area codes (the first set of three-digits and the accompanying hyphen) are optional, that is, your regex should match
both 800-555-1212 as well as just 555-1212.
1-29. Either parenthesized or hyphenated area codes are supported, not to mention optional; make your regex match
800-555-1212, 555-1212, and also (800) 555-1212.
Regex Utilities. The final set of exercises make useful utility scripts when
processing online data:
1-30. HTML Generation. Given a list of links (and optional short
description), whether user-provided on command-line, via
input from another script, or from a database, generate a
Web page (.html) that includes all links as hypertext anchors,
which upon viewing in a Web browser, allows users to click
those links and visit the corresponding site. If the short
description is provided, use that as the hypertext instead of
the URL.
1-31. Tweet Scrub. Sometimes all you want to see is the plain text of
a tweet as posted to the Twitter service by users. Create a
function that takes a tweet and an optional “meta” flag
defaulted False, and then returns a string of the scrubbed
tweet, removing all the extraneous information, such as an
“RT” notation for “retweet”, a leading ., and all “#hashtags”.
If the meta flag is True, then also return a dict containing the
metadata. This can include a key “RT,” whose value is a
tuple of strings of users who retweeted the message, and/or
a key “hashtags” with a tuple of the hashtags. If the values
don’t exist (empty tuples), then don’t even bother creating a
key-value entry for them.
52
Chapter 1 • Regular Expressions
1-32. Amazon Screenscraper. Create a script that helps you to keep
track of your favorite books and how they’re doing on Amazon
(or any other online bookseller that tracks book rankings).
For example, the Amazon link for any book is of the format,
http://amazon.com/dp/ISBN (for example, http://amazon.com/
dp/0132678209). You can then change the domain name to
check out the equivalent rankings on Amazon sites in other
countries, such as Germany (.de), France (.fr), Japan (.jp),
China (.cn), and the UK (.co.uk). Use regular expressions or a
markup parser, such as BeautifulSoup, lxml, or html5lib to
parse the ranking, and then let the user pass in a commandline argument that specifies whether the output should be in
plain text, perhaps for inclusion in an e-mail body, or formatted in HTML for Web consumption.
CHAPTER
Network Programming
So, IPv6. You all know that we are almost out of IPv4 address space. I
am a little embarrassed about that because I was the guy who decided
that 32-bit was enough for the Internet experiment. My only defense
is that that choice was made in 1977, and I thought it was an
experiment. The problem is the experiment didn't end, so here we are.
—Vint Cerf, January 20111
(verbally at linux.conf.au conference)
In this chapter...
• Introduction
• What Is Client/Server
Architecture?
• *The SocketServer Module
• *Introduction to the Twisted
Framework
• Related Modules
• Sockets: Communication
Endpoints
• Network Programming in Python
1. Dates back to 2004 via http://www.educause.edu/EDUCAUSE+Review/
EDUCAUSEReviewMagazineVolume39/MusingsontheInternetPart2/
157899
53
54
Chapter 2 • Network Programming
2.1
Introduction
In this section, we will take a brief look at network programming using
sockets. But before we delve into that, we will present some background
information on network programming, how sockets apply to Python, and
then show you how to use some of Python’s modules to build networked
applications.
2.2
What Is Client/Server Architecture?
What is client/server architecture? It means different things to different people, depending on whom you ask as well as whether you are describing a
software or a hardware system. In either case, the premise is simple: the
server—a piece of hardware or software—provides a “service” that is
needed by one or more clients (users of the service). Its sole purpose of
existence is to wait for (client) requests, respond to those clients (provide
the service), and then wait for more requests.
Clients, on the other hand, contact a server for a particular request, send
over any necessary data, and then wait for the server to reply, either completing the request or indicating the cause of failure. The server runs indefinitely, continually processing requests; clients make a one-time request for
service, receive that service, and thus conclude their transaction. A client
might make additional requests at some later time, but these are considered separate transactions.
The most common notion of the client/server architecture today is illustrated
in Figure 2-1, which depicts a user or client computer retrieving information
from a server across the Internet. Although such a system is indeed an example
of a client/server architecture, it isn’t the only one. Furthermore, client/server
architecture can be applied to computer hardware as well as software.
The Internet
Client
Figure 2-1 Typical conception of a client/server system on the Internet.
Server
2.2 What Is Client/Server Architecture?
2.2.1
55
Hardware Client/Server Architecture
Print(er) servers are examples of hardware servers. They process incoming
print jobs and send them to a printer (or some other printing device)
attached to such a system. Such a computer is generally network-accessible
and client computers would send it print requests.
Another example of a hardware server is a file server. These are typically
computers with large, generalized storage capacity, which is remotely
accessible to clients. Client computers mount the disks from the server
computer as if the disk itself were on the local computer. One of the most
popular network operating systems that support file servers is Sun Microsystems’ Network File System (NFS). If you are accessing a networked disk
drive and cannot tell whether it is local or on the network, then the client/
server system has done its job. The goal is for the user experience to be
exactly the same as that of a local disk—the abstraction is normal disk
access. It is up to the programmed implementation to make it behave in
such a manner.
2.2.2
Software Client/Server Architecture
Software servers also run on a piece of hardware but do not have dedicated peripheral devices as hardware servers do (i.e., printers, disk drives,
etc.). The primary services provided by software servers include program
execution, data transfer retrieval, aggregation, update, or other types of
programmed or data manipulation.
One of the more common software servers today is the Web server. Individuals or companies desiring to run their own Web server will get one or
more computers, install the Web pages and or Web applications they wish
to provide to users, and then start the Web server. The job of such a server
is to accept client requests, send back Web pages to (Web) clients, that is,
browsers on users’ computers, and then wait for the next client request.
These servers are started with the expectation of running forever.
Although they do not achieve that goal, they go for as long as possible
unless stopped by some external force such as being shut down, either
explicitly or catastrophically (due to hardware failure).
Database servers are another kind of software server. They take client
requests for either storage or retrieval, act upon that request, and then wait
for more business. They are also designed to run forever.
The last type of software server we will discuss are windows servers.
These servers can almost be considered hardware servers. They run on a
56
Chapter 2 • Network Programming
computer with an attached display, such as a monitor of some sort. Windows
clients are actually programs that require a windowing environment in
which to execute. These are generally considered graphical user interface
(GUI) applications. If they are executed without a window server, meaning, in
a text-based environment such as a DOS window or a Unix shell, they are
unable to start. Once a windows server is accessible, then things are fine.
Such an environment becomes even more interesting when networking
comes into play. The usual display for a windows client is the server on
the local computer, but it is possible in some networked windowing environments, such as the X Window system, to choose another computer’s
window server as a display. In such situations, you can be running a GUI
program on one computer, but have it displayed at another!
2.2.3
Bank Tellers as Servers?
One way to imagine how client/server architecture works is to create in your
mind the image of a bank teller who neither eats, sleeps, nor rests, serving
one customer after another in a line that never seems to end (see Figure 2-2).
The line might be long or it might be empty on occasion, but at any given
moment, a customer might show up. Of course, such a teller was fantasy
years ago, but automated teller machines (ATMs) seem to come close to
such a model now.
The teller is, of course, the server that runs in an infinite loop. Each customer is a client with a need that must be addressed. Customers arrive
and are handled by the teller in a first-come-first-served manner. Once a
transaction has been completed, the client goes away while the server
either serves the next customer or sits and waits until one comes along.
Why is all this important? The reason is that this style of execution is
how client/server architecture works in a general sense. Now that you
have the basic idea, let’s adapt it to network programming, which follows
the software client/server architecture model.
2.2.4
Client/Server Network Programming
Before a server can respond to client requests, some preliminary setup
procedures must be performed to prepare it for the work that lies ahead. A
communication endpoint is created which allows a server to listen for
requests. One can liken our server to a company receptionist or switchboard operator who answers calls on the main corporate line. Once the
phone number and equipment are installed and the operator arrives,
the service can begin.
2.2 What Is Client/Server Architecture?
57
Figure 2-2 The bank teller in this diagram works “forever” serving client requests. The teller
runs in an infinite loop receiving requests, servicing them, and then going back to serve or wait
for another client. There might be a long line of clients, or there might be none at all, but in either
case, a server’s work is never done.
This process is the same in the networked world—once a communication endpoint has been established, our listening server can now enter its
infinite loop, waiting for clients to connect, and responding to requests. Of
course, to keep our corporate phone receptionist busy, we must not forget
to put that phone number on company letterhead, in advertisements, or
some sort of press release; otherwise, no one will ever call!
Similarly, potential clients must be made aware that this server exists to
handle their needs—otherwise, the server will never get a single request.
Imagine creating a brand new Web site. It might be the most super-duper,
awesome, amazing, useful, and coolest Web site of all, but if the Web
address or URL is never broadcast or advertised in any way, no one will
ever know about it, and it will never see the any visitors.
Now you have a good idea as to how the server works. You have made
it past the difficult part. The client-side stuff is much more simple than
that on the server side. All the client has to do is to create its single communication endpoint, and then establish a connection to the server. The
client can now make a request, which includes any necessary exchange of
data. Once the request has been processed and the client has received the
result or some sort of acknowledgement, communication is terminated.
58
Chapter 2 • Network Programming
2.3
Sockets: Communication Endpoints
In this subsection, you’ll be introduced to sockets, get some background
on their origins, learn about the various types of sockets, and finally, how
they’re used to allow processes running on different (or the same) computers to communicate with each other.
2.3.1
What Are Sockets?
Sockets are computer networking data structures that embody the concept
of the “communication endpoint,” described in the previous section. Networked applications must create sockets before any type of communication can commence. They can be likened to telephone jacks, without
which, engaging in communication is impossible.
Sockets can trace their origins to the 1970s as part of the University of
California, Berkeley version of Unix, known as BSD Unix. Therefore, you
will sometimes hear these sockets referred to as Berkeley sockets or BSD
sockets. Sockets were originally created for same-host applications where
they would enable one running program (a.k.a. a process) to communicate
with another running program. This is known as interprocess communication,
or IPC. There are two types of sockets: file-based and network-oriented.
Unix sockets are the first family of sockets we are looking at and have a
“family name” of AF_UNIX (a.k.a. AF_LOCAL, as specified in the
POSIX1.g standard), which stands for address family: UNIX. Most popular
platforms, including Python, use the term address families and the abbreviation AF; other perhaps older systems might refer to address families as
domains or protocol families and use PF rather than AF. Similarly,
AF_LOCAL (standardized in 2000–2001) is supposed to replace AF_UNIX;
however, for backward-compatibility, many systems use both and just
make them aliases to the same constant. Python itself still uses AF_UNIX.
Because both processes run on the same computer, these sockets are
file-based, meaning that their underlying infrastructure is supported by
the file system. This makes sense, because the file system is a shared constant between processes running on the same host.
The second type of socket is networked-based and has its own family name,
AF_INET, or address family: Internet. Another address family, AF_INET6, is
used for Internet Protocol version 6 (IPv6) addressing. There are other
address families, all of which are either specialized, antiquated, seldom
used, or remain unimplemented. Of all address families, AF_INET is now
the most widely used.
2.3 Sockets: Communication Endpoints 59
Support for a special type of Linux socket was introduced in Python 2.5.
The AF_NETLINK family of (connectionless [see Section 2.3.3]) sockets
allow for IPC between user and kernel-level code using the standard BSD
socket interface. It is seen as an elegant and less risky solution over previous
and more cumbersome solutions, such as adding new system calls, /proc
support, or “IOCTL”s to an operating system.
Another feature (new in version 2.6) for Linux is support for the Transparent Interprocess Communication (TIPC) protocol. TIPC is used to
allow clusters of computers to “talk” to each other without using IP-based
addressing. The Python support for TIPC comes in the form of the
AF_TIPC family.
Overall, Python supports only the AF_UNIX, AF_NETLINK, AF_TIPC,
and AF_INET{,6} families. Because of our focus on network programming,
we will be using AF_INET for most of the remainder of this chapter.
2.3.2
Socket Addresses: Host-Port Pairs
If a socket is like a telephone jack—a piece of infrastructure that enables
communication—then a hostname and port number are like an area code
and telephone number combination. Having the hardware and ability to
communicate doesn’t do any good unless you know to whom and how
to “dial.” An Internet address is comprised of a hostname and port number pair, which is required for networked communication. It goes without
saying that there should also be someone listening at the other end; otherwise, you get the familiar tones, followed by “I’m sorry, that number is
no longer in service. Please check the number and try your call again.” You
have probably seen one networking analogy during Web surfing, for
example, “Unable to contact server. Server is not responding or is unreachable.”
Valid port numbers range from 0–65535, although those less than 1024
are reserved for the system. If you are using a POSIX-compliant system
(e.g., Linux, Mac OS X, etc.), the list of reserved port numbers (along with
servers/protocols and socket types) is found in the /etc/services file. A
list of well-known port numbers is accessible at this Web site:
http://www.iana.org/assignments/port-numbers
2.5
2.6
60
Chapter 2 • Network Programming
2.3.3
Connection-Oriented Sockets vs.
Connectionless
Connection-Oriented Sockets
Regardless of which address family you are using, there are two different
styles of socket connections. The first type is connection-oriented. What
this means is that a connection must be established before communication
can occur, such as calling a friend using the telephone system. This type of
communication is also referred to as a virtual circuit or stream socket.
Connection-oriented communication offers sequenced, reliable, and
unduplicated delivery of data, without record boundaries. That basically
means that each message may be broken up into multiple pieces, which
are all guaranteed to arrive at their destination, put back together and in
order, and delivered to the waiting application.
The primary protocol that implements such connection types is the
Transmission Control Protocol (better known by its acronym, TCP). To create
TCP sockets, one must use SOCK_STREAM as the socket type. The
SOCK_STREAM name for a TCP socket is based on one of its denotations
as stream socket. Because the networked version of these sockets
(AF_INET) use the Internet Protocol (IP) to find hosts in the network, the
entire system generally goes by the combined names of both protocols
(TCP and IP), or TCP/IP. (Of course, you can also use TCP with local [nonnetworked AF_LOCAL/AF_UNIX] sockets, but obviously there’s no IP
usage there.)
Connectionless Sockets
In stark contrast to virtual circuits is the datagram type of socket, which is connectionless. This means that no connection is necessary before communication can begin. Here, there are no guarantees of sequencing, reliability, or nonduplication in the process of data delivery. Datagrams do preserve record
boundaries, however, meaning that entire messages are sent rather than being
broken into pieces first, such as with connection-oriented protocols.
Message delivery using datagrams can be compared to the postal service. Letters and packages might not arrive in the order they were sent. In
fact, they might not arrive at all! To add to the complication, in the land of
networking, duplication of messages is even possible.
2.4 Network Programming in Python
61
So with all this negativity, why use datagrams at all? (There must be
some advantage over using stream sockets.) Because of the guarantees
provided by connection-oriented sockets, a good amount of overhead is
required for their setup as well as in maintaining the virtual circuit connection. Datagrams do not have this overhead and thus are “less expensive.” They usually provide better performance and might be suitable for
some types of applications.
The primary protocol that implements such connection types is the User
Datagram Protocol (better known by its acronym, UDP). To create UDP
sockets, we must use SOCK_DGRAM as the socket type. The SOCK_
DGRAM name for a UDP socket, as you can probably tell, comes from the
word “datagram.” Because these sockets also use the Internet Protocol to
find hosts in the network, this system also has a more general name, going
by the combined names of both of these protocols (UDP and IP), or UDP/IP.
2.4
Network Programming in Python
Now that you know all about client/server architecture, sockets, and networking, let’s try to bring these concepts to Python. The primary module
we will be using in this section is the socket module. Found within this
module is the socket() function, which is used to create socket objects.
Sockets also have their own set of methods, which enable socket-based
network communication.
2.4.1
socket() Module Function
To create a socket, you must use the socket.socket() function, which has
the general syntax:
socket(socket_family, socket_type, protocol=0)
The socket_family is either AF_UNIX or AF_INET, as explained earlier, and the socket_type is either SOCK_STREAM or SOCK_ DGRAM,
also explained earlier. The protocol is usually left out, defaulting to 0.
So to create a TCP/IP socket, you call socket.socket() like this:
tcpSock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
Likewise, to create a UDP/IP socket you perform:
udpSock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
62
Chapter 2 • Network Programming
Because there are numerous socket module attributes, this is one of the
exceptions where using from module import * is somewhat acceptable. If
we applied from socket import *, we bring the socket attributes into our
namespace, but our code is shortened considerably, as demonstrated in
the following:
tcpSock = socket(AF_INET, SOCK_STREAM)
Once we have a socket object, all further interaction will occur using
that socket object’s methods.
2.4.2
Socket Object (Built-In) Methods
In Table 2-1, we present a list of the most common socket methods. In the
next subsections, we will create both TCP and UDP clients and servers,
using some of these methods. Although we focus on Internet sockets,
these methods have similar meanings when using local/non-networked
sockets.
Table 2-1 Common Socket Object Methods and Attributes
Name
Description
Server Socket Methods
s.bind()
Bind address (hostname, port number pair) to socket
s.listen()
Set up and start TCP listener
s.accept()
Passively accept TCP client connection, waiting until
connection arrives (blocking)
Client Socket Methods
s.connect()
Actively initiate TCP server connection
s.connect_ex()
Extended version of connect(), where problems
returned as error codes rather than an exception
being thrown
2.4 Network Programming in Python
Name
63
Description
General Socket Methods
s.recv()
Receive TCP message
s.recv_into()a
Receive TCP message into specified buffer
s.send()
Transmit TCP message
s.sendall()
Transmit TCP message completely
s.recvfrom()
Receive UDP message
s.recvfrom_into()a
Receive UDP message into specified buffer
s.sendto()
Transmit UDP message
s.getpeername()
Remote address connected to socket (TCP)
s.getsockname()
Address of current socket
s.getsockopt()
Return value of given socket option
s.setsockopt()
Set value for given socket option
s.shutdown()
Shut down the connection
s.close()
Close socket
s.detach()b
Close socket without closing file descriptor, return
the latter
s.ioctl()c
Control the mode of a socket (Windows only)
Blocking-Oriented Socket Methods
s.setblocking()
Set blocking or non-blocking mode of socket
s.settimeout()d
Set timeout for blocking socket operations
s.gettimeout()d
Get timeout for blocking socket operations
(Continued)
64
Chapter 2 • Network Programming
Table 2-1 Common Socket Object Methods and Attributes (Continued)
Name
Description
File-Oriented Socket Methods
s.fileno()
File descriptor of socket
s.makefile()
Create a file object associated with socket
Data Attributes
s.familya
The socket family
s.typea
The socket type
s.protoa
The socket protocol
a. New in Python 2.5.
b. New in Python 3.2.
c. New in Python 2.6; Windows platform only. POSIX systems can use functl module
functions.
d. New in Python 2.3.
CORE TIP: Install clients and servers on different computers to run
networked applications
In our multitude of examples in this chapter, you will often see code and output referring to host “localhost” or see an IP address of 127.0.0.1. Our examples
are running the client(s) and server(s) on the same computer. We encourage the
reader to change the hostnames and copy the code to different computers as it
is much more fun developing and playing around with code that lets computers talk to one another across the network, and to see network programs that
really do work!
2.4.3
Creating a TCP Server
We will first present some general pseudocode needed to create a generic
TCP server, followed by a general description of what is going on. Keep in
mind that this is only one way of designing your server. Once you become
comfortable with server design, you will be able to modify the following
pseudocode to operate the however want it to:
2.4 Network Programming in Python
ss = socket()
ss.bind()
ss.listen()
inf_loop:
cs = ss.accept()
comm_loop:
cs.recv()/cs.send()
cs.close()
ss.close()
#
#
#
#
#
#
#
#
#
65
create server socket
bind socket to address
listen for connections
server infinite loop
accept client connection
communication loop
dialog (receive/send)
close client socket
close server socket # (opt)
All sockets are created by using the socket.socket() function. Servers
need to “sit on a port” and wait for requests, so they all must bind to a local
address. Because TCP is a connection-oriented communication system,
some infrastructure must be set up before a TCP server can begin operation. In particular, TCP servers must “listen” for (incoming) connections.
Once this setup process is complete, a server can start its infinite loop.
A simple (single-threaded) server will then sit on an accept() call, waiting for a connection. By default, accept() is blocking, meaning that execution is suspended until a connection arrives. Sockets do support a nonblocking mode; refer to the documentation or operating systems textbooks
for more details on why and how you would use non-blocking sockets.
Once a connection is accepted, a separate client socket is returned (by
accept()) for the upcoming message interchange. Using the new client
socket is similar to handing off a customer call to a service representative.
When a client eventually does come in, the main switchboard operator
takes the incoming call and patches it through, using another line to connect to the appropriate person to handle the client’s needs.
This frees up the main line (the original server socket) so that the operator can resume waiting for new calls (client requests) while the customer
and the service representative he is connected to carry on their own conversation. Likewise, when an incoming request arrives, a new communication port is created to converse directly with that client, again, leaving the
main port free to accept new client connections.
CORE TIP: Spawning threads to handle client requests
We do not implement this in our examples, but it is also fairly common to hand
off a client request to a new thread or process to complete the client processing.
The SocketServer module, a high-level socket communication module written
on top of socket, supports both threaded and spawned process handling of
client requests. Refer to the documentation to obtain more information about the
SocketServer module as well as the exercises in Chapter 4, “Multithreaded
Programming.”
66
Chapter 2 • Network Programming
Once the temporary socket is created, communication can commence,
and both client and server proceed to engage in a dialog of sending and
receiving, using this new socket until the connection is terminated. This
usually happens when one of the parties either closes its connection or
sends an empty string to its counterpart.
In our code, after a client connection is closed, the server goes back to
wait for another client connection. The final line of code, in which we close
the server socket, is optional. It is never encountered because the server is
supposed to run in an infinite loop. We leave this code in our example as a
reminder to the reader that calling the close() method is recommended
when implementing an intelligent exit scheme for the server—for example,
when a handler detects some external condition whereby the server
should be shut down. In those cases, a close() method call is warranted.
In Example 2-1, we present tsTserv.py, a TCP server program that takes
the data string sent from a client and returns it timestamped (format:
[timestamp]data) back to the client. (“tsTserv” stands for timestamp TCP
server. The other files are named in a similar manner.)
Example 2-1
TCP Timestamp Server (tsTserv.py)
This script creates a TCP server that accepts messages from clients and returns
them with a timestamp prefix.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python
from socket import *
from time import ctime
HOST =
PORT =
BUFSIZ
ADDR =
''
21567
= 1024
(HOST, PORT)
tcpSerSock = socket(AF_INET, SOCK_STREAM)
tcpSerSock.bind(ADDR)
tcpSerSock.listen(5)
while True:
print 'waiting for connection...'
tcpCliSock, addr = tcpSerSock.accept()
print '...connected from:', addr
while True:
data = tcpCliSock.recv(BUFSIZ)
if not data:
break
tcpCliSock.send('[%s] %s' % (
ctime(), data))
tcpCliSock.close()
tcpSerSock.close()
2.4 Network Programming in Python
67
Line-by-Line Explanation
Lines 1–4
After the Unix start-up line, we import time.ctime() and all the attributes
from the socket module.
Lines 6–13
The HOST variable is blank, which is an indication to the bind() method
that it can use any available address. We also choose a random port number, which does not appear to be used or reserved by the system. For our
application, we set the buffer size to 1K. You can vary this size based on
your networking capability and application needs. The argument for the
listen() method is simply a maximum number of incoming connection
requests to accept before connections are turned away or refused.
The TCP server socket (tcpSerSock) is allocated on line 11, followed by the
calls to bind the socket to the server’s address and to start the TCP listener.
Lines 15–28
Once we are inside the server’s infinite loop, we (passively) wait for a connection. When one comes in, we enter the dialog loop where we wait for
the client to send its message. If the message is blank, that means that the
client has quit, so we would break from the dialog loop, close the client
connection, and then go back to wait for another client. If we did get a
message from the client, we format and return the same data but prepend
it with the current timestamp. The final line is never executed; it is there as
a reminder to the reader that a close() call should be made if a handler is
written to allow for a more graceful exit, as we discussed earlier.
Now let’s take a look at the Python 3 version (tsTserv3.py), as shown in
Example 2-2:
Example 2-2
Python 3 TCP Timestamp Server (tsTserv3.py)
This script creates a TCP server that accepts messages from clients and returns
them with a timestamp prefix.
1
2
3
4
5
#!/usr/bin/env python
from socket import *
from time import ctime
(Continued)
68
Chapter 2 • Network Programming
Example 2-2
HOST =
PORT =
BUFSIZ
ADDR =
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Python 3 TCP Timestamp Server (tsTserv3.py) (Continued)
''
21567
= 1024
(HOST, PORT)
tcpSerSock = socket(AF_INET, SOCK_STREAM)
tcpSerSock.bind(ADDR)
tcpSerSock.listen(5)
while True:
print('waiting for connection...')
tcpCliSock, addr = tcpSerSock.accept()
print('...connected from:', addr)
while True:
data = tcpCliSock.recv(BUFSIZ)
if not data:
break
tcpCliSock.send('[%s] %s' % (
bytes(ctime(), 'utf-8'), data))
tcpCliSock.close()
tcpSerSock.close()
We’ve italicized the relevant changes in lines 16, 18, and 25, wherein
becomes a function, and we also transmit the strings as an ASCII
bytes “string” rather than in Unicode. Later in this book, we'll discuss
Python 2-to-Python 3 migration and how it’s also possible to write code
that runs unmodified by either version 2.x or 3.x interpreters.
Another pair of variations to support the IPv6, tsTservV6.py and
tsTserv3V6.py, are not shown here, but you would only need to change
the address family from AF_INET (IPv4) to AF_INET6 (IPv6) when creating
the socket. (In case you’re not familiar with these terms, IPv4 describes the
current Internet Protocol. The next generation is version 6, hence “IPv6.”)
print
2.4.4
Creating a TCP Client
Creating a client is much simpler than a server. Similar to our description
of the TCP server, we will present the pseudocode with explanations first,
then show you the real thing.
cs = socket()
cs.connect()
comm_loop:
cs.send()/cs.recv()
cs.close()
#
#
#
#
#
create client socket
attempt server connection
communication loop
dialog (send/receive)
close client socket
2.4 Network Programming in Python
69
As we noted earlier, all sockets are created by using socket.socket().
Once a client has a socket, however, it can immediately make a connection
to a server by using the socket’s connect() method. When the connection
has been established, it can participate in a dialog with the server. Once
the client has completed its transaction, it can close its socket, terminating
the connection.
We present the code for tsTclnt.py in Example 2-3. This script connects to the server and prompts the user for line after line of data. The
server returns this data timestamped, which is presented to the user by
the client code.
Example 2-3
TCP Timestamp Client (tsTclnt.py)
This script creates a TCP client that prompts the user for messages to send to the
server, receives them back from the server with a timestamp prefix, and then
displays the results to the user.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/env python
from socket import *
HOST =
PORT =
BUFSIZ
ADDR =
'localhost'
21567
= 1024
(HOST, PORT)
tcpCliSock = socket(AF_INET, SOCK_STREAM)
tcpCliSock.connect(ADDR)
while True:
data = raw_input('> ')
if not data:
break
tcpCliSock.send(data)
data = tcpCliSock.recv(BUFSIZ)
if not data:
break
print data
tcpCliSock.close()
Line-by-Line Explanation
Lines 1–3
After the Unix startup line, we import all the attributes from the socket
module.
70
Chapter 2 • Network Programming
Lines 5–11
The HOST and PORT variables refer to the server’s hostname and port number. Because we are running our test (in this case) on the same computer,
HOST contains the local hostname (change it accordingly if you are running
your server on a different host). The port number PORT should be exactly
the same as what you set for your server (otherwise, there won’t be much
communication). We also choose the same 1K buffer size.
The TCP client socket (tcpCliSock) is allocated in line 10, followed by
(an active) call to connect to the server.
Lines 13–23
The client also has an infinite loop, but it is not meant to run forever like
the server’s loop. The client loop will exit on either of two conditions: the
user enters no input (lines 14–16), or the server somehow quit and our call
to the recv() method fails (lines 18–20). Otherwise, in a normal situation,
the user enters in some string data, which is sent to the server for processing. The newly timestamped input string is then received and displayed to
the screen.
Similar to what we did for the server, let’s take a look at the Python 3
and IPv6 versions of the client (tsTclnt3.py), starting with the former as
shown in Example 2-4:
Example 2-4
Python 3 TCP Timestamp Client (tsTclnt3.py)
This is the Python 3 equivalent to tsTclnt.py.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/env python
from socket import *
HOST =
PORT =
BUFSIZ
ADDR =
'127.0.0.1' # or 'localhost'
21567
= 1024
(HOST, PORT)
tcpCliSock = socket(AF_INET, SOCK_STREAM)
tcpCliSock.connect(ADDR)
while True:
data = input('> ')
if not data:
break
tcpCliSock.send(data)
data = tcpCliSock.recv(BUFSIZ)
if not data:
break
print(data.decode('utf-8'))
tcpCliSock.close()
2.4 Network Programming in Python
71
In addition to changing print to a function, we also have to decode the
string that comes from the server. (With the help of distutils.log.warn(), it
would be simple to convert the original script to run under both Python 2
and 3, just like rewhoU.py from Chapter 1, “Regular Expressions.”) Finally,
let’s take a look at the (Python 2) IPv6 version (tsTclntV6.py), as shown in
Example 2-5.
Example 2-5
IPv6 TCP Timestamp Client (tsTclntV6.py)
This is the IPv6 version of the TCP client from the previous two examples.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/env python
from socket import *
HOST =
PORT =
BUFSIZ
ADDR =
'::1'
21567
= 1024
(HOST, PORT)
tcpCliSock = socket(AF_INET6, SOCK_STREAM)
tcpCliSock.connect(ADDR)
while True:
data = raw_input('> ')
if not data:
break
tcpCliSock.send(data)
data = tcpCliSock.recv(BUFSIZ)
if not data:
break
print data
tcpCliSock.close()
In this snippet, we needed to change the localhost to its IPv6 address of
“::1” as well as request the AF_INET6 family of sockets. If you combine the
changes from tsTclnt3.py and tsTclntV6.py, you should also be able to
arrive at an IPv6 Python 3 version of the TCP client.
2.4.5
Executing Our TCP Server and Client(s)
Now let’s run the server and client programs to see how they work.
Should we run the server first or the client? Naturally, if we ran the client
first, no connection would be possible because there is no server waiting to
accept the request. The server is considered a passive partner because it
has to establish itself first and passively wait for a connection. A client, on
72
Chapter 2 • Network Programming
the other hand, is an active partner because it actively initiates a connection. In other words:
Start the server first (before any clients try to connect).
In our example, we use the same computer, but there is nothing to stop
us from using another host for the server. If this is the case, just change the
hostname. (It is rather exciting when you get your first networked application running the server and client from different machines!)
We now present the corresponding input and output from the client
program, which exits with a simple Return (or Enter) keystroke with no
data entered:
$ tsTclnt.py
> hi
[Sat Jun 17 17:27:21 2006] hi
> spanish inquisition
[Sat Jun 17 17:27:37 2006] spanish inquisition
>
$
The server’s output is mainly diagnostic:
$ tsTserv.py
waiting for connection...
...connected from: ('127.0.0.1', 1040)
waiting for connection...
The “. . . connected from . . .” message was received when our client
made its connection. The server went back to wait for new clients while we
continued receiving “service.” When we exited from the server, we had to
break out of it, resulting in an exception. The best way to avoid such an
error is to create a more graceful exit, as we have been discussing.
CORE TIP: Exit gracefully and call the server close() method
One way to create this “friendly” exit in development is to put the server’s
while loop inside the except clause of a try-except statement and monitor for
EOFError or KeyboardInterrupt exceptions so that you can close the server’s
socket in the except or finally clauses. In production, you’ll want to be able
to start up and shut down servers in a more automated fashion. In these cases,
you’ll want to set a flag to shut down service by using a thread or creating a
special file or database entry.
The interesting thing about this simple networked application is that we
are not only showing how our data takes a round trip from the client to
2.4 Network Programming in Python
73
the server and back to the client, but we also use the server as a sort of
“time server,” because the timestamp we receive is purely from the server.
2.4.6
Creating a UDP Server
UDP servers do not require as much setup as TCP servers because they are
not connection-oriented. There is virtually no work that needs to be done
other than just waiting for incoming connections.
ss = socket()
#
ss.bind()
#
inf_loop:
#
cs = ss.recvfrom()/ss.sendto()#
ss.close()
#
create server socket
bind server socket
server infinite loop
dialog (receive/send)
close server socket
As you can see from the pseudocode, there is nothing extra other than
the usual create-the-socket and bind it to the local address (host/port pair).
The infinite loop consists of receiving a message from a client, timestamping
and returning the message, and then going back to wait for another message. Again, the close() call is optional and will not be reached due to the
infinite loop, but it serves as a reminder that it should be part of the graceful or intelligent exit scheme we’ve been mentioning.
One other significant difference between UDP and TCP servers is that
because datagram sockets are connectionless, there is no “handing off” of
a client connection to a separate socket for succeeding communication.
These servers just accept messages and perhaps reply.
You will find the code to tsUserv.py in Example 2-6, which is a UDP
version of the TCP server presented earlier. It accepts a client message and
returns it to the client with a timestamp.
Example 2-6
UDP Timestamp Server (tsUserv.py)
This script creates a UDP server that accepts messages from clients and
returns them with a timestamp prefix.
1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env python
from socket import *
from time import ctime
HOST =
PORT =
BUFSIZ
ADDR =
''
21567
= 1024
(HOST, PORT)
(Continued)
74
Chapter 2 • Network Programming
Example 2-6
11
12
13
14
15
16
17
18
19
20
21
UDP Timestamp Server (tsUserv.py) (Continued)
udpSerSock = socket(AF_INET, SOCK_DGRAM)
udpSerSock.bind(ADDR)
while True:
print 'waiting for message...'
data, addr = udpSerSock.recvfrom(BUFSIZ)
udpSerSock.sendto('[%s] %s' % (
ctime(), data), addr)
print '...received from and returned to:', addr
udpSerSock.close()
Line-by-Line Explanation
Lines 1–4
After the Unix startup line, we import time.ctime() and all the attributes
from the socket module, just like the TCP server setup.
Lines 6–12
The HOST and PORT variables are the same as before, and for all the same
reasons. The call socket() differs only in that we are now requesting a
datagram/UDP socket type, but bind() is invoked in the same way as in
the TCP server version. Again, because UDP is connectionless, no call to
“listen for incoming connections” is made here.
Lines 14–21
Once we are inside the server’s infinite loop, we (passively) wait for a message (a datagram). When one comes in, we process it (by adding a timestamp to it), then send it right back and go back to wait for another message.
The socket close() method is there for show only, as indicated before.
2.4.7
Creating a UDP Client
Of the four clients highlighted here in this section, the UDP client is the
shortest bit of code that we will look at. The pseudocode looks like this:
cs = socket()
comm_loop:
cs.sendto()/cs.recvfrom()
cs.close()
#
#
#
#
create client socket
communication loop
dialog (send/receive)
close client socket
Once a socket object is created, we enter the dialog loop, wherein we
exchange messages with the server. When communication is complete, the
socket is closed.
2.4 Network Programming in Python
75
The real client code, tsUclnt.py, is presented in Example 2-7.
Example 2-7
UDP Timestamp Client (tsUclnt.py)
This script creates a UDP client that prompts the user for messages to send to
the server, receives them back with a timestamp prefix, and then displays them
back to the user.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/usr/bin/env python
from socket import *
HOST =
PORT =
BUFSIZ
ADDR =
'localhost'
21567
= 1024
(HOST, PORT)
udpCliSock = socket(AF_INET, SOCK_DGRAM)
while True:
data = raw_input('> ')
if not data:
break
udpCliSock.sendto(data, ADDR)
data, ADDR = udpCliSock.recvfrom(BUFSIZ)
if not data:
break
print data
udpCliSock.close()
Line-by-Line Explanation
Lines 1–3
After the Unix startup line, we import all the attributes from the socket
module, again, just like in the TCP version of the client.
Lines 5–10
Because we are running the server on our local computer again, we use
“localhost” and the same port number on the client side, not to mention
the same 1K buffer. We allocate our socket object in the same way as the
UDP server.
76
Chapter 2 • Network Programming
Lines 12–22
Our UDP client loop works in almost the exact manner as the TCP client.
The only difference is that we do not have to establish a connection to the
UDP server first; we simply send a message to it and await the reply. After
the timestamped string is returned, we display it to the screen and go back
for more. When the input is complete, we break out of the loop and close
the socket.
Based on the TCP client/server examples, it should be pretty straightforward to create Python 3 and IPv6 equivalents for UDP.
2.4.8
Executing Our UDP Server and Client(s)
The UDP client behaves the same as the TCP client:
$ tsUclnt.py
> hi
[Sat Jun 17 19:55:36 2006] hi
> spam! spam! spam!
[Sat Jun 17 19:55:40 2006] spam! spam! spam!
>
$
Likewise for the server:
$ tsUserv.py
waiting for message...
...received from and returned to: ('127.0.0.1', 1025)
waiting for message...
In fact, we output the client’s information because we can be receiving
messages from multiple clients and sending replies, and such output helps
by indicating where messages came from. With the TCP server, we know
where messages come from because each client makes a connection. Note
how the messages says “waiting for message,” as opposed to “waiting for
connection.”
2.4.9
socket Module Attributes
In addition to the socket.socket() function that we are now familiar with,
the socket module features many more attributes that are used in network application development. Some of the most popular ones are shown
in Table 2-2.
2.4 Network Programming in Python
77
Table 2-2 socket Module Attributes
Attribute Name
Description
Data Attributes
AF_UNIX, AF_INET, AF_INET6,a
AF_NETLINK,b AF_TIPCc
Socket address families supported by Python
SO_STREAM, SO_DGRAM
Socket types (TCP = stream, UDP = datagram)
has_ipv6d
Boolean flag indicating whether IPv6 is
supported
Exceptions
error
Socket-related error
herrora
Host and address-related error
gaierrora
Address-related error
timeout
Timeout expiration
Functions
socket()
Create a socket object from the given address
family, socket type, and protocol type (optional)
socketpair()e
Create a pair of socket objects from the given
address family, socket type, and protocol type
(optional)
create_connection()
Convenience function that takes an address
(host, port) pair and returns the socket object
fromfd()
Create a socket object from an open file descriptor
ssl()
Initiates a Secure Socket Layer connection over
socket; does not perform certificate validation
getaddrinfo()a
Gets address information as a sequence of
5-tuples
getnameinfo()
Given a socket address, returns (host, port)
2-tuple
(Continued)
78
Chapter 2 • Network Programming
Table 2-2 socket Module Attributes (Continued)
Attribute Name
Description
Functions
getfqdn()f
Returns fully-qualified domain name
gethostname()
Returns current hostname
gethostbyname()
Maps a hostname to its IP address
gethostbyname_ex()
Extended version of gethostbyname() returning
hostname, set of alias hostnames, and list of IP
addresses
gethostbyaddr()
Maps an IP address to DNS information; returns
same 3-tuple as gethostbyname_ex()
getprotobyname()
Maps a protocol name (e.g., 'tcp') to a number
getservbyname()/
getservbyport()
Maps a service name to a port number or vice
versa; a protocol name is optional for either
function
ntohl()/ntohs()
Converts integers from network to host byte order
htonl()/htons()
Converts integers from host to network byte order
inet_aton()/inet_ntoa()
Convert IP address octet string to 32-bit packed
format or vice versa (for IPv4 addresses only)
inet_pton()/inet_ntop()
Convert IP address string to packed binary format or vice versa (for both IPv4 and IPv6
addresses)
getdefaulttimeout()/
setdefaulttimeout()
Return default socket timeout in seconds (float);
set default socket timeout in seconds (float)
a.
b.
c.
d.
e.
f.
New in Python 2.2.
New in Python 2.5.
New in Python 2.6.
New in Python 2.3.
New in Python 2.4.
New in Python 2.0.
For more information, refer to the socket module documentation in the
Python Library Reference.
2.5 *The SocketServer Module
2.5
79
*The SocketServer Module
SocketServer is a higher-level module in the
as socketserver in Python 3.x). Its goal is to
standard library (renamed
simplify a lot of the boilerplate code that is necessary to create networked clients and servers. In this
module there are various classes created on your behalf, as shown in
Table 2-3 below.
Table 2-3 SocketServer Module Classes
Class
Description
BaseServer
Contains core server functionality and hooks for
mix-in classes; used only for derivation so you will
not create instances of this class; use TCPServer or
UDPServer instead
TCPServer/
UDPServer
Basic networked synchronous TCP/UDP server
UnixStreamServer/
UnixDatagramServer
Basic file-based synchronous TCP/UDP server
ForkingMixIn/Threading
MixIn
Core forking or threading functionality; used only
as mix-in classes with one of the server classes to
achieve some asynchronicity; you will not instantiate this class directly
ForkingTCPServer/
ForkingUDPServer
Combination of ForkingMixIn and TCPServer/
ThreadingTCPServer/
ThreadingUDPServer
Combination of ThreadingMixIn and TCPServer/
BaseRequestHandler
Contains core functionality for handling service
requests; used only for derivation so you will
create instances of this class; use StreamRequest
Handler or DatagramRequestHandler instead
StreamRequestHandler/
DatagramRequestHandler
Implement service handler for TCP/UDP servers
UDPServer
UDPServer
We will create a TCP client and server that duplicates the base TCP
example shown earlier. You will notice the immediate similarities but
should recognize how some of the dirty work is now taken care of so that
3.x
80
Chapter 2 • Network Programming
you do not have to worry about that boilerplate code. These represent the
simplest synchronous servers you can write. (To configure your server to
run asynchronously, go to the exercises at the end of the chapter.)
In addition to hiding implementation details from you, another difference is that we are now writing our applications using classes. Doing
things in an object-oriented way helps us organize our data and logically
direct functionality to the right places. You will also notice that our applications are now event-driven, meaning that they only work when reacting
to an occurrence of an event in our system.
Events include the sending and receiving of messages. In fact, you will
see that our class definition only consists of an event handler for receiving
a client message. All other functionality is taken from the SocketServer
classes we use. GUI programming (see Chapter 5, "GUI Programming,") is
also event-driven. You will notice the similarity immediately as the final
line of our code is usually a server’s infinite loop waiting for and responding to client service requests. It works almost the same as our infinite while
loop in the original base TCP server earlier in this chapter.
In our original server loop, we block waiting for a request, service it when
something comes in, and then go back to waiting. In the server loop here,
instead of building your code in the server, you define a handler so that the
server can just call your function when it receives an incoming request.
2.5.1
Creating a SocketServer TCP Server
In Example 2-8, we first import our server classes, and then define the
same host constants as before. That is followed by our request handler class,
and then startup. More details follow our code snippet.
Example 2-8
SocketServer Timestamp TCP Server (tsTservSS.py)
This script creates a timestamp TCP server by using SocketServer classes,
TCPServer and StreamRequestHandler.
1
2
3
4
5
6
#!/usr/bin/env python
from SocketServer import (TCPServer as TCP,
StreamRequestHandler as SRH)
from time import ctime
2.5 *The SocketServer Module
7
8
9
10
11
12
13
14
15
16
17
18
19
81
HOST = ''
PORT = 21567
ADDR = (HOST, PORT)
class MyRequestHandler(SRH):
def handle(self):
print '...connected from:', self.client_address
self.wfile.write('[%s] %s' % (ctime(),
self.rfile.readline()))
tcpServ = TCP(ADDR, MyRequestHandler)
print 'waiting for connection...'
tcpServ.serve_forever()
Line-by-Line Explanation
Lines 1–9
The initial stuff consists of importing the right classes from SocketServer.
Note that we are using the multiline import feature introduced in Python 2.4.
If you are using an earlier version of Python, then you will have to use the
fully-qualified module.attribute names or put both attribute imports on
the same line:
from SocketServer import TCPServer as TCP, StreamRequestHandler as SRH
Lines 11–15
The bulk of the work happens here. We derive our request handler MyRequest
Handler as a subclass of SocketServer’s StreamRequestHandler and override
its handle() method, which is stubbed out in the Base Request class with no
default action as:
def handle(self):
pass
The handle() method is called when an incoming message is received
from a client. The StreamRequestHandler class treats input and output
sockets as file-like objects, so we will use readline() to get the client message and write() to send a string back to the client.
Accordingly, we need additional carriage return and NEWLINE characters in both the client and server code. Actually, you will not see it in the
code because we are just reusing those which come from the client. Other
than these minor differences, it should look just like our earlier server.
2.4
82
Chapter 2 • Network Programming
Lines 17–19
The final bits of code create the TCP server with the given host information and request handler class. We then have our entire infinite loop waiting for and servicing client requests.
2.5.2
Creating a SocketServer TCP Client
Our client, shown in Example 2-9, will naturally resemble our original
client, much more so than the server, but it has to be tweaked a bit to work
well with our new server.
Example 2-9
SocketServer Timestamp TCP Client (tsTclntSS.py)
This is a timestamp TCP client that knows how to speak to the file-like Socket
Server class StreamRequestHandler objects.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/env python
from socket import *
HOST =
PORT =
BUFSIZ
ADDR =
'localhost'
21567
= 1024
(HOST, PORT)
while True:
tcpCliSock = socket(AF_INET, SOCK_STREAM)
tcpCliSock.connect(ADDR)
data = raw_input('> ')
if not data:
break
tcpCliSock.send('%s\r\n' % data)
data = tcpCliSock.recv(BUFSIZ)
if not data:
break
print data.strip()
tcpCliSock.close()
Line-by-Line Explanation
Lines 1–8
Nothing special here; this is an exact replica of our original client code.
2.5 *The SocketServer Module
83
Lines 10–21
The default behavior of the SocketServer request handlers is to accept a
connection, get the request, and then close the connection. This makes
it so that we cannot keep our connection throughout the execution of our
application, so we need to create a new socket each time we send a message to the server.
This behavior makes the TCP server act more like a UDP server; however, this can be changed by overriding the appropriate methods in our
request handler classes. We leave this as an exercise at the end of this
chapter.
Other than the fact that our client is somewhat “inside-out” now
(because we have to create a connection each time), the only other minor
difference was previewed in the line-by-line explanation for the server
code: the handler class we are using treats socket communication like a
file, so we have to send line-termination characters (carriage return and
NEWLINE) each way. The server just retains and reuses the ones we send
here. When we get a message back from the server, we strip() them
and just use the NEWLINE that is automatically provided by the print
statement.
2.5.3
Executing our TCP Server and Client(s)
Here is the output of our SocketServer TCP client:
$ tsTclntSS.py
> 'Tis but a scratch.
[Tue Apr 18 20:55:49 2006] 'Tis but a scratch.
> Just a flesh wound.
[Tue Apr 18 20:55:56 2006] Just a flesh wound.
>
$
And here is the server’s output:
$ tsTservSS.py
waiting for connection...
...connected from: ('127.0.0.1', 53476)
...connected from: ('127.0.0.1', 53477)
The output is similar to that of our original TCP client and servers; however, you will notice that we connected to the server twice.
84
Chapter 2 • Network Programming
2.6
*Introduction to the Twisted Framework
Twisted is a complete event-driven networking framework with which you
can both use and develop complete asynchronous networked applications
and protocols. It is not part of the Python Standard Library as of this writing and must be downloaded and installed separately (you can use the
link at the end of the chapter). It provides a significant amount of support
for you to build complete systems, including network protocols, threading,
security and authentication, chat/IM, DBM and RDBMS database integration,
Web/Internet, e-mail, command-line arguments, GUI toolkit integration, etc.
Using Twisted to implement our tiny simplistic example is like using a sledgehammer to pound a thumbtack, but you have to get started somehow, and our
application is the equivalent to the “hello world” of networked applications.
Like SocketServer, most of the functionality of Twisted lies in its
classes. In particular for our examples, we will be using the classes found
in the reactor and protocol subpackages of Twisted’s Internet component.
2.6.1
Creating a Twisted Reactor TCP Server
You will find the code in Example 2-10 similar to that of the SocketServer
example. Instead of a handler class, however, we create a protocol class
and override several methods in the same manner as installing callbacks.
Also, this example is asynchronous. Let’s take a look at the server now.
Example 2-10
Twisted Reactor Timestamp TCP Server (tsTservTW.py)
This is a timestamp TCP server that uses Twisted Internet classes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env python
from twisted.internet import protocol, reactor
from time import ctime
PORT = 21567
class TSServProtocol(protocol.Protocol):
def connectionMade(self):
clnt = self.clnt = self.transport.getPeer().host
print '...connected from:', clnt
def dataReceived(self, data):
self.transport.write('[%s] %s' % (
ctime(), data))
factory = protocol.Factory()
factory.protocol = TSServProtocol
print 'waiting for connection...'
reactor.listenTCP(PORT, factory)
reactor.run()
2.6 *Introduction to the Twisted Framework
85
Line-by-Line Explanation
Lines 1–6
The setup lines of code include the usual module imports, most notably
the protocol and reactor subpackages of twisted.internet and our constant port number.
Lines 8–14
We derive the Protocol class and call ours TSServProtocol for our timestamp server. We then override connectionMade(), a method that is executed when a client connects to us, and dataReceived(), called when a
client sends a piece of data across the network. The reactor passes in the
data as an argument to this method so that we can get access to it right
away without having to extract it ourselves.
The transport instance object is how we can communicate with the client. You can see how we use it in connectionMade() to get the host information about who is connecting to us as well as in dataReceived() to
return data back to the client.
Lines 16–20
In the final part of our server, we create a protocol Factory. It is called a
factory because an instance of our protocol is “manufactured” every time
we get an incoming connection. We then install a TCP listener in our reactor to check for service requests; when it receives a request, it creates a
TSServProtocol instance to take care of that client.
2.6.2
Creating a Twisted Reactor TCP Client
Unlike the SocketServer TCP client, Example 2-11 will not look like all
the other clients—this one is distinctly Twisted.
Example 2-11
Twisted Reactor Timestamp TCP Client (tsTclntTW.py)
Our familiar timestamp TCP client, written from a Twisted point of view.
1
2
3
4
#!/usr/bin/env python
from twisted.internet import protocol, reactor
(Continued)
86
Chapter 2 • Network Programming
Example 2-11
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Twisted Reactor Timestamp TCP Client (tsTclntTW.py)
(Continued)
HOST = 'localhost'
PORT = 21567
class TSClntProtocol(protocol.Protocol):
def sendData(self):
data = raw_input('> ')
if data:
print '...sending %s...' % data
self.transport.write(data)
else:
self.transport.loseConnection()
def connectionMade(self):
self.sendData()
def dataReceived(self, data):
print data
self.sendData()
class TSClntFactory(protocol.ClientFactory):
protocol = TSClntProtocol
clientConnectionLost = clientConnectionFailed = \
lambda self, connector, reason: reactor.stop()
reactor.connectTCP(HOST, PORT, TSClntFactory())
reactor.run()
Line-by-Line Explanation
Lines 1–6
Again, nothing really new here apart from the import of Twisted components. It is very similar to all of our other clients.
Lines 8–22
Like the server, we extend Protocol by overriding the connectionMade()
and dataReceived() methods. Both execute for the same reason as the
server. We also add our own method for when data needs to be sent and
call it sendData().
Because this time we are the client, we are the ones initiating a conversation with the server. Once that connection has been established, we take
the first step and send a message. The server replies, and we handle it by
displaying it to the screen and sending another message to the server.
This continues in a loop until we terminate the connection by giving no
input when prompted. Instead of calling the write() method of the transport
2.6 *Introduction to the Twisted Framework
87
object to send another message to the server, loseConnection() is executed,
closing the socket. When this occurs, the factory’s clientConnectionLost()
method will be called and our reactor is stopped, completing execution of our
script. We also stop the reactor if a clientConnectionFailed() for some
other reason.
The final part of the script is where we create a client factory and make a
connection to the server and run the reactor. Note that we instantiate the client factory here instead of passing it in to the reactor, as we did in the server.
This is because we are not the server waiting for clients to talk to us, and its
factory makes a new protocol object for each connection. We are one client, so
we make a single protocol object that connects to the server, whose factory
makes one to talk to ours.
2.6.3
Executing Our TCP Server and Client(s)
The Twisted client displays output similar to all of our other clients:
$ tsTclntTW.py
> Where is hope
...sending Where is hope...
[Tue Apr 18 23:53:09 2006] Where is hope
> When words fail
...sending When words fail...
[Tue Apr 18 23:53:14 2006] When words fail
>
$
The server is back to a single connection. Twisted maintains the connection and does not close the transport after every message:
$ tsTservTW.py
waiting for connection...
...connected from: 127.0.0.1
The “connection from” output does not have the other information
because we only asked for the host/address from the getPeer() method of
the server’s transport object.
Keep in mind that most applications based on Twisted are much more
complex than the examples built in this subsection. It is a feature-rich
library, but it does come with a level of complexity for which you need to
be prepared.
88
Chapter 2 • Network Programming
2.7
Related Modules
Table 2-4 lists some of the other Python modules that are related to network and socket programming. The select module is usually used in
conjunction with the socket module when developing lower-level socket
applications. It provides the select() function, which manages sets of
socket objects. One of the most useful things it does is to take a set of sockets and listen for active connections on them. The select() function will
block until at least one socket is ready for communication, and when that
happens, it provides you with a set of those that are ready for reading. (It
can also determine which sockets are ready for writing, although that is
not as common as the former operation.)
Table 2-4 Network/Socket Programming Related Modules
Module
Description
socket
Lower-level networking interface, as discussed in this
chapter
asyncore/
asynchat
Provide infrastructure to create networked applications
that process clients asynchronously
select
Manages multiple socket connections in a single-threaded
network server application
SocketServer
High-level module that provides server classes for networked
applications, complete with forking or threading varieties
The async* and SocketServer modules both provide higher-level functionality as far as creating servers is concerned. Written on top of the
socket and/or select modules, they enable more rapid development of
client/server systems because all the lower-level code is handled for you.
All you have to do is to create or subclass the appropriate base classes, and
you are on your way. As we mentioned earlier, SocketServer even provides the capability of integrating threading or new processes into the
server, which affords a more parallel-like processing of client requests.
Although async* provides the only asynchronous development support
in the standard library, in the previous section, you were introduced to
Twisted, a third-party package that is more powerful than those older
2.8 Exercises
89
modules. Although the example code we have seen in this chapter is
slightly longer than the barebones scripts, Twisted provides a much more
powerful and flexible framework and has implemented many protocols
for you already. You can find out more about Twisted at its Web site:
http://twistedmatrix.com
A more modern networking framework is Concurrence, which is the
engine behind the Dutch social network, Hyves. Concurrence is a highperformance I/O system paired with libevent, the lower-level event callback dispatching system. Concurrence follows an asynchronous model,
using lightweight threads (executing callbacks) in an event-driven way to
do the work and message-passing for interthread communication. You can
find out more info about Concurrence at:
http://opensource.hyves.org/concurrence
Modern networking frameworks follow one of many asynchronous
models (greenlets, generators, etc.) to provide high-performance asynchronous servers. One of the goals of these frameworks is to push the complexity of asynchronous programming so as to allow users to code in a
more familiar, synchronous manner.
The topics we have covered in this chapter deal with network programming with sockets in Python and how to create custom applications using
lower-level protocol suites such as TCP/IP and UDP/IP. If you want to
develop higher-level Web and Internet applications, we strongly encourage you to move ahead to Chapter 3, “Internet Client Programming,” or
perhaps skip to Part II of the book.
2.8
Exercises
2-1. Sockets. What is the difference between connection-oriented
and connectionless sockets?
2-2. Client/Server Architecture. Describe in your own words what
this term means and give several examples.
2-3. Sockets. Between TCP and UDP, which type of servers accept
connections and hands them off to separate sockets for client
communication?
90
Chapter 2 • Network Programming
2-4. Clients. Update the TCP (tsTclnt.py) and UDP (tsUclnt.py)
clients so that the server name is not hardcoded into the
application. Allow the user to specify a hostname and port
number, and only use the default values if either or both
parameters are missing.
2-5. Internetworking and Sockets. Implement the sample TCP client/
server programs found in the Python Library Reference documentation on the socket module and get them to work. Set
up the server and then the client. An online version of the
source is also available here:
http://docs.python.org/library/socket#example
You decide the server is too boring. Update the server so that
it can do much more, recognizing the following commands:
date
Server will return its current date/timestamp, that is,
time.ctime().
os
Get OS information (os.name).
ls
Give a listing of the current directory. (Hints:
os.listdir() lists a directory, os.curdir is the current directory.) Extra Credit: Accept ls dir and return
dir ’s file listing.
You do not need a network to do this assignment—your
computer can communicate with itself. Be aware that after
the server exits, the binding must be cleared before you can
run it again. You might experience “port already bound”
errors. The operating system usually clears the binding
within 5 minutes, so be patient.
2-6. Daytime Service. Use the socket.getservbyname() to determine the port number for the “daytime” service under the
UDP protocol. Check the documentation for getservbyname()
to get the exact usage syntax (i.e., socket.getservbyname.
__doc__). Now write an application that sends a dummy
message over and wait for the reply. Once you have received
a reply from the server, display it to the screen.
2-7. Half-Duplex Chat. Create a simple, half-duplex chat program.
By half-duplex, we mean that when a connection is made
and the service starts, only one person can type. The other
participant must wait to get a message before being
prompted to enter a message. Once a message is sent, the
2.8 Exercises
2-8.
2-9.
2-10.
2-11.
2-12.
2-13.
sender must wait for a reply before being allowed to send
another message. One participant will be on the server side;
the other will be on the client side.
Full-Duplex Chat. Update your solution to the previous exercise so that your chat service is now full-duplex, meaning that
both parties can send and receive, independent of each other.
Multi-User Full Duplex Chat. Further update your solution so
that your chat service is multi-user.
Multi-User, Multiroom, Full Duplex Chat. Now make your chat
service multi-user and multiroom.
Web Client. Write a TCP client that connects to port 80 of your
favorite Web site (remove the “http://” and any trailing information; use only the hostname). Once a connection has been
established, send the HTTP command string GET /\n and
write all the data that the server returns to a file. (The GET
command retrieves a Web page, the / file indicates the file to
get, and the \n sends the command to the server.) Examine
the contents of the retrieved file. What is it? How can you
check to make sure the data you received is correct? (Note:
You might have to insert one or two NEWLINEs after the
command string. One usually works.)
Sleep Server. Create a sleep server. A client will request to be
“put to sleep” for a number of seconds. The server will issue
the command on behalf of the client then return a message to
the client indicating success. The client should have slept or
should have been idle for the exact time requested. This is
a simple implementation of a remote procedure call, where a
client’s request invokes commands on another computer
across the network.
Name Server. Design and implement a name server. Such
a server is responsible for maintaining a database of hostname-port number pairs, perhaps along with the string
description of the service that the corresponding servers
provide. Take one or more existing servers and have them
register their service with your name server. (Note that these
servers are, in this case, clients of the name server.)
Every client that starts up has no idea where the server is that
it is looking for. Also as clients of the name server, these
clients should send a request to the name server indicating
what type of service they are seeking. The name server, in
91
92
Chapter 2 • Network Programming
reply, returns a hostname-port number pair to this client,
which then connects to the appropriate server to process its
request.
Extra Credit:
1) Add caching to your name server for popular requests.
2) Add logging capability to your name server, keeping
track of which servers have registered and which services clients are requesting.
3) Your name server should periodically ping the registered hosts at their respective port numbers to ensure
that the service is indeed up. Repeated failures will
cause a server to be delisted from the list of services.
You can implement real services for the servers that register for your name service, or just use dummy servers (which
merely acknowledge a request).
2-14. Error Checking and Graceful Shutdown. All of the sample
client/server code presented in this chapter is poor in terms
of error-checking. We do not handle scenarios such as when
users press Ctrl+C to exit out of a server or Ctrl+D to terminate client input, nor do we check other improper input to
raw_input() or handle network errors. Because of this weakness, quite often we terminate an application without closing
our sockets, potentially losing data. Choose a client/server
pair of one of our examples, and add enough error-checking
so that each application properly shuts down, that is, closes
network connections.
2-15. Asynchronicity and SocketServer/socketserver. Take the
example TCP server and use either mix-in class to support
an asynchronous server. To test your server, create and run
multiple clients simultaneously and show output that your
server is serving requests from both, interleaved.
2.8 Exercises
2-16. *Extending SocketServer Classes. In the SocketServer TCP
server code, we had to change our client from the original
base TCP client because the SocketServer class does not
maintain the connection between requests.
a) Subclass the TCPServer and StreamRequestHandler
classes and re-design the server so that it maintains
and uses a single connection for each client (not one per
request).
b) Integrate your solution for the previous exercise with
your solution to part (a), such that multiple clients are
being serviced in parallel.
2-17. *Asynchronous Systems. Research at least five different
Python-based asynchronous systems—choose from Twisted,
Greenlets, Tornado, Diesel, Concurrence, Eventlet, Gevent,
etc. Describe what they are, categorize them, find similarities
and differences, and then create some demonstration code
samples.
93
CHAPTER
Internet Client Programming
You can’t take something off the Internet, that’s like trying to take
pee out of a swimming pool. Once it’s in there, it’s in there.
—Joe Garrelli, March 1996
(verbally via “Joe Rogan,” a character from
NewsRadio [television program]),
In this chapter...
• What Are Internet Clients?
• Transferring Files
• Network News
• E-Mail
• Related Modules
94
3.1 What Are Internet Clients?
95
I
n Chapter 2, “Network Programming,” we took a look at low-level
networking communication protocols using sockets. This type of networking is at the heart of most of the client/server protocols that exist
on the Internet today. These protocols include those for transferring files
(FTP, etc.), reading Usenet newsgroups (Network News Transfer Protocol),
sending e-mail (SMTP), and downloading e-mail from a server (POP3,
IMAP), etc. These protocols work in a way much like the client/server
examples in Chapter 2. The only difference is that now we have taken
lower-level protocols such as TCP/IP and created newer, more specific
protocols on top of them to implement these higher-level services.
3.1
What Are Internet Clients?
Before we take a look at these protocols, we first must ask, “What is an Internet client?” To answer this question, we simplify the Internet to a place
where data is exchanged, and this interchange is made up of someone offering a service and a user of such services. You will hear the term producerconsumer in some circles (although this phrase is generally reserved for
conversations on operating systems). Servers are the producers, providing the services, and clients consume the offered services. For any one particular service, there is usually only one server (process, host, etc.) and more
than one consumer. We previously examined the client/server model, and
although we do not need to create Internet clients with the low-level socket
operations seen earlier, the model is an accurate match.
In this chapter, we’ll explore a variety of these Internet protocols and
create clients for each. When finished, you should be able to recognize
how similar the application programming interfaces (APIs) of all of these
protocols are—this is done by design, as keeping interfaces consistent is a
worthy cause—and most importantly, the ability to create real clients of
these and other Internet protocols. And even though we are only highlighting these three specific protocols, at the end of this chapter, you
should feel confident enough to write clients for just about any Internet
protocol.
96
Chapter 3 • Internet Client Programming
3.2
Transferring Files
3.2.1 File Transfer Internet Protocols
One of the most popular Internet activities is file exchange. It happens all the
time. There have been many protocols to transfer files over the Internet, with
some of the most popular including the File Transfer Protocol, the Unix-toUnix Copy Protocol (UUCP), and of course, the Web’s Hypertext Transfer
Protocol (HTTP). We should also include the remote (Unix) file copy command, rcp (and now its more secure and flexible cousins, scp and rsync).
HTTP, FTP, and scp/rsync are still quite popular today. HTTP is primarily used for Web-based file download and accessing Web services. It generally doesn’t require clients to have a login and/or password on the server
host to obtain documents or service. The majority of all HTTP file transfer
requests are for Web page retrieval (file downloads).
On the other hand, scp and rsync require a user login on the server
host. Clients must be authenticated before file transfers can occur, and
files can be sent (upload) or retrieved (download). Finally, we have FTP.
Like scp/rsync, FTP can be used for file upload or download; and like
scp/rsync, it employs the Unix multi-user concepts of usernames and
passwords. FTP clients must use the login/password of existing users;
however, FTP also allows anonymous logins. Let’s now take a closer look
at FTP.
3.2.2
File Transfer Protocol
The File Transfer Protocol (FTP) was developed by the late Jon Postel and
Joyce Reynolds in the Internet Request for Comment (RFC) 959 document
and published in October 1985. It is primarily used to download publicly
accessible files in an anonymous fashion. It can also be used to transfer
files between two computers, especially when you’re using a Unix-based
system for file storage or archiving and a desktop or laptop PC for work.
Before the Web became popular, FTP was one of the primary methods of
transferring files on the Internet, and one of the only ways to download software and/or source code.
As mentioned previously, you must have a login/password to access the
remote host running the FTP server. The exception is anonymous logins,
which are designed for guest downloads. These permit clients who do not
have accounts to download files. The server’s administrator must set up an
3.2 Transferring Files
97
FTP server with anonymous logins to enable this. In these cases, the login
of an unregistered user is called anonymous, and the password is generally
the e-mail address of the client. This is akin to a public login and access to
directories that were designed for general consumption as opposed to logging in and transferring files as a particular user. The list of available commands via the FTP protocol is also generally more restrictive than that for
real users.
The protocol is diagrammed in Figure 3-1 and works as follows:
1. Client contacts the FTP server on the remote host
2. Client logs in with username and password (or anonymous
and e-mail address)
3. Client performs various file transfers or information requests
4. Client completes the transaction by logging out of the remote
host and FTP server
Of course, this is generally how it works. Sometimes there are circumstances whereby the entire transaction is terminated before it’s completed.
These include being disconnected from the network if one of the two hosts
crash or because of some other network connectivity issue. For inactive
clients, FTP connections will generally time out after 15 minutes (900 seconds)
of inactivity.
Under the hood, it is good to know that FTP uses only TCP (see Chapter 2)
—it does not use UDP in any way. Also, FTP can be seen as a more
unusual example of client/server programming because both the clients
and the servers use a pair of sockets for communication: one is the control
or command port (port 21), and the other is the data port (sometimes
port 20).
M (> 1023)
ctrl/cmd
21
Internet
FTP client
M+1
data
FTP server
20 [active] or
N (> 1023) [passive]
Figure 3-1 FTP Clients and Servers on the Internet. The client and server communicate using the
FTP protocol on the command or control port data; is transferred using the data port.
98
Chapter 3 • Internet Client Programming
We say sometimes because there are two FTP modes: Active and Passive, and the server’s data port is only 20 for Active mode. After the server
sets up 20 as its data port, it “actively” initiates the connection to the client’s
data port. For Passive mode, the server is only responsible for letting the
client know where its random data port is; the client must initiate the data
connection. As you can see in this mode, the FTP server is taking a more
passive role in setting up the data connection. Finally, there is now support
for a new Extended Passive Mode to support version 6 Internet Protocol
(IPv6) addresses—see RFC 2428.
Python supports most Internet protocols, including FTP. Other supported
client libraries can be found at http://docs.python.org/library/internet. Now
let’s take a look at just how easy it is to create an Internet client with Python.
3.2.3
Python and FTP
So, how do we write an FTP client by using Python? What we just
described in the previous section covers it pretty well. The only additional
work required is to import the appropriate Python module and make the
appropriate calls in Python. So let’s review the protocol briefly:
1.
2.
3.
4.
Connect to server
Log in
Make service request(s) (and hopefully get response[s])
Quit
When using Python’s FTP support, all you do is import the ftplib module and instantiate the ftplib.FTP class. All FTP activity—logging in,
transferring files, and logging out—will be accomplished using your
object.
Here is some Python pseudocode:
from ftplib import FTP
f = FTP('some.ftp.server')
f.login('anonymous', '[email protected]')
:
f.quit()
Soon we will look at a real example, but for now, let’s familiarize ourselves with methods from the ftplib.FTP class, which you will likely use
in your code.
3.2 Transferring Files
3.2.4
99
ftplib.FTP Class Methods
We outline the most popular methods in Table 3-1. The list is not comprehensive—see the source code for the class itself for all methods—but the ones
presented here are those that make up the API for FTP client programming
in Python. In other words, you don’t really need to use the others because
they are either utility or administrative functions or are used by the API
methods later.
Table 3-1 Methods for FTP Objects
Method
Description
login(user='anonymous',
passwd='', acct='')
Log in to FTP server; all arguments are optional
pwd()
Current working directory
cwd(path)
Change current working directory to path
dir([path[,...[,cb]])
Displays directory listing of path; optional callback cb passed to retrlines()
nlst([path[,...])
Like dir() but returns a list of filenames instead
of displaying
retrlines(cmd [, cb])
Download text file given FTP cmd, for example,
RETR filename; optional callback cb for processing
each line of file
retrbinary(cmd,
cb[, bs=8192[, ra]])
Similar to retrlines() except for binary file; callback cb for processing each block (size bs defaults
to 8K) downloaded required
storlines(cmd, f)
Upload text file given FTP cmd, for example, STOR
filename; open file object f required
storbinary(cmd,
f[, bs=8192])
Similar to storlines() but for binary file; open file
object f required, upload blocksize bs defaults to 8K
rename(old, new)
Rename remote file from old to new
delete(path)
Delete remote file located at path
mkd(directory)
Create remote directory
rmd(directory)
Remove remote directory
quit()
Close connection and quit
100
Chapter 3 • Internet Client Programming
The methods you will most likely use in a normal FTP transaction
include login(), cwd(), dir(), pwd(), stor*(), retr*(), and quit(). There
are more FTP object methods not listed in the table that you might find
useful. For more detailed information about FTP objects, read the Python
documentation available at http://docs.python.org/library/ftplib#ftp-objects.
3.2.5
An Interactive FTP Example
An example of using FTP with Python is so simple to use that you do not
even have to write a script. You can just do it all from the interactive interpreter and see the action and output in real time. Here is a sample session
from a few years ago when there was still an FTP server running at
python.org, but it will not work today, so this is just an example of what
you might experience with a running FTP server:
>>> from ftplib import FTP
>>> f = FTP('ftp.python.org')
>>> f.login('anonymous', '[email protected]')
'230 Guest login ok, access restrictions apply.'
>>> f.dir()
total 38
drwxrwxr-x 10 1075
4127
512 May 17 2000 .
drwxrwxr-x 10 1075
4127
512 May 17 2000 ..
drwxr-xr-x
3 root
wheel
512 May 19 1998 bin
drwxr-sr-x
3 root
1400
512 Jun 9 1997 dev
drwxr-xr-x
3 root
wheel
512 May 19 1998 etc
lrwxrwxrwx
1 root
bin
7 Jun 29 1999 lib -> usr/lib
-r--r--r-1 guido
4127
52 Mar 24 2000 motd
drwxrwsr-x
8 1122
4127
512 May 17 2000 pub
drwxr-xr-x
5 root
wheel
512 May 19 1998 usr
>>> f.retrlines('RETR motd')
Sun Microsystems Inc.
SunOS 5.6
Generic August 1997
'226 Transfer complete.
>>> f.quit()
'221 Goodbye.'
3.2.6
A Client Program FTP Example
We mentioned previously that an example script is not even necessary
because you can run one interactively and not get lost in any code. We will
try anyway. For example, suppose that you want a piece of code that goes to
download the latest copy of Bugzilla from the Mozilla Web site. Example 3-1
is what we came up with. We are attempting an application here, but even
so, you can probably run this one interactively, too. Our application uses the
FTP library to download the file and includes some error-checking.
3.2 Transferring Files
Example 3-1
101
FTP Download Example (getLatestFTP.py)
This program is used to download the latest version of a file from a Web site.
You can tweak it to download your favorite application.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!/usr/bin/env python
import ftplib
import os
import socket
HOST = 'ftp.mozilla.org'
DIRN = 'pub/mozilla.org/webtools'
FILE = 'bugzilla-LATEST.tar.gz'
def main():
try:
f = ftplib.FTP(HOST)
except (socket.error, socket.gaierror) as e:
print 'ERROR: cannot reach "%s"' % HOST
return
print '*** Connected to host "%s"' % HOST
try:
f.login()
except ftplib.error_perm:
print 'ERROR: cannot login anonymously’
f.quit()
return
print '*** Logged in as "anonymous"'
try:
f.cwd(DIRN)
except ftplib.error_perm:
print 'ERROR: cannot CD to "%s"' % DIRN
f.quit()
return
print '*** Changed to "%s" folder' % DIRN
try:
f.retrbinary('RETR %s' % FILE,
open(FILE, 'wb').write)
except ftplib.error_perm:
print 'ERROR: cannot read file "%s"' % FILE
os.unlink(FILE)
else:
print '*** Downloaded "%s" to CWD' % FILE
f.quit()
return
if __name__ == '__main__':
main()
Be aware that this script is not automated, so it is up to you to run it
whenever you want to perform the download, or if you are on a Unix-based
system, you can set up a cron job to automate it for you. Another issue is
that it will break if either the file or directory names change.
102
Chapter 3 • Internet Client Programming
If no errors occur when we run our script, we get the following output:
$ getLatestFTP.py
*** Connected to host "ftp.mozilla.org"
*** Logged in as "anonymous"
*** Changed to "pub/mozilla.org/webtools" folder
*** Downloaded "bugzilla-LATEST.tar.gz" to CWD
$
Line-by-Line Explanation
Lines 1–9
The initial lines of code import the necessary modules (mainly to grab
exception objects) and set a few constants.
2.6
Lines 11–44
The main() function consists of various steps of operation: create an FTP
object and attempt to connect to the FTPs server (lines 12–17) and (return
and) quit on any failure. We attempt to login as anonymous and abort if
unsuccessful (lines 19–25). The next step is to change to the distribution
directory (lines 27–33), and finally, we try to download the file (lines 35–44).
For line 14 and all other exception handlers in this book where you’re
saving the exception instance—in this case e—if you’re using Python 2.5
and older, you need to change the as to a comma, because this new syntax
was introduced (but not required) in version 2.6 to help with 3.x migration. Python 3 only understands the new syntax shown in line 14.
On lines 35–36, we pass a callback to retrbinary() that should be executed for every block of binary data downloaded. This is the write()
method of a file object we create to write out the local version of the file. We
are depending on the Python interpreter to adequately close our file after
the transfer is done and to not lose any of our data. Although more convenient, I usually try to avoid using this style, because the programmer should
be responsible for freeing resources directly allocated rather than depending on other code. In this case, we should save the open file object to a variable,
say loc, and then pass loc.write in the call to ftp.retrbinary().
After the transfer has completed, we would call loc.close(). If for some
reason we are not able to save the file, we remove the empty file to avoid
cluttering up the file system (line 40). We should put some error-checking
around that call to os.unlink(FILE) in case the file does not exist. Finally, to
avoid another pair of lines (lines 43–44) that close the FTP connection and
return, we use an else clause (lines 35–42).
Lines 46–47
This is the usual idiom for running a stand-alone script.
3.2 Transferring Files
3.2.7
103
Miscellaneous FTP
Python supports both Active and Passive modes. Note, however, that
in Python 2.0 and older releases, Passive mode was off by default; in
Python 2.1 and all successive releases, it is on by default.
Here is a list of typical FTP clients:
• Command-line client program: This is where you execute
FTP transfers by running an FTP client program such as /bin/
ftp, or NcFTP, which allows users to interactively participate
in an FTP transaction via the command line.
• GUI client program: Similar to a command-line client
program, except that it is a GUI application like WS_FTP,
Filezilla, CuteFTP, Fetch, or SmartFTP.
• Web browser: In addition to using HTTP, most Web browsers
(also referred to as a client) can also speak FTP. The first
directive in a URL/URI is the protocol, that is, “http://
blahblah.” This tells the browser to use HTTP as a means of
transferring data from the given Web site. By changing the
protocol, one can make a request using FTP, as in “ftp://
blahblah.” It looks pretty much exactly the same as a URL,
which uses HTTP. (Of course, the “blahblah” can expand
to the expected “host/path?attributes” after the protocol
directive “ftp://”.) Because of the login requirement, users can
add their logins and passwords (in clear text) into their URL,
for example, “ftp://user:[email protected]/path?attr1=val1&attr2=
val2. . .”.
• Custom application: A program you write that uses FTP to
transfer files. It generally does not allow the user to interact
with the server as the application was created for specific
purposes.
All four types of clients can be created by using Python. We used ftplib
above to create our custom application, but you can just as well create an
interactive command-line application. On top of that, you can even bring a
GUI toolkit such as Tk, wxWidgets, GTK+, Qt, MFC, and even Swing into
the mix (by importing their respective Python [or Jython] interface modules) and build a full GUI application on top of your command-line client
code. Finally, you can use Python’s urllib module to parse and perform
FTP transfers using FTP URLs. At its heart, urllib imports and uses
ftplib making urllib another client of ftplib.
2.1
104
Chapter 3 • Internet Client Programming
FTP is not only useful for downloading client applications to build and/
or use, but it can also be helpful in your everyday job for moving files
between systems. For example, suppose that you are an engineer or a system administrator needing to transfer files. It is an obvious choice to use
the scp or rsync commands when crossing the Internet boundary or
pushing files to an externally visible server. However, there is a penalty
when moving extremely large logs or database files between internal computers on a secure network in that manner: security, encryption, compression/decompression, etc. If what you want to do is just build a simple FTP
application that moves files for you quickly during the after-hours, using
Python is a great way to do it!
You can read more about FTP in the FTP Protocol Definition/Specification
(RFC 959) at http://tools.ietf.org/html/rfc959 as well as on the www. network
sorcery.com/enp/protocol/ftp.htm Web page. Other related RFCs include
2228, 2389, 2428, 2577, 2640, and 4217. To find out more about Python’s FTP
support, you can start at http://docs.python.org/library/ftplib.
3.3
Network News
3.3.1 Usenet and Newsgroups
The Usenet News System is a global archival bulletin board. There are
newsgroups for just about any topic, from poems to politics, linguistics to
computer languages, software to hardware, planting to cooking, finding
or announcing employment opportunities, music and magic, breaking up
or finding love. Newsgroups can be general and worldwide or targeted
toward a specific geographic region.
The entire system is a large global network of computers that participate
in sharing Usenet postings. Once a user uploads a message to his local
Usenet computer, it will then be propagated to other adjoining Usenet
computers, and then to the neighbors of those systems, until it’s gone
around the world and everyone has received the posting. Postings will live
on Usenet for a finite period of time, either dictated by a Usenet system
administrator or the posting itself via an expiration date/time.
Each system has a list of newsgroups that it subscribes to and only
accepts postings of interest—not all newsgroups may be archived on a
server. Usenet news service is dependent on which provider you use.
Many are open to the public; others only allow access to specific users,
such as paying subscribers, or students of a particular university, etc. A
3.3 Network News
105
login and password are optional, configurable by the Usenet system
administrator. The ability to post only download is another parameter
configurable by the administrator.
Usenet has lost its place as the global bulletin board, superseded in
large part by online forums. Still it’s worthwhile looking at Usenet here
specifically for its network protocol.
While older incarnations of the Usenet used UUCP as its network transport mechanism, another protocol arose in the mid-1980s when most network traffic began to migrate to TCP/IP. We’ll look at this new protocol
next.
3.3.2
Network News Transfer Protocol
The method by which users can download newsgroup postings or articles
or perhaps post new articles, is called the Network News Transfer Protocol (NNTP). It was authored by Brian Kantor (University of California, San
Diego) and Phil Lapsley (University of California, Berkeley) in RFC 977,
published in February 1986. The protocol has since been updated in RFC
2980, published in October 2000.
As another example of client/server architecture, NNTP operates in a
fashion similar to FTP; however, it is much simpler. Rather than having a
whole set of different port numbers for logging in, data, and control,
NNTP uses only one standard port for communication, 119. You give the
server a request, and it responds appropriately, as shown in Figure 3-2.
3.3.3
Python and NNTP
Based on your experience with Python and FTP in the previous section,
you can probably guess that there is an nntplib and an nntplib.NNTP
class that you need to instantiate, and you would be right. As with FTP,
all we need to do is to import that Python module and make the appropriate calls in Python. So let’s review the protocol briefly:
1.
2.
3.
4.
Connect to server
Log in (if applicable)
Make service request(s)
Quit
106
Chapter 3 • Internet Client Programming
Usenet on the Internet
NNTP
clients
(newsreaders)
NNTP
(read
)
NNTP
(post
)
)
ad
(re
)
TP post
(
NN
P
T
NN
NNTP
(update)
NNTP
servers
NNTP
(update)
Figure 3-2 NNTP Clients and Servers on the Internet. Clients mostly read news but can also post.
Articles are then distributed as servers update each other.
Look somewhat familiar? It should, because it’s practically a carbon
copy of using the FTP protocol. The only change is that the login step is
optional, depending on how an NNTP server is configured.
Here is some Python pseudocode to get started:
from nntplib import NNTP
n = NNTP('your.nntp.server')
r,c,f,l,g = n.group('comp.lang.python')
...
n.quit()
Typically, once you log in, you will choose a newsgroup of interest and
call the group() method. It returns the server reply, a count of the number
of articles, the ID of the first and last articles, and superfluously, the group
name again. Once you have this information, you will then perform some
sort of action, such as scroll through and browse articles, download entire
postings (headers and body of article), or perhaps post an article.
Before we take a look at a real example, let’s introduce some of the more
popular methods of the nntplib.NNTP class.
3.3 Network News
3.3.4
107
nntplib.NNTP Class Methods
As in the previous section outlining the ftplib.FTP class methods, we will
not show you all methods of nntplib.NNTP, just the ones you need in
order to create an NNTP client application.
As with the FTP objects in Table 3-1, there are more NNTP object methods than are described in Table 3-2. To avoid clutter, we list only the ones
we think you would most likely use. For the rest, we again refer you to the
Python Library Reference.
Table 3-2 Methods for NNTP Objects
Method
Description
group(name)
Select newsgroup name and return a tuple (rsp, ct, fst,
lst, group): server response, number of articles, first and
last article numbers and group name, all of which are
strings (name == group)
xhdr(hdr,
artrg[, ofile])
Returns list of hdr headers for article range artrg (“firstlast” format) or outputs data to file ofile
body(id[, ofile])
Get article body given its id, which is either a message
ID (enclosed in <...> or an article number (as a string);
returns tuple (rsp, anum, mid, data): server response, article number (as a string), message ID (enclosed in <...>),
and list of article lines or outputs data to file ofile
head(id)
Similar to body(); same tuple returned except lines only
contain article headers
article(id)
Also similar to body(); same tuple returned except lines
contain both headers and article body
stat(id)
Set article “pointer” to id (message ID or article number as above); returns tuple similar to body ( rsp, anum,
mid) but contains no data from article
next()
Used with stat(), moves article pointer to next article
and returns similar tuple
last()
Also used with stat(), moves article pointer to last
article and returns similar tuple
post(ufile)
Upload data from ufile file object (using
ufile.readline()) and post to current newsgroup
quit()
Close connection and quit
108
Chapter 3 • Internet Client Programming
3.3.5
An Interactive NNTP Example
Here is an interactive example of how to use Python’s NNTP library. It
should look similar to the interactive FTP example. (The e-mail addresses
have been changed for privacy reasons.)
When connecting to a group, you get a 5-tuple back from the group()
method, as described in Table 3-2.
>>> from nntplib import NNTP
>>> n = NNTP('your.nntp.server')
>>> rsp, ct, fst, lst, grp = n.group('comp.lang.python')
>>> rsp, anum, mid, data = n.article('110457')
>>> for eachLine in data:
...
print eachLine
From: "Alex Martelli" <[email protected]>
Subject: Re: Rounding Question
Date: Wed, 21 Feb 2001 17:05:36 +0100
"Remco Gerlich" <[email protected]> wrote:
> Jacob Kaplan-Moss <[email protected]> wrote in comp.lang.python:
>> So I've got a number between 40 and 130 that I want to round up to
>> the nearest 10. That is:
>>
>>
40 --> 40, 41 --> 50, ..., 49 --> 50, 50 --> 50, 51 --> 60
> Rounding like this is the same as adding 5 to the number and then
> rounding down. Rounding down is substracting the remainder if you were
> to divide by 10, for which we use the % operator in Python.
This will work if you use +9 in each case rather than +5 (note that he
doesn't really want rounding -- he wants 41 to 'round' to 50, for ex).
Alex
>>> n.quit()
'205 closing connection - goodbye!'
>>>
3.3.6
Client Program NNTP Example
For our NNTP client in Example 3-2, we are going to try to be more adventurous. It will be similar to the FTP client example in that we are going to
download the latest of something—this time it will be the latest article
available in the Python language newsgroup, comp.lang.python.
Once we have it, we will display (up to) the first 20 lines in the article,
and on top of that, (up to) the first 20 meaningful lines of the article. By that,
we mean lines of real data, not quoted text (which begin with “>” or “|”)
or even quoted text introductions like “In article <. . .>, [email protected]
wrote:”.
3.3 Network News
109
Finally, we are going to process blank lines intelligently. We will display
one blank line when we see one in the article, but if there are more than one
consecutive blank lines, we only show the first blank line of the set. Only
lines with real data are counted toward the first 20 lines, so it is possible
to display a maximum of 39 lines of output, 20 real lines of data interleaved
with 19 blank lines.
Example 3-2
NNTP Download Example (getFirstNNTP.py)
This script downloads and displays the first meaningful (up to 20) lines of the
most recently available article in comp.lang.python, the Python newsgroup.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/env python
import nntplib
import socket
HOST
GRNM
USER
PASS
=
=
=
=
'your.nntp.server'
'comp.lang.python'
'wesley'
'youllNeverGuess'
def main():
try:
n = nntplib.NNTP(HOST)
#, user=USER, password=PASS)
except socket.gaierror as e:
print 'ERROR: cannot reach host "%s"' % HOST
print '
("%s")' % eval(str(e))[1]
return
except nntplib.NNTPPermanentError as e:
print 'ERROR: access denied on "%s"' % HOST
print '
("%s")' % str(e)
return
print '*** Connected to host "%s"' % HOST
try:
rsp, ct, fst, lst, grp = n.group(GRNM)
except nntplib.NNTPTemporaryError as ee:
print 'ERROR: cannot load group "%s"' % GRNM
print '
("%s")' % str(e)
print '
Server may require authentication'
print '
Uncomment/edit login line above'
n.quit()
return
(Continued)
110
Chapter 3 • Internet Client Programming
Example 3-2
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
NNTP Download Example (getFirstNNTP.py) (Continued)
except nntplib.NNTPTemporaryError as ee:
print 'ERROR: group "%s" unavailable' % GRNM
print '
("%s")' % str(e)
n.quit()
return
print '*** Found newsgroup "%s"' % GRNM
rng = '%s-%s' % (lst, lst)
rsp, frm = n.xhdr('from', rng)
rsp, sub = n.xhdr('subject', rng)
rsp, dat = n.xhdr('date', rng)
print '''*** Found last article (#%s):
From: %s
Subject: %s
Date: %s
'''% (lst, frm[0][1], sub[0][1], dat[0][1])
rsp, anum, mid, data = n.body(lst)
displayFirst20(data)
n.quit()
def displayFirst20(data):
print '*** First (<= 20) meaningful lines:\n'
count = 0
lines = (line.rstrip() for line in data)
lastBlank = True
for line in lines:
if line:
lower = line.lower()
if (lower.startswith('>') and not \
lower.startswith('>>>')) or \
lower.startswith('|') or \
lower.startswith('in article') or \
lower.endswith('writes:') or \
lower.endswith('wrote:'):
continue
if not lastBlank or (lastBlank and line):
print '
%s' % line
if line:
count += 1
lastBlank = False
else:
lastBlank = True
if count == 20:
break
if __name__ == '__main__':
main()
3.3 Network News
111
If no errors occur when we run our script, we might see something like
this:
$ getLatestNNTP.py
*** Connected to host "your.nntp.server"
*** Found newsgroup "comp.lang.python"
*** Found last article (#471526):
From: "Gerard Flanagan" <[email protected]>
Subject: Re: Generate a sequence of random numbers that sum up to 1?
Date: Sat Apr 22 10:48:20 CEST 2006
*** First (<= 20) meaningful lines:
def partition(N=5):
vals = sorted( random.random() for _ in range(2*N) )
vals = [0] + vals + [1]
for j in range(2*N+1):
yield vals[j:j+2]
deltas = [ x[1]-x[0] for x in partition() ]
print deltas
print sum(deltas)
[0.10271966686994982, 0.13826576491042208, 0.064146913555132801,
0.11906452454467387, 0.10501198456091299, 0.011732423830768779,
0.11785369256442912, 0.065927165520102249, 0.098351305878176198,
0.077786747076205365, 0.099139810689226726]
1.0
$
This output is given the original newsgroup posting, which looks like this:
From: "Gerard Flanagan" <[email protected]>
Subject: Re: Generate a sequence of random numbers that sum up to 1?
Date: Sat Apr 22 10:48:20 CEST 2006
Groups: comp.lang.python
Gerard Flanagan wrote:
> Anthony Liu wrote:
> > I am at my wit's end.
> > I want to generate a certain number of random numbers.
> > This is easy, I can repeatedly do uniform(0, 1) for
> > example.
> > But, I want the random numbers just generated sum up
> > to 1 .
> > I am not sure how to do this. Any idea? Thanks.
> -------------------------------------------------------------> import random
> def partition(start=0,stop=1,eps=5):
>
d = stop - start
>
vals = [ start + d * random.random() for _ in range(2*eps) ]
>
vals = [start] + vals + [stop]
>
vals.sort()
>
return vals
> P = partition()
> intervals = [ P[i:i+2] for i in range(len(P)-1) ]
> deltas = [ x[1] - x[0] for x in intervals ]
> print deltas
112
Chapter 3 • Internet Client Programming
> print sum(deltas)
> --------------------------------------------------------------def partition(N=5):
vals = sorted( random.random() for _ in range(2*N) )
vals = [0] + vals + [1]
for j in range(2*N+1):
yield vals[j:j+2]
deltas = [ x[1]-x[0] for x in partition() ]
print deltas
print sum(deltas)
[0.10271966686994982, 0.13826576491042208, 0.064146913555132801,
0.11906452454467387, 0.10501198456091299, 0.011732423830768779,
0.11785369256442912, 0.065927165520102249, 0.098351305878176198,
0.077786747076205365, 0.099139810689226726]
1.0
Of course, the output will always be different, because articles are
always being posted. No two executions will result in the same output
unless your news server has not been updated with another article since
you last ran the script.
Line-by-Line Explanation
Lines 1–9
This application starts with a few import statements and some constants,
much like the FTP client example.
Lines 11–40
In the first section, we attempt to connect to the NNTP host server and
abort if unsuccessful (lines 13–24). Line 15 is commented out deliberately
in case your server requires authentication (with login and password)—if
so, uncomment this line and edit it in line 14. This is followed by trying to
load up the specific newsgroup. Again, it will quit if that newsgroup does
not exist, is not archived by this server, or if authentication is required
(lines 26–40).
Lines 42–55
In the next part, we get some headers to display (lines 42–51). The ones
that have the most meaning are the author, subject, and date. This data is
retrieved and displayed to the user. Each call to the xhdr() method
requires us to give the range of articles from which to extract the headers.
We are only interested in a single message, so the range is “X-X,” where X
is the last message number.
3.3 Network News
113
xhdr() returns a 2-tuple consisting of a server response ( rsp) and a list
of the headers in the range we specify. Because we are only requesting
this information for one message (the last one), we just take the first element of the list (hdr[0]). That data item is a 2-tuple consisting of the article number and the data string. Because we already know the article
number (we give it in our range request), we are only interested in the second item, the data string (hdr[0][1]).
The last part is to download the body of the article itself (lines 53–55). It
consists of a call to the body() method, a display of the first 20 or fewer meaningful lines (as defined at the beginning of this section), a logout of the
server, and complete execution.
Lines 57–80
The core piece of processing is done by the displayFirst20() function
(lines 57–80). It takes the set of lines that make up the article body and does
some preprocessing, such as setting our counter to 0, creating a generator
expression that lazily iterates through our (possibly large) set of lines making up the body, and “pretends” that we have just seen and displayed a
blank line (more on this later; lines 59–61). “Genexps” were added in
Python 2.4, so if you’re still using version 2.0–2.3, change this to a list comprehension, instead. (Really, you shouldn’t be using anything older than
version 2.4.) When we strip the line of data, we only remove the trailing
whitespace (rstrip()) because leading spaces might be intended lines of
Python code.
One criterion we have is that we should not show any quoted text or
quoted text introductions. That is what the big if statement is for on lines
65–71 (also include line 64). We do this checking if the line is not blank
(line 63). We lowercase the line so that our comparisons are case-insensitive
(line 64).
If a line begins with “>” or “|,” it means it is usually a quote. We make
an exception for lines that start with “>>>” because it might be an interactive interpreter line, although this does introduce a flaw that a triply-old
message (one quoted three times for the fourth responder) is displayed.
(One of the exercises at the end of the chapter is to remove this flaw.) Lines
that begin with “in article. . .”, and/or end with “writes:” or “wrote:”, both
with trailing colons (:), are also quoted text introductions. We skip all these
with the continue statement.
Now to address the blank lines. We want our application to be smart. It
should show blank lines as seen in the article, but it should be smart about
it. If there is more than one blank line consecutively, only show the first
2.4
114
Chapter 3 • Internet Client Programming
one so that the user does not see unneccessary lines, scrolling useful information off the screen. We should also not count any blank lines in our
set of 20 meaningful lines. All of these requirements are taken care of in
lines 72–78.
The if statement on line 72 only displays the line if the last line was not
blank, or if the last line was blank but now we have a non-blank line. In
other words, if we fall through and we print the current line, it is because
it is either a line with data or a blank line, as long as the previous line was
not blank. Now the other tricky part: if we have a non-blank line, count it
and set the lastBlank flag to False because this line was not empty (lines
74–76). Otherwise, we have just seen a blank line, so set the flag to True.
Now back to the business on line 61. We set the lastBlank flag to True,
because if the first real (non-introductory or quoted) line of the body is a
blank, we do not want to display it; we want to show the first real data line!
Finally, if we have seen 20 non-blank lines, then we quit and discard the
remaining lines (lines 79–80). Otherwise, we would have exhausted all the
lines and the for loop terminates normally.
3.3.7
Miscellaneous NNTP
You can read more about NNTP in the NNTP Protocol Definition/
Specification (RFC 977) at http://tools.ietf.org/html/rfc977 as well as on
the http://www.networksorcery.com/enp/protocol/nntp.htm Web page. Other
related RFCs include 1036 and 2980. To find out more about Python’s NNTP
support, you can start at http://docs.python.org/library/nntplib.
3.4
E-Mail
E-mail, is both archaic and modern at the same time. For those of us who
have been using the Internet since the early days, e-mail seems so “old,”
especially compared to newer and more immediate communication mechanisms, such as Web-based online chat, instant messaging (IM), and digital telephony such as Voice over Internet Protocol (VoIP) applications. The
next section gives a high-level overview of how e-mail works. If you are
already familiar with this and just want to move on to developing e-mailrelated clients in Python, skip to the succeeding sections.
Before we take a look at the e-mail infrastructure, have you ever asked
yourself what is the exact definition of an e-mail message? Well, according
to RFC 2822, “[a] message consists of header fields (collectively called ‘the
3.4 E-Mail
115
header of the message’) followed, optionally, by a body.” When we think
of e-mail as users, we immediately think of its contents, whether it be a
real message or an unsolicited commercial advertisement (a.k.a. spam).
However, the RFC states that the body itself is optional and that only the
headers are required. Imagine that!
3.4.1
E-Mail System Components and Protocols
Despite what you might think, e-mail actually existed before the modern
Internet came around. It actually started as a simple message exchange
between mainframe users; there wasn’t even any networking involved as
senders and receivers all used the same computer. Then when networking
became a reality, it was possible for users on different hosts to exchange
messages. This, of course, was a complicated concept because people used
different computers, which more than likely also used different networking protocols. It was not until the early 1980s that message exchange settled on a single de facto standard for moving e-mail around the Internet.
Before we get into the details, let’s first ask ourselves, how does e-mail
work? How does a message get from sender to recipient across the vastness of all the computers accessible on the Internet? To put it simply, there
is the originating computer (the sender’s message departs from here) and
the destination computer (recipient’s mail server). The optimal solution is
if the sending computer knows exactly how to reach the receiving host,
because then it can make a direct connection to deliver the message. However, this is usually not the case.
The sending computer queries to find another intermediate host that
can pass the message along its way to the final recipient host. Then that
host searches for the next host who is another step closer to the destination. So in between the originating and final destination hosts are any
number of computers. These are called hops. If you look carefully at the
full e-mail headers of any message you receive, you will see a “passport”
stamped with all the places your message bounced to before it finally
reached you.
To get a clearer picture, let’s take a look at the components of the e-mail
system. The foremost component is the message transport agent (MTA). This
is a server process running on a mail exchange host that is responsible for
the routing, queuing, and sending of e-mail. These represent all the hosts
that an e-mail message bounces from, beginning at the source host all the
way to the final destination host and all hops in between. Thus, they are
“agents” of “message transport.”
116
Chapter 3 • Internet Client Programming
For all this to work, MTAs need to know two things: 1) how determine
the next MTA to forward a message to, and 2) how to talk to another MTA.
The first is solved by using a domain name service (DNS) lookup to find the MX
(Mail eXchange) of the destination domain. This is not necessarily the final
recipient; it might simply be the next recipient who can eventually get the
message to its final destination. Next, how do MTAs forward messages to
other MTAs?
3.4.2
Sending E-Mail
To send e-mail, your mail client must connect to an MTA, and the only
language they understand is a communication protocol. The way MTAs
communicate with one another is by using a message transport system
(MTS). This protocol must be recognized by a pair of MTAs before they
can communicate with one another. As we described at the beginning
of this section, such communication was dicey and unpredictable in the
early days because there were so many different types of computer
systems, each running different networking software. In addition, computers were using both networked transmission as well as dial-up
modem, so delivery times were unpredictable. In fact, this author
remembers a message not showing up for almost nine months after the
message was originally sent! How is that for Internet speed? Out of this
complexity rose the Simple Mail Transfer Protocol (SMTP), one of the
foundations of modern e-mail.
SMTP, ESMTP, LMTP
SMTP was originally authored by the late Jonathan Postel (ISI) in RFC 821,
published in August 1982 and has gone through a few revisions since
then. In November 1995, via RFC 1869, SMTP received a set of service
extensions (ESMTP), and both SMTP and ESMTP were rolled into the current RFC 5321, published in October 2008. We’ll just use the term “SMTP”
to refer to both SMTP and ESMTP. For general applications, you really
only need to be able to log in to a server, send a message, and quit. Everything else is supplemental.
3.4 E-Mail
117
There is also one other alternative known as LMTP (Local Mail Transfer
Protocol) based on SMTP and ESMTP, defined in October 1996 as RFC
2033. One requirement for SMTP is having mail queues, but this requires
additional storage and management, so LMTP provides for a more lightweight system that avoids the necessity of mail queues but does require
messages to be delivered immediately (and not queued). LMTP servers
aren’t exposed externally and work directly with a mail gateway that is
Internet-facing to indicate whether messages are accepted or rejected. The
gateway serves as the queue for LMTP.
MTAs
Some well-known MTAs that have implemented SMTP include:
Open Source MTAs
• Sendmail
• Postfix
• Exim
• qmail
Commercial MTAs
• Microsoft Exchange
• Lotus Notes Domino Mail Server
Note that although they have all implemented the minimum SMTP protocol requirements, most of them, especially the commercial MTAs, have
added even more features to their servers, going above and beyond the
protocol definition.
SMTP is the MTS that is used by most of the MTAs on the Internet for
message exchange. It is the protocol used by MTAs to transfer e-mail from
(MTA) host to (MTA) host. When you send e-mail, you must connect to an
outgoing SMTP server, with which your mail application acts as an SMTP
client. Your SMTP server, therefore, is the first hop for your message.
118
Chapter 3 • Internet Client Programming
3.4.3
Python and SMTP
Yes, there is an smtplib and an smtplib.SMTP class to instantiate. Let’s
review this familiar story:
1.
2.
3.
4.
Connect to server
Log in (if applicable)
Make service request(s)
Quit
As with NNTP, the login step is optional and only required if the server
has SMTP authentication (SMTP-AUTH) enabled. SMTP-AUTH is defined
in RFC 2554. Also similar to NNTP, speaking SMTP only requires communicating with one port on the server; this time, it’s port 25.
Here is some Python pseudocode to get started:
from smtplib import SMTP
n = SMTP('smtp.yourdomain.com')
...
n.quit()
Before we take a look at a real example, let’s introduce some of the more
popular methods of the smtplib.SMTP class.
3.4.4
2.6
smtplib.SMTP Class Methods
In addition to the smtplib.SMTP class, Python 2.6 introduced another pair:
SMTP_SSL and LMTP. The latter implements LMTP, as described earlier in
Section 3.4.2, whereas the former works just like SMTP, except that it communicates over an encrypted socket and is an alternative to SMTP using
TLS. If omitted, the default port for SMTP_SSL is 465.
As in the previous sections, we won't show you all methods which
belong to the class, just the ones you need in order to create an SMTP client
application. For most e-mail sending applications, only two are required:
sendmail() and quit().
All arguments to sendmail() should conform to RFC 2822; that is, e-mail
addresses must be properly formatted, and the message body should have
appropriate leading headers and contain lines that must be delimited by
carriage-return and NEWLINE pairs (\r\n).
Note that an actual message body is not required. According to RFC
2822, “[the] only required header fields are the origination date field and
the originator address field(s),” for example, “Date:” and “From:” (MAIL
FROM, RCPT TO, DATA).
3.4 E-Mail
119
Table 3-3 presents some common SMTP object methods. There are a few
more methods not described here, but they are not normally required to
send an e-mail message. For more information about all the SMTP object
methods, refer to the Python documentation.
Table 3-3 Common Methods for SMTP Objects
Method
Description
sendmail(from, to,
msg[, mopts, ropts])
Sends msg from from to to (list/tuple) with
optional ESMTP mail (mopts) and recipient
(ropts) options.
ehlo() or helo()
Initiates a session with an SMTP or ESMTP
server using EHLO or HELO, respectively.
Should be optional because sendmail() will
call these as necessary.
starttls(keyfile=None,
certfile=None)
Directs server to begin Transport Layer Security
(TLS) mode. If either keyfile or certfile are
given, they are used in the creation of the secure
socket.
set_debuglevel(level)
Sets the debug level for server communication.
quit()
Closes connection and quits.
login(user, passwd) a
Log in to SMTP server with user name and
passwd.
a. SMTP-AUTH only.
3.4.5
Interactive SMTP Example
Once again, we present an interactive example:
>>> from smtplib import SMTP as smtp
>>> s = smtp('smtp.python.is.cool')
>>> s.set_debuglevel(1)
>>> s.sendmail('[email protected]', ('[email protected]',
'[email protected]'), ''' From: [email protected]\r\nTo:
[email protected], [email protected]\r\nSubject: test
msg\r\n\r\nxxx\r\n.''')
send: 'ehlo myMac.local\r\n'
reply: '250-python.is.cool\r\n'
reply: '250-7BIT\r\n'
reply: '250-8BITMIME\r\n'
120
Chapter 3 • Internet Client Programming
reply: '250-AUTH CRAM-MD5 LOGIN PLAIN\r\n'
reply: '250-DSN\r\n'
reply: '250-EXPN\r\n'
reply: '250-HELP\r\n'
reply: '250-NOOP\r\n'
reply: '250-PIPELINING\r\n'
reply: '250-SIZE 15728640\r\n'
reply: '250-STARTTLS\r\n'
reply: '250-VERS V05.00c++\r\n'
reply: '250 XMVP 2\r\n'
reply: retcode (250); Msg: python.is.cool
7BIT
8BITMIME
AUTH CRAM-MD5 LOGIN PLAIN
DSN
EXPN
HELP
NOOP
PIPELINING
SIZE 15728640
STARTTLS
VERS V05.00c++
XMVP 2
send: 'mail FROM:<[email protected]> size=108\r\n'
reply: '250 ok\r\n'
reply: retcode (250); Msg: ok
send: 'rcpt TO:<[email protected]>\r\n'
reply: '250 ok\r\n'
reply: retcode (250); Msg: ok
send: 'data\r\n'
reply: '354 ok\r\n'
reply: retcode (354); Msg: ok
data: (354, 'ok')
send: 'From: [email protected]\r\nTo:
[email protected]\r\nSubject: test msg\r\n\r\nxxx\r\n..\r\n.\r\n'
reply: '250 ok ; id=2005122623583701300or7hhe\r\n'
reply: retcode (250); Msg: ok ; id=2005122623583701300or7hhe
data: (250, 'ok ; id=2005122623583701300or7hhe')
{}
>>> s.quit()
send: 'quit\r\n'
reply: '221 python.is.cool\r\n'
reply: retcode (221); Msg: python.is.cool
3.4.6
Miscellaneous SMTP
You can read more about SMTP in the SMTP Protocol Definition/Specification,
RFC 5321, at http://tools.ietf.org/html/rfc2821. To find out more about Python’s
SMTP support, go to http://docs.python.org/library/smtplib.
3.4 E-Mail
121
One of the more important aspects of e-mail which we have not discussed yet is how to properly format Internet addresses as well as e-mail
messages themselves. This information is detailed in the latest Internet
Message Format specification, RFC 5322, which is accessible at http://
tools.ietf.org/html/rfc5322.
3.4.7
Receiving E-Mail
Back in the day, communicating by e-mail on the Internet was relegated to
university students, researchers, and employees of private industry and
commercial corporations. Desktop computers were predominantly still
Unix-based workstations. Home users focused mainly on dial-up Web
access on PCs and really didn’t use e-mail. When the Internet began to
explode in the mid-1990s, e-mail came home to everyone.
Because it was not feasible for home users to have workstations in their
dens running SMTP, a new type of system had to be devised to leave e-mail
on an incoming mail host while periodically downloading mail for offline
reading. Such a system had to consist of both a new application and a new
protocol to communicate with the mail server.
The application, which runs on a home computer, is called a mail user
agent (MUA). An MUA will download mail from a server, perhaps automatically deleting it from the server in the process (or leaving the mail on
the server to be deleted manually by the user). However, an MUA must
also be able to send mail; in other words, it should also be able to speak
SMTP to communicate directly to an MTA when sending mail. We have
already seen this type of client in the previous section when we looked at
SMTP. How about downloading mail then?
3.4.8
POP and IMAP
The first protocol developed for downloading was the Post Office Protocol.
As stated in the original RFC document, RFC 918 published in October
1984, “The intent of the Post Office Protocol (POP) is to allow a user’s
workstation to access mail from a mailbox server. It is expected that mail
will be posted from the workstation to the mailbox server via the Simple
Mail Transfer Protocol (SMTP).” The most recent version of POP is version 3,
otherwise known as POP3. POP3, defined in RFC 1939, is still widely used
today.
122
Chapter 3 • Internet Client Programming
A competing protocol came a few years after POP, known as the Internet
Message Access Protocol, or IMAP. (IMAP has also been known by various
other names: “Internet Mail Access Protocol,” “Interactive Mail Access
Protocol,” and “Interim Mail Access Protocol.”) The first version was
experimental, and it was not until version 2 that its RFC was published
(RFC 1064 in July 1988). It is stated in RFC 1064 that IMAP2 was inspired
by the second version of POP, POP2.
The intent of IMAP is to provide a more complete solution than POP;
however, it is more complex than POP. For example, IMAP is extremely
suitable for today’s needs due to users interacting with their e-mail messages from more than a single device, such as desktop/laptop/tablet computers, mobile phones, video game systems, etc. POP does not work well
with multiple mail clients, and although still widely used, is mostly obsolete. Note that many ISPs currently only provide POP for receiving (and
SMTP for sending) e-mail. We anticipate more adoption of IMAP as we
move forward.
The current version of IMAP in use today is IMAP4rev1, and it, too, is
widely used. In fact, Microsoft Exchange, one of the predominant mail
servers in the world today, uses IMAP as its download mechanism. At the
time of this writing, the latest draft of the IMAP4rev1 protocol definition is
spelled out in RFC 3501, published in March 2003. We use the term
“IMAP4” to refer to both the IMAP4 and IMAP4rev1 protocols, collectively.
For further reading, we suggest that you take a look at the aforementioned RFC documents. The diagram in Figure 3-3 illustrates this complex
system we know simply as e-mail.
Now let’s take a closer look at POP3 and IMAP4 support in Python.
3.4.9
Python and POP3
No surprises here: import poplib and instantiate the poplib.POP3 class; the
standard conversation is as expected:
1.
2.
3.
4.
Connect to server
Log in
Make service request(s)
Quit
3.4 E-Mail
123
Internet
POP3/IMAP4
(receive)
Mail
client
MUA
Sender
(or recipient)
SPAM & virus
Mail
filtering
server
device
IMA
POP3 P4
(receiv
e)
SMTP
(send
)
MTA
MTA
Mail
client
Recipient
MUA (or sender)
)
ive
ce
4
AP
/IM
P3
PO
TP
(re
d)
en
(s
SMTP (send)
SM
Figure 3-3 E-Mail Senders and Recipients on the Internet. Clients download and send mail via
their MUAs, which talk to their corresponding MTAs. E-mail “hops” from MTA to MTA until it
reaches the correct destination.
And the expected Python pseudocode:
from poplib import POP3
p = POP3('pop.python.is.cool')
p.user(...)
p.pass_(...)
...
p.quit()
Before we take a look at a real example, we should mention that there is
also a poplib.POP3_SSL class (added in version 2.4) which performs mail
transfer over an encrypted connection, provided the appropriate credentials are supplied. Let’s take a look at an interactive example as well as
introduce the basic methods of the poplib.POP3 class.
3.4.10
An Interactive POP3 Example
Below is an interactive example that uses Python’s poplib. The exception
you see comes from deliberately entering an incorrect password just to
demonstrate what you’ll get back from the server in practice. Here is the
interactive output:
>>> from poplib import POP3
>>> p = POP3('pop.python.is.cool')
>>> p.user('wesley')
'+OK'
>>> p.pass_("you'llNeverGuess")
2.4
124
Chapter 3 • Internet Client Programming
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.4/poplib.py", line 202,
in pass_
return self._shortcmd('PASS %s' % pswd)
File "/usr/local/lib/python2.4/poplib.py", line 165,
in _shortcmd
return self._getresp()
File "/usr/local/lib/python2.4/poplib.py", line 141,
in _getresp
raise error_proto(resp)
poplib.error_proto: -ERR directory status: BAD PASSWORD
>>> p.user('wesley')
'+OK'
>>> p.pass_('youllNeverGuess')
'+OK ready'
>>> p.stat()
(102, 2023455)
>>> rsp, msg, siz = p.retr(102)
>>> rsp, siz
('+OK', 480)
>>> for eachLine in msg:
... print eachLine
...
Date: Mon, 26 Dec 2005 23:58:38 +0000 (GMT)
Received: from c-42-32-25-43.smtp.python.is.cool
by python.is.cool (scmrch31) with ESMTP
id <2005122623583701300or7hhe>; Mon, 26 Dec 2005 23:58:37
+0000
From: [email protected]
To: [email protected]
Subject: test msg
xxx
.
>>> p.quit()
'+OK python.is.cool'
3.4.11
poplib.POP3 Class Methods
The POP3 class provides numerous methods to help you download and manage your inbox offline. Those most widely used are included in Table 3-4.
3.4 E-Mail
125
Table 3-4 Common Methods for POP3 Objects
Method
Description
user(login)
Sends the login name to the server; awaits reply indicating the server is waiting for user’s password
pass_(passwd)
Sends passwd (after user logs in with user()); an exception occurs on login/passwd failure
stat()
Returns mailbox status, a 2-tuple (msg_ct, mbox_siz): the
total message count and total message size, a.k.a. octets
list([msgnum])
Superset of stat(); returns entire message list from
server as a 3-tuple (rsp, msg_list, rsp_siz): server
response, message list, response message size; if msgnum
given, return data for that message only
retr(msgnum)
Retrieves message msgnum from server and sets its
“seen” flag; returns a 3-tuple (rsp, msglines, msgsiz):
server response, all lines of message msgnum, and message size in bytes/octets
dele(msgnum)
Tag message number msgnum for deletion; most servers
process deletes upon quit()
quit()
Logs out, commits changes (e.g., process “seen,” “delete”
flags, etc.), unlocks mailbox, terminates connection, and
then quits
When logging in, the user() method not only sends the login name to
the server, but it also awaits the reply that indicates the server is waiting
for the user’s password. If pass_() fails due to authentication issues, the
exception raised is poplib.error_proto. If it is successful, it gets back a
positive reply, for example, “+OK ready,” and the mailbox on the server
is locked until quit() is called.
For the list() method, the msg_list is of the form [‘msgnum msgsiz’,…]
where msgnum and msgsiz are the message number and message sizes,
respectively, of each message.
There are a few other methods that are not listed here. For the full details,
check out the documentation for poplib in the Python Library Reference.
126
Chapter 3 • Internet Client Programming
3.4.12
SMTP and POP3 Example
Example 3-3 shows how to use both SMTP and POP3 to create a client that
both receives and downloads e-mail as well as one that uploads and sends
it. What we are going to do is send an e-mail message to ourselves (or some
test account) via SMTP, wait for a bit—we arbitrarily chose ten seconds—and
then use POP3 to download our message and assert that the messages are
identical. Our operation will be a success if the program completes silently,
meaning that there should be no output or any errors.
Example 3-3
SMTP and POP3 Example (myMail.py)
This script sends a test e-mail message to the destination address (via the outgoing/
SMTP mail server) and retrieves it immediately from the (incoming mail/POP)
server. You must change the server names and e-mail addresses to make it work
properly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env python
from smtplib import SMTP
from poplib import POP3
from time import sleep
SMTPSVR = 'smtp.python.is.cool'
POP3SVR = 'pop.python.is.cool'
who = '[email protected]'
body = '''\
From: %(who)s
To: %(who)s
Subject: test msg
Hello World!
''' % {'who': who}
sendSvr = SMTP(SMTPSVR)
errs = sendSvr.sendmail(who, [who], origMsg)
sendSvr.quit()
assert len(errs) == 0, errs
sleep(10)
# wait for mail to be delivered
recvSvr = POP3(POP3SVR)
recvSvr.user('wesley')
recvSvr.pass_('youllNeverGuess')
rsp, msg, siz = recvSvr.retr(recvSvr.stat()[0])
# strip headers and compare to orig msg
sep = msg.index('')
recvBody = msg[sep+1:]
assert origBody == recvBody # assert identical
3.4 E-Mail
127
Line-by-Line Explanation
Lines 1–8
This application starts with a few import statements and some constants,
much like the other examples in this chapter. The constants here are the
outgoing (SMTP) and incoming (POP3) mail servers.
Lines 10–17
These lines represent the preparation of the message contents. For this test
message, the sender and the recipient will be the same user. Don’t forget
the RFC 2822-required line delimiters with a blank line separating the two
sections.
Lines 19–23
We connect to the outgoing (SMTP) server and send our message. There is
another pair of From and To addresses here. These are the “real” e-mail
addresses, or the envelope sender and recipient(s). The recipient field
should be an iterable. If a string is passed in, it will be transformed
into a list of one element. For unsolicited spam e-mail, there is usually
a discrepancy between the message headers and the envelope headers.
The third argument to sendmail() is the e-mail message itself. Once it
has returned, we log out of the SMTP server and check that no errors
have occurred. Then we give the servers some time to send and receive the
message.
Lines 25–32
The final part of our application downloads the just-sent message and
asserts that both it and the received messages are identical. A connection is
made to the POP3 server with a username and password. After successfully logging in, a stat() call is made to get a list of available messages.
The first message is chosen ([0]), and retr() is instructed to download it.
We look for the blank line separating the headers and message, discard
the headers, and compare the original message body with the incoming
message body. If they are identical, nothing is displayed and the program
ends successfully. Otherwise, an assertion is made.
Note that due to the numerous errors, we left out all the error-checking
for this script to make it a bit more easy on the eyes. (One of the exercises
at the end of the chapter is to add the error-checking.)
128
Chapter 3 • Internet Client Programming
Now you have a very good idea of how sending and receiving e-mail
works in today’s environment. If you wish to continue exploring this realm
of programming expertise, see the next section for other e-mail-related
Python modules, which will prove valuable in application development.
3.4.13
Python and IMAP4
Python supports IMAP4 via the imaplib module. Its use is quite similar to
that of other Internet protocols described in this chapter. To begin, import
imaplib and instantiate one of the imaplib.IMAP4* classes; the standard
conversation is as expected:
1.
2.
3.
4.
Connect to server
Log in
Make service request(s)
Quit
The Python pseudocode is also similar to what we’ve seen before:
from imaplib import IMAP4
s= IMAP4('imap.python.is.cool')
s.login(...)
...
s.close()
s.logout()
2.3
This module defines three classes, IMAP4, IMAP4_SSL, and IMAP4_stream
with which you can use to connect to an IMAP4-compatible server. Like
POP3_SSL for POP, IMAP4_SSL lets you connect to an IMAP4 server by using
an SSL-encrypted socket. Another subclass of IMAP is IMAP4_stream which
gives you a file-like object interface to an IMAP4 server. The latter pair of
classes was added in Python 2.3.
Now let’s take a look at an interactive example as well as introduce the
basic methods of the imaplib.IMAP4 class.
3.4.14
An Interactive IMAP4 Example
Here is an interactive example that uses Python’s imaplib:
>>> s = IMAP4('imap.python.is.cool') # default port: 143
>>> s.login('wesley', 'youllneverguess')
('OK', ['LOGIN completed'])
>>> rsp, msgs = s.select('INBOX', True)
>>> rsp
'OK'
3.4 E-Mail
129
>>> msgs
['98']
>>> rsp, data = s.fetch(msgs[0], '(RFC822)')
>>> rsp
'OK'
>>> for line in data[0][1].splitlines()[:5]:
...
print line
...
Received: from mail.google.com
by mx.python.is.cool (Internet Inbound) with ESMTP id
316ED380000ED
for <[email protected]>; Fri, 11 Mar 2011 10:49:06 -0500 (EST)
Received: by gyb11 with SMTP id 11so125539gyb.10
for <[email protected]>; Fri, 11 Mar 2011 07:49:03 -0800
(PST)
>>> s.close()
('OK', ['CLOSE completed'])
>>> s.logout()
('BYE', ['IMAP4rev1 Server logging out'])
3.4.15
Common imaplib.IMAP4 Class Methods
As we mentioned earlier, the IMAP protocol is more complex than POP, so
there are many more methods that we’re not documenting here. Table 3-5 lists
just the basic ones you are most likely to use for a simple e-mail application.
Table 3-5 Common Methods for IMAP4 Objects
Method
Description
close()
Closes the current mailbox. If access is not set to
read-only, any deleted messages will be
discarded.
fetch(message_set,
message_parts)
Retrieve e-mail messages (or requested parts via
message_parts) stated by message_set.
login(user, password)
Logs in user by using given password.
logout()
Logs out from the server.
(Continued)
130
Chapter 3 • Internet Client Programming
Table 3-5 Common Methods for IMAP4 Objects (Continued)
Method
Description
noop()
Ping the server but take no action (“no
operation”).
search(charset,
*criteria)
Searches mailbox for messages matching at least
one piece of criteria. If charset is False, it
defaults to US-ASCII.
select(mailbox= 'INBOX',
read-only=False)
Selects a mailbox (default is INBOX); user not
allowed to modify contents if readonly is set.
Below are some examples of using some of these methods.
• NOP, NOOP, or “no operation.” This is meant as a keepalive
to the server:
>>> s.noop()
('OK', ['NOOP completed'])
• Get information about a specific message:
>>> rsp, data = s.fetch('98', '(BODY)')
>>> data[0]
'98 (BODY ("TEXT" "PLAIN" ("CHARSET" "ISO-8859-1" "FORMAT" "flowed"
"DELSP" "yes") NIL NIL "7BIT" 1267 33))'
• Get just the headers of a message:
>>> rsp, data = s.fetch('98', '(BODY[HEADER])')
>>> data[0][1][:45]
'Received: from mail-gy.google.com (mail-gy.go'
• Get the IDs of the messages that have been viewed (try also
using 'ALL', 'NEW', etc.):
>>> s.search(None, 'SEEN')
('OK', ['1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 59 60 61 62
63 64 97'])
• Get more than one message (use a colon (:) as the delimiter;
note ‘)’ is used to delimit results):
>>> rsp, data = s.fetch('98:100', '(BODY[TEXT])')
>>> data[0][1][:45]
'Welcome to Google Accounts. To activate your'
3.4 E-Mail
131
>>> data[2][1][:45]
'\r\n-b1_aeb1ac91493d87ea4f2aa7209f56f909\r\nCont'
>>> data[4][1][:45]
'This is a multi-part message in MIME format.'
>>> data[1], data[3], data[5]
(')', ')', ')')
3.4.16
In Practice
E-Mail Composition
So far, we’ve taken a pretty in-depth look at the various ways Python
helps you download e-mail messages. We’ve even discussed how to create
simple text e-mail messages and then connect to SMTP servers to send
them. However, what has been missing is guidance on how to construct
slightly more complex messages in Python. As you can guess, I’m speaking about e-mail messages that are more than plain text, with attachments,
alternative formats, etc. Now is the right time to briefly visit this topic.
These longer messages are comprised normally of multiple parts, say a
plain text portion for the message, optionally an HTML equivalent for
those with Web browsers as their mail clients, and one or more attachments. The global standard for identifying and differentiating each of
these parts is known as Mail Interchange Message Extension format, or
MIME for short.
Python’s email package is perfectly suited to handle and manage MIME
parts of entire e-mail messages, and we’ll be using it for this entire subsection along with smtplib, of course. The email package has separate components that parse as well as construct e-mail. We will start with the latter
then conclude with a quick look at parsing and message walkthrough.
In Example 3-4, you’ll see two examples of creating e-mail messages,
make_mpa_msg() and make_img_msg(), both of which make a single e-mail
message with one attachment. The former creates a single multipart alternative message and sends it, and the latter creates an e-mail message containing one image and sends that. Following the example is the line-by-line
explanation.
132
Chapter 3 • Internet Client Programming
Example 3-4
Composing E-Mail (email-examples.py)
This Python 2 script creates and sends two different e-mail message types.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/usr/bin/env python
'email-examples.py - demo creation of email messages'
from
from
from
from
email.mime.image import MIMEImage
email.mime.multipart import MIMEMultipart
email.mime.text import MIMEText
smtplib import SMTP
# multipart alternative: text and html
def make_mpa_msg():
email = MIMEMultipart('alternative')
text = MIMEText('Hello World!\r\n', 'plain')
email.attach(text)
html = MIMEText(
'<html><body><h4>Hello World!</h4>'
'</body></html>', 'html')
email.attach(html)
return email
# multipart: images
def make_img_msg(fn):
f = open(fn, 'r')
data = f.read()
f.close()
email = MIMEImage(data, name=fn)
email.add_header('Content-Disposition',
'attachment; filename="%s"' % fn)
return email
def sendMsg(fr, to, msg):
s = SMTP('localhost')
errs = s.sendmail(fr, to, msg)
s.quit()
if __name__ == '__main__':
print 'Sending multipart alternative msg...'
msg = make_mpa_msg()
msg['From'] = SENDER
msg['To'] = ', '.join(RECIPS)
msg['Subject'] = 'multipart alternative test'
sendMsg(SENDER, RECIPS, msg.as_string())
print 'Sending image msg...'
msg = make_img_msg(SOME_IMG_FILE)
msg['From'] = SENDER
msg['To'] = ', '.join(RECIPS)
msg['Subject'] = 'image file test'
sendMsg(SENDER, RECIPS, msg.as_string())
3.4 E-Mail
133
Line-by-Line Explanation
Lines 1–7
In addition to the standard startup line and docstring, we see the import of
MIMEImage, MIMEMultipart, MIMEText, and SMTP classes.
Lines 9–18
Multipart alternative messages usually consist of the following two parts:
the body of an e-mail message in plain text and its equivalent on the
HTML side. It was up to the mail client to determine which gets shown.
For example, a Web-based e-mail system would generally show the HTML
version, whereas a command-line mail reader would only show the plain
text version.
To create this type of message, you need to use the email.mime.multiple.
MIMEMultipart class and instantiate it by passing in 'alternative' as its
only argument. If you don’t pass this value in, each of the two parts will be
a separate attachment; thus, some e-mail systems might show both parts.
The email.mime.text.MIMEText class was used for both parts (because
they really are both bodies of plain text). Each part is then attached to the
entire e-mail message because they are created before the return message
is returned.
Lines 20–28
The make_img_msg() function takes a single parameter, a filename. Its data
is absorbed then fed directly to a new instance of email.mime.image.MIMEImage.
A Content-Disposition header is added and then a message is returned to
the user.
Lines 30–33
The sole purpose of sendMsg() is to take the basic e-mail-sending criteria
(sender, recipient[s], message body), transmit the message, and then return
to the caller.
Looking for more verbose output? Try this extension: s.set_debuglevel
(True), where “s” is the smtplib.SMTP server. Finally, we’ll repeat what we
said earlier that many SMTP servers require logins, so you’d do that here
(just after logging in but before sending an e-mail message).
134
Chapter 3 • Internet Client Programming
Lines 35–48
The “main” part of this script just tests each of these two functions. The
functions create the message, add the From, To, and Sender fields, and
then transmit the message to those recipients. Naturally, you need to fill in
all of the following for your application to work: SENDER, RECIPS, and
SOME_IMG_FILE.
E-Mail Parsing
Parsing is somewhat easier than constructing a message from scratch.
You would typically use several tools from the email package: the
email.message_from_string() function as well as the message.walk() and
message.get_payload() methods. Here is the typical pattern:
def processMsg(entire_msg):
body = ''
msg = email.message_from_string(entire_msg)
if msg.is_multipart():
for part in msg.walk():
if part.get_content_type() == 'text/plain':
body = part.get_payload()
break
else:
body = msg.get_payload(decode=True)
else:
body = msg.get_payload(decode=True)
return body
This snippet should be fairly simple to figure out. Here are the major
players:
• email.message_from_string(), used to parse the message.
• msg.walk(): Let’s “walk down” the attachment hierarchy of a
mall stand/shop.
• part.get_content_type(): Guess the correct MIME type.
• msg.get_payload(): Pull out the specific part from the message
body. Typically the decode flag is set to True so as to decode
the body part as per the Content-Transfer-Encoding header.
3.4 E-Mail
135
Web-Based Cloud E-Mail Services
The use of the protocols that we’ve covered so far in this chapter, for the
most part, have been ideal: there hasn’t been much of a focus on security
or the messiness that comes with it. Of course, we did mention that some
servers require logins.
However, when coding in real life, we need to come back down to earth
and recognize that servers that are actively maintained really don’t want
to be the focus or target of hackers who want a free spam and/or phishing
e-mail relay or other nefarious activity. Such systems, predominantly e-mail
systems, are locked down appropriately. The e-mail examples given earlier in the chapter are for generic e-mail services that come with your ISP.
Because you’re paying a monthly fee for your Internet service, you generally get e-mail uploading/sending and downloading/receiving for “free.”
Let’s take a look at some public Web-based e-mail services such as
Yahoo! Mail and Google’s Gmail service. Because such software as a service (SaaS) cloud services don’t require you to pay a monthly fee up front,
it seems completely free to you. However, users generally “pay” by being
exposed to advertising. The better the ad relevance, the more likely the
service provider is able to recoup some of the costs of offering such services free of charge.
Gmail features algorithms that scan e-mail messages to get a sense of its
context and hopefully, with good machine learning algorithms, presents
ads that are more likely to be clicked by users than generic ad inventory.
The ads are generally in plain text and along the right side of the e-mail
message panel. Because of the efficacy of their algorithms, Google not only
offers Web access to their Gmail service for free, they even allow outbound
transfer of messages through a client service via POP3 and IMAP4 as well
as the ability to send e-mail using SMTP.
Yahoo!, on the other hand, shows more general ads in image format
embedded in parts of their Web application. Because their ads don’t target
as well, they likely don’t derive as much revenue, which might be a contributing factor for why they require a paid subscription service (called
Yahoo! Mail Plus) in order to download your e-mail. Another reason could
be that they don’t want users to easily be able to move their mail elsewhere. Yahoo! currently does not charge for sending e-mail via SMTP at
the time of this writing. We will look at some code examples of both in the
remainder of this subsection.
136
Chapter 3 • Internet Client Programming
Best Practices: Security, Refactoring
We need to take a moment to also discuss best practices, including security
and refactoring. Sometimes, the best laid plans are thwarted because of the
reality that different releases of a programming language will have
improvements and bugfixes that aren’t found in older releases, so in practice, you might have to do a little bit more work than you had originally
planned.
Before we look at the two e-mail services from Google and Yahoo!, let’s
look at some boilerplate code that we’ll use for each set of examples:
from imaplib import IMAP4_SSL
from poplib import POP3_SSL
from smtplib import SMTP_SSL
from secret import * # where MAILBOX, PASSWD come from
who = . . . # [email protected]/gmail.com where MAILBOX = xxx
from_ = who
to = [who]
headers = [
'From: %s' % from_,
'To: %s' % ', '.join(to),
'Subject: test SMTP send via 465/SSL',
]
body = [
'Hello',
'World!',
]
msg = '\r\n\r\n'.join(('\r\n'.join(headers), '\r\n'.join(body)))
The first thing you’ll notice is that we’re no longer in utopia; the realities
of living and working, nay even existing, on the Web requires that we use
secure connections, so we’re using the SSL-equivalents of all three protocols; hence, the “_SSL” appended to the end of each of the original class
names.
Secondly, we can’t use our mailboxes (login names) and passwords in
plain text as we did in the codes examples in previous sections. In practice,
putting account names and passwords in plain text and embedding them
in source code is... well, horrific to say the least. In practice, they should be
fetched from either a secure database, imported from a bytecode-compiled
.pyc or .pyo file, or retrieved from some live server or broker found somewhere on your company’s intranet. For our example, we’ll assume they’re
in a secret.pyc file that contains MAILBOX and PASSWD attributes associated
with the equivalent privileged information.
3.4 E-Mail
137
The last set of variables just represent the actual e-mail message plus
sender and receiver (both the same people to make it easy). The way we’ve
structured the e-mail message itself is slightly more complex than we did
in the earlier example, in which the body was a single string that required
us to fill in the necessary field data:
body = '''\
From: %(who)s
To: %(who)s
Subject: test msg
Hello World!
''' % {'who': who}
However, we chose to use lists instead, because in practice, the body of
the e-mail message is more likely to be generated or somehow controlled
by the application instead of being a hardcoded string. The same may be
true of the e-mail headers. By making them lists, you can easily add (or
even remove) lines to (from) an e-mail message. Then when ready for
transmission, the process only requires a couple of str.join() calls with
\r\n pairs. (Recall from an earlier subsection in this chapter that this is the
official delimiter accepted by RFC5322-compliant SMTP servers—some
servers won’t accept only NEWLINEs.)
We’ve also made another minor tweak to the message body data: there
might be more than one receiver, so the to variable has also been changed
to a list. We then have to str.join() them together when creating the final
set of e-mail headers. Finally, let’s look at a specific utility function we’re
going to use for our upcoming Yahoo! Mail and Gmail examples; it’s a
short snippet that just goes and grabs the Subject line from inbound e-mail
messages.
def getSubject(msg, default='(no Subject line)'):
'''\
getSubject(msg) - 'msg' is an iterable, not a
delimited single string; this function iterates
over 'msg' look for Subject: line and returns
if found, else the default is returned if one isn't
found in the headers
'''
for line in msg:
if line.startswith('Subject:'):
return line.rstrip()
if not line:
return default
The getSubject() function is fairly simplistic; it looks for the Subject
line only within the headers. As soon as one is found, the function returns
immediately. The headers have completed when a blank line is reached, so
138
Chapter 3 • Internet Client Programming
if one hasn’t been found at this point, return a default, which is a local
variable with a default argument allowing the user to pass in a custom
default string as desired. Yeah, I know some of you performance buffs will
want to use line[:8] == 'Subject:' to avoid the str.startswith()
method call, but guess what? Don’t forget that line[:8] results in a
str.__getslice__() call; although to be honest, for this case it is about
40 percent faster than str.startswith(), as shown in a few timeit tests:
>>> t = timeit.Timer('s[:8] == "Subject:"', 's="Subject: xxx"')
>>> t.timeit()
0.14157199859619141
>>> t.timeit()
0.1387479305267334
>>> t.timeit()
0.13623881340026855
>>>
>>> t = timeit.Timer('s.startswith("Subject:")', 's="Subject: xxx"')
>>> t.timeit()
0.23016810417175293
>>> t.timeit()
0.23104190826416016
>>> t.timeit()
0.24139499664306641
Using timeit is another best practice and we’ve just gone over one of its
most common use cases: you have a pair of snippets that do the same
thing, so you’re in a situation in which you need to know which one is
more efficient. Now let’s see how we can apply some of this knowledge on
some real code.
Yahoo! Mail
Assuming that all of the preceding boilerplate code has been executed,
we’ll start with Yahoo! Mail. The code we’re going to look at is an extension
of Example 3-3. We’ll also send e-mail via SMTP but will retrieve messages
via both POP then IMAP. Here’s the prototype script:
s = SMTP_SSL('smtp.mail.yahoo.com', 465)
s.login(MAILBOX, PASSWD)
s.sendmail(from_, to, msg)
s.quit()
print 'SSL: mail sent!'
s = POP3_SSL('pop.mail.yahoo.com', 995)
s.user(MAILBOX)
s.pass_(PASSWD)
rv, msg, sz = s.retr(s.stat()[0])
s.quit()
3.4 E-Mail
139
line = getSubject(msg)
print 'POP:', line
s = IMAP4_SSL('imap.n.mail.yahoo.com', 993)
s.login(MAILBOX, PASSWD)
rsp, msgs = s.select('INBOX', True)
rsp, data = s.fetch(msgs[0], '(RFC822)')
line = getSubject(StringIO(data[0][1]))
s.close()
s.logout()
print 'IMAP:', line
Assuming we stick all of this into a ymail.py file, our execution might
look something like this:
$ python ymail.py
SSL mail sent!
POP: Subject:Meet singles for dating, romance and more.
IMAP: Subject: test SMTP send via 465/SSL
In our case, we had a Yahoo! Mail Plus account, which allows us to
download e-mail. (The sending is free regardless of whether you’re a
paying or non-paying subscriber.) However, note a couple of things that
didn’t work out quite right. The first is that the message obtained via POP
was not that of our sent message, whereas IMAP was able to find it. In general, you’ll find IMAP more reliable. Also in the preceding example, we’re
assuming that you’re a paying customer and using a current version of
Python (version 2.6.3+); reality sets in rather quickly if you’re not.
If you’re not paying for Yahoo! Mail Plus, you’re not allowed to download e-mail. Here’s a sample traceback that you’ll get if you attempt it:
Traceback (most recent call last):
File "ymail.py", line 101, in <module>
s.pass_(PASSWD)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/
python2.7/poplib.py", line 189, in pass_
return self._shortcmd('PASS %s' % pswd)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/
python2.7/poplib.py", line 152, in _shortcmd
return self._getresp()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/
python2.7/poplib.py", line 128, in _getresp
raise error_proto(resp)
poplib.error_proto: -ERR [SYS/PERM] pop not allowed for user.
Furthermore, the SMTP_SSL class was only added in version 2.6, and on
top of that, it was buggy until version 2.6.3, so that’s the minimum version
you need in order to be able to write code that uses SMTP over SSL. If you
2.6
140
Chapter 3 • Internet Client Programming
using a release older than version 2.6, you won’t even get that class, and if
you’re using version 2.6(.0)–2.6.2, you’ll get an error that looks like this:
Traceback (most recent call last):
File "ymail.py", line 61, in <module>
s.login(MAILBOX, PASSWD)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/smtplib.py", line 549, in login
self.ehlo_or_helo_if_needed()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/smtplib.py", line 509, in ehlo_or_helo_if_needed
if not (200 <= self.ehlo()[0] <= 299):
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/smtplib.py", line 382, in ehlo
self.putcmd(self.ehlo_msg, name or self.local_hostname)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/smtplib.py", line 318, in putcmd
self.send(str)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/smtplib.py", line 310, in send
raise SMTPServerDisconnected('please run connect() first')
smtplib.SMTPServerDisconnected: please run connect() first
These are just some of the issues you’ll discover in practice; it’s never as
perfect as what you’d find in a textbook. There are always weird, unanticipated gotchas that end up biting you. By simulating it here, hopefully it
will be less shocking for you.
Let’s clean up the output a bit. But more importantly, let’s add all these
(version) checks that you’d have to do in real life, just to get used to it. Our
final version of ymail.py can be found in Example 3-5.
Example 3-5
Yahoo! Mail SMTP, POP, IMAP Example (ymail.py)
This script exercises SMTP, POP, and IMAP for the Yahoo! Mail service.
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python
'ymail.py - demo Yahoo!Mail SMTP/SSL, POP, IMAP'
from
from
from
from
from
cStringIO import StringIO
imaplib import IMAP4_SSL
platform import python_version
poplib import POP3_SSL, error_proto
socket import error
# SMTP_SSL added in 2.6, fixed in 2.6.3
release = python_version()
if release > '2.6.2':
from smtplib import SMTP_SSL, SMTPServerDisconnected
3.4 E-Mail
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
141
else:
SMTP_SSL = None
from secret import *
# you provide MAILBOX, PASSWD
who = '%[email protected]' % MAILBOX
from_ = who
to = [who]
headers = [
'From: %s' % from_,
'To: %s' % ', '.join(to),
'Subject: test SMTP send via 465/SSL',
]
body = [
'Hello',
'World!',
]
msg = '\r\n\r\n'.join(('\r\n'.join(headers), '\r\n'.join(body)))
def getSubject(msg, default='(no Subject line)'):
'''\
getSubject(msg) - iterate over 'msg' looking for
Subject line; return if found otherwise 'default'
'''
for line in msg:
if line.startswith('Subject:'):
return line.rstrip()
if not line:
return default
# SMTP/SSL
print '*** Doing SMTP send via SSL...'
if SMTP_SSL:
try:
s = SMTP_SSL('smtp.mail.yahoo.com', 465)
s.login(MAILBOX, PASSWD)
s.sendmail(from_, to, msg)
s.quit()
print '
SSL mail sent!'
except SMTPServerDisconnected:
print '
error: server unexpectedly disconnected... try
again'
56 else:
57
print '
error: SMTP_SSL requires 2.6.3+'
58
59 # POP
60 print '*** Doing POP recv...'
(Continued)
142
Chapter 3 • Internet Client Programming
Example 3-5
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
Yahoo! Mail SMTP, POP, IMAP Example (ymail.py)
(Continued)
try:
s = POP3_SSL('pop.mail.yahoo.com', 995)
s.user(MAILBOX)
s.pass_(PASSWD)
rv, msg, sz = s.retr(s.stat()[0])
s.quit()
line = getSubject(msg)
print '
Received msg via POP: %r' % line
except error_proto:
print '
error: POP for Yahoo!Mail Plus subscribers only'
# IMAP
print '*** Doing IMAP recv...'
try:
s = IMAP4_SSL('imap.n.mail.yahoo.com', 993)
s.login(MAILBOX, PASSWD)
rsp, msgs = s.select('INBOX', True)
rsp, data = s.fetch(msgs[0], '(RFC822)')
line = getSubject(StringIO(data[0][1]))
s.close()
s.logout()
print '
Received msg via IMAP: %r' % line
except error:
print '
error: IMAP for Yahoo!Mail Plus subscribers only
Line-by-Line Explanation
Lines 1–8
These are the normal header and import lines.
Lines 10–15
Here we ask for the Python release number as a string which comes from
platform.python_version(). We only perform the import smtplib attributes if we’re using version 2.6.3 and newer; otherwise, set SMTP_SSL to
None.
Lines 17–21
As mentioned earlier, instead of hardcoding privileged information such
as login and password, we put them in somewhere else, such as a bytecode-compiled secret.pyc file, where the average user cannot reverse engineer the MAILBOX and PASSWD data. As this is just a test application, after
obtaining that information (line 17), we set the envelope sender and recipient variables as the same person (lines 19–21). Why is the sender variable
named from_ instead of from?
3.4 E-Mail
143
Lines 23–32
These next set of lines constitute the body of the e-mail message. Lines
23–27 represent the headers (which you can have easily generated by
some code), lines 28–31 are for the actual body of the message (which can
also be generated or in an iterable). At the end (line 32), we have the line of
code that merges all of the previous information (headers + body) and creates
the entire e-mail message body with the correct delimiter(s).
Lines 34–43
We have already discussed the getSubject() function, whose sole purpose
is to seek the Subject line within an inbound message’s e-mail headers, taking a default string if no Subject line is found. It’s optional as we’ve implemented a default value for default.
Lines 45–57
This is the SMTP code. Earlier in lines 10–15, we decided whether to use
SMTP_SSL or assign None to that value. Here, if we did get the class (line 7),
try to connect to the server, login, execute the e-mail send, and then quit
(lines 48–53). Otherwise, alert the user that version 2.6.3 or newer is
required (lines 56–57). Occasionally you might get disconnected from the
server due to a variety of reasons such as poor connectivity, etc. In such
cases, usually a retry does the trick, so we inform the user about the retry
attempt (lines 54–55).
Lines 59–70
This is the POP3 code that we already covered earlier for the most part
(lines 62–68). The only difference is that we’ve added a check in case
you’re not paying for the POP access but are trying to download your mail
anyway, which is why we need to catch the poplib.error_proto exception
(lines 69–70), seen earlier.
Lines 72–84
The same is true for the IMAP4 code. We wrap the basic functionality in a
try block (lines 74–82) and catch socket.error (lines 83–84). Did you also
notice that this is where we subtly use the cStringIO.StringIO object (line
79)? The reason for this is because IMAP returns the e-mail message as a
single large string. Because getSubject() iterates over multiple lines, we
need to provide it something similar that it can work with, so that’s
what we get from StringIO—it takes a long string and gives it a file-like
interface.
144
Chapter 3 • Internet Client Programming
So that, in practice, is how you would actually deal with Yahoo! Mail.
Gmail is very similar, except that all the access is “free.” In addition, Gmail
also allows standard SMTP (using TLS).
Gmail
Example 3-6 looks at Google’s Gmail service. In addition to SMTP over
SSL, Gmail also offers SMTP using Transport Layer Security (TLS), so we’ll
see one additional import of the smtplib.SMTP class with its own section of
code. As far as everything else (SMTP over SSL, POP, and IMAP), they’ll
look quite similar to their equivalents for Yahoo! Mail. Because e-mail
download is completely free, we do not need the exception handler to process access errors due to not being a subscriber.
Example 3-6
Gmail SMTPx2, POP, IMAP Example (gmail.py)
This script exercises SMTP, POP, and IMAP of the Google Gmail service.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/env python
'gmail.py - demo Gmail SMTP/TLS, SMTP/SSL, POP, IMAP'
from
from
from
from
from
cStringIO import StringIO
imaplib import IMAP4_SSL
platform import python_version
poplib import POP3_SSL
smtplib import SMTP
# SMTP_SSL added in 2.6
release = python_version()
if release > '2.6.2':
from smtplib import SMTP_SSL
else:
SMTP_SSL = None
from secret import *
# fixed in 2.6.3
# you provide MAILBOX, PASSWD
who = '%[email protected]' % MAILBOX
from_ = who
to = [who]
headers = [
'From: %s' % from_,
'To: %s' % ', '.join(to),
'Subject: test SMTP send via 587/TLS',
]
body = [
'Hello',
'World!',
]
msg = '\r\n\r\n'.join(('\r\n'.join(headers), '\r\n'.join(body)))
3.4 E-Mail
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def getSubject(msg, default='(no Subject line)'):
'''\
getSubject(msg) - iterate over 'msg' looking for
Subject line; return if found otherwise 'default'
'''
for line in msg:
if line.startswith('Subject:'):
return line.rstrip()
if not line:
return default
# SMTP/TLS
print '*** Doing SMTP send
s = SMTP('smtp.gmail.com',
if release < '2.6':
s.ehlo()
# required
s.starttls()
if release < '2.5':
s.ehlo()
# required
s.login(MAILBOX, PASSWD)
s.sendmail(from_, to, msg)
s.quit()
print '
TLS mail sent!'
via TLS...'
587)
in older releases
in older releases
# POP
print '*** Doing POP recv...'
s = POP3_SSL('pop.gmail.com', 995)
s.user(MAILBOX)
s.pass_(PASSWD)
rv, msg, sz = s.retr(s.stat()[0])
s.quit()
line = getSubject(msg)
print '
Received msg via POP: %r' % line
body = body.replace('587/TLS', '465/SSL')
# SMTP/SSL
if SMTP_SSL:
print '*** Doing SMTP send via SSL...'
s = SMTP_SSL('smtp.gmail.com', 465)
s.login(MAILBOX, PASSWD)
s.sendmail(from_, to, msg)
s.quit()
print '
SSL mail sent!'
# IMAP
print '*** Doing IMAP recv...'
s = IMAP4_SSL('imap.gmail.com', 993)
s.login(MAILBOX, PASSWD)
rsp, msgs = s.select('INBOX', True)
rsp, data = s.fetch(msgs[0], '(RFC822)')
line = getSubject(StringIO(data[0][1]))
s.close()
s.logout()
print '
Received msg via IMAP: %r' % line
145
146
Chapter 3 • Internet Client Programming
Line-by-Line Explanation
Lines 1–8
These are the usual header and import lines with one addition: the import
of smtplib.SMTP. We will use this class with TLS to send an e-mail message.
Lines 10–43
These are pretty much the same as the equivalent lines in ymail.py. One
difference is that our who variable will have an @gmail.com e-mail address,
of course (line 19). The other change is that we’ll start with SMTP/TLS, so
the Subject line reflects this. We also don’t import the smtplib.SMTPServerDisconnected exception, because this exception wasn’t observed throughout our testing.
Lines 45–56
This is the SMTP code that connects to the server by using TLS. As you can
see, successive releases of Python (lines 48–52) have resulted in less boilerplate necessary to communicate with the server. It also has a different port
number than SMTP/SSL (line 47).
Lines 58–88
The rest of the script is nearly identical to the equivalent in Yahoo! Mail.
As we mentioned earlier, there are fewer error checks because those issues
either don’t exist for Gmail or have not been observed when using Gmail.
One final minor difference is that as a result of sending both SMTP/TLS
and SMTP/SSL messages, the Subject line needed to be tweaked (line 68).
What we’re hoping that readers get out of these final pair of applications includes being able to take the concepts learned earlier in the chapter
and apply some realism to every day application development; how in
practice, security is a necessity; and yes, sometimes there are minor differences between Python releases. As much as we’d prefer solutions that are
more pure, we know this isn’t reality, and such issues are just examples of
things that you have to take into consideration on any development project.
3.5
Related Modules
One of Python’s greatest assets is the strength of its networking support in
the standard library, particularly those oriented toward Internet protocols
and client development. The subsections that follow present related modules, first focusing on e-mail, followed by Internet protocols in general.
3.5 Related Modules
3.5.1
147
E-Mail
Python features numerous e-mail modules and packages to help you with
building an application. Some of them are listed in Table 3-6.
Table 3-6 E-Mail-Related Modules
Module/Package
Description
email
Package for processing e-mail (also supports MIME)
smtpd
SMTP server
base64
Base-16, 32, and 64 data encodings (RFC 3548)
mhlib
Classes for handling MH folders and messages
mailbox
Classes to support parsing mailbox file formats
mailcap
Support for handling “mailcap” files
mimetools
(deprecated) MIME message parsing tools (use email)
mimetypes
Converts between filenames/URLs and associated MIME
types
MimeWriter
(deprecated) MIME message processing (use email)
mimify
(deprecated) Tools to MIME-process messages with (use
email)
quopri
Encode/decode MIME quoted-printable data
binascii
Binary and ASCII conversion
binhex
Binhex4 encoding and decoding support
3.5.2
Other Internet “Client” Protocols
Table 3-7 presents other Internet “Client” Protocol-Related Modules.
148
Chapter 3 • Internet Client Programming
Table 3-7 Internet “Client” Protocol-Related Modules
Module
Description
ftplib
FTP protocol client
xmlrpclib
XML-RPC protocol client
httplib
HTTP and HTTPS protocol client
imaplib
IMAP4 protocol client
nntplib
NNTP protocol client
poplib
POP3 protocol client
smtplib
SMTP protocol client
3.6
Exercises
FTP
3-1. Simple FTP Client. Given the FTP examples from this chapter,
write a small FTP client program that goes to your favorite
Web sites and downloads the latest versions of the applications you use. This can be a script that you run every few
months to ensure that you’re using the “latest and greatest.”
You should probably keep some sort of table with FTP location, login, and password information for your convenience.
3-2. Simple FTP Client and Pattern-Matching. Use your solution to
Exercise 3-1 as a starting point for creating another simple
FTP client that either pushes or pulls a set of files from a
remote host by using patterns. For example, if you want to
move a set of Python or PDF files from one host to another,
allow users to enter *.py or doc*.pdf and only transfer
those files whose names match.
3-3. Smart FTP Command-Line Client. Create a command-line FTP
application similar to the vanilla Unix /bin/ftp program;
however, make it a “better FTP client,” meaning it should
have additional useful features. You can take a look at the
ncFTP application as motivation. It can be found at http://
ncftp.com. For example, it has the following features: history,
3.6 Exercises
149
bookmarks (saving FTP locations with log in and password),
download progress, etc. You might need to implement readline functionality for history and curses for screen control.
3-4. FTP and Multithreading. Create an FTP client that uses Python
threads to download files. You can either upgrade your existing Smart FTP client, as in Exercise 3-3, or just write a more
simple client to download files. This can be either a commandline program in which you enter multiple files as arguments
to the program, or a GUI in which you let the user select 1+
file(s) to transfer. Extra Credit: Allow patterns, that is, *.exe.
Use individual threads to download each file.
3-5. FTP and GUI. Take the smart FTP client that you developed
earlier and add a GUI layer on top of it to form a complete
FTP application. You can choose from any of the modern
Python GUI toolkits.
3-6. Subclassing. Derive ftplib.FTP and make a new class FTP2
where you do not need to give STOR filename and RETR
filename commands with all four (4) retr*() and stor*()
methods; you only need to pass in the filename. You can
choose to either override the existing methods or create new
ones with a 2 suffix, for example, retrlines2().
The file Tools/scripts/ftpmirror.py in the Python source distribution is a
script that can mirror FTP sites, or portions thereof, using the ftplib module. It can be used as an extended example that applies to this module. The
next five exercises feature the creation of solutions that revolve around code
such as ftpmirror.py. You can use code in ftpmirror.py or implement your
own solution with its code as your motivation.
3-7. Recursion. The ftpmirror.py script copies a remote directory recursively. Create a simpler FTP client in the spirit of
ftpmirror.py but one that does not recurse by default. Create
an -r option that instructs the application to recursively copy
subdirectories to the local filesystem.
3-8. Pattern-Matching. The ftpmirror.py script has an -s option
that lets users skip files that match the given pattern, such
as .exe. Create your own simpler FTP client or update your
solution to Exercise 3-7 so that it lets the user supply a pattern and only copy those files matching that pattern. Use
your solution to an earlier exercise as a starting point.
3-9. Recursion and Pattern-Matching. Create an FTP client that integrates both Exercises 3-7 and 3-8.
150
Chapter 3 • Internet Client Programming
3-10. Recursion and ZIP files. This exercise is similar to Exercise 3-7;
however, instead of copying the remote files to the local filesystem, either update your existing FTP client or create a
new one to download remote files and compress them into a
ZIP (or TGZ or BZ2) file. This -z option allows your users to
back up an FTP site in an automated manner.
3-11. Kitchen Sink. Implement a single, final, all-encompassing FTP
application that has all the solutions to Exercises 3-7, 3-8, 3-9,
and 3-10, that is, -r, -s, and -z options.
NNTP
3-12. Introduction to NNTP. Change Example 3-2 (getLatestNNTP.py)
so that instead of the most recent article, it displays the first
available article, meaningfully.
3-13. Improving Code. Fix the flaw in getLatestNNTP.py where
triple-quoted lines show up in the output. This is because we
want to display Python interactive interpreter lines but not
triple-quoted text. Solve this problem by checking whether
the stuff that comes after the “>>>” is real Python code. If so,
display it as a line of data; if not, do not display this quoted
text. Extra Credit: Use your solution to solve another minor
problem—leading whitespace is not stripped from the body
because it might represent indented Python code. If it really
is code, display it; otherwise, it is text, so lstrip() that
before displaying.
3-14. Finding Articles. Create an NNTP client application that lets the
user log in and choose a newsgroup of interest. Once that has
been accomplished, prompt the user for keywords to search
article Subject lines. Bring up the list of articles that match the
requirement and display them to the user. The user should
then be allowed to choose an article to read from that list—display them and provide simple navigation like pagination,
etc. If no search field is entered, bring up all current articles.
3-15. Searching Bodies. Upgrade your solution to Exercise 3-14 by
searching both Subject lines and article bodies. Allow for
AND or OR searching of keywords. Also allow for AND or
OR searching of Subject lines and article bodies; that is, keyword(s) must be in Subject lines only, article bodies only,
either, or both.
3.6 Exercises
3-16. Threaded Newsreader. This doesn’t mean write a multithreaded newsreader—it means organize related postings
into “article threads.” In other words, group related articles
together, independent of when the individual articles were
posted. All the articles belonging to individual threads
should be listed chronologically though. Allow the user to
do the following:
a) Select individual articles (bodies) to view, then have the
option to go back to the list view or to previous or next
article, either sequentially or related to the current thread.
b) Allow replies to threads, option to copy and quote previous article, and reply to the entire newsgroup via
another post. Extra Credit: Allow personal reply to individual via e-mail.
c) Permanently delete threads—no future related articles
should show up in the article list. For this, you will have
to temporarily keep a persistent list of deleted threads so
that they don’t show up again. You can assume a thread
is dead if no one posts an article with the same Subject
line after several months.
3-17. GUI Newsreader. Similar to an FTP exercise above, choose a
Python GUI toolkit to implement a complete standalone GUI
newsreader application.
3-18. Refactoring. Like ftpmirror.py for FTP, there is a demo script
for NNTP: Demo/scripts/newslist.py. Run it. This script was
written a long time ago and can use a facelift. For this exercise, you are to refactor this program using features of the
latest versions of Python as well as your developing skills in
Python to perform the same task but run and complete in
less time. This can include using list comprehensions or generator expressions, using smarter string concatenation, not
calling unnecessary functions, etc.
3-19. Caching. Another problem with newslist.py is that, according to its author, “I should really keep a list of ignored empty
groups and re-check them for articles on every run, but I
haven’t got around to it yet.” Make this improvement a reality. You can use the default version as-is or your newly
improved one from Exercise 3-18.
151
152
Chapter 3 • Internet Client Programming
E-Mail
3-20. Identifiers. The POP3 method pass_() is used to send the password to the server after giving it the login name by using
login(). Can you give any reasons why you believe this
method was named with a trailing underscore (pass_()),
instead of just plain, old pass()?
3-21. POP and IMAP. Write an application using one of the poplib
classes (POP3 or POP3_SSL) to download e-mail, then do the
same thing using imaplib.You can borrow some of the code
seen earlier in this chapter. Why would you want to leave
your login and password information out of the source code?
The next set of exercises deal with the myMail.py application presented in
Example 3-3.
3-22. E-Mail Headers. In myMail.py, the last few lines compared the
originally sent body with the body in the received e-mail.
Create similar code to assert the original headers. Hint:
Ignore newly added headers.
3-23. Error Checking. Add SMTP and POP error-checking.
3-24. SMTP and IMAP. Add support for IMAP. Extra Credit:
Support both mail download protocols, giving the user the
ability to choose which to use.
3-25. E-Mail Composition. Further develop your solution to Exercise 3-24 by giving the users of your application the ability to
compose and send e-mail.
3-26. E-Mail Application. Further develop your e-mail application,
turning it into something more useful by adding in mailbox
management. Your application should be able to read in the
current set of e-mail messages in a user’s imbeds and display
their Subject lines. Users should be able to select messages to
view. Extra Credit: Add support to view attachments via
external applications.
3-27. GUI. Add a GUI layer on top of your solution to the previous
problem to make it practically a full e-mail application.
3-28. Elements of SPAM. Unsolicited junk e-mail, or spam, is a very
real and significant problem today. There are many good
solutions out there, validating this market. We do not want
you to (necessarily) reinvent the wheel, but we would like you
to get a taste of some of the elements of spam processing.
3.6 Exercises
a) “mbox” format. Before we can get started, we should
convert any e-mail messages you want to work on to a
common format, such as the mbox format. (There are others that you can use if you prefer. Once you have several
(or all) work messages in mbox format, merge them all
into a single file. Hint: See the mailbox module and
email package.
b) Headers. Most of the clues of spam lie in the e-mail headers. (You might want to use the e-mail package or parse
them manually yourself.) Write code that answers questions such as:
– What e-mail client appears to have originated this
message? (Check out the X-Mailer header.)
– Is the message ID (Message-ID header) format valid?
– Are there domain name mismatches between the From,
Received, and perhaps Return-Path headers? What
about domain name and IP address mismatches? Is
there an X-Authentication-Warning header? If so,
what does it report?
c) Information Servers. Based on an IP address or domain,
servers such as WHOIS, SenderBase.org, etc., might
be able to help you identify the location where a piece of
bulk e-mail originated. Find one or more of these
services and build code to the find the country of origin,
and optionally the city, network owner name, contact
information, etc.
d) Keywords. Certain words keep popping up in spam. You
have no doubt seen them before, and in all of their variations, including using a number resembling a letter, capitalizing random letters, etc. Build a list of frequent words
that you have seen definitely tied to spam and quarantine
them. Extra Credit: Develop an algorithm or add keyword variations to spot such trickery in messages.
e) Phishing. These spam messages attempt to disguise themselves as valid e-mail from major banking institutions or
well-known Internet Web sites. They contain links that
lure readers to Web sites in an attempt to harvest private
and extremely sensitive information such as login names,
passwords, and credit card numbers. These fakers do a
153
154
Chapter 3 • Internet Client Programming
pretty good job of giving their fraudulent messages an
accurate look-and-feel. However, they cannot hide the
fact that the actual link that they direct users to does not
belong to the company they are masquerading as. Many
of them are obvious giveaways; for example, horriblelooking domain names, raw IP addresses, and even IP
addresses in 32-bit integer format rather than in octets.
Develop code that can determine whether e-mail that
looks like official communication is real or bogus.
E-Mail Composition
The following set of exercises deal with composing e-mail messages by
using the e-mail package and specifically refers to the code we looked at in
email-examples.py.
3-29. Multipart Alternative. What does multipart alternative mean,
anyway? We took a quick look at it earlier in the make_mpa_msg()
function, but what does it really signify? How would the
behavior of make_mpa_msg() change if we removed
'alternative' when we instantiated the MIMEMultipart
class, that is, email = MIMEMultipart()?
3-30. Python 3. Port the email-examples.py script to Python 3 (or
create a hybrid that runs without modification under both
versions 2.x and 3.x).
3-31. Multiple attachments. In the section on composing e-mail, we
looked at the make_img_msg() function, which created a single
e-mail message made up of a single image. While that’s a
great start, it isn’t as useful in the real world. Create a more
generalized function called attachImgs(), attach_images(),
or whatever you want to call it, with which users can pass in
more than one image file. Take those files and make them
individual attachments of the entire e-mail message body
and return a single multipart message object.
3-32. Robustness. Improve the solution for Exercise 3-31 for
attachImgs() by making sure that users are passing in only
image files (and throwing exceptions if not). In other words,
check the filename to ensure that the extension matches .png,
.jpg, .gif, .tif, etc. Extra Credit: Support file introspection to
take files with any, incorrect, or no extension and determine
what type they really are. To help you get started, check out
the Wikipedia page at http://en.wikipedia.org/wiki/File_format.
3.6 Exercises
155
3-33. Robustness, Networking. Further enhance the attachImgs()
function so that in addition to local files, users can pass in a
URL to an online picture such as http://docs.python.org/
_static/py.png.
3-34. Spreadsheets. Create a function called attachSheets() that
attaches one or more spreadsheet files to a multipart e-mail
message. Support the most common formats such as .csv,
.xls, .xlsx, .ods, .uof/.uos, etc.). You can use attachImgs() as
a model; however, instead of using email.mime.image.MIMEImage,
you’ll be using email.mime.base.MIMEBase as well as need to
specify an appropriate MIME type (for example, 'application/
vnd.ms-excel'). Also don’t forget the Content-Disposition
header.
3-35. Documents. Similar to Exercise 3-34, create a function called
attachDocs() that attaches document files to a multipart
e-mail message. Support common formats, such as .doc,
.docx, .odt, .rtf, .pdf, .txt, .uof/.uot, etc.
3-36. Multiple Attachment Types. Let’s broaden the scope defined by
your solutions to Exercise 3-35. Create a new, more generalized function called attachFiles(), which takes any type of
attachment. You are welcome to merge any of the code from
the solutions for any of these exercises.
Miscellaneous
A list of various Internet protocols, including the three highlighted in this
chapter, can be found at http://networksorcery.com/enp/topic/ipsuite.htm. A
list of specific Internet protocols supported by Python can be found at
http://docs. python.org/library/internet.
3-37. Developing Alternate Internet Clients. Now that you have seen
four examples of how Python can help you to develop Internet
clients, choose another protocol with client support in a
Python Standard Library module and write a client application for it.
3-38. *Developing New Internet Clients. Much more difficult: find an
uncommon or upcoming protocol without Python support
and implement it. Be serious enough that you will consider
writing and submitting a PEP to have your module included
in the standard library distribution of a future Python release.
CHAPTER
Multithreaded Programming
> With Python you can start a thread, but you can’t stop it.
> Sorry. You’ll have to wait until it reaches the end of execution.
So, just the same as [comp.lang.python], then?
—Cliff Wells, Steve Holden
(and Timothy Delaney), February 2002
In this chapter...
• Introduction/Motivation
• Threads and Processes
• Threads and Python
• The thread Module
• The threading Module
• Comparing Single vs. Multithreaded Execution
• Multithreading in Practice
• Producer-Consumer Problem and the Queue/queue Module
• Alternative Considerations to Threads
• Related Modules
156
4.1 Introduction/Motivation 157
I
n this section, we will explore the different ways by which you can
achieve more parallelism in your code. We will begin by differentiating between processes and threads in the first few of sections of this
chapter. We will then introduce the notion of multithreaded programming
and present some multithreaded programming features found in Python.
(Those of you already familiar with multithreaded programming can skip
directly to Section 4.3.5.) The final sections of this chapter present some
examples of how to use the threading and Queue modules to accomplish
multithreaded programming with Python.
4.1
Introduction/Motivation
Before the advent of multithreaded (MT) programming, the execution of
computer programs consisted of a single sequence of steps that were executed in synchronous order by the host’s CPU. This style of execution was
the norm whether the task itself required the sequential ordering of steps
or if the entire program was actually an aggregation of multiple subtasks.
What if these subtasks were independent, having no causal relationship
(meaning that results of subtasks do not affect other subtask outcomes)? Is
it not logical, then, to want to run these independent tasks all at the same
time? Such parallel processing could significantly improve the performance of the overall task. This is what MT programming is all about.
MT programming is ideal for programming tasks that are asynchronous
in nature, require multiple concurrent activities, and where the processing
of each activity might be nondeterministic, that is, random and unpredictable.
Such programming tasks can be organized or partitioned into multiple
streams of execution wherein each has a specific task to accomplish.
Depending on the application, these subtasks might calculate intermediate
results that could be merged into a final piece of output.
While CPU-bound tasks might be fairly straightforward to divide into
subtasks and executed sequentially or in a multithreaded manner, the task
of managing a single-threaded process with multiple external sources of
input is not as trivial. To achieve such a programming task without multithreading, a sequential program must use one or more timers and implement a multiplexing scheme.
A sequential program will need to sample each I/O terminal channel to
check for user input; however, it is important that the program does not
block when reading the I/O terminal channel, because the arrival of user
input is nondeterministic, and blocking would prevent processing of other
I/O channels. The sequential program must use non-blocked I/O or
blocked I/O with a timer (so that blocking is only temporary).
158
Chapter 4 • Multithreaded Programming
Because the sequential program is a single thread of execution, it must
juggle the multiple tasks that it needs to perform, making sure that it does
not spend too much time on any one task, and it must ensure that user
response time is appropriately distributed. The use of a sequential program for this type of task often results in a complicated flow of control
that is difficult to understand and maintain.
Using an MT program with a shared data structure such as a Queue
(a multithreaded queue data structure, discussed later in this chapter), this
programming task can be organized with a few threads that have specific
functions to perform:
• UserRequestThread: Responsible for reading client input,
perhaps from an I/O channel. A number of threads would be
created by the program, one for each current client, with
requests being entered into the queue.
• RequestProcessor: A thread that is responsible for retrieving
requests from the queue and processing them, providing
output for yet a third thread.
• ReplyThread: Responsible for taking output destined for the
user and either sending it back (if in a networked application)
or writing data to the local file system or database.
Organizing this programming task with multiple threads reduces the
complexity of the program and enables an implementation that is clean,
efficient, and well organized. The logic in each thread is typically less complex because it has a specific job to do. For example, the UserRequestThread
simply reads input from a user and places the data into a queue for further
processing by another thread, etc. Each thread has its own job to do; you
merely have to design each type of thread to do one thing and do it well.
Use of threads for specific tasks is not unlike Henry Ford’s assembly line
model for manufacturing automobiles.
4.2
Threads and Processes
4.2.1 What Are Processes?
Computer programs are merely executables, binary (or otherwise), which
reside on disk. They do not take on a life of their own until loaded into
memory and invoked by the operating system. A process (sometimes called
4.2 Threads and Processes
159
a heavyweight process) is a program in execution. Each process has its own
address space, memory, a data stack, and other auxiliary data to keep
track of execution. The operating system manages the execution of all processes on the system, dividing the time fairly between all processes.
Processes can also fork or spawn new processes to perform other tasks, but
each new process has its own memory, data stack, etc., and cannot generally share information unless interprocess communication (IPC) is employed.
4.2.2
What Are Threads?
Threads (sometimes called lightweight processes) are similar to processes
except that they all execute within the same process, and thus all share the
same context. They can be thought of as “mini-processes” running in parallel within a main process or “main thread.”
A thread has a beginning, an execution sequence, and a conclusion. It has
an instruction pointer that keeps track of where within its context it is currently running. It can be preempted (interrupted) and temporarily put on
hold (also known as sleeping) while other threads are running—this is called
yielding.
Multiple threads within a process share the same data space with the
main thread and can therefore share information or communicate with
one another more easily than if they were separate processes. Threads are
generally executed in a concurrent fashion, and it is this parallelism and
data sharing that enable the coordination of multiple tasks. Naturally, it is
impossible to run truly in a concurrent manner in a single CPU system, so
threads are scheduled in such a way that they run for a little bit, then yield
to other threads (going to the proverbial back of the line to await more
CPU time again). Throughout the execution of the entire process, each
thread performs its own, separate tasks, and communicates the results
with other threads as necessary.
Of course, such sharing is not without its dangers. If two or more
threads access the same piece of data, inconsistent results can arise because
of the ordering of data access. This is commonly known as a race condition.
Fortunately, most thread libraries come with some sort of synchronization
primitives that allow the thread manager to control execution and access.
Another caveat is that threads cannot be given equal and fair execution
time. This is because some functions block until they have completed. If
not written specifically to take threads into account, this skews the amount
of CPU time in favor of such greedy functions.
160
Chapter 4 • Multithreaded Programming
4.3
Threads and Python
In this section, we discuss how to use threads in Python. This includes the
limitations of threads due to the global interpreter lock and a quick demo
script.
4.3.1
Global Interpreter Lock
Execution of Python code is controlled by the Python Virtual Machine (a.k.a.
the interpreter main loop). Python was designed in such a way that only one
thread of control may be executing in this main loop, similar to how multiple processes in a system share a single CPU. Many programs can be in
memory, but only one is live on the CPU at any given moment. Likewise,
although multiple threads can run within the Python interpreter, only one
thread is being executed by the interpreter at any given time.
Access to the Python Virtual Machine is controlled by the global interpreter lock (GIL). This lock is what ensures that exactly one thread is running. The Python Virtual Machine executes in the following manner in an
MT environment:
1. Set the GIL
2. Switch in a thread to run
3. Execute either of the following:
a. For a specified number of bytecode instructions, or
b. If the thread voluntarily yields control (can be accomplished
time.sleep(0))
4. Put the thread back to sleep (switch out thread)
5. Unlock the GIL
6. Do it all over again (lather, rinse, repeat)
When a call is made to external code—that is, any C/C++ extension
built-in function—the GIL will be locked until it has completed (because
there are no Python bytecodes to count as the interval). Extension programmers do have the ability to unlock the GIL, however, so as the Python
developer, you shouldn’t have to worry about your Python code locking
up in those situations.
As an example, for any Python I/O-oriented routines (which invoke
built-in operating system C code), the GIL is released before the I/O call is
made, allowing other threads to run while the I/O is being performed.
Code that doesn’t have much I/O will tend to keep the processor (and GIL)
4.3 Threads and Python
161
for the full interval a thread is allowed before it yields. In other words,
I/O-bound Python programs stand a much better chance of being able to
take advantage of a multithreaded environment than CPU-bound code.
Those of you who are interested in the source code, the interpreter main
loop, and the GIL can take a look at the Python/ceval.c file.
4.3.2
Exiting Threads
When a thread completes execution of the function it was created for, it
exits. Threads can also quit by calling an exit function such as
thread.exit(), or any of the standard ways of exiting a Python process
such as sys.exit() or raising the SystemExit exception. You cannot, however, go and “kill” a thread.
We will discuss in detail the two Python modules related to threads in the
next section, but of the two, the thread module is the one we do not recommend. There are many reasons for this, but an obvious one is that when the
main thread exits, all other threads die without cleanup. The other module,
threading, ensures that the whole process stays alive until all “important”
child threads have exited. (For a clarification of what important means, read
the upcoming Core Tip, “Avoid using the thread module.”)
Main threads should always be good managers, though, and perform the
task of knowing what needs to be executed by individual threads, what data
or arguments each of the spawned threads requires, when they complete
execution, and what results they provide. In so doing, those main threads
can collate the individual results into a final, meaningful conclusion.
4.3.3
Accessing Threads from Python
Python supports multithreaded programming, depending on the operating
system on which it’s running. It is supported on most Unix-based platforms,
such as Linux, Solaris, Mac OS X, *BSD, as well as Windows-based PCs.
Python uses POSIX-compliant threads, or pthreads, as they are commonly
known.
By default, threads are enabled when building Python from source
(since Python 2.0) or the Win32 installed binary. To determine whether
threads are available for your interpreter, simply attempt to import the
thread module from the interactive interpreter, as shown here (no errors
occur when threads are available):
>>> import thread
>>>
162
Chapter 4 • Multithreaded Programming
If your Python interpreter was not compiled with threads enabled, the
module import fails:
>>> import thread
Traceback (innermost last):
File "<stdin>", line 1, in ?
ImportError: No module named thread
In such cases, you might need to recompile your Python interpreter to
get access to threads. This usually involves invoking the configure script
with the --with-thread option. Check the README file for your distribution
to obtain specific instructions on how to compile Python with threads for
your system.
4.3.4
Life Without Threads
For our first set of examples, we are going to use the time.sleep() function to show how threads work. time.sleep() takes a floating point argument and “sleeps” for the given number of seconds, meaning that
execution is temporarily halted for the amount of time specified.
Let’s create two time loops: one that sleeps for 4 seconds (loop0()), and
one that sleeps for 2 seconds (loop1()), respectively. (We use the names
“loop0” and “loop1” as a hint that we will eventually have a sequence of
loops.) If we were to execute loop0() and loop1() sequentially in a oneprocess or single-threaded program, as onethr.py does in Example 4-1, the
total execution time would be at least 6 seconds. There might or might not
be a 1-second gap between the starting of loop0() and loop1() as well as
other execution overhead which can cause the overall time to be bumped
to 7 seconds.
Example 4-1
Loops Executed by a Single Thread (onethr.py)
This script executes two loops consecutively in a single-threaded program. One
loop must complete before the other can begin. The total elapsed time is the sum
of times taken by each loop.
1
2
3
4
5
6
7
#!/usr/bin/env python
from time import sleep, ctime
def loop0():
print 'start loop 0 at:', ctime()
sleep(4)
4.3 Threads and Python
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
163
print 'loop 0 done at:', ctime()
def loop1():
print 'start loop 1 at:', ctime()
sleep(2)
print 'loop 1 done at:', ctime()
def main():
print 'starting at:', ctime()
loop0()
loop1()
print 'all DONE at:', ctime()
if __name__ == '__main__':
main()
We can verify this by executing onethr.py, which renders the following
output:
$ onethr.py
starting at: Sun Aug 13 05:03:34 2006
start loop 0 at: Sun Aug 13 05:03:34 2006
loop 0 done at: Sun Aug 13 05:03:38 2006
start loop 1 at: Sun Aug 13 05:03:38 2006
loop 1 done at: Sun Aug 13 05:03:40 2006
all DONE at: Sun Aug 13 05:03:40 2006
Now, assume that rather than sleeping, loop0() and loop1() were separate functions that performed individual and independent computations,
all working to arrive at a common solution. Wouldn’t it be useful to have
them run in parallel to cut down on the overall running time? That is the
premise behind MT programming that we now introduce.
4.3.5
Python Threading Modules
Python provides several modules to support MT programming, including
the thread, threading, and Queue modules. Programmers can us the thread
and threading modules to create and manage threads. The thread module
provides basic thread and locking support; threading provides higher-level,
fully-featured thread management. With the Queue module, users can
create a queue data structure that can be shared across multiple threads.
We will take a look at these modules individually and present examples and
intermediate-sized applications.
164
Chapter 4 • Multithreaded Programming
CORE TIP: Avoid using the thread module
We recommend using the high-level threading module instead of the thread
module for many reasons. threading is more contemporary, has better thread
support, and some attributes in the thread module can conflict with those in the
threading module. Another reason is that the lower-level thread module has few
synchronization primitives (actually only one) while threading has many.
However, in the interest of learning Python and threading in general, we do
present some code that uses the thread module. We present these for learning
purposes only; hopefully they give you a much better insight as to why you
would want to avoid using thread. We will also show you how to use more
appropriate tools such as those available in the threading and Queue modules.
Another reason to avoid using thread is because there is no control of when
your process exits. When the main thread finishes, any other threads will also
die, without warning or proper cleanup. As mentioned earlier, at least threading
allows the important child threads to finish first before exiting.
3.x
Use of the thread module is recommended only for experts desiring lowerlevel thread access. To emphasize this, it is renamed to _thread in Python 3.
Any multithreaded application you create should utilize threading and perhaps other higher-level modules.
4.4
The thread Module
Let’s take a look at what the thread module has to offer. In addition to
being able to spawn threads, the thread module also provides a basic synchronization data structure called a lock object (a.k.a. primitive lock, simple
lock, mutual exclusion lock, mutex, and binary semaphore). As we mentioned
earlier, such synchronization primitives go hand in hand with thread
management.
Table 4-1 lists the more commonly used thread functions and LockType
lock object methods.
4.4 The thread Module
165
Table 4-1 thread Module and Lock Objects
Function/Method
Description
thread Module Functions
start_new_thread(function,
args, kwargs=None)
Spawns a new thread and executes function
with the given args and optional kwargs
allocate_lock()
Allocates LockType lock object
exit()
Instructs a thread to exit
LockType Lock Object Methods
acquire(wait=None)
Attempts to acquire lock object
locked()
Returns True if lock acquired, False
otherwise
release()
Releases lock
The key function of the thread module is start_new_thread(). It takes a
function (object) plus arguments and optionally, keyword arguments. A
new thread is spawned specifically to invoke the function.
Let’s take our onethr.py example and integrate threading into it. By
slightly changing the call to the loop*() functions, we now present mtsleepA.py
in Example 4-2:
Example 4-2
Using the thread Module (mtsleepA.py)
The same loops from onethr.py are executed, but this time using the simple
multithreaded mechanism provided by the thread module. The two loops are
executed concurrently (with the shorter one finishing first, obviously), and the
total elapsed time is only as long as the slowest thread rather than the total time for
each separately.
1
2
3
4
5
6
7
#!/usr/bin/env python
import thread
from time import sleep, ctime
def loop0():
print 'start loop 0 at:', ctime()
(Continued)
166
Chapter 4 • Multithreaded Programming
Example 4-2
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Using the thread Module (mtsleepA.py) (Continued)
sleep(4)
print 'loop 0 done at:', ctime()
def loop1():
print 'start loop 1 at:', ctime()
sleep(2)
print 'loop 1 done at:', ctime()
def main():
print 'starting at:', ctime()
thread.start_new_thread(loop0, ())
thread.start_new_thread(loop1, ())
sleep(6)
print 'all DONE at:', ctime()
if __name__ == '__main__':
main()
requires the first two arguments, so that is the reason for passing in an empty tuple even if the executing function requires
no arguments.
Upon execution of this program, our output changes drastically. Rather
than taking a full 6 or 7 seconds, our script now runs in 4 seconds, the
length of time of our longest loop, plus any overhead.
start_new_thread()
$ mtsleepA.py
starting at: Sun Aug 13 05:04:50 2006
start loop 0 at: Sun Aug 13 05:04:50 2006
start loop 1 at: Sun Aug 13 05:04:50 2006
loop 1 done at: Sun Aug 13 05:04:52 2006
loop 0 done at: Sun Aug 13 05:04:54 2006
all DONE at: Sun Aug 13 05:04:56 2006
The pieces of code that sleep for 4 and 2 seconds now occur concurrently, contributing to the lower overall runtime. You can even see how
loop 1 finishes before loop 0.
The only other major change to our application is the addition of the
sleep(6) call. Why is this necessary? The reason is that if we did not stop
the main thread from continuing, it would proceed to the next statement,
displaying “all done” and exit, killing both threads running loop0() and
loop1().
We did not have any code that directed the main thread to wait for the
child threads to complete before continuing. This is what we mean by
threads requiring some sort of synchronization. In our case, we used
another sleep() call as our synchronization mechanism. We used a value
4.4 The thread Module
167
of 6 seconds because we know that both threads (which take 4 and 2 seconds) should have completed by the time the main thread has counted to 6.
You are probably thinking that there should be a better way of managing threads than creating that extra delay of 6 seconds in the main
thread. Because of this delay, the overall runtime is no better than in our
single-threaded version. Using sleep() for thread synchronization as we
did is not reliable. What if our loops had independent and varying execution times? We could be exiting the main thread too early or too late.
This is where locks come in.
Making yet another update to our code to include locks as well as getting
rid of separate loop functions, we get mtsleepB.py, which is presented in
Example 4-3. Running it, we see that the output is similar to mtsleepA.py.
The only difference is that we did not have to wait the extra time for
mtsleepA.py to conclude. By using locks, we were able to exit as soon as
both threads had completed execution. This renders the following output:
$ mtsleepB.py
starting at: Sun Aug 13 16:34:41 2006
start loop 0 at: Sun Aug 13 16:34:41 2006
start loop 1 at: Sun Aug 13 16:34:41 2006
loop 1 done at: Sun Aug 13 16:34:43 2006
loop 0 done at: Sun Aug 13 16:34:45 2006
all DONE at: Sun Aug 13 16:34:45 2006
Example 4-3
Using thread and Locks (mtsleepB.py)
Rather than using a call to sleep() to hold up the main thread as in
mtsleepA.py, the use of locks makes more sense.
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python
import thread
from time import sleep, ctime
loops = [4,2]
def loop(nloop, nsec, lock):
print 'start loop', nloop, 'at:', ctime()
sleep(nsec)
print 'loop', nloop, 'done at:', ctime()
lock.release()
(Continued)
168
Chapter 4 • Multithreaded Programming
Example 4-3
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Using thread and Locks (mtsleepB.py) (Continued)
def main():
print 'starting at:', ctime()
locks = []
nloops = range(len(loops))
for i in nloops:
lock = thread.allocate_lock()
lock.acquire()
locks.append(lock)
for i in nloops:
thread.start_new_thread(loop,
(i, loops[i], locks[i]))
for i in nloops:
while locks[i].locked(): pass
print 'all DONE at:', ctime()
if __name__ == '__main__':
main()
So how did we accomplish our task with locks? Let’s take a look at the
source code.
Line-by-Line Explanation
Lines 1–6
After the Unix startup line, we import the thread module and a few familiar attributes of the time module. Rather than hardcoding separate functions to count to 4 and 2 seconds, we use a single loop() function and
place these constants in a list, loops.
Lines 8–12
The loop() function acts as a proxy for the deleted loop*() functions from
our earlier examples. We had to make some cosmetic changes to loop() so
that it can now perform its duties using locks. The obvious changes are
that we need to be told which loop number we are as well as the sleep
duration. The last piece of new information is the lock itself. Each thread
will be allocated an acquired lock. When the sleep() time has concluded,
we release the corresponding lock, indicating to the main thread that this
thread has completed.
4.5 The threading Module
169
Lines 14–34
The bulk of the work is done here in main(), using three separate for
loops. We first create a list of locks, which we obtain by using the
thread.allocate_lock() function and acquire (each lock) with the
acquire() method. Acquiring a lock has the effect of “locking the lock.”
Once it is locked, we add the lock to the lock list, locks. The next loop
actually spawns the threads, invoking the loop() function per thread, and
for each thread, provides it with the loop number, the sleep duration, and
the acquired lock for that thread. So why didn’t we start the threads in the
lock acquisition loop? There are two reasons. First, we wanted to synchronize the threads, so that all the horses started out the gate around the same
time, and second, locks take a little bit of time to be acquired. If your
thread executes too fast, it is possible that it completes before the lock has
a chance to be acquired.
It is up to each thread to unlock its lock object when it has completed
execution. The final loop just sits and spins (pausing the main thread)
until both locks have been released before continuing execution. Because
we are checking each lock sequentially, we might be at the mercy of all the
slower loops if they are more toward the beginning of the set of loops. In
such cases, the majority of the wait time may be for the first loop(s). When
that lock is released, remaining locks may have already been unlocked
(meaning that corresponding threads have completed execution). The
result is that the main thread will fly through those lock checks without
pause. Finally, you should be well aware that the final pair of lines will
execute main() only if we are invoking this script directly.
As hinted in the earlier Core Note, we presented the thread module
only to introduce the reader to threaded programming. Your MT application should use higher-level modules such as the threading module,
which we discuss in the next section.
4.5
The threading Module
We will now introduce the higher-level threading module, which gives
you not only a Thread class but also a wide variety of synchronization
mechanisms to use to your heart’s content. Table 4-2 presents a list of all
the objects available in the threading module.
170
Chapter 4 • Multithreaded Programming
Table 4-2 threading Module Objects
3.2
Object
Description
Thread
Object that represents a single thread of execution
Lock
Primitive lock object (same lock as in thread module)
RLock
Re-entrant lock object provides ability for a single thread
to (re)acquire an already-held lock (recursive locking)
Condition
Condition variable object causes one thread to wait until
a certain “condition” has been satisfied by another
thread, such as changing of state or of some data value
Event
General version of condition variables, whereby any
number of threads are waiting for some event to occur
and all will awaken when the event happens
Semaphore
Provides a “counter” of finite resources shared between
threads; block when none are available
BoundedSemaphore
Similar to a Semaphore but ensures that it never exceeds
its initial value
Timer
Similar to Thread, except that it waits for an allotted
period of time before running
Barriera
Creates a “barrier,” at which a specified number of
threads must all arrive before they’re all allowed to
continue
a. New in Python 3.2.
In this section, we will examine how to use the Thread class to implement threading. Because we have already covered the basics of locking,
we will not cover the locking primitives here. The Thread() class also contains a form of synchronization, so explicit use of locking primitives is not
necessary.
4.5 The threading Module
171
CORE TIP: Daemon threads
Another reason to avoid using the thread module is that it does not support the
concept of daemon (or daemonic) threads. When the main thread exits, all child
threads will be killed, regardless of whether they are doing work. The concept of
daemon threads comes into play here if you do not desire this behavior.
Support for daemon threads is available in the threading module, and here is
how they work: a daemon is typically a server that waits for client requests to
service. If there is no client work to be done, the daemon sits idle. If you set the
daemon flag for a thread, you are basically saying that it is non-critical, and it is
okay for the process to exit without waiting for it to finish. As you have seen in
Chapter 2, “Network Programming,” server threads run in an infinite loop and do
not exit in normal situations.
If your main thread is ready to exit and you do not care to wait for the child
threads to finish, then set their daemon flags. A value of true denotes a thread
is not important or more likely, not doing anything but waiting for a client.
To set a thread as daemonic, make this assignment: thread.daemon = True before
you start the thread. (The old-style way of calling thread.setDaemon(True) is
deprecated.) The same is true for checking on a thread’s daemonic status; just
check that value (versus calling thread.isDaemon()). A new child thread inherits its daemonic flag from its parent. The entire Python program (read as: the
main thread) will stay alive until all non-daemonic threads have exited—in
other words, when no active non-daemonic threads are left.
4.5.1
The Thread Class
The Thread class of the threading module is your primary executive
object. It has a variety of functions not available to the thread module.
Table 4-3 presents a list of attributes and methods.
172
Chapter 4 • Multithreaded Programming
Table 4-3 Thread Object Attributes and Methods
Attribute
Description
Thread object data attributes
name
The name of a thread.
ident
The identifier of a thread.
daemon
Boolean flag indicating whether a thread is
daemonic.
Thread object methods
__init__(group=None,
target=None, name=None,
args=(), kwargs={},
verbose=None,
daemon=None)c
Instantiate a Thread object, taking target callable
and any args or kwargs. A name or group can also
be passed but the latter is unimplemented. A
verbose flag is also accepted. Any daemon value
sets the thread.daemon attribute/flag.
start()
Begin thread execution.
run()
Method defining thread functionality (usually
overridden by application writer in a subclass).
join(timeout=None)
Suspend until the started thread terminates; blocks
unless timeout (in seconds) is given.
getName()a
Return name of thread.
setName(name)a
Set name of thread.
isAlive/is_alive()b
Boolean flag indicating whether thread is still
running.
isDaemon()c
setDaemon(daemonic)
Return True if thread daemonic, False otherwise.
c
Set the daemon flag to the given Boolean daemonic
value (must be called before thread start().
a. Deprecated by setting (or getting) thread.name attribute or passed in during instantiation.
b. CamelCase names deprecated and replaced starting in Python 2.6.
c. is/setDaemon() deprecated by setting thread.daemon attribute; thread.daemon can
also be set during instantiation via the optional daemon value—new in Python 3.3.
4.5 The threading Module
173
There are a variety of ways by which you can create threads using the
class. We cover three of them here, all quite similar. Pick the one
you feel most comfortable with, not to mention the most appropriate for
your application and future scalability (we like the final choice the best):
Thread
• Create Thread instance, passing in function
• Create Thread instance, passing in callable class instance
• Subclass Thread and create subclass instance
You’ll discover that you will pick either the first or third option. The latter is chosen when a more object-oriented interface is desired and the former, otherwise. The second, honestly, is a bit more awkward and slightly
harder to read, as you’ll discover.
Create Thread Instance, Passing in Function
In our first example, we will just instantiate Thread, passing in our function (and its arguments) in a manner similar to our previous examples.
This function is what will be executed when we direct the thread to begin
execution. Taking our mtsleepB.py script from Example 4-3 and tweaking
it by adding the use of Thread objects, we have mtsleepC.py, as shown in
Example 4-4.
Example 4-4
Using the threading Module (mtsleepC.py)
The Thread class from the threading module has a join() method that lets the
main thread wait for thread completion.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/usr/bin/env python
import threading
from time import sleep, ctime
loops = [4,2]
def loop(nloop, nsec):
print 'start loop', nloop, 'at:', ctime()
sleep(nsec)
print 'loop', nloop, 'done at:', ctime()
def main():
print 'starting at:', ctime()
threads = []
(Continued)
174
Chapter 4 • Multithreaded Programming
Example 4-4
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Using the threading Module (mtsleepC.py) (Continued)
nloops = range(len(loops))
for i in nloops:
t = threading.Thread(target=loop,
args=(i, loops[i]))
threads.append(t)
for i in nloops:
threads[i].start()
# start threads
for i in nloops:
threads[i].join()
# wait for all
# threads to finish
print 'all DONE at:', ctime()
if __name__ == '__main__':
main()
When we run the script in Example 4-4, we see output similar to that of
its predecessors:
$ mtsleepC.py
starting at: Sun Aug 13 18:16:38 2006
start loop 0 at: Sun Aug 13 18:16:38 2006
start loop 1 at: Sun Aug 13 18:16:38 2006
loop 1 done at: Sun Aug 13 18:16:40 2006
loop 0 done at: Sun Aug 13 18:16:42 2006
all DONE at: Sun Aug 13 18:16:42 2006
So what did change? Gone are the locks that we had to implement when
using the thread module. Instead, we create a set of Thread objects. When
each Thread is instantiated, we dutifully pass in the function (target) and
arguments (args) and receive a Thread instance in return. The biggest difference between instantiating Thread (calling Thread()) and invoking
thread.start_new_thread() is that the new thread does not begin execution right away. This is a useful synchronization feature, especially when
you don’t want the threads to start immediately.
Once all the threads have been allocated, we let them go off to the races
by invoking each thread’s start() method, but not a moment before that.
And rather than having to manage a set of locks (allocating, acquiring,
releasing, checking lock state, etc.), we simply call the join() method for
each thread. join() will wait until a thread terminates, or, if provided, a
timeout occurs. Use of join() appears much cleaner than an infinite loop
that waits for locks to be released (which is why these locks are sometimes
known as spin locks).
4.5 The threading Module
175
One other important aspect of join() is that it does not need to be
called at all. Once threads are started, they will execute until their given
function completes, at which point, they will exit. If your main thread has
things to do other than wait for threads to complete (such as other processing or waiting for new client requests), it should do so. join() is useful
only when you want to wait for thread completion.
Create Thread Instance, Passing in Callable Class Instance
A similar offshoot to passing in a function when creating a thread is having a callable class and passing in an instance for execution—this is the
more object-oriented approach to MT programming. Such a callable class
embodies an execution environment that is much more flexible than a
function or choosing from a set of functions. You now have the power of
a class object behind you, as opposed to a single function or a list/tuple of
functions.
Adding our new class ThreadFunc to the code and making other slight
modifications to mtsleepC.py, we get mtsleepD.py, shown in Example 4-5.
Example 4-5
Using Callable Classes (mtsleepD.py)
In this example, we pass in a callable class (instance) as opposed to just a
function. It presents more of an object-oriented approach than mtsleepC.py.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/usr/bin/env python
import threading
from time import sleep, ctime
loops = [4,2]
class ThreadFunc(object):
def __init__(self, func, args, name=''):
self.name = name
self.func = func
self.args = args
def __call__(self):
self.func(*self.args)
(Continued)
176
Chapter 4 • Multithreaded Programming
Example 4-5
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Using Callable classes (mtsleepD.py) (Continued)
def loop(nloop, nsec):
print 'start loop', nloop, 'at:', ctime()
sleep(nsec)
print 'loop', nloop, 'done at:', ctime()
def main():
print 'starting at:', ctime()
threads = []
nloops = range(len(loops))
for i in nloops: # create all threads
t = threading.Thread(
target=ThreadFunc(loop, (i, loops[i]),
loop.__name__))
threads.append(t)
for i in nloops: # start all threads
threads[i].start()
for i in nloops: # wait for completion
threads[i].join()
print 'all DONE at:', ctime()
if __name__ == '__main__':
main()
When we run mtsleepD.py, we get the expected output:
$ mtsleepD.py
starting at: Sun Aug 13 18:49:17 2006
start loop 0 at: Sun Aug 13 18:49:17 2006
start loop 1 at: Sun Aug 13 18:49:17 2006
loop 1 done at: Sun Aug 13 18:49:19 2006
loop 0 done at: Sun Aug 13 18:49:21 2006
all DONE at: Sun Aug 13 18:49:21 2006
So what are the changes this time? The addition of the ThreadFunc class
and a minor change to instantiate the Thread object, which also instantiates ThreadFunc, our callable class. In effect, we have a double instantiation
going on here. Let’s take a closer look at our ThreadFunc class.
We want to make this class general enough to use with functions other
than our loop() function, so we added some new infrastructure, such as
having this class hold the arguments for the function, the function itself,
and also a function name string. The constructor __init__() just sets all
the values.
When the Thread code calls our ThreadFunc object because a new thread
is created, it will invoke the __call__() special method. Because we
already have our set of arguments, we do not need to pass it to the
Thread() constructor and can call the function directly.
4.5 The threading Module
177
Subclass Thread and Create Subclass Instance
The final introductory example involves subclassing Thread(), which turns
out to be extremely similar to creating a callable class as in the previous
example. Subclassing is a bit easier to read when you are creating
your threads (lines 29–30). We will present the code for mtsleepE.py in
Example 4-6 as well as the output obtained from its execution, and leave it
as an exercise for you to compare mtsleepE.py to mtsleepD.py.
Example 4-6
Subclassing Thread (mtsleepE.py)
Rather than instantiating the Thread class, we subclass it. This gives us more
flexibility in customizing our threading objects and simplifies the thread
creation call.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env python
import threading
from time import sleep, ctime
loops = (4, 2)
class MyThread(threading.Thread):
def __init__(self, func, args, name=''):
threading.Thread.__init__(self)
self.name = name
self.func = func
self.args = args
def run(self):
self.func(*self.args)
def loop(nloop, nsec):
print 'start loop', nloop, 'at:', ctime()
sleep(nsec)
print 'loop', nloop, 'done at:', ctime()
def main():
print 'starting at:', ctime()
threads = []
nloops = range(len(loops))
for i in nloops:
t = MyThread(loop, (i, loops[i]),
loop.__name__)
threads.append(t)
(Continued)
178
Chapter 4 • Multithreaded Programming
Example 4-6
33
34
35
36
37
38
39
40
41
42
Subclassing Thread (mtsleepE.py) (Continued)
for i in nloops:
threads[i].start()
for i in nloops:
threads[i].join()
print 'all DONE at:', ctime()'
if __name__ == '__main__':
main()
Here is the output for mtsleepE.py. Again, it’s just as we expected:
$ mtsleepE.py
starting at: Sun Aug 13 19:14:26 2006
start loop 0 at: Sun Aug 13 19:14:26 2006
start loop 1 at: Sun Aug 13 19:14:26 2006
loop 1 done at: Sun Aug 13 19:14:28 2006
loop 0 done at: Sun Aug 13 19:14:30 2006
all DONE at: Sun Aug 13 19:14:30 2006
While you compare the source between the mtsleep4 and mtsleep5
modules, we want to point out the most significant changes: 1) our MyThread
subclass constructor must first invoke the base class constructor (line 9),
and 2) the former special method __call__() must be called run() in the
subclass.
We now modify our MyThread class with some diagnostic output and
store it in a separate module called myThread (look ahead to Example 4-7)
and import this class for the upcoming examples. Rather than simply calling our functions, we also save the result to instance attribute self.res,
and create a new method to retrieve that value, getResult().
Example 4-7
MyThread Subclass of Thread (myThread.py)
To generalize our subclass of Thread from mtsleepE.py, we move the subclass to
a separate module and add a getResult() method for callables that produce
return values.
1
2
3
4
5
#!/usr/bin/env python
import threading
from time import ctime
4.5 The threading Module
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
179
class MyThread(threading.Thread):
def __init__(self, func, args, name=''):
threading.Thread.__init__(self)
self.name = name
self.func = func
self.args = args
def getResult(self):
return self.res
def run(self):
print 'starting', self.name, 'at:', \
ctime()
self.res = self.func(*self.args)
print self.name, 'finished at:', \
ctime()
4.5.2
Other Threading Module Functions
In addition to the various synchronization and threading objects, the Threading
module also has some supporting functions, as detailed in Table 4-4.
Table 4-4 threading Module Functions
Function
Description
activeCount/
active_count()a
Number of currently active Thread objects
currentThread()/
current_threada
Returns the current Thread object
enumerate()
Returns list of all currently active Threads
settrace(func)b
Sets a trace function for all threads
setprofile(func)b
Sets a profile function for all threads
stack_size(size=0)c
Returns stack size of newly created threads;
optional size can be set for subsequently created
threads
a. CamelCase names deprecated and replaced starting in Python 2.6.
b. New in Python 2.3.
c. An alias to thread.stack_size(); (both) new in Python 2.5.
180
4.6
Chapter 4 • Multithreaded Programming
Comparing Single vs. Multithreaded
Execution
The mtfacfib.py script, presented in Example 4-8 compares execution of
the recursive Fibonacci, factorial, and summation functions. This script
runs all three functions in a single-threaded manner. It then performs the
same task by using threads to illustrate one of the advantages of having
a threading environment.
Example 4-8
Fibonacci, Factorial, Summation (mtfacfib.py)
In this MT application, we execute three separate recursive functions—first in a
single-threaded fashion, followed by the alternative with multiple threads.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/usr/bin/env python
from myThread import MyThread
from time import ctime, sleep
def fib(x):
sleep(0.005)
if x < 2: return 1
return (fib(x-2) + fib(x-1))
def fac(x):
sleep(0.1)
if x < 2: return 1
return (x * fac(x-1))
def sum(x):
sleep(0.1)
if x < 2: return 1
return (x + sum(x-1))
funcs = [fib, fac, sum]
n = 12
def main():
nfuncs = range(len(funcs))
print '*** SINGLE THREAD'
for i in nfuncs:
print 'starting', funcs[i].__name__, 'at:', \
ctime()
print funcs[i](n)
print funcs[i].__name__, 'finished at:', \
ctime()
print '\n*** MULTIPLE THREADS'
threads = []
4.6 Comparing Single vs. Multithreaded Execution
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
181
for i in nfuncs:
t = MyThread(funcs[i], (n,),
funcs[i].__name__)
threads.append(t)
for i in nfuncs:
threads[i].start()
for i in nfuncs:
threads[i].join()
print threads[i].getResult()
print 'all DONE'
if __name__ == '__main__':
main()
Running in single-threaded mode simply involves calling the functions
one at a time and displaying the corresponding results right after the function call.
When running in multithreaded mode, we do not display the result
right away. Because we want to keep our MyThread class as general as possible (being able to execute callables that do and do not produce output),
we wait until the end to call the getResult() method to finally show you
the return values of each function call.
Because these functions execute so quickly (well, maybe except for the
Fibonacci function), you will notice that we had to add calls to sleep() to
each function to slow things down so that we can see how threading can
improve performance, if indeed the actual work had varying execution
times—you certainly wouldn’t pad your work with calls to sleep(). Anyway, here is the output:
$ mtfacfib.py
*** SINGLE THREAD
starting fib at: Wed
233
fib finished at: Wed
starting fac at: Wed
479001600
fac finished at: Wed
starting sum at: Wed
78
sum finished at: Wed
*** MULTIPLE
starting fib
starting fac
starting sum
Nov 16 18:52:20 2011
Nov 16 18:52:24 2011
Nov 16 18:52:24 2011
Nov 16 18:52:26 2011
Nov 16 18:52:26 2011
Nov 16 18:52:27 2011
THREADS
at: Wed Nov 16 18:52:27 2011
at: Wed Nov 16 18:52:27 2011
at: Wed Nov 16 18:52:27 2011
182
Chapter 4 • Multithreaded Programming
fac finished at: Wed Nov 16 18:52:28 2011
sum finished at: Wed Nov 16 18:52:28 2011
fib finished at: Wed Nov 16 18:52:31 2011
233
479001600
78
all DONE
4.7
Multithreading in Practice
So far, none of the simplistic sample snippets we’ve seen so far represent
code that you’d write in practice. They don’t really do anything useful
beyond demonstrating threads and the different ways that you can create
them—the way we’ve started them up and wait for them to finish are all
identical, and they all just sleep, too.
We also mentioned earlier in Section 4.3.1 that due to the fact that the
Python Virtual Machine is single-threaded (the GIL), greater concurrency
in Python is only possible when threading is applied to an I/O-bound
application (versus CPU-bound applications, which only do round-robin),
so let’s look at an example of this, and for a further exercise, try to port it to
Python 3 to give you a sense of what that process entails.
4.7.1
Book Rankings Example
The bookrank.py script shown in Example 4-9 is very staightforward. It
goes to the one of my favorite online retailers, Amazon, and asks for the
current rankings of books written by yours truly. In our sample code,
you’ll see a function, getRanking(), that uses a regular expression to pull
out and return the current ranking plus showRanking(), which displays the
result to the user.
Note that, according to their Conditions of Use guidelines, “Amazon
grants you a limited license to access and make personal use of this site and not to
download (other than page caching) or modify it, or any portion of it, except with
express written consent of Amazon.” For our application, all we’re doing is
looking at the current book rankings for a specific book and then throwing
everything away; we’re not even caching the page.
Example 4-9 is our first (but nearly-final) attempt at bookrank.py, which
is a non-threaded version.
4.7 Multithreading in Practice
Example 4-9
183
Book Rankings “Screenscraper” (bookrank.py)
This script makes calls to download book ranking information via separate
threads.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/usr/bin/env python
from
from
from
from
from
atexit import register
re import compile
threading import Thread
time import ctime
urllib2 import urlopen as uopen
REGEX = compile('#([\d,]+) in Books ')
AMZN = 'http://amazon.com/dp/'
ISBNs = {
'0132269937': 'Core Python Programming',
'0132356139': 'Python Web Development with Django',
'0137143419': 'Python Fundamentals',
}
def getRanking(isbn):
page = uopen('%s%s' % (AMZN, isbn)) # or str.format()
data = page.read()
page.close()
return REGEX.findall(data)[0]
def _showRanking(isbn):
print '- %r ranked %s' % (
ISBNs[isbn], getRanking(isbn))
def _main():
print 'At', ctime(), 'on Amazon...'
for isbn in ISBNs:
_showRanking(isbn)
@register
def _atexit():
print 'all DONE at:', ctime()
if __name__ == '__main__':
main()
Line-by-Line Explanation
Lines 1–7
These are the startup and import lines. We’ll use the atexit.register()
function to tell us when the script is over (you’ll see why later). We’ll also
use the regular expression re.compile() function for the pattern that
matches a book’s ranking on Amazon’s product pages. Then, we save the
184
Chapter 4 • Multithreaded Programming
threading.Thread import for future improvement (coming up a bit later),
time.ctime() for the current timestamp string, and urllib2.urlopen() for
accessing each link.
Lines 9–15
We use three constants in this script: REGEX, the regular expression object
(compiled from the regex pattern that matches a book’s ranking); AMZN, the
base Amazon product link—all we need to complete each link is a book’s
International Standard Book Number (ISBN), which serves as a book’s ID,
differentiating one written work from all others. There are two standards:
the ISBN-10 ten-character value and its successor, the ISBN-13 thirteencharacter ISBN. Currently, Amazon’s systems understand both ISBN types, so
we’ll just use ISBN-10 because they’re shorter. These are stored in the
ISBNs dictionary along with the corresponding book titles.
Lines 17–21
The purpose of getRanking() is to take an ISBN, create the final URL with
which to communicate to Amazon’s servers, and then call urllib2.urlopen()
on it. We used the string format operator to put together the URL (on line 18)
but if you’re using version 2.6 and newer, you can also try the str.format()
method, for example, '{0}{1}'.format(AMZN,isbn).
Once you have the full URL, call urllib2.urlopen()—we shortened it to
uopen()—and expect the file-like object back once the Web server has been
contacted. Then the read() call is issued to download the entire Web page,
and “file” is closed. If the regex is as precise as we have planned, there
should only be exactly one match, so we grab it from the generated list
(any additional would be dropped) and return it back to the caller.
Lines 23–25
The _showRanking() function is just a short snippet of code that takes an
ISBN, looks up the title of the book it represents, calls getRanking() to get
its current ranking on Amazon’s Web site, and then outputs both of these
values to the user. The leading single-underscore notation indicates that
this is a special function only to be used by code within this module and
should not be imported by any other application using this as a library or
utility module.
4.7 Multithreading in Practice
185
Lines 27–30
_main() is also a special function, only executed if this module is run
directly from the command-line (and not imported for use by another
module). It shows the start and end times (to let users know how long it
took to run the entire script) and calls _showRanking() for each ISBN to lookup
and display each book’s current ranking on Amazon.
Lines 32–37
These lines present something completely different. What is atexit.register()?
It’s a function (used in a decorator role here) that registers an exit function
with the Python interpreter, meaning it’s requesting a special function be
called just before the script quits. (Instead of the decorator, you could have
also done register (_atexit()).
Why are we using it here? Well, right now, it’s definitely not needed.
The print statement could very well go at the end of _main() in lines 27–31,
but that’s not a really great place for it. Plus this is functionality that you
might really want to use in a real production application at some point.
We assume that you know what lines 36–37 are about, so onto the output:
$ python bookrank.py
At Wed Mar 30 22:11:19 2011 PDT on Amazon...
- 'Core Python Programming' ranked 87,118
- 'Python Fundamentals' ranked 851,816
- 'Python Web Development with Django' ranked 184,735
all DONE at: Wed Mar 30 22:11:25 2011
If you’re wondering, we’ve separated the process of retrieving (getRanking())
and displaying (_showRanking() and _main()) the data in case you wish to
do something other than dumping the results out to the user via the terminal. In practice, you might need to send this data back via a Web template,
store it in a database, text it to a mobile phone, etc. If you put all of this
code into a single function, it makes it harder to reuse and/or repurpose.
Also, if Amazon changes the layout of their product pages, you might
need to modify the regular expression “screenscraper” to continue to be
able to extract the data from the product page. By the way, using a regex
(or even plain old string processing) for this simple example is fine, but
you might need a more powerful markup parser, such as HTMLParser
from the standard library or third-party tools like BeautifulSoup, html5lib,
or lxml. (We demonstrate a few of these in Chapter 9, “Web Clients and
Servers.”)
186
Chapter 4 • Multithreaded Programming
Add threading
Okay, you don’t have to tell me that this is still a silly single-threaded program. We’re going to change our application to use threads instead. It is an
I/O-bound application, so this is a good candidate to do so. To simplify
things, we won’t use any of the classes and object-oriented programming;
instead, we’ll use threading.Thread directly, so you can think of this more
as a derivative of mtsleepC.py than any of the succeeding examples. We’ll
just spawn the threads and start them up immediately.
Take your application and modify the _showRanking(isbn) call to the
following:
Thread(target=_showRanking, args=(isbn,)).start().
That’s it! Now you have your final version of bookrank.py and can see
that the application (typically) runs faster because of the added concurrency. But, your still only as fast as the slowest response.
$ python bookrank.py
At Thu Mar 31 10:11:32 2011 on Amazon...
- 'Python Fundamentals' ranked 869,010
- 'Core Python Programming' ranked 36,481
- 'Python Web Development with Django' ranked 219,228
all DONE at: Thu Mar 31 10:11:35 2011
As you can see from the output, instead of taking six seconds as our
single-threaded version, our threaded version only takes three. Also note
that the output is in “by completion” order, which is variable, versus the
single-threaded display. With the non-threaded version, the order is
always by key, but now the queries all happen in parallel with the output
coming as each thread completes its work.
In the earlier mtsleepX.py examples, we used Thread.join() on all the
threads to block execution until each thread exits. This effectively prevents
the main thread from continuing until all threads are done, so the print
statement of “all DONE at” is called at the correct time.
In those examples, it’s not necessary to join() all the threads because
none of them are daemon threads. The main thread is not going to exit the
script until all the spawned threads have completed anyway. Because of
this reasoning, we’ve dropped all the join()s in mtsleepF.py. However,
realize that if we displayed “all done” from the same spot, it would be
incorrect.
The main thread would have displayed “all done” before the threads
have completed, so we can’t have that print call above in _main(). There
are only 2 places we can put this print: after line 37 when _main() returns
(the very final line executed of our script), or use atexit.register() to
4.7 Multithreading in Practice
187
register an exit function. Because the latter is something we haven’t discussed before and might be something useful to you later on, we thought
this would be a good place to introduce it to you. This is also one interface
that remains constant between Python 2 and 3, our upcoming challenge.
Porting to Python 3
The next thing we want is a working Python 3 version of this script. As
projects and applications continue down the migration path, this is something with which you need to become familiar, anyway. Fortunately, there
are few tools to help you, one of them being the 2to3 tool. There are generally two ways of using it:
$ 2to3 foo.py
# only output diff
$ 2to3 -w foo.py # overwrites w/3.x code
In the first command, the 2to3 tool just displays the differences between
the version 2.x original script and its generated 3.x equivalent. The -w flag
instructs 2to3 to overwrite the original script with the newly minted 3.x
version while renaming the 2.x version to foo.py.bak.
Let’s run 2to3 on bookrank.py, writing over the existing file. It not only
spits out the differences, it also saves the new version, as we just
described:
$ 2to3 -w bookrank.py
RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
RefactoringTool: Skipping implicit fixer: ws_comma
--- bookrank.py (original)
+++ bookrank.py (refactored)
@@ -4,7 +4,7 @@
from re import compile
from threading import Thread
from time import ctime
-from urllib2 import urlopen as uopen
+from urllib.request import urlopen as uopen
REGEX = compile('#([\d,]+) in Books ')
AMZN = 'http://amazon.com/dp/'
@@ -21,17 +21,17 @@
return REGEX.findall(data)[0]
+
+
def _showRanking(isbn):
print '- %r ranked %s' % (
ISBNs[isbn], getRanking(isbn))
print('- %r ranked %s' % (
ISBNs[isbn], getRanking(isbn)))
3.x
188
Chapter 4 • Multithreaded Programming
def _main():
print 'At', ctime(), 'on Amazon...'
print('At', ctime(), 'on Amazon...')
for isbn in ISBNs:
Thread(target=_showRanking,
args=(isbn,)).start()#_showRanking(isbn)
+
@register
def _atexit():
print 'all DONE at:', ctime()
+
print('all DONE at:', ctime())
if __name__ == '__main__':
_main()
RefactoringTool: Files that were modified:
RefactoringTool: bookrank.py
The following step is optional for readers, but we renamed our files to
bookrank.py and bookrank3.py by using these POSIX commands (Windowsbased PC users should use the ren command):
$ mv bookrank.py bookrank3.py
$ mv bookrank.py.bak bookrank.py
If you try to run our new next-generation script, it’s probably wishful
thinking that it’s a perfect translation and that you’re done with your
work. Something bad happened, and you’ll get the following exception in
each thread (this output is for just one thread as they’re all the same):
$ python3 bookrank3.py
Exception in thread Thread-1:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/
3.2/lib/python3.2/threading.py", line 736, in
_bootstrap_inner
self.run()
File "/Library/Frameworks/Python.framework/Versions/
3.2/lib/python3.2/threading.py", line 689, in run
self._target(*self._args, **self._kwargs)
File "bookrank3.py", line 25, in _showRanking
ISBNs[isbn], getRanking(isbn)))
File "bookrank3.py", line 21, in getRanking
return REGEX.findall(data)[0]
TypeError: can't use a string pattern on a bytes-like object
:
Darn it! Apparently the problem is that the regular expression is a (Unicode) string, whereas the data that comes back from urlopen() file-like
object’s read() method is an ASCII/bytes string. The fix here is to compile
a bytes object instead of a text string. Therefore, change line 9 so that
re.compile() is compiling a bytes string (by adding the bytes string. To
4.7 Multithreading in Practice
189
do this, add the bytes string designation b just before the opening quote,
as shown here:
REGEX = compile(b'#([\d,]+) in Books ')
Now let’s try it again:
$ python3 bookrank3.py
At Sun Apr 3 00:45:46 2011 on Amazon...
- 'Core Python Programming' ranked b'108,796'
- 'Python Web Development with Django' ranked b'268,660'
- 'Python Fundamentals' ranked b'969,149'
all DONE at: Sun Apr 3 00:45:49 2011
Aargh! What’s wrong now? Well, it’s a little bit better (no errors), but the
output looks weird. The ranking values grabbed by the regular expressions, when passed to str() show the b and quotes. Your first instinct
might be to try ugly string slicing:
>>> x = b'xxx'
>>> repr(x)
"b'xxx'"
>>> str(x)
"b'xxx'"
>>> str(x)[2:-1]
'xxx'
However, it’s just more appropriate to convert it to a real (Unicode
string, perhaps using UTF-8:
>>> str(x, 'utf-8')
'xxx'
To do that in our script, make a similar change to line 53 so that it now
reads as:
return str(REGEX.findall(data)[0], 'utf-8')
Now, the output of our Python 3 script matches that of our Python 2 script:
$ python3 bookrank3.py
At Sun Apr 3 00:47:31 2011 on Amazon...
- 'Python Fundamentals' ranked 969,149
- 'Python Web Development with Django' ranked 268,660
- 'Core Python Programming' ranked 108,796
all DONE at: Sun Apr 3 00:47:34 2011
In general, you’ll find that porting from version 2.x to version 3.x follows a similar pattern: you ensure that all your unit and integration tests
pass, knock down all the basics using 2to3 (and other tools), and then
clean up the aftermath by getting the code to run and pass the same tests.
We’ll try this exercise again with our next example which demonstrates
the use of synchronization with threads.
190
Chapter 4 • Multithreaded Programming
4.7.2
Synchronization Primitives
In the main part of this chapter, we looked at basic threading concepts and
how to utilize threading in Python applications. However, we neglected to
mention one very important aspect of threaded programming: synchronization. Often times in threaded code, you will have certain functions or
blocks in which you don’t (or shouldn’t) want more than one thread executing. Usually these involve modifying a database, updating a file, or
anything similar that might cause a race condition, which, if you recall
from earlier in the chapter, is when different code paths or behaviors are
exhibited or inconsistent data was rendered if one thread ran before
another one and vice versa. (You can read more about race conditions on
the Wikipedia page at http://en.wikipedia.org/wiki/Race_condition.)
Such cases require synchronization. Synchronization is used when any
number of threads can come up to one of these critical sections of code
(http://en.wikipedia.org/wiki/Critical_section), but only one is allowed
through at any given time. The programmer makes these determinations
and chooses the appropriate synchronization primitives, or thread control
mechanisms to perform the synchronization. There are different types of
process synchronization (see http://en.wikipedia.org/wiki/Synchronization_
(computer_ science)) and Python supports several types, giving you enough
choices to select the best one to get the job done.
We introduced them all to you earlier at the beginning of this section, so
here we’d like to demonstrate a couple of sample scripts that use two types
of synchronization primitives: locks/mutexes, and semaphores. A lock is
the simplest and lowest-level of all these mechanisms; while semaphores
are for situations in which multiple threads are contending for a finite
resource. Locks are easier to explain, so we’ll start there, and then discuss
semaphores.
4.7.3
Locking Example
Locks have two states: locked and unlocked (surprise, surprise). They support only two functions: acquire and release. These actions mean exactly
what you think.
As multiple threads vie for a lock, the first thread to acquire one is permitted to go in and execute code in the critical section. All other threads
coming along are blocked until the first thread wraps up, exits the critical
section, and releases the lock. At this moment, any of the other waiting
threads can acquire the lock and enter the critical section. Note that there
4.7 Multithreading in Practice
191
is no ordering (first come, first served) for the blocked threads; the selection of the “winning” thread is not deterministic and can vary between
different implementations of Python.
Let’s see why locks are necessary. mtsleepF.py is an application that
spawns a random number of threads, each of which outputs when it has
completed. Take a look at the core chunk of (Python 2) source here:
from
from
from
from
atexit import register
random import randrange
threading import Thread, currentThread
time import sleep, ctime
class CleanOutputSet(set):
def __str__(self):
return ', '.join(x for x in self)
loops = (randrange(2,5) for x in xrange(randrange(3,7)))
remaining = CleanOutputSet()
def loop(nsec):
myname = currentThread().name
remaining.add(myname)
print '[%s] Started %s' % (ctime(), myname)
sleep(nsec)
remaining.remove(myname)
print '[%s] Completed %s (%d secs)' % (
ctime(), myname, nsec)
print '
(remaining: %s)' % (remaining or 'NONE')
def _main():
for pause in loops:
Thread(target=loop, args=(pause,)).start()
@register
def _atexit():
print 'all DONE at:', ctime()
We’ll have a longer line-by-line explanation once we’ve finalized our
code with locking, but basically what mtsleepF.py does is expand on our
earlier examples. Like bookrank.py, we simplify the code a bit by skipping
object-oriented programming, drop the list of thread objects and thread
join()s, and (re)use atexit.register() (for all the same reasons as
bookrank.py).
Also as a minor change to the earlier mtsleepX.py examples, instead of
hardcoding a pair of loops/threads sleeping for 4 and 2 seconds, respectively, we wanted to mix it up a little by randomly creating between 3 and
6 threads, each of which can sleep anywhere between 2 and 4 seconds.
192
Chapter 4 • Multithreaded Programming
One of the new features that stands out is the use of a set to hold the
names of the remaining threads still running. The reason why we’re subclassing the set object instead of using it directly is because we just want to
demonstrate another use case, altering the default printable string representation of a set.
When you display a set, you get output such as set([X, Y, Z,...]). The
issue is that the users of our application don’t (and shouldn’t) need to
know anything about sets or that we’re using them. We just want to display something like X, Y, Z, ..., instead; thus the reason why we derived
from set and implemented its __str__() method.
With this change, and if you’re lucky, the output will be all nice and
lined up properly:
$ python mtsleepF.py
[Sat Apr 2 11:37:26 2011] Started Thread-1
[Sat Apr 2 11:37:26 2011] Started Thread-2
[Sat Apr 2 11:37:26 2011] Started Thread-3
[Sat Apr 2 11:37:29 2011] Completed Thread-2 (3 secs)
(remaining: Thread-3, Thread-1)
[Sat Apr 2 11:37:30 2011] Completed Thread-1 (4 secs)
(remaining: Thread-3)
[Sat Apr 2 11:37:30 2011] Completed Thread-3 (4 secs)
(remaining: NONE)
all DONE at: Sat Apr 2 11:37:30 2011
However, if you’re unlucky, you might get strange output such as this
pair of example executions:
$ python mtsleepF.py
[Sat Apr 2 11:37:09 2011] Started Thread-1
[Sat Apr 2 11:37:09 2011] Started Thread-2
[Sat Apr 2 11:37:09 2011] Started Thread-3
[Sat Apr 2 11:37:12 2011] Completed Thread-1 (3 secs)
[Sat Apr 2 11:37:12 2011] Completed Thread-2 (3 secs)
(remaining: Thread-3)
(remaining: Thread-3)
[Sat Apr 2 11:37:12 2011] Completed Thread-3 (3 secs)
(remaining: NONE)
all DONE at: Sat Apr 2 11:37:12 2011
$ python mtsleepF.py
[Sat Apr 2 11:37:56 2011] Started Thread-1
[Sat Apr 2 11:37:56 2011] Started Thread-2
[Sat Apr 2 11:37:56 2011] Started Thread-3
[Sat Apr 2 11:37:56 2011] Started Thread-4
[Sat Apr 2 11:37:58 2011] Completed Thread-2 (2 secs)
[Sat Apr 2 11:37:58 2011] Completed Thread-4 (2 secs)
(remaining: Thread-3, Thread-1)
(remaining: Thread-3, Thread-1)
4.7 Multithreading in Practice
193
[Sat Apr 2 11:38:00 2011] Completed Thread-1 (4 secs)
(remaining: Thread-3)
[Sat Apr 2 11:38:00 2011] Completed Thread-3 (4 secs)
(remaining: NONE)
all DONE at: Sat Apr 2 11:38:00 2011
What’s wrong? Well, for one thing, the output might appear partially
garbled (because multiple threads might be executing I/O in parallel). You
can see some examples of preceding code in which the output is interleaved, too. Another problem identified is when you have two threads
modifying the same variable (the set containing the names of the remaining threads).
Both the I/O and access to the same data structure are part of critical sections; therefore, we need locks to prevent more than one thread from
entering them at the same time. To add locking, you need to add a line of
code to import the Lock (or RLock) object and create a lock object, so add/
modify your code to contain these lines in the right places:
from threading import Thread, Lock, currentThread
lock = Lock()
Now you mut use your lock. The following code highlights the acquire()
and release() calls that we should insert into our loop() function:
def loop(nsec):
myname = currentThread().name
lock.acquire()
remaining.add(myname)
print '[%s] Started %s' % (ctime(), myname)
lock.release()
sleep(nsec)
lock.acquire()
remaining.remove(myname)
print '[%s] Completed %s (%d secs)' % (
ctime(), myname, nsec)
print '
(remaining: %s)' % (remaining or 'NONE')
lock.release()
Once the changes are made, you should no longer get strange output:
$ python mtsleepF.py
[Sun Apr 3 23:16:59 2011] Started Thread-1
[Sun Apr 3 23:16:59 2011] Started Thread-2
[Sun Apr 3 23:16:59 2011] Started Thread-3
[Sun Apr 3 23:16:59 2011] Started Thread-4
[Sun Apr 3 23:17:01 2011] Completed Thread-3 (2 secs)
(remaining: Thread-4, Thread-2, Thread-1)
[Sun Apr 3 23:17:01 2011] Completed Thread-4 (2 secs)
(remaining: Thread-2, Thread-1)
194
Chapter 4 • Multithreaded Programming
[Sun Apr 3 23:17:02 2011] Completed Thread-1 (3 secs)
(remaining: Thread-2)
[Sun Apr 3 23:17:03 2011] Completed Thread-2 (4 secs)
(remaining: NONE)
all DONE at: Sun Apr 3 23:17:03 2011
The modified (and final) version of mtsleepF.py is shown in Example 4-10.
Example 4-10
Locks and More Randomness (mtsleepF.py)
In this example, we demonstrate the use of locks and other threading tools.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/usr/bin/env python
from
from
from
from
atexit import register
random import randrange
threading import Thread, Lock, currentThread
time import sleep, ctime
class CleanOutputSet(set):
def __str__(self):
return ', '.join(x for x in self)
lock = Lock()
loops = (randrange(2,5) for x in xrange(randrange(3,7)))
remaining = CleanOutputSet()
def loop(nsec):
myname = currentThread().name
lock.acquire()
remaining.add(myname)
print '[%s] Started %s' % (ctime(), myname)
lock.release()
sleep(nsec)
lock.acquire()
remaining.remove(myname)
print '[%s] Completed %s (%d secs)' % (
ctime(), myname, nsec)
print '
(remaining: %s)' % (remaining or 'NONE')
lock.release()
def _main():
for pause in loops:
Thread(target=loop, args=(pause,)).start()
@register
def _atexit():
print 'all DONE at:', ctime()
if __name__ == '__main__':
main()
4.7 Multithreading in Practice
195
Line-by-Line Explanation
Lines 1–6
These are the usual startup and import lines. Be aware that threading.currentThread() is renamed to threading.current_thread() starting
in version 2.6 but with the older name remaining intact for backward
compatibility.
Lines 8–10
This is the set subclass we described earlier. It contains an implementation
of __str__() to change the output from the default to a comma-delimited
string of its elements.
Lines 12–14
Our global variables consist of the lock, an instance of our modified set
from above, and a random number of threads (between three and six),
each of which will pause or sleep for between two and four seconds.
Lines 16–28
The loop() function saves the name of the current thread executing it, then
acquires a lock so that the addition of that name to the remaining set and
an output indicating the thread has started is atomic (where no other
thread can enter this critical section). After releasing the lock, this thread
sleeps for the predetermined random number of seconds, then re-acquires
the lock in order to do its final output before releasing it.
Lines 30–39
The _main() function is only executed if this script was not imported for
use elsewhere. Its job is to spawn and execute each of the threads. As mentioned before, we use atexit.register() to register the _atexit() function that the interpreter can execute before exiting.
As an alternative to maintaining your own set of currently running
threads, you might consider using threading.enumerate(), which returns
a list of all threads that are still running (including daemon threads, but
not those which haven’t started yet). We didn’t use it for our example here
because it gives us two extra threads that we need to remove to keep our
output short: the current thread (because it hasn’t completed yet) as well
as the main thread (not necessary to show this either).
2.6
196
Chapter 4 • Multithreaded Programming
Also don’t forget that you can also use the str.format() method instead
of the string format operator if you’re using Python 2.6 or newer (including version 3.x). In other words, this print statement
print '[%s] Started %s' % (ctime(), myname)
2.6-2.7
can be replaced by this one in 2.6+
print '[{0}] Started {1}'.format(ctime(), myname)
3.x
or this call to the print() function in version 3.x:
print('[{0}] Started {1}'.format(ctime(), myname))
If you just want a count of currently running threads, you can use
threading.activeCount() (renamed to active_count() starting in version
2.6), instead.
Using Context Management
2.5
Another option for those of you using Python 2.5 and newer is to have neither the lock acquire() nor release() calls at all, simplifying your code.
When using the with statement, the context manager for each object is
responsible for calling acquire() before entering the suite and release()
when the block has completed execution.
The threading module objects Lock, RLock, Condition, Semaphore, and
BoundedSemaphore, all have context managers, meaning they can be used
with the with statement. By using with, you can further simplify loop() to:
from __future__ import with_statement # 2.5 only
def loop(nsec):
myname = currentThread().name
with lock:
remaining.add(myname)
print '[%s] Started %s' % (ctime(), myname)
sleep(nsec)
with lock:
remaining.remove(myname)
print '[%s] Completed %s (%d secs)' % (
ctime(), myname, nsec)
print '
(remaining: %s)' % (
remaining or 'NONE',)
Porting to Python 3
3.x
Now let’s do a seemingly easy port to Python 3.x by running the 2to3 tool
on the preceding script (this output is truncated because we saw a full
diff dump earlier):
4.7 Multithreading in Practice
197
$ 2to3 -w mtsleepF.py
RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
RefactoringTool: Skipping implicit fixer: ws_comma
:
RefactoringTool: Files that were modified:
RefactoringTool: mtsleepF.py
After renaming mtsleepF.py to mtsleepF3.py and mtsleep.py.bak to
mtsleepF.py, we discover, much to our pleasant surprise, that this is one
script that ported perfectly, with no issues:
$ python3 mtsleepF3.py
[Sun Apr 3 23:29:39 2011] Started Thread-1
[Sun Apr 3 23:29:39 2011] Started Thread-2
[Sun Apr 3 23:29:39 2011] Started Thread-3
[Sun Apr 3 23:29:41 2011] Completed Thread-3 (2 secs)
(remaining: Thread-2, Thread-1)
[Sun Apr 3 23:29:42 2011] Completed Thread-2 (3 secs)
(remaining: Thread-1)
[Sun Apr 3 23:29:43 2011] Completed Thread-1 (4 secs)
(remaining: NONE)
all DONE at: Sun Apr 3 23:29:43 2011
Now let’s take our knowledge of locks, introduce semaphores, and look
at an example that uses both.
4.7.4
Semaphore Example
As stated earlier, locks are pretty simple to understand and implement. It’s
also fairly easy to decide when you should need them. However, if the situation is more complex, you might need a more powerful synchronization
primitive, instead. For applications with finite resources, using semaphores might be a better bet.
Semaphores are some of the oldest synchronization primitives out
there. They’re basically counters that decrement when a resource is being
consumed (and increment again when the resource is released). You can
think of semaphores representing their resources as either available or
unavailable. The action of consuming a resource and decrementing the
counter is traditionally called P() (from the Dutch word probeer/proberen)
but is also known as wait, try, acquire, pend, or procure. Conversely, when a
thread is done with a resource, it needs to return it back to the pool. To do
this, the action used is named “V()” (from the Dutch word verhogen/
verhoog) but also known as signal, increment, release, post, vacate. Python
simplifies all the naming and uses the same function/method names as
198
Chapter 4 • Multithreaded Programming
locks: acquire and release. Semaphores are more flexible than locks
because you can have multiple threads, each using one of the instances of
the finite resource.
For our example, we’re going to simulate an oversimplified candy vending machine as an example. This particular machine has only five slots
available to hold inventory (candy bars). If all slots are taken, no more
candy can be added to the machine, and similarly, if there are no more of
one particular type of candy bar, consumers wishing to purchase that
product are out-of-luck. We can track these finite resources (candy slots)
by using a semaphore.
Example 4-11 shows the source code (candy.py).
Example 4-11
Candy Vending Machine and Semaphores (candy.py)
This script uses locks and semaphores to simulate a candy vending machine.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/usr/bin/env python
from
from
from
from
atexit import register
random import randrange
threading import BoundedSemaphore, Lock, Thread
time import sleep, ctime
lock = Lock()
MAX = 5
candytray = BoundedSemaphore(MAX)
def refill():
lock.acquire()
print 'Refilling candy...',
try:
candytray.release()
except ValueError:
print 'full, skipping'
else:
print 'OK'
lock.release()
def buy():
lock.acquire()
print 'Buying candy...',
if candytray.acquire(False):
print 'OK'
else:
print 'empty, skipping'
lock.release()
4.7 Multithreading in Practice
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
199
def producer(loops):
for i in xrange(loops):
refill()
sleep(randrange(3))
def consumer(loops):
for i in xrange(loops):
buy()
sleep(randrange(3))
def _main():
print 'starting at:', ctime()
nloops = randrange(2, 6)
print 'THE CANDY MACHINE (full with %d bars)!' % MAX
Thread(target=consumer, args=(randrange(
nloops, nloops+MAX+2),)).start() # buyer
Thread(target=producer, args=(nloops,)).start() #vndr
@register
def _atexit():
print 'all DONE at:', ctime()
if __name__ == '__main__':
_main()
Line-by-Line Explanation
Lines 1–6
The startup and import lines are quite similar to examples earlier in this
chapter. The only thing new is the semaphore. The threading module
comes with two semaphore classes, Semaphore and BoundedSemaphore. As
you know, semaphores are really just counters; they start off with some
fixed number of a finite resource.
This counter decrements when one unit of this is allocated, and when
that unit is returned to the pool, the counter increments. The additional
feature you get with a BoundedSemaphore is that the counter can never
increment beyond its initial value; in other words, it prevents the aberrant
use case where a semaphore is released more times than it’s acquired.
Lines 8–10
The global variables in this script are the lock, a constant representing the
maximum number of items that can be inventoried, and the tray of candy.
200
Chapter 4 • Multithreaded Programming
Lines 12–21
The refill() function is performed when the owner of the fictitious vending machines comes to add one more item to inventory. The entire routine
represents a critical section; this is why acquiring the lock is the only way
to execute all lines. The code outputs its action to the user as well as warns
when someone has exceeded the maximum inventory (lines 17–18).
Lines 23–30
buy() is the converse of refill(); it allows a consumer to acquire one unit
of inventory. The conditional (line 26) detects when all finite resources
have been consumed already. The counter can never go below zero, so this
call would normally block until the counter is incremented again. By passing the nonblocking flag as False, this instructs the call to not block but to
return a False if it would've blocked, indicating no more resources.
Lines 32–40
The producer() and consumer() functions merely loop and make corresponding calls to refill() and buy(), pausing momentarily between calls.
Lines 42–55
The remainder of the code contains the call to _main() if the script was executed from the command-line, the registration of the exit function, and
finally, _main(), which seeds the newly created pair of threads representing the producer and consumer of the candy inventory.
The additional math in the creation of the consumer/buyer is to randomly suggest positive bias where a customer might actually consume
more candy bars than the vendor/producer puts in the machine (otherwise, the code would never enter the situation in which the consumer
attempts to buy a candy bar from an empty machine).
Running the script results in output similar to the following:
$ python candy.py
starting at: Mon Apr 4 00:56:02 2011
THE CANDY MACHINE (full with 5 bars)!
Buying candy... OK
Refilling candy... OK
Refilling candy... full, skipping
Buying candy... OK
Buying candy... OK
Refilling candy... OK
Buying candy... OK
Buying candy... OK
Buying candy... OK
all DONE at: Mon Apr 4 00:56:08 2011
4.7 Multithreading in Practice
201
CORE TIP: Debugging might involve intervention
At some point, you might need to debug a script that uses semaphores, but to do
this, you might need to know exactly what value is in the semaphore’s counter at
any given time. In one of the exercises at the end of the chapter, you will implement such a solution to candy.py, perhaps calling it candydebug.py, and give it the
ability to display the counter’s value. To do this, you’ll need to look at the source
code for threading.py (and probably in both the Python 2 and Python 3
versions).
You’ll discover that the threading module’s synchronization primitives are
not class names even though they use CamelCase capitalization to look like a
class. In fact, they’re really just one-line functions that instantiate the objects
you’re expecting. There are two problems to consider: the first one is that you
can’t subclass them (because they’re functions); the second problem is that the
variable name changed between version 2.x and 3.x.
The entire issue could be avoided if the object gives you clean/easy access to a
counter, which it doesn’t. You can directly access the counter’s value because
it’s just an attribute of the class, as we just mentioned, the variable name
changed from self.__value, meaning self._Semaphore__value, in Python 2
to self._value in Python 3.
For developers, the cleanest application programming interface (API) (at least
in our opinion) is to derive from threading._BoundedSemaphore class and
implement an __len__() method but use the correct counter value we just discussed if you plan to support this on both version 2.x and version 3.x.
Porting to Python 3
Similar to mtsleepF.py, candy.py is another example of how the 2to3 tool is
sufficient to generate a working Python 3 version, which we have renamed to
candy3.py. We’ll leave this as an exercise for the reader to confirm.
Summary
We’ve demonstrated only a couple of the synchronization primitives that
come with the threading module. There are plenty more for you to
explore. However, keep in mind that that’s still only what they are: “primitives.” There’s nothing wrong with using them to build your own classes
and data structures that are thread-safe. The Python Standard Library
comes with one, the Queue object.
3.x
202
Chapter 4 • Multithreaded Programming
4.8
3.x
Producer-Consumer Problem and the
Queue/queue Module
The final example illustrates the producer-consumer scenario in which a
producer of goods or services creates goods and places it in a data structure such as a queue. The amount of time between producing goods is nondeterministic, as is the consumer consuming the goods produced by the
producer.
We use the Queue module (Python 2.x; renamed to queue in version 3.x)
to provide an interthread communication mechanism that allows threads
to share data with each other. In particular, we create a queue into which
the producer (thread) places new goods and the consumer (thread) consumes them. Table 4-5 itemizes the various attributes that can be found in
this module.
Table 4-5 Common Queue/queue Module Attributes
Attribute
Description
Queue/queue Module Classes
Queue(maxsize=0)
Creates a FIFO queue of given maxsize where
inserts block until there is more room, or (if
omitted), unbounded
LifoQueue(maxsize=0)
Creates a LIFO queue of given maxsize where
inserts block until there is more room, or (if
omitted), unbounded
PriorityQueue(maxsize=0)
Creates a priority queue of given maxsize where
inserts block until there is more room, or (if
omitted), unbounded
Queue/queue Exceptions
Empty
Raised when a get*() method called for an
empty queue
Full
Raised when a put*() method called for a full
queue
4.8 Producer-Consumer Problem and the Queue/queue Module
Attribute
203
Description
Queue/queue Object Methods
qsize()
Returns queue size (approximate, whereas
queue may be getting updated by other threads)
empty()
Returns True if queue empty, False otherwise
full()
Returns True if queue full, False otherwise
put(item, block=True,
timeout=None)
Puts item in queue; if block True (the default) and
timeout is None, blocks until room is available; if
timeout is positive, blocks at most timeout seconds or if block False, raises the Empty exception
put_nowait(item)
Same as put(item, False)
get(block=True,
timeout=None)
Gets item from queue, if block given (not 0), block
until an item is available
get_nowait()
Same as get(False)
task_done()
Used to indicate work on an enqueued item
completed, used with join() below
join()
Blocks until all items in queue have been processed
and signaled by a call to task_done() above
We’ll use Example 4-12 (prodcons.py), to demonstrate producer-consumer
Queue/queue. The following is the output from one execution of this script:
$ prodcons.py
starting writer at: Sun Jun 18
producing object for Q... size
starting reader at: Sun Jun 18
consumed object from Q... size
producing object for Q... size
consumed object from Q... size
producing object for Q... size
producing object for Q... size
producing object for Q... size
consumed object from Q... size
consumed object from Q... size
writer finished at: Sun Jun 18
consumed object from Q... size
reader finished at: Sun Jun 18
all DONE
20:27:07
now 1
20:27:07
now 0
now 1
now 0
now 1
now 2
now 3
now 2
now 1
20:27:17
now 0
20:27:25
2006
2006
2006
2006
204
Chapter 4 • Multithreaded Programming
Example 4-12
Producer-Consumer Problem (prodcons.py)
This implementation of the Producer–Consumer problem uses Queue objects
and a random number of goods produced (and consumed). The producer and
consumer are individually—and concurrently—executing threads.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#!/usr/bin/env python
from
from
from
from
random import randint
time import sleep
Queue import Queue
myThread import MyThread
def writeQ(queue):
print 'producing object for Q...',
queue.put('xxx', 1)
print "size now", queue.qsize()
def readQ(queue):
val = queue.get(1)
print 'consumed object from Q... size now', \
queue.qsize()
def writer(queue, loops):
for i in range(loops):
writeQ(queue)
sleep(randint(1, 3))
def reader(queue, loops):
for i in range(loops):
readQ(queue)
sleep(randint(2, 5))
funcs = [writer, reader]
nfuncs = range(len(funcs))
def main():
nloops = randint(2, 5)
q = Queue(32)
threads = []
for i in nfuncs:
t = MyThread(funcs[i], (q, nloops),
funcs[i].__name__)
threads.append(t)
for i in nfuncs:
threads[i].start()
for i in nfuncs:
threads[i].join()
print 'all DONE'
if __name__ == '__main__':
main()
4.8 Producer-Consumer Problem and the Queue/queue Module
205
As you can see, the producer and consumer do not necessarily alternate
in execution. (Thank goodness for random numbers!) Seriously, though,
real life is generally random and non-deterministic.
Line-by-Line Explanation
Lines 1–6
In this module, we use the Queue.Queue object as well as our thread class
myThread.MyThread, seen earlier. We use random.randint() to make production and consumption somewhat varied. (Note that random.randint()
works just like random.randrange() but is inclusive of the upper/end
value).
Lines 8–16
The writeQ() and readQ() functions each have a specific purpose: to place
an object in the queue—we are using the string 'xxx', for example—and
to consume a queued object, respectively. Notice that we are producing
one object and reading one object each time.
Lines 18–26
The writer() is going to run as a single thread whose sole purpose is to
produce an item for the queue, wait for a bit, and then do it again, up to the
specified number of times, chosen randomly per script execution. The
reader() will do likewise, with the exception of consuming an item, of
course.
You will notice that the random number of seconds that the writer
sleeps is in general shorter than the amount of time the reader sleeps. This
is to discourage the reader from trying to take items from an empty queue.
By giving the writer a shorter time period of waiting, it is more likely that
there will already be an object for the reader to consume by the time their
turn rolls around again.
Lines 28–29
These are just setup lines to set the total number of threads that are to be
spawned and executed.
206
Chapter 4 • Multithreaded Programming
Lines 31–47
Finally, we have our main() function, which should look quite similar to
the main() in all of the other scripts in this chapter. We create the appropriate threads and send them on their way, finishing up when both threads
have concluded execution.
We infer from this example that a program that has multiple tasks to
perform can be organized to use separate threads for each of the tasks.
This can result in a much cleaner program design than a single-threaded
program that attempts to do all of the tasks.
In this chapter, we illustrated how a single-threaded process can limit
an application’s performance. In particular, programs with independent,
non-deterministic, and non-causal tasks that execute sequentially can be
improved by division into separate tasks executed by individual threads.
Not all applications will benefit from multithreading due to overhead and
the fact that the Python interpreter is a single-threaded application, but
now you are more cognizant of Python’s threading capabilities and can
use this tool to your advantage when appropriate.
4.9
Alternative Considerations to Threads
Before you rush off and do some threading, let’s do a quick recap: threading in general is a good thing. However, because of the restrictions of the
GIL in Python, threading is more appropriate for I/O-bound applications
(I/O releases the GIL, allowing for more concurrency) than for CPU-bound
applications. In the latter case, to achieve greater parallelism, you’ll need
processes that can be executed by other cores or CPUs.
Without going into too much detail here (some of these topics have
already been covered in the “Execution Environment” chapter of Core
Python Programming or Core Python Language Fundamentals), when looking
at multiple threads or processes, the primary alternatives to the threading
module include:
4.9.1
2.4
The subprocess Module
This is the primary alternative when desiring to spawn processes, whether
to purely execute stuff or to communicate with another process via the standard files (stdin, stdout, stderr). It was introduced to Python in version 2.4.
4.9 Alternative Considerations to Threads 207
4.9.2
The multiprocessing Module
This module, added in Python 2.6, lets you spawn processes for multiple
cores or CPUs but with an interface very similar to that of the threading
module; it also contains various mechanisms to pass data between processes that are cooperating on shared work.
4.9.3
2.6
The concurrent.futures Module
This is a new high-level library that operates only at a “job” level, which
means that you no longer have to fuss with synchronization, or managing
threads or processes. you just specify a thread or process pool with a certain number of “workers,” submit jobs, and collate the results. It’s new in
Python 3.2, but a port for Python 2.6+ is available at http://code.google.
com/p/pythonfutures.
What would bookrank3.py look like with this change? Assuming everything else stays the same, here’s the new import and modified _main()
function:
from concurrent.futures import ThreadPoolExecutor
. . .
def _main():
print('At', ctime(), 'on Amazon...')
with ThreadPoolExecutor(3) as executor:
for isbn in ISBNs:
executor.submit(_showRanking, isbn)
print('all DONE at:', ctime())
The argument given to concurrent.futures.ThreadPoolExecutor is the
thread pool size, and our application is looking for the rankings of three
books. Of course, this is an I/O-bound application for which threads are
more useful. For a CPU-bound application, we would use concurrent.
futures.ProcessPoolExecutor, instead.
Once we have an executor (whether threads or processes), which is
responsible for dispatching the jobs and collating the results, we can call
its submit() method to execute what we would have had to spawn a thread
to run previously.
If we do a “full” port to Python 3 by replacing the string format operator
with the str.format() method, making liberal use of the with statement, and
using the executor’s map() method, we can actually delete _showRanking()
and roll its functionality into _main(). In Example 4-13, you’ll find our final
bookrank3CF.py script.
3.2
208
Chapter 4 • Multithreaded Programming
Example 4-13
Higher-Level Job Management (bookrank3CF.py)
Our friend, the book rank screenscraper, but this time using
concurrent.futures.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/env python
from
from
from
from
concurrent.futures import ThreadPoolExecutor
re import compile
time import ctime
urllib.request import urlopen as uopen
REGEX = compile(b'#([\d,]+) in Books ')
AMZN = 'http://amazon.com/dp/'
ISBNs = {
'0132269937': 'Core Python Programming',
'0132356139': 'Python Web Development with Django',
'0137143419': 'Python Fundamentals',
}
def getRanking(isbn):
with uopen('{0}{1}'.format(AMZN, isbn)) as page:
return str(REGEX.findall(page.read())[0], 'utf-8')
def _main():
print('At', ctime(), 'on Amazon...')
with ThreadPoolExecutor(3) as executor:
for isbn, ranking in zip(
ISBNs, executor.map(getRanking, ISBNs)):
print('- %r ranked %s' % (ISBNs[isbn], ranking)
print('all DONE at:', ctime())
if __name__ == '__main__':
main()
Line-by-Line Explanation
Lines 1–14
Outside of the new import statement, everything in the first half of this
script is identical to the bookrank3.py file we looked at earlier in this chapter.
Lines 16–18
The new getRanking() uses the with statement and str.format(). You can
make the same change to bookrank.py because both features are available
in version 2.6+ (they are not unique to version 3.x).
Lines 20–26
In the previous code example, we used executor.submit() to spawn the
jobs. Here, we tweak this slightly by using executor.map() because it
4.10 Related Modules
209
allows us to absorb the functionality from _showRanking(), letting us remove
it entirely from our code.
The output is nearly identical to what we’ve seen earlier:
$ python3 bookrank3CF.py
At Wed Apr 6 00:21:50 2011 on Amazon...
- 'Core Python Programming' ranked 43,992
- 'Python Fundamentals' ranked 1,018,454
- 'Python Web Development with Django' ranked 502,566
all DONE at: Wed Apr 6 00:21:55 2011
You can read more about the concurrent.futures module origins at the
link below.
• http://docs.python.org/dev/py3k/library/concurrent.futures.html
• http://code.google.com/p/pythonfutures/
• http://www.python.org/dev/peps/pep-3148/
A summary of these options and other threading-related modules and
packages can be found in the next section.
4.10 Related Modules
Table 4-6 lists some of the modules that you can use when programming
multithreaded applications.
Table 4-6 Threading-Related Standard Library Modules
Module
Description
threada
Basic, lower-level thread module
threading
Higher-level threading and synchronization objects
multiprocessingb
Spawn/use subprocesses with a “threading” interface
subprocessc
Skip threads altogether and execute processes
instead
Queue
Synchronized FIFO queue for multiple threads
mutexd
Mutual exclusion objects
(Continued)
210
Chapter 4 • Multithreaded Programming
Table 4-6 Threading-Related Standard Library Modules (Continued)
Module
Description
concurrent.futurese
High-level library for asynchronous execution
SocketServer
Create/manage threaded TCP or UDP servers
a.
b.
c.
d.
e.
Renamed to _thread in Python 3.0.
New in Python 2.6.
New in Python 2.4.
Deprecated in Python 2.6 and removed in version 3.0.
New in Python 3.2 (but available outside the standard library for version 2.6+).
4.11 Exercises
4-1. Processes versus Threads. What are the differences between
processes and threads?
4-2. Python Threads. Which type of multithreaded application will
tend to fare better in Python, I/O-bound or CPU-bound?
4-3. Threads. Do you think anything significant happens if you
have multiple threads on a multiple CPU system? How do
you think multiple threads run on these systems?
4-4. Threads and Files.
a) Create a function that obtains a byte value and a filename
(as parameters or user input) and displays the number of
times that byte appears in the file.
b) Suppose now that the input file is extremely large. Multiple readers in a file is acceptable, so modify your solution
to create multiple threads that count in different parts of
the file such that each thread is responsible for a certain
part of the file. Collate the data from each thread and provide the correct total. Use the timeit module to time both
the single- threaded new multithreaded solutions and
say something about the difference in performance, if
any.
4-5. Threads, Files, and Regular Expressions. You have a very large
mailbox file—if you don’t have one, put all of your e-mail messages together into a single text file. Your job is to take
4.11 Exercises
4-6.
4-7.
4-8.
4-9.
4-10.
4-11.
the regular expressions you designed earlier in this book that
recognize e-mail addresses and Web site URLs and use them
to convert all e-mail addresses and URLs in this large file
into live links so that when the new file is saved as an .html
(or .htm) file, it will show up in a Web browser as live and
clickable. Use threads to segregate the conversion process
across the large text file and collate the results into a single
new .html file. Test the results on your Web browser to
ensure the links are indeed working.
Threads and Networking. Your solution to the chat service
application in the previous chapter required you to use
heavyweight threads or processes as part of your solution.
Convert your solution to be multithreaded.
*Threads and Web Programming. The Crawler application in
Chapter 10, “Web Programming: CGI and WSGI,” is a singlethreaded application that downloads Web pages. It would
benefit from MT programming. Update crawl.py (you could
call it mtcrawl.py) such that independent threads are used to
download pages. Be sure to use some kind of locking mechanism to prevent conflicting access to the links queue.
Thread Pools. Instead of a producer thread and a consumer
thread, change the code for prodcons.py, in Example 4-12 so
that you have any number of consumer threads (a thread pool)
which can process or consume more than one item from the
Queue at any given moment.
Files. Create a set of threads to count how many lines there
are in a set of (presumably large) text files. You can choose
the number of threads to use. Compare the performance
against a single-threaded version of this code. Hint: Review
the exercises at the end of the Chapter 9, in Core Python
Programming or Core Python Language Fundamentals.
Concurrent Processing. Take your solution to Exercise 4-9 and
adopt it to a task of your selection, for example, processing a
set of e-mail messages, downloading Web pages, processing
RSS or Atom feeds, enhancing message processing as part of
a chat server, solving a puzzle, etc.
Synchronization Primitives. Investigate each of the synchronization primitives in the threading module. Describe what
they do, what they might be useful for, and create working
code examples for each.
211
212
Chapter 4 • Multithreaded Programming
The next couple of exercises deal with the candy.py script featured in
Example 4-11.
4-12. Porting to Python 3. Take the candy.py script and run the 2to3
tool on it to create a Python 3 version called candy3.py.
4-13. The threading module. Add debugging to the script. Specifically, for applications that use semaphores (whose initial
value is going to be greater than 1), you might need to know
exactly the counter’s value at any given time. Create a variation of candy.py, perhaps calling it candydebug.py, and give it
the ability to display the counter’s value. You will need to
look at the threading.py source code, as alluded to earlier in
the CORE TIP sidebar. Once you’re done with the modifications, you can alter its output to look something like the
following:
$ python candydebug.py
starting at: Mon Apr 4 00:24:28 2011
THE CANDY MACHINE (full with 5 bars)!
Buying candy... inventory: 4
Refilling candy... inventory: 5
Refilling candy... full, skipping
Buying candy... inventory: 4
Buying candy... inventory: 3
Refilling candy... inventory: 4
Buying candy... inventory: 3
Buying candy... inventory: 2
Buying candy... inventory: 1
Buying candy... inventory: 0
Buying candy... empty, skipping
all DONE at: Mon Apr 4 00:24:36 2011
CHAPTER
GUI Programming
GUI stuff is supposed to be hard. It builds character.
—Jim Ahlstrom, May 1995
(verbally at Python Workshop)
In this chapter...
• Introduction
• Tkinter and Python Programming
• Tkinter Examples
• A Brief Tour of Other GUIs
• Related Modules and Other GUIs
213
214
Chapter 5 • GUI Programming
I
n this chapter, we will give you a brief introduction to the subject of
graphical user interface (GUI) programming. If you are somewhat
new to this area or want to learn more about it, or if you want to see
how it is done in Python, then this chapter is for you. We cannot show you
everything about GUI application development in this one chapter, but we
will give you a very solid introduction to it. The primary GUI toolkit we
will be using is Tk, Python’s default GUI. We’ll access Tk from its Python
interface called Tkinter (short for “Tk interface”).
Tk is not the latest and greatest, nor does it have the most robust set of
GUI building blocks, but it is fairly simple to use, and with it, you can
build GUIs that run on most platforms. We will present several simple and
intermediate examples using Tkinter, followed by a few examples using
other toolkits. Once you have completed this chapter, you will have the
skills to build more complex applications and/or move to a more modern
toolkit. Python has bindings or adapters to most of the current major toolkits, including commercial systems.
5.1
Introduction
Before getting started with GUI programming, we first discuss Tkinter as
Python’s default UI toolkit. We begin by looking at installation because
Tkinter is not always on by default (especially when building Python
yourself). This is followed by a quick review of client/server architecture,
which is covered in Chapter 2, “Network Programming,” but has relevance here.
5.1.1
What Are Tcl, Tk, and Tkinter?
Tkinter is Python’s default GUI library. It is based on the Tk toolkit, originally designed for the Tool Command Language (Tcl). Due to Tk’s popularity, it has been ported to a variety of other scripting languages,
including Perl (Perl/Tk), Ruby (Ruby/Tk), and Python (Tkinter). The combination of Tk’s GUI development portability and flexibility along with the
simplicity of a scripting language integrated with the power of systems
language gives you the tools to rapidly design and implement a wide variety
of commercial-quality GUI applications.
If you are new to GUI programming, you will be pleasantly surprised at
how easy it is. You will also find that Python, along with Tkinter, provides
a fast and exciting way to build applications that are fun (and perhaps
5.1 Introduction
215
useful) and that would have taken much longer if you had to program
directly in C/C++ with the native windowing system’s libraries. Once you
have designed the application and the look and feel that goes along with
your program, you will use basic building blocks known as widgets to
piece together the desired components, and finally, to attach functionality
to “make it real.”
If you are an old hand at using Tk, either with Tcl or Perl, you will find
Python a refreshing way to program GUIs. On top of that, it provides an
even faster rapid prototyping system for building them. Remember that
you also have Python’s system accessibility, networking functionality,
XML, numerical and visual processing, database access, and all the other
standard library and third-party extension modules.
Once you get Tkinter up on your system, it will take less than 15 minutes to get your first GUI application running.
5.1.2
Getting Tkinter Installed and Working
Tkinter is not necessarily turned on by default on your system. You can
determine whether Tkinter is available for your Python interpreter by
attempting to import the Tkinter module (in Python 1 and 2; renamed to
tkinter in Python 3). If Tkinter is available, then no errors occur, as demonstrated in the following:
>>> import Tkinter
>>>
If your Python interpreter was not compiled with Tkinter enabled, the
module import fails:
>>> import Tkinter
Traceback (innermost last):
File "<stdin>", line 1, in ?
File "/usr/lib/pythonX.Y/lib-tk/Tkinter.py", line 8, in ?
import _tkinter # If this fails your Python may not
be configured for Tk
ImportError: No module named _tkinter
You might need to recompile your Python interpreter to gain access to
Tkinter. This usually involves editing the Modules/Setup file and then
enabling all the correct settings to compile your Python interpreter with
hooks to Tkinter, or choosing to have Tk installed on your system. Check
the README file for your Python distribution for specific instructions for compiling Tkinter on your system. After compiling the interpreter, be sure that
you start the new Python interpreter otherwise, it will act just like your old
one without Tkinter (and in fact, it is your old one).
3.x
216
Chapter 5 • GUI Programming
5.1.3
Client/Server Architecture—Take Two
In Chapter 2, we introduced the concept of client/server computing. A windowing system is another example of a software server. These run on a computer with an attached display, such as a monitor. There are clients,
too—programs that require a windowing environment in which to execute,
also known as GUI applications. Such applications cannot run without a
windows system.
The architecture becomes even more interesting when networking
comes into play. Usually when a GUI application is executed, it displays to
the computer that it started on (via the windowing server), but it is possible in some networked windowing environments, such as the X Window
system on Unix, to choose another computer’s window server to which the
application displays. Thus, you can be running a GUI program on one
computer, but display it on another.
5.2
Tkinter and Python Programming
In this section, we’ll introduce GUI programming in general then focus on
how to use Tkinter and its components to build GUIs in Python.
5.2.1 The Tkinter Module: Adding Tk to your
Applications
So what do you need to do to have Tkinter as part of your application?
First, it is not necessary to have an application already. You can create a
pure GUI if you want, but it probably isn’t too useful without some underlying software that does something interesting.
There are basically five main steps that are required to get your GUI up
and running:
1. Import the Tkinter module (or from Tkinter import *).
2. Create a top-level windowing object that contains your entire
GUI application.
3. Build all your GUI components (and functionality) on top
(or within) of your top-level windowing object.
4. Connect these GUI components to the underlying application code.
5. Enter the main event loop.
The first step is trivial: all GUIs that use Tkinter must import the
Tkinter module. Getting access to Tkinter is the first step (see Section 5.1.2).
5.2 Tkinter and Python Programming
5.2.2
217
Introduction to GUI Programming
Before going to the examples, we will give you a brief introduction to GUI
application development. This will provide you with some of the general
background you need to move forward.
Setting up a GUI application is similar to how an artist produces a
painting. Conventionally, there is a single canvas onto which the artist
must put all the work. Here’s how it works: you start with a clean slate, a
“top-level” windowing object on which you build the rest of your components. Think of it as a foundation to a house or the easel for an artist. In
other words, you have to pour the concrete or set up your easel before putting together the actual structure or canvas on top of it. In Tkinter, this
foundation is known as the top-level window object.
Windows and Widgets
In GUI programming, a top-level root windowing object contains all of the
little windowing objects that will be part of your complete GUI application. These can be text labels, buttons, list boxes, etc. These individual little
GUI components are known as widgets. So when we say create a top-level
window, we just mean that you need a place where you put all your widgets. In Python, this would typically look like this line:
top = Tkinter.Tk() # or just Tk() with "from Tkinter import *"
The object returned by Tkinter.Tk() is usually referred to as the root
window; hence, the reason why some applications use root rather than top
to indicate as such. Top-level windows are those that show up stand-alone
as part of your application. You can have more than one top-level window
for your GUI, but only one of them should be your root window. You can
choose to completely design all your widgets first, and then add the real
functionality, or do a little of this and a little of that along the way. (This
means mixing and matching steps 3 and 4 from our list.)
Widgets can be stand-alone or be containers. If a widget contains other
widgets, it is considered the parent of those widgets. Accordingly, if a widget is contained in another widget, it’s considered a child of the parent, the
parent being the next immediate enclosing container widget.
Usually, widgets have some associated behaviors, such as when a button is pressed, or text is filled into a text field. These types of user behaviors are called events, and the GUI’s response to such events are known as
callbacks.
218
Chapter 5 • GUI Programming
Event-Driven Processing
Events can include the actual button press (and release), mouse movement, hitting the Return or Enter key, etc. The entire system of events that
occurs from the beginning until the end of a GUI application is what
drives it. This is known as event-driven processing.
One example of an event with a callback is a simple mouse move. Suppose that the mouse pointer is sitting somewhere on top of your GUI
application. If you move the mouse to another part of your application,
something has to cause the movement of the mouse to be replicated by the
cursor on your screen so that it looks as if it is moving according to the
motion of your hand. These are mouse move events that the system must
process portray your cursor moving across the window. When you release
the mouse, there are no more events to process, so everything just remains
idle on the screen again.
The event-driven processing nature of GUIs fits right in with client/
server architecture. When you start a GUI application, it must perform
some setup procedures to prepare for the core execution, just as how a network server must allocate a socket and bind it to a local address. The GUI
application must establish all the GUI components, then draw (a.k.a. render or paint) them to the screen. This is the responsibility of the geometry
manager (more about this in a moment). When the geometry manager has
completed arranging all of the widgets, including the top-level window,
GUI applications enter their server-like infinite loop. This loop runs forever waiting for GUI events, processing them, and then going to wait for
more events to process.
Geometry Managers
Tk has three geometry managers that help with positioning your widgetset.
The original one was called the Placer. It was very straightforward: you
provide the size of the widgets and locations to place them; the manager
then places them for you. The problem is that you have to do this with all
the widgets, burdening the developer with coding that should otherwise
take place automatically.
The second geometry manager, and the main one that you will use, is
the Packer, named appropriately because it packs widgets into the correct
places (namely the containing parent widgets, based on your instruction),
and for every succeeding widget, it looks for any remaining “real estate”
into which to pack the next one. The process is similar to how you would
pack elements into a suitcase when traveling.
5.2 Tkinter and Python Programming 219
A third geometry manager is the Grid. You use the Grid to specify GUI
widget placement, based on grid coordinates. The Grid will render each
object in the GUI in their grid position. For this chapter, we will stick with
the Packer.
Once the Packer has determined the sizes and alignments of all your
widgets, it will then place them on the screen for you.
When all the widgets are in place, we instruct the application to enter
the aforementioned infinite main loop. In Tkinter, the code that does this is:
Tkinter.mainloop()
This is normally the last piece of sequential code your program runs.
When the main loop is entered, the GUI takes over execution from there.
All other actions are handled via callbacks, even exiting your application.
When you select the File menu and then click the Exit menu option or
close the window directly, a callback must be invoked to end your GUI
application.
5.2.3
Top-Level Window: Tkinter.Tk()
We mentioned earlier that all main widgets are built on the top-level window object. This object is created by the Tk class in Tkinter and is instantiated
as follows:
>>> import Tkinter
>>> top = Tkinter.Tk()
Within this window, you place individual widgets or multiple-component
pieces together to form your GUI. So what kinds of widgets are there? We
will now introduce the Tk widgets.
5.2.4
Tk Widgets
At the time of this writing, there were 18 types of widgets in Tk. We describe
these widgets in Table 5-1. The newest of the widgets are LabelFrame,
PanedWindow, and Spinbox, all three of which were added in Python 2.3 (via
Tk 8.4).
2.3
220
Chapter 5 • GUI Programming
Table 5-1 Tk Widgets
Widget
Description
Button
Similar to a Label but provides additional functionality for
mouse-overs, presses, and releases, as well as keyboard
activity/events
Canvas
Provides ability to draw shapes (lines, ovals, polygons,
rectangles); can contain images or bitmaps
Checkbutton
Set of boxes, of which any number can be “checked”
(similar to HTML checkbox input)
Entry
Single-line text field with which to collect keyboard input
(similar to HTML text input)
Frame
Pure container for other widgets
Label
Used to contain text or images
LabelFrame
Combo of a label and a frame but with extra label attributes
Listbox
Presents the user with a list of choices from which to choose
Menu
Actual list of choices “hanging” from a Menubutton from
which the user can choose
Menubutton
Provides infrastructure to contain menus (pulldown,
cascading, etc.)
Message
Similar to a Label, but displays multiline text
PanedWindow
A container widget with which you can control other
widgets placed within it
Radiobutton
Set of buttons, of which only one can be “pressed” (similar
to HTML radio input)
Scale
Linear “slider” widget providing an exact value at current
setting; with defined starting and ending values
Scrollbar
Provides scrolling functionality to supporting widgets, for
example, Text, Canvas, Listbox, and Entry
Spinbox
Combination of an entry with a button letting you adjust its
value
5.3 Tkinter Examples
221
Widget
Description
Text
Multiline text field with which to collect (or display) text
from user (similar to HTML textarea)
Toplevel
Similar to a Frame, but provides a separate window container
We won’t go over the Tk widgets in detail as there is plenty of good documentation available for you to read, either from the Tkinter topics page at
the main Python Web site or the abundant number of Tcl/Tk printed and
online resources (some of which are available in Appendix B, “Reference
Tables”). However, we will present several simple examples to help you
get started.
CORE NOTE: Default arguments are your friend
GUI development really takes advantage of default arguments in Python
because there are numerous default actions in Tkinter widgets. Unless you
know every single option available to you for every single widget that you are
using, it’s best to start out by setting only the parameters that you are aware of
and letting the system handle the rest. These defaults were chosen carefully. If
you do not provide these values, do not worry about your applications appearing odd on the screen. They were created with an optimized set of default arguments as a general rule, and only when you know how to exactly customize
your widgets should you use values other than the default.
5.3
Tkinter Examples
Now we’ll look at our first GUI scripts, each introducing another widget
and perhaps showing a different way of using a widget that we’ve looked
at before. Very basic examples lead to more intermediate ones, which have
more relevance to coding GUIs in practice.
5.3.1
Label Widget
In Example 5-1, we present tkhello1.py, which is the Tkinter version of
“Hello World!” In particular, it shows you how a Tkinter application is set
up and highlights the Label widget.
222
Chapter 5 • GUI Programming
Example 5-1
Label Widget Demo (tkhello1.py)
Our first Tkinter example is—well, what else could it be but “Hello World!”? In
particular, we introduce our first widget: the Label.
1
2
3
4
5
6
7
8
#!/usr/bin/env python
import Tkinter
top = Tkinter.Tk()
label = Tkinter.Label(top, text='Hello World!')
label.pack()
Tkinter.mainloop()
In the first line, we create our top-level window. That is followed by our
widget, which contains the all-too-famous string. We instruct the
Packer to manage and display our widget, and then finally call mainloop()
to run our GUI application. Figure 5-1 shows what you will see when you
run this GUI application.
Label
Unix (twm)
Windows
Figure 5-1 The Tkinter Label widget.
5.3.2
The Button Widget
The next example (tkhello2.py) is pretty much the same as the first. However, instead of a simple text label, we will create a button. Example 5-2
presents the source code.
Example 5-2
Button Widget Demo (tkhello2.py)
This example is exactly the same as tkhello1.py, except that rather than using a
Label widget, we create a Button widget.
1
2
3
4
5
6
7
8
9
#!/usr/bin/env python
import Tkinter
top = Tkinter.Tk()
quit = Tkinter.Button(top, text='Hello World!',
command=top.quit)
quit.pack()
Tkinter.mainloop()
5.3 Tkinter Examples
223
The first few lines are identical. Things differ only when we create the
Button widget. Our button has one additional parameter, the Tkinter.quit()
method. This installs a callback to our button so that if it is pressed (and
released), the entire application will exit. The final two lines are the usual
pack() and invocation of the mainloop(). This simple button application is
shown in Figure 5-2.
Unix
Windows
Figure 5-2 The Tkinter Label widget.
5.3.3
The Label and Button Widgets
In Example 5-3, we combine tkhello1.py and tkhello2.py into tkhello3.py, a
script that has both a label and a button. In addition, we are providing
more parameters now than before when we were comfortable using all of
the default arguments that are automatically set for us.
Example 5-3
Label and Button Widget Demo (tkhello3.py)
This example features both a Label and a Button widget. Rather than
primarily using default arguments when creating the widget, we are able
to specify additional parameters now that we know more about Button
widgets and how to configure them.
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python
import Tkinter
top = Tkinter.Tk()
hello = Tkinter.Label(top, text='Hello World!')
hello.pack()
quit = Tkinter.Button(top, text='QUIT',
command=top.quit, bg='red', fg='white')
quit.pack(fill=Tkinter.X, expand=1)
Tkinter.mainloop()
224
Chapter 5 • GUI Programming
Besides additional parameters for the widgets, we also see some arguments for the Packer. The fill parameter tells it to let the QUIT button take
up the rest of the horizontal real estate, and the expand parameter directs
it to visually fill out the entire horizontal landscape, stretching the button
to the left and right sides of the window.
As you can see in Figure 5-3, without any other instructions to the
Packer, the widgets are placed vertically (on top of each other). Horizontal
placement requires creating a new Frame object with which to add the
buttons. That frame will take the place of the parent object as a single
child object (see the buttons in the listdir.py module, [Example 5-6] in
Section 5.3.6).
Unix
Windows
Figure 5-3 Tkinter Label widget, together.
5.3.4
Label, Button, and Scale Widgets
Our final trivial example, tkhello4.py, involves the addition of a Scale
widget. In particular, the Scale is used to interact with the Label widget.
The Scale slider is a tool that controls the size of the text font in the Label
widget. The greater the slider position, the larger the font, and vice versa.
The code for tkhello4.py is presented in Example 5-4.
Example 5-4
Label, Button, and Scale Demonstration (tkhello4.py)
Our final introductory widget example introduces the Scale widget and
highlights how widgets can “communicate” with each other by using callbacks
(such as resize()). The text in the Label widget is affected by actions taken on
the Scale widget.
1
2
3
4
#!/usr/bin/env python
from Tkinter import *
5.3 Tkinter Examples
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
225
def resize(ev=None):
label.config(font='Helvetica -%d bold' % \
scale.get())
top = Tk()
top.geometry('250x150')
label = Label(top, text='Hello World!',
font='Helvetica -12 bold')
label.pack(fill=Y, expand=1)
scale = Scale(top, from_=10, to=40,
orient=HORIZONTAL, command=resize)
scale.set(12)
scale.pack(fill=X, expand=1)
quit = Button(top, text='QUIT',
command=top.quit, activeforeground='white',
activebackground='red')
quit.pack()
mainloop()
New features of this script include a resize() callback function (lines 5–7),
which is attached to the Scale. This is the code that is activated when the
slider on the Scale is moved, resizing the size of the text in the Label.
We also define the size (250 × 150) of the top-level window (line 10). The
final difference between this script and the first three is that we import the
attributes from the Tkinter module into our namespace by using from
Tkinter import *. Although this is not recommended because it “pollutes” your namespace, we do it here mainly because this application
involves a great number of references to Tkinter attributes. This would
require the use of their fully qualified names for each and every attribute
access. By using the undesired shortcut, we are able to access attributes
with less typing and have code that is easier to read, at some cost.
As you can see in Figure 5-4, both the slider mechanism as well as the
current set value show up in the main part of the window. The figure also
shows the state of the GUI after the user moves the scale/slider to a
value of 36. Notice in the code that the initial setting for the scale when the
application starts is 12 (line 18).
226
Chapter 5 • GUI Programming
Unix
Windows
Figure 5-4 Tkinter Label, Button, and Scale widgets.
5.3.5
2.5
Partial Function Application Example
Before looking at a longer GUI application, we want to review the Partial
Function Application (PFA), as introduced in Core Python Programming or
Core Python Language Fundamentals.
PFAs were added to Python in version 2.5 and are one piece in a series
of significant improvements in functional programming. Using PFAs, you
can cache function parameters by effectively “freezing” those predetermined arguments, and then at runtime, when you have the remaining
arguments you need, you can thaw them out, send in the final arguments,
and have that function called with all parameters.
Best of all, PFAs are not limited to just functions. They will work with
any “callable,” which is any object that has a functional interface, just by
using parentheses, including, classes, methods, or callable instances. The
use of PFAs fits perfectly into a situation for which there are many callables
and many of the calls feature the same arguments over and over again.
GUI programming makes a great use case, because there is good probability that you want some consistency in GUI widget look-and-feel, and
5.3 Tkinter Examples
227
this consistency comes about when the same parameters are used to create
similar objects. We are now going to present an application in which multiple buttons will have the same foreground and background colors. It
would be a waste of typing to give the same arguments to the same instantiators every time we wanted a slightly different button: the foreground
and background colors are the same, but the text is slightly different.
We are going to use traffic road signs as our example, with our application attempting to create textual versions of road signs by dividing them
up into various categories of sign types, such as critical, warning, or informational (just like logging levels). The sign type determines the color
scheme when they are created. For example, critical signs have the text in
bright red with a white background; warning signs are in black text on a
goldenrod background; and informational or regulatory signs feature
black text on a white background. We have the “Do Not Enter” and
“Wrong Way” signs, which are both critical, plus “Merging Traffic” and
“Railroad Crossing,” both of which are warnings. Finally, we have the regulatory “Speed Limit” and “One Way” signs.
The application in Example 5-5 creates the signs, which are just buttons.
When users press the buttons, they display the corresponding Tk dialog
in a pop-up window, critical/error, warning, or informational. It is not too
exciting, but how the buttons are built is.
Example 5-5
Road Signs PFA GUI Application (pfaGUI2.py)
Create road signs with the appropriate foreground and background colors,
based on sign type. Use PFAs to help “templatize” common GUI parameters.
1
2
3
4
5
6
7
8
9
10
11
12
#!/usr/bin/env python
from functools import partial as pto
from Tkinter import Tk, Button, X
from tkMessageBox import showinfo, showwarning, showerror
WARN = 'warn'
CRIT = 'crit'
REGU = 'regu'
SIGNS = {
'do not enter': CRIT,
(Continued)
228
Chapter 5 • GUI Programming
Example 5-5
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Road Signs PFA GUI Application (pfaGUI2.py) (Continued)
'railroad crossing': WARN,
'55\nspeed limit': REGU,
'wrong way': CRIT,
'merging traffic': WARN,
'one way': REGU,
}
critCB = lambda: showerror('Error', 'Error Button Pressed!')
warnCB = lambda: showwarning('Warning',
'Warning Button Pressed!')
infoCB = lambda: showinfo('Info', 'Info Button Pressed!')
top = Tk()
top.title('Road Signs')
Button(top, text='QUIT', command=top.quit,
bg='red', fg='white').pack()
MyButton =
CritButton
WarnButton
ReguButton
pto(Button, top)
= pto(MyButton, command=critCB, bg='white', fg='red')
= pto(MyButton, command=warnCB, bg='goldenrod1')
= pto(MyButton, command=infoCB, bg='white')
for eachSign in SIGNS:
signType = SIGNS[eachSign]
cmd = '%sButton(text=%r%s).pack(fill=X, expand=True)' % (
signType.title(), eachSign,
'.upper()' if signType == CRIT else '.title()')
eval(cmd)
top.mainloop()
When you execute this application, you will see a GUI that will look
something like Figure 5-5.
Figure 5-5 The Road signs PFA GUI application on XDarwin in Mac OS X.
5.3 Tkinter Examples
229
Line-by-Line Explanation
Lines 1–18
We begin our application by importing functools.partial(), a few Tkinter
attributes, and the Tk dialogs (lines 1–5). Next, we define some signs along
with their categories (lines 7–18).
Lines 20–28
The Tk dialogs are assigned as button callbacks, which we will use for
each button created (lines 20–23). We then launch Tk, set the title, and create
a QUIT button (lines 25–28).
Lines 30–33
These lines represent our PFA magic. We use two levels of PFA. The first
templatizes the Button class and the root window top. This means that
every time we call MyButton, it will call Button (Tkinter.Button() creates
a button.) with top as its first argument. We have frozen this into MyButton.
The second level of PFA is where we use our first one, MyButton, and
templatize that. We create separate button types for each of our sign categories. When users create a critical button CritButton (by calling it, for
example, CritButton()), it will then call MyButton along with the appropriate button callback and background and foreground colors, which
means calling Button with top, callback, and colors. You can see how it
unwinds and goes down the layers until at the very bottom, it has the call
that you would have originally had to make if this feature did not exist
yet. We repeat with WarnButton and ReguButton.
Lines 35–42
With the setup completed, we look at our list of signs and create them. We
put together a string that Python can evaluate, consisting of the correct button name, pass in the button label as the text argument, and pack() it. If it is
a critical sign, then we capitalize the button text; otherwise, we titlecase it.
This last bit is done in line 39, demonstrating another feature introduced in
Python 2.5, the ternary/conditional operator. Each button is instantiated
with eval(), resulting in what is shown in Figure 5-5. Finally, we start the
GUI by entering the main event loop.
You can easily replace the use of the ternary operator with the old “and/
or” syntax if running with version 2.4 or older, but functools.partial() is
a more difficult feature to replicate, so we recommend you use version 2.5
or newer with this example application.
2.5
230
Chapter 5 • GUI Programming
5.3.6
Intermediate Tkinter Example
We conclude this section with a larger script, listdir.py, which is presented in Example 5-6. This application is a directory tree traversal tool. It
starts in the current directory and provides a file listing. Double-clicking any
other directory in the list causes the tool to change to the new directory as well
as replace the original file listing with the files from the new directory.
Example 5-6
File System Traversal GUI (listdir.py)
This slightly more advanced GUI expands on the use of widgets, adding
listboxes, text entry fields, and scrollbars to our repertoire. There are also a good
number of callbacks such as mouse clicks, key presses, and scrollbar action.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/usr/bin/env python
import os
from time import sleep
from Tkinter import *
class DirList(object):
def __init__(self, initdir=None):
self.top = Tk()
self.label = Label(self.top,
text='Directory Lister v1.1')
self.label.pack()
self.cwd = StringVar(self.top)
self.dirl = Label(self.top, fg='blue',
font=('Helvetica', 12, 'bold'))
self.dirl.pack()
self.dirfm = Frame(self.top)
self.dirsb = Scrollbar(self.dirfm)
self.dirsb.pack(side=RIGHT, fill=Y)
self.dirs = Listbox(self.dirfm, height=15,
width=50, yscrollcommand=self.dirsb.set)
self.dirs.bind('<Double-1>', self.setDirAndGo)
self.dirsb.config(command=self.dirs.yview)
self.dirs.pack(side=LEFT, fill=BOTH)
self.dirfm.pack()
self.dirn = Entry(self.top, width=50,
textvariable=self.cwd)
self.dirn.bind('<Return>', self.doLS)
self.dirn.pack()
self.bfm = Frame(self.top)
self.clr = Button(self.bfm, text='Clear',
command=self.clrDir,
activeforeground='white',
activebackground='blue')
5.3 Tkinter Examples
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
231
self.ls = Button(self.bfm,
text='List Directory',
command=self.doLS,
activeforeground='white',
activebackground='green')
self.quit = Button(self.bfm, text='Quit',
command=self.top.quit,
activeforeground='white',
activebackground='red')
self.clr.pack(side=LEFT)
self.ls.pack(side=LEFT)
self.quit.pack(side=LEFT)
self.bfm.pack()
if initdir:
self.cwd.set(os.curdir)
self.doLS()
def clrDir(self, ev=None):
self.cwd.set('')
def setDirAndGo(self, ev=None):
self.last = self.cwd.get()
self.dirs.config(selectbackground='red')
check = self.dirs.get(self.dirs.curselection())
if not check:
check = os.curdir
self.cwd.set(check)
self.doLS()
def doLS(self, ev=None):
error = ''
tdir = self.cwd.get()
if not tdir: tdir = os.curdir
if not os.path.exists(tdir):
error = tdir + ': no such file'
elif not os.path.isdir(tdir):
error = tdir + ': not a directory'
if error:
self.cwd.set(error)
self.top.update()
sleep(2)
if not (hasattr(self, 'last') \
and self.last):
self.last = os.curdir
self.cwd.set(self.last)
self.dirs.config(\
selectbackground='LightSkyBlue')
self.top.update()
return
(Continued)
232
Chapter 5 • GUI Programming
Example 5-6
File System Traversal GUI (listdir.py) (Continued)
94
self.cwd.set(\
95
'FETCHING DIRECTORY CONTENTS...')
96
self.top.update()
97
dirlist = os.listdir(tdir)
98
dirlist.sort()
99
os.chdir(tdir)
100
self.dirl.config(text=os.getcwd())
101
self.dirs.delete(0, END)
102
self.dirs.insert(END, os.curdir)
103
self.dirs.insert(END, os.pardir)
104
for eachFile in dirlist:
105
self.dirs.insert(END, eachFile)
106
self.cwd.set(os.curdir)
107
self.dirs.config(\
108
selectbackground='LightSkyBlue')
109
110 def main():
111
d = DirList(os.curdir)
112
mainloop()
113
114 if __name__ == '__main__':
115
main()
In Figure 5-6, we present what this GUI looks like on a Windows-based
PC. The POSIX UI screenshot of this application is shown in Figure 5-7.
Line-by-Line Explanation
Lines 1–5
These first few lines contain the usual Unix startup line and importation of
the os module, the time.sleep() function, and all attributes of the Tkinter
module.
Lines 9–13
These lines define the constructor for the DirList class, an object that
represents our application. The first Label we create contains the main title
of the application and the version number.
Lines 15–19
We declare a Tk variable named cwd to hold the name of the directory we
are on—we will see where this comes in handy later. Another Label is
created to display the name of the current directory.
5.3 Tkinter Examples
233
Windows
Figure 5-6 Our List directory GUI application as it appears in Windows.
Lines 21–29
This section defines the core part of our GUI, the Listbox dirs, which contain the list of files of the directory that is being listed. A Scrollbar is
employed to allow the user to move through a listing if the number of files
exceeds the size of the Listbox. Both of these widgets are contained in a
Frame widget. Listbox entries have a callback (setDirAndGo) tied to them
by using the Listbox bind() method.
Binding means to tie a keystroke, mouse action, or some other event to a
callback to be executed when such an event is generated by the user.
setDirAndGo() will be called if any item in the Listbox is double-clicked.
The Scrollbar is tied to the Listbox by calling the Scrollbar.config()
method.
234
Chapter 5 • GUI Programming
Unix
Figure 5-7 The List directory GUI application as it appears in Unix.
Lines 31–34
We then create a text Entry field for the user to enter the name of the
directory he wants to traverse and see its files listed in the Listbox. We add
a Return or Enter key binding to this text entry field so that the user can
press Return as an alternative to clicking a button. The same applies for
the mouse binding we saw earlier in the Listbox. When the user doubleclicks a Listbox item, it has the same effect as entering the directory name
manually into the text Entry field and then clicking the Go button.
Lines 36–53
We then define a Button frame (bfm) to hold our three buttons: a “clear”
button (clr), a “go” button (ls), and a “quit” button (quit). Each button
has its own configuration and callbacks, if pressed.
Lines 55–57
The final part of the constructor initializes the GUI program, starting with
the current working directory.
5.3 Tkinter Examples
235
Lines 59–60
The clrDir() method clears the cwd Tk string variable, which contains the
current active directory. This variable is used to keep track of what directory we are in and, more important, helps keep track of the previous directory in case errors arise. You will notice the ev variables in the callback
functions with a default value of None. Any such values would be passed
in by the windowing system. They might or might not be used in your
callback.
Lines 62–69
The setDirAndGo() method sets the directory to which to traverse and
issues the call to the method that makes it all happen, doLS().
Lines 71–108
doLS() is, by far, the key to this entire GUI application. It performs all the
safety checks (e.g., is the destination a directory and does it exist?). If there
is an error, the last directory is reset to be the current directory. If all goes
well, it calls os.listdir() to get the actual set of files and replaces the
listing in the Listbox. While the background work is going on to pull in
the information from the new directory, the highlighted blue bar becomes
bright red. When the new directory has been installed, it reverts to blue.
Lines 110–115
The last pieces of code in listdir.py represent the main part of the code.
main() is executed only if this script is invoked directly; when main() runs,
it creates the GUI application, and then calls mainloop() to start the GUI,
which is passed control of the application.
We leave all other aspects of the application as an exercise for you to
undertake, recommending that it is easier to view the entire application as
a combination of a set of widgets and functionality. If you see the individual pieces clearly, then the entire script will not appear as daunting.
We hope that we have given you a good introduction to GUI programming with Python and Tkinter. Remember that the best way to become
familiar with Tkinter programming is by practicing and stealing a few
examples! The Python distribution comes with a large number of demonstration applications that you can study.
If you download the source code, you will find Tkinter demonstration
code in Lib/lib-tk, Lib/idlelib, and Demo/tkinter. If you have installed
the Win32 version of Python and C:\Python2x, then you can get access
to the demonstration code in Lib\lib-tk and Lib\idlelib. The latter
236
Chapter 5 • GUI Programming
directory contains the most significant sample Tkinter application: the
IDLE IDE itself. For further reference, there are several books on Tk programming, one specifically on Tkinter.
5.4
A Brief Tour of Other GUIs
We hope to eventually develop an independent chapter on general GUI
development that makes use of the abundant number of graphical toolkits
that exist under Python, but alas, that is for the future. As a proxy, we
would like to present a single, simple GUI application written by using
four of the more popular toolkits: Tix (Tk Interface eXtensions), Pmw
(Python MegaWidgets Tkinter extension), wxPython (Python binding to
wxWidgets), and PyGTK (Python binding to GTK+). The final example
demonstrates how to use Tile/Ttk—in both Python 2 and 3. You can find
links to more information and/or download these toolkits in the reference
section at the end of this chapter.
The Tix module is already available in the Python Standard Library.
You must download the others, which are third party. Since Pmw is just an
extension to Tkinter, it is the easiest to install (just extract it into your site packages). wxPython and PyGTK involve the download of more than one file and
building (unless you opt for the Win32 versions for which binaries are usually available). Once the toolkits are installed and verified, we can begin.
Rather than just sticking with the widgets we’ve already seen in this chapter, we’d like to introduce a few more complex widgets for these examples.
In addition to the Label and Button widgets, we would like to introduce
the Control or SpinButton and ComboBox. The Control widget is a combination of a text widget that contains a value which is “controlled” or “spun
up or down” by a set of arrow buttons close by. The ComboBox is usually a
text widget and a pulldown menu of options where the currently active or
selected item in the list is displayed in the text widget.
Our application is fairly basic: pairs of animals are being moved
around, and the number of total animals can range from a pair to a maximum of a dozen. The Control is used to keep track of the total number,
while the ComboBox is a menu containing the various types of animals that can
be selected. In Figure 5-8, each image shows the state of the GUI application
immediately after launching. Note that the default number of animals is
two, and no animal type has been selected yet.
Things are different once we start to play around with the application,
as evidenced in Figure 5-9, which shows some of the elements after we
have modified them in the Tix application.
5.4 A Brief Tour of Other GUIs 237
Tix
PyGTK
wxPython
Pmw
Figure 5-8 Application using various GUIs under Win32.
Tix
Figure 5-9 The Tix GUI modified version of our application.
You can view the code for all four versions of our GUI in Examples 5-7
through 5-10. Example 5-11, which uses Tile/Ttk (the code is supported in
Python 2 and 3) supersedes these first four examples. You will note that
although relatively similar, each one differs in its own special way. Also,
we use the.pyw extension to suppress DOS command or terminal window
pop-ups.
238
Chapter 5 • GUI Programming
5.4.1
Tk Interface eXtensions (Tix)
We start with Example 5-7, which uses the Tix module. Tix is an extension
library for Tcl/Tk that adds many new widgets, image types, and other
commands that keep Tk a viable GUI development toolkit. Let’s take a
look at how to use Tix with Python.
Example 5-7
Tix GUI Demo (animalTix.pyw)
Our first example uses the Tix module. Tix comes with Python!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/env python
from Tkinter import Label, Button, END
from Tix import Tk, Control, ComboBox
top = Tk()
top.tk.eval('package require Tix')
lb = Label(top,
text='Animals (in pairs; min: pair, max: dozen)')
lb.pack()
ct = Control(top, label='Number:',
integer=True, max=12, min=2, value=2, step=2)
ct.label.config(font='Helvetica -14 bold')
ct.pack()
cb = ComboBox(top, label='Type:', editable=True)
for animal in ('dog', 'cat', 'hamster', 'python'):
cb.insert(END, animal)
cb.pack()
qb = Button(top, text='QUIT',
command=top.quit, bg='red', fg='white')
qb.pack()
top.mainloop()
Line-by-Line Explanation
Lines 1–7
This is all the setup code, module imports, and basic GUI infrastructure.
Line 7 asserts that the Tix module is available to the application.
Lines 8–27
These lines create all the widgets: Label (lines 9–11), Control (lines 13–16),
ComboBox (lines 18–21), and quit Button (lines 23–25). The constructors and
5.4 A Brief Tour of Other GUIs 239
arguments for the widgets are fairly self-explanatory and do not require
elaboration. Finally, we enter the main GUI event loop in line 27.
5.4.2
Python MegaWidgets (PMW)
Next we take a look at Python MegaWidgets (shown in Example 5-8). This
module was created to address the aging Tkinter. It basically helps to
extend its longevity by adding more modern widgets to the GUI palette.
Example 5-8
Pmw GUI Demo (animalPmw.pyw)
Our second example uses the Python MegaWidgets package.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/env python
from Tkinter import Button, END, Label, W
from Pmw import initialise, ComboBox, Counter
top = initialise()
lb = Label(top,
text='Animals (in pairs; min: pair, max: dozen)')
lb.pack()
ct = Counter(top, labelpos=W, label_text='Number:',
datatype='integer', entryfield_value=2,
increment=2, entryfield_validate={'validator':
'integer', 'min': 2, 'max': 12})
ct.pack()
cb = ComboBox(top, labelpos=W, label_text='Type:')
for animal in ('dog', 'cat', 'hamster', 'python'):
cb.insert(end, animal)
cb.pack()
qb = Button(top, text='QUIT',
command=top.quit, bg='red', fg='white')
qb.pack()
top.mainloop()
The Pmw example is so similar to our Tix example that we leave line-byline analysis to the reader. The line of code that differs the most is the constructor for the control widget, the Pmw Counter. It provides for entry validation. Instead of specifying the smallest and largest possible values as
keyword arguments to the widget constructor, Pmw uses a “validator” to
ensure that the values do not fall outside our accepted range.
240
Chapter 5 • GUI Programming
Tix and Pmw are extensions to Tk and Tkinter, respectively, but now we
are going to leave the Tk world behind and change gears to look at completely different toolkits: wxWidgets and GTK+. You will notice that the
number of lines of code starts to increase as we start programming in a
more object-oriented way with these more modern and robust GUI toolkits.
5.4.3
wxWidgets and wxPython
wxWidgets (formerly known as wxWindows) is a cross-platform toolkit
that you can use to build graphical user applications. It is implemented by
using C++ and is available on a wide range of platforms to which wxWidgets defines a consistent and common applications programming interface
(API). The best part of all is that wxWidgets uses the native GUI on each
platform, so your program will have the same look-and-feel as all the
other applications on your desktop. Another feature is that you are not
restricted to developing wxWidgets applications in C++; there are interfaces to both Python and Perl. Example 5-9 shows our animal application
using wxPython.
Example 5-9
wxPython GUI Demo (animalWx.pyw)
Our third example uses wxPython (and wxWidgets). Note that we have placed
all of our widgets inside a “sizer” for organization. Also, take note of the more
object-oriented nature of this application.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/env python
import wx
class MyFrame(wx.Frame):
def __init__(self, parent=None, id=-1, title=''):
wx.Frame.__init__(self, parent, id, title,
size=(200, 140))
top = wx.Panel(self)
sizer = wx.BoxSizer(wx.VERTICAL)
font = wx.Font(9, wx.SWISS, wx.NORMAL, wx.BOLD)
lb = wx.StaticText(top, -1,
'Animals (in pairs; min: pair, max: dozen)')
sizer.Add(lb)
c1 = wx.StaticText(top, -1, 'Number:')
c1.SetFont(font)
ct = wx.SpinCtrl(top, -1, '2', min=2, max=12)
5.4 A Brief Tour of Other GUIs 241
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
sizer.Add(c1)
sizer.Add(ct)
c2 = wx.StaticText(top, -1, 'Type:')
c2.SetFont(font)
cb = wx.ComboBox(top, -1, '',
choices=('dog', 'cat', 'hamster','python'))
sizer.Add(c2)
sizer.Add(cb)
qb = wx.Button(top, -1, "QUIT")
qb.SetBackgroundColour('red')
qb.SetForegroundColour('white')
self.Bind(wx.EVT_BUTTON,
lambda e: self.Close(True), qb)
sizer.Add(qb)
top.SetSizer(sizer)
self.Layout()
class MyApp(wx.App):
def OnInit(self):
frame = MyFrame(title="wxWidgets")
frame.Show(True)
self.SetTopWindow(frame)
return True
def main():
pp = MyApp()
app.MainLoop()
if __name__ == '__main__':
main()
Line-by-Line Explanation
Lines 5–37
Here we instantiate a Frame class (lines 5–8), of which the sole member is
the constructor. This method’s only purpose in life is to create our widgets.
Inside the frame, we have a Panel. Inside the panel we use a BoxSizer to
contain and layout all of our widgets (lines 10, 36), which consist of a
Label (lines 12–14), SpinCtrl (lines 16–20), ComboBox (lines 22–27), and quit
Button (lines 29–34).
We have to manually add Labels to the SpinCtrl and ComboBox widgets
because they apparently do not come with them. Once we have them all,
we add them to the sizer, set the sizer to our panel, and lay everything out.
On line 10, you will note that the sizer is vertically oriented, meaning that
our widgets will be placed top to bottom.
242
Chapter 5 • GUI Programming
One weakness of the SpinCtrl widget is that it does not support “step”
functionality. With the other three examples, we are able to click an arrow
selector which increments or decrements by units of two, but that is not
possible with this widget.
Lines 39–51
Our application class instantiates the Frame object we just designed, renders it
to the screen, and sets it as the top-most window of our application. Finally,
the setup lines just instantiate our GUI application and start it running.
5.4.4
GTK+ and PyGTK
Finally, we have the PyGTK version, which is quite similar to the wxPython
GUI (See Example 5-10). The biggest difference is that we use only one
class, and it seems more tedious to set the foreground and background
colors of objects, buttons in particular.
Example 5-10
PyGTK GUI Demo (animalGtk.pyw)
Our final example uses PyGTK (and GTK+). Like the wxPython example, this
one also uses a class for our application. It is interesting to note how similar
yet different all of our GUI applications are. This is not surprising and allows
programmers to switch between toolkits with relative ease.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env python
import pygtk
pygtk.require('2.0')
import gtk
import pango
class GTKapp(object):
def __init__(self):
top = gtk.Window(gtk.WINDOW_TOPLEVEL)
top.connect("delete_event", gtk.main_quit)
top.connect("destroy", gtk.main_quit)
box = gtk.VBox(False, 0)
lb = gtk.Label(
'Animals (in pairs; min: pair, max: dozen)')
box.pack_start(lb)
sb = gtk.HBox(False, 0)
adj = gtk.Adjustment(2, 2, 12, 2, 4, 0)
5.4 A Brief Tour of Other GUIs
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
243
sl = gtk.Label('Number:')
sl.modify_font(
pango.FontDescription("Arial Bold 10"))
sb.pack_start(sl)
ct = gtk.SpinButton(adj, 0, 0)
sb.pack_start(ct)
box.pack_start(sb)
cb = gtk.HBox(False, 0)
c2 = gtk.Label('Type:')
cb.pack_start(c2)
ce = gtk.combo_box_entry_new_text()
for animal in ('dog', 'cat','hamster', 'python'):
ce.append_text(animal)
cb.pack_start(ce)
box.pack_start(cb)
qb = gtk.Button("")
red = gtk.gdk.color_parse('red')
sty = qb.get_style()
for st in (gtk.STATE_NORMAL,
gtk.STATE_PRELIGHT, gtk.STATE_ACTIVE):
sty.bg[st] = red
qb.set_style(sty)
ql = qb.child
ql.set_markup('<span color="white">QUIT</span>')
qb.connect_object("clicked",
gtk.Widget.destroy, top)
box.pack_start(qb)
top.add(box)
top.show_all()
if __name__ == '__main__':
animal = GTKapp()
gtk.main()
Line-by-Line Explanation
Lines 1–6
We import three different modules and packages, PyGTK, GTK, and Pango,
a library for layout and rendering of text, specifically for I18N purposes.
We need it here because it represents the core of text and font handling for
GTK+ (version 2.x).
Lines 8–50
The GTKapp class represents all the widgets of our application. The topmost
window is created (with handlers for closing it via the window manager),
and a vertically oriented sizer (VBox) is created to hold our primary widgets.
This is exactly what we did in the wxPython GUI.
244
Chapter 5 • GUI Programming
However, wanting the static labels for the SpinButton and ComboBoxEntry
to be next to them (unlike above them for the wxPython example), we
create little horizontally-oriented boxes to contain the label-widget pairs
(lines 18–35) and placed those HBoxes into the all-encompassing VBox.
After creating the quit Button and adding the VBox to our topmost window, we render everything on screen. You will notice that we create the
button with an empty label at first. We do this so that a Label (child) object
will be created as part of the button. Then on lines 44–45, we get access to
the label and set the text with white font color.
The reason we do this is because if you set the style foreground, for
instance, in the loop and auxiliary code on lines 40–43, the foreground
only affects the button’s foreground and not the label—for example, if you
set the foreground style to white and highlight the button (by pressing the
Tab key until it is “selected”) you will see that the inside dotted box identifying the selected widget is white, but the label text would still be black if
you did not alter it such as we did with the markup on line 45.
Lines 52–54
Here, we create our application and enter the main event loop.
5.4.5
Tile/Ttk
Since its inception, the Tk library has established a solid reputation as a
flexible and simple library and toolkit with which to build GUI tools.
However, after its first decade, a perception grew among the current user
base as well as new developers that without new features, major changes,
and upgrades, it became perceived as being dated and not keeping up
with more current toolkits such as wxWidgets and GTK+.
Tix attempts to address this by providing new widgets, image types,
and new commands to extend Tk. Some of its core widgets even used
native UI code, giving them a more similar look and feel to other applications on the same windowing system. However, this effort merely extended
Tk’s capabilities.
In the mid-2000s, a more radical approach was proposed: the Tile widget set, which is a reimplementation of most of Tk’s core widgets while
adding several new ones. Not only is native code more prevalent, but Tile
comes with a themeing engine.
Themed widget sets and the ability to easily create, import, and export
themes give developers (and users) much more control over the visual
appearance of applications and lends to a more seamless integration with
5.4 A Brief Tour of Other GUIs 245
the operating system and the windowing system that runs on it. This
aspect of Tile was compelling enough to cause it to be integrated with the
Tk core in version 8.5 as Ttk. Rather than being a replacement, the Ttk widget set is provided as an adjunct to the original core Tk widget set.
Tile/Ttk made its debut in Python 2.7 and 3.1. To use Ttk, the Python
version you’re using needs to have access to either Tk 8.5 as a minimum;
recent but older versions will also work, as long as Tile is installed. In
Python 2.7+, Tile/Ttk is made available via the ttk module; while in 3.1+, it
has been absorbed under the tkinter umbrella, so you would import
tkinter.ttk.
In Examples 5-11 and 5-12, you’ll find Python 2 and 3 versions of our
animalTtk.pyw and animalTtk3.pyw applications. Whether using Python 2
or 3, a UI application screen similar to that found in Figure 5-10 will be
what you’ll get upon execution.
Example 5-11
Tile/Ttk GUI Demo (animalTtk.pyw)
A demonstration application using the Tile toolkit (named Ttk when integrated
into Tk 8.5).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/usr/bin/env python
from Tkinter import Tk, Spinbox
from ttk import Style, Label, Button, Combobox
top = Tk()
Style().configure("TButton",
foreground='white', background='red')
Label(top,
text='Animals (in pairs; min: pair, '
'max: dozen)').pack()
Label(top, text='Number:').pack()
Spinbox(top, from_=2, to=12,
increment=2, font='Helvetica -14 bold').pack()
Label(top, text='Type:').pack()
Combobox(top, values=('dog',
'cat', 'hamster', 'python')).pack()
Button(top, text='QUIT',
command=top.quit, style="TButton").pack()
top.mainloop()
2.7
3.1
246
Chapter 5 • GUI Programming
Example 5-12
Tile/Ttk Python 3 GUI Demo (animalTtk3.pyw)
A Python 3 demonstration using the Tile toolkit (named Ttk when integrated
into Tk 8.5).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/usr/bin/env python3
from tkinter import Tk, Spinbox
from tkinter.ttk import Style, Label, Button, Combobox
top = Tk()
Style().configure("TButton",
foreground='white', background='red')
Label(top,
text='Animals (in pairs; min: pair, '
'max: dozen)').pack()
Label(top, text='Number:').pack()
Spinbox(top, from_=2, to=12,
increment=2, font='Helvetica -14 bold').pack()
Label(top, text='Type:').pack()
Combobox(top, values=('dog',
'cat', 'hamster', 'python')).pack()
Button(top, text='QUIT',
command=top.quit, style="TButton").pack()
top.mainloop()
Figure 5-10 The animal UI in Tile/Ttk.
5.5 Related Modules and Other GUIs 247
Line-by-Line Explanation
Lines 1–4
The Tk core widgets received three new widgets in Tk 8.4. One of them
was the Spinbox, which we’ll be using in this application. (The other two
are LabelFrame and PanedWindow.) All others used here are Tile/Ttk widgets: Label, Button, and Combobox, plus the Style class, which helps with
the widget themeing.
Lines 6–8
These lines just initiate the root window as well as a Style object, which
contains the themed elements for widgets that choose to use it. It helps
define a common look and feel to your widgets. Although it seems like a
waste to use it just for our quit button, you cannot specify individual foreground and background colors directly for buttons. This forces you to program in a more disciplined way. The minor inconvenience in this trivial
example will prove a more useful habit in practice.
Lines 10–26
The majority of the rest of the code defines (and packs) the entire widgetset,
which matches pretty much what you’ve seen in this application using the
other UIs introduced in this chapter: a Label defining the application, a
Label and Spinbox combo that controls the numeric range of possible values (and increment), a Label and Combobox pair letting users select an animal, and a quit Button. We end by entering the GUI mainloop.
This line-by-line explanation is identical to that of its Python 3 sibling
shown in Example 5-12, with the only changes being in imports: Tkinter is
renamed to tkinter in Python 3, and the ttk module becomes a submodule
of tkinter.
5.5
Related Modules and Other GUIs
There are other GUI development systems that can be used with Python.
We present the appropriate modules along with their corresponding window systems in Table 5-2.
248
Chapter 5 • GUI Programming
Table 5-2 GUI Systems Available for Python
GUI Library
Description
Tk-Related Modules
Tkinter/tkintera
TK INTERface: Python’s default GUI toolkit
http://wiki.python.org/moin/TkInter
Pmw
Python MegaWidgets (Tkinter extension)
http://pmw.sf.net
Tix
Tk Interface eXtension (Tk extension)
http://tix.sf.net
Tile/Ttk
Tile/Ttk themed widget set
http://tktable.sf.net
TkZinc (Zinc)
Extended Tk canvas type (Tk extension)
http://www.tkzinc.org
EasyGUI (easygui)
Very simple, non-event-driven GUIs (Tkinter extension)
http://ferg.org/easygui
TIDE + (IDE Studio)
Tix Integrated Development Environment (including
IDE Studio, a Tix-enhanced version of the standard
IDLE IDE) http://starship.python.net/crew/mike
wxWidgets-Related Modules
wxPython
Python binding to wxWidgets, a cross-platform GUI
framework (formerly known as wxWindows)
http://wxpython.org
Boa Constructor
Python IDE and wxPython GUI builder
http://boa-constructor.sf.net
PythonCard
wxPython-based desktop application GUI construction
kit (inspired by HyperCard)
http://pythoncard.sf.net
wxGlade
another wxPython GUI designer (inspired by Glade, the
GTK+/GNOME GUI builder)
http://wxglade.sf.net
GTK+/GNOME-Related Modules
PyGTK
Python wrapper for the GIMP Toolkit (GTK+) library
http://pygtk.org
5.5 Related Modules and Other GUIs 249
GUI Library
Description
GTK+/GNOME-Related Modules
GNOME-Python
Python binding to GNOME desktop and development
libraries
http://gnome.org/start/unstable/bindings
http://download.gnome.org/sources/gnome-python
Glade
A GUI builder for GTK+ and GNOME
http://glade.gnome.org
PyGUI (GUI)
Cross-platform “Pythonic” GUI API (built on Cocoa
[Mac OS X] and GTK+ [POSIX/X11 and Win32])
http://www.cosc.canterbury.ac.nz/~greg/python_gui
Qt/KDE-Related Modules
PyQt
Python binding for the Qt GUI/XML/SQL C++ toolkit
from Trolltech (partially open source [dual-license])
http://riverbankcomputing.co.uk/pyqt
PyKDE
Python binding for the KDE desktop environment
http://riverbankcomputing.co.uk/pykde
eric
Python IDE written in PyQt using QScintilla editor widget
http://die-offenbachs.de/detlev/eric3
http://ericide.python-hosting.com/
PyQtGPL
Qt (Win32 Cygwin port), Sip, QScintilla, PyQt bundle
http://pythonqt.vanrietpaap.nl
Other Open-Source GUI Toolkits
FXPy
Python binding to FOX toolkit (http://fox-toolkit.org)
http://fxpy.sf.net
pyFLTK (fltk)
Python binding to FLTK toolkit (http://fltk.org)
http://pyfltk.sf.net
PyOpenGL
(OpenGL)
Python binding to OpenGL (http://opengl.org)
http://pyopengl.sf.net
Commercial
win32ui
Microsoft MFC (via Python for Windows Extensions)
http://starship.python.net/crew/mhammond/win32
swing
Sun Microsystems Java/Swing (via Jython)
http://jython.org
a. Tkinter for Python 2 and tkinter for Python 3.
250
Chapter 5 • GUI Programming
You can find out more about all GUIs related to Python from the general
GUI Programming page on the Python wiki at http://wiki.python.org/moin/
GuiProgramming.
5.6
Exercises
5-1. Client/Server Architecture. Describe the roles of a windows (or
windowing) server and a windows client.
5-2. Object-Oriented Programming. Describe the relationship
between child and parent widgets.
5-3. Label Widgets. Update the tkhello1.py script to display your
own message instead of “Hello World!”
5-4. Label and Button Widgets. Update the tkhello3.py script so
that there are three new buttons in addition to the QUIT button. Pressing any of the three buttons will result in changing
the text label so that it will then contain the text of the Button
(widget) that was pressed. Hint: You will need three separate
handlers or customize one handler with arguments preset
(still three function objects).
5-5. Label, Button, and Radiobutton Widgets. Modify your solution to Exercise 5-4 so that there are three Radiobuttons presenting the choices of text for the Label. There are two
buttons: the QUIT button and an Update button. When the
Update button is pressed, the text label will then be changed
to contain the text of the selected Radiobutton. If no Radiobutton
has been checked, the Label will remain unchanged.
5-6. Label, Button, and Entry Widgets. Modify your solution to
Exercise 5-5 so that the three Radiobuttons are replaced by a
single Entry text field widget with a default value of “Hello
World!” (to reflect the initial string in the Label). The Entry
field can be edited by the user with a new text string for the
Label, which will be updated if the Update button is pressed.
5-7. Label and Entry Widgets and Python I/O. Create a GUI application that provides an Entry field in which the user can provide the name of a text file. Open the file and read it,
displaying its contents in a Label.
5.6 Exercises
Extra Credit (Menus): Replace the Entry widget with a menu
that has a File Open option that pops up a window to allow
the user to specify the file to read. Also add an Exit or Quit
option to the menu to augment the QUIT button.
5-8. Simple Text Editor. Use your solution to the previous problem
to create a simple text editor. A file can be created from scratch
or read and displayed into a Text widget that can be edited
by the user. When the user quits the application (either by
using the QUIT button or the Quit/Exit menu option), the
user is prompted whether to save the changes or quit without saving.
Extra Credit: Interface your script to a spellchecker and add a
button or menu option to spellcheck the file. The words that
are misspelled should be highlighted by using a different
foreground or background color in the Text widget.
5-9. Multithreaded Chat Applications. The chat programs from the
earlier chapters need completion. Create a fully-functional,
multithreaded chat server. A GUI is not really necessary for
the server unless you want to create one as a front-end to its
configuration, for example, port number, name, connection
to a name server, etc. Create a multithreaded chat client that
has separate threads to monitor user input (and sends the
message to the server for broadcast) and another thread to
accept incoming messages to display to the user. The client
front-end GUI should have two portions of the chat window:
a larger section with multiple lines to hold all the dialog, and
a smaller text entry field to accept input from the user.
5-10. Using Other GUIs. The example GUI applications using the
various toolkits are very similar; however, they are not the
same. Although it is impossible to make them all look exactly
alike, tweak them so that they are more consistent than they
are now.
5-11. Using GUI Builders. GUI builders help you to create GUI
applications faster by auto-generating the boilerplate code
for you so that all you have to do is “the hard stuff.” Download a GUI builder tool and implement the animal GUI by
just dragging the widgets from the corresponding palette.
Hook it up with callbacks so that they behave just like the
sample applications we looked at in this chapter.
251
252
Chapter 5 • GUI Programming
What GUI builders are out there? For wxWidgets, see PythonCard, wxGlade, XRCed, wxFormBuilder, or even Boa Constructor (no longer maintained), and for GTK+, there’s Glade
(plus its friend GtkBuilder). For more tools like these, check
out the “GUI Design Tools and IDEs” section of the GUI tools
wiki page at http://wiki.python.org/moin/GuiProgramming.
CHAPTER
Database Programming
Did you really name your son Robert');
DROP TABLE Students;-- ?
—Randall Munroe, XKCD, October 2007
In this chapter...
• Introduction
• The Python DB-API
• ORMs
• Non-Relational Databases
• Related References
253
254
Chapter 6 • Database Programming
I
n this chapter, we discuss how to communicate with databases by
using Python. Files or simplistic persistent storage can meet the needs
of smaller applications, but larger server or high-data-volume applications might require a full-fledged database system, instead. Thus, we cover
both relational and non-relational databases as well as Object-Relational
Mappers (ORMs).
6.1
Introduction
This opening section will discuss the need for databases, present the Structured Query Language (SQL), and introduce readers to Python’s database
application programming interface (API).
6.1.1
Persistent Storage
In any application, there is a need for persistent storage. Generally, there
are three basic storage mechanisms: files, a database system, or some sort
of hybrid, such as an API that sits on top of one of those existing systems,
an ORM, file manager, spreadsheet, configuration file, etc.
In the Files chapter of Core Python Language Fundamentals or Core Python
Programming, we discussed persistent storage using both plain file access
as well as a Python and database manager (DBM), which is an old Unix
persistent storage mechanism, overlay on top of files, that is, *dbm, dbhash/
bsddb files, shelve (combination of pickle and DBM), and using their
dictionary-like object interface.
This chapter will focus on using databases for the times when files or
creating your own data storage system does not suffice for larger projects.
In such cases, you will have many decisions to make. Thus, the goal of this
chapter is to introduce you to the basics and show you as many of your
options as possible (and how to work with them from within Python) so
that you can make the right decision. We start off with SQL and relational
databases first, because they are still the prevailing form of persistent storage.
6.1.2
Basic Database Operations and SQL
Before we dig into databases and how to use them with Python, we want
to present a quick introduction (or review if you have some experience) to
some elementary database concepts and SQL.
6.1 Introduction
255
Underlying Storage
Databases usually have a fundamental persistent storage that uses the file
system, that is, normal operating system files, special operating system
files, and even raw disk partitions.
User Interface
Most database systems provide a command-line tool with which to issue
SQL commands or queries. There are also some GUI tools that use the
command-line clients or the database client library, affording users a much
more comfortable interface.
Databases
A relational database management system (RDBMS) can usually manage
multiple databases, such as sales, marketing, customer support, etc., all on
the same server (if the RDBMS is server-based; simpler systems are usually
not). In the examples we will look at in this chapter, MySQL demonstrates a
server-based RDBMS because there is a server process running continuously, waiting for commands; neither SQLite nor Gadfly have running
servers.
Components
The table is the storage abstraction for databases. Each row of data will
have fields that correspond to database columns. The set of table definitions of columns and data types per table all put together define the database schema.
Databases are created and dropped. The same is true for tables. Adding
new rows to a database is called inserting; changing existing rows in a
table is called updating; and removing existing rows in a table is called
deleting. These actions are usually referred to as database commands or
operations. Requesting rows from a database with optional criteria is called
querying.
When you query a database, you can fetch all of the results (rows) at
once, or just iterate slowly over each resulting row. Some databases use the
concept of a cursor for issuing SQL commands, queries, and grabbing
results, either all at once or one row at a time.
256
Chapter 6 • Database Programming
SQL
Database commands and queries are given to a database via SQL. Not all
databases use SQL, but the majority of relational databases do. Following
are some examples of SQL commands. Note that most databases are configured to be case-insensitive, especially database commands. The accepted
style is to use CAPS for database keywords. Most command-line programs
require a trailing semicolon (;) to terminate a SQL statement.
Creating a Database
CREATE DATABASE test;
GRANT ALL ON test.* to user(s);
The first line creates a database named “test,” and assuming that you
are a database administrator, the second line can be used to grant permissions to specific users (or all of them) so that they can perform the database
operations that follow.
Using a Database
USE test;
If you logged into a database system without choosing which database
you want to use, this simple statement allows you to specify one with
which to perform database operations.
Dropping a Database
DROP DATABASE test;
This simple statement removes all the tables and data from the database
and deletes it from the system.
Creating a Table
CREATE TABLE users (login VARCHAR(8), userid INT, projid INT);
This statement creates a new table with a string column login and a pair
of integer fields, userid and projid.
Dropping a Table
DROP TABLE users;
This simple statement drops a database table, along with all its data.
6.1 Introduction
257
Inserting a Row
INSERT INTO users VALUES('leanna', 2111, 1);
You can insert a new row in a database by using the INSERT statement.
You specify the table and the values that go into each field. For our example, the string 'leanna' goes into the login field, and 2111 and 1 to userid
and projid, respectively.
Updating a Row
UPDATE users SET projid=4 WHERE projid=2;
UPDATE users SET projid=1 WHERE userid=311;
To change existing table rows, you use the UPDATE statement. Use SET
for the columns that are changing and provide any criteria for determining which rows should change. In the first example, all users with a “project ID” (or projid) of 2 will be moved to project #4. In the second example,
we take one user (with a UID of 311) and move him to project #1.
Deleting a Row
DELETE FROM users WHERE projid=%d;
DELETE FROM users;
To delete a table row, use the DELETE FROM command, specify the table
from which you want to delete rows, and any optional criteria. Without it,
as in the second example, all rows will be deleted.
Now that you are up to speed on basic database concepts, it should
make following the rest of the chapter and its examples much easier. If you
need additional help, there are plenty of database tutorial books available
that can do the trick.
6.1.3
Databases and Python
We are going to cover the Python database API and look at how to access
relational databases from Python—either directly through a database interface, or via an ORM—and how you can accomplish the same task but
without necessarily having to give explicit commands in SQL.
258
Chapter 6 • Database Programming
Topics such as database principles, concurrency, schema, atomicity,
integrity, recovery, proper complex left JOINs, triggers, query optimization, transactions, stored procedures, etc., are all beyond the scope of this
text, and we will not be discussing them in this chapter other than direct
use from a Python application. Rather, we will present how to store and
retrieve data to and from RDBMSs while playing within a Python framework. You can then decide which is best for your current project or application and be able to study sample code that can get you started instantly.
The goal is to get you on top of things as quickly as possible if you need to
integrate your Python application with some sort of database system.
We are also breaking out of our mode of covering only the “batteries
included” features of the Python Standard Library. While our original goal
was to play only in that arena, it has become clear that being able to work
with databases is really a core component of everyday application development in the Python world.
As a software engineer, you can probably only make it so far in your
career without having to learn something about databases: how to use one
(command-line and/or GUI interfaces), how to extract data by using the
SQL, perhaps how to add or update information in a database, etc. If
Python is your programming tool, then a lot of the hard work has already
been done for you as you add database access to your Python universe. We
first describe what the Python database API, or DB-API is, then give examples of database interfaces that conform to this standard.
We will show some examples using popular open-source RDBMSs.
However, we will not include discussions of open-source versus commercial products. Adapting to those other RDBMS systems should be fairly
straightforward. A special mention will be given to Aaron Watters’s Gadfly
database, a simple RDBMS written completely in Python.
The way to access a database from Python is via an adapter. An adapter
is a Python module with which you can interface to a relational database’s
client library, usually in C. It is recommended that all Python adapters
conform to the API of the Python database special interest group (DBSIG). This is the first major topic of this chapter.
Figure 6-1 illustrates the layers involved in writing a Python database
application, with and without an ORM. The figure demonstrates that the
DB-API is your interface to the C libraries of the database client.
6.2 The Python DB-API
Application
(embedded SQL)
RDBMS client library
Python application
(embedded SQL)
259
Python application
(little or no SQL)
Python ORM
Python DB adapter
Python DB adapter
RDBMS client library
RDBMS client library
Relational database (RDBMS)
Figure 6-1 Multitiered communication between application and database. The first box is
generally a C/C++ program, whereas DB-API-compliant adapters let you program applications
in Python. ORMs can simplify an application by handling all of the database-specific details.
6.2
The Python DB-API
Where can one find the interfaces necessary to talk to a database? Simple.
Just go to the database topics section at the main Python Web site. There
you will find links to the full and current DB-API (version 2.0), existing
database modules, documentation, the special interest group, etc. Since its
inception, the DB-API has been moved into PEP 249. (This PEP supersedes
the old DB-API 1.0 specification, which is PEP 248.) What is the DB-API?
The API is a specification that states a set of required objects and database access mechanisms to provide consistent access across the various
database adapters and underlying database systems. Like most communitybased efforts, the API was driven by strong need.
In the “old days,” we had a scenario of many databases and many people implementing their own database adapters. It was a wheel that was
being reinvented over and over again. These databases and adapters were
implemented at different times by different people without any consistency of functionality. Unfortunately, this meant that application code
using such interfaces also had to be customized to which database module
they chose to use, and any changes to that interface also meant updates
were needed in the application code.
SIG for Python database connectivity was formed, and eventually, an
API was born: the DB-API version 1.0. The API provides for a consistent
interface to a variety of relational databases, and porting code between different databases is much simpler, usually only requiring tweaking several
lines of code. You will see an example of this later on in this chapter.
260
Chapter 6 • Database Programming
6.2.1
Module Attributes
The DB-API specification mandates that the features and attributes listed
below must be supplied. A DB-API-compliant module must define the
global attributes as shown in Table 6-1.
Table 6-1 DB-API Module Attributes
Attribute
Description
apilevel
The version of the DB-API with which an adapter is
compliant
threadsafety
Level of thread safety of this module
paramstyle
SQL statement parameter style of this module
connect()
Connect() function
(Various exceptions)
(SeeTable 6-4)
Data Attributes
apilevel
This string (not float) indicates the highest version of the DB-API with
which the module is compliant, for example, 1.0, 2.0, etc. If absent, 1.0
should be assumed as the default value.
threadsafety
This an integer that can take the following possible values:
• 0: Not threadsafe, so threads should not share the module at
all
• 1: Minimally threadsafe: threads can share the module but
not connections
• 2: Moderately threadsafe: threads can share the module and
connections but not cursors
• 3: Fully threadsafe: threads can share the module,
connections, and cursors
6.2 The Python DB-API
261
If a resource is shared, a synchronization primitive such as a spin lock or
semaphore is required for atomic-locking purposes. Disk files and global
variables are not reliable for this purpose and can interfere with standard
mutex operation. See the threading module or go back to Chapter 4,
“Multithreaded Programming,” for more information on how to use a lock.
paramstyle
The API supports a variety of ways to indicate how parameters should be
integrated into an SQL statement that is eventually sent to the server for
execution. This argument is just a string that specifies the form of string
substitution you will use when building rows for a query or command
(see Table 6-2).
Table 6-2 paramstyle Database Parameter Styles
Parameter Style
Description
Example
numeric
Numeric positional style
WHERE name=:1
named
Named style
WHERE name=:name
pyformat
Python dictionary printf()
format conversion
WHERE name=%(name)s
qmark
Question mark style
WHERE name=?
format
ANSI C printf() format
conversion
WHERE name=%s
Function Attribute(s)
connect() Function
Connection objects.
access to the database is made available through
A compliant module must implement a connect()
function, which creates and returns a Connection object. Table 6-3 shows
the arguments to connect().
262
Chapter 6 • Database Programming
Table 6-3 connect() Function Attributes
Parameter
Description
user
Username
password
Password
host
Hostname
database
Database name
dsn
Data source name
You can pass in database connection information as a string with multiple parameters (DSN) or individual parameters passed as positional arguments (if you know the exact order), or more likely, keyword arguments.
Here is an example of using connect() from PEP 249:
connect(dsn='myhost:MYDB',user='guido',password='234$')
The use of DSN versus individual parameters is based primarily on the
system to which you are connecting. For example, if you are using an API
like Open Database Connectivity (ODBC) or Java DataBase Connectivity
(JDBC), you would likely be using a DSN, whereas if you are working
directly with a database, then you are more likely to issue separate login
parameters. Another reason for this is that most database adapters have
not implemented support for DSN. The following are some examples of
non-DSN connect() calls. Note that not all adapters have implemented
the specification exactly, e.g., MySQLdb uses db instead of database.
• MySQLdb.connect(host='dbserv', db='inv', user='smith')
• PgSQL.connect(database='sales')
• psycopg.connect(database='template1', user='pgsql')
• gadfly.dbapi20.connect('csrDB', '/usr/local/database')
• sqlite3.connect('marketing/test')
6.2 The Python DB-API
263
Exceptions
Exceptions that should also be included in the compliant module as
globals are shown in Table 6-4.
Table 6-4 DB-API Exception Classes
Exception
Description
Warning
Root warning exception class
Error
Root error exception class
InterfaceError
Database interface (not database) error
DatabaseError
Database error
DataError
Problems with the processed data
OperationalError
Error during database operation execution
IntegrityError
Database relational integrity error
InternalError
Error that occurs within the database
ProgrammingError
SQL command failed
NotSupportedError
Unsupported operation occurred
6.2.2
Connection Objects
Connections are how your application communicates with the database.
They represent the fundamental mechanism by which commands are sent
to the server and results returned. Once a connection has been established
(or a pool of connections), you create cursors to send requests to and
receive replies from the database.
Connection Object Methods
objects are not required to have any data attributes but should
define the methods shown in Table 6-5.
Connection
264
Chapter 6 • Database Programming
Table 6-5
Connection Object Methods
Method Name
Description
close()
Close database connection
commit()
Commit current transaction
rollback()
Cancel current transaction
cursor()
Create (and return) a cursor or cursor-like object
using this connection
errorhandler(cxn, cur,
errcls, errval)
Serves as a handler for given connection cursor
When close() is used, the same connection cannot be used again without running into an exception.
The commit() method is irrelevant if the database does not support
transactions or if it has an auto-commit feature that has been enabled. You
can implement separate methods to turn auto-commit off or on if you
wish. Since this method is required as part of the API, databases that do
not support transactions should just implement “pass” for this method.
Like commit(), rollback() only makes sense if transactions are supported in the database. After execution, rollback() should leave the database in the same state as it was when the transaction began. According to
PEP 249, “Closing a connection without committing the changes first will cause
an implicit rollback to be performed.”
If the RDBMS does not support cursors, cursor() should still return an
object that faithfully emulates or imitates a real cursor object. These are
just the minimum requirements. Each individual adapter developer can
always add special attributes specifically for their interface or database.
It is also recommended but not required for adapter writers to make all
database module exceptions (see earlier) available via a connection. If not,
then it is assumed that Connection objects will throw the corresponding
module-level exception. Once you have completed using your connection
and cursors are closed, you should commit() any operations and close()
your connection.
6.2 The Python DB-API
6.2.3
265
Cursor Objects
Once you have a connection, you can begin communicating with the database. As we mentioned earlier in the introductory section, a cursor lets a
user issue database commands and retrieve rows resulting from queries.
A Python DB-API cursor object functions as a cursor for you, even if cursors are not supported in the database. In this case, if you are creating a
database adapter, you must implement cursor objects so that they act like
cursors. This keeps your Python code consistent when you switch between
database systems that support or do not support cursors.
Once you have created a cursor, you can execute a query or command
(or multiple queries and commands) and retrieve one or more rows from the
results set. Table 6-6 presents Cursor object data attributes and methods.
Table 6-6 Cursor Object Attributes
Object Attribute
Description
arraysize
Number of rows to fetch at a time with
fetchmany(); default is 1
connection
Connection that created this cursor (optional)
description
Returns cursor activity (7-item tuples): (name,
type_code, display_size, internal_ size,
precision, scale, null_ok); only name and
type_code are required
lastrowid
Row ID of last modified row (optional; if row
IDs not supported, default to None)
rowcount
Number of rows that the last execute*()
produced or affected
callproc(func[, args])
Call a stored procedure
close()
Close cursor
execute(op[, args])
Execute a database query or command
executemany(op, args)
Like execute() and map() combined; prepare
and execute a database query or command over
given arguments
(Continued)
266
Chapter 6 • Database Programming
Table 6-6 Cursor Object Attributes (Continued)
Object Attribute
Description
fetchone()
Fetch next row of query result
fetchmany ([size=
cursor.arraysize])
Fetch next size rows of query result
fetchall()
Fetch all (remaining) rows of a query result
__iter__()
Create iterator object from this cursor (optional;
also see next())
messages
List of messages (set of tuples) received from
the database for cursor execution (optional)
next()
Used by iterator to fetch next row of query
result (optional; like fetchone(), also see
__iter__())
nextset()
Move to next results set (if supported)
rownumber
Index of cursor (by row, 0-based) in current
result set (optional)
setinputsizes(sizes)
Set maximum input size allowed (required but
implementation optional)
setoutputsize(size[,col])
Set maximum buffer size for large column
fetches (required but implementation optional)
The most critical attributes of cursor objects are the execute*() and the
methods; all service requests to the database are performed by
these. The arraysize data attribute is useful in setting a default size for
fetchmany(). Of course, closing the cursor is a good thing, and if your
database supports stored procedures, then you will be using callproc().
fetch*()
6.2.4
Type Objects and Constructors
Oftentimes, the interface between two different systems are the most
fragile. This is seen when converting Python objects to C types and vice
versa. Similarly, there is also a fine line between Python objects and native
database objects. As a programmer writing to Python’s DB-API, the
parameters you send to a database are given as strings, but the database
6.2 The Python DB-API
267
might need to convert it to a variety of different, supported data types that
are correct for any particular query.
For example, should the Python string be converted to a VARCHAR, a
TEXT, a BLOB, or a raw BINARY object, or perhaps a DATE or TIME
object if that is what the string is supposed to be? Care must be taken to
provide database input in the expected format; therefore, another requirement of the DB-API is to create constructors that build special objects that
can easily be converted to the appropriate database objects. Table 6-7
describes classes that can be used for this purpose. SQL NULL values are
mapped to and from Python’s NULL object, None.
Table 6-7 Type Objects and Constructors
Type Object
Description
Date(yr,mo,dy)
Object for a date value
Time(hr,min,sec)
Object for a time value
Timestamp
(yr,mo,dy,hr,min,sec)
Object for a timestamp value
DateFromTicks(ticks)
Date object, given in number of seconds since
the epoch
TimeFromTicks(ticks)
Time object, given in number of seconds since
the epoch
TimestampFromTicks(ticks) Timestamp object, given in number of seconds
since the epoch
Binary(string)
Object for a binary (long) string value
STRING
Object describing string-based columns, for
example, VARCHAR
BINARY
Object describing (long) binary columns, for
example, RAW, BLOB
NUMBER
Object describing numeric columns
DATETIME
Object describing date/time columns
ROWID
Object describing “row ID” columns
268
Chapter 6 • Database Programming
Changes to API Between Versions
Several important changes were made when the DB-API was revised from
version 1.0 (1996) to 2.0 (1999):
• The required dbi module was removed from the API.
• Type objects were updated.
• New attributes were added to provide better database
bindings.
• callproc() semantics and the return value of execute() were
redefined.
• Conversion to class-based exceptions.
Since version 2.0 was published, some of the additional, optional DBAPI extensions that you just read about were added in 2002. There have
been no other significant changes to the API since it was published. Continuing discussions of the API occur on the DB-SIG mailing list. Among
the topics brought up over the last five years include the possibilities
for the next version of the DB-API, tentatively named DB-API 3.0.
These include the following:
• Better return value for nextset() when there is a new
result set.
• Switch from float to Decimal.
• Improved flexibility and support for parameter styles.
• Prepared statements or statement caching.
• Refine the transaction model.
• State the role of API with respect to portability.
• Add unit testing.
If you have strong feelings about the API or its future, feel free to participate and join in the discussion. Here are some references that you might
find handy.
• http://python.org/topics/database
• http://linuxjournal.com/article/2605 (outdated but historical)
• http://wiki.python.org/moin/DbApi3
6.2 The Python DB-API
6.2.5
269
Relational Databases
So, you are now ready to go, but you probably have one burning question:
“which interfaces to database systems are available to me in Python?”
That inquiry is similar to, “which platforms is Python available for?” The
answer is, “Pretty much all of them.” Following is a broad (but not exhaustive) list of interfaces:
Commercial RDBMSs
• IBM Informix
• Sybase
• Oracle
• Microsoft SQL Server
• IBM DB2
• SAP
• Embarcadero Interbase
• Ingres
Open-Source RDBMSs
• MySQL
• PostgreSQL
• SQLite
• Gadfly
Database APIs
• JDBC
• ODBC
Non-Relational Databases
• MongoDB
• Redis
• Cassandra
• SimpleDB
270
Chapter 6 • Database Programming
• Tokyo Cabinet
• CouchDB
• Bigtable (via Google App Engine Datastore API)
To find an updated (but not necessarily the most recent) list of what
databases are supported, go to the following Web site:
http://wiki.python.org/moin/DatabaseInterfaces
6.2.6
Databases and Python: Adapters
For each of the databases supported, there exists one or more adapters that
let you connect to the target database system from Python. Some databases, such as Sybase, SAP, Oracle, and SQLServer, have more than one
adapter available. The best thing to do is to determine which ones best fit
your needs. Your questions for each candidate might include: how good is
its performance, how useful is its documentation and/or Web site, whether
it has an active community or not, what is the overall quality and stability
of the driver, etc. You have to keep in mind that most adapters provide just
the basic necessities to get you connected to the database. It is the extras that
you might be looking for. Keep in mind that you are responsible for
higher-level code like threading and thread management as well as management of database connection pools, etc.
If you are squeamish and want less hands-on interaction—for example,
if you prefer to do as little SQL or database administration as possible—then
you might want to consider ORMs, which are covered later in this chapter.
Let’s now look at some examples of how to use an adapter module to
communicate with a relational database. The real secret is in setting up the
connection. Once you have this and use the DB-API objects, attributes, and
object methods, your core code should be pretty much the same, regardless of which adapter and RDBMS you use.
6.2.7
Examples of Using Database Adapters
First, let’s look at a some sample code, from creating a database to creating
a table and using it. We present examples that use MySQL, PostgreSQL,
and SQLite.
6.2 The Python DB-API
271
MySQL
We will use MySQL as the example here, along with the most well-known
MySQL Python adapter: MySQLdb, a.k.a. MySQL-python—we’ll discuss the
other MySQL adapter, MySQL Connector/Python, when our conversation
turns to Python 3. In the various bits of code that follow, we’ll also expose
you (deliberately) to examples of error situations so that you have an idea
of what to expect, and for which you might want to create handlers.
We first log in as an administrator to create a database and grant permissions, then log back in as a normal client, as shown here:
>>> import MySQLdb
>>> cxn = MySQLdb.connect(user='root')
>>> cxn.query('DROP DATABASE test')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
_mysql_exceptions.OperationalError: (1008, "Can't drop, database
'test'; database doesn't exist")
>>> cxn.query('CREATE DATABASE test')
>>> cxn.query("GRANT ALL ON test.* to ''@'localhost'")
>>> cxn.commit()
>>> cxn.close()
In the preceding code, we did not use a cursor. Some adapters have
objects, which can execute SQL queries with the query()
method, but not all. We recommend you either not use it or check your
adapter to ensure that it is available.
The commit() was optional for us because auto-commit is turned on by
default in MySQL. We then connect back to the new database as a regular
user, create a table, and then perform the usual queries and commands by
using SQL to get our job done via Python. This time we use cursors and
their execute() method.
The next set of interactions shows us creating a table. An attempt to create
it again (without first dropping it) results in an error:
Connection
>>> cxn = MySQLdb.connect(db='test')
>>> cur = cxn.cursor()
>>> cur.execute('CREATE TABLE users(login VARCHAR(8), userid INT)')
0L
Now we will insert a few rows into the database and query them out:
>>> cur.execute("INSERT INTO users VALUES('john', 7000)")
1L
>>> cur.execute("INSERT INTO users VALUES('jane', 7001)")
1L
>>> cur.execute("INSERT INTO users VALUES('bob', 7200)")
1L
272
Chapter 6 • Database Programming
>>> cur.execute("SELECT * FROM users WHERE login LIKE 'j%'")
2L
>>> for data in cur.fetchall():
... print '%s\t%s' % data
...
john
7000
jane
7001
The last bit features updating the table, either by updating or deleting
rows:
>>> cur.execute("UPDATE users SET userid=7100 WHERE userid=7001")
1L
>>> cur.execute("SELECT * FROM users")
3L
>>> for data in cur.fetchall():
... print '%s\t%s' % data
...
john
7000
jane
7100
bob
7200
>>> cur.execute('DELETE FROM users WHERE login="bob"')
1L
>>> cur.execute('DROP TABLE users')
0L
>>> cur.close()
>>> cxn.commit()
>>> cxn.close()
MySQL is one of the most popular open-source databases in the world,
and it is no surprise that a Python adapter is available for it.
PostgreSQL
Another popular open-source database is PostgreSQL. Unlike MySQL,
there are no less than three Python adapters available for Postgres: psycopg,
PyPgSQL, and PyGreSQL. A fourth, PoPy, is now defunct, having contributed
its project to combine with that of PyGreSQL in 2003. Each of the three
remaining adapters has its own characteristics, strengths, and weaknesses,
so it would be a good idea to practice due diligence to determine which is
right for you.
Note that while we demonstrate the use of each of these, PyPgSQL has
not been actively developed since 2006, whereas PyGreSQL released its
most recent version (4.0) in 2009. This inactivity clearly leaves psycopg as
the sole leader of the PostgreSQL adapters, and this will be the final version of this book featuring examples of those adapters. psycopg is on its
second version, meaning that even though our examples use the version 1
psycopg module, when you download it today, you’ll be using psycopg2,
instead.
6.2 The Python DB-API
273
The good news is that the interfaces are similar enough that you can create an application that, for example, measures the performance between
all three (if that is a metric that is important to you). The following presents the setup code to get a Connection object for each adapter.
psycopg
>>> import psycopg
>>> cxn = psycopg.connect(user='pgsql')
PyPgSQL
>>> from pyPgSQL import PgSQL
>>> cxn = PgSQL.connect(user='pgsql')
PyGreSQL
>>> import pgdb
>>> cxn = pgdb.connect(user='pgsql')
Here is some generic code that will work for all three adapters:
>>>
>>>
>>>
>>>
...
>>>
>>>
>>>
cur = cxn.cursor()
cur.execute('SELECT * FROM pg_database')
rows = cur.fetchall()
for i in rows:
print i
cur.close()
cxn.commit()
cxn.close()
Finally, you can see how the output from each adapter is slightly different from one another.
PyPgSQL
sales
template1
template0
psycopg
('sales', 1, 0, 0, 1, 17140, '140626', '3221366099', '', None, None)
('template1', 1, 0, 1, 1, 17140, '462', '462', '', None, '{pgsql=C*T*/
pgsql}')
('template0', 1, 0, 1, 0, 17140, '462', '462', '', None, '{pgsql=C*T*/
pgsql}')
PyGreSQL
['sales', 1, 0, False, True, 17140L, '140626', '3221366099', '', None,
None]
274
Chapter 6 • Database Programming
['template1', 1, 0, True, True, 17140L, '462', '462', '', None,
'{pgsql=C*T*/pgsql}']
['template0', 1, 0, True, False, 17140L, '462', '462', '', None,
'{pgsql=C*T*/pgsql}']
SQLite
2.5
For extremely simple applications, using files for persistent storage usually suffices, but the most complex and data-driven applications demand a
full relational database. SQLite targets the intermediate systems, and
indeed is a hybrid of the two. It is extremely lightweight and fast, plus it is
serverless and requires little or no administration.
SQLite has experienced a rapid growth in popularity, and it is available
on many platforms. With the introduction of the pysqlite database adapter
in Python 2.5 as the sqlite3 module, this marks the first time that the
Python Standard Library has featured a database adapter in any release.
It was bundled with Python not because it was favored over other databases and adapters, but because it is simple, uses files (or memory) as its
back-end store like the DBM modules do, does not require a server, and
does not have licensing issues. It is simply an alternative to other similar
persistent storage solutions included with Python but which happens to
have a SQL interface.
Having a module like this in the standard library allows you to develop
rapidly in Python by using SQLite, and then migrate to a more powerful
RDBMS such as MySQL, PostgreSQL, Oracle, or SQL Server for production purposes, if this is your intention. If you don't need all that horsepower, sqlite3 is a great solution.
Although the database adapter is now provided in the standard library,
you still have to download the actual database software yourself. However, once you have installed it, all you need to do is start up Python (and
import the adapter) to gain immediate access:
>>>
>>>
>>>
>>>
import sqlite3
cxn = sqlite3.connect('sqlite_test/test')
cur = cxn.cursor()
cur.execute('CREATE TABLE users(login VARCHAR(8),
userid INTEGER)')
cur.execute('INSERT INTO users VALUES("john", 100)')
cur.execute('INSERT INTO users VALUES("jane", 110)')
cur.execute('SELECT * FROM users')
for eachUser in cur.fetchall():
print eachUser
>>>
>>>
>>>
>>>
...
...
(u'john', 100)
(u'jane', 110)
6.2 The Python DB-API
275
>>> cur.execute('DROP TABLE users')
<sqlite3.Cursor object at 0x3d4320>
>>> cur.close()
>>> cxn.commit()
>>> cxn.close()
Okay, enough of the small examples. Next, we look at an application
similar to our earlier example with MySQL, but which does a few more
things:
• Creates a database (if necessary)
• Creates a table
• Inserts rows into the table
• Updates rows in the table
• Deletes rows from the table
• Drops the table
For this example, we will use two other open-source databases. SQLite
has become quite popular of late. It is very small, lightweight, and
extremely fast for all of the most common database functions. Another
database involved in this example is Gadfly, a mostly SQL-compliant
RDBMS written entirely in Python. (Some of the key data structures have a
C module available, but Gadfly can run without it [slower, of course].)
Some notes before we get to the code. Both SQLite and Gadfly require
that you specify the location to store database files (MySQL has a default
area and does not require this information). The most current incarnation of
Gadfly is not yet fully DB-API 2.0 compliant, and as a result, it is missing some
functionality, most notably the cursor attribute, rowcount, in our example.
6.2.8
A Database Adapter Example Application
In the example that follows, we demonstrate how to use Python to access a
database. For the sake of variety and exposing you to as much code as possible, we added support for three different database systems: Gadfly,
SQLite, and MySQL. To mix things up even further, we’re first going to
dump out the entire Python 2.x source, without a line-by-line explanation.
The application works in exactly the same ways as described via the
bullet points in the previous subsection. You should be able to understand
its functionality without a full explanation—just start with the main()
function at the bottom. (To keep things simple, for a full system such as
276
Chapter 6 • Database Programming
MySQL that has a server, we will just login as the root user, although it’s
discouraged to do this for a production application.) Here’s the source
code for this application, which is called ushuffle_db.py:
#!/usr/bin/env python
import os
from random import randrange as rand
COLSIZ = 10
FIELDS = ('login', 'userid', 'projid')
RDBMSs = {'s': 'sqlite', 'm': 'mysql', 'g': 'gadfly'}
DBNAME = 'test'
DBUSER = 'root'
DB_EXC = None
NAMELEN = 16
tformat = lambda s: str(s).title().ljust(COLSIZ)
cformat = lambda s: s.upper().ljust(COLSIZ)
def setup():
return RDBMSs[raw_input('''
Choose a database system:
(M)ySQL
(G)adfly
(S)QLite
Enter choice: ''').strip().lower()[0]]
def connect(db):
global DB_EXC
dbDir = '%s_%s' % (db, DBNAME)
if db == 'sqlite':
try:
import sqlite3
except ImportError:
try:
from pysqlite2 import dbapi2 as sqlite3
except ImportError:
return None
DB_EXC = sqlite3
if not os.path.isdir(dbDir):
os.mkdir(dbDir)
cxn = sqlite3.connect(os.path.join(dbDir, DBNAME))
elif db == 'mysql':
try:
import MySQLdb
import _mysql_exceptions as DB_EXC
6.2 The Python DB-API
except ImportError:
return None
try:
cxn = MySQLdb.connect(db=DBNAME)
except DB_EXC.OperationalError:
try:
cxn = MySQLdb.connect(user=DBUSER)
cxn.query('CREATE DATABASE %s' % DBNAME)
cxn.commit()
cxn.close()
cxn = MySQLdb.connect(db=DBNAME)
except DB_EXC.OperationalError:
return None
elif db == 'gadfly':
try:
from gadfly import gadfly
DB_EXC = gadfly
except ImportError:
return None
try:
cxn = gadfly(DBNAME, dbDir)
except IOError:
cxn = gadfly()
if not os.path.isdir(dbDir):
os.mkdir(dbDir)
cxn.startup(DBNAME, dbDir)
else:
return None
return cxn
def create(cur):
try:
cur.execute('''
CREATE TABLE users (
login VARCHAR(%d),
userid INTEGER,
projid INTEGER)
''' % NAMELEN)
except DB_EXC.OperationalError:
drop(cur)
create(cur)
drop = lambda cur: cur.execute('DROP TABLE users')
NAMES = (
('aaron', 8312), ('angela', 7603), ('dave', 7306),
('davina',7902), ('elliot', 7911), ('ernie', 7410),
('jess', 7912), ('jim', 7512), ('larry', 7311),
('leslie', 7808), ('melissa', 8602), ('pat', 7711),
277
278
Chapter 6 • Database Programming
('serena', 7003), ('stan', 7607), ('faye', 6812),
('amy', 7209), ('mona', 7404), ('jennifer', 7608),
)
def randName():
pick = set(NAMES)
while pick:
yield pick.pop()
def insert(cur, db):
if db == 'sqlite':
cur.executemany("INSERT INTO users VALUES(?, ?, ?)",
[(who, uid, rand(1,5)) for who, uid in randName()])
elif db == 'gadfly':
for who, uid in randName():
cur.execute("INSERT INTO users VALUES(?, ?, ?)",
(who, uid, rand(1,5)))
elif db == 'mysql':
cur.executemany("INSERT INTO users VALUES(%s, %s, %s)",
[(who, uid, rand(1,5)) for who, uid in randName()])
getRC = lambda cur: cur.rowcount if hasattr(cur, 'rowcount') else -1
def update(cur):
fr = rand(1,5)
to = rand(1,5)
cur.execute(
"UPDATE users SET projid=%d WHERE projid=%d" % (to, fr))
return fr, to, getRC(cur)
def delete(cur):
rm = rand(1,5)
cur.execute('DELETE FROM users WHERE projid=%d' % rm)
return rm, getRC(cur)
def dbDump(cur):
cur.execute('SELECT * FROM users')
print '\n%s' % ''.join(map(cformat, FIELDS))
for data in cur.fetchall():
print ''.join(map(tformat, data))
def main():
db = setup()
print '*** Connect to %r database' % db
cxn = connect(db)
if not cxn:
print 'ERROR: %r not supported or unreachable, exiting' % db
return
cur = cxn.cursor()
print '\n*** Create users table (drop old one if appl.)'
create(cur)
6.2 The Python DB-API
279
print '\n*** Insert names into table'
insert(cur, db)
dbDump(cur)
print '\n*** Move users to a random group'
fr, to, num = update(cur)
print '\t(%d users moved) from (%d) to (%d)' % (num, fr, to)
dbDump(cur)
print '\n*** Randomly delete group'
rm, num = delete(cur)
print '\t(group #%d; %d users removed)' % (rm, num)
dbDump(cur)
print '\n*** Drop users table'
drop(cur)
print '\n*** Close cxns'
cur.close()
cxn.commit()
cxn.close()
if __name__ == '__main__':
main()
Trust me, this application runs. It’s available for download from this
book’s Web site if you really want to try it out. However, before we execute
it here in the book, there’s one more matter to take care of. No, we’re not
going to give you the line-by-line explanation yet.
Don’t worry, the line-by-line is coming up, but we wanted to use this
example for another purpose: to demonstrate another example of porting
to Python 3 and how it’s possible to build scripts that will run under both
Python 2 and 3 with a single source .py file and without the need for conversion using tools like 2to3 or 3to2. After the port, we’ll officially make it
Example 6-1. Furthermore, we’ll use and reuse the attributes from this
example in the examples for the remainder of the chapter, porting it to use
ORMs as well as non-relational databases.
Porting to Python 3
A handful of porting recommendations are provided in the best practices
chapter of Core Python Language Fundamentals, but we wanted to share
some specific tips here and implement them by using ushuffle_db.py.
One of the big porting differences between Python 2 and 3 is print, which
is a statement in Python 2 but a built-in function (BIF) in Python 3. Instead
of using either, you can proxy for both by using the distutils.log.warn()
function—at least you could at the time of this writing. It’s identical in
3.x
280
Chapter 6 • Database Programming
Python 2 and 3; thus, it doesn’t require any changes. To keep the code from
getting confusing, we rename this function to printf() in our application,
in homage to the print/print()-equivalent in C/C++. Also see the related
exercise at the end of this chapter.
The second tip is for the Python 2 BIF raw_input(). It changes its name
to input() in Python 3. This is further complicated by the fact that there is
also an input() function in Python 2 that is a security hazard and removed
from the language. In other words, raw_input() replaces and is renamed
to input() in Python 3. To continue honoring C/C++, we call this function
scanf() in our application.
The next tip is to remind you of the changes in the syntax for handling
exceptions. This subject is covered in detail in the Errors and Exceptions
chapter of Core Python Language Fundamentals and Core Python Programming. You can read more about the update there, but for now, the fundamental change that you need to know about is this:
Old: except Exception, instance
New: except Exception as instance
However, this only matters if you save the instance because you’re interested in the cause of the exception. If it doesn’t matter or you’re not intending to use it, just leave it out. There’s nothing wrong with just: except
Exception.
That syntax does not change between Python 2 and 3. In earlier editions
of this book, we used except Exception, e. For this edition, we’ve removed
the “, e” altogether rather than changing it to “as e” to make porting easier.
Finally, the last change we’re going to do is tied specifically to our example, whereas those other changes are general porting suggestions. At the
time of this writing, the main C-based MySQL-Python adapter, better
known by its package name, MySQLdb, has not yet been ported to Python 3.
However, there is another MySQL adapter, and it’s called MySQL Connector/Python and has a package name of mysql.connector.
MySQL Connector/Python implements the MySQL client protocol in
pure Python, so neither MySQL libraries nor compilation are necessary,
and best of all, there is a port to Python 3. Why is this a big deal? It gives
Python 3 users access to MySQL databases, that’s all!
6.2 The Python DB-API
281
Making all of these changes and additions to ushuffle_db.py, we arrive
at what I’d like to refer to as the “universal” version of the application,
ushuffle_dbU.py, which you can see in Example 6-1.
Example 6-1
Database Adapter Example (ushuffle_dbU.py)
This script performs some basic operations by using a variety of databases
(MySQL, SQLite, Gadfly). It runs under Python 2 and 3 without any code
changes, and components will be (re)used in future sections of this chapter.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python
from distutils.log import warn as printf
import os
from random import randrange as rand
if isinstance(__builtins__, dict) and 'raw_input' in __builtins__:
scanf = raw_input
elif hasattr(__builtins__, 'raw_input'):
scanf = raw_input
else:
scanf = input
COLSIZ = 10
FIELDS = ('login', 'userid', 'projid')
RDBMSs = {'s': 'sqlite', 'm': 'mysql', 'g': 'gadfly'}
DBNAME = 'test'
DBUSER = 'root'
DB_EXC = None
NAMELEN = 16
tformat = lambda s: str(s).title().ljust(COLSIZ)
cformat = lambda s: s.upper().ljust(COLSIZ)
def setup():
return RDBMSs[raw_input('''
Choose a database system:
(M)ySQL
(G)adfly
(S)QLite
Enter choice: ''').strip().lower()[0]]
def connect(db, DBNAME):
global DB_EXC
dbDir = '%s_%s' % (db, DBNAME)
(Continued)
282
Chapter 6 • Database Programming
Example 6-1
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
Database Adapter Example (ushuffle_dbU.py) (Continued)
if db == 'sqlite':
try:
import sqlite3
except ImportError:
try:
from pysqlite2 import dbapi2 as sqlite3
except ImportError:
return None
DB_EXC = sqlite3
if not os.path.isdir(dbDir):
os.mkdir(dbDir)
cxn = sqlite.connect(os.path.join(dbDir, DBNAME))
elif db == 'mysql':
try:
import MySQLdb
import _mysql_exceptions as DB_EXC
try:
cxn = MySQLdb.connect(db=DBNAME)
except DB_EXC.OperationalError:
try:
cxn = MySQLdb.connect(user=DBUSER)
cxn.query('CREATE DATABASE %s' % DBNAME)
cxn.commit()
cxn.close()
cxn = MySQLdb.connect(db=DBNAME)
except DB_EXC.OperationalError:
return None
except ImportError:
try:
import mysql.connector
import mysql.connector.errors as DB_EXC
try:
cxn = mysql.connector.Connect(**{
'database': DBNAME,
'user': DBUSER,
})
except DB_EXC.InterfaceError:
return None
except ImportError:
return None
elif db == 'gadfly':
try:
from gadfly import gadfly
DB_EXC = gadfly
except ImportError:
return None
6.2 The Python DB-API
283
90
try:
91
cxn = gadfly(DBNAME, dbDir)
92
except IOError:
93
cxn = gadfly()
94
if not os.path.isdir(dbDir):
95
os.mkdir(dbDir)
96
cxn.startup(DBNAME, dbDir)
97
else:
98
return None
99
return cxn
100
101 def create(cur):
102
try:
103
cur.execute('''
104
CREATE TABLE users (
105
login VARCHAR(%d),
106
userid INTEGER,
107
projid INTEGER)
108
''' % NAMELEN)
109
except DB_EXC.OperationalError, e:
110
drop(cur)
111
create(cur)
112
113 drop = lambda cur: cur.execute('DROP TABLE users')
114
115 NAMES = (
116
('aaron', 8312), ('angela', 7603), ('dave', 7306),
117
('davina',7902), ('elliot', 7911), ('ernie', 7410),
118
('jess', 7912), ('jim', 7512), ('larry', 7311),
119
('leslie', 7808), ('melissa', 8602), ('pat', 7711),
120
('serena', 7003), ('stan', 7607), ('faye', 6812),
121
('amy', 7209), ('mona', 7404), ('jennifer', 7608),
122 )
123
124 def randName():
125
pick = set(NAMES)
126
while pick:
127
yield pick.pop()
128
129 def insert(cur, db):
130
if db == 'sqlite':
131
cur.executemany("INSERT INTO users VALUES(?, ?, ?)",
132
[(who, uid, rand(1,5)) for who, uid in randName()])
133
elif db == 'gadfly':
134
for who, uid in randName():
135
cur.execute("INSERT INTO users VALUES(?, ?, ?)",
136
(who, uid, rand(1,5)))
137
elif db == 'mysql':
138
cur.executemany("INSERT INTO users VALUES(%s, %s, %s)",
139
[(who, uid, rand(1,5)) for who, uid in randName()])
140
141 getRC = lambda cur: cur.rowcount if hasattr(cur,
'rowcount') else -1
142
(Continued)
284
Chapter 6 • Database Programming
Example 6-1
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
Database Adapter Example (ushuffle_dbU.py) (Continued)
def update(cur):
fr = rand(1,5)
to = rand(1,5)
cur.execute(
"UPDATE users SET projid=%d WHERE projid=%d" % (to, fr))
return fr, to, getRC(cur)
def delete(cur):
rm = rand(1,5)
cur.execute('DELETE FROM users WHERE projid=%d' % rm)
return rm, getRC(cur)
def dbDump(cur):
cur.execute('SELECT * FROM users')
printf('\n%s' % ''.join(map(cformat, FIELDS)))
for data in cur.fetchall():
printf(''.join(map(tformat, data)))
def main():
db = setup()
printf('*** Connect to %r database' % db)
cxn = connect(db)
if not cxn:
printf('ERROR: %r not supported or unreachable, exit' % db)
return
cur = cxn.cursor()
printf('\n*** Creating users table')
create(cur)
printf('\n*** Inserting names into table')
insert(cur, db)
dbDump(cur)
printf('\n*** Randomly moving folks')
fr, to, num = update(cur)
printf('\t(%d users moved) from (%d) to (%d)' % (num, fr, to))
dbDump(cur)
printf('\n*** Randomly choosing group')
rm, num = delete(cur)
printf('\t(group #%d; %d users removed)' % (rm, num))
dbDump(cur)
printf('\n*** Dropping users table')
drop(cur)
printf('\n*** Close cxns')
cur.close()
cxn.commit()
cxn.close()
if __name__ == '__main__':
main()
6.2 The Python DB-API
285
Line-by-Line Explanation
Lines 1–32
The first part of this script imports the necessary modules, creates some
global constants (the column size for display and the set of databases we are
supporting), and features the tformat(), cformat(), and setup() functions.
After the import statements, you’ll find some curious code (lines 7–12)
that finds the right function to which to alias from scanf(), our designated
command-line user input function. The elif and else are simpler to
explain: we’re checking to see if raw_input() exists as a BIF. If it does,
we’re in Python (1 or) 2 and should use that. Otherwise, we’re in Python 3
and should use its new name, input().
The other bit of complexity is the if statement. __builtins__ is only a
module in your application. In an imported module, __builtins__ is a
dict. The conditional basically says that if we were imported, check if
‘raw_input’ is a name in this dictionary; otherwise, it’s a module, so drop
down to the elif and else. Hope that makes sense!
With regard to the tformat() and cformat() functions, the former is the
format string for showing the titles; for instance, “tformat” means “titlecase formatter.” It’s just a cheap way to take names from the database,
which can be all lowercase (such as what we have), first letter capped correctly, all CAPS, etc., and make all the names uniform. The latter function’s
name stands for “CAPS formatter.” All it does is take each column name
and turn it into a header by calling the str.upper() method.
Both formatters left-justify their output and limit it to ten characters in
width because it’s not expected the data will exceed that—our sample data
certainly doesn’t, so if you want to use your own, change COLSIZ to whatever works for your data. It was simpler to write these as lambdas rather
than traditional functions although you can certainly do that, as well.
One can argue that this is probably a lot of effort to do this when all
scanf() will do is prompt the user in setup() to select the RDBMS to use
for any particular execution of this script (or derivatives in the remainder
of the chapter). However, the point is to show you some code that you
might be able to use elsewhere. We haven’t claimed that this is a script
you’d use in production have we?
We already have the user output function—as mentioned earlier, we’re
using distutils.log.warn() in place of print for Python 2 and print() for
Python 3. In our application, we import it (line 3) as printf().
Most of the constants are fairly self-explanatory. One exception is
DB_EXC, which stands for DataBase EXCeption. This variable will eventually
286
Chapter 6 • Database Programming
be assigned the database exception module for the specific database system with which users choose to use to run this application. In other
words, for users who choose MySQL, DB_EXC will be _mysql_exceptions,
etc. If we built this application in a more object-oriented way, we would
have a class in which this would simply be an instance attribute, such as
self.db_exc_module.
Lines 35–99
The guts of consistent database access happen here in the connect()
function. At the beginning of each section (“section” here refers to each
database’s if clause), we attempt to load the corresponding database modules. If a suitable one is not found, None is returned to indicate that the
database system is not supported.
Once a connection is made, all of other code is database and adapter
independent and should work across all connections. (The only exception
in our script is insert().) In all three subsections of this set of code, you
will notice that a valid connection should be passed back as cxn.
If SQLite is chosen, we attempt to load a database adapter. We first try
to load the standard library’s sqlite3 module (Python 2.5+). If that fails,
we look for the third-party pysqlite2 package. This is to support version
2.4.x and older systems with the pysqlite adapter installed. If either is
found, we then check to ensure that the directory exists, because the database is file based. (You can also choose to create an in-memory database by
substituting :memory: as the filename.) When the connect() call is made to
SQLite, it will either use one that already exists or make a new one using
that path if one does not exist.
MySQL uses a default area for its database files and does not require
this to come from the user. The most popular MySQL adapter is the
MySQLdb package, so we try to import this first. Like SQLite, there is a “plan
B,” the mysql.connector package—a good choice because it’s compatible
with both Python 2 and 3. If neither is found, MySQL isn’t supported and
None is returned.
The last database supported by our application is Gadfly. (At the time of
this writing, this database is mostly, but not fully, DB-API-compliant, and
you will see this in this application.) It uses a startup mechanism similar to
that of SQLite: it starts up with the directory where the database files
should be. If it is there, fine, but if not, you have to take a roundabout way
to start up a new database. (Why this is, we are not sure. We believe that
the startup() functionality should be merged into that of the constructor gadfly.gadfly().)
6.2 The Python DB-API
287
Lines 101–113
The create() function creates a new users table in our database. If there is
an error, it is almost always because the table already exists. If this is the
case, drop the table and re-create it by recursively calling this function
again. This code is dangerous in that if the re-creation of the table still fails,
you will have infinite recursion until your application runs out of memory.
You will fix this problem in one of the exercises at the end of the chapter.
The table is dropped from the database with the one-liner drop(), written as a lambda.
Lines 115–127
The next blocks of code feature a constant set of NAMES and user IDs, followed by the generator randName(). NAMES is a tuple that must be converted
to a set for use in randName() because we alter it in the generator, removing one name at a time until the names are exhausted. Because this is
destructive behavior and is used often in the application, it’s best to set
NAMES as the canonical source and just copy its contents to another data
structure to be destroyed each time the generator is used.
Lines 129–139
The insert() function is the only other place where database-dependent
code lives. This is because each database is slightly different in one way or
another. For example, both the adapters for SQLite and MySQL are DBAPI-compliant, so both of their cursor objects have an executemany() function, whereas Gadfly does not, so rows must be inserted one at a time.
Another quirk is that both SQLite and Gadfly use the qmark parameter
style, whereas MySQL uses format. Because of this, the format strings are
different. If you look carefully, however, you will see that the arguments
themselves are created in a very similar fashion.
What the code does is this: for each name-userID pair, it assigns that
individual to a project group (given by its project ID or projid). The project ID is chosen randomly out of four different groups (randrange(1,5)).
Line 141
This single line represents a conditional expression (read as: Python ternary operator) that returns the rowcount of the last operation (in terms of
rows altered), or if the cursor object does not support this attribute (meaning it is not DB-API–compliant), it returns –1.
288
2.5
Chapter 6 • Database Programming
Conditional expressions were added in Python 2.5, so if you are using
version 2.4.x or older, you will need to convert it back to the “old-style”
way of doing it:
getRC = lambda cur: (hasattr(cur, 'rowcount') \
and [cur.rowcount] or [-1])[0]
If you are confused by this line of code, don’t worry about it. Check the
FAQ to see why this is, and get a taste of why conditional expressions
were finally added to Python in version 2.5. If you are able to figure it out,
then you have developed a solid understanding of Python objects and
their Boolean values.
Lines 143–153
The update() and delete() functions randomly choose folks from one
group. If the operation is update, move them from their current group to
another (also randomly chosen); if it is delete, remove them altogether.
Lines 155–159
The dbDump() function pulls all rows from the database, formats them for
printing, and displays them to the user. The displayed output requires the
assistance of the cformat() (to display the column headers) and tformat()
(to format each user row).
First, you should see that the data was extracted after the SELECT by
the fetchall() method. So as we iterate each user, take the three columns (login, userid, projid) and pass them to tformat() via map() to convert them to strings (if they are not already), format them as titlecase, and
then format the complete string to be COLSIZ columns, left-justified (righthand space padding).
Lines 161–195
The director of this movie is main(). It makes individual calls to each function described above that defines how this script works (assuming that it
does not exit due to either not finding a database adapter or not being able
to obtain a connection [lines 164–166]). The bulk of it should be fairly selfexplanatory, given the proximity of the output statements. The last bits
wrap up the cursor and connection.
6.3 ORMs
6.3
289
ORMs
As seen in the previous section, a variety of different database systems are
available today, and most of them have Python interfaces with which you
can harness their power. The only drawback to those systems is the need
to know SQL. If you are a programmer who feels more comfortable with
manipulating Python objects instead of SQL queries, yet still want to use a
relational database as your data back-end, then you would probably prefer
to use ORMs.
6.3.1
Think Objects, Not SQL
Creators of these systems have abstracted away much of the pure SQL
layer and implemented objects in Python that you can manipulate to
accomplish the same tasks without having to generate the required lines
of SQL. Some systems allow for more flexibility if you do have to slip in a
few lines of SQL, but for the most part, you can avoid almost all the general SQL required.
Database tables are magically converted to Python classes with columns
and features as attributes, and methods responsible for database operations. Setting up your application to an ORM is somewhat similar to that
of a standard database adapter. Because of the amount of work that ORMs
perform on your behalf, some things are actually more complex or require
more lines of code than using an adapter directly. Hopefully, the gains you
achieve in productivity make up for a little bit of extra work.
6.3.2
Python and ORMs
The most well-known Python ORMs today are SQLAlchemy (http://sqlalchemy.org) and SQLObject (http://sqlobject.org). We will give you examples of both because the systems are somewhat disparate due to different
philosophies, but once you figure these out, moving on to other ORMs is
much simpler.
Some other Python ORMs include Storm, PyDO/PyDO2, PDO, Dejavu,
PDO, Durus, QLime, and ForgetSQL. Larger Web-based systems can also
have their own ORM component such as WebWare MiddleKit and
290
Chapter 6 • Database Programming
Django’s Database API. Be advised that “well-known” does not mean best
for your application. Although these others were not included in our discussion, that does not mean that they would not be right for your application.
Setup and Installation
Because neither SQLAlchemy nor SQLObject are in the standard library,
you’ll need to download and install them on your own. (Usually this is
easily taken care of with the easy_install or pip tools.)
At the time of this writing, all of the software packages described in this
chapter are available in Python 2; only SQLAlchemy, SQLite, and the
MySQL Connector/Python adapter are available in Python 3. The sqlite3
package
is part of the standard library for Python 2.5+ and 3.x, so you
2.5, 3.x
don’t need to do anything unless you’re using version 2.4 and older.
If you’re starting on a computer with only Python 3 installed, you’ll
need to get Distribute (which includes easy_install) first. You’ll need a
Web browser (or the curl command if you have it) and to download the
installation file (available at http://python-distribute.org/distribute_setup.py),
and then get SQLAlchemy with easy_install. Here is what this entire
process might look like on a Windows-based PC:
C:\WINDOWS\Temp>C:\Python32\python distribute_setup.py
Extracting in c:\docume~1\wesley\locals~1\temp\tmp8mcddr
Now working in c:\docume~1\wesley\locals~1\temp\tmp8mcddr\distribute0.6.21
Installing Distribute
warning: no files found matching 'Makefile' under directory 'docs'
warning: no files found matching 'indexsidebar.html' under directory
'docs'
creating build
creating build\src
:
Installing easy_install-3.2.exe script to C:\python32\Scripts
Installed c:\python32\lib\site-packages\distribute-0.6.21-py3.2.egg
Processing dependencies for distribute==0.6.21
Finished processing dependencies for distribute==0.6.21
After install bootstrap.
Creating C:\python32\Lib\site-packages\setuptools-0.6c11-py3.2.egg-info
Creating C:\python32\Lib\site-packages\setuptools.pth
6.3 ORMs
291
C:\WINDOWS\Temp>
C:\WINDOWS\Temp>C:\Python32\Scripts\easy_install sqlalchemy
Searching for sqlalchemy
Reading http://pypi.python.org/simple/sqlalchemy/
Reading http://www.sqlalchemy.org
Best match: SQLAlchemy 0.7.2
Downloading http://pypi.python.org/packages/source/S/SQLAlchemy/
SQLAlchemy-0.7.2.tar.gz#md5=b84a26ae2e5de6f518d7069b29bf8f72
:
Adding sqlalchemy 0.7.2 to easy-install.pth file
Installed c:\python32\lib\site-packages\sqlalchemy-0.7.2-py3.2.egg
Processing dependencies for sqlalchemy
Finished processing dependencies for sqlalchemy
6.3.3
Employee Role Database Example
We will port our user shuffle application ushuffle_db.py to both SQLAlchemy and SQLObject. MySQL will be the back-end database server for
both. You will note that we implement these as classes because there is
more of an object feel to using ORMs, as opposed to using raw SQL in a
database adapter. Both examples import the set of NAMES and the random
name chooser from ushuffle_db.py. This is to avoid copying and pasting
the same code everywhere as code reuse is a good thing.
6.3.4
SQLAlchemy
We start with SQLAlchemy because its interface is somewhat closer to
SQL than SQLObject’s. SQLObject is simpler, more Pythonic, and faster,
whereas SQLAlchemy abstracts really well to the object world and also
gives you more flexibility in issuing raw SQL, if you have to.
Examples 6-2 and 6-3 illustrate that the ports of our user shuffle examples using both these ORMs are very similar in terms of setup, access, and
overall number of lines of code. Both also borrow the same set of functions and constants from ushuffle_db{,U}.py.
292
Chapter 6 • Database Programming
Example 6-2
SQLAlchemy ORM Example (ushuffle_sad.py)
This user shuffle Python 2.x and 3.x-compatible application features the
SQLAlchemy ORM paired up with MySQL or SQLite databases as back-ends.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/usr/bin/env python
from distutils.log import warn as printf
from os.path import dirname
from random import randrange as rand
from sqlalchemy import Column, Integer, String, create_engine, exc, orm
from sqlalchemy.ext.declarative import declarative_base
from ushuffle_dbU import DBNAME, NAMELEN, randName,
FIELDS, tformat, cformat, setup
DSNs = {
'mysql': 'mysql://[email protected]/%s' % DBNAME,
'sqlite': 'sqlite:///:memory:',
}
Base = declarative_base()
class Users(Base):
__tablename__ = 'users'
login = Column(String(NAMELEN))
userid
= Column(Integer, primary_key=True)
projid = Column(Integer)
def __str__(self):
return ''.join(map(tformat,
(self.login, self.userid, self.projid)))
class SQLAlchemyTest(object):
def __init__(self, dsn):
try:
eng = create_engine(dsn)
except ImportError:
raise RuntimeError()
try:
eng.connect()
except exc.OperationalError:
eng = create_engine(dirname(dsn))
eng.execute('CREATE DATABASE %s' % DBNAME).close()
eng = create_engine(dsn)
Session = orm.sessionmaker(bind=eng)
self.ses = Session()
self.users = Users.__table__
self.eng = self.users.metadata.bind = eng
6.3 ORMs
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
293
def insert(self):
self.ses.add_all(
Users(login=who, userid=userid, projid=rand(1,5)) \
for who, userid in randName()
)
self.ses.commit()
def update(self):
fr = rand(1,5)
to = rand(1,5)
i = -1
users = self.ses.query(
Users).filter_by(projid=fr).all()
for i, user in enumerate(users):
user.projid = to
self.ses.commit()
return fr, to, i+1
def delete(self):
rm = rand(1,5)
i = -1
users = self.ses.query(
Users).filter_by(projid=rm).all()
for i, user in enumerate(users):
self.ses.delete(user)
self.ses.commit()
return rm, i+1
def dbDump(self):
printf('\n%s' % ''.join(map(cformat, FIELDS)))
users = self.ses.query(Users).all()
for user in users:
printf(user)
self.ses.commit()
def __getattr__(self, attr):
# use for drop/create
return getattr(self.users, attr)
def finish(self):
self.ses.connection().close()
def main():
printf('*** Connect to %r database' % DBNAME)
db = setup()
if db not in DSNs:
printf('\nERROR: %r not supported, exit' % db)
return
try:
orm = SQLAlchemyTest(DSNs[db])
except RuntimeError:
printf('\nERROR: %r not supported, exit' % db)
return
(Continued)
294
Chapter 6 • Database Programming
Example 6-2
SQLAlchemy ORM Example (ushuffle_sad.py) (Continued)
98
printf('\n*** Create users table (drop old one if appl.)')
99
orm.drop(checkfirst=True)
100
orm.create()
101
102
printf('\n*** Insert names into table')
103
orm.insert()
104
orm.dbDump()
105
106
printf('\n*** Move users to a random group')
107
fr, to, num = orm.update()
108
printf('\t(%d users moved) from (%d) to (%d)' % (num, fr, to))
109
orm.dbDump()
110
111
printf('\n*** Randomly delete group')
112
rm, num = orm.delete()
113
printf('\t(group #%d; %d users removed)' % (rm, num))
114
orm.dbDump()
115
116
printf('\n*** Drop users table')
117
orm.drop()
118
printf('\n*** Close cxns')
119
orm.finish()
120
121 if __name__ == '__main__':
122
main()
Line-by-Line Explanation
Lines 1–13
As expected, we begin with module imports and constants. We follow the
suggested style guideline of importing Python Standard Library modules
first (distutils, os.path, random), followed by third-party or external
modules (sqlalchemy), and finally, local modules to our application
(ushuffle_dbU), which in our case is providing the majority of the constants and utility functions.
The other constant contains the Database Source Names (DSNs), which
you can think of as database connection URIs. In previous editions of this
book, this application only supported MySQL, so we’re happy to be able
to add SQLite to the mix. In the ushuffle_dbU.py application seen earlier,
we used the file system with SQLite. Here we’ll use the in-memory version
(line 12).
6.3 ORMs
295
CORE NOTE: Active Record pattern
Active Record is a software design pattern (http://en.wikipedia.org/wiki/Active_
record_pattern) that ties manipulation of objects to equivalent actions on a database. ORM objects essentially represent database rows such that when an object is
created, a row representing its data is written to the database automatically. When
an object is updated, so is the corresponding row. Similarly, when an object is
removed, its row in the database is deleted.
In the beginning, SQLAlchemy didn’t have an Active Record flavored declarative layer to make working with the ORM less complex. Instead, it followed the
“Data Mapper” pattern in which objects do not have the ability to modify the
database itself; rather, they come with actions that the user can call upon to
make those changes happen. Yes, an ORM can substitute for having to issue
raw SQL, but developers are still responsible for explicitly making the equivalent database operations to persist additions, updates, and deletions.
A desire for an Active Record-like interface spawned the creation of projects
like ActiveMapper and TurboEntity. Eventually, both were replaced by Elixir
(http://elixir.ematia.de), which became the most popular declarative layer for
SQLAlchemy. Some developers find it Rails-like in nature, whereas others find
it overly simplistic, abstracting away too much functionality.
However, SQLAlchemy eventually came up with its own declarative layer
which also adheres to the Active Record pattern. It’s fairly lightweight, simple,
and gets the job done, so we’ll use it in our example because it’s is more beginnerfriendly. However, if you do find it too lightweight, you can still use the
__table__ object for more traditional access.
Lines 15–23
The next code block represents the use of SQLAlchemy’s declarative layer.
Its use defines objects that, as manipulated, will result in the equivalent
database operation. As mentioned in the preceding Core Note, it might
not be as feature-rich as the third-party tools, but it suffices for our simple
example here.
To use it, you must import sqlalchemy.ext.declarative_base (line 7)
and use it to make a Base class (line 15) from which you derive your data
subclasses (line 16).
296
Chapter 6 • Database Programming
The next part of the class definition contains the __tablename__ attribute, which is the database table name to which it is mapped. Alternatively, you can define a lower-level sqlalchemy.Table object explicitly, in
which case you would alias to __table__, instead. In this application, we’re
taking a hybrid approach, mostly using the objects for row access, but we’ve
saved off the table (line 41) for table-level actions (create and drop).
After that are the “column” attributes; check the docs for all allowed
data types. Finally, we have an __str__() method definition which returns
a human-readable string representation of a row of data. Because this output is customized (with the help of the tformat() function), we don’t recommend this in practice. If you wanted to reuse this code in another
application, that’s made more difficult because you might wish the output
to be formatted differently. More likely, you’ll subclass this one and modify the child class __str__() method, instead. SQLAlchemy does support
table inheritance.
Lines 25–42
The class initializer, like ushuffle_dbU.connect(), does everything it can
to ensure that there is a database available, and then saves a connection to
it. First, it attempts to use the DSN to create an engine to the database. An
engine is the main database manager. For debugging purposes, you might
wish to see the ORM-generated SQL. To do that, just set the echo parameter, e.g., create_engine('sqlite:///:memory:', echo=True).
Engine creation failure (lines 29–30) means that SQLAlchemy isn’t able
to support the chosen database, usually an ImportError, because it cannot
find an installed adapter. In this case, we fail back to the setup() function
to inform the user.
Assuming that an engine was successfully created, the next step is to try
a database connection. A failure usually means that the database itself (or
its server) is reachable, but in this case, the database you want to use to
store your data does not exist, so we attempt to create it here and retry the
connection (lines 34–37). Notice that we were sneaky in using os.path.
dirname() to strip off the database name and leave the rest of the DSN
intact so that the connection works (line 35).
This is the only place you will see raw SQL (line 36) because this type of
activity is typically an operational task, not application-oriented. All other
database operations happen under the table (pun not originally intended)
via object manipulation or by calling a database table method via delegation (more on this a bit later in lines 44–70).
6.3 ORMs
297
The last section of code (lines 39–42) creates a session object to manage
individual transaction-flavored objects involving one or more database
operations that all must be committed for the data to be written. We then
save the session object plus the user’s table and engine as instance attributes. The additional binding of the engine to the table’s metadata (line 42)
means to bind all operations on this table to the given engine. (You can
bind to other engines or connections.)
Lines 44–70
These next three methods represent the core database functionality of row
insertion (lines 44–49), update (lines 51–60), and deletion (lines 62–70).
Insertion employs a session.add_all() method, which takes an iterable
and builds up a set of insert operations. At the end, you can decide
whether to issue a commit as we did (line 49) or a rollback.
Both update() and delete() feature a session query and use the
query.filter_by() method for lookup. Updating randomly chooses members from one product group (fr) and moves them to another project by
changing those IDs to another value (to). The counter (i) tracks the rowcount of how many users were affected. Deleting involves randomly
choosing a theoretical company project by ID (rm) that was cancelled, and
because of which, employees laid-off. Both commit via the session object
once the operations are carried out.
Note that there are equivalent query object update() and delete()
methods that we aren’t using in our application. They reduce the amount
of code necessary as they operate in bulk and return the rowcount. Porting
ushuffle_sad.py to using these methods is an exercise at the end of the
chapter.
Here are some of the more commonly-used query methods:
• filter_by() Extract values with specific column values as
keyword parameters.
• filter() Similar to filter_by() but more flexible as you
provide an expression, instead. For example:
query.filter_by(userid=1) is the same as
query.filter(Users.userid==1).
• order_by() Analogous to the SQL ORDER BY directive. The
default is ascending. You’ll need to import sqlalchemy.desc()
for descending sort.
• limit() Analogous to the SQL LIMIT directive.
298
Chapter 6 • Database Programming
• offset()
Analogous to the SQL OFFSET directive.
• all() Return all objects that match the query.
• one() Return only one (the next) object that matches
the query.
• first() Return the first object that matches the query.
• join() Create a SQL JOIN given the desired JOIN criteria.
• update()
Bulk update rows.
• delete()
Bulk delete rows.
Most of these methods result in another Query object and can thus be
chained together, for example, query.order_by(desc(Users.userid)).
limit(5).offset(5).
If you wish to use LIMIT and OFFSET, the more Pythonic way is to take your
query object and apply a slice to it, for example, query.order_by (User.userid)
[10:20] for the second group of ten users with the oldest user IDs.
To see Query methods, read the documentation at http://www. sqlalchemy.
org/docs/orm/query.html#sqlalchemy.orm.query.Query. JOINs are a large
topic on their own, so there is additional and more specific information at
http://www.sqlalchemy.org/docs/orm/tutorial.html#ormtutorial-joins. You’ll
get a chance to play with some of these methods in the chapter exercises.
So far, we’ve only discussed querying, thus row-level operations. What
about table create and drop actions? Shouldn’t there be functions that look
like the following?
def drop(self):
self.users.drop()
Here we made a decision to use delegation again (as introduced in the
object-oriented programming chapter in Core Python Language Fundamentals
or Core Python Programming). Delegation is where missing attributes in an
instance are required from another object in our instance (self.users)
which has it; for example, wherever you see __getattr__(), self.users.
create(), self.users.drop(), etc. (lines 79–80, 98–99, 116), think delegation.
Lines 72–77
The responsibility of displaying proper output to the screen belongs to the
dbDump() method. It extracts the rows from the database and pretty-prints
the data just like its equivalent in ushuffle_dbU.py. In fact, they are
nearly identical.
6.3 ORMs
299
Lines 79–83
We just discussed delegation, and using __getattr__() lets us deliberately
avoid creating drop() and create() methods because it would just respectively call the table’s drop() or create() methods, anyway. There is no
added functionality, so why create yet another function to have to maintain? We would like to remind you that __getattr__() is only called whenever an attribute lookup fails. (This is as opposed to __getattribute__(),
which is called, regardless.)
If we call orm.drop() and find no such method, getattr(orm, 'drop') is
invoked. When that happens, __getattr__() is called and delegates the
attribute name to self.users. The interpreter will find that self.users has
a drop attribute and pass that method call to it: self. users.drop().
The last method is finish(), which does the final cleanup of closing the
connection. Yes, we could have written this as a lambda but chose not to in
case cleaning up of cursors and connections, etc. requires more than a single
statement.
Lines 85–122
The main() function drives our application. It creates a SQLAlchemyTest
object and uses that for all database operations. The script is the same
as that of our original application, ushuffle_dbU.py. You will notice that
the database parameter db is optional and does not serve any purpose
here in ushuffle_sad.py or the upcoming SQLObject version, ushuffle_
so.py. This is a placeholder for you to add support for other RDBMSs
in these applications (see the exercises at the end of the chapter).
Upon running this script, you might get output that looks like this on a
Windows-based PC:
C:\>python ushuffle_sad.py
*** Connect to 'test' database
Choose a database system:
(M)ySQL
(G)adfly
(S)QLite
Enter choice: s
*** Create users table (drop old one if appl.)
300
Chapter 6 • Database Programming
*** Insert names into table
LOGIN
Faye
Serena
Amy
Dave
Larry
Mona
Ernie
Jim
Angela
Stan
Jennifer
Pat
Leslie
Davina
Elliot
Jess
Aaron
Melissa
USERID
6812
7003
7209
7306
7311
7404
7410
7512
7603
7607
7608
7711
7808
7902
7911
7912
8312
8602
PROJID
2
4
2
3
2
2
1
2
1
2
4
2
3
3
4
2
3
1
*** Move users to a random group
(3 users moved) from (1) to (3)
LOGIN
Faye
Serena
Amy
Dave
Larry
Mona
Ernie
Jim
Angela
Stan
Jennifer
Pat
Leslie
Davina
Elliot
Jess
Aaron
Melissa
USERID
6812
7003
7209
7306
7311
7404
7410
7512
7603
7607
7608
7711
7808
7902
7911
7912
8312
8602
PROJID
2
4
2
3
2
2
3
2
3
2
4
2
3
3
4
2
3
3
*** Randomly delete group
(group #3; 7 users removed)
LOGIN
Faye
Serena
Amy
USERID
6812
7003
7209
PROJID
2
4
2
6.3 ORMs
Larry
Mona
Jim
Stan
Jennifer
Pat
Elliot
Jess
7311
7404
7512
7607
7608
7711
7911
7912
301
2
2
2
2
4
2
4
2
*** Drop users table
*** Close cxns
C:\>
Explicit/“Classical” ORM Access
We mentioned early on that we chose to use the declarative layer in SQLAlchemy for our example. However, we feel it’s also educational to look at
the more “explicit” form of ushuffle_sad.py (User shuffle SQLAlchemy
declarative), which we’ll name as ushuffle_sae.py (User shuffle SQLAlchemy explicit). You’ll notice that they look extremely similar to each
other.
A line-by-line explanation isn’t provided due to its similarity with
ushuffle_sad.py, but it can be downloaded from http://corepython.com.
The point is to both preserve this from previous editions as well as to let
you compare explicit versus declarative. SQLAlchemy has matured since
the book’s previous edition, so we wanted to bring it up-to-date, as well.
Here is ushuffle_sae.py:
#!/usr/bin/env python
from
from
from
from
distutils.log import warn as printf
os.path import dirname
random import randrange as rand
sqlalchemy import Column, Integer, String, create_engine,
exc, orm, MetaData, Table
from sqlalchemy.ext.declarative import declarative_base
from ushuffle_dbU import DBNAME, NAMELEN, randName, FIELDS,
tformat, cformat, setup
DSNs = {
'mysql': 'mysql://[email protected]/%s' % DBNAME,
'sqlite': 'sqlite:///:memory:',
}
class SQLAlchemyTest(object):
def __init__(self, dsn):
try:
eng = create_engine(dsn)
302
Chapter 6 • Database Programming
except ImportError, e:
raise RuntimeError()
try:
cxn = eng.connect()
except exc.OperationalError:
try:
eng = create_engine(dirname(dsn))
eng.execute('CREATE DATABASE %s' % DBNAME).close()
eng = create_engine(dsn)
cxn = eng.connect()
except exc.OperationalError:
raise RuntimeError()
metadata = MetaData()
self.eng = metadata.bind = eng
try:
users = Table('users', metadata, autoload=True)
except exc.NoSuchTableError:
users = Table('users', metadata,
Column('login', String(NAMELEN)),
Column('userid', Integer),
Column('projid', Integer),
)
self.cxn = cxn
self.users = users
def insert(self):
d = [dict(zip(FIELDS, [who, uid, rand(1,5)])) \
for who, uid in randName()]
return self.users.insert().execute(*d).rowcount
def update(self):
users = self.users
fr = rand(1,5)
to = rand(1,5)
return (fr, to,
users.update(users.c.projid==fr).execute(
projid=to).rowcount)
def delete(self):
users = self.users
rm = rand(1,5)
return (rm,
users.delete(users.c.projid==rm).execute().rowcount)
def dbDump(self):
printf('\n%s' % ''.join(map(cformat, FIELDS)))
users = self.users.select().execute()
6.3 ORMs
for user in users.fetchall():
printf(''.join(map(tformat, (user.login,
user.userid, user.projid))))
def __getattr__(self, attr):
return getattr(self.users, attr)
def finish(self):
self.cxn.close()
def main():
printf('*** Connect to %r database' % DBNAME)
db = setup()
if db not in DSNs:
printf('\nERROR: %r not supported, exit' % db)
return
try:
orm = SQLAlchemyTest(DSNs[db])
except RuntimeError:
printf('\nERROR: %r not supported, exit' % db)
return
printf('\n*** Create users table (drop old one if appl.)')
orm.drop(checkfirst=True)
orm.create()
printf('\n*** Insert names into table')
orm.insert()
orm.dbDump()
printf('\n*** Move users to a random group')
fr, to, num = orm.update()
printf('\t(%d users moved) from (%d) to (%d)' % (num, fr, to))
orm.dbDump()
printf('\n*** Randomly delete group')
rm, num = orm.delete()
printf('\t(group #%d; %d users removed)' % (rm, num))
orm.dbDump()
printf('\n*** Drop users table')
orm.drop()
printf('\n*** Close cxns')
orm.finish()
if __name__ == '__main__':
main()
303
304
Chapter 6 • Database Programming
The noticeable major differences between ushuffle_sad.py and
are:
ushuffle_sae.py
• Creates a Table object instead of declarative Base object
• Our election not to use Sessions; instead performing
individual units of work, auto-commit, non-transactional, etc.
• Uses the Table object for all database interaction rather than
Session Querys
To show sessions and explicit operations are not tied together, you’ll get
an exercise to roll Sessions into ushuffle_sae.py. Now that you’ve learned
SQLAlchemy, let’s move onto SQLObject and see a similar tool.
SQLObject
SQLObject was Python’s first major ORM. In fact, it’s a decade old! Ian
Bicking, its creator, released the first alpha version to the world in October
2002. (SQLAlchemy didn’t come along until February 2006.) At the time of
this writing, SQLObject is only available for Python 2.
As we mentioned earlier, SQLObject is more object-flavored (some feel
more Pythonic) and implemented the Active Record pattern for implicit
object-to-database access early on but doesn’t give you as much freedom
to use raw SQL for more ad hoc or customized queries. Many users claim
that it is easy to learn SQLAlchemy, but we’ll let you be the judge. Take a
look at ushuffle_so.py in Example 6-3, which is our port of ushuffle_
dbU.py and ushuffle_sad.py to SQLObject.
Example 6-3
SQLObject ORM Example (ushuffle_so.py)
This user shuffle Python 2.x and 3.x-compatible application features the
SQLObject ORM paired up with MySQL or SQLite databases as back-ends.
1
2
3
4
5
6
7
8
#!/usr/bin/env python
from distutils.log import warn as printf
from os.path import dirname
from random import randrange as rand
from sqlobject import *
from ushuffle_dbU import DBNAME, NAMELEN, randName, FIELDS,
tformat, cformat, setup
6.3 ORMs
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
305
DSNs = {
'mysql': 'mysql://[email protected]/%s' % DBNAME,
'sqlite': 'sqlite:///:memory:',
}
class Users(SQLObject):
login = StringCol(length=NAMELEN)
userid = IntCol()
projid = IntCol()
def __str__(self):
return ''.join(map(tformat,
(self.login, self.userid, self.projid)))
class SQLObjectTest(object):
def __init__(self, dsn):
try:
cxn = connectionForURI(dsn)
except ImportError:
raise RuntimeError()
try:
cxn.releaseConnection(cxn.getConnection())
except dberrors.OperationalError:
cxn = connectionForURI(dirname(dsn))
cxn.query("CREATE DATABASE %s" % dbName)
cxn = connectionForURI(dsn)
self.cxn = sqlhub.processConnection = cxn
def insert(self):
for who, userid in randName():
Users(login=who, userid=userid, projid=rand(1,5))
def update(self):
fr = rand(1,5)
to = rand(1,5)
i = -1
users = Users.selectBy(projid=fr)
for i, user in enumerate(users):
user.projid = to
return fr, to, i+1
def delete(self):
rm = rand(1,5)
users = Users.selectBy(projid=rm)
i = -1
for i, user in enumerate(users):
user.destroySelf()
return rm, i+1
def dbDump(self):
printf('\n%s' % ''.join(map(cformat, FIELDS)))
for user in Users.select():
printf(user)
def finish(self):
self.cxn.close()
(Continued)
306
Chapter 6 • Database Programming
Example 6-3
SQLObject ORM Example (ushuffle_so.py) (Continued)
65 def main():
66
printf('*** Connect to %r database' % DBNAME)
67
db = setup()
68
if db not in DSNs:
69
printf('\nERROR: %r not supported, exit' % db)
70
return
71
72
try:
73
orm = SQLObjectTest(DSNs[db])
74
except RuntimeError:
75
printf('\nERROR: %r not supported, exit' % db)
76
return
77
78
printf('\n*** Create users table (drop old one if appl.)')
79
Users.dropTable(True)
80
Users.createTable()
81
82
printf('\n*** Insert names into table')
83
orm.insert()
84
orm.dbDump()
85
86
printf('\n*** Move users to a random group')
87
fr, to, num = orm.update()
88
printf('\t(%d users moved) from (%d) to (%d)' % (num, fr, to))
89
orm.dbDump()
90
91
printf('\n*** Randomly delete group')
92
rm, num = orm.delete()
93
printf('\t(group #%d; %d users removed)' % (rm, num))
94
orm.dbDump()
95
96
printf('\n*** Drop users table')
97
Users.dropTable()
98
printf('\n*** Close cxns')
99
orm.finish()
100
101 if __name__ == '__main__':
102
main()
Line-by-Line Explanation
Lines 1–12
The imports and constant declarations for this module are practically
identical to those of ushuffle_sad.py, except that we are using SQLObject
instead of SQLAlchemy.
Lines 14–20
The Users table extends the SQLObject.SQLObject class. We define the same
columns as before and also provide an __str__() for display output.
6.3 ORMs
307
Lines 22–34
The constructor for our class does everything it can to ensure that there is
a database available and returns a connection to it, just like our SQLAlchemy example. Similarly, this is the only place you will see real SQL. The
code works as described in the following, which bails on all errors:
• Try to establish a connection to an existing table (line 29); if it
works, we are done. It has to dodge exceptions like an
RDBMS adapter being available and the server online, and
then beyond that, the existence of the database.
• Otherwise, create the table; if so, we are done (lines 31–33).
• Once successful, we save the connection object in self.cxn.
Lines 36–55
The database operations happen in these lines. We have Insert (lines 36–38),
Update (lines 40–47), and Delete (lines 49–55). These are analogous to
the SQLAlchemy equivalents.
CORE TIP (HACKER’S CORNER): Reducing insert() down to one
(long) line of Python
We can reduce the code from the insert() method into a more obfuscated
“one-liner:”
[Users(**dict(zip(FIELDS, (who, userid, rand(1,5))))) \
for who, userid in randName()]
We’re not in the business to encourage code that damages readability or executes
code explicitly by using a list comprehension; however, the existing solution
does have one flaw: it requires you to create new objects by explicitly naming
the columns as keyword arguments. By using FIELDS, you don’t need to know
the column names and wouldn’t need to fix as much code if those column names
changed, especially if FIELDS was in some configuration (not application) module.
Lines 57–63
This block starts with the same (and expected) dbDump() method, which
pulls the rows from the database and displays things nicely to the screen.
The finish() method (lines 62–63) closes the connection. We could not
use delegation for table drop as we did for the SQLAlchemy example
because the would-be delegated method for it is called dropTable(), not
drop().
308
Chapter 6 • Database Programming
Lines 65–102
This is the main() function again. It works just like the one in
ushuffle_sad.py. Also, the db argument and DSNs constant are building
blocks for you to add support for other RDBMSs in these applications (see
the exercises at the end of the chapter).
Here is what your output might look like if you run ushuffle_so.py
(which is going to be nearly identical to the output from the ushuffle_
dbU.py and ushuffle_sa?.py scripts):
$ python ushuffle_so.py
*** Connect to 'test' database
Choose a database system:
(M)ySQL
(G)adfly
(S)QLite
Enter choice: s
*** Create users table (drop old one if appl.)
*** Insert names into table
LOGIN
Jess
Ernie
Melissa
Serena
Angela
Aaron
Elliot
Jennifer
Leslie
Mona
Larry
Davina
Stan
Jim
Pat
Amy
Faye
Dave
USERID
7912
7410
8602
7003
7603
8312
7911
7608
7808
7404
7311
7902
7607
7512
7711
7209
6812
7306
PROJID
2
1
1
1
1
4
3
1
4
4
1
3
4
2
1
2
1
4
*** Move users to a random group
(5 users moved) from (4) to (2)
LOGIN
Jess
Ernie
USERID
7912
7410
PROJID
2
1
6.4 Non-Relational Databases
Melissa
Serena
Angela
Aaron
Elliot
Jennifer
Leslie
Mona
Larry
Davina
Stan
Jim
Pat
Amy
Faye
Dave
8602
7003
7603
8312
7911
7608
7808
7404
7311
7902
7607
7512
7711
7209
6812
7306
309
1
1
1
2
3
1
2
2
1
3
2
2
1
2
1
2
*** Randomly delete group
(group #3; 2 users removed)
LOGIN
Jess
Ernie
Melissa
Serena
Angela
Aaron
Jennifer
Leslie
Mona
Larry
Stan
Jim
Pat
Amy
Faye
Dave
USERID
7912
7410
8602
7003
7603
8312
7608
7808
7404
7311
7607
7512
7711
7209
6812
7306
PROJID
2
1
1
1
1
2
1
2
2
1
2
2
1
2
1
2
*** Drop users table
*** Close cxns
$
6.4
Non-Relational Databases
At the beginning of this chapter, we introduced you to SQL and looked at
relational databases. We then showed you how to get data to and from
those types of systems and presented a short lesson in porting to Python 3,
as well. Those sections were followed by sections on ORMs and how they
310
Chapter 6 • Database Programming
let users avoid SQL by taking on more of an “object” approach, instead.
However, under the hood, both SQLAlchemy and SQLObject generate
SQL on your behalf. In the final section of this chapter, we’ll stay on
objects but move away from relational databases.
6.4.1
Introduction to NoSQL
Recent trends in Web and social services have led to the generation of data
in amounts and/or rates greater than relational databases can handle.
Think Facebook or Twitter scale data generation. Developers of Facebook
games or applications that handle Twitter stream data, for example, might
have applications that need to write to persistent storage at a rate of millions of rows or objects per hour. This scalability issue has led to the creation,
explosive growth, and deployment of non-relational or NoSQL databases.
There are plenty of options available here, but they’re not all the same.
In the non-relational (or non-rel for short) category alone, there are object
databases, key-value stores, document stores (or datastores), graph databases, tabular databases, columnar/extensible record/wide-column databases,
multivalue databases, etc. At the end of the chapter, we’ll provide some
links to help you with your NoSQL research. At the time of this writing,
one of the more popular document store non-rel databases is MongoDB.
6.4.2
MongoDB
MongoDB has experienced a recent boost in popularity. Besides users,
documentation, community, and professional support, it has its own regular set of conferences—another sign of adoption. The main Web site claims
a variety of marquee users, including Craigslist, Shutterfly, foursquare,
bit.ly, SourceForge, etc. See http://www.mongodb.org/display/DOCS/
Production+Deployments for these and more. Regardless of its user base,
we feel that MongoDB is a good choice to introduce readers to NoSQL and
document datastores. For those who are curious, MongoDB’s document
storage system is written in C++.
If you were to compare document stores (MongoDB, CouchDB, Riak,
Amazon SimpleDB) in general to other non-rel databases, they fit somewhere between simple key-value stores, such as Redis, Voldemort, Amazon Dynamo, etc., and column-stores, such as Cassandra, Google Bigtable,
and HBase. They’re somewhat like schemaless derivatives of relational
6.4 Non-Relational Databases
311
databases, simpler and less constrained than columnar-based storage but
more flexible than plain key-value stores. They generally store their data
as JavaScript Object Notation (JSON) objects, which allows for data types,
such as strings, numbers, lists, as well as for nesting.
Some of the MongoDB (and NoSQL) terminology is also different from
those of relational database systems. For example, instead of thinking
about rows and columns, you might have to consider documents and collections, instead. To better wrap your head around the change in terms,
you can take a quick look at the SQL-to-Mongo Mapping Chart at http://
www.mongodb.org/display/DOCS/SQL+to+Mongo+Mapping+Chart
MongoDB in particular stores its JSON payloads (documents)—think a
single Python dictionary—in a binary-encoded serialization, commonly
known as BSON format. However, regardless of its storage mechanism,
the main idea is that to developers, it looks like JSON, which in turn
looks like Python dictionaries, which brings us to where we want to be.
MongoDB is popular enough to have adapters available for most platforms, including Python.
6.4.3
PyMongo: MongoDB and Python
Although there are a variety of MongoDB drivers for Python, the most formal of them is PyMongo. The others are either more lightweight adapters
or are special-purpose. You can perform a search on mongo at the Cheeseshop (http://pypi.python.org) to see all MongoDB-related Python packages. You can try any of them, as you prefer, but our example in this
chapter uses PyMongo.
Another benefit of the pymongo package is that it has been ported to
Python 3. Given the techniques already used earlier in this chapter, we
will only present one Python application that runs on both Python 2 and 3,
and depending on which interpreter you use to execute the script, it in
turn will utilize the appropriately-installed version of pymongo.
We won’t spend much time on installation as that is primarily beyond
the scope of this book; however, we can point you to mongodb.org to
download MongoDB and let you know that you can use easy_install or
pip to install PyMongo and/or PyMongo3. (Note: I didn’t have any problems getting pymongo3 on my Mac, but the install process choked in
Windows.) Whichever one you install (or both), it’ll look the same from
your code: import pymongo.
312
Chapter 6 • Database Programming
To confirm that you have MongoDB installed and working correctly,
check out the QuickStart guide at http://www.mongodb.org/display/
DOCS/Quickstart and similarly, to confirm the same for PyMongo, ensure
that you can import the pymongo package. To get a feel for using MongoDB
with Python, run through the PyMongo tutorial at http://api.mongodb.
org/python/current/tutorial.html.
What we’re going to do here is port our existing user shuffle
(ushuffle_*.py) application that we’ve been looking at throughout this
chapter to use MongoDB as its persistent storage. You’ll notice that the flavor of the application is similar to that of SQLAlchemy and SQLObject,
but it is even less substantial in that there isn’t as much overhead with
MongoDB as there is a typical relational database system such as MySQL.
Example 6-4 presents the Python 2 and 3-compatible ushuffle_mongo.py,
followed by the line-by-line explanation.
Example 6-4
MongoDB Example (ushuffle_mongo.py)
Our user shuffle Python 2.x and 3.x-compatible MongoDB and PyMongo
application.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python
from
from
from
from
distutils.log import warn as printf
random import randrange as rand
pymongo import Connection, errors
ushuffle_dbU import DBNAME, randName, FIELDS, tformat, cformat
COLLECTION = 'users'
class MongoTest(object):
def __init__(self):
try:
cxn = Connection()
except errors.AutoReconnect:
raise RuntimeError()
self.db = cxn[DBNAME]
self.users = self.db[COLLECTION]
def insert(self):
self.users.insert(
dict(login=who, userid=uid, projid=rand(1,5)) \
for who, uid in randName())
def update(self):
fr = rand(1,5)
to = rand(1,5)
i = -1
for i, user in enumerate(self.users.find({'projid': fr})):
6.4 Non-Relational Databases
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
313
self.users.update(user,
{'$set': {'projid': to}})
return fr, to, i+1
def delete(self):
rm = rand(1,5)
i = -1
for i, user in enumerate(self.users.find({'projid': rm})):
self.users.remove(user)
return rm, i+1
def dbDump(self):
printf('\n%s' % ''.join(map(cformat, FIELDS)))
for user in self.users.find():
printf(''.join(map(tformat,
(user[k] for k in FIELDS))))
def finish(self):
self.db.connection.disconnect()
def main():
printf('*** Connect to %r database' % DBNAME)
try:
mongo = MongoTest()
except RuntimeError:
printf('\nERROR: MongoDB server unreachable, exit')
return
printf('\n*** Insert names into table')
mongo.insert()
mongo.dbDump()
printf('\n*** Move users to a random group')
fr, to, num = mongo.update()
printf('\t(%d users moved) from (%d) to (%d)' % (num, fr, to))
mongo.dbDump()
printf('\n*** Randomly delete group')
rm, num = mongo.delete()
printf('\t(group #%d; %d users removed)' % (rm, num))
mongo.dbDump()
printf('\n*** Drop users table')
mongo.db.drop_collection(COLLECTION)
printf('\n*** Close cxns')
mongo.finish()
if __name__ == '__main__':
main()
314
Chapter 6 • Database Programming
Line-by-Line Explanation
Lines 1–8
The main import line is to bring in PyMongo’s Connection object and the
package’s exceptions (errors). Everything else you’ve seen earlier in this
chapter. Like the ORM examples, we yet again borrow most constants and
common functions from our earlier ushuffle_dbU.py application. The last
statement sets our collection (“table”) name.
Lines 10–17
The first part of the initializer for our MongoTest class creates a connection,
raising an exception if the server cannot be reached (lines 12–15). The next
two lines are very easy to skip over because they look like mere assignments, but under the hood, these create a database or reuse an existing one
(line 16) and create or reuse an existing “users” collection, which you can
sort of consider as analogous to a database table.
Tables have defined columns then rows for each record, whereas collections don’t have any schema requirements; they have individual documents for each record. You will notice the conspicuous absence of a “data
model” class definition in this part of the code. Each record defines itself,
so to speak—whatever record you save is what goes into the collection.
Lines 19–22
The insert() method adds values to a MongoDB collection. A collection is
made up of documents. You can think of a document as a single record in
the form of a Python dictionary. We create one by using the dict() factory
function of those for each record, and all are streamed to the collection’s
insert() method via a generator expression.
Lines 24–31
The update() method works in the same manner as earlier in the chapter.
The difference is the collection’s update() method which, gives developers
more options than a typical database system. Here, (lines 29–30) we use
the MongoDB $set directive, which updates an existing value explicitly.
Each MongoDB directive represents a modifier operation that is both
highly-efficient, useful, and convenient to the developer when updating
existing values. In addition to $set, there are also operations for incrementing a field by a value, removing a field (key-value pair), appending
and removing values to/from an array, etc.
Working backward somewhat, before the update, however, we first
need to query for all the users in the system (line 28) to find those with a
6.4 Non-Relational Databases
315
project ID (projid) that matches the group we want to update. To do this,
you use the collection find() method and pass in the criteria. This takes
the place of a SQL SELECT statement.
It’s also possible to use the Collection.update() method to modify multiple documents; you would just need to set the multi flag to True. The
only bad news with this is that it currently doesn’t return the total number
of documents modified.
For more complex queries than just the single criteria for our simple
script, check the corresponding page in the official documentation at http://
www.mongodb.org/display/DOCS/Advanced+Queries.
Lines 33–38
The delete() method reuses the same query as for update(). Once we
have all the users that match the query, we remove() them one at a time
(lines 36–37) and return the results. If you don’t care about the total number of documents removed, then you can simply make a single call to
self.users.remove(), which deletes all documents from a collection.
Lines 40–44
The query performed in dbDump() has no criteria (line 42), so all users in
the collection are returned, followed by the data, string-formatted and displayed to the user (lines 43–44).
Lines 46–47
The final method defined and called during application execution disconnects from the MongoDB server.
Lines 49–77
The main() driver function is self-documenting and following the exact
same script as the previous applications seen in this chapter: connect to
database server and do preparation work; insert users into the collection
(“table”) and dump database contents; move users from one project to
another (and dump contents); remove an entire group (and dump contents); drop the entire collection; and then finally, disconnect.
While this closes our look at non-relational databases for Python, it
should only be the beginning for you. As mentioned at the beginning of
this section, there are plenty of NoSQL options to look at, and you’ll need
to investigate and perhaps prototype each to determine which among
them might be the right tool for the job. In the next section, we give various additional references for you to read further.
316
Chapter 6 • Database Programming
6.4.4
Summary
We hope that we have provided you with a good introduction to using
relational databases with Python. When your application’s needs go
beyond those offered by plain files, or specialized files, such as DBM, pickled, etc., you have many options. There are a good number of RDBMSs out
there, not to mention one completely implemented in Python, freeing you
from having to install, maintain, or administer a real database system.
In the following section, you will find information on many of the
Python adapters plus database and ORM systems. Furthermore, the community has been augmented with non-relational databases now to help
out in those situations when relational databases don’t scale to the level
that your application needs.
We also suggest checking out the DB-SIG pages as well as the Web
pages and mailing lists of all systems of interest. Like all other areas of
software development, Python makes things easy to learn and simple to
experiment with.
6.5
Related References
Table 6-8 lists most of the common databases available, along with working Python modules and packages that serve as adapters to those database
systems. Note that not all adapters are DB-API-compliant.
Table 6-8 Database-Related Modules/Packages and Web sites
Name
Online Reference
Relational Databases
Gadfly
gadfly.sf.net
MySQL
mysql.com or mysql.org
MySQLdb a.k.a.
MySQL-python
sf.net/projects/mysql-python
MySQL Connector/
Python
launchpad.net/myconnpy
6.5 Related References
Name
317
Online Reference
Relational Databases
PostgreSQL
postgresql.org
psycopg
initd.org/psycopg
PyPgSQL
pypgsql.sf.net
PyGreSQL
pygresql.org
SQLite
sqlite.org
pysqlite
trac.edgewall.org/wiki/PySqlite
sqlite3a
docs.python.org/library/sqlite3
APSW
code.google.com/p/apsw
MaxDB (SAP)
maxdb.sap.com
sdb.dbapi
maxdb.sap.com/doc/7_7/46/
702811f2042d87e10000000a1553f6/content.htm
sdb.sql
maxdb.sap.com/doc/7_7/46/
71b2a816ae0284e10000000a1553f6/content.htm
sapdb
sapdb.org/sapdbPython.html
Firebird (InterBase)
firebirdsql.org
KInterbasDB
firebirdsql.org/en/python-driver
SQL Server
microsoft.com/sql
pymssql
code.google.com/p/pymssql (requires FreeTDS
[freetds.org])
adodbapi
adodbapi.sf.net
Sybase
sybase.com
sybase
www.object-craft.com.au/projects/sybase
Oracle
oracle.com
2.5
(Continued)
318
Chapter 6 • Database Programming
Table 6-8 Database-Related Modules/Packages and Web sites (Continued)
Name
Online Reference
cx_Oracle
cx-oracle.sf.net
DCOracle2
zope.org/Members/matt/dco2
(older, for Oracle8 only)
Ingres
ingres.com
Ingres DBI
community.actian.com/wiki/
Ingres_Python_Development_Center
ingmod
www.informatik.uni-rostock.de/~hme/software/
NoSQL Document Datastores
MongoDB
mongodb.org
PyMongo
pypi.python.org/pypi/pymongo
Docs at api.mongodb.org/python/current
PyMongo3
pypi.python.org/pypi/pymongo3
Other adapters
api.mongodb.org/python/current/tools.html
CouchDB
couchdb.apache.org
couchdb-python
code.google.com/p/couchdb-python
Docs at packages.python.org/CouchDB
ORMs
SQLObject
sqlobject.org
SQLObject2
sqlobject.org/2
SQLAlchemy
sqlalchemy.org
Storm
storm.canonical.com
PyDO/PyDO2
skunkweb.sf.net/pydo.html
a. pysqlite
added to Python 2.5 as sqlite3 module.
In addition to the database-related modules/packages, the following are
yet more online references that you can consider:
6.6 Exercises
Python and Databases
• wiki.python.org/moin/DatabaseProgramming
• wiki.python.org/moin/DatabaseInterfaces
Database Formats, Structures, and Development Patterns
• en.wikipedia.org/wiki/DSN
• www.martinfowler.com/eaaCatalog/dataMapper.html
• en.wikipedia.org/wiki/Active_record_pattern
• blog.mongodb.org/post/114440717/bson
Non-relational Databases
• en.wikipedia.org/wiki/Nosql
• nosql-database.org/
• www.mongodb.org/display/DOCS/MongoDB,+CouchDB,
+MySQL+Compare+Grid
6.6
Exercises
Databases
6-1. Database API. What is the Python DB-API? Is it a good thing?
Why (or why not)?
6-2. Database API. Describe the differences between the database
module parameter styles (see the paramstyle module attribute).
6-3. Cursor Objects. What are the differences between the cursor
execute*() methods?
6-4. Cursor Objects. What are the differences between the cursor
fetch*() methods?
6-5. Database Adapters. Research your RDBMS and its Python
module. Is it DB-API compliant? What additional features
are available for that module that are extras not required by
the API?
6-6. Type Objects. Study using Type objects for your database and
DB-API adapter, and then write a small script that uses at
least one of those objects.
319
320
Chapter 6 • Database Programming
6-7. Refactoring. In the ushuffle_dbU.create() function, a table
that already exists is dropped and re-created by recursively
calling create() again. This is dangerous, because if recreation of the table fails (again), you will then have infinite
recursion. Fix this problem by creating a more practical solution
that does not involve copying the create query (cur.execute())
again in the exception handler. Extra Credit: Try to recreate
the table a maximum of three times before returning failure
back to the caller.
6-8. Database and HTML. Take any existing database table, and
use your Web programming knowledge to create a handler
that outputs the contents of that table as HTML for browsers.
6-9. Web Programming and Databases. Take our user shuffle example (ushuffle_db.py) and create a Web interface for it.
6-10. GUI Programming and Databases. Take our user shuffle example (ushuffle_db.py) and throw a GUI for it.
6-11. Stock Portfolio Class. Create an application that manages the
stock portfolios for multiple users. Use a relational database
as the back-end and provide a Web-based user interface. You
can use the stock database class from the object-oriented
programming chapter of Core Python Language Fundamentals
or Core Python Programming.
6-12. Debugging & Refactoring. The update() and remove() functions each have a minor flaw: update() might move users
from one group into the same group. Change the random
destination group to be different from the group from which
the user is moving. Similarly, remove() might try to remove
people from a group that has no members (because they
don’t exist or were moved up with update()).
ORMs
6-13. Stock Portfolio Class. Create an alternative solution to the
Stock Portfolio (Exercise 6-11) by using an ORM instead of
direct to an RDBMS.
6-14. Debugging and Refactoring. Port your solutions to Exercise 6-13
to both the SQLAlchemy and SQLObject examples.
6-15. Supporting Different RDBMSs. Take either the SQLAlchemy
(ushuffle_sad.py) or SQLObject (ushuffle_so.py) application, which currently support MySQL and SQLite, and add
yet another relational database of your choice.
6.6 Exercises
321
For the next four exercises, focus on the ushuffle_dbU.py script, which features some code near the top (lines 7–12) that determines which function
should be used to get user input from the command-line.
6-16. Importing and Python. Review that code again. Why do we
need to check if __builtins__ is a dict versus a module?
6-17. Porting to Python 3. Using distutils.log.warn() is not a perfect substitute for print/print(). Prove it. Provide code snippets to show where warn() is not compatible with print().
6-18. Porting to Python 3. Some users believe that they can use
print() in Python 2 just like in Python 3. Prove them wrong.
Hint: From Guido himself: print(x, y)
6-19. Python Language. Assume that you want to use print() in
Python 3 but distutils.log.warn() in Python 2, and you
want to use the printf() name. What’s wrong with the code
below?
from distutils.log import warn
if hasattr(__builtins__, 'print'):
printf = print
else:
printf = warn
6-20. Exceptions. When establishing our connection to the server
using our designated database name in ushuffle_sad.py, a
failure (exc.OperationalError) indicated that our table did
not exist, so we had to back up and create the database first
before retrying the database connection. However, this is
not the only source of errors: if using MySQL and the server
itself is down, the same exception is also thrown. In this
situation, execution of CREATE DATABASE will fail, as well.
Add another handler to take care of this situation, raising
RuntimeError back to the code attempting to create an
instance.
6-21. SQLAlchemy. Augment the ushuffle_sad.dbDump() function
by adding a new default parameter named newest5 which
defaults to False. If True is passed in, rather than displaying
all users, reverse sort the list by order of Users.userid and
show only the top five representing the newest employees.
Make this special call in main() right after the call to
orm.insert() and orm.dbDump().
a) Use the Query limit() and offset() methods.
b) Use the Python slicing syntax, instead.
322
Chapter 6 • Database Programming
The updated output would look something like this:
. . .
Jess
Aaron
Melissa
7912
8312
8602
4
3
2
*** Top 5 newest employees
LOGIN
Melissa
Aaron
Jess
Elliot
Davina
USERID
8602
8312
7912
7911
7902
PROJID
2
3
4
3
3
*** Move users to a random group
(4 users moved) from (3) to (1)
LOGIN
Faye
Serena
Amy
. . .
USERID
6812
7003
7209
PROJID
4
2
1
6-22. SQLAlchemy. Change ushuffle_sad.update() to use the
Query update() method, dropping down to 5 lines of code.
Use the timeit module to show whether it’s faster than the
original.
6-23. SQLAlchemy. Same as Exercise 6-22 but for ushuffle_
sad.delete(), use the Query delete() method.
6-24. SQLAlchemy. In the explicitly non-declarative version of
ushuffle_sad.py, ushuffle_sae.py, we removed the use of
the declarative layer as well as sessions. While using an
Active Record model is more optional, the concept of
Sessions isn’t a bad idea at all. Change all of the code that
performs database operations in ushuffle_sae.py so that
they all use/share a Session object, as in the declarative
ushuffle_sad.py.
6-25. Django Data Models. Take the Users data model class, as
implemented in our SQLAlchemy or SQLObject examples,
and create the equivalent by using the Django ORM. You
might want to read ahead to Chapter 11, “Web Frameworks:
Django.”
6-26. Storm ORM. Port the ushuffle_s*.py application to the
Storm ORM.
6.6 Exercises
Non-Relational (NoSQL) Databases
6-27. NoSQL. What are some of reasons why non-relational databases have become popular? What do they offer over traditional relational databases?
6-28. NoSQL. There are at least four different types of nonrelational databases. Categorize each of the major types
and name the most well-known projects in each category.
Note the specific ones that have at least one Python adapter.
6-29. CouchDB. CouchDB is another document datastore that’s
often compared to MongoDB. Review some of the online
comparisons in the final section of this chapter, and then
download and install CouchDB. Morph ushuffle_mongo.py
into a CouchDB-compatible ushuffle_couch.py.
323
CHAPTER
*Programming
Microsoft Office
Whatever you have to do, there is always a limiting factor that
determines how quickly and well you get it done. Your job is to study the
task and identify the limiting factor or constraint within it. You must
then focus all of your energies on alleviating that single choke point.
—Brian Tracy, March 2001
(from Eat That Frog, 2001, Berrett-Koehler)
In this chapter...
• Introduction
• COM Client Programming with Python
• Introductory Examples
• Intermediate Examples
• Related Modules/Packages
Note that the examples in this chapter require a Windows operating system;
they will not work on Apple computers running Microsoft Office for Mac.
324
7.1 Introduction
325
T
his chapter represents a departure from most other sections of this
book, meaning that instead of focusing on developing networked,
GUI, Web, or command-line-based applications, we’ll be using
Python for something completely different: controlling proprietary software, specifically Microsoft Office applications, via Component Object
Model (COM) client programming.
7.1
Introduction
Like it or not, we developers live in a world in which we will interact with
Windows-based PCs. It might be intermittent or something you have to
deal with on a daily basis, but regardless of how much exposure you face,
the power of Python can be used to make our lives easier.
In this chapter, we will explore COM client programming by using
Python to control and communicate with Microsoft Office applications
such as Word, Excel, PowerPoint, and Outlook. COM is a service through
which PC applications can interact with each other. Specifically, wellknown applications such as those in the Office suite provide COM services, and COM client programs can be written to drive these applications.
Traditionally, COM clients are written in Microsoft Visual Basic (VB)/
Visual Basic for Applications (VBA) or (Visual) C++, two very powerful
but very different tools. For COM programming, Python is often viewed
as a viable substitute because it is more powerful than VB, and it is more
expressive and less time-consuming than developing in C++.
IronPython, .NET, and VSTO are all newer tools that help you to write
applications that communicate with Office tools, as well, but if you look
under the hood, you’ll find COM, so the material in this chapter still
applies, even if you’re using some of these more advanced tools.
This chapter is designed for both COM developers who want to learn
how they can apply Python in their world, and also for Python programmers who need to learn how to create COM clients to automate tasks such
as generating Excel spreadsheets, creating form letters as Word documents, building slide presentations by using PowerPoint, sending e-mail
via Outlook, etc. We will not be discussing the principles or concepts of
COM, waxing philosophically on such thoughts as “Why COM?” Nor
will we be learning about COM+, ATL, IDL, MFC, DCOM, ADO, .NET,
IronPython, VSTO, etc.
Instead, we will immerse you in COM client programming by learning
how to use Python to communicate with Office applications.
326
Chapter 7 • *Programming Microsoft Office
7.2
COM Client Programming with Python
One of the most useful things that you can do in an everyday business
environment is to integrate support for Windows applications. Being able
to read data from and write data to such applications can often be very
handy. Your department might not be running in a Windows environment, but chances are, your management and other project teams are. Mark
Hammond’s Windows Extensions for Python allows programmers to
interact with Windows applications in their native environment.
The Windows programming universe is expansive; most of it available
from the Windows Extensions for Python package. This bundle includes
the Windows applications programming interface (API), spawning processes, Microsoft Foundation Classes (MFC) Graphical User Interface
(GUI) development, Windows multithreaded programming, services,
remote access, pipes, server-side COM programming, and events. For the
remainder of the chapter, we are going to focus on one part of the Windows
universe: COM client programming.
7.2.1
Client-Side COM Programming
We can use COM (or its marketing name, ActiveX), to communicate with
tools such as Outlook and Excel. For programmers, the pleasure comes
with being able to ‘‘control” a native Office application directly from their
Python code.
Specifically, when discussing the use of a COM object, for example,
launching of an application and allowing code to access methods and data
of that application, this is referred to as COM client-side programming.
Server-side COM programming is the implementation of a COM object for
clients to access.
CORE NOTE: Python and Microsoft COM (client-side) programming
Python on the Windows 32-bit platform contains connectivity to COM, a
Microsoft interfacing technology that allows objects to talk to one another, thus
facilitating higher-level applications to talk to one another, without any language or format dependence. We will see in this section how the combination
of Python and COM (client programming) presents a unique opportunity to
create scripts that can communicate directly with Microsoft Office applications
such as Word, Excel, PowerPoint, and Outlook.
7.2 COM Client Programming with Python
7.2.2
327
Getting Started
The prerequisites to this section include using a PC (or other system containing a virtual machine) that is running a 32-bit or 64-bit version of Windows. You also must have .NET 2.0 (at least) installed as well as both
Python and the Python Extensions for Windows. (You can get the extensions from http://pywin32.sf.net.) Finally, you must have one or more
Microsoft applications available with which to try the examples. You can
develop from the command-line or with the PythonWin IDE that comes
with the Extensions distribution.
I must confess that I’m neither a COM expert or a Microsoft software
developer, however I am skilled enough to show you how to use Python
to control Office applications. Naturally our examples can be vastly
improved. We solicit you to drop us a line and send us any comments,
suggestions, or improvements that you would consider for the general
audience.
The rest of the chapter is made up of demonstration applications to get
you started in programming each of the major Office applications; it then
concludes with several intermediate examples. Before we show you examples, we want to point out that client-side COM applications all follow
similar steps in execution. The typical way in which you would interact
with these applications is something like this:
1. Launch application
2. Add appropriate document to work on (or load an existing
one)
3. Make application visible (if desired)
4. Perform all desired work on document
5. Save or discard document
6. Quit
Enough talking; let’s take a look at some code. In the following section
are a series of scripts that each control a different Microsoft application.
All import the win32com.client module as well as a couple of Tk modules
to control the launch (and completion) of each application. Also, as we did
in Chapter 5, “GUI Programming,” we used the .pyw file extension to suppress the unneeded DOS command window.
328
Chapter 7 • *Programming Microsoft Office
7.3
Introductory Examples
In this section, we will take a look at basic examples that will get you started
developing with four major Office applications: Excel, Word, PowerPoint,
and Outlook.
7.3.1
Excel
Our first example is a demonstration using Excel. Of the entire Office
suite, we find Excel to be the most programmable. It is quite useful to pass
data to Excel so that you can both take advantage of the spreadsheet’s features as well as view data in a nice, printable format. It is also useful to be
able to read data from a spreadsheet and process it with the power of a
real programming language such as Python. We will present a more complex example at the end of this section, but we have to start somewhere, so
let’s start with Example 7-1.
Example 7-1
Excel Example (excel.pyw)
This script launches Excel and writes data to spreadsheet cells.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/usr/bin/env python
from Tkinter import Tk
from time import sleep
from tkMessageBox import showwarning
import win32com.client as win32
warn = lambda app: showwarning(app, 'Exit?')
RANGE = range(3, 8)
def excel():
app = 'Excel'
xl = win32.gencache.EnsureDispatch('%s.Application' % app)
ss = xl.Workbooks.Add()
sh = ss.ActiveSheet
xl.Visible = True
sleep(1)
sh.Cells(1,1).Value = 'Python-to-%s Demo' % app
sleep(1)
for i in RANGE:
sh.Cells(i,1).Value = 'Line %d' % i
sleep(1)
sh.Cells(i+2,1).Value = "Th-th-th-that's all folks!"
7.3 Introductory Examples
26
27
28
29
30
31
32
329
warn(app)
ss.Close(False)
xl.Application.Quit()
if __name__=='__main__':
Tk().withdraw()
excel()
Line-by-Line Explanation
Lines 1–6, 31
We import Tkinter and tkMessageBox only to use the showwarning message
box upon termination of the demonstration. We withdraw() the Tk toplevel window to suppress it (line 31) before bringing up the dialog box
(line 26). If you do not initialize the top level beforehand, one will automatically be created for you; it won’t be withdrawn and will be an annoyance on screen.
Lines 11–17
After the code starts (or “dispatches”) Excel, we add a workbook (a spreadsheet that contains sheets to which the data is written; these sheets are
organized as tabs in the workbook), and then grab a handle to the active
sheet (the sheet that is displayed). Do not get all worked up about the terminology, which can be confusing mostly because a spreadsheet contains
sheets.
CORE NOTE: Static and dynamic dispatch
On line 13, we use what is known as static dispatch. Before starting up the script,
we ran the Makepy utility from PythonWin. (Start the IDE, select Tools, COM
Makepy utility, and then choose the appropriate application object library.)
This utility creates and caches the objects that are needed for the application.
Without this preparatory work, the objects and attributes will need to be built
during runtime; this is known as dynamic dispatch. If you want to run dynamically, then use the regular Dispatch() function:
xl = win32com.client.Dispatch('%s.Application' % app)
330
Chapter 7 • *Programming Microsoft Office
The Visible flag must be set to True to make the application visible on
your desktop; pause so that you can see each step in the demonstration
(line 16).
Lines 19–24
In the application portion of the script, we write out the title of our demonstration to the first (upper-left) cell, (A1) or (1, 1). We then skip a row and
write “Line N” where N is numbered from 3 to 7, pausing 1 second in
between each row so that you can see our updates happening live. (The
cell updates would occur too quickly without the delay. This is the reason
for all the sleep() calls throughout the script.)
Lines 26–32
A warning dialog box appears after the demonstration, stating that you
can quit once you have observed the output. The spreadsheet is closed
without saving, ss.Close([SaveChanges=]False), and the application
exits. Finally, the “main” part of the script initializes Tk and runs the core
part of the application.
Running this script results in an Excel application window, which should
look similar to Figure 7-1.
Figure 7-1 The Python-to-Excel demonstration script (excel.pyw).
7.3 Introductory Examples
7.3.2
331
Word
The next demonstration involves Word. Using Word for documents is not
as applicable to the programming world because there is not much data
involved. However, you could consider using Word for generating form
letters. In Example 7-2, we create a document by writing one line of text
after another.
Example 7-2
Word Example (word.pyw)
This script launches Word and writes data to the document.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env python
from Tkinter import Tk
from time import sleep
from tkMessageBox import showwarning
import win32com.client as win32
warn = lambda app: showwarning(app, 'Exit?')
RANGE = range(3, 8)
def word():
app = 'Word'
word = win32.gencache.EnsureDispatch('%s.Application' % app)
doc = word.Documents.Add()
word.Visible = True
sleep(1)
rng = doc.Range(0,0)
rng.InsertAfter('Python-to-%s Test\r\n\r\n' % app)
sleep(1)
for i in RANGE:
rng.InsertAfter('Line %d\r\n' % i)
sleep(1)
rng.InsertAfter("\r\nTh-th-th-that's all folks!\r\n")
warn(app)
doc.Close(False)
word.Application.Quit()
if __name__=='__main__':
Tk().withdraw()
word()
The Word example follows pretty much the same script as the Excel
example. The only difference is that instead of writing in cells, we insert
the strings into the text “range” of our document and move the cursor forward after each write. We also must manually provide the line termination
characters, carriage RETURN followed by NEWLINE (\r\n).
332
Chapter 7 • *Programming Microsoft Office
When you run this script, the resulting screen might look like Figure 7-2.
Figure 7-2 The Python-to-Word demonstration script (word.pyw).
7.3.3
PowerPoint
Applying PowerPoint in an application might not seem commonplace, but
you could consider using it when you are rushed to make a presentation.
You can create your bullet points in a text file on the plane, and then upon
arrival at the hotel that evening, use a script that parses the file and autogenerates a set of slides. You can further enhance those slides by adding in
a background, animation, etc., all of which are possible through the COM
interface. Another use case would be if you had to auto-generate or modify
new or existing presentations. You can create a COM script controlled via a
shell script to create and tweak each presentation. Okay, enough speculation;
let’s take a look at Example 7-3 to see our PowerPoint example in action.
7.3 Introductory Examples
Example 7-3
333
PowerPoint Example (ppoint.pyw)
This script launches PowerPoint and writes data to the “shapes” on a slide.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/env python
from Tkinter import Tk
from time import sleep
from tkMessageBox import showwarning
import win32com.client as win32
warn = lambda app: showwarning(app, 'Exit?')
RANGE = range(3, 8)
def ppoint():
app = 'PowerPoint'
ppoint = win32.gencache.EnsureDispatch('%s.Application' % app)
pres = ppoint.Presentations.Add()
ppoint.Visible = True
s1 = pres.Slides.Add(1, win32.constants.ppLayoutText)
sleep(1)
s1a = s1.Shapes[0].TextFrame.TextRange
s1a.Text = 'Python-to-%s Demo' % app
sleep(1)
s1b = s1.Shapes[1].TextFrame.TextRange
for i in RANGE:
s1b.InsertAfter("Line %d\r\n" % i)
sleep(1)
s1b.InsertAfter("\r\nTh-th-th-that's all folks!")
warn(app)
pres.Close()
ppoint.Quit()
if __name__=='__main__':
Tk().withdraw()
ppoint()
Again, you will notice similarities to both the preceding Excel and Word
demonstrations. Where PowerPoint differs is in the objects to which you
write data. Instead of a single active sheet or document, PowerPoint is
somewhat trickier because with a presentation, you have multiple slides,
and each slide can have a different layout. (Recent versions of PowerPoint
have 30 different layouts!) The actions you can perform on a slide depend
on which layout you have chosen.
In our example, we just use a title and text layout (line 17) and fill in the
main title (lines 19–20), Shape[0] or Shape(1)—Python sequences begin at
334
Chapter 7 • *Programming Microsoft Office
index 0 while Microsoft software starts at 1—and the text portion (lines
22–26), Shape[1] or Shape(2). To figure out which constant to use, you will
need a list of all those that are available to you. For example, ppLayoutText
is defined as a constant with a value of 2 (integer), ppLayoutTitle is 1, etc.
You can find the constants in most Microsoft VB/Office programming
books or online by just searching on the names. Alternatively, you can just
use the integer constants without having to name them via win32.constants.
The PowerPoint screenshot is shown in Figure 7-3.
Figure 7-3 The Python-to-PowerPoint demonstration script (ppoint.pyw).
7.3.4
Outlook
Finally, we present an Outlook demonstration, which uses even more constants than PowerPoint. As a fairly common and versatile tool, use of Outlook in an application makes sense, like it does for Excel. There are always
e-mail addresses, messages, and other data that can be easily manipulated
in a Python program. Example 7-4 is an Outlook example that does a little
bit more than our previous examples.
7.3 Introductory Examples
Example 7-4
335
Outlook Example (olook.pyw)
This script launches Outlook, creates a new message, sends it, and lets you view
it by opening and displaying both the Outbox and the message itself.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/env python
from Tkinter import Tk
from time import sleep
from tkMessageBox import showwarning
import win32com.client as win32
warn = lambda app: showwarning(app, 'Exit?')
RANGE = range(3, 8)
def outlook():
app = 'Outlook'
olook = win32.gencache.EnsureDispatch('%s.Application' % app)
mail = olook.CreateItem(win32.constants.olMailItem)
recip = mail.Recipients.Add('[email protected]')
subj = mail.Subject = 'Python-to-%s Demo' % app
body = ["Line %d" % i for i in RANGE]
body.insert(0, '%s\r\n' % subj)
body.append("\r\nTh-th-th-that's all folks!")
mail.Body = '\r\n'.join(body)
mail.Send()
ns = olook.GetNamespace("MAPI")
obox = ns.GetDefaultFolder(win32.constants.olFolderOutbox)
obox.Display()
obox.Items.Item(1).Display()
warn(app)
olook.Quit()
if __name__=='__main__':
Tk().withdraw()
outlook()
In this example, we use Outlook to send an e-mail to ourselves. To make
the demonstration work, you need to turn off your network access so that
you do not really send the message, and thus are able to view it in your
Outbox folder (and delete it after viewing, if you like). After launching
Outlook, we create a new mail message and fill out the various fields such
as recipient, subject, and body (lines 15–21). We then call the send()
method (line 22) to spool the message to the Outbox where it will be moved
to “Sent Mail” once the message has actually been transmitted to the mail
server.
336
Chapter 7 • *Programming Microsoft Office
Like PowerPoint, there are many constants available; olMailItem (with a
constant value of 0) is the one used for e-mail messages. Other popular
Outlook items include olAppointmentItem (1), olContactItem (2), and
olTaskItem (3). Of course, there are more, so you will need to find a VB/
Office programming book or search for the constants and their values
online.
In the next section (lines 24–27), we use another constant, olFolderOutbox (4), to open the Outbox folder and bring it up for display. We find
the most recent item (hopefully the one we just created) and display it, as
well. Other popular folders include: olFolderInbox (6), olFolderCalendar (9), olFolderContacts (10), olFolderDrafts (16), olFolderSentMail
(5), and olFolderTasks (13). If you use dynamic dispatch, you will likely
have to use the numeric values instead of the constants’ names (see the
previous Core Note).
Figure 7-4 shows a screen capture of just the message window.
Figure 7-4 The Python-to-Outlook demonstration script (olook.pyw).
Before we get this far, however, from its history we know that Outlook
has been vulnerable to all kinds of attacks, so Microsoft has built in some
7.3 Introductory Examples
337
protection that restricts access to your address book and the ability to send
mail on your behalf. When attempting to access your Outlook data, the
screen shown in Figure 7-5 pops up, in which you must explicitly give permission to an outside program.
Figure 7-5 Outlook address book access warning.
Then, when you are trying to send a message from an external program,
a warning dialog appears, as shown in Figure 7-6; you must wait until the
timer expires before you are allowed to select Yes.
Figure 7-6 Outlook e-mail transmission warning.
Once you pass all the security checks, everything else should work
smoothly. There is software available to help get you around these checks
but they have to be downloaded and installed separately.
On this book’s Web site at http://corepython.com, you will find an alternative script that combines these four smaller ones into a single application that lets users choose which of these demonstrations to run.
338
Chapter 7 • *Programming Microsoft Office
7.4
Intermediate Examples
The examples we’ve looked at so far in this chapter are to get you started
with using Python to control Microsoft Office products. Now let’s look at
several real-world useful applications, some of which I’ve used regularly
for work.
7.4.1
Excel
In this example, we’re going to combine the material from this chapter
with that of Chapter 13, “Web Services.” In this chapter, we feature a script
stock.py as Example 13-1, that uses the Yahoo! Finance service and asks
for stock quote data. Example 7-5 shows how we can merge the stock
quote example with our Excel demonstration script; we will end up with an
application that can download stock quotes from the Net and insert them
directly into Excel, without having to create or use CSV files as a medium.
Example 7-5
Stock Quote and Excel Example (estock.pyw)
This script downloads stock quotes from Yahoo! and writes the data to Excel.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python
from Tkinter import Tk
from time import sleep, ctime
from tkMessageBox import showwarning
from urllib import urlopen
import win32com.client as win32
warn = lambda app: showwarning(app, 'Exit?')
RANGE = range(3, 8)
TICKS = ('YHOO', 'GOOG', 'EBAY', 'AMZN')
COLS = ('TICKER', 'PRICE', 'CHG', '%AGE')
URL = 'http://quote.yahoo.com/d/quotes.csv?s=%s&f=sl1c1p2'
def excel():
app = 'Excel'
xl = win32.gencache.EnsureDispatch('%s.Application' % app)
ss = xl.Workbooks.Add()
sh = ss.ActiveSheet
xl.Visible = True
sleep(1)
sh.Cells(1, 1).Value = 'Python-to-%s Stock Quote Demo' % app
sleep(1)
sh.Cells(3, 1).Value = 'Prices quoted as of: %s' % ctime()
sleep(1)
for i in range(4):
sh.Cells(5, i+1).Value = COLS[i]
7.4 Intermediate Examples
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
339
sleep(1)
sh.Range(sh.Cells(5, 1), sh.Cells(5, 4)).Font.Bold = True
sleep(1)
row = 6
u = urlopen(URL % ','.join(TICKS))
for data in u:
tick, price, chg, per = data.split(',')
sh.Cells(row, 1).Value = eval(tick)
sh.Cells(row, 2).Value = ('%.2f' % round(float(price), 2))
sh.Cells(row, 3).Value = chg
sh.Cells(row, 4).Value = eval(per.rstrip())
row += 1
sleep(1)
u.close()
warn(app)
ss.Close(False)
xl.Application.Quit()
if __name__=='__main__':
Tk().withdraw()
excel()
Line-by-Line Explanation
Lines 1–13
Looking ahead in Chapter 13, we will explore a simple script that fetches
stock quotes from the Yahoo! Finance service. In this chapter, we take the
core component from that script and integrate it into an example that takes
the data and imports it into an Excel spreadsheet.
Lines 15–32
The first part of the core function launches Excel (lines 17–21), as seen earlier. The title and timestamp are then written to cells (lines 23–29), along
with the column headings, which are then styled as bold (line 30). The
remaining cells are dedicated to writing the actual stock quote data, starting in row 6 (line 32).
Lines 34–43
We open the URL as before (line 34), but instead of just writing the data to
standard output, we fill in the spreadsheet cells, one column of data at a
time, and one company per row (lines 35–42).
Lines 45–51
The remaining lines of our script mirror code that we have seen before.
340
Chapter 7 • *Programming Microsoft Office
Figure 7-7 shows a window with real data after executing our script.
Figure 7-7 The Python-to-Excel stock quote demonstration script (estock.pyw).
Note that the data columns lose the original formatting of the numeric
strings because Excel stores them as numbers, using the default cell format. We lose the formatting of the numbers to two places after the decimal
point; for example, “34.2” is displayed, even though Python passed in
“34.20.” For the “change from previous close column,” we lose not only
the decimal places but also the plus sign (+) that indicates a positive
change in value. (Compare the output in Excel to the output from the
original text version, which you can see in Example 13-1 [stock.py], in
Chapter 13. These problems will be addressed by an exercise at the end of
this chapter.)
7.4.2
Outlook
At first, we wanted to give readers examples of Outlook scripts that
manipulate your address book or that send and receive e-mail. However,
given all the security issues with Outlook, we decided to avoid those categories, yet still give you a very useful example.
7.4 Intermediate Examples
341
Those of us who work daily on the command-line building applications
are used to certain text editors to help us do our work. Without getting
into any religious wars, these tools include Emacs, vi (or its modern
replacement vim or gVim), and others. For users of these tools, editing an
e-mail reply in an Outlook dialog window may not exactly be their cup of
tea. In comes Python to the rescue.
This script, inspired by John Klassa’s original 2001 creation, is very simple: when you reply to an e-mail message in Outlook, it launches your editor of choice, brings in the content of the e-mail reply that is currently in
the crude-editing dialog window, lets you edit the rest of the message to
your heart’s desire in your favorite editor, and then when exiting, replaces
the dialog window content with the text you just edited. You only need to
click the Send button.
You can run the tool from the command-line. We’ve named it
outlook_edit.pyw. The .pyw extension is used to indicate the suppression
of the terminal, meaning the intention is to run a GUI application for
which user interaction isn’t necessary. Before we look at the code, let’s
describe how it works. When it’s started, you’ll see its simple user interface, as shown in Figure 7-8.
Figure 7-8 The Outlook e-mail editor GUI control panel (outlook_edit.pyw).
As your going through your e-mail, there might be one to which you
want to respond, so you click the Reply button to bring up a pop-up window similar to that (except for the contents, of course) in Figure 7-9.
Now, rather than editing in this poor dialog window, you prefer to do so
in a different editor (your editor of choice) rather than taking what’s given to
you. Once you’ve set up one to use with outlook_edit.py, click the GUI’s
Edit button. We hardcoded it to be gVim 7.3 in this example, but there’s no
reason why you can’t use an environment variable or let the user specify this
on the command-line (see the related exercise at the end of the chapter).
For the figures in this section, we’re using Outlook 2003. When this version of Outlook detects an outside script that is requesting access to it, it
displays the same warning dialog as that shown in Figure 7-5. Once you
342
Chapter 7 • *Programming Microsoft Office
Figure 7-9 Standard Outlook reply dialog window.
“opt-in,” a new gVim window pops open, including the contents of the
Outlook reply dialog box. An example of ours is shown in Figure 7-10.
At this point, you can add your reply, editing any other part of the message as desired. We’ll just do a quick and friendly reply (Figure 7-11). Saving the file and quitting the editor results in that window closing and the
contents of your reply pushed back into the Outlook reply dialog box (see
Figure 7-12) that you didn’t want to deal with to begin with. The only
thing you need to do here is to click the Send button, and you’re done!
Now let’s take a look at the script itself, shown in Example 7-6. You will
see from the line-by-line description of the code that this script is broken
up into four main parts: hook into Outlook and grab the current item
being worked on; clean the text in the Outlook dialog and transfer it to a
temporary file; spawn the editor opened against the temporary text file;
and reading the contents of the edited text file and pushing it back into that
dialog window.
7.4 Intermediate Examples
Figure 7-10 Outlook dialog contents in a spawned gVim editor window.
Figure 7-11 An edited reply in the gVim editor window.
343
344
Chapter 7 • *Programming Microsoft Office
Figure 7-12 Back to the Outlook dialog with our modified contents.
Example 7-6
Outlook Editor Example (outlook_edit.pyw)
Why edit your Outlook new or reply messages in a dialog window?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env python
from Tkinter import Tk, Frame, Label, Button, BOTH
import os
import tempfile
import win32com.client as win32
def edit():
olook = win32.Dispatch('Outlook.Application')
insp = olook.ActiveInspector()
if insp is None:
return
item = insp.CurrentItem
if item is None:
return
body = item.Body
tmpfd, tmpfn = tempfile.mkstemp()
f = os.fdopen(tmpfd, 'a')
7.4 Intermediate Examples
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
345
f.write(body.encode(
'ascii', 'ignore').replace('\r\n', '\n'))
f.close()
#ed = r"d:\emacs-23.2\bin\emacsclientw.exe"
ed = r"c:\progra~1\vim\vim73\gvim.exe"
os.spawnv(os.P_WAIT, ed, [ed, tmpfn])
f = open(tmpfn, 'r')
body = f.read().replace('\n', '\r\n')
f.close()
os.unlink(tmpfn)
item.Body = body
if __name__=='__main__':
tk = Tk()
f = Frame(tk, borderwidth=2)
f.pack(fill=BOTH)
Label(f,
text="Outlook Edit Launcher v0.3").pack()
Button(f, text="Edit",
fg='blue', command=edit).pack(fill=BOTH)
Button(f, text="Quit",
fg='red', command=tk.quit).pack(fill=BOTH)
tk.mainloop()
Line-by-Line Explanation
Lines 1–6
Although Tk does not play a huge role in any of the examples in this chapter,
it provides an execution shell with which to control the interface between
the user and the target Office application. Accordingly, we need a bunch of
Tk constants and widgets for this application. There are a bunch of operating system items that we need, so we import the os module (well, nt actually). tempfile is a Python module that we haven’t really discussed, but it
provides a variety of utilities and classes that developers can use to create
temporary files, filenames, and directories. Finally, we need our PC connectivity to Office applications and their COM servers.
Lines 8–15
The only real PC COM client lines of code are here, obtaining a handle to
the running instance of Outlook, looking for the active dialog (should be a
olMailItem) that is being worked on. If it cannot do this inspection or find
the current item, the application quits quietly. You will know if this is the
case because control of the Edit button comes back immediately rather than
being grayed-out (if all went well and the editor window pops up).
346
Chapter 7 • *Programming Microsoft Office
Note that we’re choosing to use dynamic dispatch here instead of static
(win32.Dispatch() vs. win32.gencache.EnsureDispatch()) because dynamic
usually has quicker startup, and we’re not using any of the cached constant values in this script.
Lines 16–22
Once the current dialog (compose new or reply) window is identified, the
first thing we do in this section is to grab the text and write it to a temporary file. Admittedly, the handling of Unicode text and diacritic characters
is not good here; we’re filtering all non-ASCII characters out of the dialog
box. (One of the exercises at the end of the this chapter is to right this
wrong and tweak the script so it works correctly with Unicode.)
Originally, Unix-flavored editors did not like to deal with the carriage
RETURN-NEWLINE pair used as line termination characters in files created
on PCs, so another piece of processing that’s done pre- and post-editing is
to convert these to pure NEWLINEs before sending the file to the editor
and then add them back after editing is complete. Modern text-based editors handle \r\n more cleanly, so this isn’t as much of an issue as it was in
the past.
Lines 24–26
Here’s where a bit of magic happens: after setting our editor (on line 25,
where we specify the location of the vim binary on our system; Emacs
users will do something like line 24 which is commented out), we launch
the editor with the temporary filename as the argument (assuming that
the editor takes the target filename on the command-line as the first argument after the program name). This is done via the call to os.spawnv() on
line 26.
The P_WAIT flag is used to “pause” the main (parent) process until the
spawned (child) process has completed. In other words, we do want to
keep the Edit button grayed-out so that the user does not try to edit more
than one reply at a time. It sounds like a limitation, but it helps the user
focus and not have partially-edited replies all over the desktop.
To further expand on what else you can do with spawnv(), this flag
works on both POSIX and Windows systems just like P_NOWAIT (which
does the opposite—do not wait for the child to finish, running both processes in parallel). The last two possible flags, P_OVERLAY and P_DETACH, are
only valid on Windows. P_OVERLAY causes the child to replace the parent
like the POSIX exec() call, and P_DETACH, like P_NOWAIT, starts the child
running in parallel with the parent, except it does so in the background,
“detached” from a keyboard or console.
7.4 Intermediate Examples
347
One of the exercises at the end of this chapter is to make this part of the
code more flexible. As we hinted a bit earlier, you should be able to specify
your editor of choice here via the command-line or through the use of an
environment variable.
Lines 28–32
The next block of code opens the updated temporary file after the editor
has closed, takes its contents, deletes the temporary file, and replaces the
text in the dialog window. Note that we are merely sending this data back
to Outlook—it does not prevent Outlook from mucking with your message; that is, there can be a variety of side effects, some of which include
adding your signature (again), removing NEWLINEs, etc.
Lines 34–44
The application is built around main() which uses Tk(inter) to draw up a
simple user interface with a single frame containing a Label with the
application description, plus a pair of buttons: Edit spawns an editor on
the active Outlook dialog window, and Quit terminates this application.
7.4.3
PowerPoint
Our final example of a more realistic application is one that Python users
have requested of me for many years now, and I’m happy to say that I’m
finally able to present it to the community. If you have ever seen me
deliver a presentation at a conference, you will likely have seen my ploy of
showing the audience a plain text version of my talk, perhaps to the shock
and horror of some of the attendees who have yet to hear me speak.
I then launch this script on that plain text file and let the power of
Python autogenerate a PowerPoint presentation, complete with style template, and then start the slide show, much to the amazement of the audience. However, once you realize it’s only a small, easily-written Python
script, you might be less impressed but satisfied that you can do the same
thing too!
The way it works is this: the GUI comes up (see Figure 7-13a) prompting the user to enter the location of the text file. If the user types in a valid
location for the file, things progress, but if the file is not found or “DEMO”
is entered, a demonstration will start. If a filename is given but somehow
can’t be opened by the application, the DEMO string is installed into the text
entry along with the error stating that the file can’t be opened (Figure 7-13b).
348
Chapter 7 • *Programming Microsoft Office
(a)
(b)
Figure 7-13 Text-to-PowerPoint GUI control panel (txt2ppt.pyw).
(a) Filename entry field clear on start-up (b) DEMO if demo request or error otherwise.
As shown in Figure 7-14, the next step is to connect to the existing
PowerPoint application that is running (or launch one if it isn’t and then
get a handle to it), create a title slide (based on the ALL CAPS slide title),
and then create any other slides based on contents of the plain text file formatted in a pseudo-Python syntax.
Figure 7-14 PowerPoint creating the title slide of the demo presentation.
Figure 7-15 shows the script in mid-flight, creating the final slide of the
demonstration. When this screen was captured, the final line had not been
added to the slide yet (so it’s not a bug in the code).
7.4 Intermediate Examples
349
Figure 7-15 Creating the final slide of the demo presentation.
Finally, the code adds one more auxiliary slide to tell the user the slideshow is set to go (Figure 7-16) and gives a cute little countdown from three
to zero. (The screenshot was taken as the count had already started and
progressed down to two.) The slideshow is then started without any additional processing. Figure 7-17 depicts the plain look (black text on a white
background).
To show it works, now we apply a presentation template (Figure 7-18)
to give it the desired look and feel, and then you can drive it from here on out.
350
Chapter 7 • *Programming Microsoft Office
Figure 7-16 Counting down to start the slideshow.
Figure 7-17 The slideshow has started, but no template has been applied (yet).
7.4 Intermediate Examples
351
Figure 7-18 The finished PowerPoint slideshow after the template is applied.
Example 7-7 presents the txt2ppt.pyw script, followed by the corresponding code walkthrough.
Example 7-7
Text-to-PowerPoint converter (txt2ppt.pyw)
This script generates a PowerPoint presentation from a plain text file formatted like
Python code.
1
2
3
4
5
6
7
8
9
10
11
#!/usr/bin/env python
from Tkinter import Tk, Label, Entry, Button
from time import sleep
import win32com.client as win32
INDENT = '
'
DEMO = '''
PRESENTATION TITLE
optional subtitle
(Continued)
352
Chapter 7 • *Programming Microsoft Office
Example 7-7
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
Text-to-PowerPoint converter (txt2ppt.pyw) (Continued)
slide 1 title
slide 1 bullet 1
slide 1 bullet 2
slide 2 title
slide 2 bullet 1
slide 2 bullet 2
slide 2 bullet 2a
slide 2 bullet 2b
'''
def txt2ppt(lines):
ppoint = win32.gencache.EnsureDispatch(
'PowerPoint.Application')
pres = ppoint.Presentations.Add()
ppoint.Visible = True
sleep(2)
nslide = 1
for line in lines:
if not line:
continue
linedata = line.split(INDENT)
if len(linedata) == 1:
title = (line == line.upper())
if title:
stype = win32.constants.ppLayoutTitle
else:
stype = win32.constants.ppLayoutText
s = pres.Slides.Add(nslide, stype)
ppoint.ActiveWindow.View.GotoSlide(nslide)
s.Shapes[0].TextFrame.TextRange.Text = line.title()
body = s.Shapes[1].TextFrame.TextRange
nline = 1
nslide += 1
sleep((nslide<4) and 0.5 or 0.01)
else:
line = '%s\r\n' % line.lstrip()
body.InsertAfter(line)
para = body.Paragraphs(nline)
para.IndentLevel = len(linedata) - 1
nline += 1
sleep((nslide<4) and 0.25 or 0.01)
s = pres.Slides.Add(nslide,win32.constants.ppLayoutTitle)
ppoint.ActiveWindow.View.GotoSlide(nslide)
s.Shapes[0].TextFrame.TextRange.Text = "It's time for a slideshow!".upper()
sleep(1.)
for i in range(3, 0, -1):
s.Shapes[1].TextFrame.TextRange.Text = str(i)
sleep(1.)
7.4 Intermediate Examples
64
65
66
353
pres.SlideShowSettings.ShowType = win32.constants.ppShowTypeSpeaker
ss = pres.SlideShowSettings.Run()
pres.ApplyTemplate(r'c:\Program Files\Microsoft
Office\Templates\Presentation Designs\Stream.pot')
s.Shapes[0].TextFrame.TextRange.Text = 'FINIS'
s.Shapes[1].TextFrame.TextRange.Text = ''
67
68
69
70 def _start(ev=None):
71
fn = en.get().strip()
72
try:
73
f = open(fn, 'U')
74
except IOError, e:
75
from cStringIO import StringIO
76
f = StringIO(DEMO)
77
en.delete(0, 'end')
78
if fn.lower() == 'demo':
79
en.insert(0, fn)
80
else:
81
import os
82
en.insert(0,
83
r"DEMO (can't open %s: %s)" % (
84
os.path.join(os.getcwd(), fn), str(e)))
85
en.update_idletasks()
86
txt2ppt(line.rstrip() for line in f)
87
f.close()
88
89 if __name__=='__main__':
90
tk = Tk()
91
lb = Label(tk, text='Enter file [or "DEMO"]:')
92
lb.pack()
93
en = Entry(tk)
94
en.bind('<Return>', _start)
95
en.pack()
96
en.focus_set()
97
quit = Button(tk, text='QUIT',
98
command=tk.quit, fg='white', bg='red')
99
quit.pack(fill='x', expand=True)
100
tk.mainloop()
Line-by-Line Explanation
Lines 1–5
Surprisingly, there aren’t that many things to import. Python has almost
everything we need to solve this problem. Like the Outlook dialog editor,
we need to bring in some basic Tk functionality for a shell GUI application
to capture user input. Naturally, you can choose to do it via a commandline interface, as well, but you have enough knowledge to do that on your
own. Sometimes it’s more convenient to have the tool sitting on your desktop waiting for you to use.
354
Chapter 7 • *Programming Microsoft Office
The use of the time.sleep() function is purely academic. We’re only
using it to slow down our application. You can choose to leave out all
those calls if you prefer. The reason why we’re using it here as well as our
Excel stock demonstration earlier is to slow things down a bit because the
code generally executes so quickly, people are skeptical that it even did
anything or that it was staged.
The last bit of course, is the lynchpin: the PC library.
Lines 7–21
These are a pair of general global variables that represent two values. The
first is the default indentation level of four spaces, much like the recommended indentation for Python code per the PEP 8 style guide, only this
time, we’re defining the presentation bullet level. The other one is a demonstration slide presentation in case you prefer to see a demonstration of
how the script works or as a backup in case the desired source text file cannot be found by the script. This static string also serves as an example of
how you should structure your source text file. Once you’ve created a presentation, you won’t need to look at this again.
Lines 23–29
These first few lines of the main function, txt2ppt(), launch PowerPoint,
create a new presentation, make the PowerPoint application show up on
the desktop, pause for a few seconds, and then reset the slide count to one.
Lines 30–54
The txt2ppt() function takes one argument: all the lines of the source text
file that comprise the presentation. You can pretty much feed this function
any iterable with one or more lines, and a slide presentation will be created
for you. For the demonstration bullet points, we use cStringIO.StringIO
object to iterate through the text, and for a real file, we use a generator
expression for each line. Naturally, if you’re using Python 2.3 or older,
you’ll need to change the “genexp” to a list comprehension. True, it’s not as
great for memory, especially large source files, but what are you going do?
Back to the processor loop; we skip blank lines, then do a little bit of
magic by string splitting on the indentation. A look at this code snippet
will show you exactly what we’re doing:
>>> 'slide title'.split('
')
['slide title']
>>> '
1st level bullet'.split('
')
7.4 Intermediate Examples
['', '1st level bullet']
>>> '
2nd level bullet'.split('
['', '', '2nd level bullet']
355
')
When there is no indentation, meaning that splitting on the indentation
only leaves a single string, this means we’re starting a new slide and the
text is the slide title. If the length of this list is greater than one, this means
that we have at least one level of indentation and that this is continuing
material of a previous slide (and not the beginning of a new one). For the
former, this affirmative part of the if clause makes up lines 35 to 47. We’ll
focus on this block first, followed by the rest.
The next five lines (35–39) determine whether this is a title slide or a
standard text slide. This is where the ALL CAPS for a title slide comes in.
We just compare the contents to an all-capitalized version of it. If they
match, meaning the text is in CAPS, this means that this slide should use
the title layout, designated by the PC constant ppLayoutTitle. Otherwise,
this is a standard slide with a title and text body (ppLayoutText).
After we’ve determined the slide layout, the new slide is created on line
41, PowerPoint is directed (in line 42) to that slide (by making it the active
slide), and the title or main shape text is set to the content, using title case
(line 43). Note that Python starts counting at zero (Shape[0]), whereas
Microsoft likes to start counting at one (Shape(1))—either syntax is acceptable.
The remaining content to come will be part of Shape[1] (or Shape(2)),
and we call that the body (line 44); for a title slide it will be the subtitle,
and for a standard slide it’s going to be bulleted lines of text.
On the remaining lines in this clause (45–47), we mark that we’ve written the first line on this slide, increment the counter tracking the total
number of slides in the presentation, and then pause so that the user can
see how the Python script was able to control PowerPoint’s execution.
Jumping over the wall to the else-clause, we move to the code that’s
executed for the remaining list on the same slide, filling in the second
shape or body of the slide. Because we have already used the indentation
to indicate where we are and the indentation level, we don’t need those
leading spaces any more, so we strip (str.lstrip()) them out, and then
insert the text into the body (lines 49–50).
The rest of the block indents the text to the correct bullet level (or no
indentation at all if it’s a title slide—setting an indentation level of zero has
no effect on the text), increments the linecount, and adds the minor pause
at the end to slow things down (lines 51–54).
356
Chapter 7 • *Programming Microsoft Office
Lines 56–62
After all the main slides have been created, we add one more title slide at
the end, announcing that it’s time for a slideshow by changing the text
dynamically, counting down by seconds from three to zero.
Lines 64–68
The primary purpose of these lines is to start the slideshow. Actually only
the first two lines (64 and 65) do this. Line 66 applies the template. We do
this after the slideshow has started so that you can see it—it’s more
impressive that way. The last pair of lines in this block of code (67–68) reset
the “it’s time for a slideshow” slide and countdown used earlier.
Lines 70–100
The _start() function is only useful if we ran this script from the commandline. We leave txt2ppt() as importable to be used elsewhere, but _start()
requires the GUI. Jumping down momentarily to lines 90–100, you can see
that we create a Tk GUI with a text entry field (with a label prompting the
user to enter a filename or “DEMO” to see the demonstration) and a Quit
button.
So _start() begins (on line 71) by extracting the contents of this entry
field and attempts to open this file (line 73; see the related exercise at the
end of the chapter). If the file is opened successfully, it skips the except
clause and calls txt2ppt() to process the file then closes it when complete
(lines 86–87).
If an exception is encountered, the handler checks to see if the demo
was selected (lines 77–79). If so, it reads the demonstration string into a
cStringIO.StringIO object (line 76) and passes that to txt2ppt(); otherwise, it runs the demonstration anyway but inserts an error message in the
text field to inform the user why the failure occurred (lines 81–84).
7.4.4
Summary
Hopefully, by studying this chapter, you will have received a strong introduction to COM client programming with Python. Although the COM
servers on the Microsoft Office applications are the most robust and fullfeatured, the material you learned here will apply to other applications
with COM servers, or even OpenOffice, the open-source version of StarOffice, another alternative to Microsoft Office.
7.6 Exercises
357
Since the acquisition by Oracle of Sun Microsystems, the original corporate sponsor of StarOffice and OpenOffice, the successor to StarOffice has
been announced as Oracle Open Office, and those in the open-source community who feel that the status of OpenOffice has become jeopardized
have forked it as LibreOffice. Since they both come from the same codebase, they share the same COM-style interface known as Universal Network Objects (UNO). You can use the PyUNO module to drive OpenOffice
or LibreOffice applications to process documents, such as, writing PDF
files, converting from Microsoft Word to the OpenDocument text (ODT)
format, HTML, etc.
7.5
Related Modules/Packages
Python Extensions for Windows
http://pywin32.sf.net
xlrd, xlwt (Python 3 versions available)
http://www.lexicon.net/sjmachin/xlrd.htm
http://pypi.python.org/pypi/xlwt
http://pypi.python.org/pypi/xlrd
pyExcelerator
http://sourceforge.net/projects/pyexcelerator/
PyUNO
http://udk.openoffice.org/python/python-bridge.html
7.6
Exercises
7-1. Web Services. Take the Yahoo! stock quote example
(stock.py) and change the application to save the quote data
to a file instead of displaying it to the screen. Optional: You
can change the script so that users can choose to display the
quote data or save it to a file.
358
Chapter 7 • *Programming Microsoft Office
7-2. Excel and Web Pages. Create an application that will read data
from an Excel spreadsheet and map all of it to an equivalent
HTML table. (You can use the third-party HTMLgen module if
desired.)
7-3. Office Applications and Web Services. Interface to any existing
Web service, whether REST or URL-based, and write data to
an Excel spreadsheet, or format the data nicely into a Word
document. Format them properly for printing. Extra Credit:
Support both Excel and Word.
7-4. Outlook and Web Services. Similar to Exercise 7-3, do the same
thing, but put the data into a new e-mail message that you
send by using Outlook. Extra Credit: Do the same thing but
send the e-mail by using regular SMTP instead. (You might
want to refer to Chapter 3, “Internet Client Programming.”)
7-5. Slideshow Generation. In Exercises 7-15 through 7-24, you’ll
build new features into the slideshow generator we introduced earlier in this chapter, txt2ppt.pyw. This exercise
prompts you to think about just the basics but with a nonproprietary format. Implement a script with similar functionality to txt2ppt.pyw, except instead of interfacing with
PowerPoint, your output should use an open-source standard such as HTML5. Take a look at projects such as LandSlide, DZSlides, and HTML5Wow for inspiration. You can
find others at http://en.wikipedia.org/wiki/Web-based_
slideshow. Create a plain-text specification format for your
users, document it, and let your users use this tool to produce something that they can use on stage.
7-6. Outlook, Databases, and Your Address Book. Write a program
that will extract the contents of an Outlook address book and
store the desired fields into a database. The database can be a
text file, DBM file, or even an RDBMS. (You might want to
refer to Chapter 6, “Database Programming.”) Extra Credit:
Do the reverse; read in contact information from a database
(or allow for direct user input) and create or update records
in Outlook.
7-7. Microsoft Outlook and E-mail. Develop a program that backs
up your e-mail by taking the contents of your Inbox and/or
other important folders and saves it in (as close to) regular
“mbox” format to disk.
7.6 Exercises
7-8. Outlook Calendar. Write a simple script that creates new Outlook appointments. Take at least the following as user input:
start date and time, appointment name or subject, and duration of appointment.
7-9. Outlook Calendar. Build an application that dumps the contents of your appointments to a destination of your choice,
for example, to the screen, to a database, to Excel, etc. Extra
Credit: Do the same thing to your set of Outlook tasks.
7-10. Multithreading. Update the Excel version of the stock quote
download script (estock.pyw) so that the downloads of data
happen concurrently using multiple Python threads.
Optional: You might also try this exercise with Visual C++
threads using win32process.beginthreadex().
7-11. Excel Cell Formatting. In the spreadsheet version of the stock
quote download script (estock.pyw), we saw in Figure 7-7
how the stock price does not default to two places after the
decimal point, even if we pass in a string with the trailing
zero(s). When Excel converts it to a number, it uses the
default setting for the number format.
a) Change the numeric format to correctly go out to two
decimal places by changing the cell’s NumberFormat
attribute to 0.00.
b) We also saw that the “change from previous close” column loses the “+” character in addition to the decimal
point formatting. However, we discovered that making
the correction in part (a) to both columns only solves the
decimal place problem; the plus sign is automatically
dropped for any number. The solution here is to change
this column to be text instead of a number. You can do
this by changing the cell’s NumberFormat attribute to @.
c) By changing the cell’s numeric format to text, however,
we lose the right alignment that comes automatically
with numbers. In addition to your solution to part (b),
you must also now set the cell’s HorizontalAlignment
attribute to the PC Excel constant xlRight. After you
come up with the solutions to all three parts, your output
will now look more acceptable, as shown in Figure 7-19.
359
360
Chapter 7 • *Programming Microsoft Office
Figure 7-19 Improving the Python-to-Excel stock quote script (estock.pyw).
7-12. Python 3. Example 7-8 shows the Python 3 version of our first
Excel example (excel3.pyw) along with the changes (in italics).
Given this solution, port all the other scripts in this chapter
to Python 3.
Example 7-8
Python 3 version of Excel Example (excel3.pyw)
Porting the original excel.pyw script is as simple as running the 2to3 tool.
1
2
3
4
5
6
7
#!/usr/bin/env python3
from time import sleep
from tkinter import Tk
from tkinter.messagebox import showwarning
import win32com.client as win32
7.6 Exercises
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
warn = lambda app: showwarning(app, 'Exit?')
RANGE = list(range(3, 8))
def excel():
app = 'Excel'
xl = win32.gencache.EnsureDispatch('%s.Application' % app)
ss = xl.Workbooks.Add()
sh = ss.ActiveSheet
xl.Visible = True
sleep(1)
sh.Cells(1,1).Value = 'Python-to-%s Demo' % app
sleep(1)
for i in RANGE:
sh.Cells(i,1).Value = 'Line %d' % i
sleep(1)
sh.Cells(i+2,1).Value = "Th-th-th-that's all folks!"
warn(app)
ss.Close(False)
xl.Application.Quit()
if __name__=='__main__':
Tk().withdraw()
excel()
The next pair of exercises pertain to Example 7-6 (outlook_edit.pyw).
7-13. Unicode Support. Fix the outlook_edit.pyw script so that it
works flawlessly with Unicode and diacritic characters. In
other words, do not strip these out. Instead, preserve them,
pass them to the editor, and accept them in messages after
editing so that they can be sent in e-mail messages.
7-14. Robustness. Make the script more flexible by allowing
the user to specify the editor she prefers to use from the
command-line. If one is not provided, the application should
fall back to an environment variable setting, or finally, bring
up one of the editors hardcoded as a last resort.
The next set of exercises pertain to Example 7-7 (txt2ppt.pyw).
7-15. Skip Comments. Modify your script to support comments: if
a line in the text file begins with an hash mark (‘#’, a.k.a.
pound sign, octothorpe, etc.), assume this line doesn’t exist
and move to the next one.
361
362
Chapter 7 • *Programming Microsoft Office
7-16. Improving Title Slide Designation. Come up with a better way
to signify a title slide. Using all capital letters is nice except
for certain situations in which title casing is not desired. For
example, if the user created a talk entitled, “Intro to TCP/IP”,
it will contain errors due to the capitalization of “to” and the
lowercase “cp” and “p” in “Tcp/Ip”:
>>> 'Intro to TCP/IP'.title()
'Intro To Tcp/Ip'
7-17. Side Effects. What happens in _start() if there is a text file
named “demo” in the current folder? Is this a bug or a feature? Can we improve this situation in any way? If so, code
it. If not, indicate why not.
7-18. Template Specification. Currently in the script, all presentations will apply the design template C:\Program
Files\Microsoft Office\Templates\Presentation
Designs\Stream.pot. That’s boring.
(a) Allow the user to choose from any of the other templates
in that folder or wherever your installation is.
(b) Allow the user to specify their own template (and
its location) from a new entry field in the GUI, the
command-line, or from an environment variable (your
choice). Extra Credit: Support all options here in the order
of precedence given, or give the user a pulldown in the
user interface for the default template options from
part (a).
7-19. Hyperlinking. A talk might feature links in the plain text file.
Make those links active from PowerPoint. Hint: You will
need to set the Hyperlink.Address as the URL to spawn a
browser to visit if a viewer clicks the link in the slide (see the
ActionSettings for a ppMouseClick). Extra Credit: Support
hyperlinks only on the URL text when the link isn’t the only
text on the same line; that is, set the active part of the link to
be just the URL and not any other text on that line.
7-20. Text Formatting. Add the ability to have bold, italics, and
monospaced (for example, Courier) text to presentation contents by supporting some sort of lightweight markup formatting in source text files. We strongly recommend reST
7.6 Exercises
7-21.
7-22.
7-23.
7-24.
(reStructuredText), Markdown, or similar, like Wiki-style
formatting, such as, ‘monospaced’, *bold*, _italic_, etc. For
more examples, see http://en.wikipedia.org/wiki/
Lightweight_markup_language.
Text Formatting. Add support for other formatting services,
such as underlining, shadowing, other fonts, text color, justification change (left, right, centered, etc.), font sizing, headers and footers, or anything else that PowerPoint supports.
Images. One important feature we need to add to our application is the ability to have slides with images. Let’s make the
problem easier by requiring you to only support slides with
a title and a single image (resized and centered on a presentation slide). You’ll need to specify a customized syntax for
your users to embed image filenames with, for example,
:IMG:C:/py/talk/images/cover.png. Hints: So far, we’ve
only used the ppLayoutTitle or ppLayoutText slide layouts;
for this exercise, we recommend ppLayoutTitleOnly. Insert
images using Shapes.AddPicture() and resize them using
ScaleHeight() and ScaleWidth() along with data points provided by PageSetup.SlideHeight and PageSetup.SlideWidth
plus the image’s Height and Width attributes.
Different Layouts. Further extend your solution to Exercise 7-22
so that your script supports slides with multiple images or
slides with images and bulleted text. Mainly, this means
playing around with other layout styles.
Embedded Videos. Another advanced feature you can add is
the ability to embed YouTube video clips (or other Adobe
Flash applications) in presentations. Similar to Exercise 7-23,
you’ll need to define your own syntax to support this, for
example, :VID:http://youtube.com/v/Tj5UmH5TdfI. Hints:
We recommend the ppLayoutTitleOnly layout again here. In
addition, you’ll need to use Shapes.AddOLDObject() with a
type of 'ShockwaveFlash.ShockwaveFlash.10' or whatever
version your Flash player is.
363
CHAPTER
Extending Python
C is very efficient. Unfortunately, C gets that efficiency by
requiring you to do a lot of low-level management of resources.
With today’s machines as powerful as they are, this is usually a bad
tradeoff—it’s smarter to use a language that uses the machine’s
time less efficiently, but your time much more efficiently.
Thus, Python.
—Eric Raymond, October 1996
In this chapter...
• Introduction/Motivation
• Extending Python by Writing Extensions
• Related Topics
364
8.1 Introduction/Motivation 365
n this chapter, we will discuss how to take code written externally
and integrate that functionality into the Python programming environment. We will first present the motivation for why you do it, and
then take you through the step-by-step process of how to do it. We should
point out, though, that because extensions are primarily done in the C language, all of the example code you will see in this section is pure C, as a
lowest common denominator. You can also use C++ if you want because
it’s a superset of C; if you’re building extensions on PCs by using Microsoft
Visual Studio, you will be using (Visual) C++.
I
8.1
Introduction/Motivation
In this opening section of the chapter, we’ll define what Python extensions
are, and then try to justify why you would (or wouldn’t) consider creating one.
8.1.1
What Are Extensions?
In general, any code that you write that can be integrated or imported into
another Python script can be considered an extension. This new code can
be written in pure Python or in a compiled language such as C and C++,
(or Java for Jython and C# or VisualBasic.NET for IronPython).
One great feature of Python is that its extensions interact with the interpreter in exactly the same way as the regular Python modules. Python was
designed so that the abstraction of module import hides the underlying
implementation details from the code that uses such extensions. Unless
the client programmer searches the file system, he simply wouldn’t be able
to tell whether a module is written in Python or in a compiled language.
CORE NOTE: Creating extensions on different platforms
We will note here that extensions are generally available in a development environment in which you compile your own Python interpreter. There is a subtle
relationship between manual compilation versus obtaining the binaries. Although
compilation can be a bit trickier than just downloading and installing binaries,
you have the most flexibility in customizing the version of Python that you are
using. If you intend to create extensions, you should perform this task in a similar
environment.
366
Chapter 8 • Extending Python
The examples in this chapter are built on a Unix-based system (which usually
comes with a compiler), but assuming you do have access to a C/C++ (or Java)
compiler and a Python development environment in C/C++ (or Java), the only
differences are in your compilation method. The actual code to make your
extensions usable in the Python world is the same on any platform.
If you are developing for Windows-based PCs, you’ll need Visual C++ “Developer Studio.” The Python distribution comes with project files for version 7.1,
but you can use older versions of VC++.
For more information on building extensions in general:
•
C++ on PCs–http://docs.python.org/extending/windows
•
Java/Jython–http://wiki.python.org/jython
•
IronPython–http://ironpython.codeplex.com
Caution: Although moving binaries between different hosts of the same architecture is generally a non-issue, sometimes slight differences in the compiler or
CPU will cause code not to work consistently.
8.1.2
Why You Want to Extend Python
Throughout the brief history of software engineering, programming languages have always been taken at face value. What you see is what you get;
it was impossible to add new functionality to an existing language. In
today’s programming environment, however, the ability to customize one’s
programming environment is now a desired feature; it also promotes code
reuse. Languages such as Tcl and Python are among the first languages to
provide the ability to extend the base language. So why would you want to
extend a language like Python, which is already feature-rich? There are
several good reasons:
• Added/extra (non-Python) functionality One reason for
extending Python is the need to have new functionality that is
not provided by the core part of the language. This can be
accomplished in either pure Python or as a compiled
extension, but there are certain things such as creating new
data types or embedding Python in an existing application
that must be compiled.
8.1 Introduction/Motivation 367
• Bottleneck performance improvement It is well known that
interpreted languages do not perform as fast as compiled
languages because that translation must happen on the fly,
and during runtime. In general, moving a body of code into an
extension will improve overall performance. The problem is
that it is sometimes not advantageous if the cost is high in
terms of resources.
From the perspective of percentage, it is a wiser bet to do
some simple profiling of the code to identify what the
bottlenecks are, and move those pieces of code out to an
extension. The gain can be seen more quickly and without
expending as much in terms of resources.
• Keep proprietary source code private Another important
reason to create extensions is due to one side effect of having
a scripting language. For all the ease-of-use such languages
bring to the table, there really is no privacy as far as source
code is concerned because the executable is the source code.
Code that is moved out of Python and into a compiled
language helps keep proprietary code private because you
ship a binary object. Because these objects are compiled, they
are not as easily reverse-engineered; thus, the source remains
more private. This is key when it involves special algorithms,
encryption or software security, etc.
Another alternative to keeping code private is to ship precompiled .pyc files only. It serves as a good middle ground
between releasing the actual source (.py files) and having to
migrate that code to extensions.
8.1.3
Why You Don’t Want to Extend Python
Before we get into how to write extensions, we want to warn you that you
might not want to do this, after all. You can consider this section a caveat
so that you don’t think there’s any false advertising going on here. Yes,
there are definitely benefits to writing extensions such as those just outlined, however there are some drawbacks too:
• You have to write C/C++ code.
368
Chapter 8 • Extending Python
• You’ll need to understand how to pass data between Python
and C/C++.
• You need to manage references on your own.
• There are tools that accomplish the same thing—that is, they
generate and take advantage of the performance of C/C++
code without you writing any C/C++ at all. You’ll find some of
these tools at the end of this chapter.
Don’t say we didn’t warn you! Now you may proceed...
8.2
Extending Python by Writing
Extensions
Creating extensions for Python involves three main steps:
1. Creating application code
2. Wrapping code with boilerplates
3. Compilation and testing
In this section, we will break out and expose all three stages.
8.2.1
Creating Your Application Code
First, before any code becomes an extension, you need to create a standalone “library.” In other words, create your code keeping in mind that it is
going to turn into a Python module. Design your functions and objects
with the vision that Python code will be communicating and sharing data
with your C code, and vice versa.
Next, create test code to bulletproof your software. You can even use the
Pythonic development method of designating your main() function in C as
the testing application so that if your code is compiled, linked, and loaded
into an executable (as opposed to just a shared object), the invocation of
such an executable will result in a regression test of your software library.
For our extension example that follows, this is exactly what we do.
The test case involves two C functions that we want to bring to the
world of Python programming. The first is the recursive factorial function,
fac(). The second, reverse(), is a simple string reverse algorithm, whose
main purpose is to reverse a string “in place,” that is, to return a string
8.2 Extending Python by Writing Extensions 369
whose characters are all reversed from their original positions, all without
allocating a separate string to copy in reverse order. Because this involves
the use of pointers, we need to carefully design and debug our code before
bringing Python into the picture.
Our first version, Extest1.c, is presented in Example 8-1.
Example 8-1
Pure C Version of Library (Extest1.c)
This code represents our library of C functions, which we want to wrap so that
we can use it from within the Python interpreter. main() is our tester function.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int fac(int n)
{
if (n < 2) return(1); /* 0! == 1! == 1 */
return (n)*fac(n-1); /* n! == n*(n-1)! */
}
char *reverse(char *s)
{
register char t,
*p = s,
*q = (s + (strlen(s)-1));
while (p < q)
{
t = *p;
*p++ = *q;
*q-- = t;
}
return s;
/* tmp */
/* fwd */
/* bwd */
/* if p < q */
/* swap & mv ptrs */
}
int main()
{
char s[BUFSIZ];
printf("4! == %d\n", fac(4));
printf("8! == %d\n", fac(8));
printf("12! == %d\n", fac(12));
strcpy(s, "abcdef");
printf("reversing 'abcdef', we get '%s'\n", \
reverse(s));
strcpy(s, "madam");
printf("reversing 'madam', we get '%s'\n", \
reverse(s));
return 0;
}
370
Chapter 8 • Extending Python
This code consists of a pair of functions, fac() and reverse(), which are
implementations of the functionality we just described. fac() takes a single
integer argument and recursively calculates the result, which is eventually
returned to the caller once it exits the outermost call.
The last piece of code is the required main() function. We use it to be
our tester, sending various arguments to fac() and reverse(). With this
function, we can determine whether our code actually works.
Now we should compile the code. For many versions of Unix with the
gcc compiler, we can use the following command:
$ gcc Extest1.c -o Extest
$
To run our program, we issue the following command and get the output:
$ Extest
4! == 24
8! == 40320
12! == 479001600
reversing 'abcdef', we get 'fedcba'
reversing 'madam', we get 'madam'
$
We stress again that you should try to complete your code as much as
possible, because you do not want to mix debugging of your library with
potential bugs when integrating with Python. In other words, keep the
debugging of your core code separate from the debugging of the integration. The closer you write your code to Python interfaces, the sooner your
code will be integrated and work correctly.
Each of our functions takes a single value and returns a single value. It’s
pretty cut and dried, so there shouldn’t be a problem integrating with
Python. Note that, so far, we have not seen any connection or relationship
with Python. We are simply creating a standard C or C++ application.
8.2.2
Wrapping Your Code in Boilerplate
The entire implementation of an extension primarily revolves around the
“wrapping” concept that should seem familiar to you: composite classes,
decorator functions, class delegation, etc. You should design your code in
such a way that there is a smooth transition between the world of Python
and your implementing language. This interfacing code is commonly
called boilerplate code because it is a necessity if your code is to talk to the
Python interpreter.
8.2 Extending Python by Writing Extensions 371
There are four main pieces to the boilerplate software:
1. Include a Python header file
2. Add PyObject* Module_func() Python wrappers for each module function
3. Add a PyMethodDef ModuleMethods[] array/table for each module
function
4. Add a void initModule() module initializer function
Including the Python Header File
The first thing you should do is to find your Python include files and
ensure that your compiler has access to that directory. On most Unixbased systems, this would be either /usr/local/include/python2.x or
/usr/include/python2.x, where 2.x is your version of Python. If you
compiled and installed your Python interpreter, you should not have a
problem, because the system generally knows where your files are installed.
Add the inclusion of the Python.h header file to your source. The line
will look something like:
#include "Python.h"
That is the easy part. Now you have to add the rest of the boilerplate
software.
Add PyObject* Module_func() Python Wrappers for
Each Function
This part is the trickiest. For each function that you want accessible to the
Python environment, you will create a static PyObject* function with
the module name along with an underscore (_) prepended to it.
For example, we want fac() to be one of the functions available for
import from Python and we will use Extest as the name of our final module, so we create a wrapper called Extest_fac(). In the client Python script,
there will be an import Extest and an Extest.fac() call somewhere (or
just fac() for from Extest import fac).
The job of the wrapper is to take Python values, convert them to C, and
then make a call to the appropriate function with what we want. When our
function has completed, and it is time to return to the world of Python; it is
also the job of this wrapper to take whatever return values we designate,
convert them to Python, and then perform the return, passing back any values as necessary.
372
Chapter 8 • Extending Python
In the case of fac(), when the client program invokes Extest.fac(), our
wrapper will be called. We will accept a Python integer, convert it to a C
integer, call our C function fac(), and then obtain another integer result.
We then have to take that return value, convert it back to a Python integer,
and then return from the call. (keep in mind that you are writing the code
that will proxy for a def fac(n) declaration. When you are returning, it is
as if that imaginary Python fac() function is completing.)
So, you’re asking, how does this conversion take place? The answer is
with the PyArg_Parse*() functions when going from Python to C, and
Py_BuildValue() when returning from C to Python.
The PyArg_Parse*() functions are similar to the C sscanf() function. It
takes a stream of bytes, and then, according to some format string, parcels
them off to corresponding container variables, which, as expected, take
pointer addresses. They both return 1 on successful parsing, and 0 otherwise.
Py_BuildValue() works like sprintf(), taking a format string and converting all arguments to a single returned object containing those values in
the formats that you requested.
You will find a summary of these functions in Table 8-1.
Table 8-1 Converting Data Between Python and C/C++
Function
Description
Python to C
int
PyArg_ParseTuple()
Converts (a tuple of) arguments
passed from Python to C
int
PyArg_ParseTupleAndKeywords()
Same as PyArg_ParseTuple() but also
parses keyword arguments
C to Python
PyObject*
Py_BuildValue()
Converts C data values into a Python
return object, either a single object or
a single tuple of objects
8.2 Extending Python by Writing Extensions 373
A set of conversion codes is used to convert data objects between C and
Python; they are given in Table 8-2.
Table 8-2 Pythona and C/C++ Conversion “Format Units”
Format Unit
Python Type
C/C++ Type
s, s#
str/unicode, len()
char*(, int)
z, z#
str/unicode/None, len()
char*/NULL(, int)
u, u#
unicode, len()
(Py_UNICODE*, int)
i
int
int
b
int
char
h
int
short
l
int
long
k
int or long
unsigned long
I
int or long
unsigned int
B
int
unsigned char
H
int
unsigned short
L
long
long long
K
long
unsigned long long
c
str
char
d
float
double
f
float
float
D
complex
Py_Complex*
O
(any)
PyObject*
S
str
PyStringObject
Nb
(any)
PyObject*
O&
(any)
(any)
a. These format codes are for Python 2 but have near equivalents in Python 3.
b. Like “O” except it does not increment object’s reference count.
374
Chapter 8 • Extending Python
These conversion codes are the ones given in the respective format strings
that dictate how the values should be converted when moving between
both languages. Note that the conversion types are different for Java
because all data types are classes. Consult the Jython documentation to
obtain the corresponding Java types for Python objects. The same applies for
C# and VB.NET.
Here, we show you our completed Extest_fac() wrapper function:
static PyObject *
Extest_fac(PyObject *self, PyObject *args) {
int res;
int num;
PyObject* retval;
// parse result
// arg for fac()
// return value
res = PyArg_ParseTuple(args, "i", &num);
if (!res) {
// TypeError
return NULL;
}
res = fac(num);
retval = (PyObject*)Py_BuildValue("i", res);
return retval;
}
The first step is to parse the data received from Python. It should be a
regular integer, so we use the “i” conversion code to indicate as such. If the
value was indeed an integer, then it is stored in the num variable. Otherwise, PyArg_ParseTuple() will return a NULL, in which case we also
return one. In our case, it will generate a TypeError exception that informs
the client user that we are expecting an integer.
We then call fac() with the value stored in num and put the result in
res, reusing that variable. Now we build our return object, a Python integer, again using a conversion code of “i.” Py_BuildValue() creates an integer Python object, which we then return. That’s all there is to it!
In fact, once you have created wrapper after wrapper, you tend to
shorten your code somewhat to avoid the extraneous use of variables. Try
to keep your code legible, though. We take our Extest_fac() function and
reduce it to its smaller version given here, using only one variable, num:
static PyObject *
Extest_fac(PyObject *self, PyObject *args) {
int num;
if (!PyArg_ParseTuple(args, "i", &num))
return NULL;
return (PyObject*)Py_BuildValue("i", fac(num));
}
8.2 Extending Python by Writing Extensions 375
What about reverse()? Well, given you already know how to return a
single value, we are going to change our reverse() example somewhat,
returning two values instead of one. We will return a pair of strings as a
tuple; the first element being the string as passed in to us, and the second
being the newly reversed string.
To show you that there is some flexibility, we will call this function
Extest.doppel() to indicate that its behavior differs from reverse().
Wrapping our code into an Extest_doppel() function, we get:
static PyObject *
Extest_doppel(PyObject *self, PyObject *args) {
char *orig_str;
if (!PyArg_ParseTuple(args, "s", &orig_str)) return NULL;
return (PyObject*)Py_BuildValue("ss", orig_str, \
reverse(strdup(orig_str)));
}
As in Extest_fac(), we take a single input value, this time a string, and
store it into orig_str. Notice that we use the “s” conversion code now. We
then call strdup() to create a copy of the string. (Because we want to
return the original one, as well, we need a string to reverse, so the best candidate is just a copy of the string.) strdup() creates and returns a copy,
which we immediately dispatch to reverse(). We get back a reversed string.
As you can see, Py_BuildValue() puts together both strings using a
conversion string of ss. This creates a tuple of two strings: the original
string and the reversed one. End of story, right? Unfortunately, no.
We got caught by one of the perils of C programming: the memory leak
(when memory is allocated but not freed). Memory leaks are analogous to
borrowing books from the library but not returning them. You should always
release resources that you have acquired when you no longer require them.
How did we commit such a crime with our code (which looks innocent
enough)?
When Py_BuildValue() puts together the Python object to return, it
makes copies of the data that has been passed to it. In our case here, that
would be a pair of strings. The problem is that we allocated the memory
for the second string, but we did not release that memory when we finished, leaking it. What we really want to do is to build the return object,
and then free the memory that we allocated in our wrapper. We have no
choice but to lengthen our code to:
static PyObject *
Extest_doppel(PyObject *self, PyObject *args) {
char *orig_str;
// original string
char *dupe_str;
// reversed string
PyObject* retval;
376
Chapter 8 • Extending Python
if (!PyArg_ParseTuple(args, "s", &orig_str)) return NULL;
retval = (PyObject*)Py_BuildValue("ss", orig_str, \
dupe_str=reverse(strdup(orig_str)));
free(dupe_str);
return retval;
}
We introduce the dupe_str variable to point to the newly allocated
string and build the return object. Then we free() the memory allocated
and finally return back to the caller. Now we are done.
Adding PyMethodDef ModuleMethods[] Array/Table
for Each Module Function
Now that both of our wrappers are complete, we want to list them somewhere so that the Python interpreter knows how to import and access
them. This is the job of the ModuleMethods[] array.
It is made up of an array of arrays, with each individual array containing information about each function, terminated by a NULL array that
marks the end of the list. For our Extest module, we create the following
ExtestMethods[] array:
static PyMethodDef
ExtestMethods[] = {
{ "fac", Extest_fac, METH_VARARGS },
{ "doppel", Extest_doppel, METH_VARARGS },
{ NULL, NULL },
};
The Python-accessible names are given, followed by the corresponding
wrapping functions. The constant METH_VARARGS is given, indicating a set
of arguments in the form of a tuple. If we are using PyArg_ParseTuple
AndKeywords() with keyworded arguments, we would logically OR this
flag with the METH_KEYWORDS constant. Finally, a pair of NULLs properly
terminates our list of two functions.
Adding a void initModule() Module Initializer
Function
The final piece to our puzzle is the module initializer function. This code is
called when our module is imported for use by the interpreter. In this
code, we make one call to Py_InitModule() along with the module name
and the name of the ModuleMethods[] array so that the interpreter can
access our module functions. For our Extest module, our initExtest()
procedure looks like this:
8.2 Extending Python by Writing Extensions 377
void initExtest() {
Py_InitModule("Extest", ExtestMethods);
}
We are now done with all our wrapping. We add all this code to our
original code from Extest1.c and merge the results into a new file called
Extest2.c, concluding the development phase of our example.
Another approach to creating an extension would be to make your
wrapping code first, using stubs or test or dummy functions which will,
during the course of development, be replaced by the fully-functional
pieces of implemented code. This way, you can ensure that your interface
between Python and C is correct, and then use Python to test your C code.
8.2.3
Compilation
Now we are on to the compilation phase. To get your new wrapper Python
extension to build, you need to get it to compile with the Python library.
This task has been standardized (since version 2.0) across platforms to
make life a lot easier for extension writers. The distutils package is used to
build, install, and distribute modules, extensions, and packages. It came
about back in Python 2.0 and replaced the old version 1.x way of building extensions that used “makefiles.” Using distutils, we can follow this
easy recipe:
1.
2.
3.
4.
Create setup.py
Compile and link your code by running setup.py
Import your module from Python
Test the function
Creating setup.py
The next step is to create a setup.py file. The bulk of the work will be
done by the setup() function. All the lines of code that come before that
call are preparatory steps. For building extension modules, you need to create an Extension instance per extension. Since we only have one, we only
need one Extension instance:
Extension('Extest', sources=['Extest2.c'])
The first argument is the (full) extension name, including any high-level
packages, if necessary. The name should be in full dotted-attribute notation. Ours is stand-alone, hence the name “Extest.” sources is a list of all
the source files. Again, we only have the one, Extest2.c.
2.0
378
Chapter 8 • Extending Python
Now we are ready to call setup(). It takes a name argument for what it
is building and a list of the items to build. Because we are creating an
extension, we set it a list of extension modules to build as ext_modules.
The syntax will be like this:
setup('Extest', ext_modules=[...])
Because we only have one module, we combine the instantiation of our
extension module into our call to setup(), setting the module name as
“constant” MOD on the preceding line:
MOD = 'Extest'
setup(name=MOD, ext_modules=[
Extension(MOD, sources=['Extest2.c'])])
There are many more options to setup(); in fact, they are too numerous
to list here. You can find out more about creating setup.py and calling
setup() in the official Python documentation that we refer to at the end of
this chapter. Example 8-2 shows the complete script that we are using for
our example.
Example 8-2
The Build Script (setup.py)
This script compiles our extension into the build/lib.* subdirectory.
1
2
3
4
5
6
7
#!/usr/bin/env python
from distutils.core import setup, Extension
MOD = 'Extest'
setup(name=MOD, ext_modules=[
Extension(MOD, sources=['Extest2.c'])])
Compile and Link Your Code by Running setup.py
Now that we have our setup.py file, we can build our extension by running it with the build directive, as we have done here on our Mac (your
output will differ based on the version of the operating system you are
running as well as the version of Python you are using):
$ python setup.py build
running build
running build_ext
building 'Extest' extension
creating build
creating build/temp.macosx-10.x-fat-2.x
gcc -fno-strict-aliasing -Wno-long-double -no-cppprecomp -mno-fused-madd -fno-common -dynamic -DNDEBUG -g
8.2 Extending Python by Writing Extensions 379
-I/usr/include -I/usr/local/include -I/sw/include -I/
usr/local/include/python2.x -c Extest2.c -o build/temp.macosx-10.xfat-2.x/Extest2.o
creating build/lib.macosx-10.x-fat-2.x
gcc -g -bundle -undefined dynamic_lookup -L/usr/lib -L/
usr/local/lib -L/sw/lib -I/usr/include -I/usr/local/
include -I/sw/include build/temp.macosx-10.x-fat-2.x/Extest2.o -o
build/lib.macosx-10.x-fat-2.x/Extest.so
8.2.4
Importing and Testing
The final step is to go back into Python and use our new extension as if it
were written in pure Python.
Importing Your Module from Python
Your extension module will be created in the build/lib.* directory from
where you ran your setup.py script. You can either change to that directory to test your module or install it into your Python distribution with:
$ python setup.py install
If you do install it, you will get the following output:
running install
running build
running build_ext
running install_lib
copying build/lib.macosx-10.x-fat-2.x/Extest.so ->
/usr/local/lib/python2.x/site-packages
Now we can test our module from the interpreter:
>>> import Extest
>>> Extest.fac(5)
120
>>> Extest.fac(9)
362880
>>> Extest.doppel('abcdefgh')
('abcdefgh', 'hgfedcba')
>>> Extest.doppel("Madam, I'm Adam.")
("Madam, I'm Adam.", ".madA m'I ,madaM")
Adding a Test Function
The one last thing we want to do is to add a test function. In fact, we
already have one, in the form of the main() function. Be aware that it is
potentially dangerous to have a main() function in our code because
there should only be one main() in the system. We remove this danger by
380
Chapter 8 • Extending Python
changing the name of our main() to test() and wrapping it, adding
Extest_test() and updating the ExtestMethods array so that they both
look like this:
static PyObject *
Extest_test(PyObject *self, PyObject *args) {
test();
return (PyObject*)Py_BuildValue("");
}
static PyMethodDef
ExtestMethods[] = {
{ "fac", Extest_fac, METH_VARARGS },
{ "doppel", Extest_doppel, METH_VARARGS },
{ "test", Extest_test, METH_VARARGS },
{ NULL, NULL },
};
The Extest_test() module function just runs test() and returns an
empty string, resulting in a Python value of None being returned to the
caller.
Now we can run the same test from Python:
>>> Extest.test()
4! == 24
8! == 40320
12! == 479001600
reversing 'abcdef', we get 'fedcba'
reversing 'madam', we get 'madam'
>>>
In Example 8-3, we present the final version of Extest2.c that was
used to generate the output we just saw.
Example 8-3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Python-Wrapped Version of C Library (Extest2.c)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int fac(int n)
{
if (n < 2) return(1);
return (n)*fac(n-1);
}
char *reverse(char *s)
{
register char t,
*p = s,
*q = (s + (strlen(s) - 1));
8.2 Extending Python by Writing Extensions 381
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
while (s && (p < q))
{
t = *p;
*p++ = *q;
*q-- = t;
}
return s;
}
int test()
{
char s[BUFSIZ];
printf("4! == %d\n", fac(4));
printf("8! == %d\n", fac(8));
printf("12! == %d\n", fac(12));
strcpy(s, "abcdef");
printf("reversing 'abcdef', we get '%s'\n", \
reverse(s));
strcpy(s, "madam");
printf("reversing 'madam', we get '%s'\n", \
reverse(s));
return 0;
}
#include "Python.h"
static PyObject *
Extest_fac(PyObject *self, PyObject *args)
{
int num;
if (!PyArg_ParseTuple(args, "i", &num))
return NULL;
return (PyObject*)Py_BuildValue("i", fac(num));}
}
static PyObject *
Extest_doppel(PyObject *self, PyObject *args)
{
char *orig_str;
char *dupe_str;
PyObject* retval;
if (!PyArg_ParseTuple(args, "s", &orig_str))
return NULL;
retval = (PyObject*)Py_BuildValue("ss", orig_str, \
dupe_str=reverse(strdup(orig_str)));
free(dupe_str);
return retval;
}
(Continued)
382
Chapter 8 • Extending Python
Example 8-3
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Python-Wrapped Version of C Library (Extest2.c)
(Continued)
static PyObject *
Extest_test(PyObject *self, PyObject *args)
{
test();
return (PyObject*)Py_BuildValue("");
}
static PyMethodDef
ExtestMethods[] =
{
{ "fac", Extest_fac, METH_VARARGS },
{ "doppel", Extest_doppel, METH_VARARGS },
{ "test", Extest_test, METH_VARARGS },
{ NULL, NULL },
};
void initExtest()
{
Py_InitModule("Extest", ExtestMethods);
}
In this example, we chose to segregate our C code from our Python
code. It just kept things easier to read and is no problem with our short
example. In practice, these source files tend to get large, and some choose
to implement their wrappers completely in a different source file such as
ExtestWrappers.c or something of that nature.
8.2.5
Reference Counting
You might recall that Python uses reference counting as a means of keeping track of objects and de-allocating objects no longer referenced, as part
of the garbage collection mechanism. When creating extensions, you must
pay extra special attention to how you manipulate Python objects, because
you must be mindful of whether you need to change the reference count
for such objects.
There are two types of references that you can have to an object, one of
which is an owned reference, meaning that the reference count to the object
is incremented by one to indicate your ownership. One situation for which
you would definitely have an owned reference is when you create a
Python object from scratch.
8.2 Extending Python by Writing Extensions 383
When you are done with a Python object, you must dispose of your ownership, either by decrementing the reference count, transferring your ownership
by passing it on, or storing the object. Failure to dispose of an owned reference creates a memory leak.
You can also have a borrowed reference to an object. Somewhat lower on
the responsibility ladder, this is when you are passed the reference of an
object, but otherwise do not manipulate the data in any way. Nor do you
have to worry about its reference count, as long as you do not hold on to
this reference after its reference count has decreased to zero. You might
convert your borrowed reference to an owned reference simply by incrementing an object’s reference count.
Python provides a pair of C macros which are used to change the reference count to a Python object. They are given in Table 8-3.
Table 8-3 Macros for Performing Python Object Reference Counting
Function
Description
Py_INCREF(obj)
Increment the reference count to obj
Py_DECREF(obj)
Decrement the reference count to obj
In our above Extest_test() function, we return None by building a
PyObject with an empty string; however, this can also be accomplished by
becoming an owner of the None object, PyNone, incrementing your reference
count to it, and returning it explicitly, as in the following alternative
piece of code:
static PyObject *
Extest_test(PyObject *self, PyObject *args) {
test();
Py_INCREF(Py_None);
return PyNone;
}
Py_INCREF() and Py_DECREF() also have versions that check for NULL
objects. They are Py_XINCREF() and Py_XDECREF(), respectively.
We strongly urge that you consult the Python documentation regarding
extending and embedding Python for all the details with regard to reference
counting (see the documentation reference in Appendix C, “Python 3: The
Evolution of a Programming Language”).
384
Chapter 8 • Extending Python
8.2.6
Threading and the GIL
Extension writers must be aware that their code might be executed in a multithreaded Python environment. In Chapter 4, “Multithreaded Programming,”
in Section 4.3.1, we introduced the Python Virtual Machine (PVM) and the
Global Interpreter Lock (GIL), describing how only one thread of execution
can be running at any given time in the PVM and that the GIL is responsible
for keeping other threads from running. Furthermore, we indicated that
code calling external functions, such as in extension code, would keep the
GIL locked until the call returns.
We also hinted that there was a remedy, a way for the extension programmer to release the GIL, for example, before performing a system call.
This is accomplished by “blocking” your code off to where threads may
(and may not) run safely using another pair of C macros, Py_BEGIN_
ALLOW_THREADS and Py_END_ALLOW_THREADS. A block of code bounded by
these macros will permit other threads to run.
As with the reference counting macros, we urge that you consult the
documentation regarding extending and embedding Python as well as
the Python/C API reference manual.
8.3
Related Topics
In this final section of this chapter, we’ll look at various tools representing
alternatives to writing extensions (in any supported language). We’ll introduce you to SWIG, Pyrex, Cython, psyco, and PyPy. We end the chapter
with a brief discussion about a related topic, Embedding Python.
8.3.1
The Simplified Wrapper and Interface
Generator
There is an external tool available called Simplified Wrapper and Interface
Generator (SWIG). It was written by David Beazley, who is also the author
of Python Essential Reference (Addison-Wesley, 2009). It is a software tool
that can take annotated C/C++ header files and generate wrapped code,
ready to compile for Python, Tcl, and Perl. Using SWIG frees you from
having to write the boilerplate code we’ve seen in this chapter. You only
need to worry about coding the solution part of your project in C/C++. All
8.3 Related Topics
385
you have to do is create your files in the SWIG format, and it will do the
background work on your behalf. You can find out more information
about SWIG from its main Web site:
http://swig.org
http://en.wikipedia.org/wiki/SWIG
8.3.2
Pyrex
One obvious weakness of creating C/C++ extensions (raw or with SWIG) is
that you have to write C/C++ (surprise, surprise), with all of its strengths,
and, more importantly, its pitfalls. Pyrex gives you practically all of the
gains of writing extensions but none of the headache. Pyrex is a new language created specifically for writing Python extensions. It is a hybrid of C
and Python, leaning much more toward Python; in fact, the Pyrex Web site
goes as far as saying that “Pyrex is Python with C data types.” You only need
to write code in the Pyrex syntax and run the Pyrex compiler on the
source. Pyrex creates C files, which can then be compiled and used as you
would a normal extension. Some have sworn off C programming forever
upon discovering Pyrex. You can get Pyrex at its home page:
http://cosc.canterbury.ac.nz/~greg/python/Pyrex
http://en.wikipedia.org/wiki/Pyrex_(programming_language)
8.3.3
Cython
Cython is a fork of Pyrex from 2007—the first release of Cython was 0.9.6,
which came out around the same time as Pyrex 0.9.6. The Cython developers have a more agile and aggressive approach to Cython’s development
over the Pyrex team in that the latter takes a more cautious approach. The
result is that more patches, improvements, and extensions make it into
Cython faster/sooner than into Pyrex, but both are considered active projects. You can read more about Cython and its distinctions from Pyrex via
the links below.
http://cython.org
http://wiki.cython.org/DifferencesFromPyrex
http://wiki.cython.org/FAQ
386
Chapter 8 • Extending Python
8.3.4
Psyco
Pyrex and Cython offer the benefit of no longer having to write pure C
code. However, do you need to learn some new syntax (sigh... yet another
language to have to deal with.) In the end, your Pyrex/Cython code turns
into C anyway. Developers write extensions or use tools like SWIG or
Pyrex/Cython for that performance boost. However, what if you can
obtain such performance gains without having to write code in a language
other than pure Python?
Psyco’s concept is quite different from those other approaches. Rather
than writing C code, why not just make your existing Python code run
faster? Psyco serves as a just-in-time (JIT) compiler, so you do not have to
change to your source other than importing the Psyco module and telling
it to start optimizing your code (during runtime).
Psyco can also profile your code to establish where it can make the most
significant improvements. You can even enable logging to see what Psyco
does while optimizing your code. The only restriction is that it solely supports
32-bit Intel 386 architectures (Linux, Max OS X, Windows, BSD) running
2.2-2.6 Python 2.2.2-2.6.x but not version 3.x. Version 2.7 support is not complete (at
the time of this writing). For more information, go to the following links:
http://psyco.sf.net
http://en.wikipedia.org/wiki/Psyco
8.3.5
PyPy
PyPy is the successor project to Psyco. It has a much more ambitious goal
of creating a generalized environment for developing interpreted languages, independent of platform or target execution environment. It all
started innocently, to create a Python interpreter written in Python—in
fact, this is what most people still think PyPy is, while in fact, this specific
interpreter is just part of the entire PyPy ecosystem.
However, this toolset comprises the “real goods,” the power to allow
language designers to only be concerned with the parsing and semantic
analysis of their interpreter language du jour. All of the difficult stuff in
translating to a native architecture, such as memory management, bytecode translation, garbage collection, internal representation of numeric
types, primitive data structures, native architecture, etc., are taken care of
for you.
8.3 Related Topics
387
The way it works is that you take your language and implement it with
a restricted, statically-typed version of Python, called RPython. As mentioned above, Python was the first target language, so an interpreter for it
was written in RPython—this is as close to the term “PyPy” as you’re
going to get. However, you can implement any language you want with
RPython, not just Python.
This toolchain will translate your RPython code into something lowerlevel, like C, Java bytecode, or Common Intermediate Language (CIL),
which is the bytecode for languages written against the Common Language Infrastructure (CLI) standard. In other words, interpreted language
developers only need to worry about language design and much less
about implementation and target architecture. For more information, go to:
http://pypy.org
http://codespeak.net/pypy
http://en.wikipedia.org/wiki/PyPy
8.3.6
Embedding
Embedding is another feature available in Python. It is the inverse of an
extension. Rather than taking C code and wrapping it into Python, you
take a C application and wrap a Python interpreter inside it. This has the
effect of giving a potentially large, monolithic, and perhaps rigid, proprietary, and/or mission-critical application the power of having an embedded
Python interpreter. Once you have Python, well, it’s like a whole new ball
game.
For extension writer, there is a set of official documents that you should
refer to for additional information.
Here are links to some of the Python documentation related to this
chapter’s topics: http://docs.python.org/extending/embedding.
Extending and Embedding
http://docs.python.org/ext
Python/C API
http://docs.python.org/c-api
Distributing Python Modules
http://docs.python.org/distutils
388
Chapter 8 • Extending Python
8.4
Exercises
8-1. Extending Python. What are some of the advantages of
Python extensions?
8-2. Extending Python. Can you see any disadvantages or dangers
of using extensions?
8-3. Writing Extensions. Obtain a C/C++ compiler and (re)familiarize
yourself with C/C++ programming. Create a simple utility
function that you can make available and configure as an
extension. Demonstrate that your utility executes in both
C/C++ and Python.
8-4. Porting from Python to C. Take several of the exercises you did in
earlier chapters and port them to C/C++ as extension modules.
8-5. Wrapping C Code. Find a piece of C/C++ code, which you
might have done a long time ago but want to port to Python.
Instead of porting, make it an extension module.
8-6. Writing Extensions. In one of the exercises in the objectoriented programming chapter of Core Python Programming
or Core Python Language Fundamentals, you created a
dollarize() function as part of a class to format a floatingpoint value into a financial numeric string. Create an extension featuring a wrapped dollarize() function and integrate
a regression testing function, for example, test(), into the
module. Extra Credit: In addition to creating a C extension,
also rewrite dollarize() in Pyrex or Cython.
8-7. Extending vs. Embedding. What is the difference between
extending and embedding?
8-8. Not Writing Extensions. Take the C/C++ code you used in
Exercise 8-3, 8-4, or 8-5 and redo it in pseudo-Python via
Pyrex or Cython. Describe your experiences using Pyrex/
Cython versus integrating that code all as part of a C extension.
PA R T
Web
Development
CHAPTER
Web Clients and Servers
If you have a browser from CERN’s WWW project
(World Wide Web, a distributed hypertext system) you can
browse a WWW hypertext version of the manual.
—Guido van Rossum, November 1992
(first mention of the Web on the Python mailing list)
In this chapter...
• Introduction
• Python Web Client Tools
• Web Clients
• Web (HTTP) Servers
• Related Modules
390
9.1 Introduction
9.1
391
Introduction
Because the universe of Web applications is so expansive, we’ve (re)organized this book in a way that allows readers to focus specifically on multiple aspects of Web development via a set of chapters that cover individual
topics.
Before getting into the nitty-gritty, this introductory chapter on Web
programming will start you off by again focusing on client/server architecture, but this time the perspective of the Web. It provides a solid foundation for the material in the remaining chapters of the book.
9.1.1
Web Surfing: Client/Server Computing
Web surfing falls under the same client/server architecture umbrella that
we have seen repeatedly. This time, however, Web clients are browsers,
which, of course, are applications that allow users to view documents on
the World Wide Web. On the other side are Web servers, which are processes that run on an information provider’s host computers. These servers wait for clients and their document requests, process them, and then
return the requested data. As with most servers in a client/server system,
Web servers are designed to run indefinitely. The Web surfing experience
is best illustrated by Figure 9-1. Here, a user runs a Web client program,
such as a browser, and makes a connection to a Web server elsewhere on
the Internet to obtain information.
The Internet
Client
Server
Figure 9-1 A Web client and Web server on the Internet. A client sends a request out over the
Internet to the server, which then responds by sending the requested data back to the client.
392
Chapter 9 • Web Clients and Servers
Clients can issue a variety of requests to Web servers. Such requests
might include obtaining a Web page for viewing or submitting a form
with data for processing. The request is then serviced by the Web server
(and possibly other systems), and the reply comes back to the client in a
special format for display purposes.
The language that is spoken by Web clients and servers, the standard
protocol used for Web communication, is called HyperText Transfer Protocol (HTTP). HTTP is written on top of the TCP and IP protocol suite,
meaning that it relies on TCP and IP to carry out its lower-level communication needs. Its responsibility is not to route or deliver messages—TCP
and IP handle that—but to respond to client requests (by sending and
receiving HTTP messages).
HTTP is known as a stateless protocol because it does not keep track of
information from one client request to the next, similar to the client/server
architecture we have seen so far. The server stays running, but client interactions are singular events structured in such a way that once a client
request is serviced, it quits. New requests can always be sent, but they are
considered separate service requests. Because of the lack of context per
request, you might notice that some URLs have a long set of variables and
values chained as part of the request to provide some sort of state information. Another alternative is the use of cookies—static data stored on the client side that generally contain state information, as well. In later parts of
this chapter, we will look at how to use both long URLs and cookies to
maintain state information.
9.1.2
The Internet
The Internet is a moving and fluctuating “cloud” or “pond” of interconnected clients and servers scattered around the globe. Metaphorically
speaking, communication from client to server consists of a series of connections from one lily pad on the pond to another, with the last step connecting to the server. As a client user, all this detail is kept hidden from
your view. The abstraction is to have a direct connection between you
(the client) and the server you are visiting, but the underlying HTTP,
TCP, and IP protocols are hidden underneath, doing all of the dirty
work. Information regarding the intermediate nodes is of no concern or
consequence to the general user, anyway, so it’s good that the implementation is hidden. Figure 9-2 shows an expanded view of the Internet.
9.1 Introduction
Home User
393
Colocated .com Servers
Server
Home
Modem
ISP Network
Server
ISP
Client
Modem
Internet Core
ISP Network
The Internet
ISP Network
Internal Server
Client
Corporate
LAN
Web Server
Farm LAN
External Server
An Intranet
Corporate Local Area Network
Corporate Web Site (Network)
Figure 9-2 A grand view of the Internet. The left side illustrates where you would find Web
clients; the right side hints as to where Web servers are typically located.
It’s worth mentioning that with all of the data moving around the Internet, there might be some that is more sensitive. There is no encryption
service available by default, so standard protocols just transmit the data as
they’re sent from applications. An additional level of security has been
added to ordinary sockets, called the secure socket layer (SSL), to encrypt all
transmission going across a socket created with this additional level. Now
developers can determine whether they want this additional security or not.
394
Chapter 9 • Web Clients and Servers
Where the Clients and Servers Are
As you can see from Figure 9-2, the Internet is made up of multiple, interconnected networks, all working with some sense of (perhaps disjointed)
harmony. The left half of the diagram is focused on the Web clients—users
who are either at home, connected via their ISP, or at work on their company’s LAN. Missing from the diagram are special-purpose (and popular)
devices such as firewalls and proxy servers.
Firewalls help fight against unauthorized access to a corporate (or home)
network by blocking known entry points, configurable on a per-network
basis. Without one of these, computers that have servers might allow
intruders to enter an unprotected port and gain system access. Network
administrators reduce the chances of hacking by locking everything out
and only opening up ports for well-known services like Web servers and
secure shell (SSH) access, the latter based on the aforementioned SSL.
Proxy servers are another useful tool that can work alongside firewalls
(or not). Network administrators might prefer that only a certain number
of computers have Internet access, perhaps to better monitor traffic in and
out of their networks. Another useful feature is if the proxy can cache data.
As an example, if Linda accesses a Web page which is proxy-cached, when
her co-worker Heather visits the same page later, she’ll experience a faster
loading time. Her browser did not need to go all the way to the Web
server; instead, it got everything it needed from the proxy. Furthermore,
the IT staff at their company now knows that at least two employees visited that Web site and when (and likely who). Such servers are also known
as forward proxies, based on what they do.
A similar type of computer is a reverse proxy. These do (sort-of) the
opposite of the forward proxy. (In actuality, you can configure a single
computer to perform as both a forward and reverse proxy.) A reverse
proxy acts like a server with which clients can connect. They will likely
access hit a back-end server to obtain the information for which the clients
are requesting. Reverse proxies can also cache such server data and return
it directly back to the client as if they were one of the back-ends.
As you can surmise, instead of caching on their behalf, “living closer
to,” and serving clients, reverse proxies live closer to (back-end) servers.
They act on the behalf of servers, possibly caching for them, load balancing,
etc. You can also use reverse proxies as firewalls or to encrypt data (SSL,
HTTPS, Secure FTP (SFTP), etc.). They’re very useful, and it’s highly likely
that you’ll come across more than one reverse proxy during daily Web surfing. Now let’s talk about where some of those back-end Web servers are.
9.1 Introduction
395
The right side of Figure 9-2 concentrates more on Web servers and
where they can be found. Corporations with larger Web sites will typically
have an entire Web server farm located at their ISPs. Such physical placement is called co-location, meaning that a company’s servers reside at an
ISP along with computers from other corporate customers. These servers
are either all providing different data to clients or are part of a redundant
system with duplicated information designed for heavy demand (high
number of clients). Smaller corporate Web sites might not require as much
hardware and networking gear, and hence, might only have one or several
co-located servers at their ISP.
In either case, most co-located servers are stored with a larger ISP sitting on a network backbone, meaning that they have a “fatter” (read wider)
and presumably faster connection to the Internet—closer to the core of the
Internet, if you will. This permits clients to access the servers quickly—being
on a backbone means clients do not have to hop across as many networks
to access a server, thus allowing more clients to be serviced within a given
time period.
Internet Protocols
You should also keep in mind that although Web surfing is the most common Internet application, it is not the only one and is certainly not the oldest. The Internet predates the Web by almost three decades. Before the
Web, the Internet was mainly used for educational and research purposes,
and many of the original Internet protocols, such as FTP, SMTP, and NNTP
are still around today.
Since Python was initially known for Internet programming, you will
find support for all of the protocols discussed above in addition to many
others. We differentiate between “Internet programming” and “Web programming” by stating that the latter pertains only to applications developed specifically for the Web, such as Web clients and servers, which are
the focus for this chapter.
Internet programming covers a wider range of applications, including
applications that use some of the Internet protocols we previously mentioned, plus network and socket programming in general, all of which are
covered in previous chapters in this book.
396
Chapter 9 • Web Clients and Servers
9.2
Python Web Client Tools
One thing to keep in mind is that a browser is only one type of Web client.
Any application that makes a request for data from a Web server is considered a client. Yes, it is possible to create other clients that retrieve documents or data from the Internet. One important reason to do this is that a
browser provides only limited capacity; it is used primarily for viewing
and interacting with Web sites. A client program, on the other hand,
has the ability to do more—not only can it download data, but it can also
store it, manipulate it, or perhaps even transmit it to another location or
application.
Applications that use the urllib module to download or access information on the Web (using either urllib.urlopen() or urllib.urlretrieve())
can be considered a simple Web client. All you need to do is provide a valid
Web address.
9.2.1
Uniform Resource Locators
Simple Web surfing involves using Web addresses called Uniform Resource
Locators (URLs). Such addresses are used to locate a document on the Web
or to call a CGI program to generate a document for your client. URLs are
part of a larger set of identifiers known as Uniform Resource Identifiers
(URIs). This superset was created in anticipation of other naming conventions that have yet to be developed. A URL is simply a URI that uses an
existing protocol or scheme (i.e., http, ftp, etc.) as part of its addressing. To
complete this picture, we’ll add that non-URL URIs are sometimes known
as Uniform Resource Names (URNs), but because URLs are the only URIs in
use today, you really don’t hear much about URIs or URNs, save for perhaps XML identifiers.
Like street addresses, Web addresses have some structure. An American
street address usually is of the form “number/street designation,” for
example, 123 Main Street. It can differ from other countries, which
might have their own rules. A URL uses the format:
prot_sch://net_loc/path;params?query#frag
9.2 Python Web Client Tools 397
Table 9-1 describes each of the components.
Table 9-1 Web Address Components
URL Component
Description
prot_sch
Network protocol or download scheme
net_loc
Location of server (and perhaps user information)
path
Slash (/) delimited path to file or CGI application
params
Optional parameters
query
Ampersand (&) delimited set of “key=value” pairs
frag
Fragment to a specific anchor within document
net_loc can be broken down into several more components, some required,
others optional. The net_loc string looks like this:
user:[email protected]:port
These individual components are described in Table 9-2.
Table 9-2 Network Location Components
Component
Description
user
User name or login
passwd
User password
host
Name or address of the computer running the Web server
(required)
port
Port number (if not 80, which is the default)
Of the four, the host name is the most important. The port number is
necessary only if the Web server is running on a different port number
from the default. (If you aren’t sure what a port number is, read Chapter 2,
“Network Programming.”)
398
Chapter 9 • Web Clients and Servers
User names and perhaps passwords are used only when making FTP
connections, and even then they usually aren’t necessary because the
majority of such connections are anonymous.
Python supplies two different modules, each dealing with URLs in completely different functionality and capacities. One is urlparse, and the
other is urllib. We will briefly introduce some of their functions here.
9.2.2
The urlparse Module
The urlparse module provides basic functionality with which to manipulate URL strings. These functions include urlparse(), urlunparse(), and
urljoin().
urlparse.urlparse()
breaks up a URL string into some of the major components
described earlier. It has the following syntax:
urlparse()
urlparse(urlstr, defProtSch=None, allowFrag=None)
urlparse() parses urlstr into a 6-tuple ( prot_sch, net_loc, path,
params, query, frag). Each of these components has been described earlier.
defProtSch specifies a default network protocol or download scheme in
case one is not provided in urlstr. allowFrag is a flag that signals whether
a fragment part of a URL is allowed. Here is what urlparse() outputs
when given a URL:
>>> urlparse.urlparse('http://www.python.org/doc/FAQ.html')
('http', 'www.python.org', '/doc/FAQ.html', '', '', '')
urlparse.urlunparse()
urlunparse() does the exact opposite of urlparse()—it merges a 6-tuple
(prot_sch, net_loc, path, params, query, frag)—urltup, which could be the
output of urlparse(), into a single URL string and returns it. Accordingly,
we state the following equivalence:
urlunparse(urlparse(urlstr)) ≡ urlstr
You might have already surmised that the syntax of urlunparse() is as
follows:
urlunparse(urltup)
9.2 Python Web Client Tools 399
urlparse.urljoin()
The urljoin() function is useful in cases for which many related URLs are
needed, for example, the URLs for a set of pages to be generated for a Web
site. The syntax for urljoin() is:
urljoin(baseurl, newurl, allowFrag=None)
urljoin() takes baseurl and joins its base path (net_loc plus the full
path up to, but not including, a file at the end) with newurl. For example:
>>> urlparse.urljoin('http://www.python.org/doc/FAQ.html',
... 'current/lib/lib.htm')
'http://www.python.org/doc/current/lib/lib.html'
A summary of the functions in urlparse can be found in Table 9-3.
Table 9-3 Core urlparse Module Functions
urlparse Functions
Description
urlparse(urlstr,
defProtSch=None,
allowFrag=None)
Parses urlstr into separate components, using
defProtSch if the protocol or scheme is not
given in urlstr; allowFrag determines whether
a URL fragment is allowed
urlunparse(urltup)
Unparses a tuple of URL data (urltup) into a
single URL string
urljoin(baseurl, newurl,
allowFrag=None)
Merges the base part of the baseurl URL with
newurl to form a complete URL; allowFrag is
the same as for urlparse()
9.2.3
urllib Module/Package
CORE MODULE: urllib in Python 2 and Python 3
Unless you are planning on writing a more lower-level network client, the urllib
module provides all the functionality you need. urllib provides a high-level Web
communication library, supporting the basic Web protocols, HTTP, FTP, and
Gopher, as well as providing access to local files. Specifically, the functions of the
urllib module are designed to download data (from the Internet, local network,
or local host) using the aforementioned protocols. Use of this module generally
400
Chapter 9 • Web Clients and Servers
obviates the need for using the httplib, ftplib, and gopherlib modules unless
you desire their lower-level functionality. In those cases, such modules can be considered as alternatives. (Note: most modules named *lib are generally for developing clients of the corresponding protocols. This is not always the case, however,
as perhaps urllib should then be renamed “internetlib” or something similar!)
3.x
With urllib, urlparse, urllib2, and others in Python 2, a step was taken in
Python 3 to streamline all of these related modules under a single package
named urllib, so you’ll find pieces of urllib and urllib2 unified into the
urllib.request module and urlparse turned into urllib.parse. The urllib
package in Python 3 also includes the response, error, and robotparser submodules. Keep these changes in mind as you read this chapter and try the
examples or exercises.
The urllib module provides functions to download data from given
URLs as well as encoding and decoding strings to make them suitable for
including as part of valid URL strings. The functions we will be looking at
in the upcoming section include urlopen(), urlretrieve(), quote(),
unquote(), quote_plus(), unquote_plus(), and urlencode(). We will also
look at some of the methods available to the file-like object returned by
urlopen().
urllib.urlopen()
opens a Web connection to the given URL string and returns a
file-like object. It has the following syntax:
urlopen()
urlopen(urlstr, postQueryData=None)
urlopen() opens the URL pointed to by urlstr. If no protocol or download scheme is given, or if a “file” scheme is passed in, urlopen() will
open a local file.
For all HTTP requests, the normal request type is GET. In these cases,
the query string provided to the Web server (key-value pairs encoded or
quoted, such as the string output of the urlencode() function), should be
given as part of urlstr.
If the POST request method is desired, then the query string (again
encoded) should be placed in the postQueryData variable. (We’ll discuss
GET and POST some more later in the chapter, but such HTTP commands are general to Web programming and HTTP itself, not tied specifically to Python.)
When a successful connection is made, urlopen() returns a file-like
object, as if the destination was a file opened in read mode. If our file
9.2 Python Web Client Tools 401
object is f, for example, then our “handle” would support the expected
read methods such as f.read(), f.readline(), f.readlines(), f.close(),
and f.fileno().
In addition, a f.info() method is available which returns the Multipurpose Internet Mail Extension (MIME) headers. Such headers give the
browser information regarding which application can view returned file
types. For example, the browser itself can view HTML, plain text files, and
render PNG (Portable Network Graphics) and JPEG (Joint Photographic
Experts Group) or the old GIF (Graphics Interchange Format) graphics
files. Other files, such as multimedia or specific document types, require
external applications in order to view.
Finally, a geturl() method exists to obtain the true URL of the final
opened destination, taking into consideration any redirection that might
have occurred. A summary of these file-like object methods is given in
Table 9-4.
Table 9-4 urllib.urlopen() File-like Object Methods
urlopen() Object Methods
Description
f.read([bytes])
Reads all or bytes bytes from f
f.readline()
Reads a single line from f
f.readlines()
Reads a all lines from f into a list
f.close()
Closes URL connection for f
f.fileno()
Returns file number of f
f.info()
Gets MIME headers of f
f.geturl()
Returns true URL opened for f
If you expect to be accessing more complex URLs or want to be able to
handle more complex situations, such as basic and digest authentication,
redirections, cookies, etc., then we suggest using the urllib2 module. It
too, has a urlopen() function, but it also provides other functions and
classes for opening a variety of URLs.
If you’re staying with version 2.x for now, we strongly recommend that 2.6, 3.0
you use urllib2.urlopen(), instead, because it deprecates the original one
in urllib starting in version 2.6; the old one is removed in version 3.0. As
402
Chapter 9 • Web Clients and Servers
you read in the Core Module sidebar earlier, the functionality for both
modules are merged into urllib.request in Python 3. This is just another way
of saying that the version 3.x urllib.request.urlopen() function is ported
directly from version 2.x urllib2.urlopen() (and not urllib.urlopen()).
urllib.urlretrieve()
Rather than opening a URL and letting you access it like a file, urlretrieve()
just downloads the entire HTML and saves it as a file. Here is the syntax
for urlretrieve():
urlretrieve(url, filename=None, reporthook=None, data=None)
Rather than reading from the URL like urlopen() does, urlretrieve()
simply downloads the entire HTML file located at urlstr to your local
disk. It stores the downloaded data into localfile, if given, or a temporary file if not. If the file has already been copied from the Internet or if the
file is local, no subsequent downloading will occur.
The downloadStatusHook, if provided, is a function that is called after
each block of data has been downloaded and delivered. It is called with
the following three arguments: number of blocks read so far, the block size
in bytes, and the total (byte) size of the file. This is very useful if you are
implementing download status information to the user in a text-based or
graphical display.
urlretrieve() returns a 2-tuple (filename, mime_hdrs). filename is the
name of the local file containing the downloaded data. mime_hdrs is the
set of MIME headers returned by the responding Web server. For more
information, see the Message class of the mimetools module. mime_hdrs is
None for local files.
urllib.quote() and urllib.quote_plus()
The quote*() functions take URL data and encode it so that it is fit for
inclusion as part of a URL string. In particular, certain special characters
that are unprintable or cannot be part of valid URLs to a Web server must
be converted. This is what the quote*() functions do for you. Both
quote*() functions have the following syntax:
quote(urldata, safe='/')
Characters that are never converted include commas, underscores, periods, and dashes, as well as alphanumerics. All others are subject to conversion. In particular, the disallowed characters are changed to their
9.2 Python Web Client Tools 403
hexadecimal ordinal equivalents, prepended with a percent sign (%), for
example, %xx, where xx is the hexadecimal representation of a character’s
ASCII value. When calling quote*(), the urldata string is converted to an
equivalent string that can be part of a URL string. The safe string should
contain a set of characters that should also not be converted. The default is
the slash (/).
quote_plus() is similar to quote(), except that it also encodes spaces to
plus signs (+). Here is an example using quote() versus quote_plus():
>>> name = 'joe mama'
>>> number = 6
>>> base = 'http://www/~foo/cgi-bin/s.py'
>>> final = '%s?name=%s&num=%d' % (base, name, number)
>>> final
'http://www/~foo/cgi-bin/s.py?name=joe mama&num=6'
>>>
>>> urllib.quote(final)
'http:%3a//www/%7efoo/cgi-bin/s.py%3fname%3djoe%20mama%26num%3d6'
>>>
>>> urllib.quote_plus(final)
'http%3a//www/%7efoo/cgi-bin/s.py%3fname%3djoe+mama%26num%3d6'
urllib.unquote() and urllib.unquote_plus()
As you have probably guessed, the unquote*() functions do the exact
opposite of the quote*() functions—they convert all characters encoded
in the %xx fashion to their ASCII equivalents. The syntax of unquote*() is
as follows:
unquote*(urldata)
Calling unquote() will decode all URL-encoded characters in urldata
and return the resulting string. unquote_plus() will also convert plus
signs back to space characters.
urllib.urlencode()
takes a dictionary of key-value pairs and encodes them to be
included as part of a query in a CGI request URL string. The pairs are in
key=value format and are delimited by ampersands (&). Furthermore, the
keys and their values are sent to quote_plus() for proper encoding. Here
is an example output from urlencode():
urlencode()
>>> aDict = { 'name': 'Georgina Garcia', 'hmdir': '~ggarcia' }
>>> urllib.urlencode(aDict)
'name=Georgina+Garcia&hmdir=%7eggarcia'
404
Chapter 9 • Web Clients and Servers
There are other functions in urllib and urlparse that we don’t have the
opportunity to cover here. Refer to the documentation for more information.
A summary of the urllib functions discussed in this section can be
found in Table 9-5.
Table 9-5 Core urllib Module Functions
urllib Functions
Description
urlopen(urlstr,
postQueryData=None)
Opens the URL urlstr, sending the query
data in postQueryData if a POST request
urlretrieve(urlstr,
localfile=None,
downloadStatusHook=None)
Downloads the file located at the urlstr
URL to localfile or a temporary file
if localfile not given; if present,
downloaStatusHook is a function that
can receive download statistics
quote(urldata, safe='/')
Encodes invalid URL characters of
urldata; characters in safe string are not
encoded
quote_plus(urldata, safe='/')
Same as quote() except encodes spaces as
plus (+) signs (rather than as %20)
unquote(urldata)
Decodes encoded characters of urldata
unquote_plus(urldata)
Same as unquote() but converts plus
signs to spaces
urlencode(dict)
Encodes the key-value pairs of dict into a
valid string for CGI queries and encodes
the key and value strings with
quote_plus()
SSL Support
Before wrapping up our discussion on urllib and looking at some examples, we want to mention that it supports opening HTTP connections
using the SSL. (The core change to add SSL is implemented in the socket
module.) The httplib module supports URLs using the “https” connection scheme. In addition to those two modules, other protocol client modules with SSL support include: imaplib, poplib, and smtplib.
9.2 Python Web Client Tools 405
9.2.4
An Example of urllib2 HTTP Authentication
As mentioned in the previous subsection, urllib2 can handle more complex URL opening. One example is for Web sites with basic authentication
(login and password) requirements. The most straightforward solution to
getting past security is to use the extended net_loc URL component, as
described earlier in this chapter, for example, http://username:[email protected]
www.python.org. The problem with this solution is that it is not programmatic. Using urllib2, however, we can tackle this problem in two
different ways.
We can create a basic authentication handler (urllib2.HTTPBasicAuth
Handler) and register a login password given the base URL and realm,
meaning a string defining the secure area of the Web site. Once you have a
handler, you build an opener with it and install a URL-opener with it so
that all URLs opened will use our handler.
The realm comes from the defined .htaccess file for the secure part of
the Web site. One example of such a file appears here:
AuthType
AuthName
AuthUserFile
require
basic
"Secure Archive"
/www/htdocs/.htpasswd
valid-user
For this part of the Web site, the string listed for AuthName is the realm.
The username and (encrypted) password are created by using the htpasswd
command (and installed in the .htpasswd file). For more on realms and Web
authentication, see RFC 2617 (HTTP Authentication: Basic and Digest Access
Authentication) as well as the WikiPedia page at http://en.wikipedia.org/
wiki/Basic_access_authentication.
The alternative to creating an opener with a authentication handler is to
simulate a user typing the username and password when prompted by a
browser; that is, to send an HTTP client request with the appropriate
authorization headers. In Example 9-1, we demonstrate these two methods.
Example 9-1
Basic HTTP Authentication (urlopen_auth.py)
This script uses both techniques described earlier for basic HTTP authentication.
You must use urllib2 because this functionality isn’t in urllib.
1
2
3
4
#!/usr/bin/env python
import urllib2
(Continued)
406
Chapter 9 • Web Clients and Servers
Example 9-1
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Basic HTTP Authentication (urlopen_auth.py) (Continued)
LOGIN = 'wesley'
PASSWD = "you'llNeverGuess"
URL = 'http://localhost'
REALM = 'Secure Archive'
def handler_version(url):
from urlparse import urlparse
hdlr = urllib2.HTTPBasicAuthHandler()
hdlr.add_password(REALM,
urlparse(url)[1], LOGIN, PASSWD)
opener = urllib2.build_opener(hdlr)
urllib2.install_opener(opener)
return url
def request_version(url):
from base64 import encodestring
req = urllib2.Request(url)
b64str = encodestring('%s:%s' % (LOGIN, PASSWD))[:-1]
req.add_header("Authorization", "Basic %s" % b64str)
return req
for funcType in ('handler', 'request'):
print '*** Using %s:' % funcType.upper()
url = eval('%s_version' % funcType)(URL)
f = urllib2.urlopen(url)
print f.readline()
f.close()
Line-by-Line Explanation
Lines 1–8
This is the usual, expected setup plus some constants for the rest of the
script to use. We don’t need to remind you that sensitive information should
come from a secure database, or at least from environment variables or precompiled .pyc files rather than being hardcoded in plain text in a source file.
Lines 10–17
The “handler” version of the code allocates a basic handler class as
described earlier, and then adds the authentication information. The handler is then used to create a URL-opener that is subsequently installed so
that all URLs opened will use the given authentication. This code was
adapted from the official Python documentation for the urllib2 module.
Lines 19–24
The “request” version of our code just builds a Request object and adds the
simple base64-encoded authentication header into our HTTP request. This
9.2 Python Web Client Tools 407
request is then used to substitute the URL string when calling urlopen()
upon returning back to “main.” Note that the original URL was “baked
into” the urllib2.Request object, hence the reason why it was not a problem to replace it in the subsequent call to urllib2.urlopen(). This code
was inspired by Michael Foord’s and Lee Harr’s recipes in the Python Cookbook, which you can obtain at:
http://aspn.activestate.com/ASPN/Cookbook/Python/
Recipe/305288
http://aspn.activestate.com/ASPN/Cookbook/Python/
Recipe/267197
It would have been great to have been able to use Harr’s HTTPRealm
Finder class so that we do not need to hard-code it in our example.
Lines 26–31
The rest of this script just opens the given URL by using both techniques
and displays the first line (dumping the others) of the resulting HTML
page returned by the server once authentication has been validated. Note
that an HTTP error (and no HTML) would be returned if the authentication information is invalid.
The output should look something like this:
$ python urlopen_auth.py
*** Using HANDLER:
<html>
*** Using REQUEST:
<html>
In addition to the official Python documentation for urllib2, you may
find this companion piece useful:
http://www.voidspace.org.uk/python/articles/urllib2.shtml.
9.2.5
Porting the HTTP Authentication Example
to Python 3
At the time of this writing, porting this application requires a bit more
work than just using the 2to3 tool. Of course, it does the heavy lifting, but
it does require a softer (or is that “software”?) touch afterwards. Let’s take
our urlauth_open.py script and run the tool on it:
$ 2to3 -w urlopen_auth.py
. . .
3.x
408
Chapter 9 • Web Clients and Servers
You would use a similar command on PCs, but as you might have
already seen from earlier chapters, the output shows the differences that
were changed between the Python 2 and Python 3 versions of the script,
and the original file is overridden with the Python 3 version, whereas the
Python 2 version was backed up automatically.
Rename the new file from urlopen_auth.py urlopen_auth3.py and the
backup from urlopen_auth.py.bak to urlopen_auth.py. On a POSIX system, execute these file rename commands (and on PCs, you would do it
from Windows or use the ren DOS command):
$ mv urlopen_auth.py urlopen_auth3.py
$ mv urlopen_auth.py.bak urlopen_auth.py
This keeps with our naming strategy to help recognize our code that’s in
Python 2 versus those ported to Python 3. Anyway, running the tool is just
the beginning. If we’re optimistic that it will run the first time, our hopes
are dashed quickly:
$ python3 urlopen_auth3.py
*** Using HANDLER:
b'<HTML>\n'
*** Using REQUEST:
Traceback (most recent call last):
File "urlopen_auth3.py", line 28, in <module>
url = eval('%s_version' % funcType)(URL)
File "urlopen_auth3.py", line 22, in request_version
b64str = encodestring('%s:%s' % (LOGIN, PASSWD))[:-1]
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/
python3.2/base64.py", line 353, in encodestring
return encodebytes(s)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/
python3.2/base64.py", line 341, in encodebytes
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str
Going with our gut instinct, change the string in line 22 to a bytes string
by adding a leading “b” before the opening quote, as in b'%s:%s' %
(LOGIN, PASSWD). Now if we run it again, we get another error—welcome
to the Python 3 porting club!
$ python3 urlopen_auth3.py
*** Using HANDLER:
b'<HTML>\n'
*** Using REQUEST:
Traceback (most recent call last):
File "urlopen_auth3.py", line 28, in <module>
url = eval('%s_version' % funcType)(URL)
File "urlopen_auth3.py", line 22, in request_version
b64str = encodestring(b'%s:%s' % (LOGIN, PASSWD))[:-1]
TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple'
9.2 Python Web Client Tools 409
Apparently, bytes objects do not support the string format operator
because, technically, you’re not supposed to use them as strings. Instead,
we need to format the string as (Unicode) text, and then convert the whole
thing into a bytes object: bytes('%s:%s' % (LOGIN, PASSWD), 'utf-8')). The
output after this change is much closer to what we want:
$ python3 urlopen_auth3.py
*** Using HANDLER:
b'<HTML>\n'
*** Using REQUEST:
b'<HTML>\n'
It’s still slightly off because we’re seeing the designation of the bytes
objects (leading “b”, quotes, etc.) instead of just the text in which we’re
interested. Change the print() call to this: print(str(f.readline(), 'utf-8')).
Now the output of the Python 3 version is identical to that of the Python 2
script:
$ python3 urlopen_auth3.py
*** Using HANDLER:
<html>
*** Using REQUEST:
<html>
As you can see, porting requires a bit of handholding, but it’s not impossible. Again, as we noted earlier, urllib, urllib2, and urlparse are all
merged together under the urllib package umbrella in Python 3. Because
of how the 2to3 tool works, an import of urllib.parse already exists at the
top. It is thus is superfluous in the definition of handler_version() and
removed. You’ll find that change along with the others in Example 9-2.
Example 9-2
Python 3 HTTP Authentication Script (urlopen_auth3.py)
This represents the Python 3 version to our urlopen_auth.py script.
1
2
3
4
5
6
7
8
9
#!/usr/bin/env python3
import urllib.request, urllib.error, urllib.parse
LOGIN = 'wesley'
PASSWD = "you'llNeverGuess"
URL = 'http://localhost'
REALM = 'Secure Archive'
(Continued)
410
Chapter 9 • Web Clients and Servers
Example 9-2
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Python 3 HTTP Authentication Script (urlopen_auth3.py)
(Continued)
def handler_version(url):
hdlr = urllib.request.HTTPBasicAuthHandler()
hdlr.add_password(REALM,
urllib.parse.urlparse(url)[1], LOGIN, PASSWD)
opener = urllib.request.build_opener(hdlr)
urllib.request.install_opener(opener)
return url
def request_version(url):
from base64 import encodestring
req = urllib.request.Request(url)
b64str = encodestring(
bytes('%s:%s' % (LOGIN, PASSWD), 'utf-8'))[:-1]
req.add_header("Authorization", "Basic %s" % b64str)
return req
for funcType in ('handler', 'request'):
print('*** Using %s:' % funcType.upper())
url = eval('%s_version' % funcType)(URL)
f = urllib.request.urlopen(url)
print(str(f.readline(), 'utf-8')
f.close()
Let’s now turn our attention to slightly more advanced Web clients.
9.3
Web Clients
Web browsers are basic Web clients. They are used primarily for searching
and downloading documents from the Web. You can also create Web clients
that do more than that, though. We’ll take a look at several in this section.
9.3.1
A Simple Web Crawler/Spider/Bot
One example of a slightly more complex Web client is a crawler (a.k.a. spider,
[ro]bot). These are programs that explore and download pages from the
Internet for a variety of reasons, some of which include:
• Indexing into a large search engine such as Google or Yahoo!
• Offline browsing—downloading documents onto a local hard
disk and rearranging hyperlinks to create almost a mirror
image for local browsing
9.3 Web Clients
411
• Downloading and storing for historical or archival purposes, or
• Web page caching to save superfluous downloading time on
Web site revisits.
The crawler in Example 9-3, crawl.py, takes a starting Web address (URL),
downloads that page and all other pages whose links appear in succeeding pages, but only those that are in the same domain as the starting page.
Without such limitations, you will run out of disk space.
Example 9-3
Web Crawler (crawl.py)
The crawler consists of two classes: one to manage the entire crawling process
(Crawler), and one to retrieve and parse each downloaded Web page (Retriever).
(Refactored from earlier editions of this book.)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/env python
import cStringIO
import formatter
from htmllib import HTMLParser
import httplib
import os
import sys
import urllib
import urlparse
class Retriever(object):
__slots__ = ('url', 'file')
def __init__(self, url):
self.url, self.file = self.get_file(url)
def get_file(self, url, default='index.html'):
'Create usable local filename from URL'
parsed = urlparse.urlparse(url)
host = parsed.netloc.split('@')[-1].split(':')[0]
filepath = '%s%s' % (host, parsed.path)
if not os.path.splitext(parsed.path)[1]:
filepath = os.path.join(filepath, default)
linkdir = os.path.dirname(filepath)
if not os.path.isdir(linkdir):
if os.path.exists(linkdir):
os.unlink(linkdir)
os.makedirs(linkdir)
return url, filepath
(Continued)
412
Chapter 9 • Web Clients and Servers
Example 9-3
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
Web Crawler (crawl.py) (Continued)
def download(self):
'Download URL to specific named file'
try:
retval = urllib.urlretrieve(self.url, self.file)
except (IOError, httplib.InvalidURL) as e:
retval = (('*** ERROR: bad URL "%s": %s' % (
self.url, e)),)
return retval
def parse_links(self):
'Parse out the links found in downloaded HTML file'
f = open(self.file, 'r')
data = f.read()
f.close()
parser = HTMLParser(formatter.AbstractFormatter(
formatter.DumbWriter(cStringIO.StringIO())))
parser.feed(data)
parser.close()
return parser.anchorlist
class Crawler(object):
count = 0
def __init__(self, url):
self.q = [url]
self.seen = set()
parsed = urlparse.urlparse(url)
host = parsed.netloc.split('@')[-1].split(':')[0]
self.dom = '.'.join(host.split('.')[-2:])
def get_page(self, url, media=False):
'Download page & parse links, add to queue if nec'
r = Retriever(url)
fname = r.download()[0]
if fname[0] == '*':
print fname, '... skipping parse'
return
Crawler.count += 1
print '\n(', Crawler.count, ')'
print 'URL:', url
print 'FILE:', fname
self.seen.add(url)
ftype = os.path.splitext(fname)[1]
if ftype not in ('.htm', '.html'):
return
for link in r.parse_links():
if link.startswith('mailto:'):
print '... discarded, mailto link'
continue
9.3 Web Clients
413
81
if not media:
82
ftype = os.path.splitext(link)[1]
83
if ftype in ('.mp3', '.mp4', '.m4v', '.wav'):
84
print '... discarded, media file'
85
continue
86
if not link.startswith('http://'):
87
link = urlparse.urljoin(url, link)
88
print '*', link,
89
if link not in self.seen:
90
if self.dom not in link:
91
print '... discarded, not in domain'
92
else:
93
if link not in self.q:
94
self.q.append(link)
95
print '... new, added to Q'
96
else:
97
print '... discarded, already in Q'
98
else:
99
print '... discarded, already processed'
100
101
def go(self, media=False):
102
'Process next page in queue (if any)'
103
while self.q:
104
url = self.q.pop()
105
self.get_page(url, media)
106
107 def main():
108
if len(sys.argv) > 1:
109
url = sys.argv[1]
110
else:
111
try:
112
url = raw_input('Enter starting URL: ')
113
except (KeyboardInterrupt, EOFError):
114
url = ''
115
if not url:
116
return
117
if not url.startswith('http://') and \
118
not url.startswith('ftp://'):
119
url = 'http://%s/' % url
120
robot = Crawler(url)
121
robot.go()
122
123
if __name__ == '__main__':
124
main()
Line-by-Line (Class-by-Class) Explanation
Lines 1–10
The top part of the script consists of the standard Python Unix startup line
and the import of the modules/packages to be used. Here are some brief
explanations:
• cStringIO, formatter, htmllib We use various classes in
these modules for parsing HTML.
414
Chapter 9 • Web Clients and Servers
• httplib We only need an exception from this module.
• os This provides various file system functions.
• sys
We are just using argv for command-line arguments.
• urllib We only need the urlretrieve() function for
downloading Web pages.
• urlparse We use the urlparse() and urljoin() functions
for URL manipulation.
Lines 12–29
The Retriever class has the responsibility of downloading pages from the
Web and parsing the links located within each document, adding them to
the “to-do” queue, if necessary. A Retriever instance object is created for
each page that is downloaded from the Internet. Retriever consists of several methods to aid in its functionality: a constructor (__init__()),
get_file(), download(), and parse_links().
Skipping ahead momentarily, the get_file() method takes the given
URL and comes up with a safe and sane corresponding filename to store
the file locally—we are downloading this file from the Web. Basically, it
works by removing the http:// prefix from the URL, getting rid of any
extras such as username, password, and port number in order to arrive at
the hostname (line 20).
URLs without trailing file extensions will be given the default filename
index.html and can be overridden by the caller. You can see how this
works as well as the final filepath created on lines 21–23.
We then pull out the final destination directory (line 24) and check if it is
already a directory—if so, we leave it alone and return the URL-filepath
pair. If we enter this if clause, this means the directory either doesn’t exist
or is a plain file. In the case it is the latter, so it will be erased. Finally, the target directory and any parents are created by using os.makedirs() in line 28.
Now let’s go back up to the initializer __init__(). A Retriever object is
created and stores both the URL (str) and the corresponding filename
returned by get_file() as (instance) attributes. In our current design,
instances are created for every file downloaded. In the case of a Web site
with many, many files, a small instance like this can cause additional memory usage. To help minimize consumed resources, we create a __slots__
variable, indicating that the only attributes that instances can have are
self.url and self.file.
9.3 Web Clients
415
Lines 31–49
We’ll see the crawler momentarily, but this is a heads-up that it creates
Retriever objects for each downloaded file. The download() method, as
you can imagine, actually goes out to the Internet to download the page
with the given link (line 34). It calls urllib.urlretrieve() with the URL
and saves it to the filename (the one returned by get_file()).
If the download was successful, the filename is returned (line 34), but if
there’s an error, an error string prefixed with *** is returned instead (lines
35–36). The crawler checks this return value and calls parse_links() to
parse links out of the just-downloaded page only if all went well.
The more serious method in this part of our application is the
parse_links() method. Yes, the job of a crawler is to download Web
pages, but a recursive crawler (like ours) looks for additional links in each
downloaded page and processes them, too. It first opens up the downloaded Web page and extracts the entire HTML content as a single string
(lines 42–44).
The magic you see in lines 45–49 is a well-known recipe that uses the
htmllib.HTMLParser class. We would like to say something to the effect
that this is a recipe that’s been passed down by Python programmers from
generation to generation, but we would just be lying to you. Anyway, we
digress.
The main point of how it works is that the parser class doesn’t do I/O, so
it takes a formatter object to handle that. Formatter objects—Python only
has one real formatter: formatter.AbstractFormatter—parse the data and
use a writer object to dispatch its output. Similarly, Python only has one
useful writer object: formatter.DumbWriter. It optionally takes a file object
to which to write the output. If you omit it, it writes to standard output,
which is probably undesirable. To that effect, we instantiate a cStringIO.
StringIO object to absorb this output (think /dev/null, if you know what
that is.) You can search online for any of the class names and find similar
code snippets in many places along with additional commentary.
Because htmllib.HTMLParser is fairly long in the tooth and deprecated
starting in version 2.6, a smaller example demonstrating some of the more
contemporary tools comes in the next subsection. We leave it in this example because it is/was such a common recipe and still can be the right tool
for this job.
Anyway, all the complexity in creating the parser is entirely contained
in a single call (lines 45–46). The rest of this block consists of passing in the
HTML, closing the parser, and then returning a list of parsed links/anchors.
416
Chapter 9 • Web Clients and Servers
Lines 51–59
The Crawler class is the star of the show, managing the entire crawling
process for one Web site. If we added threading to our application, we
would create separate instances for each site crawled. The Crawler consists
of three items stored by the constructor during the instantiation phase, the
first of which is self.q, a queue of links to download. Such a list will fluctuate during execution, shrinking as each page is processed and expanding
as new links are discovered within each downloaded page.
The other two data values for the Crawler include self.seen, a set
containing all the links that we have seen (downloaded) already. And
finally, we store the domain name for the main link, self.dom, and use that
value to determine whether any succeeding links are part of the same
domain. All three values are created in the initializer method __init__()
in lines 54–59.
Note that we parse the domain by using urlparse.urlparse() (line 58)
in the same way that we grab the hostname out of the URL in the
Retriever. The domain name comes by just taking the final two parts of
the hostname. Note that because we don’t use the host for anything else, you
can make your code harder to read by combining lines 58 and 59 like this:
self.dom = '.'.join(urlparse.urlparse(
url).netloc.split('@')[-1].split(':')[0].split('.')[-2:])
Right above __init__(), the Crawler also has a static data item named
The purpose of this counter is just to keep track of the number of
objects we have downloaded from the Internet. It is incremented for every
successfully downloaded page.
count.
Lines 61-105
Crawler has a pair of other methods in addition to its constructor:
get_page() and go(). go() is simply the method that is used to start the
Crawler. It is called from the main body of code. go() consists of a loop
that will continue to execute as long as there are new links in the queue
that need to be downloaded. The workhorse of this class, though, is the
get_page() method.
get_page() instantiates a Retriever object with the first link and lets it
go off to the races. If the page was downloaded successfully, the counter is
incremented (otherwise, links that error-out are skipped [lines 65–67]) and
the link added to the “already seen” set (line 72). We use a set because
order doesn’t matter and its lookup is much faster than using a list.
get_page() looks at all the links featured inside each downloaded page
(skipping all non-Web pages [lines 73–75]) and determines whether any
9.3 Web Clients
417
more links should be added to the queue (lines 77–99). The main loop in
go() will continue to process links until the queue is empty, at which time
victory is declared (lines 103–105).
Links that are a part of another domain (lines 90–91), or have already
been downloaded (lines 98–99), are already in the queue waiting to be processed (lines 96–97), or are mailto: links are ignored and not added to the
queue (lines 78–80). The same applies for media files (lines 81–85).
Lines 107–124
main() needs a URL to begin processing. If one is entered on the command
line (for example, when this script is invoked directly; lines 108–109), it
will just go with the one given. Otherwise, the script enters interactive
mode, prompting the user for a starting URL (line 112). With a starting
link in hand, the Crawler is instantiated, and away we go (lines 120–121).
One sample invocation of crawl.py might look like this:
$ crawl.py
Enter starting URL: http://www.null.com/home/index.html
( 1 )
URL: http://www.null.com/home/index.html
FILE: www.null.com/home/index.html
* http://www.null.com/home/overview.html ... new, added to Q
* http://www.null.com/home/synopsis.html ... new, added to Q
* http://www.null.com/home/order.html ... new, added to Q
* mailto:[email protected] ... discarded, mailto link
* http://www.null.com/home/overview.html ... discarded, already in Q
* http://www.null.com/home/synopsis.html ... discarded, already in Q
* http://www.null.com/home/order.html ... discarded, already in Q
* mailto:[email protected] ... discarded, mailto link
* http://bogus.com/index.html ... discarded, not in domain
( 2 )
URL: http://www.null.com/home/order.html
FILE: www.null.com/home/order.html
* mailto:[email protected] ... discarded, mailto link
* http://www.null.com/home/index.html ... discarded, already processed
* http://www.null.com/home/synopsis.html ... discarded, already in Q
* http://www.null.com/home/overview.html ... discarded, already in Q
( 3 )
URL: http://www.null.com/home/synopsis.html
FILE: www.null.com/home/synopsis.html
* http://www.null.com/home/index.html ... discarded, already processed
* http://www.null.com/home/order.html ... discarded, already processed
* http://www.null.com/home/overview.html ... discarded, already in Q
418
Chapter 9 • Web Clients and Servers
( 4 )
URL: http://www.null.com/home/overview.html
FILE: www.null.com/home/overview.html
* http://www.null.com/home/synopsis.html ... discarded, already
processed
* http://www.null.com/home/index.html ... discarded, already processed
* http://www.null.com/home/synopsis.html ... discarded, already
processed
* http://www.null.com/home/order.html ... discarded, already processed
After execution, a www.null.com directory would be created in the local
file system, with a home subdirectory. You will find all the processed files
within home.
If after reviewing the code you’re still wondering where writing a
crawler in Python can get you, you might be surprised to learn that the
original Google Web crawlers were written in Python. For more information, see http://infolab.stanford.edu/~backrub/google.html.
9.3.2
Parsing Web Content
In the previous subsection, we took a look at a crawler Web client. Part of
the spidering process involved parsing of links, or anchors as they’re officially called. For a long while, the well-known recipe htmllib.HTMLParser
was employed for parsing Web pages; however, newer and improved
modules and packages have come along. We’ll be demonstrating some of
these in this subsection.
In Example 9-4, we explore one standard library tool, the HTMLParser class
in the HTMLParser module (added in version 2.2). HTMLParser.HTMLParser
was supposed to replace htmllib.HTMLParser because it was simpler, provided a lower-level view of the content, and handled XHTML, whereas
the latter was older and more complex because it was based on the
sgmllib module (meaning it had to understand the intricacies of Standard
Generalized Markup Language [SGML]). The official documentation is
fairly sparse when describing how to use HTMLParser.HTMLParser, so hopefully we’ll give a more useful example here.
We’ll also demonstrate the use of two of the other three most popular
Web parsers, BeautifulSoup and html5lib, which are available as separate
downloads outside of the standard library. You can access them both at the
Cheeseshop, or from http://pypi.python.org. For a less stressful installation, you can also use the easy_install or pip tools to get either one.
9.3 Web Clients
419
The one we skipped was lxml; we’ll leave that as an exercise for you to
undertake. You’ll find more exercises at the end of the chapter that will
help you learn these more thoroughly by substituting them for htmllib.
HTMLParser in the crawler.
The parse_links.py script in Example 9-4 only consists of parsing anchors
out of any input data. Given a URL, it will extract all links, attempt to
make any necessary adjustments to make them full URLs, sort, and display them to the user. It runs each URL through all three parsers. For
BeautifulSoup in particular, we provide two different solutions: the first
one is simpler, parsing all tags then looking for all the anchor tags; the second requires the use of the SoupStrainer class, which specifically targets
anchor tags and only parses those.
Example 9-4
Link Parser (parse_links.py)
This script uses three different parsers to extract links from HTML anchor tags.
It features the HTMLParser standard library module as well as the third-party
BeautifulSoup and html5lib packages.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/env python
from
from
from
from
HTMLParser import HTMLParser
cStringIO import StringIO
urllib2 import urlopen
urlparse import urljoin
from BeautifulSoup import BeautifulSoup, SoupStrainer
from html5lib import parse, treebuilders
URLs = (
'http://python.org',
'http://google.com',
)
def output(x):
print '\n'.join(sorted(set(x)))
def simpleBS(url, f):
'simpleBS() - use BeautifulSoup to parse all tags to get anchors'
output(urljoin(url, x['href']) for x in BeautifulSoup(
f).findAll('a'))
(Continued)
420
Chapter 9 • Web Clients and Servers
Example 9-4
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
Link Parser (parse_links.py) (Continued)
def fasterBS(url, f):
'fasterBS() - use BeautifulSoup to parse only anchor tags'
output(urljoin(url, x['href']) for x in BeautifulSoup(
f, parseOnlyThese=SoupStrainer('a')))
def htmlparser(url, f):
'htmlparser() - use HTMLParser to parse anchor tags'
class AnchorParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag != 'a':
return
if not hasattr(self, 'data'):
self.data = []
for attr in attrs:
if attr[0] == 'href':
self.data.append(attr[1])
parser = AnchorParser()
parser.feed(f.read())
output(urljoin(url, x) for x in parser.data)
def html5libparse(url, f):
'html5libparse() - use html5lib to parse anchor tags'
output(urljoin(url, x.attributes['href']) \
for x in parse(f) if isinstance(x,
treebuilders.simpletree.Element) and \
x.name == 'a')
def process(url, data):
print '\n*** simple BS'
simpleBS(url, data)
data.seek(0)
print '\n*** faster BS'
fasterBS(url, data)
data.seek(0)
print '\n*** HTMLParser'
htmlparser(url, data)
data.seek(0)
print '\n*** HTML5lib'
html5libparse(url, data)
def main():
for url in URLs:
f = urlopen(url)
data = StringIO(f.read())
f.close()
process(url, data)
if __name__ == '__main__':
main()
9.3 Web Clients
421
Line-by-Line Explanation
Lines 1–9
In this script, we use four modules from the standard library. HTMLParser is
one of the parsers; the other three are for general use throughout. The second group of imports are of third-party (non-standard library) modules/
packages. This ordering is the generally accepted standard for imports:
standard library modules/packages first, followed by third-party installations, and finally, any modules/packages local to the application.
Lines 11–17
The URLs variable contains the Web pages to parse; feel free to add, change,
or remove URLs here. The output() function takes an iterable of links,
removes duplicates by putting them all into a set, sorts them in lexicographic order, and then merges them into a NEWLINE-delimited string
that is displayed to the user.
Lines 19–27
We highlight the use of BeautifulSoup in the simpleBS() and fasterBS()
functions. In simpleBS(), the parsing happens when you instantiate BeautifulSoup with the file handle. In the following short snippet, we do
exactly that, using an already downloaded page from the PyCon Web site
as pycon.html.
>>> from BeautifulSoup import BeautifulSoup as BS
>>> f = open('pycon.html')
>>> bs = BS(f)
When you get the instance and call its findAll() method requesting
anchor (‘a’) tags, it returns a list of tags, as shown here:
>>> type(bs)
<class 'BeautifulSoup.BeautifulSoup'>
>>> tags = bs.findAll('a')
>>> type(tags)
<type 'list'>
>>> len(tags)
19
>>> tag = tags[0]
>>> tag
<a href="/2011/">PyCon 2011 Atlanta</a>
>>> type(tag)
<class 'BeautifulSoup.Tag'>
>>> tag['href']
u'/2011/'
422
Chapter 9 • Web Clients and Servers
Because the Tag object is an anchor, it should have an 'href' tag, so we
ask for it. We then call urlparse.urljoin() and pass along the head URL
along with the link to get the full URL. Here’s our continuing example
(assuming the PyCon URL):
>>> from urlparse import urljoin
>>> url = 'http://us.pycon.org'
>>> urljoin(url, tag['href'])
u'http://us.pycon.org/2011/'
The generator expression iterates over all the final links created by
from all of the anchor tags and sends them to output(),
which processes them as just described. If the code is slightly more difficult to understand because of the use of the generator expression, we can
expand out the code to the equivalent:
urlparse.urljoin()
def simpleBS(url, f):
parsed = BeautifulSoup(f)
tags = parsed.findAll('a')
links = [urljoin(url, tag['href']) for tag in tags]
output(links)
For readability purposes, this wins over our single line version, and we
would recommend that when developing open-source, work, or group collaborative projects, you always consider this over a more cryptic one-liner.
Although the simpleBS() function is fairly easy to understand, one of its
drawbacks is that the way we’re processing it isn’t as efficient as it can be.
We use BeautifulSoup to parse all the tags in this document and then look
for the anchors. It would be quicker if we could just filter only the anchor
tags (and ignore the rest).
This is what fasterBS() does, accomplishing what we just described by
using the SoupStrainer helper class (and passing that request to filter only
anchor tags as the parseOnlyThese parameter). By using SoupStrainer, you
can tell BeautifulSoup to skip all the elements it isn’t interested in when
building the parse tree, so it saves time as well as memory. Also, once
parsing has completed, only the anchors make up the parse tree, so there’s
no need to use the findAll() method before iterating.
Lines 29–42
In htmlparser(), we use the standard library class HTMLParser.HTMLParser
to do the parsing. You can see why BeautifulSoup is a popular parser;
code is shorter and less complex than using HTMLParser. Our use of
HTMLParser is also slower here because you have to manually build a list,
that is, create an empty list and repeatedly call its append() method.
9.3 Web Clients
423
You can also tell that HTMLParser is lower level than BeautifulSoup. You
subclass it and have to create a method called handle_starttag() that’s
called every time a new tag is encountered in the file stream (lines 31–39).
We skip all non-anchor tags (lines 33–34), and then add all anchor links to
self.data (lines 37–39), initializing self.data when necessary (lines 35–36).
To use your new parser, you instantiate and feed it (lines 40–41). The
results, as you know, are placed into parser.data, and we create the full
URLs and display them (line 42) as in our previous BeautifulSoup example.
Lines 44–49
The final example uses html5lib, a parser for HTML documents that follow the HTML5 specification. The simplest way of using html5lib is to call
its parse() function with the payload (line 47). It builds and outputs a tree
in its custom simpletree format.
You can also choose to use any of a variety of popular tree formats,
including minidom, ElementTree, lxml, or BeautifulSoup. To choose an alternative tree format, just pass the name of the desired format in to parse() as
the treebuilder argument:
import html5lib
f = open("pycon.html")
tree = html5lib.parse(f, treebuilder="lxml")
f.close()
Unless you need a specific tree, usually simpletree is good enough. If
you were to perform a trial run and parse a generic document, you’d see
output looking something like this:
>>> import html5lib
>>> f = open("pycon.html")
>>> tree = html5lib.parse(f)
>>> f.close()
>>> for x in data:
... print x, type(x)
...
<html> <class 'html5lib.treebuilders.simpletree.DocumentType'>
<html> <class 'html5lib.treebuilders.simpletree.Element'>
<head> <class 'html5lib.treebuilders.simpletree.Element'>
<None> <class 'html5lib.treebuilders.simpletree.TextNode'>
<meta> <class 'html5lib.treebuilders.simpletree.Element'>
<None> <class 'html5lib.treebuilders.simpletree.TextNode'>
<title> <class 'html5lib.treebuilders.simpletree.Element'>
<None> <class 'html5lib.treebuilders.simpletree.TextNode'>
<None> <class 'html5lib.treebuilders.simpletree.CommentNode'>
. . .
<img> <class 'html5lib.treebuilders.simpletree.Element'>
<None> <class 'html5lib.treebuilders.simpletree.TextNode'>
<h1> <class 'html5lib.treebuilders.simpletree.Element'>
424
Chapter 9 • Web Clients and Servers
<a> <class 'html5lib.treebuilders.simpletree.Element'>
<None> <class 'html5lib.treebuilders.simpletree.TextNode'>
<h2> <class 'html5lib.treebuilders.simpletree.Element'>
<None> <class 'html5lib.treebuilders.simpletree.TextNode'>
. . .
Most of the traversed items are either Element or TextNode objects. We
don’t really care about TextNode objects in our example here; we’re only
concerned with one specific type of Element object, the anchor. To filter
these out, we have two checks in the if clause of the generator expression:
only look at Elements, and of those, only anchors (lines 47–49). For those
that meet this criteria, we pull out their 'href' attribute, merge into a complete URL, and output that as before (line 46).
Lines 51–72
The drivers of this application are the main() function, which process each
of links found on lines 11–14. It makes one call to download the Web page
and immediately sticks the data into a StringIO object (lines 65–68) so that
we can iterate over them using each of the parsers (line 69) via a call to
process().
The process() function (lines 51–62) takes the target URL and the
StringIO object, and then calls on each parser to perform its duty and output its result. With every successive parse (after the first), process() must
also reset the StringIO object back to the beginning (lines 54, 57, and 60)
for the next parser.
Once you’re satisfied with the code and have it working, you can run it
and see how each parser outputs all links (sorted in alphabetical order)
found in anchor tags within the Web page’s URL. Note that at the time of
this writing, there is a preliminary port of BeautifulSoup to Python 3 but
not html5lib.
9.3.3
Programmatic Web Browsing
In this final section on Web clients, we’ll present a slightly different example that uses a third-party tool, Mechanize (based on a similarly-named
tool written for Perl), which is designed to simulate a browser. It also
spawned off a Ruby version.
In the previous example (parse_links.py), BeautifulSoup was one of the
parsers we used to decipher Web page content. We’ll use that again here.
If you wish to play along, you’ll need to have both Mechanize and
BeautifulSoup installed on your system. Again, you can obtain and install
them separately, or you can use a tool like easy_install or pip.
9.3 Web Clients
425
Example 9-5 presents the mech.py script, which is very much of a script
or batch-style application. There are no classes or functions. The whole
thing is just one large main() broken up into seven parts, each of which
explores one page of the Web site we’re examining today: the PyCon conference Web site from 2011. We chose this because the site is not likely to
change over time (more recent conferences will get their own customized
application).
If it does change, however, there are many Web sites to which you can
adapt this example, such as logging in to any Web-based e-mail service you
subscribe to or some tech news or blog site you frequent. By going over
mech.py and what it does, you should have a good enough understanding of
how it works to easily port the sample code to work elsewhere.
Example 9-5
Programmatic Web Browsing (mech.py)
In a very batch-like, straightforward script, we employ the Mechanize thirdparty tool to explore the PyCon 2011 Web site, parsing it with another nonstandard tool, BeautifulSoup.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup, SoupStrainer
from mechanize import Browser
br = Browser()
# home page
rsp = br.open('http://us.pycon.org/2011/home/')
print '\n***', rsp.geturl()
print "Confirm home page has 'Log in' link; click it"
page = rsp.read()
assert 'Log in' in page, 'Log in not in page'
rsp = br.follow_link(text_regex='Log in')
# login page
print '\n***', rsp.geturl()
print 'Confirm at least a login form; submit invalid creds'
assert len(list(br.forms())) > 1, 'no forms on this page'
br.select_form(nr=0)
(Continued)
426
Chapter 9 • Web Clients and Servers
Example 9-5
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Programmatic Web Browsing (mech.py) (Continued)
br.form['username'] = 'xxx'
br.form['password'] = 'xxx'
rsp = br.submit()
# wrong login
# wrong passwd
# login page, with error
print '\n***', rsp.geturl()
print 'Error due to invalid creds; resubmit w/valid creds'
assert rsp.geturl() == 'http://us.pycon.org/2011/account/login/',
rsp.geturl()
page = rsp.read()
err = str(BS(page).find("div",
{"id": "errorMsg"}).find('ul').find('li').string)
assert err == 'The username and/or password you specified are not correct.', err
br.select_form(nr=0)
br.form['username'] = YOUR_LOGIN
br.form['password'] = YOUR_PASSWD
rsp = br.submit()
# login successful, home page redirect
print '\n***', rsp.geturl()
print 'Logged in properly on home page; click Account link'
assert rsp.geturl() == 'http://us.pycon.org/2011/home/', rsp.geturl()
page = rsp.read()
assert 'Logout' in page, 'Logout not in page'
rsp = br.follow_link(text_regex='Account')
# account page
print '\n***', rsp.geturl()
print 'Email address parseable on Account page; go back'
assert rsp.geturl() == 'http://us.pycon.org/2011/account/email/',
rsp.geturl()
page = rsp.read()
assert 'Email Addresses' in page, 'Missing email addresses'
print '
Primary e-mail: %r' % str(
BS(page).find('table').find('tr').find('td').find('b').string)
rsp = br.back()
# back to home page
print '\n***', rsp.geturl()
print 'Back works, on home page again; click Logout link'
assert rsp.geturl() == 'http://us.pycon.org/2011/home/', rsp.geturl()
rsp = br.follow_link(url_regex='logout')
# logout page
print '\n***', rsp.geturl()
print 'Confirm on Logout page and Log in link at the top'
assert rsp.geturl() == 'http://us.pycon.org/2011/account/logout/',
rsp.geturl()
66 page = rsp.read()
67 assert 'Log in' in page, 'Log in not in page'
68 print '\n*** DONE'
9.3 Web Clients
427
Line-by-Line Explanation
Lines 1–6
This script is fairly simplistic. In fact, we don’t use any standard library packages/modules, so all you see here are the imports of the Mechanize.Browser
and BeautifulSoup.BeautifulSoup classes.
Lines 8–14
The first place we visit on the PyCon 2011 Web site is the home page. We
display the URL to the user as a confirmation (line 10). Note that this is the
final URL that is visited because the original link might have redirected
the user elsewhere. The last part of this section (lines 12–14) confirms that
the user is not logged in by looking for the 'Log in' link and following it.
Lines 16–23
Once we’ve confirmed that we’re on a login page (that has at least one
form on it), we select the first (and only) form, fill in the authentication
fields with erroneous data (unless, unfortunately, your login and password are both 'xxx'), and submit it.
Lines 25–36
Upon confirmation of a login error on the login page (lines 28–32), we fill
in the fields with the correct credentials (which the reader must supply
[YOUR_LOGIN, YOUR_PASSWD]) and resubmit.
Lines 38–44
Once authentication has been validated, you are directed back to the home
page. This is confirmed (on lines 41–43) by checking for a “Logout” link
(which wouldn’t be there if you had not successfully logged in). We then
click the Account link.
Lines 46–54
You must register by using an e-mail address. You can have more than
one, but there must be a single primary address. Your e-mail addresses are
the first tab that you arrive at when visiting this page for your Account
information. We use BeautifulSoup to parse and display the e-mail
address table and peek into the first cell of the first row of the table (lines
52–53). The next step is to click the “click on the back button” to return to
the home page.
428
Chapter 9 • Web Clients and Servers
Lines 56–60
This is the shortest of all the sections; we really don’t do much here except
confirm that we’re back on the home page (lines 59), then follow the “Logout” link.
Lines 62–68
The last section confirms we’re on the logout page and that you’re not
logged in. This is accomplished by checking to see if there’s a “Log in” link
on this page (lines 66–67).
This application demonstrates that, using Mechanize.Browser is fairly
straightforward. You just need to mentally map user activity in a browser
to the right method calls. Ultimately, the primary concern is whether the
underlying Web page or application will be altered by its developers,
potentially rendering our script out-of-date. Note that at the time of this
writing, there is no Python 3 port of Mechanize yet.
Summary
This concludes our look at various types of Web clients. We can now turn
our attention to Web servers.
9.4
Web (HTTP) Servers
Until now, we have been discussing the use of Python in creating Web clients and performing tasks to aid Web servers in request processing. We
know (and saw earlier in this chapter) that Python can be used to create
both simple and complex Web clients.
However, we have yet to explore the creation of Web servers, and that is
the focus of this section. If Google Chrome, Mozilla Firefox, Microsoft
Internet Explorer, and Opera are among the most popular Web clients,
then what are the most common Web servers? They are Apache, ligHTTPD,
Microsoft IIS, LiteSpeed Technologies LiteSpeed, and ACME Laboratories
thttpd. For situations in which these servers might be overkill for your
desired application, Python can be used to create simple yet useful Web
servers.
Note that although these servers are simplistic and not meant for production, they can be very useful in providing development servers for
your users. Both the Django and Google App Engine development servers
are based on the BaseHTTPServer module described in the next section.
9.4 Web (HTTP) Servers
9.4.1
429
Simple Web Servers in Python
The base code needed is already available in the Python standard library—you
just need to customize it for your needs. To create a Web server, a base
server and a handler are required.
The base Web server is a boilerplate item—a must-have. Its role is to
perform the necessary HTTP communication between client and server.
The base server class is (appropriately) named HTTPServer and is found in
the BaseHTTPServer module.
The handler is the piece of software that does the majority of the Web
serving. It processes the client request and returns the appropriate file,
whether static or dynamically generated. The complexity of the handler
determines the complexity of your Web server. The Python Standard
Library provides three different handlers.
The most basic, plain, vanilla handler, BaseHTTPRequestHandler, is found
in the BaseHTTPServer module, along with the base Web server. Other than
taking a client request, no other handling is implemented at all, so you
have to do it all yourself, such as in our myhttpd.py server coming up.
The SimpleHTTPRequestHandler, available in the SimpleHTTP-Server
module, builds on BaseHTTPRequestHandler by implementing the standard GET and HEAD requests in a fairly straightforward manner. Still
nothing sexy, but it gets the simple jobs done.
Finally, we have the CGIHTTPRequestHandler, available in the CGIHTTPServer
module, which takes the SimpleHTTPRequestHandler and adds support for
POST requests. It has the ability to call common gateway interface (CGI)
scripts to perform the requested processing and can send the generated
HTML back to the client. In this chapter, we’re only going to explore a
CGI-processing server; the next chapter will describe to you why CGI is no
longer the way the world of the Web works, but you still need to know the
concepts.
To simplify the user experience, consistency, and code maintenance,
these modules (actually their classes) have been combined into a single
module named server.py and installed as part of the http package in
Python 3. (Similarly, the Python 2 httplib [HTTP client] module has been
renamed to http.client in Python 3.) The three modules, their classes,
and the Python 3 http.server umbrella package are summarized in Table 9-6.
3.x
430
Chapter 9 • Web Clients and Servers
Table 9-6 Web Server Modules and Classes
Module
Description
BaseHTTPServera
Provides the base Web server and base handler classes,
HTTPServer and BaseHTTPRequestHandler, respectively
SimpleHTTPServera
Contains the SimpleHTTPRequestHandler class to perform GET and HEAD requests
CGIHTTPServera
Contains the CGIHTTPRequestHandler class to process
POST requests and perform CGI execution
http.serverb
All three Python 2 modules and classes above combined into a single Python 3 package.
a. Removed in Python 3.0.
b. New in Python 3.0.
Implementing a Simple Base Web server
To be able to understand how the more advanced handlers found in the
SimpleHTTPServer and CGIHTTPServer modules work, we will implement
simple GET processing for a BaseHTTPRequestHandler. In Example 9-6, we
present the code for a fully working Web server, myhttpd.py.
Example 9-6
Simple Web Server (myhttpd.py)
This simple Web server can read GET requests, fetch a Web page (.html file),
and return it to the calling client. It uses the BaseHTTPRequestHandler found in
BaseHTTPServer and implements the do_GET() method to enable processing of
GET requests.
1
2
3
4
5
6
7
8
9
10
11
#!/usr/bin/env python
from BaseHTTPServer import \
BaseHTTPRequestHandler, HTTPServer
class MyHandler(BaseHTTPRequestHandler):
def do_GET(self):
try:
f = open(self.path[1:], 'r')
self.send_response(200)
self.send_header('Content-type', 'text/html')
9.4 Web (HTTP) Servers
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
431
self.end_headers()
self.wfile.write(f.read())
f.close()
except IOError:
self.send_error(404,
'File Not Found: %s' % self.path)
def main():
try:
server = HTTPServer(('', 80), MyHandler)
print 'Welcome to the machine...',
print 'Press ^C once or twice to quit.'
server.serve_forever()
except KeyboardInterrupt:
print '^C received, shutting down server'
server.socket.close()
if __name__ == '__main__':
main()
This server derives from BaseHTTPRequestHandler and consists of a
single do_GET() method (lines 6–7), which is called when the base server
receives a GET request. We attempt to open the path (removing the
leading ‘/’) passed in by the client (line 9), and if all goes well, return
an “OK” status (200) and forward the downloaded Web page to the user
(line 13) via the wfile pipe. If the file was not found, it returns a 404 status
(lines 15–17).
The main() function simply instantiates our Web server class and
invokes it to run our familiar infinite server loop, shutting it down if
interrupted by Ctrl+C or similar keystroke. If you have appropriate
access and can run this server, you will notice that it displays loggable
output, which will look something like this:
# myhttpd.py
Welcome to the machine... Press ^C once or twice to quit
localhost - - [26/Aug/2000 03:01:35] "GET /index.html HTTP/1.0" 200 localhost - - [26/Aug/2000 03:01:29] code 404, message File Not Found:
x.html
localhost - - [26/Aug/2000 03:01:29] "GET /dummy.html HTTP/1.0" 404 localhost - - [26/Aug/2000 03:02:03] "GET /hotlist.htm HTTP/1.0" 200 -
Of course, our simple little Web server is so simple, it cannot even process plain text files. We leave that as an exercise for you to undertake (see
Exercise 9-10 at the end of this chapter).
432
Chapter 9 • Web Clients and Servers
More Power, Less Code: A Simple CGI Web Server
The previous example is also weak in that it cannot process CGI requests.
BaseHTTPServer is as basic as it gets. One step higher, we have the
SimpleHTTPServer. It provides the do_HEAD() and do_GET() methods on
your behalf, so you don’t have to create either, such as we did with the
BaseHTTPServer.
The highest-level (take that with a grain of salt) server provided in the
standard library is CGIHTTPServer. In addition to do_HEAD() and do_GET(),
it defines do_POST(), with which you can process form data. Because of
these amenities, a CGI-capable development server can be created with
just two real lines of code (so short we’re not even bothering making it a
code example in this chapter, because you can just recreate it by typing it
up on your computer now):
#!/usr/bin/env python
import CGIHTTPServer
CGIHTTPServer.test()
Note that we left off the check to quit the server by using Ctrl+C and
other fancy output, taking whatever the CGIHTTPServer.test() function
gives us, which is a lot. You start the server by just invoking it from your
shell. Below is an example of running this code on a PC—it’s quite similar
to what you’ll experience on a POSIX machine:
C:\py>python cgihttpd.py
Serving HTTP on 0.0.0.0 port 8000 ...
It starts a server by default on port 8000 (but you can change that at runtime by providing a port number as a command-line argument:
C:\py\>python cgihttpd.py 8080
Serving HTTP on 0.0.0.0 port 8080 ...
To test it out, just make sure that a cgi-bin folder exists (with some CGI
Python scripts) at the same level as the script. There’s no point in setting
up Apache, setting CGI handler prefixes, and all that extra stuff when you
just want to test a simple script. We’ll show you how to write CGI scripts
in Chapter 10, “Web Programming: CGI and WSGI,” as well as tell you
why you should avoid doing so.
As you can see, it doesn’t take much to have a Web server up and running in pure Python. Again, you shouldn’t be writing servers all the time.
Generally you’re creating Web applications that run on Web servers. These
server modules are meant only to create servers that are useful during
development, regardless of whether you develop applications or Web
frameworks.
9.5 Related Modules
433
In production, your live service will instead be using servers that are
production-worthy such as Apache, ligHTTPD, or any of the others listed
at the beginning of this section. However, we hope this section will have
enlightened you such that you realize doing complex tasks can be simplified with the power that Python gives you.
9.5
Related Modules
In Table 9-7, we present a list of modules, some of which are covered in this
chapter (and others not), that you might find useful for Web development.
Table 9-7 Web Programming Related Modules
Module/Package
Description
Web Applications
cgi
Retrieves CGI form data
cgitbc
Handles CGI tracebacks
htmllib
Older HTML parser for simple HTML
files; HTML- Parser class extends from
sgmllib.SGMLParser
HTMLparserc
Newer, non-SGML-based parser for HTML
and XHTML
htmlentitydefs
HTML general entity definitions
Cookie
Server-side cookies for HTTP state management
cookielibe
Cookie-handling classes for HTTP clients
webbrowserb
Controller: launches Web documents in a
browser
sgmllib
Parses simple SGML files
robotparsera
Parses robots.txt files for URL “fetchability”
analysis
httpliba
Used to create HTTP clients
(Continued)
434
Chapter 9 • Web Clients and Servers
Table 9-7 Web Programming Related Modules (Continued)
Module/Package
Description
Web Applications
urllib
Access servers via URL, other URL-related
utilities; urllib.urlopen() replaced by
urllib2.urlopen() in Python 3 as
urllib.request.urlopen()
urllib2; urllib.requestg,
urllib.errorg
Classes and functions to open (real-world)
URLs; broken up into the second two subpackages in Python 3
urlparse, urllib.parseg
Utilities for parsing URL strings; renamed as
urllib.parse in Python 3.
XML Processing
xmllib
Original simple XML parser (outdated/
deprecated)
xmlb
XML package featuring various parsers (some
following)
xml.saxb
Simple API for XML (SAX) SAX2-compliant
XML parser
xml.domb
Document Object Model [DOM] XML parser
xml.etreef
Tree-oriented XML parser based on the
Element flexible container object
xml.parsers.expatb
Interface to the non-validating Expat XML
parser
xmlrpclibc
Client support for XML Remote Procedure Call
(RPC) via HTTP
SimpleXMLRPCServerc
Basic framework for Python XML-RPC servers
DocXMLRPCServerd
Framework for self-documenting XML-RPC
servers
9.5 Related Modules
Module/Package
435
Description
Web Servers
BaseHTTPServer
Abstract class with which to develop Web servers
SimpleHTTPServer
Serve the simplest HTTP requests (HEAD and
GET)
CGIHTTPServer
In addition to serving Web files such as
SimpleHTTPServers, can also process CGI
(HTTP POST) requests
http.serverg
New name for the combined package merging
together BaseHTTPServer, SimpleHTTPServer,
and CGIHTTPServer modules in Python 3
wsgireff
Package defining a standard interface between
Web servers and Web applications
Third-Party Packages (not in standard library)
HTMLgen
CGI helper converts Python objects into valid
HTML
http://starship.python.net/crew/friedrich/
HTMLgen/html/main.html
BeautifulSoup
HTML and XML parser and screen-scraper
http://crummy.com/software/BeautifulSoup
Mechanize
Web-browsing package based on WWW:
Mechanize
http://wwwsearch.sourceforge.net/mechanize/
a.
b.
c.
d.
e.
f.
g.
New in Python 1.6.
New in Python 2.0.
New in Python 2.2.
New in Python 2.3.
New in Python 2.4.
New in Python 2.5.
New in Python 3.0.
436
Chapter 9 • Web Clients and Servers
9.6
Exercises
9-1. urllib Module. Write a program that takes a user-input URL
(either a Web page or an FTP file such as http://python.org or
ftp://ftp.python.org/pub/python/README), and downloads
it to your computer with the same filename (or modified
name similar to the original if it is invalid on your system).
Web pages (HTTP) should be saved as .htm or .html files, and
FTP’d files should retain their extension.
9-2. urllib Module. Rewrite the grabWeb.py script of Example 11-4
of Core Python Programming or Core Python Language Fundamentals, which downloads a Web page and displays the first
and last non-blank lines of the resulting HTML file so that
you use urlopen() instead of urlretrieve() to process the
data directly (as opposed to downloading the entire file first
before processing it).
9-3. URLs and Regular Expressions. Your browser can save your
favorite Web site URLs as a bookmarks HTML file (Mozillaflavored browsers do this) or as a set of .url files in a “favorites” directory (Internet Explorer does this). Find your
browser’s method of recording your “hot links” and the location of where and how they are stored. Without altering any
of the files, strip the URLs and names of the corresponding
Web sites (if given) and produce a two-column list of names
and links as output, and then store this data into a disk file.
Truncate site names or URLs to keep each line of output
within 80 characters in length.
9-4. URLs, urllib Module, Exceptions, and Regular Expressions. As
a follow-up problem to Exercise 9-3, add code to your script
to test each of your favorite links. Report back a list of dead
links (and their names) such as Web sites that are no longer
active or a Web page that has been removed. Only output
and save to disk the still-valid links.
Exercises 9-5 to 9-8 below pertain to Web server access log files and regular
expressions. Web servers (and their administrators) generally have to maintain an access log file (usually logs/access_log from the main Web,
server directory) which tracks requests. Over a period of time, such files
become large and either need to be stored or truncated. Why not save only
the pertinent information and delete the files to conserve disk space? The
9.6 Exercises
437
exercises below are designed to give you some exercise with regular expressions and how they can be used to help archive and analyze Web server
data.
9-5. Count how many of each type of request (GET versus POST)
exist in the log file.
9-6. Count the successful page/data downloads. Display all links
that resulted in a return code of 200 (OK [no error]) and how
many times each link was accessed.
9-7. Count the errors: Show all links that resulted in errors
(return codes in the 400s or 500s) and how many times each
link was accessed.
9-8. Track IP addresses: for each IP address, output a list of each
page/data downloaded and how many times that link was
accessed.
9-9. Web Browser Cookies and Web Site Registration. The user login
registration database you worked on in various chapters
(7, 9, 13) of Core Python Programming or Core Python Language
Fundamentals had you creating a pure text-based, menudriven script. Port it to the Web so that your user-password
information should now be site authentication system.
Extra Credit: Familiarize yourself with setting Web browser
cookies and maintain a login session for four hours from the
last successful login.
9-10. Creating Web Servers. Our code for myhttpd.py (Example 9-6)
is only able to read HTML files and return them to the calling
client. Add support for plain text files with the .txt ending.
Be sure that you return the correct MIME type of “text/plain.”
Extra Credit: Add support for JPEG files ending with either
.jpg or .jpeg and having a MIME type of “image/jpeg.”
Exercises 9-11 through 9-14 require you to update Example 9-3, crawl.py,
the Web crawler.
9-11. Web Clients. Port crawl.py so that it uses either HTMLParser,
BeautifulSoup, html5lib, or lxml parsing systems.
9-12. Web Clients. URLs given as input to crawl.py must have the
leading “http://” protocol indicator and top-level URLs must
contain a trailing slash, for example, http://www.prenhallprofessional.com/. Make crawl.py more robust by allowing
438
Chapter 9 • Web Clients and Servers
the user to input just the hostname (without the protocol part
[make it assume HTTP]) and also make the trailing slash
optional. For example, www.prenhallprofessional.com should
now be acceptable input.
9-13. Web Clients. Update the crawl.py script to also download
links that use the ftp: scheme. All mailto: links are ignored
by crawl.py. Add support to ensure that it also ignores
telnet:, news:, gopher:, and about: links.
9-14. Web Clients. The crawl.py script only downloads .html files
via links found in Web pages at the same site and does not
handle/save images that are also valid “files” for those pages.
It also does not handle servers that are susceptible to URLs
that are missing the trailing slash (/). Add a pair of classes to
crawl.py to deal with these problems.
A My404UrlOpener class should subclass urllib.Fancy
URLOpener and consist of a single method, http_
error_404() which determines if a 404 error was reached
because of a URL without a trailing slash. If so, it adds the
slash and retries the request again (and only once). If it still
fails, return a real 404 error. You must set urllib._urlopener
with an instance of this class so that urllib uses it.
Create another class called LinkImageParser, which derives
from htmllib.HTMLParser. This class should contain a constructor to call the base class constructor as well as initialize a
list for the image files parsed from Web pages. The handle_
image() method should be overridden to add image filenames to the image list (instead of discarding them like the
current base class method does).
The final set of exercises pertain to the parse_links.py file, shown earlier in
this chapter as Example 9-4.
9-15. Command-line Arguments. Add command-line arguments to
let the user see output from one or more parsers (instead of
just all of them [which could be the default]).
9-16. lxml Parser. Download and install lxml, and then add support for lxml to parse_links.py.
9-17. Markup Parsers. Subsitute each parser into the crawler replacing
htmllib.HTMLParser.
a) HTMLParser.HTMLParser
9.6 Exercises
b) html5lib
c) BeaufifulSoup
d) lxml
9-18. Refactoring. Change the output() function to be able to support other forms of output.
a) Writing to a file
b) Sending to another process (i.e., writing to a socket)
9-19. Pythonic Coding. In the Line-by-Line Explanation of
parse_links.py, we expanded simpleBS() from a lessreadable one-liner to a block of properly formatted
Python code. Do the same thing with fasterBS() and
html5libparse().
9-20. Performance and Profiling. Earlier, we described how fasterBS()
performs better than simpleBS(). Use timeit to show it runs
faster, and then find a Python memory tool online to show it
saves memory. Describe what the memory profiler tool is
and where you found it. Do any of the three standard library
profilers (profile, hotshot, cProfile) show memory usage
information?
9-21. Best Practices. In htmlparser(), suppose that we didn’t like
the thought of having to create a blank list and having to call
its append() method repeatedly to build the list; instead, you
wanted to use a list comprehension to replace lines 35–39
with the following single line of code:
self.data = [v for k, v in attrs if k == 'href']
Is this a valid substitution? In other words, could we make
this change and still have it all execute correctly? Why (or
why not)?
9-22. Data Manipulation. In parse_links.py, we sort the URLs
alphabetically (actually lexicographically). However, this
might not be the best way to organize links:
http://python.org/psf/
http://python.org/search
http://roundup.sourceforge.net/
http://sourceforge.net/projects/mysql-python
http://twistedmatrix.com/trac/
439
440
Chapter 9 • Web Clients and Servers
http://wiki.python.org/moin/
http://wiki.python.org/moin/CgiScripts
http://www.python.org/
Instead, a sort by domain name might make more sense:
http://python.org/psf/
http://python.org/search
http://wiki.python.org/moin/
http://wiki.python.org/moin/CgiScripts
http://www.python.org/
http://roundup.sourceforge.net/
http://sourceforge.net/projects/mysql-python
http://twistedmatrix.com/trac/
Give your script the ability to sort by domain in addition to the alpha/
lexicographic sort.
CHAPTER
Web Programming:
CGI and WSGI
[The] benefits of WSGI are primarily for Web framework authors
and Web server authors, not Web application authors. This is
not an application API, it’s a framework-to-server glue API.
—Phillip J. Eby, August 2004
In this chapter...
• Introduction
• Helping Web Servers Process Client Data
• Building CGI Applications
• Using Unicode with CGI
• Advanced CGI
• Introduction to WSGI
• Real-World Web Development
• Related Modules
441
442
Chapter 10 • Web Programming: CGI and WSGI
10.1 Introduction
This introductory chapter on Web programming will give you a quick and
broad overview of the kinds of things you can do with Python on the Internet, from Web surfing to creating user feedback forms, from recognizing
URLs to generating dynamic Web page output. We’ll first explore the common gateway interface (CGI) then discuss the web server gateway interface (WSGI).
10.2 Helping Web Servers Process
Client Data
In this section, we’ll introduce you to CGI, what it means, why it exists,
and how it works in relation to Web servers. We’ll then show you how to
use Python to create CGI applications.
10.2.1
Introduction to CGI
The Web was initially developed to be a global online repository or archive
of documents (mostly educational and research-oriented). Such pieces of
information generally come in the form of static text and usually in HTML.
HTML is not as much a language as it is a text formatter, indicating
changes in font types, sizes, and styles. The main feature of HTML is in its
hypertext capability. This refers to the ability to designate certain text
(usually highlighted in some fashion) or even graphic elements as links
that point to other “documents” or locations on the Internet and Web that
are related in context to the original. Such a document can be accessed by a
simple mouse click or other user selection mechanism. These (static) HTML
documents live on the Web server and are sent to clients when requested.
As the Internet and Web services evolved, there grew a need to process
user input. Online retailers needed to be able to take individual orders,
and online banks and search engine portals needed to create accounts for
individual users. Thus fill-out forms were invented; they were the only
way a Web site could get specific information from users (until Java
applets came along). This, in turn, required that the HTML be generated
on the fly, for each client submitting user-specific data.
But, Web servers are only really good at one thing: getting a user
request for a file and returning that file (i.e., an HTML file) to the client.
They do not have the “brains” to be able to deal with user-specific data
10.2 Helping Web Servers Process Client Data
443
such as those which come from fields. Given this is not their responsibility,
Web servers farm out such requests to external applications which create
the dynamically generated HTML that is returned to the client.
The entire process begins when the Web server receives a client request
(i.e., GET or POST) and calls the appropriate application. It then waits for
the resulting HTML—meanwhile, the client also waits. Once the application has completed, it passes the dynamically generated HTML back to the
server, which then (finally) forwards it back to the user. This process of
the server receiving a form, contacting an external application, and receiving and returning the HTML takes place through the CGI. An overview of
how CGI works is presented in Figure 10-1, which shows you the execution
and data flow, step-by-step, from when a user submits a form until the
resulting Web page is returned.
Web Browser (Client)
Web Server
CGI Application
CGI
Submit
completed form
User
4
1
CGI
Program's
response
Call CGI
3
2
CGI
Program's
response
Figure 10-1 Overview of how CGI works. CGI represents the interaction between a Web server
and the application that is required to process a user’s form and generate the dynamic HTML that
is eventually returned.
Forms input on the client and sent to a Web server can include processing and perhaps some form of storage in a back-end database. Just keep in
mind that any time a Web page contains items that require user input (text
fields, radio buttons, etc.) and/or a Submit button or image, it most likely
involves some sort of CGI activity.
CGI applications that create the HTML are usually written in one of
many higher-level programming languages that have the ability to accept
user data, process it, and then return HTML back to the server. Before we
take a look at CGI, we have to issue the caveat that the typical production
Web application is no longer being implemented in CGI.
Because of its significant limitations and limited ability to allow Web
servers to process an abundant number of simultaneous clients, CGI is
444
Chapter 10 • Web Programming: CGI and WSGI
a dinosaur. Mission-critical Web services rely on compiled languages like
C/C++ to scale. A modern-day Web server is typically composed of Apache
and integrated components for database access (MySQL or PostgreSQL),
Java (Tomcat), PHP, and various modules for dynamic languages such as
Python or Ruby, and secure sockets layer (SSL)/security. However, if you
are working on small personal Web sites or those of small organizations
and do not need the power and complexity required by mission critical
Web services, CGI is a quick way to get started. It can also be used for testing.
Furthermore, there are a good number of Web application development
frameworks out there as well as content management systems, all of
which make building CGI a relic of past. However, beneath all the fluff
and abstraction, they must still, in the end, follow the same model that
CGI originally provided, and that is being able to take user input, execute
code based on that input, and then provide valid HTML as its final output
for the client. Therefore, the exercise in learning CGI is well worth it in
terms of understanding the fundamentals required to develop effective
Web services.
In this next section, we will look at how to create CGI applications in
Python, with the help of the cgi module.
10.2.2
CGI Applications
A CGI application is slightly different from a typical program. The primary differences are in the input, output, and user interaction aspects of a
computer program. When a CGI script starts, it needs to retrieve the usersupplied form data, but it has to obtain this data from the Web client, not a
user on the server computer or a disk file. This is usually known as the
request.
The output differs in that any data sent to standard output will be sent
back to the connected Web client rather than to the screen, GUI window, or
disk file. This is known as the response. The data sent back must be a set of
valid headers followed by HTML-tagged data. If it is not and the Web
client is a browser, an error (specifically, an Internal Server Error) will
occur because Web clients understand only valid HTTP data (i.e., MIME
headers and HTML).
Finally, as you can probably guess, there is no user interaction with the
script. All communication occurs among the Web client (on behalf of a
user), the Web server, and the CGI application.
10.2 Helping Web Servers Process Client Data
10.2.3
445
The cgi Module
There is one primary class in the cgi module that does all the work: the
FieldStorage class. This class reads in all the pertinent user information
from the Web client (via the Web server); thus, it should be instantiated
when a Python CGI script begins. Once it has been instantiated, it will consist of a dictionary-like object that contains a set of key-value pairs. The
keys are the names of the input items that were passed in via the form.
The values contain the corresponding data.
Values can be one of three objects. The first are FieldStorage objects
(instances). The second are instances of a similar class called MiniField
Storage, which is used in cases for which no file uploads or multiple-part
form data is involved. MiniFieldStorage instances contain only the keyvalue pair of the name and the data. Lastly, they can be a list of such
objects. This occurs when a form contains more than one input item with
the same field name.
For simple Web forms, you will usually find all MiniFieldStorage
instances. All of our examples that follow pertain only to this general case.
10.2.4
The cgitb Module
As we mentioned earlier, a valid response back to the Web server (which
would then forward it to the user/browser) must contain valid HTTP
headers and HTML-tagged data. Have you thought about the returned
data if your CGI application crashes? What happens when you run a
Python script that results in an error? That’s right: a traceback occurs.
Would the text of a traceback be considered as valid HTTP headers or
HTML? No.
A Web server receiving a response it doesn’t understand will just throw
up its hands and give up, returning a “500 error.” The 500 is an HTTP
response code that means an internal Web server error has occurred, most
likely from the application that is being executed. The output on the
browser doesn’t aid the developer either, as the screen is either blank or
shows “Internal Server Error,” or something similar.
When our Python programs were running on the command-line or in
an integrated development environment (IDE), errors resulted in a traceback,
upon which we could take action. Not so in the browser. What we really
want is to see the Web application’s traceback on the browser screen, not
“Internal Server Error.” This is where the cgitb module comes in.
446
Chapter 10 • Web Programming: CGI and WSGI
To enable a dump of tracebacks, all we need to do is to insert the following import and call in our CGI applications:
import cgitb
cgitb.enable()
You’ll have plenty of opportunity as we explore CGI for the first half of
this chapter. For now, just leave these two lines out as we undertake some
simple examples. First, I want you to see the “Internal Server Error” messages and debug them the hard way. Once you realize how the server’s not
throwing you a bone, you’ll add these two lines religiously, on your own.
10.3 Building CGI Applications
In this section of the chapter, we go hands-on, showing you how to set up
a Web server, followed by a step-by-step breakdown of how to create a
CGI application in Python. We start with a simple script, then build on it
incrementally. The practices you learn here can be used for developing
applications using any Web framework.
10.3.1
Setting Up a Web Server
To experiment with CGI development in Python, you need to first install a
Web server, configure it for handling Python CGI requests, and then give
the Web server access to your CGI scripts. Some of these tasks might
require assistance from your system administrator.
Production Servers
If you want a real Web server, you will likely download and install
Apache, ligHTTPD, or thttpd. For Apache, there are various plug-ins or
modules for handling Python CGI, but they are not required for our examples. You might want to install those if you are planning on “going live” to
the world with your service. But even this might be overkill.
Developer Servers
For learning purposes or for simple Web sites, it might suffice to use
the Web servers that come with Python. In Chapter 9, “Web Clients and
Servers,” you were exposed to creating and configuring simple Pythonbased Web servers. Our examples in this chapter are simpler, use only
Python’s CGI Web server.
10.3 Building CGI Applications
447
If you want to start up this most basic Web server, execute it directly in
Python 2.x, as follows:
2.x
$ python -m CGIHTTPServer [port]
This won’t work as easily in Python 3 because all three Web servers and
their handlers have been merged into a single module (http.server), with
one base server and three request handler classes (BaseHTTPRequestHandler,
SimpleHTTPRequestHandler, and CGIHTTPRequestHandler).
If you don’t provide the optional port number for the server, it starts at
port 8000 by default. Also, the -m option is new in version 2.4. If you are
using an older version of Python or want to see alternative ways of running it, here are your options:
3.x
2.4
• Executing the module from a command shell
This method is somewhat troublesome because you need to
know where the CGIHTTPServer.py file is physically located.
On Windows-based PCs, this is easier because the typical
installation folder is C:\Python2X:
C:\>python C:\Python27\Lib\CGIHTTPServer.py
Serving HTTP on 0.0.0.0 port 8000 ...
On POSIX systems, you need to do a bit more sleuthing:
>>> import sys, CGIHTTPServer
>>> sys.modules['CGIHTTPServer']
<module 'CGIHTTPServer' from '/usr/local/lib/python2.7/
CGIHTTPServer.py'>
>>>^D
$ python /usr/local/lib/python2.7/CGIHTTPServer.py
Serving HTTP on 0.0.0.0 port 8000 ...
• Use the -c option
Using the -c option you can run a string consisting of Python
statements. Therefore, import CGIHTTPServer and execute the
test() function, use the following:
$ python -c "import CGIHTTPServer; CGIHTTPServer.test()"
Serving HTTP on 0.0.0.0 port 8000 ...
Because CGIHTTPServer is merged into http.server in version 3.x,
you can issue the equivalent call (by using, for example,
Python 3.2) as the following:
$ python3.2 -c "from http.server import
CGIHTTPRequestHandler,test;test(CGIHTTPRequestHandler)"
3.x
448
Chapter 10 • Web Programming: CGI and WSGI
• Create a quick script
Take the import and test() call from the previous option and
insert it into an arbitrary file, say cgihttpd.py file (Python 2
or 3). For Python 3, because there is no CGIHTTPServer.py
module to execute, the only way to get your server to start
from the command-line on a port other than 8000 is to use this
script:
$ python3.2 cgihttpd.py 8080
Serving HTTP on 0.0.0.0 port 8080 ...
Any of these four techniques will start a Web server on port 8000 (or
whatever you chose) on your current computer from the current directory.
Then you can just create a cgi-bin directory right under the directory from
which you started the server and put your Python CGI scripts there. Put
some HTML files in that directory and perhaps some .py CGI scripts in
cgi-bin, and you are ready to “surf” directly to this Web site with
addresses looking something like these:
http://localhost:8000/friends.htm
http://localhost:8080/cgi-bin/friendsB.py
Be sure to start up your server where there is a cgi-bin directory and
ensure that your .py files are there; otherwise, the development server will
return your Python files as static text rather than executing them.
10.3.2
Creating the Form Page
In Example 10-1, we present the code for a simple Web form, friends.htm.
As you can see in the HTML, the form contains two input variables: person
and howmany. The values of these two fields will be passed to our CGI
script, friendsA.py.
You will notice in our example that we install our CGI script into the
default cgi-bin directory (see the ACTION link) on the local host. (If this
information does not correspond with your development environment,
update the form action before attempting to test the Web page and CGI
script.) Also, because a METHOD subtag is missing from the form action,
all requests will be of the default type, GET. We choose the GET method
because we do not have very many form fields, and also, we want our
query string to show up in the Location (a.k.a. “Address,” “Go To”) bar so
that you can see what URL is sent to the server.
10.3 Building CGI Applications
Example 10-1
449
Static Form Web Page (friends.htm)
This HTML file presents a form to the user with an empty field for the user’s
name and a set of radio buttons from which the user can choose.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<HTML><HEAD><TITLE>
Friends CGI Demo (static screen)
</TITLE></HEAD>
<BODY><H3>Friends list for: <I>NEW USER</I></H3>
<FORM ACTION="/cgi-bin/friendsA.py">
<B>Enter your Name:</B>
<INPUT TYPE=text NAME=person VALUE="NEW USER" SIZE=15>
<P><B>How many friends do you have?</B>
<INPUT TYPE=radio NAME=howmany VALUE="0" CHECKED> 0
<INPUT TYPE=radio NAME=howmany VALUE="10"> 10
<INPUT TYPE=radio NAME=howmany VALUE="25"> 25
<INPUT TYPE=radio NAME=howmany VALUE="50"> 50
<INPUT TYPE=radio NAME=howmany VALUE="100"> 100
<P><INPUT TYPE=submit></FORM></BODY></HTML>
Figure 10-2 and 10-3 show the screen that is rendered by friends.htm
in clients running on both Mac and Windows.
Figure 10-2 The Friends form page in Chrome “incognito mode,” on Mac OS X.
450
Chapter 10 • Web Programming: CGI and WSGI
Figure 10-3 The Friends form page in Firefox 6 on Windows.
10.3.3
Generating the Results Page
The input is entered by the user when the Submit button is clicked. (Alternatively, the user can also press the Return or Enter key within the text field
to invoke the same action.) When this occurs, the script in Example 10-2,
friendsA.py, is executed via CGI.
Example 10-2
Results Screen CGI code (friendsA.py)
This CGI script grabs the person and howmany fields from the form and uses that
data to create the dynamically generated results screen. Add parentheses to the
print statement on line 17 for the Python 3 version, friendsA3.py (not
displayed here). Both are available at corepython.com.
1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env python
import cgi
reshtml = '''Content-Type: text/html\n
<HTML><HEAD><TITLE>
Friends CGI Demo (dynamic screen)
</TITLE></HEAD>
<BODY><H3>Friends list for: <I>%s</I></H3>
Your name is: <B>%s</B><P>
10.3 Building CGI Applications
11
12
13
14
15
16
17
451
You have <B>%s</B> friends.
</BODY></HTML>'''
form = cgi.FieldStorage()
who = form['person'].value
howmany = form['howmany'].value
print reshtml % (who, who, howmany)
This script contains all the programming power to read the form input
and process it as well as return the resulting HTML page back to the user.
All the “real” work in this script takes place in only four lines of Python
code (lines 14–17).
The form variable is our FieldStorage instance, containing the values
of the person and howmany fields. We read these into the Python who and
howmany variables, respectively. The reshtml variable contains the general
body of HTML text to return, with a few fields filled in dynamically, using
the data just read in from the form.
CORE TIP: HTTP headers separate from HTML
Here’s something that always catches beginners: when sending results back via
a CGI script, the CGI script must return the appropriate HTTP headers first
before any HTML. Furthermore, to distinguish between these headers and the
resulting HTML, there must be one blank line (a pair of NEWLINE characters)
inserted between both sets of data, as in line 5 of our friendsA.py example (one
explicit \n plus the implicit one at the end of line 5). You’ll notice this in the
other examples, too.
One possible resulting screen appears in Figure 10-4, (assuming the
user typed in “Annalee Lenday” as the name and clicked the “25 friends”
radio button).
If you are a Web site producer, you might be thinking, “Gee, wouldn’t it
be nice if I could automatically capitalize this person’s name, especially if
she forgot?” With Python CGI, you can accomplish this easily. (And we
shall do so soon!)
452
Chapter 10 • Web Programming: CGI and WSGI
Figure 10-4 The Friends results page after the name and number of friends has been submitted.
Notice how on a GET request that our form variables and their values
are added to the form action URL in the Address bar. Also, did you
observe that the title for the friends.htm page has the word “static” in it,
whereas the output screen from friends.py has the word “dynamic” in
its title? We did that for a reason: to indicate that the friends.htm file is a
static text file while the results page is dynamically generated. In other
words, the HTML for the results page did not exist on disk as a text file;
rather, it was generated by our CGI script, which returned it as if it were a
local file.
In our next example, we bypass static files altogether by updating our
CGI script to be somewhat more multifaceted.
10.3.4
Generating Form and Results Pages
We obsolete friends.html and merge it into friendsB.py. The script will
now generate both the form page as well as the results page. But how can
we tell which page to generate? Well, if there is form data being sent to us,
that means that we should be creating a results page. If we do not get any
information at all, that tells us that we should generate a form page for
the user to enter his data. Our new friendsB.py script is presented in
Example 10-3.
10.3 Building CGI Applications
Example 10-3
453
Generating Form and Results Pages (friendsB.py)
Both friends.htm and friendsA.py are merged into friendsB.py. The
resulting script can now output both form and results pages as dynamically
generated HTML and has the smarts to know which page to output. To port this
to the Python 3 version, friendsB3.py, you need to add parentheses to both
print statements and change the form action to friendsB3.py.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/usr/bin/env python
import cgi
header = 'Content-Type: text/html\n\n'
formhtml = '''<HTML><HEAD><TITLE>
Friends CGI Demo</TITLE></HEAD>
<BODY><H3>Friends list for: <I>NEW USER</I></H3>
<FORM ACTION="/cgi-bin/friendsB.py">
<B>Enter your Name:</B>
<INPUT TYPE=hidden NAME=action VALUE=edit>
<INPUT TYPE=text NAME=person VALUE="NEW USER" SIZE=15>
<P><B>How many friends do you have?</B>
%s
<P><INPUT TYPE=submit></FORM></BODY></HTML>'''
fradio = '<INPUT TYPE=radio NAME=howmany VALUE="%s" %s> %s\n'
def showForm():
friends = []
for i in (0, 10, 25, 50, 100):
checked = ''
if i == 0:
checked = 'CHECKED'
friends.append(fradio % (str(i), checked, str(i)))
print '%s%s' % (header, formhtml % ''.join(friends))
reshtml = '''<HTML><HEAD><TITLE>
Friends CGI Demo</TITLE></HEAD>
<BODY><H3>Friends list for: <I>%s</I></H3>
Your name is: <B>%s</B><P>
You have <B>%s</B> friends.
</BODY></HTML>'''
def doResults(who, howmany):
print header + reshtml % (who, who, howmany)
def process():
form = cgi.FieldStorage()
(Continued)
454
Chapter 10 • Web Programming: CGI and WSGI
Example 10-3
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Generating Form and Results Pages (friendsB.py)
(Continued)
if 'person' in form:
who = form['person'].value
else:
who = 'NEW USER'
if 'howmany' in form:
howmany = form['howmany'].value
else:
howmany = 0
if 'action' in form:
doResults(who, howmany)
else:
showForm()
if __name__ == '__main__':
process()
Line-by-Line Explanation
Lines 1–5
In addition to the usual startup and module import lines, we separate the
HTTP MIME header from the rest of the HTML body because we will use
it for both types of pages (form page and results page) returned and we
don’t want to duplicate the text. We will add this header string to the corresponding HTML body when it’s time for output to occur.
Lines 7–28
All of this code is related to the now-integrated friends.htm form page in
our CGI script. We have a variable for the form page text, formhtml, and
we also have a string to build the list of radio buttons, fradio. We could
have duplicated this radio button HTML text as it is in friends.htm, but
we wanted to show how we could use Python to generate more dynamic
output—see the for loop in lines 22–26.
The showForm() function has the responsibility of generating a form for
user input. It builds a set of text for the radio buttons, merges those lines of
HTML into the main body of formhtml, prepends the header to the form,
and then returns the entire collection of data back to the client by sending
the entire string to standard output.
There are a couple of interesting things to note about this code. The first
is the “hidden” variable in the form called action, containing the value
10.3 Building CGI Applications
455
on line 12. This field is the only way we can tell which screen to display (i.e., the form page or the results page). We will see this field come
into play in lines 53–56.
Also, observe that we set the 0 radio button as the default by “checking”
it within the loop that generates all the buttons. This will also allow us to
update the layout of the radio buttons and/or their values on a single line
of code (line 18) rather than over multiple lines of text. It will also offer
some more flexibility in letting the logic determine which radio button is
checked—see the next update to our script, friendsC.py, coming up.
Now you might be thinking, “Why do we need an action variable when
I could just as well be checking for the presence of person or howmany?”
That is a valid question, because yes, you could have just used person or
howmany in this situation.
However, the action variable is a more conspicuous presence, insofar as
its name as well as what it does—the code is easier to understand. The
person and howmany variables are used for their values, whereas the action
variable is used as a flag.
The other reason for creating action is that we will be using it again to
help us determine which page to generate. In particular, we will need to
display a form with the presence of a person variable (rather than a results
page). This will break your code if you are solely relying on there being a
person variable.
edit
Lines 30–38
The code to display the results page is practically identical to that of
friendsA.py.
Lines 40–55
Because there are different pages that can result from this one script, we
created an overall process() function to get the form data and decide
which action to take. The main portion of process() will also look familiar
to the main body of code in friendsA.py. There are two major differences,
however.
Because the script might or might not be getting the expected fields
(invoking the script the first time to generate a form page, for example,
will not pass any fields to the server), we need to “bracket” our retrieval of
the form fields with if statements to check if they are even there. Also, we
mentioned the action field above, which helps us decide which page to
bring up. The code that performs this determination is in lines 52–55.
456
Chapter 10 • Web Programming: CGI and WSGI
Figure 10-5 illustrates that the auto-generated form looks identical to
the static form presented in Figure 10-2; however, instead of a link ending
in .html, it ends in .py. If we enter “Cynthia Gilbert” for the name and
select 50 friends, clicking the Submit button results in what is shown in
Figure 10-6.
Figure 10-5 The autogenerated Friends form page in Chrome on Windows.
Figure 10-6 The Friends results page after submitting the name and friend count.
Note that a static friends.htm does not show up in the URL because
is responsible for both the form and results pages.
friendsB.py
10.3 Building CGI Applications
10.3.5
457
Fully Interactive Web Sites
Our final example will complete the circle. As in the past, a user enters her
information from the form page. We then process the data and output a
results page. This time, however, we will add a link to the results page that
will allow the user to go back to the form page, but rather than presenting a
blank form, we will fill in the data that the user has already provided. We
will also add some error processing to give you an example of how it can be
accomplished. The new friendsC.py is shown in Example 10-4.
Example 10-4
Full User Interaction and Error Processing (friendsC.py)
By adding a link to return to the form page with information already provided,
we have come full circle, giving the user a fully interactive Web surfing
experience. Our application also now performs simple error checking, which
notifies the user if no radio button was selected.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/env python
import cgi
from urllib import quote_plus
header = 'Content-Type: text/html\n\n'
url = '/cgi-bin/friendsC.py'
errhtml = '''<HTML><HEAD><TITLE>
Friends CGI Demo</TITLE></HEAD>
<BODY><H3>ERROR</H3>
<B>%s</B><P>
<FORM><INPUT TYPE=button VALUE=Back
ONCLICK="window.history.back()"></FORM>
</BODY></HTML>'''
def showError(error_str):
print header + errhtml % error_str
formhtml = '''<HTML><HEAD><TITLE>
Friends CGI Demo</TITLE></HEAD>
<BODY><H3>Friends list for: <I>%s</I></H3>
<FORM ACTION="%s">
<B>Enter your Name:</B>
<INPUT TYPE=hidden NAME=action VALUE=edit>
<INPUT TYPE=text NAME=person VALUE="%s" SIZE=15>
<P><B>How many friends do you have?</B>
%s
<P><INPUT TYPE=submit></FORM></BODY></HTML>'''
(Continued)
458
Chapter 10 • Web Programming: CGI and WSGI
Example 10-4
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
Full User Interaction and Error Processing (friendsC.py)
(Continued)
fradio = '<INPUT TYPE=radio NAME=howmany VALUE="%s" %s> %s\n'
def showForm(who, howmany):
friends = []
for i in (0, 10, 25, 50, 100):
checked = ''
if str(i) == howmany:
checked = 'CHECKED'
friends.append(fradio % (str(i), checked, str(i)))
print '%s%s' % (header, formhtml % (
who, url, who, ''.join(friends)))
reshtml = '''<HTML><HEAD><TITLE>
Friends CGI Demo</TITLE></HEAD>
<BODY><H3>Friends list for: <I>%s</I></H3>
Your name is: <B>%s</B><P>
You have <B>%s</B> friends.
<P>Click <A HREF="%s">here</A> to edit your data again.
</BODY></HTML>'''
def doResults(who, howmany):
newurl = url + '?action=reedit&person=%s&howmany=%s'%\
(quote_plus(who), howmany)
print header + reshtml % (who, who, howmany, newurl)
def process():
error = ''
form = cgi.FieldStorage()
if 'person' in form:
who = form['person'].value.title()
else:
who = 'NEW USER'
if 'howmany' in form:
howmany = form['howmany'].value
else:
if 'action' in form and \
form['action'].value == 'edit':
error = 'Please select number of friends.'
else:
howmany = 0
if not error:
if 'action' in form and \
form['action'].value != 'reedit':
doResults(who, howmany)
else:
showForm(who, howmany)
else:
showError(error)
if __name__ == '__main__':
process()
10.3 Building CGI Applications
459
friendsC.py is not too unlike friendsB.py. We invite you to compare the
differences; we present a brief summary of the major changes for you here.
Abridged Line-by-Line Explanation
Line 7
We take the URL out of the form because we now need it in two places, the
results page being the new customer in addition to the user input form.
Lines 9–18, 68–70, 74–81
All of these lines deal with the new feature of having an error screen. If the
user does not select a radio button indicating the number of friends, the
howmany field is not passed to the server. In such a case, the showError()
function returns the error page to the user.
The error page also features a JavaScript “Back” button. Because buttons are input types, we need a form, but no action is needed because we
are just going back one page in the browsing history. Although our script
currently supports (a.k.a. tests for) only one type of error, we still use a
generic error variable in case we want to continue development of this
script to add more error detection in the future.
Lines 26–28, 37–40, 47, and 51–54
One goal for this script is to create a meaningful link back to the form page
from the results page. This is implemented as a link to give the user the
ability to return to a form page to update or edit the data he entered. The
new form page makes sense only if it contains information pertaining to
the data that has already been entered by the user. (It is frustrating for
users to re-enter their information from scratch!)
To accomplish this, we need to embed the current values into the
updated form. In line 26, we add a value for the name. This value will be
inserted into the name field, if given. Obviously, it will be blank on the initial form page. In Lines 37–38, we set the radio box corresponding to the
number of friends currently chosen. Finally, on lines 48 and the updated
doResults() function on lines 52–54, we create the link with all the existing information, which returns the user to our modified form page.
Line 61
Finally, we added a simple feature that we thought would be a nice aesthetic touch. In the screens for friendsA.py and friendsB.py, the text
entered by the user as her name is taken verbatim. If you look at the equivalent line in friendsA.py and friendsB.py, you’ll notice that we leave the
460
Chapter 10 • Web Programming: CGI and WSGI
names alone from form to display. This means that if users enter names in
all lowercase, they will show up in all lowercase, etc. So, we added a call to
str.title() to automatically capitalize a user’s name. The title() string
method titlecases the passed-in string. This might or might not be a desired
feature, but we thought that we would share it with you so that you know
that such functionality exists.
Figures 10-7 through 10-10 show the progression of user interaction
with this CGI form and script.
In Figure 10-7, we invoke friendsC.py to bring up the form page. We
enter a name “foo bar,” but deliberately avoid checking any of the radio buttons. The resulting error after submitting the form can be seen in Figure 10-8.
Figure 10-7 The Friends initial form page without friends selection.
Figure 10-8 An error page appears due to invalid user input.
10.3 Building CGI Applications
461
We click the Back button, click the 50 radio button, and then resubmit
our form. The results page, shown in Figure 10-9, is also familiar, but now
has an extra link at the bottom, which will take us back to the form page.
The only difference between the new form page and our original is that all
the data filled in by the user is now set as the default settings, meaning
that the values are already available in the form. (Hopefully you’ll notice
the automatic name capitalization too.) We can see this in Figure 10-10.
Figure 10-9 The Friends results page with valid input.
Figure 10-10 The Friends form page redux.
462
Chapter 10 • Web Programming: CGI and WSGI
Now the user is able to make changes to either of the fields and resubmit her form.
As the developer, however, you will no doubt begin to notice that as our
forms and data become more complicated, so does the generated HTML,
especially for complex results pages. If you ever get to a point where generating the HTML text is interfering with your application, you might consider trying Python packages, such as HTMLgen, xist, or HSC. These thirdparty tools specialize in HTML generation directly from Python objects.
Finally, in Example 10-5, we want to show you the Python 3 equivalent,
friendsC3.py.
Example 10-5
Python 3 port of friendsC.py (friendsC3.py)
The equivalent of friendsC.py in Python 3. What are the differences?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/env python
import cgi
from urllib.parse import quote_plus
header = 'Content-Type: text/html\n\n'
url = '/cgi-bin/friendsC3.py'
errhtml = '''<HTML><HEAD><TITLE>
Friends CGI Demo</TITLE></HEAD>
<BODY><H3>ERROR</H3>
<B>%s</B><P>
<FORM><INPUT TYPE=button VALUE=Back
ONCLICK="window.history.back()"></FORM>
</BODY></HTML>'''
def showError(error_str):
print(header + errhtml % (error_str))
formhtml = '''<HTML><HEAD><TITLE>
Friends CGI Demo</TITLE></HEAD>
<BODY><H3>Friends list for: <I>%s</I></H3>
<FORM ACTION="%s">
<B>Enter your Name:</B>
<INPUT TYPE=hidden NAME=action VALUE=edit>
<INPUT TYPE=text NAME=person VALUE="%s" SIZE=15>
<P><B>How many friends do you have?</B>
%s
<P><INPUT TYPE=submit></FORM></BODY></HTML>'''
10.3 Building CGI Applications
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
fradio = '<INPUT TYPE=radio NAME=howmany VALUE="%s" %s> %s\n'
def showForm(who, howmany):
friends = []
for i in (0, 10, 25, 50, 100):
checked = ''
if str(i) == howmany:
checked = 'CHECKED'
friends.append(fradio % (str(i), checked, str(i)))
print('%s%s' % (header, formhtml % (
who, url, who, ''.join(friends))))
reshtml = '''<HTML><HEAD><TITLE>
Friends CGI Demo</TITLE></HEAD>
<BODY><H3>Friends list for: <I>%s</I></H3>
Your name is: <B>%s</B><P>
You have <B>%s</B> friends.
<P>Click <A HREF="%s">here</A> to edit your data again.
</BODY></HTML>'''
def doResults(who, howmany):
newurl = url + '?action=reedit&person=%s&howmany=%s' % (
quote_plus(who), howmany)
print(header + reshtml % (who, who, howmany, newurl))
def process():
error = ''
form = cgi.FieldStorage()
if 'person' in form:
who = form['person'].value.title()
else:
who = 'NEW USER'
if 'howmany' in form:
howmany = form['howmany'].value
else:
if 'action' in form and \
form['action'].value == 'edit':
error = 'Please select number of friends.'
else:
howmany = 0
if not error:
if 'action' in form and \
form['action'].value != 'reedit':
doResults(who, howmany)
else:
showForm(who, howmany)
else:
showError(error)
if __name__ == '__main__':
process()
463
464
Chapter 10 • Web Programming: CGI and WSGI
10.4 Using Unicode with CGI
In the “Sequences” chapter of Core Python Programming or Core Python
Language Fundamentals, we introduced the use of Unicode strings. In one
particular section, we gave a simple example of a script that takes a Unicode string, writes it out to a file, and then reads it back in. Here, we’ll
demonstrate a similar CGI script that produces Unicode output. We’ll
show you how to give your browser enough clues to be able to render the
characters properly. The one requirement is that you must have East
Asian fonts installed on your computer so that the browser can display
them.
To see Unicode in action, we will build a CGI script to generate a multilingual Web page. First, we define the message in a Unicode string. We
assume that your text editor can only enter ASCII. Therefore, the nonASCII characters are input by using the \u escape. In practice, the message
can also be read from a file or database.
# Greeting in English, Spanish,
# Chinese and Japanese.
UNICODE_HELLO = u"""
Hello!
\u00A1Hola!
\u4F60\u597D!
\u3053\u3093\u306B\u3061\u306F!
"""
The first output generated by the CGI is the content-type HTTP header.
It is very important to declare here that the content is transmitted in the
UTF-8 encoding so that the browser can correctly interpret it.
print 'Content-type: text/html; charset=UTF-8\r'
print '\r'
Then, output the actual message. Use the string’s encode() method to
translate the string into UTF-8 sequences first.
print UNICODE_HELLO.encode('UTF-8')
You can look through the code in Example 10-6, whose output will look
like the browser window shown in Figure 10-11.
10.4 Using Unicode with CGI
Example 10-6
Simple Unicode CGI Example (uniCGI.py)
This script outputs Unicode strings to your Web browser.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/usr/bin/env python
CODEC = 'UTF-8'
UNICODE_HELLO = u'''
Hello!
\u00A1Hola!
\u4F60\u597D!
\u3053\u3093\u306B\u3061\u306F!
'''
print
print
print
print
print
print
'Content-Type: text/html; charset=%s\r' % CODEC
'\r'
'<HTML><HEAD><TITLE>Unicode CGI Demo</TITLE></HEAD>'
'<BODY>'
UNICODE_HELLO.encode(CODEC)
'</BODY></HTML>'
Figure 10-11 A simple Simple Unicode CGI demonstration output in Firefox.
465
466
Chapter 10 • Web Programming: CGI and WSGI
10.5 Advanced CGI
We will now take a look at some of the more advanced aspects of CGI programming. These include the use of cookies (cached data saved on the client
side), multiple values for the same CGI field, and file upload using multipart form submissions. To save space, we show you all three of these
features with a single application. Let’s take a look at multipart submissions first.
10.5.1
Multipart Form Submission and File
Uploading
Currently, the CGI specifications only allow two types of form encodings:
“application/x-www-form-urlencoded” and “multipart/form-data.” Because
the former is the default, there is never a need to state the encoding in the
FORM tag like this:
<FORM enctype="application/x-www-form-urlencoded" ...>
But for multipart forms, you must explicitly give the encoding as:
<FORM enctype="multipart/form-data" ...>
You can use either type of encoding for form submissions, but at this
time, file uploads can only be performed with the multipart encoding.
Multipart encoding was invented by Netscape in the early days of the Web
but has since been adopted by all major browsers today.
File uploads are accomplished by using the file input type:
<INPUT type=file name=...>
This directive presents an empty text field with a button on the side
which allows you to browse your file directory structure for a file to
upload. When using multipart, your Web client’s form submission to the
server will look amazingly like (multipart) e-mail messages with attachments. A separate encoding was needed because it would not be wise to
“urlencode” a file, especially a binary file. The information still gets to the
server, but it is just packaged in a different way.
Regardless of whether you use the default encoding or the multipart,
the cgi module will process them in the same manner, providing keys
and corresponding values in the form submission. You will simply access
the data through your FieldStorage instance, as before.
10.5 Advanced CGI
10.5.2
467
Multivalued Fields
In addition to file uploads, we are going to show you how to process fields
with multiple values. The most common case is when you provide checkboxes for a user to select from various choices. Each of the checkboxes is
labeled with the same field name, but to differentiate them, each will have
a different value associated with a particular checkbox.
As you know, the data from the user is sent to the server in key-value
pairs during form submission. When more than one checkbox is submitted, you will have multiple values associated with the same key. In these
cases, rather than being given a single MiniFieldStorage instance for
your data, the cgi module will create a list of such instances that you will
iterate over to obtain the different values. Not too painful at all.
10.5.3
Cookies
Finally, we will use cookies in our example. If you are not familiar with
cookies, they are just bits of data information which a server at a Web site
will request to be saved on the client side (the browser).
Because HTTP is a stateless protocol, information that has to be carried
from one page to another can be accomplished by using key-value pairs in
the request, as you have seen in the GET requests and screens earlier in
this chapter. Another way of doing it, as we have also seen before, is by
using hidden form fields such as the action variable in some of the later
friends*.py scripts. These variables and their values are managed by the
server because the pages they return to the client must embed these in
generated pages.
One alternative to maintaining persistency in state across multiple page
views is to save the data on the client side, instead. This is where cookies
come in. Rather than embedding data to be saved in the returned Web
pages, a server will make a request to the client to save a cookie. The
cookie is linked to the domain of the originating server (so a server cannot
set or override cookies from other Web sites) and has an expiration date
(so your browser doesn’t become cluttered with cookies).
These two characteristics are tied to a cookie along with the key-value
pair representing the data item of interest. There are other attributes of
cookies such as a domain subpath or a request that a cookie should only
be delivered in a secure environment.
468
Chapter 10 • Web Programming: CGI and WSGI
By using cookies, we no longer have to pass the data from page to page
to track a user. Although they have been subject to a good amount of controversy with regard to privacy, most Web sites use cookies responsibly. To
prepare you for the code, a Web server requests that a client store a cookie by
sending the “Set-Cookie” header immediately before the requested file.
Once cookies are set on the client side, requests to the server will automatically have those cookies sent to the server using the HTTP_COOKIE
environment variable. The cookies are delimited by semicolons (;), and
each key-value pair is separated by equal signs (=). All your application
needs to do to access the data values is to split the string several times (i.e.,
using str.split() or manual parsing).
Like multipart encoding, cookies originated from Netscape, which
wrote up the first specification that is still mostly valid today. You can
access this document at the following Web site:
http://www.netscape.com/newsref/std/cookie_spec.html
Once cookies are standardized and this document finally made obsolete,
you will be able to get more current information from Request for Comment
documents (RFCs). The first published on cookies was RFC 2109 in 1997. It
was then replaced by RFC 2965 a few years later in 2000. The most recent
one (which supersedes the other two) at the time of this writing is RFC
6265, published in April 2011.
10.5.4
Cookies and File Upload
We now present our CGI application, advcgi.py, which has code and
functionality not too unlike the friendsC.py script earlier in this chapter.
The default first page is a user fill-out form consisting of four main parts:
user-set cookie string, name field, checkbox list of programming languages, and file submission box. Figure 10-12 presents an image of this
screen along with some sample input.
All of the data is submitted to the server using multipart encoding, and
retrieved in the same manner on the server side using the FieldStorage
instance. The only tricky part is in retrieving the uploaded file. In our
application, we choose to iterate over the file, reading it line by line. It is
also possible to read in the entire contents of the file if you are not wary of
its size.
Because this is the first occasion data is received by the server, it is at
this time, when returning the results page back to the client, that we use
the “Set-Cookie:” header to cache our data in browser cookies.
10.5 Advanced CGI
469
Figure 10-12 An advanced CGI cookie, upload, and multivalue form page.
In Figure 10-13, you will see the results after submitting our form data.
All the fields the user entered are shown on the page. The given file in the
final dialog box was uploaded to the server and displayed, as well.
You will also notice the link at the bottom of the results page, which
returns us to the form page, again using the same CGI script.
If we click that link at the bottom, no form data is submitted to our
script, causing a form page to be displayed. Yet, as you can see from Figure 10-14, what shows up is anything but an empty form; information previously entered by the user is already present. How did we accomplish
this with no form data (either hidden or as query arguments in the URL)?
The secret is that the data is stored on the client side in cookies—two of
them, in fact.
470
Chapter 10 • Web Programming: CGI and WSGI
Figure 10-13 Our advanced CGI application results page.
The user cookie holds the string of data typed in by the user in the
“Enter cookie value” form field, and the user’s name, languages he is
familiar with, and uploaded files are stored in the information cookie.
When the script detects no form data, it shows the form page, but before
the form page has been created, it grabs the cookies from the client (which
are automatically transmitted by the client when the user clicks the link)
and fills out the form accordingly. So when the form is finally displayed,
all the previously entered information appears to the user like magic.
We are certain you are eager to take a look at this application, so take a
look at it in Example 10-7.
10.5 Advanced CGI
471
Figure 10-14 The new form page with data loaded from cookies, except the uploaded file.
Example 10-7
Advanced CGI Application (advcgi.py)
This script has one main class that does a bit more, AdvCGI.py. It has methods to
show either form, error, or results pages, as well as those that read or write
cookies from/to the client (a Web browser).
1
2
3
4
5
6
7
#!/usr/bin/env python
from
from
from
from
cgi import FieldStorage
os import environ
cStringIO import StringIO
urllib import quote, unquote
(Continued)
472
Chapter 10 • Web Programming: CGI and WSGI
Example 10-7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Advanced CGI Application (advcgi.py) (Continued)
class AdvCGI(object):
header = 'Content-Type: text/html\n\n'
url = '/cgi-bin/advcgi.py'
formhtml = '''<HTML><HEAD><TITLE>
Advanced CGI Demo</TITLE></HEAD>
<BODY><H2>Advanced CGI Demo Form</H2>
<FORM METHOD=post ACTION="%s" ENCTYPE="multipart/form-data">
<H3>My Cookie Setting</H3>
<LI> <CODE><B>CPPuser = %s</B></CODE>
<H3>Enter cookie value<BR>
<INPUT NAME=cookie value="%s"> (<I>optional</I>)</H3>
<H3>Enter your name<BR>
<INPUT NAME=person VALUE="%s"> (<I>required</I>)</H3>
<H3>What languages can you program in?
(<I>at least one required</I>)</H3>
%s
<H3>Enter file to upload <SMALL>(max size 4K)</SMALL></H3>
<INPUT TYPE=file NAME=upfile VALUE="%s" SIZE=45>
<P><INPUT TYPE=submit>
</FORM></BODY></HTML>'''
langSet = ('Python', 'Ruby', 'Java', 'C++', 'PHP', 'C',
'JavaScript')
langItem = '<INPUT TYPE=checkbox NAME=lang VALUE="%s"%s> %s\n'
def getCPPCookies(self):
# reads cookies from client
if 'HTTP_COOKIE' in environ:
cookies = [x.strip() for x in environ['HTTP_
COOKIE'].split(';')]
for eachCookie in cookies:
if len(eachCookie)>6 and eachCookie[:3]=='CPP':
tag = eachCookie[3:7]
try:
self.cookies[tag] = eval(unquote(
eachCookie[8:]))
except (NameError, SyntaxError):
self.cookies[tag] = unquote(
eachCookie[8:])
if 'info' not in self.cookies:
self.cookies['info'] = ''
if 'user' not in self.cookies:
self.cookies['user'] = ''
else:
self.cookies['info'] = self.cookies['user'] = ''
10.5 Advanced CGI
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
473
if self.cookies['info'] != '':
self.who, langStr, self.fn = self.cookies['info'].split(':')
self.langs = langStr.split(',')
else:
self.who = self.fn = ' '
self.langs = ['Python']
def showForm(self):
self.getCPPCookies()
# put together language checkboxes
langStr = []
for eachLang in AdvCGI.langSet:
langStr.append(AdvCGI.langItem % (eachLang,
' CHECKED' if eachLang in self.langs else '',
eachLang))
# see if user cookie set up yet
if not ('user' in self.cookies and self.cookies['user']):
cookStatus = '<I>(cookie has not been set yet)</I>'
userCook = ''
else:
userCook = cookStatus = self.cookies['user']
print '%s%s' % (AdvCGI.header, AdvCGI.formhtml % (
AdvCGI.url, cookStatus, userCook, self.who,
''.join(langStr), self.fn))
errhtml = '''<HTML><HEAD><TITLE>
Advanced CGI Demo</TITLE></HEAD>
<BODY><H3>ERROR</H3>
<B>%s</B><P>
<FORM><INPUT TYPE=button VALUE=Back
ONCLICK="window.history.back()"></FORM>
</BODY></HTML>'''
def showError(self):
print AdvCGI.header + AdvCGI.errhtml % (self.error)
reshtml = '''<HTML><HEAD><TITLE>
Advanced CGI Demo</TITLE></HEAD>
<BODY><H2>Your Uploaded Data</H2>
<H3>Your cookie value is: <B>%s</B></H3>
<H3>Your name is: <B>%s</B></H3>
<H3>You can program in the following languages:</H3>
<UL>%s</UL>
<H3>Your uploaded file...<BR>
Name: <I>%s</I><BR>
(Continued)
474
Chapter 10 • Web Programming: CGI and WSGI
Example 10-7
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
Advanced CGI Application (advcgi.py) (Continued)
Contents:</H3>
<PRE>%s</PRE>
Click <A HREF="%s"><B>here</B></A> to return to form.
</BODY></HTML>'''
def setCPPCookies(self):# tell client to store cookies
for eachCookie in self.cookies.keys():
print 'Set-Cookie: CPP%s=%s; path=/' % \
(eachCookie, quote(self.cookies[eachCookie]))
def doResults(self):# display results page
MAXBYTES = 4096
langList = ''.join(
'<LI>%s<BR>' % eachLang for eachLang in self.langs)
filedata = self.fp.read(MAXBYTES)
if len(filedata) == MAXBYTES and f.read():
filedata = '%s%s' % (filedata,
'... <B><I>(file truncated due to size)</I></B>')
self.fp.close()
if filedata == '':
filedata = <B><I>(file not given or upload error)</I></B>'
filename = self.fn
# see if user cookie set up yet
if not ('user' in self.cookies and self.cookies['user']):
cookStatus = '<I>(cookie has not been set yet)</I>'
userCook = ''
else:
userCook = cookStatus = self.cookies['user']
# set cookies
self.cookies['info'] = ':'.join(
(self.who, ','.join(self.langs, ','), filename))
self.setCPPCookies()
print '%s%s' % (AdvCGI.header, AdvCGI.reshtml % (
cookStatus, self.who, langList,
filename, filedata, AdvCGI.url)
def go(self):
# determine which page to return
self.cookies = {}
self.error = ''
form = FieldStorage()
if not form.keys():
self.showForm()
return
if 'person' in form:
self.who = form['person'].value.strip().title()
if self.who == '':
self.error = 'Your name is required. (blank)'
10.5 Advanced CGI
475
149
else:
150
self.error = 'Your name is required. (missing)'
151
152
self.cookies['user'] = unquote(form['cookie'].value.strip()) if
'cookie' in form else ''
153
if 'lang' in form:
154
langData = form['lang']
155
if isinstance(langData, list):
156
self.langs = [eachLang.value for eachLang in langData]
157
else:
158
self.langs = [langData.value]
159
else:
160
self.error = 'At least one language required.'
161
162
if 'upfile' in form:
163
upfile = form['upfile']
164
self.fn = upfile.filename or ''
165
if upfile.file:
166
self.fp = upfile.file
167
else:
168
self.fp = StringIO('(no data)')
169
else:
170
self.fp = StringIO('(no file)')
171
self.fn = ''
172
173
if not self.error:
174
self.doResults()
175
else:
176
self.showError()
177
178 if __name__ == '__main__':
179
page = AdvCGI()
180
page.go()
advcgi.py looks strikingly similar to our friendsC.py CGI scripts seen
earlier in this chapter. It has a form, results, and error pages to return. In
addition to all of the advanced CGI features that are part of our new script,
we are also infusing more of an object-oriented feel to our script by using a
class with methods instead of just a set of functions. The HTML text for
our pages is now static data for our class, meaning that they will remain
constant across all instances—even though there is actually only one
instance in our case.
Line-by-Line Explanation
Lines 1–6
The usual startup and import lines appear here. If you’re not familiar with
the StringIO class, it’s is a file-like data structure whose core element is a
string—think in-memory text stream.
476
Chapter 10 • Web Programming: CGI and WSGI
For Python 2, this class is found in either the StringIO module or its Cequivalent, cStringIO. In Python 3, it has been moved into the io package.
Similarly, the Python 2 urllib.quote() and urllib.unquote() functions
have been moved into the urllib.parse package for Python 3.
Lines 8–28
After the AdvCGI class is declared, the header and url (static class) variables
are created for use by the methods displaying all the different pages. The
static text form HTML comes next, followed by the programming language set and HTML element for each language.
Lines 33–55
This example uses cookies. Somewhere further down in this application is
the setCPPCookies() method, which our application calls to send cookies
(from the Web server) back to the browser and store them there.
The getCPPCookies() method does the opposite. When a browser makes
subsequent calls to the application, it sends those same cookies back to the
server via HTTP headers. By the time our application executes, those values are available to us (the application) via the HTTP_COOKIE environment
variable.
This method parses the cookies, specifically seeking those that start
with the CPP string (line 37). In our application, we’re only looking for
cookies named “CPPuser” and “CPPinfo.” The keys 'user' and 'info' are
extracted as the tag on line 38, the equal sign at index 7 skipped, and the
value starting at index 8 unquoted and evaluated into a Python object
occurs on lines 39–42. The exception handler looks for cookie payloads
that are not valid Python objects and just saves the string value. If either of
the cookies are missing, they are assigned to the empty string (lines
43–48). The getCPPCookies() method is only called from showForm().
We parse the cookies ourselves in this simple example, but if things get
more complex, you will likely use the Cookie module (renamed to
http.cookies in Python 3) to perform this task.
Similarly, if you’re writing Web clients and need to manage all the cookies stored in the browser (a cookie jar) and communication to Web servers,
you’ll likely use the cookielib module (renamed to http.cookiejar in
Python 3).
Lines 57–76
The checkUserCookie() method is used by both showForm() and doResults()
to check whether the user-supplied cookie value has been set. Both the
form and results HTML templates display this value.
10.5 Advanced CGI
477
The showForm() method’s only purpose is to display the form to the
user. It relies on getCPPCookies() to retrieve cookies from previous
requests (if any) and format the form as appropriate.
Lines 78–87
This block of code is responsible for the error page.
Lines 89–101
This is just the HTML template for the results page. It is used in
doResults(), which fills in all the required data.
Lines 102–135
The results page is created by using these blocks of code. The setCPPCookies()
method requests that a client store the cookies for our application, and the
doResults() method puts together all the data and sends the output back
to the client.
The latter, called from the go() method, does all the heavy lifting to put
together the output. In the first block of this method (lines 109–119), we
process the user input: the set of programming languages chosen (at least
one required—see the go() method), any uploaded file and the usersupplied cookie value, both of which are optional.
The final steps of doResults() (lines 128–135) cram all this data into a
single “CPPinfo” cookie for use later, and then renders the results template with all the data.
Lines 137–180
The script begins by instantiating an AdvCGI page object and then calling
its go() method to start the ball rolling. The go() method contains the logic
that reads all incoming data and decides which page to show.
The error page will be displayed if no name was given or if no languages were checked. The showForm() method is called to output the
form if no input data was received; otherwise, the doResults() method is
invoked to display the results page. Error situations are created by setting
the self.error variable, which serves two purposes. It lets you set an error
reason as a string and also serves as a flag to indicate that an error has
occurred. If this value is not blank, the user will be forwarded to the error
page.
Handling the person field (lines 145–150) is the same as we have seen in
the past: a single key-value pair. However, collecting the language information (lines 153–160) is a bit trickier because we must check for either a
478
Chapter 10 • Web Programming: CGI and WSGI
(Mini)FieldStorage instance or a list of such instances. We will employ
the familiar isinstance() built-in function for this purpose. In the end, we
2.5
will have a list of a single language name or many, depending on the
user’s selections.
The use of cookies to contain data illustrates how they can be used to
avoid using any kind of CGI field pass-through. In our previous examples
in this chapter, we passed such values as CGI variables. Now we are only
using cookies. You will notice in the code that obtains such data that no
CGI processing is invoked, meaning that the data does not come from the
FieldStorage object. The data is passed to us by the Web client with each
request and the values (user’s chosen data as well as information to fill in a
succeeding form with pre-existing information) are obtained from cookies.
Because the showResults() method receives the new input from the
user, it has the responsibility of setting the cookies, for example, by calling
setCPPCookies(). However, showForm(), must read in the cookies’ values
in order to display a form page with the current user selections. This is
done by its invocation of the getCPPCookies() method.
Finally, we get to the file upload processing (lines 162–171). Regardless
of whether a file was actually uploaded, FieldStorage is given a file handle in the file attribute. On line 171, if there was no filename given, then
we just set it to a blank string. As a better alternative, you can access the
file pointer—the file attribute—and perhaps read only one line at a time or
other kind of slower processing.
In our case, file uploads are only part of user submissions, so we simply
pass on the file pointer to the doResults() function to extract the data
from the file. doResults() will display only the first 4KB (as set on line
112) of the file for space reasons and to show you that it is not necessary
(or necessarily productive or useful) to display a 4GB binary file.
Existing Core Python readers will notice that we have refactored this
code significantly from previous editions of this book. The original was
over a decade old and did not reflect contemporary Python practices. It is
likely this incarnation of advcgi.py will not run in Python older than version 2.5. However, you can still access the code from earlier editions of this
script from the book’s Web site as well as the equivalent Python 3 version.
10.6 Introduction to WSGI
This section of the chapter introduces you to everything you need to know
about WSGI, starting with the motivation and background. The second
half of this section covers how to write Web applications without having
to worry about how they will be executed.
10.6 Introduction to WSGI
10.6.1
479
Motivation (CGI Alternatives)
Okay, now you have a good understanding of what CGI does and why
something like it is needed: servers cannot create dynamic content; they
don’t have knowledge of user-specific application information data, such
as authentication, bank accounts, online purchases, etc. Web servers must
communicate with an outside process to do this custom work.
In the first two-thirds of this chapter, we discussed how CGI solves this
problem and taught you how it works. We also mentioned that it is woefully inadequate because it does not scale; CGI processes (like Python
interpreters) are created per-request then thrown away. If your application
receives thousands of requests, spawning of a like-number of language
interpreters will quickly bring your servers to a halt. Two widely-used
methods to combat this performance issue are: server integration and
external processes. Let’s briefly discuss each of these.
10.6.2
Server Integration
Server integration is also known as a server API. These include proprietary
solutions like the Netscape Server Application Programming Interface
(NSAPI) and Microsoft’s Internet Server Application Programming Interface
(ISAPI). The most widely-user server solution today (since the mid-1990s) is
the Apache HTTP Web server, an open-source solution. Apache as it is commonly called, has a server API, as well, and uses the term module to describe
compiled plug-in components that extend its functionality and capability.
All three of these and similar solutions address the CGI performance
problem by integrating the gateway into the server. In other words,
instead of the server forking off a separate language interpreter to handle
a request, it merely makes a function call, running any application code
and coming up with the response in-process. These servers may process
their work via a set of pre-created processes or threads, depending on its
API. Most can be adjusted to suit the requirements of the supported applications. General features that servers also provide include compression,
security, proxying, and virtual hosting, to name a few.
Of course, no solution is without its downsides, and for server APIs,
this includes a variety of issues such as buggy code affecting server performance, language implementations that are not-fully compatible, requiring
the API developer to have to code in the same programming language as
the Web server implementation, integration into a proprietary solution (if
not using an open-source server API), requiring that applications must be
thread-safe, etc.
480
Chapter 10 • Web Programming: CGI and WSGI
10.6.3
External Processes
Another solution is an external process. These are CGI applications that
permanently run outside of the server. When a request comes in, the
server passes it off to such a process. They scale better than pure CGI
because these processes are long-lived as opposed to being spawned for
individual requests then terminated. The most well-known external process solution is FastCGI. With external processes, you get the benefits of
server APIs but not as many of the drawbacks because, for instance, you
get to run outside the server, they can be implemented in your language of
choice, application defects might not affect the Web server, you’re not
forced to code against a proprietary source, etc.
Naturally, there is a Python implementation of FastCGI, as well as a
variety of Python modules for Apache (PyApache, mod_snake, mod_python,
etc.), some of which are no longer being maintained. All these plus the
original pure CGI solution make up the gamut of Web server API gateway
solutions to calling Python Web applications.
Because of these different invocation mechanisms, an additional burden
has been placed on the developer. You not only need to build your application, but you must also decide on integration with these Web servers. In
fact, when you write your application, you need to know exactly in which
one of these mechanisms it will execute and code it that way.
This problem is more acute for Web framework developers, because
you want to give your users the most flexibility. If you don’t want to force
them to create multiple versions of their applications, you’ll need to provide interfaces to all server solutions in order to promote adoption of your
framework. This dilemma certainly doesn’t sound like it lends itself to
being Pythonic, thus it has led to the creation of the Web Server Gateway
Interface (WSGI) standard.
10.6.4
Introducing WSGI
It’s not a server, an API you program against, or an actual piece of code,
but it does define an interface. The WSGI specification was created as PEP
333 in 2003 to address the wide proliferation of disparate Web frameworks, Web servers, and various invocation styles just discussed (pure
CGI, server API, external process).
The goal was to reduce this type of interoperability and fragmentation
with a standard that targets a common API between the Web server and
Web framework layers. Since its creation, WSGI adoption has become
10.6 Introduction to WSGI
481
commonplace. Nearly all of the Python-based Web servers are WSGIcompliant. Having WSGI as a standard is advantageous to application
developers, framework creators, and the community as a whole.
A WSGI application is defined as a callable which (always) takes the following parameters: a dictionary containing the server environment variables, and another callable that initializes the response with an HTTP
status code and HTTP headers to return back to the client. This callable
must return an iterable which makes up the payload.
In the sample “Hello World” WSGI application that follows, these variables are named environ and start_response(), respectively:
def simple_wsgi_app(environ, start_response):
status = '200 OK'
headers = [('Content-type', 'text/plain')]
start_response(status, headers)
return ['Hello world!']
The environ variable contains familiar environment variables, such as
HTTP_HOST, HTTP_USER_AGENT, SERVER_PROTOCOL, etc. The start_response()
callable that must be executed within the application to prepare the
response that will eventually be sent back to the client. The response must
include an HTTP return code (200, 300, etc.) as well as HTTP response
headers.
In this first version of the WSGI standard, start_response() should also
return a write() function in order to support legacy servers that stream
results back. It is recommended against using it and returning just an iterable to let the Web server manage returning the data back to the client
(instead of having the application do so as that is not in its realm of expertise). Because of this, most applications just drop the return value from
start_response() or don’t use or save it otherwise.
In the previous example, you can see that a 200 status code is set as well
as the Content-Type header. Both are passed into start_response() to formally begin the response. Everything else that comes after should be some
iterable, such as, list, generator, etc. that make up the actual response payload. In this example, we’re only returning a list containing a single string,
but you can certainly imagine a lot more data going back. It can also be any
iterable not just a list; a generator or callable instance are great alternatives.
The last thing we wanted to say about start_response() is the third and
optional exception information parameter, usually known by its abbreviation, exc_info. If an application has set the headers to say “200 OK” (but
has not actually sent them) and encounters problems during execution, it’s
possible to change the headers to something else, like “403 Forbidden” or
“500 Internal Server Error,” if desired.
482
Chapter 10 • Web Programming: CGI and WSGI
To make this happen, we can assume that the application called
with the regular pair of parameters at the beginning of
execution. When errors occur, start_response() can be called again, but
with exc_info passed in along with the new status and headers that will
replace the existing ones.
It is an error to call start_response() a second time without exc_info.
Again, this must all happen before any HTTP headers are sent. If the headers have already been sent, an exception must be raised, such as, raise
exc_info[0], exc_info[1], or exc_info[2'].
For more information on the start_response() callable, refer to PEP 333 at
http://www.python.org/dev/peps/pep-0333/#the-start-response-callable.
start_response()
10.6.5
WSGI servers
On the server side, we need to call the application (as we discussed previously), pass in the environment and start_response() callable, and then
wait for the application to complete. When it does, we should get an iterable as the return value and return this data back to the client. In the following script, we present a simplistic and limited example of what a WSGI
Web server would look like:
import StringIO
import sys
def run_wsgi_app(app, environ):
body = StringIO.StringIO()
def start_response(status, headers):
body.write('Status: %s\r\n' % status)
for header in headers:
body.write('%s: %s\r\n' % header)
return body.write
iterable = app(environ, start_response)
try:
if not body.getvalue():
raise RuntimeError("start_response() not called by app!")
body.write('\r\n%s\r\n' % '\r\n'.join(line for line in iterable))
finally:
if hasattr(iterable, 'close') and callable(iterable.close):
iterable.close()
sys.stdout.write(body.getvalue())
sys.stdout.flush()
10.6 Introduction to WSGI
483
The underlying server/gateway will take the application as provided by
the developer and put it together the with environ dictionary with the contents of os.environ() plus the WSGI-specified wsgi.* environment variables
(see the PEP, but expect elements, such as wsgi.input, wsgi.errors,
wsgi.version, etc.) as well as any framework or middleware environment
variables. (More on middleware coming soon.) With both of these items, it
will then call run_wsgi_app(), which returns the response back to the client.
In reality as an application developer, you wouldn’t be interested in
minutia such as this. Creating servers is for those wanting to provide, with
WSGI specifications, a consistent execution framework for applications.
You can see from the preceding example that WSGI provides a clean break
between the application side and the server side. Any application can be
passed to the server described above (or any other WSGI server). Similarly,
in any application, you don’t care what kind of server is calling you; all
you care about is the environment you’re given and the start_response()
callable that you need to execute before returning data to the client.
10.6.6
Reference Server
As we just mentioned, application developers shouldn’t be forced to write
servers too, so rather than having to create and manage code like
run_wsgi_app(), you should be able to choose any WSGI server you want,
and if none are handy, Python provides a simple reference server in the
standard library: wsgiref.simple_server.WSGIServer.
You can build one using the class directly; however, the wsgiref package
itself features a convenience function called make_server() that you can
employ for simple access to the reference server. Let’s do so with our sample application, simple_wsgi_app():
#!/usr/bin/env python
from wsgiref.simple_server import make_server
httpd = make_server('', 8000, simple_wsgi_app)
print "Started app serving on port 8000..."
httpd.serve_forever()
This takes the application we created earlier, simple_wsgi_app(), wraps
it in a server running on port 8000, and starts the server loop. If you visit
http://localhost:8000 in a browser (or whatever [host, port] pair you’re
using), you should see the plain text output of “Hello World!”
484
Chapter 10 • Web Programming: CGI and WSGI
For the truly lazy, you don’t have to write the application or the server.
The wsgiref module also has a demonstration application, wsgiref.simple_
server.demo_app(). The demo_app() is nearly identical to simple_wsgi_
app(), except that in addition, it displays the environment variables.
Here’s the code for running the demonstration application with the reference server:
#!/usr/bin/env python
from wsgiref.simple_server import make_server, demo_app
httpd = make_server('', 8000, demo_app)
print "Started app serving on port 8000..."
httpd.serve_forever()
Start up a CGI server, and then browse to the application; you should
see the “Hello World!” output along with the environment variable dump.
This is just the reference model for a WSGI-compliant server. It is not
full-featured or intended to serve in production use. However, server creators can take a page from this to design their own products and make
them WSGI-compliant. The same is true for demo_app() as a reference
WSGI-compliant application for application developers.
10.6.7
Sample WSGI Applications
As mentioned earlier, WSGI is now the standard, and nearly all Python
Web frameworks support it, even if it doesn’t look like it. For example, an
Google App Engine handler class, given the usual imports, might contain
code that looks something like this:
class MainHandler(webapp.RequestHandler):
def get(self):
self.response.out.write('Hello world!')
application = webapp.WSGIApplication([
('/', MainHandler)], debug=True)
run_wsgi_app(application)
Not all frameworks will have an exact match as far as code goes, but
you can clearly see the WSGI reference. For a much closer comparison,
you can go one level lower and take a look at the run_bare_wsgi_app()
function found in the util.py module of the webapp subpackage of the
App Engine Python SDK. You’ll find this code looks much more like a
derivative of simple_wsgi_app().
10.6 Introduction to WSGI
10.6.8
485
Middleware and Wrapping WSGI
Applications
There might be situations in which you want to let the application run asis, but you want to inject pre or post-processing before (the request) or
after the application executes (the response). This is commonly known as
middleware, which is additional functionality that sits between the Web
server and the Web application. You’re either massaging the data coming
from the user before passing it to the application, or you need to do some
final tweaks to the results from the application before returning the payload back to the user. This is commonly referred to as a middleware onion,
indicating the application is at the heart, with additional layers in between.
Preprocessing can include activities, such as intercepting the request
parameters; modifying them; adding or removing them; altering the environment (including any user-submitted form [CGI] variables); using the
URL path to dispatch application functionality; forwarding or redirecting
requests; load-balancing based on network traffic via the inbound client IP
address; delegating to altered functionality (e.g., using the User-Agent
header to send mobile users to a simplified UI/app); etc.
Examples of post-processing primarily involves manipulating the output from the application. The following script is an example, similar to the
timestamp server that we created in Chapter 2, “Network Programming”:
for each line from the application’s results, we’re going to prepend it with
a timestamp. In practice of course, this is much more complicated, but this
is an example similar to others you can find online that capitalize or lowercase application output. Here, we’ll wrap our call to simple_wsgi_app()
with ts_simple_wsgi_app() and install the latter as the application that the
server registers:
#!/usr/bin/env python
from time import ctime
from wsgiref.simple_server import make_server
def ts_simple_wsgi_app(environ, start_response):
return ('[%s] %s' % (ctime(), x) for x in \
simple_wsgi_app(environ, start_response))
httpd = make_server('', 8000, ts_simple_wsgi_app)
print "Started app serving on port 8000..."
httpd.serve_forever()
486
Chapter 10 • Web Programming: CGI and WSGI
For those of you with more of an object bent, you can use a class wrapper instead of a function wrapper. On top of this, we can reduce environ
and start_response() into a single variable argument tuple (see stuff in
the example that follows) to shorten the code a bit because we added some
with the inclusion of a class and definition of a pair of methods:
class Ts_ci_wrapp(object):
def __init__(self, app):
self.orig_app = app
def __call__(self, *stuff):
return ('[%s] %s' % (ctime(), x) for x in
self.orig_app(*stuff))
httpd = make_server('', 8000, Ts_ci_wrapp(simple_wsgi_app))
print "Started app serving on port 8000..."
httpd.serve_forever()
We’ve named the class Ts_ci_wrapp, which is short for “timestamp callable instance wrapped application” that is instantiated when we create the
server. The initializer takes the original application and caches it for use
later. When the server executes the application, it still passes in the environ dict and start_response() callable, as before. With this change, the
instance itself will be called (hence the __call__() method definition).
Both environ and start_response() are passed to the original application
via stuff.
Although we used a callable instance here and a function earlier, keep
in mind that any callable will work. Also note that none of these last few
examples modify simple_wsgi_app() in any way. The main point is that
WSGI provides a clean break between the Web application and the Web
server. This helps compartmentalize development, allow teams to more
easily divide the work, and gives a consistent and flexible way to allow
Web application’s to run with any type of WSGI-compliant back-end. It
also frees the Web server creator from having to incorporate any custom or
specific hooks for users who choose to run applications by using their
(Web) server software.
10.6.9
3.x
Updates to WSGI in Python 3
PEP 333 defined the WSGI standard for Python 2. PEP 3333 offers
enhances to PEP 333 to bring the standard to Python 3. Specifically, it calls
out that the network traffic is all done in bytes. While such strings are
native to Python 2, native Python 3 strings are Unicode to emphasize that
they represent text data while the original ASCII strings were renamed to
the bytes type.
10.7 Real-World Web Development
487
Specifically, PEP 3333 clarifies that “native” strings—the data type
named str, regardless of whether you’re using Python 2 or 3—are those
used for all HTTP headers and corresponding metadata. It also states that
“byte” strings are those which are used for the HTTP payloads (requests/
responses, GET/POST/PUT input data, HTML output, etc.). For more
information on PEP 333, take a look at its definition, which you can find at
www.python.org/dev/peps/pep-3333/.
Independent of PEP 3333, there are other related proposals that will
make for good reading. One is PEP 444, which is a first attempt to define a
“WSGI 2,” if such a thing takes on that name. The community generally
regards PEP 3333 as a “WSGI 1.0.1,” an enhancement to the original
PEP 333 specification, whereas PEP 444 is a consideration for WSGI’s next
generation.
10.7 Real-World Web Development
CGI was the way things used to work, and the concepts it brought still
apply in Web programming today; hence, the reason why we spent so
much time looking at it. The introduction to WSGI brought you one step
closer to reality.
Today, new Python Web programmers have a wealth of choices, and
while the big names in the Web framework space are still Django, Pyramid, and Google App Engine, there are plenty more options for users to
choose from—perhaps a mind-numbing selection, actually. Frameworks
aren’t even necessary: you could go straight down to a WSGI-compliant
Web server without any of the extra “fluff” or framework features. However, the chances are more likely that you will go with a framework because
of the convenience of having the rest of the Web stack available to you.
A modern Web execution environment will likely consist of either a
multithreaded or multiprocess server model, signed/secure cookies, basic
user authentication, and session management. Many of these things regular application developers already know; authentication represents user
registration with a login name and password, and cookies are ways of
maintaining user information, sometimes session information, as well. We
also know that in order to scale, Web servers need to be able to handle
requests from multiple users; hence, the use of threads or processes. However, one thing that hasn’t been covered is the need for sessions.
If you look at all the application code in this entire chapter that runs on
Web servers, it might take a while for you to know that aside from the
obvious differences from scripts that run from beginning to end or server
488
Chapter 10 • Web Programming: CGI and WSGI
loops which just run forever, Web applications (or servlets in Java parlance)
are executed for every request. There’s no state saved within the code, and
we already mentioned that HTTP is stateless, as well. In other words, don’t
expect data to be saved in variables, global or otherwise. Think of a
request like a single transaction. It comes in, does its business, and finishes, leaving nothing behind in the codebase.
This is why session management—saving of a user’s state across one or
more requests within a well-defined duration of time—is needed. Generally, this is accomplished by using some sort of persistent storage, such as
memcache, flat (or not-so-flat) files, and even databases. Developers can
certainly roll their own, especially when writing lower-level code, as
we’ve seen in this chapter. But without question this wheel has already
been (re)invented several times, which is why many of the larger, more
well-known Web frameworks, including Django, come with their own session management software. (This leads directly into our next chapter.)
10.8 Related Modules
In Table 10-1, we present a list of modules that you might find useful
for Web development. You might also take a look at Chapter 3, “Internet
Client Programming,” and Chapter 13, “Web Services,” for other useful
Web application modules.
Table 10-1 Web Programming Related Modules
Module/Package
Description
Web Applications
cgi
Retrieves CGI form data
cgitbc
Handles CGI tracebacks
htmllib
Older HTML parser for simple HTML files; HTML-Parser
class extends from sgmllib.SGMLParser
HTMLparserc
Newer, non-SGML-based parser for HTML and XHTML
htmlentitydefs
HTML general entity definitions
Cookie
Server-side cookies for HTTP state management
cookielibe
Cookie-handling classes for HTTP clients
10.8 Related Modules
Module/Package
489
Description
Web Applications
webbrowserb
Controller: launches Web documents in a browser
sgmllib
Parses simple SGML files
robotparsera
Parses robots.txt files for URL “fetchability” analysis
httpliba
Used to create HTTP clients
Web Servers
BaseHTTPServer
Abstract class with which to develop Web servers
SimpleHTTPServer
Serve the simplest HTTP requests (HEAD and GET)
CGIHTTPServer
In addition to serving Web files like SimpleHTTPServers,
can also process CGI (HTTP POST) requests
http.serverg
New name for the combined package merging together
BaseHTTPServer, SimpleHTTPServer, and CGIHTTPServer
modules in Python 3
wsgireff
WSGI reference module
3rd party packages (not in standard library)
BeautifulSoup
Regex-based HTML and XML parser
http://crummy.com/software/BeautifulSoup
html5lib
HTML5 parser
http://code.google.com/p/html5lib
lxml
Comprehensive HTML and XML parser (supports both
of the above parsers) http://lxml.de
a.
b.
c.
d.
e.
f.
g.
New in Python 1.6.
New in Python 2.0.
New in Python 2.2.
New in Python 2.3.
New in Python 2.4.
New in Python 2.5.
New in Python 3.0.
490
Chapter 10 • Web Programming: CGI and WSGI
10.9 Exercises
CGI and Web Applications
10-1. urllib Module and Files. Update the friendsC.py script so
that it stores names and corresponding number of friends
into a two-column text file on disk and continues to add
names each time the script is run.
Extra Credit: Add code to dump the contents of such a file to
the Web browser (in HTML format). Additional Extra Credit:
Create a link that clears all the names in this file.
10-2. Error Checking. The friendsC.py script reports an error if no
radio button was selected to indicate the number of friends.
Update the CGI script to also report an error if no name (e.g.,
blank or whitespace) is entered.
Extra Credit: We have so far explored only server-side error
checking. Explore JavaScript programming and implement
client-side error checking by creating JavaScript code to
check for both error situations so that these errors are
stopped before they reach the server.
10-3. Simple CGI. Create a “Comments” or “Feedback” page for
a Web site. Take user feedback via a form, process the data in
your script, and then return a “thank you” screen.
10-4. Simple CGI. Create a Web guestbook. Accept a name, an
e-mail address, and a journal entry from a user, and then log
it to a file (format of your choice). Like Exercise 10-3, return a
“thanks for filling out a guestbook entry” page. Also provide
a link so that users can view guestbooks.
10-5. Web Browser Cookies and Web Site Registration. Create a user
authentication service for a Web site. Manager user names
and passwords in an encrypted way. You may have done
a plain text version of this exercise in either Core Python
Programming or Core Python Language Fundamentals and can
use parts of that solution if you wish.
Extra Credit: Familiarize yourself with setting Web browser
cookies and maintain a login session for four hours from the
last successful login.
10.9 Exercises
Extra Credit: Allow for federated authentication via OpenID,
allowing users to log in via Google, Yahoo!, AOL, WordPress, or even proprietary authentication systems such as
“Facebook Connect” or “sign in with Twitter.” You can also
use the Google Identity Toolkit that you can download from
http://code. google.com/apis/identitytoolkit.
10-6. Errors. What happens when a CGI script crashes? How can
the cgitb module be helpful?
10-7. CGI, File Updates, and Zip Files. Create a CGI application that
not only saves files to the server’s disk, but also intelligently
unpacks Zip files (or other archive) into a subdirectory
named after the archive file.
10-8. Web Database Application. Think of a database schema that
you want to provide as part of a Web database application.
For this multi-user application, you want to grant everyone
read access to the entire contents of the database, but perhaps only write access to each individual. One example
might be an address book for your family and relatives. Each
family member, once successfully logged in, is presented
with a Web page with several options, add an entry, view my
entry, update my entry, remove or delete my entry, and view
all entries (entire database).
Design a UserEntry class and create a database entry for
each instance of this class. You can use any solution created for any previous problem to implement the registration
framework. Finally, you can use any type of storage mechanism for your database, either a relational database such as
MySQL or some of the simpler Python persistent storage
modules such as anydbm or shelve.
10-9. Electronic Commerce Engine. Create an e-commerce/online
shopping Web service that is generic and can be “reskinned”
for multiple clients. Add your own authentication system as
well as classes for users and shopping carts (If you have Core
Python Programming or Core Python Language Fundamentals,
you can use the classes created for your solutions to Exercises 4 and 11 in the Object-Oriented Programming chapter.)
Don’t forget that you will also need code to manage your
products, whether they are hard goods or services. You
might want to connect to a payment system such as those
offered by PayPal or Google. After reading the next few
491
492
Chapter 10 • Web Programming: CGI and WSGI
chapters, port this temporary CGI solution to Django, Pyramid, or Google App Engine.
10-10. Python 3. Examine the differences between friendsC.py and
friendsC3.py. Describe each change.
10-11. Python 3, Unicode/Text vs. Data/Bytes. Port the Unicode example,
uniCGI.py, to Python 3.
WSGI
10-12. Background. What is WSGI and what were some of the reasons
behind its creation?
10-13. Background. What are/were some of the techniques used to
get around the scalability issue of CGI?
10-14. Background. Name some well-known frameworks that are WSGIcompliant, and do some research to find some that are not.
10-15. Background. What is the difference between WSGI and CGI?
10-16. WSGI Applications. WSGI applications can be what kind(s) of
Python object(s)?
10-17. WSGI Applications. What are the two required arguments for
a WSGI application? Go into more detail about the second one.
10-18. WSGI Applications. What is (are) the possible return type(s) of
a WSGI application?
10-19. WSGI Applications. Solutions to Exercises 10-1 through 10-11
only work if/when your server processes form data in the
same manner as CGI. Choose one of them to port to WSGI,
where it will work regardless of which WSGI-compliant
server you choose, with perhaps only slight modifications.
10-20. WSGI Servers. The WSGI servers presented in Section 10.6.5
featured a sample run_wsgi_app() server function which
executes a WSGI application.
a) The run_wsgi_app() function currently does not feature
the optional third parameter exc_info. Study PEPs 333
and 3333 and add support for exc_info.
b) Create a Python 3 port of this function.
10-21. Case Study. Compare and contrast the WSGI implementations of the following Python Web frameworks: Werkzeug,
WebOb, Django, Google App Engine’s webapp.
10-22. Standards. While PEP 3333 includes clarifications and
enhancements to PEP 333 for Python 3, PEP 444 is something
else. Describe what PEP 444 is all about and how it relates to
the existing PEPs.
CHAPTER
Web Frameworks: Django
Python: the only language with more Web frameworks than keywords.
—Harald Armin Massa, December 2005
In this chapter...
• Introduction
• Creating the Blog’s User Interface
• Web Frameworks
• Improving the Output
• Introduction to Django
• Working with User Input
• Projects and Apps
• Forms and Model Forms
• Your “Hello World”
Application (A Blog)
• More About Views
• Creating a Model to
Add Database Service
• *Unit Testing
• The Python Application
Shell
• The Django Administration
App
• *Look-and-Feel Improvements
• *An Intermediate Django App:
The TweetApprover
• Resources
493
494
Chapter 11 • Web Frameworks: Django
11.1 Introduction
In this chapter, we’ll go outside the Python Standard Library and explore
one popular Web framework for Python: Django. We’ll first go over Web
frameworks in general, and then expose you to developing applications by
using Django. This discussion starts with the basics and a “Hello World”
application then takes you beyond that with other areas that you’ll likely
come across when developing a real application. This roadmap essentially
defines the structure of this chapter: a solid introduction followed by an
intermediate application involving Twitter, e-mail, and OAuth, which is
an open protocol for authorization to gain access to data via application
programming interfaces (APIs).
The goal is to introduce you to a real tool that Python developers use
every day to get their jobs done. We’ll give you the skills and provide
enough knowledge for you to build more complex applications via
Django. You can also take these skills and jump to any of the other Python
Web frameworks. To get started, let’s define the topic.
11.2 Web Frameworks
We hope that you gained a greater understanding of Web development
from the material presented in Chapter 10, “Web Programming: CGI and
WSGI.” Rather than doing everything by hand, you can take advantage of
the significant body of work done by others to make your life easier. These
Web development environments are generically called Web frameworks,
and their goal is to help you to perform your job by pushing common
tasks “under the hood” and/or providing resources for you to create,
update, execute, and scale applications with a minimal amount of work.
Also, we explained earlier, using CGI is no longer an option, due to scalability limitations. So, people in the Python community look to more powerful Web server solutions such as Apache, ligHTTPD (pronounced as
“lighty”), or nginx. Some servers, such as Pylons and CherryPy, have their
own framework ecosystem around them. However, serving content is
only one aspect of creating Web applications. You still need to worry about
ancillary tools such as a JavaScript framework, an object-relational mapper
(ORM) or lower-level database adapter, a web templating system, and
orthogonal but necessary for any type of development: a unit-testing and/
or continuous integration framework. Python Web frameworks are either
individual (or multiple) subcomponents or complete full-stack systems.
11.2 Web Frameworks
495
The term full-stack means that you can develop code for all phases and
levels of a Web application. Frameworks that are considered as such will
provide all related services, such as a Web server, database ORM, templating, and all necessary middleware hooks. Some even provide a JavaScript
library. Django is arguably one of the most well-known Web frameworks
on the market today; many consider it as Python’s answer to Ruby on
Rails. It includes all of the services mentioned above as a single, all-in-one
solution (except for a built-in JavaScript library, because you can use
whichever one you like). We’ll see in Chapter 12, “Cloud Computing:
Google App Engine,” that Google App Engine also provides many of these
components but is geared more specifically for scalability and fast request/
response Web and non-Web applications hosted by the Internet giant.
Although Django was created as a single entity by one engineering
team, not all frameworks follow in this philosophy. TurboGears, for example, is a best-of-breed full-stack system, built by a scattered team of developers, serving as glue code that ties together well-known individual
components in the stack, such as ToscaWidgets (high-level Web widgets
that can utilize a variety of JavaScript frameworks, such as Ex1tJS, jQuery,
etc.), SQLAlchemy (ORM), Pylons (Web server), and Genshi (templating).
Frameworks that follow this architectural style provide greater flexibility
in that users can choose from a variety of templating systems, JS libraries,
tools to generate raw SQL, and multiple Web servers. You only need to
sacrifice a bit of consistency and any peace of mind that comes with using
only one tool. However, that might not be that different from what you’re
used to.
Pyramid is also very popular and is the successor to both repoze.bfg (or
“BFG” for short) and the Pylons Web frameworks. Its approach is even
simpler: it only provides with you the basics, such as URL dispatch, templating, security, and resources. If you need anything else, you must add
those capabilities yourself. Its minimalistic approach along with its strong
sense of testing and documentation, plus its inheritance of users from both
the Pylons and BFG communities, make it a strong contender in today’s set
of Web frameworks available for Python.
If you’re new to Python, you might be coming from Rails or perhaps
PHP, which has significantly expanded from its original intention as an
HTML-embedded scripting language to its own large monolithic universe.
One benefit you gain from Python is that you’re not locked to a “single
language, single framework” type of scenario. There are many frameworks out there from which to choose; hence, the quote at the beginning of
496
Chapter 11 • Web Frameworks: Django
this chapter. Web framework popularity was accelerated by the creation of
the web server gateway interface (WSGI) standard, defined by PEP 333 at
http://python.org/dev/peps/pep-0333.
If you don’t already know about WSGI, it’s not really code or an API as
much as it is an interface definition that frees the Web framework developer from having to create a custom Web server for the framework, which
in turn frees application developers from having to use that server when
perhaps they would prefer something else. With WSGI, it’s easy for application developers to swap between WSGI-compliant servers (or develop
new ones) without worrying about being forced to change application
code. For more on WSGI, take a look back at Chapter 10.
I don’t know if it’s a good thing to say this (especially in print), but
when passionate Python developers become dissatisfied with the choices
out there, they’ll just come up with a new framework. After all, there are
more Web frameworks than keywords in Python, right? Other frameworks you’ll undoubtedly hear about at some point will include web2py,
web.py, Tornado, Diesel, and Zope. One good resource is the wiki page on
the Python Web site at http://wiki.python.org/moin/WebFrameworks.
Okay, enough idle chatter, let’s engage our Web development knowledge and take a look at Django.
11.3 Introduction to Django
Django bills itself as “the Web framework for perfectionists with deadlines.” It
originated in the early 2000s, created by Web developers at the online presence of the Lawrence Journal-World newspaper, which introduced it to the
world in 2005 as a way of “developing code with journalism deadlines.” We’ll
put ourselves on a deadline and see how fast we can produce a very simple
blog by using Django, and later do the same with Google App Engine.
(You’ll have to work on your perfectionist side on your own.) Although
we’re going to blast through this example, we’ll still give you enough in the
way of explanation so that you know what’s going on. However, if you
would like to explore a full treatment of this exact example, you’ll find it in
Chapter 2 of Python Web Development with Django (Addison-Wesley, 2009),
written by my esteemed colleagues, Jeff Forcier (lead developer of Fabric)
and Paul Bissex (creator of dpaste), plus yours truly.
11.3 Introduction to Django 497
CORE TIP: Python 3 availability forthcoming
At the time of this writing, Django is not available for Python 3, so all of the
examples in this chapter are Python 2.x only. However, because the Python 3
port currently passes all tests (at the time of this writing), a release will be
forthcoming once the documentation is ready. When this occurs, look for
Python 3 versions of the code from this chapter on the book’s Web site. I strongly
believe that Python 3 adoption will definitely experience a significant uptick
once large frameworks like Django, along with other infrastructure libraries
such as database adapters, become available on that next generation platform.
11.3.1
Installation
Before jumping into Django development, we first need to install the necessary components, which include installation of the prerequisites followed by Django itself.
Prerequisites
Before you install Django, Python must already be installed. Because you’re
more than knee-deep in a Python book, we’re going to assume that’s
already been taken care of. Also, most POSIX-compliant (Mac OS X, Linux,
*BSD) operating systems already come with Python installed. Microsoft
Windows users are typically the only ones that need to download and
install Python.
Apache is the king of Web servers, so this is what most deployments
use. The Django team recommends the mod_wsgi Apache module and provides simple instructions at http://docs.djangoproject.com/en/dev/topics/
install/#install-apache-and-mod-wsgi as well as a more comprehensive
document at http://docs.djangoproject.com/en/dev/howto/deployment/
modwsgi/. Another great document for more complex installations—those
that host multiple Django Web sites (projects) using only one instance of
Apache—can be found at http://forum.webfaction.com/viewtopic.php?id=3646.
If you’re wondering about mod_python, it’s mostly found in older installations or part of operating system distributions before mod_wsgi became the
standard. Support for mod_python is now officially deprecated (and in fact
removed in Django 1.5).
3.x
498
2.5
Chapter 11 • Web Frameworks: Django
As we close our discussion of Web servers,1 it’s good to remind you that
you don’t need to use Apache for your production server. As just mentioned there are other options, as well, with many of them lighter in memory footprint and faster; perhaps one of those might be a better fit for your
application. You can find out more about some of the possible Web server
arrangements at http://code.djangoproject.com/wiki/ServerArrangements.
Django does require a database. The standard version of Django (currently) only runs on SQL-based relational database management systems
(RDBMSs). The four main databases employed by users are PostgreSQL,
MySQL, Oracle, and SQLite. By far, the easiest to set up is SQLite. Furthermore, SQLite is the only one of the four that does not require running a
database server, so it’s also the simplest. Of course, that doesn’t make it a toy;
it performs admirably against its more well-known brethren.
Why is it easy to set up? The SQLite database adapter comes bundled in
all versions of Python, starting with version 2.5. Be aware that we’re only
talking about the adapter here. Some distributions come bundled with
SQLite, others link to the system-installed SQLite, and everyone else will
need to download and install it manually.
SQLite is just one RDBMS supported by Django, so don’t feel you’re
stuck with that, especially if your company is already using one of the
server-based databases. You can read more about Django and database
installation at http://docs.djangoproject.com/en/dev/topics/install/#database-installation.
We have also seen a recent rapid proliferation of non-relational (NoSQL)
databases. Presumably this is due to the additional scalability offered by
such systems in the face of an ever-increasing amount of data. If you’re
talking about the volume of data on the scale of Facebook, Twitter, or similar services, a relational database usually requires manual partitioning,
also known as sharding. If you wish to develop for NoSQL databases such
as MongoDB or Google App Engine’s native datastore, try Django-nonrel
so that users have the option of using either relational or non-relational
databases, as opposed to just one type. (As an FYI, Google App Engine
also has a relational [MySQL-compatible] database option, Google Cloud
SQL.)
1. A Web server is not required until deployment, so you can hold off on
this if you prefer. Django comes with a development server (which we’ll
take a look at) that aids you during the creation and testing of your
application until you’re ready to go live.
11.3 Introduction to Django 499
You can download Django-nonrel from http://www.allbuttonspressed.
com/projects/django-nonrel followed by one of the adapters, https://
github.com/FlaPer87/django-mongodb-engine (Django with MongoDB),
or http://www.allbuttonspressed.com/projects/djangoappengine (Django
on Google App Engine’s datastore). Because Django-nonrel is (at the time
of this writing) a fork of Django, you can just install it instead of a stock
Django package. The main reason for doing that is because you want to
use the same version for both development and production. As stated at
http://www.allbuttonspressed.com/projects/django-nonrel, “the modifications
to Django are minimal (maybe less than 100 lines).” Django-nonrel is available
as a Zip file, so you would just unzip it, go into the folder, and issue the
following command:
$ sudo python setup.py install
These are the same instructions as if you went to download the stock
Django tarball (see below), so you can completely skip the next subsection
(Installing Django) to the start of the tutorial.
Installing Django
There are several ways of installing Django on your system, which are
listed here in increasing order of effort and/or complexity:
• Python package manager
• Operating system package manager
• Standard release tarball
• Source code repository
The simplest download and installation process takes advantage of
Python package management tools like easy_install from Setuptools
(http://packages.python.org/distribute/easy_install.html) or pip (http://
pip.openplans.org), both of which are available for all platforms. For Windows users with Setuptools, the easy_install.exe file should be installed
in the Scripts folder in which your Python distribution is located. You
only need to issue a single command; this is the command you would use
from a DOS Command window:
C:\WINDOWS\system32>easy_install django
Searching for django
Reading http://pypi.python.org/simple/django/
Reading http://www.djangoproject.com/
Best match: Django 1.2.7
500
Chapter 11 • Web Frameworks: Django
Downloading http://media.djangoproject.com/releases/1.2/Django1.2.7.tar.gz
Processing Django-1.2.7.tar.gz
. . .
Adding django 1.2.7 to easy-install.pth file
Installing django-admin.py script to c:\python27\Scripts
Installed c:\python27\lib\site-packages\django-1.2.7-py2.7.egg
Processing dependencies for django
Finished processing dependencies for django
To avoid having to type in the full path of easy_install.exe, we recommend that you add C:\Python2x\Scripts to your PATH environment variable,2 depending on which Python 2.x you have installed. If you’re on a
POSIX system, easy_install will be installed in a well-known path such
as /usr/bin or /usr/local/bin, so you don’t have to worry about adding a
new directory to your PATH, but you will probably need to use the sudo
command to install it the typical system directories such as /usr/local.
Your command will look something like
$ sudo easy_install django
or, like this:
$ sudo pip install django
Using sudo is only necessary if you’re installing in a location for which
superuser access is required; if installing in user-land then it isn’t necessary. We also encourage you to consider “container” environments such as
virtualenv. Using virtualenv gives you the ability to have multiple installations with multiple versions of Python and/or Django, different databases, etc. Each environment runs in its own container and can be created,
managed, executed, and destroyed at your convenience. You can find out
more about virtualenv at http://pypi.python.org/pypi/virtualenv.
Another way to install Django is by using your operating system’s package manager, if your system has one. These are generally confined to POSIX
computers (Linux and Mac OS X). You’ll issue a command similar to the
following:
(Linux)
$ sudo COMMAND install django
(Mac OS X) $ sudo port install django
2. Windows-based PC users can modify their PATH by right-clicking My
Computer, and then selecting Properties. In the dialog box that opens,
select the Advanced tab, and then click the Environment Variables button.
11.4 Projects and Apps
501
For Linux, COMMAND is your distribution’s package manager, for example,
apt-get, yum, aptitude, etc. You can find instructions for installing from distributions at http://docs.djangoproject.com/en/dev/misc/distributions.
In addition to the methods just described, you can simply download
and install the original release tarball from the Django Web site. Once you
unzip it, you can run the usual installation command:
$ sudo python setup.py install
You can find more specific instructions at http://docs.djangoproject.com/
en/dev/topics/install/#installing-an-official-release
Hardcore developers might prefer to get the latest from the Subversion
source tree itself. You can find the instructions at http://docs.djangoproject.com/
en/dev/topics/install/#installing-the-development-version
Finally, here are the overall installation instructions:
http://docs.djangoproject.com/en/dev/topics/install/
#install-the-django-code
The next step is to bring up a server and confirm that everything installed
properly and is working correctly. But first, let’s talk about some basic
Django concepts: projects and apps.
11.4 Projects and Apps
What are projects and apps in Django? Simply put, you can consider a
project as the set of all files necessary to create and run an entire Web site.
Within a project folder are a set of one or more subdirectories that have
specific functionality; these are called apps, although apps don’t necessarily need to be inside the project folder. Apps can be specific to the project,
or they can be reusable components that you can take from project to project. Apps are the individual subcomponents of functionality, the sum of
which form an entire Web experience. You can have apps that solicit and
manage user/reader feedback, update real-time information, process feed
data, aggregate data from other sites, etc.
One of the more well-known set of reusable Django apps can be found
in a platform called Pinax. Such apps include (but are not limited to)
authentication (OpenID support, password management, etc.), messaging
(e-mail verification, notifications, user-to-user contact, interest groups,
threaded discussions, etc.), and more stand-alone features, such as project
management, blogging, tagging, and contact import. You can read more
about Pinax at http://pinaxproject.com.
502
Chapter 11 • Web Frameworks: Django
The concept of projects and apps makes this type of plug-n-play functionality feasible and gives the added bonus of strongly encouraging agile
design and code reuse. Okay, now that you know what projects and apps
are, let’s create a project!
11.4.1
Creating a Project in Django
Django comes with a utility called django-admin.py that can streamline
tasks such as the creation of the aforementioned project directories. On
POSIX platforms, it will usually be installed into directories such as /usr/
local/bin, /usr/bin, etc.; if you’re on a Windows-based computer, it goes
into the Scripts folder, which is directly in your Python installation folder,
e.g., C:\Python27\Scripts. For either POSIX computers or Windows computers, you should make sure that django-admin.py is in your PATH environment variable so that it can be executed from the command-line (unless
you like calling interpreters by using full pathnames).
For Windows computers, you will likely have to manually add
c:\python27 and c:\python27\scripts to your system PATH variable for
everything to work well (or whatever directory you installed Python in).
You do this by opening the Control Panel and then clicking System, or you
can right-click My Computer, and then choose Properties. From here,
select the Advanced tab, and then click the Environment Variables button.
You can choose to edit the PATH entry either for a single user (the top
listbox) or for all users (the bottom listbox), and then add ;c:\python27;c:\
python27\scripts after any text in the Variable value textbox. Some of
what you see appears in Figure 11-1.
Once your PATH is set (on either type of platform), you should be able
to run python and get an interactive interpreter and Django’s djangoadmin.py command to see its usage. You can test this by opening up a Unix
shell or DOS Command window and issuing those command names.
Once you’ve confirmed that everything is working, we can proceed.
The next step is to go to a directory or folder in which you want to place
your code. To create the project in the current working directory, issue the
following command (we’ll use a generic project name such as mysite, but
you can call it anything you wish):
$ django-admin.py startproject mysite
11.4 Projects and Apps
503
Figure 11-1 Adding Python to the Windows PATH variable.
Note that if you’re on a Windows PC, you’ll first need to open a DOS
Command window first. Of course, your prompt will look more like
C:\WINDOWS\system32> as a (shell) prompt instead of the POSIX dollar sign
($) or percent symbol (%) for the old-timers.
Now let’s take a look at the contents of the directory to see what this
command has created for you. It should look something like the following
on a POSIX computer:
$ cd mysite
$ ls -l
total 32
-rw-r--r--rw-r--r--rw-r--r--rw-r--r--
1
1
1
1
wesley
wesley
wesley
wesley
admin
admin
admin
admin
0
546
4778
482
Dec
Dec
Dec
Dec
7
7
7
7
17:13
17:13
17:13
17:13
__init__.py
manage.py
settings.py
urls.py
If you are developing in Windows, opening an Explorer window to that
folder will appear similar to Figure 11-2, if we had earlier created a folder
named C:\py\django with the intention of putting our project there.
504
Chapter 11 • Web Frameworks: Django
Figure 11-2 The mysite folder on a Windows-based PC.
In Django, a barebones project consists of the four files, __init__.py,
and urls.py (you will add your applications
later). Table 11-1 explains the purpose of each file.
manage.py, settings.py,
Table 11-1 Django Project Files
Filename
Description/Purpose
__init__.py
Specifies to Python that this is a package
urls.py
Global URL configuration (“URLconf”)
settings.py
Project-specific configuration
manage.py
Command-line interface for applications
You’ll notice that every file created by the startproject command is
Python source code—there are no .ini files, XML data, or funky configuration syntax. Django pursues a “pure Python” philosophy wherever possible. This gives you a lot of flexibility without adding complexity to the
framework as well as the ability to have your settings file import
additional settings from some other file, based on the current configuration,
11.4 Projects and Apps
505
or calculate a value instead of having it hardcoded. There is no barrier, it’s
just Python. We’re sure you’ve also figured out that django-admin.py is a
Python script, too. It serves as a command-line interface between you and
your project. You’ll use manage.py in similar way to manage your apps.
(Both commands have a Help option with which you can get more information on how to use each.)
11.4.2
Running the Development Server
At this point, you haven’t created an app yet, but nonetheless, there are
some Django conveniences in place for your use. One of the handiest is
Django’s built-in Web server. It’s a server designed for the development
phase that runs on your local computer. Note that we strongly recommend
against using it for deploying public sites because it is not a productionworthy server.
Why does the development server exist? Here are some of the reasons:
1. You can use it to run your project (and apps) without requiring a full production environment just to test some code.
2. It automatically detects when you make changes to your
Python source files and reloads those modules. This saves
time and is convenient over systems that require you to manually restart every time you edit your code.
3. The development server knows how to find and display static
media files for the Django Administration (or “admin”) application so that you can get started working with that right
away. (You will meet the admin soon. For now, just don’t get it
confused with the django-admin.py script.)
Running the development server is as simple as issuing the following
single command from your project’s manage.py utility:
(POSIX) $ python ./manage.py runserver
(PCs)
C:\py\django\mysite> python manage.py runserver
If you’re using a POSIX system and assign your script execute permission,
that is, $ chmod 755 manage.py, you won’t need to explicitly call python, for
example, $ ./manage.py runserver. The same is true in a DOS Command
window, if Python is correctly installed in your Windows registry.
506
Chapter 11 • Web Frameworks: Django
Once the server has started, you should see output similar to that in the
following example (Windows uses a different quit key combination):
Validating models...
0 errors found.
Django version 1.2, using settings 'mysite.settings'
Development server is running at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
Open that link (http://127.0.0.1:8000/ or http://localhost:8000/) in
your browser, and you should see Django’s “It Worked!” screen, as shown
in Figure 11-3.
Figure 11-3 Django’s initial “It worked!” screen.
Note that if you want to run your server on a different port, you can
specify that on the command-line. For example, if you want to run it on
port 8080, instead, issue this command: $ python ./manage.py runserver
8080. You can find all of the runserver options at http://docs.djangoproject.
com/en/dev/ref/django-admin/#django-admin-runserver.
11.5 Your “Hello World” Application (A Blog)
507
If you’re seeing the “It worked!” screen in Figure 11-3, then everything
is in great shape. Meanwhile, if you look in your terminal session, you’ll
see that the development server has logged your GET request:
[11/Dec/2010 14:15:51] "GET / HTTP/1.1" 200 2051
The four sections of the log line are, from left to right, the timestamp,
request, HTTP response code, and byte count (yours might be slightly different). The “It Worked!” page is Django’s friendly way of telling you that
the development server is working, and that you can create applications
now. If your server isn’t working at this point, retrace your steps. Be ruthless! It’s probably easier to delete your entire project and start from scratch
than it is to debug at this point.
When the server is running successfully, we can move on to setting up
your first Django application.
11.5 Your “Hello World” Application
(A Blog)
Now that we have a project, we can create apps within it. To create our
blog application, use manage.py again:
$ ./manage.py startapp blog
As with your project, you can call your application blog as we did or
anything else that you prefer. It’s just as simple as starting a project. Now
we have a blog directory inside our project directory. Here’s what’s in
it, first in POSIX format, then in a screenshot of the folder in Windows
(Figure 11-4):
$ ls -l blog
total 24
-rw-r--r-- 1
-rw-r--r-- 1
-rw-r--r-- 1
-rw-r--r-- 1
wesley
wesley
wesley
wesley
admin
admin
admin
admin
0
175
514
26
Dec 8 18:08 __init__.py
Dec 10 18:30 models.py
Dec 8 18:08 tests.py
Dec 8 18:08 views.py
508
Chapter 11 • Web Frameworks: Django
Figure 11-4 The blog folder on a Windows-based PC.
Descriptions of the app-level files are given in Table 11-2.
Table 11-2 Django App Files
Filename
Description/Purpose
__init__.py
Specifies to Python that this is a package
urls.py
The app’s URL configuration (“URLconf”); this isn’t automatically created such as for project URLconf (hence, why
it’s missing from the above)
models.py
Data models
views.py
View functions (think “controllers”)
tests.py
Unit tests
As with your project, your app is a Python package, too, but in this case,
the models.py and views.py files have no real code in them (yet); they’re
merely placeholders for you to put your stuff into. The unit tests that go
into tests.py haven’t been written yet and are waiting for your input
there, as well. Similarly, even though you can use your project’s URLconf
to direct all the traffic, one for a local app isn’t automatically created for
you. You’ll need to do it yourself, and then use the include() directive
from the project’s URLconf to have requests routed to an app’s URLconf.
11.6 Creating a Model to Add Database Service 509
To inform Django that this new app is part of your project, you need to
edit settings.py (which we can also refer to as your settings file). Open it
in your editor and find the INSTALLED_APPS tuple near the bottom. Add
your app name (blog) as a member of that tuple (usually toward the bottom), so that it looks like this:
INSTALLED_APPS = (
. . .
'blog',
)
Although it isn’t necessary, we add a trailing comma so that if we want
to add more to this tuple, we wouldn’t then need to add it. Django uses
INSTALLED_APPS to determine the configuration of various parts of the system, including the automatic administration application and the testing
framework.
11.6 Creating a Model to Add Database
Service
We’ve now arrived at the core of your Django-based blog application: the
models.py file. This is where we’ll define the data structures of the blog.
Following the principle of Don't Repeat Yourself (DRY), Django gets a lot of
mileage out of the model information you provide for your application.
Let’s create a basic model and then see all the stuff Django does for us
using that information.
The data model represents the type of data that will be stored per
record in the database. Django provides a variety of fields to help you map
your data into your app. We’ll use three different field types in our app
(see the code sample that follows).
Open models.py in your editor and add the following model class
directly after the import statement already present in the file:
# models.py
from django.db import models
class BlogPost(models.Model):
title = models.CharField(max_length=150)
body = models.TextField()
timestamp = models.DateTimeField()
That’s a complete model, representing a “blog post” object with three
fields. (To be accurate, it has four fields—Django automatically creates an
auto-incrementing, unique ID field for each model, by default). You can
510
Chapter 11 • Web Frameworks: Django
see that our newly minted class, BlogPost, is a subclass of django.db.models.
Model. That’s Django’s standard base class for data models, which is the
core of Django’s powerful ORM. The fields are defined like regular class
attributes, with each one being an instance of a particular field class, where
an instance of the composite is equivalent to a single database record.
For our app, we chose the CharField for the blog post title, limiting the
field to a maximum length. A CharField is appropriate for short, single
lines of text. For larger chunks of text, such as the body of blog post, we
picked the TextField type. Finally, the timestamp is a DateTimeField. A
DateTimeField is represented by a Python datetime.datetime object.
Those field classes are also defined in django.db.models, and there are
many more types than the three we’re using here, from BooleanField to
XMLField. For a comprehensive list of all that are available, read the official
documentation at http://docs.djangoproject.com/en/dev/ref/models/fields/
#field-types.
11.6.1
Setting Up the Database
If you don’t have a database server installed and running, we recommend
SQLite as the easiest way to get going. It’s fast, widely available, and stores
its database as a single file in the file system. Access controls are simply
file permissions. If you do have a database server—MySQL, PostgreSQL,
Oracle—and want to use it rather than SQLite, then use your database’s
administration tools to create a new database for your Django project. In
the examples here, our database is called mysite.db, but you can call it
whatever you like.
Using MySQL
With your (empty) database in place, all that remains is to instruct Django
on how to use it. This is where your project’s settings.py file comes in
(again). There are six potentially relevant settings here (though you might
need only two): ENGINE, NAME, HOST, PORT, USER, and PASSWORD. Their names
render their respective purposes pretty obvious. Just plug in the correct
values corresponding to the database server you’ll be using with Django.
For example, settings for MySQL will look something like the following:
11.6 Creating a Model to Add Database Service 511
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql',
'NAME': 'testdb',
'USER': 'wesley',
'PASSWORD': 's3Cr3T',
'HOST': '',
'PORT': '',
}
}
Note that if you’re using an older version of Django, then instead of
everything being in a single dictionary, you’ll find these as stand-alone,
module-level variables.)
We haven’t specified PORT because that’s only needed if your database
server is running on a non-standard port. For example, MySQL’s server
uses port 3306 by default. Unless you’ve changed the setup, you don’t
need to specify PORT. HOST was left blank to indicate that the database
server is running on the current computer that runs our application. Be
sure that you’ve already executed CREATE DATABASE testdb or whatever
you named your database and that the user (and its password) already
exist before you continue with Django. Using PostgreSQL is more like the
setup to MySQL than is Oracle.
For details on setting up new databases, users, and your settings, see the
Django documentation at http://docs.djangoproject.com/en/dev/intro/
tutorial01/#database-setup and http://docs.djangoproject.com/en/dev/ref/
settings/#std:setting-DATABASES as well as Appendix B of Python Web
Development with Django, if you have the book.
Using SQLite
SQLite is a popular choice for testing. It’s even a good candidate for
deployment in scenarios for which there isn’t a great deal of simultaneous
writing going on. No host, port, user, or password information is needed
because SQLite uses the local file system for storage and the native file
system permissions for access control—you can also choose a pure in-memory database. This is why our DATABASES configuration in settings.py
shown in the following code only has ENGINE and NAME when directing
Django to use your SQLite database.
512
Chapter 11 • Web Frameworks: Django
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.sqlite3',
'NAME': '/tmp/mysite.db', # use full pathname to avoid confusion
}
}
2.5
When using SQLite with a real Web server like Apache, you’ll need to
ensure that the account that owns the Web server process has write access
both for the database file itself and the directory containing that database
file. When working with the development server as we are here, permissions are typically not an issue because the user running the development
server (you) also owns the project files and directories.
SQLite is also one of the most popular choices on Windows-based PCs
because it comes included with the Python distribution (starting with version 2.5). Given that we have already created a C:\py\django folder with
our project (and application), let’s create a db directory, as well, and specify
the name of the database file that will be created later:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.sqlite3',
'NAME': r'C:\py\django\db\mysite.db', # full pathname
}
}
If you’ve been working with Python for some time, you’re probably
aware that the r before the folder name designates this is a Python raw
string. This just means to take each string character verbatim and to not
translate special characters, meaning that “\n” should be interpreted as a
backslash (\) followed by the letter “n” instead of a single NEWLINE
character. DOS file pathnames and regular expressions are two of the most
common use cases for Python raw strings because they often include the
backslash character, which in Python is a special escape character. See the
section on strings in the Sequences chapter of Core Python Programming or
Core Python Language Fundamentals for more details.
11.6.2
Creating the Tables
Now we need to instruct Django to use the connection information you’ve
given it to connect to the database and set up the tables that your application needs. You’ll use manage.py and its syncdb command, as demonstrated in the following sample execution:
11.6 Creating a Model to Add Database Service 513
$ ./manage.py syncdb
Creating tables ...
Creating table auth_permission
Creating table auth_group_permissions
Creating table auth_group
Creating table auth_user_user_permissions
Creating table auth_user_groups
Creating table auth_user
Creating table auth_message
Creating table django_content_type
Creating table django_session
Creating table django_site
Creating table blog_blogpost
When you issue the syncdb command, Django looks for a models.py file
in each of your INSTALLED_APPS. For each model it finds, it creates a database table. (There are exceptions to this rule but it’s true for the most part.)
If you are using SQLite, you will also notice that the mysite.db database
file is created exactly where you specified in your settings.
The other items in INSTALLED_APPS—the items that were there by
default—all have models, too. The output from manage.py syncdb confirms
this; you can see Django is creating one or more tables for each of those
apps. That’s not all the output from the syncdb command, though. There
are also some interactive queries related to the django.contrib.auth app
(see the following example). We recommend you create a superuser,
because we’ll need one soon. Here’s how this process works from the tail
end of the syncdb command:
You just installed Django's auth system, which means you don't have
any superusers defined.
Would you like to create one now? (yes/no): yes
Username (Leave blank to use 'wesley'):
E-mail address: ****@****.com
Password:
Password (again):
Superuser created successfully.
Installing custom SQL ...
Installing indexes ...
No fixtures found.
Now you have one superuser (hopefully yourself) in the auth system.
This will come in handy in a moment, when we add in Django’s automatic
admin application.
Finally, the setup process wraps up with a line relating to a feature
called fixtures, which represent serialized, pre-existing contents of a database. You can use fixtures to pre-load this type of data in any newly created applications. Your initial database setup is now complete. The next
time you run the syncdb command on this project (which you’ll do any
514
Chapter 11 • Web Frameworks: Django
time you add an application or model), you’ll see a bit less output, because
it doesn’t need to set up any of those tables a second time or prompt you to
create a superuser.
At this point we’ve completed the data model portion of our app. It’s
ready to accept user input; however, we don’t have any way of doing this,
yet. If you subscribe to the model-view controller (MVC) pattern of Web
application design, you’ll recognize that only the model is done. There is
no view (user-facing HTML, templating, etc.) or controller (application
logic) yet.
CORE TIP: MVC vs. MTV
The Django community uses an alternate representation of the MVC pattern. In
Django, it’s called model-template-view or MTV. The data model remains the
same, but the view is known as the template in Django because templates are
used to define what the users see. Finally, the “view” in Django represents
view functions, the sum of which form all of the logic of the controller. It’s all
the same, but just a different interpretation of the roles. To read more about
Django’s philosophy with regard to this matter, check out the FAQ answer at
http://docs.djangoproject.com/en/dev/faq/general/#django-appears-to-be-amvc-framework-but-you-call-the-controller-the-view-and-the-view-the-template-how-come-you-don-t-use-the-standard-names.
11.7 The Python Application Shell
Python programmers know how useful the interactive interpreter is. The
creators of Django know this as well, and have integrated it to aid in
everyday Django development. In these subsections, we’ll explore how to
use the Python shell to perform low-level data introspection and manipulation when such things are not so easily accomplished with Web application development.
11.7.1
Using the Python Shell in Django
Even without the template (view) or view (controller), we can still test out
our data model by adding some BlogPost entries. If your app is backed by
an RDBMS, as most Django apps are, you would be adding rows to a table
per blog entry. If you end up using a NoSQL database such as MongoDB
11.7 The Python Application Shell 515
or Google App Engine’s datastore, you would be adding objects, documents, or entities into the database, instead.
How do we do this? Django provides a Python application shell that
you can use to instantiate your models and otherwise interact with your
app. Python users will recognize the familiar interactive interpreter startup and prompt when using the shell command of the manage.py script:
$ python2.5 ./manage.py shell
Python 2.5.1 (r251:54863, Feb 9 2009, 18:49:36)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>>
The difference between this Django shell and the standard Python interactive interpreter is that in addition to the latter, the shell is much more
aware of your Django project’s environment. You can interact with your
view functions and your data models because the shell automatically sets
up environment variables, including your sys.path, that give it access to
the modules and packages in both Django and your project that you
would otherwise need to manually configure. In addition to the standard
shell, there are a couple of alternative interactive interpreters that you can
consider, some of which we cover in Chapter 1 of Core Python Programming
or Core Python Language Fundamentals.
Rich shells such as IPython and bpython are actually preferred by
Django because they provide extremely useful functionality on top of the
vanilla interpreter. When you run the shell command, Django searches
first for a rich shell, employing the first one it finds or reverting to the
standard interpreter if none are available.
In the previous example, we used a Python 2.5 interpreter without a
rich shell; hence, the reason the standard interpreter came up. Now when
we execute manage.py shell, in which one (IPython) is available, it comes
up, instead:
$ ./manage.py shell
Python 2.7.1 (r271:86882M, Nov 30 2010, 09:39:13)
[GCC 4.0.1 (Apple Inc. build 5494)] on darwin
Type "copyright", "credits" or "license" for more information.
IPython 0.10.1 -- An enhanced Interactive Python.
?
-> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help
-> Python's own help system.
object?
-> Details about 'object'. ?object also works, ?? prints
more.
In [1]:
516
Chapter 11 • Web Frameworks: Django
You can also use the --plain option to force a vanilla interpreter:
$ ./manage.py shell --plain
Python 2.7.1 (r271:86882M, Nov 30 2010, 09:39:13)
[GCC 4.0.1 (Apple Inc. build 5494)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>>
Note that having a rich shell or not has nothing to do with the version of
Python you have installed, as in the preceding example; it just so happens
I have IPython available only for the version 2.7 installation on my computer but not for version 2.5.
If you want to install a rich shell, just use easy_install or pip, as
explained earlier when we described the different methods for installing
Django. Here’s what it looks like for Windows PC users to install IPython
on their system:
C:\WINDOWS\system32>\python27\Scripts\easy_install ipython
Searching for ipython
Reading http://pypi.python.org/simple/ipython/
Reading http://ipython.scipy.org
Reading http://ipython.scipy.org/dist/0.10
Reading http://ipython.scipy.org/dist/0.9.1
. . .
Installing ipengine-script.py script to c:\python27\Scripts
Installing ipengine.exe script to c:\python27\Scripts
Installed c:\python27\lib\site-packages\ipython-0.10.1-py2.7.egg
Processing dependencies for ipython
Finished processing dependencies for ipython
11.7.2
Experimenting with Our Data Model
Now that we know how to start a Python shell, let’s play around with our
application and its data model by starting IPython and giving a few
Python or IPython commands:
In [1]:
In [2]:
In [3]:
Out[3]:
In [4]:
....:
....:
....:
In [5]:
Out[5]:
In [6]:
from datetime import datetime
from blog.models import BlogPost
BlogPost.objects.all() # no objects saved yet!
[]
bp = BlogPost(title='test cmd-line entry', body='''
yo, my 1st blog post...
it's even multilined!''',
timestamp=datetime.now())
bp
<BlogPost: BlogPost object>
bp.save()
11.7 The Python Application Shell 517
In [7]: BlogPost.objects.count()
Out[7]: 1
In [8]: exec _i3 # repeat cmd #3; should have 1 object now
Out[8]: [<BlogPost: BlogPost object>]
In [9]: bp = BlogPost.objects.all()[0]
In [10]: print bp.title
test cmd-line entry
In [11]: print bp.body # yes an extra \n in front, see above
yo, my 1st blog post...
it's even multilined!
In [12]: bp.timestamp.ctime()
Out[12]: 'Sat Dec 11 16:38:37 2010'
The first couple of commands just bring in the objects we need. Step #3
queries the database for BlogPost objects, of which there are none, so in
step #4, we add the first one to our database by instantiating a BlogPost
object, passing in its attributes that were defined earlier (title, body, and
timestamp). Once our object is created, we need to write it to the database
(step #6) with the BlogPost.save() method.
When that’s done, we can confirm the object count in the database has
gone from 0 to 1 by using BlogPost.objects.count() method (step #7). In
step #8, we take advantage of the IPython command to repeat step #3 to
get a list of all the BlogPost objects stored in the database—we could have
just retyped BlogPost.objects.all(), but we wanted to demonstrate a
rich shell feature. The last steps involve grabbing the first (and only) element of the list of all BlogPost objects (step #9) and dumping out all the
data to show that we were able to successfully retrieve the data we just
stored moments ago.
The preceding is just a sampling of what you can do with an interactive
interpreter tied to your app. You can read more about the shell’s features at
http://docs.djangoproject.com/en/dev/intro/tutorial01/#playing-with-theapi. These Python shells are great developer tools. In addition to the standard command-line tool you get bundled with Python, you’ll find them
incorporated into integrated development environments (IDEs) as well as
augmented with even more functionality in third-party developed interactive interpreters such as IPython and bpython.
Almost all users and many developers prefer a web-based create, read,
update, delete (CRUD) tool instead, and this is true for every web app
that’s developed. But do developers really want to create such an administration Web console for every single app they create? Seems like you’d
always want to have one, and that’s where the Django admin app comes in.
518
Chapter 11 • Web Frameworks: Django
11.8 The Django Administration App
The automatic back-end administration application, or admin for short, has
been described as Django’s crown jewel. For anyone who has tired of creating simple CRUD interfaces for Web applications, it’s a godsend. Admin
is an app that every Web site needs. Why? Well, you might want to confirm your app’s ability to insert a new record as well as update or delete it.
You understand that, but if your app hasn’t been completed yet, that
makes this a bit more difficult. The admin app solves this problem for you
by giving developers the ability to validate their data manipulation code
before the full UI has been completed.
11.8.1
Setting Up the Admin
Although the admin app comes free with Django, it’s still optional, so
you’ll need to explicitly enable it by specifying this in your configuration
settings, just like you did with your own blog application. Open settings.py
and let’s zoom down to the INSTALLED_APPS tuple again. You added
'blog', earlier, but you probably overlooked the four lines right above it:
INSTALLED_APPS = (
. . .
# Uncomment the next line to enable the admin:
# 'django.contrib.admin',
# Uncomment the next line to enable admin documentation:
# 'django.contrib.admindocs',
'blog',
)
The one we care about is the first commented-out entry, 'django.
contrib.admin'. Remove the hash character (#)—a.k.a. the octothorpe,
pound sign, or comment symbol—at the beginning of the line to enable it.
The second one is optional, representing the Django admin documentation generator. The admindocs app auto-generates documents for your
project by extracting Python documentation strings (“docstrings”) and
makes those available to the admin. If you want to enable it, that’s fine, but
we won’t be using it in our example here.
Every time you add a new application to your project, you should perform a syncdb to ensure that the tables it needs have been created in your
database. Here we can see that adding the admin app to INSTALLED_APPS
and running syncdb triggers the creation of one more table in our database:
11.8 The Django Administration App
519
$ ./manage.py syncdb
Creating tables ...
Creating table django_admin_log
Installing custom SQL ...
Installing indexes ...
No fixtures found.
Now that the app is set up, all we need to do is give it a URL so that we
can get to it. In the automatically generated (project) urls.py, you’ll notice
these lines near the top:
# Uncomment the next two lines to enable the admin:
# from django.contrib import admin
# admin.autodiscover()
You’ll also see this 2-tuple commented out near the bottom of the
global variable:
urlpatterns
# Uncomment the next line to enable the admin:
# (r'^admin/', include(admin.site.urls)),
Uncomment all three real lines of code and save the file. You’ve just
directed Django to load up the default admin site when visitors to the Web
site hit the URL http://localhost:8000/admin.
Finally, your applications need to specify to Django which models
should show up for editing in the admin screens. To do so, you simply
need to register your BlogPost model with it. Create blog/admin.py with
the following lines:
# admin.py
from django.contrib import admin
from blog import models
admin.site.register(models.BlogPost)
The first two lines import the admin and our data model(s). They are
followed by the line that registers our BlogPost class with the admin. This
enables the admin to manage objects of this type in the database (in addition to the others already registered).
11.8.2
Trying Out the Admin
Now that we’ve registered our model with the admin, let’s take it out for a
spin. Issue the manage.py runserver command again, and then go to the
same link as earlier (either http://127.0.0.1:8000 or http://localhost:8000).
What do you get? Hopefully, you actually get an error. Specifically, you
should get a 404 error that looks similar to the one depicted in Figure 11-5.
520
Chapter 11 • Web Frameworks: Django
Figure 11-5 The admin login screen.
Why do you get this error? It’s because you haven’t defined an action for
the '/' URL yet. The only one that you’ve enabled for your app is /admin, so
you need to go directly to that URL, instead; that is, you need to go to http://
127.0.0.1:8000/admin, or http://localhost:8000/admin, or just add /admin to the
existing path in your browser.
In fact, if you look carefully at the error screen, Django itself informs
you that only /admin is available because it tries them all before it gives up.
Note that the “It Worked!” page is a special case for which you have no
URLs set for your app. (If it weren’t for that special case, you would’ve
received a 404 error, as well.)
When you do arrive at the admin safely, you’ll be prompted to login
with a nice, friendly screen, as shown in Figure 11-6.
Type in the superuser username and password that you created earlier.
Once you’ve logged in, you’ll see the admin home page, as shown in
Figure 11-7.
What you’ll see is the set of all classes that have registered with the
admin app. Because the admin allows you to manipulate all of these
classes which live in the database, including Users, this means that you
can add standard, “staff,” or other superusers (and from a friendly Web
interface, not a command-line or a shell environment).
11.8 The Django Administration App
Figure 11-6 The admin login screen.
Figure 11-7 The admin home page.
521
522
Chapter 11 • Web Frameworks: Django
CORE TIP: My class isn’t there!
Sometimes, your class might not appear in the list. The three most common
causes for “my app’s data doesn’t show up in the admin” issues include:
1. Forgetting to register your model class with admin.site.
register()
2. Errors in the app’s models.py file
3. Forgetting to add the app to the INSTALLED_APPS tuple in your
settings.py file.
Now, let’s explore the real power of the admin: the ability to manipulate
your data. If you click the “Blog posts” link, you’ll go to a page listing all
of the BlogPost objects in the database (see Figure 11-8)—so far, we only
have the one that we entered from the shell, earlier.
Figure 11-8 Our solitary BlogPost object.
11.8 The Django Administration App
523
Notice in the figure that it’s identified with a very generic tag of “BlogPost object.” Why is the post given such an awkward name? Django is
designed to flexibly handle an infinite variety of content types, so it
doesn’t take guesses about what field might be the best handle for a given
piece of content. As a result, it’s direct and not so interesting.
Because you are fairly certain that this post represents the data you
entered earlier, and you’re not going to confuse this entry with other BlogPost objects, no additional information about this object is needed. Go
ahead and click it to enter the edit screen shown in Figure 11-9.
Figure 11-9 Web view of our command-line BlogPost entry.
Feel free to make any changes you desire (or none at all), and then click
Save and add another so that we can experiment with adding an entry
from a Web form (instead of from the shell). Figure 11-10 illustrates how
the form is identical to that in which you edited the previous post a
moment ago.
524
Chapter 11 • Web Frameworks: Django
Figure 11-10 With the previous post saved, we’re ready to add a new one.
What’s a new BlogPost without content? Give your post a title and some
scintillating content, perhaps similar to what you see in Figure 11-11. For
the timestamp, you can click the Today and Now shortcut links to fill in
the current date and time. You can also click the calendar and clock icons
to pull up handy date and time pickers. When you’re done writing your
masterpiece, click the Save button.
After your post has been saved to the database, a screen pops up that
displays a confirmation message (The blog post “BlogPost object” was
added successfully.) along with a list of all your blog posts, as shown in
Figure 11-12.
Note that this output has not improved any—in fact, it has become
worse because we now have two BlogPost objects, but there’s no way to
distinguish between them. You just aren’t going to feel satisfied seeing all
the entries generically labeled as “BlogPost object.” You’re certainly not
alone if you’re thinking, “There has got to be a way to make it look more
useful!” Well, Django gives you the power to do just that.
Earlier, we enabled the admin tool with the bare minimum configuration, namely registering our model with the admin app all by itself. However, with an extra two lines of code and a modification of the registration
11.8 The Django Administration App
525
Figure 11-11 Adding a new post directly from the admin.
call, we can make the presentation of the listing much nicer and more useful.
Update your blog/admin.py file with a new BlogPostAdmin class, and add
it to the registration line so that it now looks like this:
# admin.py
from django.contrib import admin
from blog import models
class BlogPostAdmin(admin.ModelAdmin):
list_display = ('title', 'timestamp')
admin.site.register(models.BlogPost, BlogPostAdmin)
Note that because we define BlogPostAdmin here, we do not prepend it
as an attribute of our blog/models.py module; that is, we don’t register
models.BlogPostAdmin. If you refresh the admin page for BlogPost objects
(see Figure 11-13), you will now see much more useful output, based on
the new list_display variable you added to your BlogPostAdmin class:
The image in Figure 11-13 must seem like a breath of fresh air as we’re
no longer looking at a pair of BlogPost objects. To a developer new to
Django, it might surprise you that adding two lines and editing a third is
all it takes to change the output to something much more relevant.
526
Chapter 11 • Web Frameworks: Django
Figure 11-12 The new BlogPost has been saved. Now we have a pair of posts , but there’s no
way to tell them apart.
Figure 11-13 Much better!
11.9 Creating the Blog’s User Interface
527
Try clicking the Title and Timestamp column headers that have
appeared—each one affects how your items are sorted. For example, click
the Title column head once to sort in ascending order by title; click it a second time to change to descending order. Also try sorting by timestamp
order. Yes, these features are already built-in to the admin! You didn’t have
to roll your own like in the good ’ol days.
The admin has many other useful features that can be activated with
just a line or two of code: searching, custom ordering, filters, and more.
We’ve barely touched the features in the admin, but hopefully, we’ve
given you enough of a taste to whet your appetite.
11.9 Creating the Blog’s User Interface
Everything that we have just done was strictly for you, the developer,
right? Users of your app will not be using the Django shell and probably
not the admin tool either. We now need to build the public-facing side of
your app. From Django’s perspective, a Web page has the following three
typical components:
• A template that displays information passed to it (via a Python
dictionary-like object).
• A view function or “view” that performs the core logic for a
request. It will likely fetch (and format) the information to be
displayed, typically from a database.
• A URL pattern that matches an incoming request with the
corresponding view, optionally passing parameters to the
view, as well.
When you think about it, you can see how when Django processes a
request, it processes the request bottom-up: it starts by finding the matching URL pattern. It then calls the corresponding view function which then
returns the data rendered into a template back to the user.
We’re going to build our app in a slightly different order:
1. A basic template comes first because we need to be able to see
stuff.
2. Design a quick URL pattern so that Django can access our app
right away.
3. Prototype and then iterate as we develop the view function.
528
Chapter 11 • Web Frameworks: Django
The main reason for this order is that your template and URL pattern
aren’t going to change very much. The heart and soul of your application
will be in the view, so we want to employ an agile way of building it. By
creating the view steps at a time, we’re more in-line with the test-driven
development (TDD) model.
11.9.1
Creating a Template
Django’s template language is easy enough to read that we can jump right
in to example code. This is a simple template for displaying a single blog
post (based on the attributes of our BlogPost object):
<h2>{{ post.title }}</h2>
<p>{{ post.timestamp }}</p>
<p>{{ post.body }}</p>
You probably noticed that’s it’s just HTML (though Django templates
can be used for any kind of textual output) plus special tags in curly
braces: {{ ... }}. These tags are called variable tags. They display the contents of the object within the braces. Inside a variable tag, you can use
Python-style dot-notation to access attributes of these variables. The values can be pure data or callables—if they’re the latter, they will automatically be called without requiring you to include “()” to indicate a
function/method call.
There are also special functions that you can use in variable tags called
filters. These are functions that you can apply immediately to a variable
while inside the tag. All you need to do is to insert a pipe symbol (|) right
after the variable, followed by the filter name. For example, if we wanted
to titlecase the BlogPost title, you would simply call the title() filter like
this:
<h2>{{ post.title|title }}</h2>
This means that when the template encounters our post.title of “test
admin entry,” the final HTML output will be <h2>Test Admin Entry</h2>.
Variables are passed to the template in the form of a special Python dictionary called a context. In the preceding example, we’re assuming a BlogPost object called “post” has been passed in via the context. The three lines
of the template fetch the BlogPost object’s title, body, and timestamp fields,
respectively. Now let’s enhance the template a bit to make it a bit more
useful, such as passing in all blog posts via the context so that we can loop
through and display them:
11.9 Creating the Blog’s User Interface
529
<!-- archive.html -->
{% for post in posts %}
<h2>{{ post.title }}</h2>
<p>{{ post.timestamp }}</p>
<p>{{ post.body }}</p>
<hr>
{% endfor %}
The original three lines are unchanged; we’ve simply wrapped this core
functionality with a loop over all posts. In doing so, we’ve introduced
another construct of Django’s templating language: block tags. Whereas
variable tags are delimited by using pairs of curly braces, block tags use
braces and percent symbols as enclosing pairs: {% ... %}. They are used to
embed logic such as loops and conditionals into your HTML template.
Save the HTML template code above into a simple template in a file
called archive.html and put it in a directory called templates, inside your
app’s folder; thus, the path to your template file should be mysite/blog/
templates/archive.html. The name of the template itself is arbitrary (we
could have called it foo.html), but the templates directory name is mandatory. By default, when searching for templates, Django will look for a
templates directory inside each of your installed applications.
To learn more about templates and tags, check out the official documents page at http://docs.djangoproject.com/en/dev/ref/templates/api/
#basics.
The next step is to prepare for the creation of the view function that
users are eventually going to execute to see the output from our brand
new template. Before we create the view, let’s approach this from the
user’s point of view.
11.9.2
Creating a URL Pattern
In this next section, we’re going to discuss how the pathnames of URLs in
your users’ browsers are mapped to various parts of your app. When
users issue a client request from their browsers, the Internet magic of mapping hostnames to IP addresses happens, followed by the client making a
connection to the server’s address and at port 80 or other designated port
(the Django development server uses 8000 by default).
The Project’s URLconf
The server, through the magic of WSGI, will end up calling the endpoint of
Django, which passes the request down the line. The type of request (GET,
530
Chapter 11 • Web Frameworks: Django
POST, etc.) and path (the remainder of the URL beyond the protocol, host,
and port) are accepted and arrives at the project URLconf (mysite/
urls.py) file. Here, there must be a valid (regular expression) match on the
path that resolves the request; otherwise, the server will return a 404 error
just like the one we encountered earlier in the “Trying Out the Admin”
subsection, because we did not define a handler for '/'.
We could create the needed URL pattern directly inside mysite/urls.py,
but that makes for a messy coupling between our project and our app.
However, we might want to use our blog app somewhere else, so it would
be nice if it were responsible for its own URLs. This falls in line with code
reuse principles, DRY, debugging the same code in one place, etc. To keep
our project and app appropriately compartmentalized, we’ll define the
URL mapping in two simple steps and create two URLconfs: one for the
project, and one for the app.
The first step is much like enabling the admin that you saw earlier. In
mysite/urls.py, there’s an autogenerated, commented-out example line
that is almost what we need. It appears near the top of your urlpatterns
variable:
urlpatterns = patterns('',
# Example:
# (r'^mysite/', include('mysite.foo.urls')),
. . .
Edit out the comment and make the necessary name changes so that it
points to our app’s URLconf:
(r'^blog/', include('blog.urls')),
The include() function defers taking action here to another URLconf
(the app’s URLconf, naturally). In our example here, we’re catching
requests that begin with blog/ and passing them on to the mysite/blog/
urls.py that we’re about to create. (More on include() coming up soon.)
Along with setting up the admin app that we did earlier, now your
entire project URLconf should look like this:
# mysite/urls.py
from django.conf.urls.defaults import *
from django.contrib import admin
admin.autodiscover()
urlpatterns = patterns('',
(r'^blog/', include('blog.urls')),
(r'^admin/', include(admin.site.urls)),
)
11.9 Creating the Blog’s User Interface
531
The patterns() function takes a group of 2-tuples (URL regular expression, destination). The regex is straightforward, but what is the destination? It’s either directly a view function that’s called for URLs that match
the pattern, or it’s a call to include() another URLconf file.
When include() is used, the current URL path head is removed, and
the remainder of the path is passed to the patterns() function of the
downwind URLconf. For example, when the URL http://localhost:8000/
blog/foo/bar is entered into the client browser, the project’s URLconf
receives blog/foo/bar. It matches the '^blog' regex and finds an include()
function (as opposed to a view function), so it passes foo/bar down to the
matching URL handler in mysite/blog/urls.py.
You can see this in the parameter to include(): 'blog.urls'. A similar
scenario exists for http://localhost:8000/admin/xxx/yyy/zzz; the xxx/yyy/
zzz would be passed to admin/site/urls.py as specified by include
(admin.site.urls). Now, if your eyes are sharp enough, you might notice
something odd in the code snippet—something small and perhaps missing? It is nearly an optical illusion. Take a careful look at the calls to the
include() function.
Do you see how the reference to blog.urls is in quotes, but not
admin.site.urls? Nope, it’s not a typo. Both patterns() and include()
accept strings or objects. Generally strings are used, but some developers
prefer the more concrete use of passing in objects. The only thing you need
to remember when passing in objects is to ensure that they are imported.
In the preceding example, the import of django.contrib.admin does the job.
Another example of this usage is coming up in the next subsection. To
read more about strings versus objects, take a look at the documents page
on this topic at http://docs.djangoproject.com/en/dev/topics/http/urls/
#passing-callable-objects-instead-of-strings.
The App’s URLconf
With the include() of blog.urls, we’re on the hook to define URLs to
match remaining path elements inside the blog application package itself.
Create a new file, mysite/blog/urls.py, that contains these lines:
# urls.py
from django.conf.urls.defaults import *
import blog.views
urlpatterns = patterns('',
(r'^$', blog.views.archive),
)
532
Chapter 11 • Web Frameworks: Django
It looks quite similar to our project URLconf. First, let’s remind you that
the head (blog/) part of the request URL on which our root URLconf was
matching, has been stripped, so we only need to match the empty string,
which is handled by the regex ^$. Our blog application is now reusable
and shouldn’t care if it’s mounted at blog/ or news/ or what/i/had/for/
lunch/. The only mystery here is the archive() view function to which our
request is sent.
Incorporating new view functions as part of your app is as simple as
adding individual lines to your URLconf, not adding ten lines here, editing another five lines of some complex XML file there, etc. In other words,
if you were to add view functions foo() and bar(), your updated urlpatterns
would just have to be changed to the following (but don’t really make
these changes to yours):
urlpatterns = patterns('',
(r'^$', blog.views.archive),
(r'foo/', blog.views.foo),
(r'bar/', blog.views.bar),
)
So that’s great, but if you continue to develop in Django and come back
to look at this file again and again, you’ll begin to notice a lot of repetition
here, violating DRY, of course. Do you see all the references to blog.views
to get to the view functions? This is a good indicator that we should use a
feature in patterns(), namely the first argument, which has been an
empty string all this time.
That parameter is a prefix for the views, so we can move blog.views up
there, remove the repetition, and tweak the import so that it doesn’t
NameError-out. Here’s what the modified URLconf would look like:
from django.conf.urls.defaults import *
from blog.views import *
urlpatterns = patterns('blog.views',
(r'^$', archive),
(r'foo/', foo),
(r'bar/', bar),
)
Based on the import statement, all three functions are expected to be in
meaning mysite/blog/views.py. From the earlier discussion,
you know that because we imported it, we can pass in the objects as we
just did in the preceding example (archive, foo, bar). But, would it be so
bad of us to be even lazier and just not even have that import statement?
blog.views,
11.9 Creating the Blog’s User Interface
533
As described in the previous subsection, Django supports strings in
addition to objects so that you don’t even need that import. If you remove
it and put quotes around your view names, that’s fine, too:
from django.conf.urls.defaults import *
urlpatterns = patterns('blog.views',
(r'^$', 'archive'),
(r'foo/', 'foo'),
(r'bar/', 'bar'),
)
Okay, we know that foo() and bar() don’t exist in our example application, but you can expect that real projects will have multiple views in your
app’s URLconf. We were just showing you how to do to basic cleanup. You
can find more information on reducing the clutter in URLconf files in
the Django documentation at http://docs.djangoproject.com/en/dev/intro/
tutorial03/#simplifying-the-urlconfs.
The final piece of our puzzle is the controller, the view function, which
is called upon seeing a matching URL path.
11.9.3
Creating a View Function
In this section, we focus on the view function, the core functionality of
your app. The development process can take some time, so we’ll first
show you how to get started quickly for those who are impatient, and then
go into more detail so that you know how to do it right in practice.
“Hello World” Fake View
So, you want to debug your HTML template and URLconf right away
without having to create your complete and entire view at this early stage
of development? Let’s do this! Blow up a fake BlogPost and render it into
the template immediately. Create this “Hello World” mysite/blog/
views.py six-statement file now:
# views.py
from datetime import datetime
from django.shortcuts import render_to_response
from blog.models import BlogPost
def archive(request):
post = BlogPost(title='mocktitle', body='mockbody',
timestamp=datetime.now())
return render_to_response('archive.html', {'posts': [post]})
534
Chapter 11 • Web Frameworks: Django
We know the view needs to be called archive() because of its designation in the URLconf, so that’s easy. The code creates a fake blog post and
passes it to the template as a single-element posts list. (Don’t call
post.save() because... well, guess why not?!?)
We’ll come back to render_to_response() shortly, but if you just use
your imagination and guess that it takes a template (archive.html, found
in mysite/blog/templates) and a context dictionary, merges them
together, and spits back the generated HTML to the user, then your imagination would be correct.
Bring up your development server (or run it live by using a real Web
server). Work through any errors you have in your URLconf or template,
and then when you’ve got it working, you’ll see something similar to that
shown in Figure 11-14.
Figure 11-14 The output from our fake “view.”
Coming up with a fake view with semi-mocked data is the fastest way
to get instant gratification and validation that your basic setup is okay.
This iterative process is agile, and when things are good, it signals to you
that it’s safe to begin the real work.
The Real View
Now we’re going to create the real thing, a simple view function (actually
twice) that will fetch all of our blog posts from the database and display
11.9 Creating the Blog’s User Interface
535
them to users by employing our template. First, we’re going to do it the
“formal” way, which means strict adherence to the following steps, from
obtaining the data to returning the HTTP response back to the client:
• Query the database for all blog entries
• Load the template file
• Create the context dictionary for the template
• Pass the context to the template
• Render the template into HTML
• Return the HTML via the HTTP response
Open blog/views.py and enter the following lines of code, exactly as
shown. This will execute our preceding recipe—it pretty much replaces all
of your earlier fake views.py file:
# views.py
from django.http import HttpResponse
from django.template import loader, Context
from blog.models import BlogPost
def archive(request):
posts = BlogPost.objects.all()
t = loader.get_template("archive.html")
c = Context({'posts': posts})
return HttpResponse(t.render(c))
Check the development (or real Web) server, then go to the app again in
your browser. You should see a simple, bare-bones rendering (with real
data) of any blog posts that you have entered, complete with title, timestamp, and post body, separated by a horizontal rule (<hr>), similar to
what you see in Figure 11-15 (if you created the first and only pair of posts
that we made earlier).
That’s great! But in keeping with the tradition of not repeating yourself,
the developers of Django noticed that this was an extremely common pattern (get data, render in template, return response), so they created a
shortcut when rendering a template from a simple view function. This is
where we run into our friend, render_to_response(), once again.
536
Chapter 11 • Web Frameworks: Django
Figure 11-15 The user’s view of blogposts.
We saw render_to_response() earlier in our fake view, but let’s roll that
into our real view now. Add its import from django.shortcuts, remove
the now-superfluous imports of loader, Context, and HttpResponse, and
replace those last three lines of your view. You should be left with this:
# views.py
from django.shortcuts import render_to_response
from blog.models import BlogPost
def archive(request):
posts = BlogPost.objects.all()
return render_to_response('archive.html', {'posts': posts})
If you refresh your browser, nothing will change because you’ve only
shortened your code and haven’t changed any real functionality. To read
more about using render_to_response(), check out these pages from the
official documentation:
• http://docs.djangoproject.com/en/dev/intro/tutorial03/#ashortcut-render-to-response
• http://docs.djangoproject.com/en/dev/topics/http/shortcuts/
#render-to-response
11.10 Improving the Output
537
Shortcuts are just the beginning. There are other, special types of view
functions that we’ll discuss later called generic views, which are even more
hands-off than render_to_response(). With a generic view, for example,
you wouldn’t even need to write a view function—you’d just use a premade generic view that Django provides and map to it directly from the
URLconf. That is one of the main goals of generic views if you can believe
it: not having to write any code at all!
11.10 Improving the Output
That’s it! You did the three steps it takes to get a working app to the point
where we now have a user-facing interface (and don’t have to rely on the
Admin for CRUD of data). So now what? We’ve got a simple blog working. It responds to client requests, extracts the information from the database, and displays all posts to the user. This is good but we can certainly
make some useful improvements to exhibit more realistic behavior.
One logical direction to take is to show the posts in reverse chronological
order; it makes sense to see the most recent posts first. Another is to limit the
output. If you have any more than 10 (or even 5) posts showing on the page,
it is certainly too long for users. First, let’s tackle reverse-chronological order.
It’s easy for us to tell Django to do that. In fact, we have a choice as to
where we want to tell it to do so. We can either add a default ordering to
our model, or we can add it to the query in our view code. We’ll do the latter first because it’s the simplest to explain.
11.10.1 Query Change
Taking a quick step back, BlogPost is your data model class. The objects
attribute is a model Manager class, and it has an all() method to give you a
QuerySet. You can think of a QuerySet as objects that represent the rows
of data returned from the database. That’s about as far as you should
go because they’re not the actual rows because QuerySets perform “lazy
iteration.”
The database isn’t actually hit until the QuerySet is evaluated. In other
words, you can do all kinds of QuerySet manipulation without touching
the data at all. To find out when a QuerySet is evaluated, check out the official documentation at http://docs.djangoproject.com/en/dev/ref/models/
querysets/.
538
Chapter 11 • Web Frameworks: Django
Now we have the background out of the way. We could have simply
told you to add a call to the order_by() method and provide a sort parameter. In our case, we want to sort newest first, which means reverse order
by timestamp. It’s as simple as changing your query statement to the
following:
posts = BlogPost.objects.all().order_by('-timestamp')
By prepending the minus sign (–) to timestamp, we are specifying a
descending chronological sort. For normal ascending order, remove the
minus sign.
To test reading in the top ten posts, we need more than just two BlogPost
entries in the database, so here’s a great place to whip up a few lines of code
using the Django shell (plain one this time; we don’t need the power of
IPython or bpython) and auto-generate a bunch of records in the database:
$ ./manage.py shell --plain
Python 2.7.1 (r271:86882M, Nov 30 2010, 09:39:13)
[GCC 4.0.1 (Apple Inc. build 5494)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from datetime import datetime as dt
>>> from blog.models import BlogPost
>>> for i in range(10):
...
bp = BlogPost(title='post #%d' % i,
...
body='body of post #%d' % i, timestamp=dt.now())
...
bp.save()
...
Figure 11-16 shows the change reflected in the browser when you perform a refresh.
The shell can also be used to test the change that we just made as well as
the new query we want to use:
>>>
>>>
...
...
Fri
Fri
Fri
Fri
Fri
Fri
Fri
Fri
Fri
Fri
Mon
Sat
posts = BlogPost.objects.all().order_by('-timestamp')
for p in posts:
print p.timestamp.ctime(), p.title
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
17
17
17
17
17
17
17
17
17
17
13
11
15:59:37
15:59:37
15:59:37
15:59:37
15:59:37
15:59:37
15:59:37
15:59:37
15:59:37
15:59:37
00:13:01
16:38:37
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
post
post
post
post
post
post
post
post
post
post
test
test
#9
#8
#7
#6
#5
#4
#3
#2
#1
#0
admin entry
cmd-line entry
11.10 Improving the Output
539
Figure 11-16 The original pair of blog entries, plus ten more.
This gives us some degree of certainty that when the core bits are copied
to the view function, things should pretty much work right away.
Furthermore, the output can be limited to only the top 10 by using
Python’s friendly slice syntax ([:10]), so add that, too. Take these changes
and update your blog/views.py file so that it looks like the following:
# views.py
from django.shortcuts import render_to_response
from blog.models import BlogPost
def archive(request):
posts = BlogPost.objects.all().order_by('-timestamp')[:10]
return render_to_response('archive.html', {'posts': posts})
Save the change and refresh your browser again. You should see two
changes: the blogs post in reverse-chronological order, and only the ten
most recent posts show up—in other words, of 12 total entries, you should
no longer see either of the two original posts, as demonstrated in Figure 11-17.
540
Chapter 11 • Web Frameworks: Django
Figure 11-17 Only the ten newest blog posts appear here.
So changing the query is fairly straightforward, but for our particular
case, setting a default ordering in the model is a more logical option
because this (most recent, top N posts) is pretty much the only type of
ordering that makes sense for a blog.
Setting the Model Default Ordering
If we set our preferred ordering in the model, any other Django-based app
or project that accesses our data will use that ordering. To set default
ordering for your model, give it an inner class called Meta and set the
ordering attribute in that class:
class Meta:
ordering = ('-timestamp',)
This effectively moves order_by('-timestamp') from the query to the
model. Make these changes to both files, and you should be left with code
shown in the following:
11.10 Improving the Output
541
# models.py
from django.db import models
class BlogPost(models.Model):
title = models.CharField(max_length=150)
body = models.TextField()
timestamp = models.DateTimeField()
class Meta:
ordering = ('-timestamp',)
# views.py
from django.shortcuts import render_to_response
from blog.models import BlogPost
def archive(request):
posts = BlogPost.objects.all()[:10]
return render_to_response('archive.html', {'posts': posts})
CORE TIP (HACKER’S CORNER): Reducing archive() down to one
(long) line of Python
It’s possible to reduce archive() down to a single line if you feel comfortable
using lambda:
archive = lambda req: render_to_response('archive.html',
{'posts': BlogPost.objects.all()[:10]})
Readability is one of the hallmarks of having a Pythonic piece of code. Another
goal of expressive languages such as Python, is to help reduce the number of
lines of code to attain such readability. Although this does reduce the number
of lines, I can’t say that it helps with making it easier to read; hence, why it’s in
this Hacker’s Corner.
Other differences to the original: the request variable was reduced to just req,
and we do save a tiny bit of memory without having the posts variable. If
you’re new to Python, we recommend you check out the Functions chapter of
Core Python Programming or Core Python Language Fundamentals which covers
lambda.
If you refresh your Web browser, you should see no changes at all, as it
should be. Now that we’ve spent some time improving data retrieval from
the database, we’re going to suggest that you minimize database interaction.
542
Chapter 11 • Web Frameworks: Django
11.11 Working with User Input
So now our app is complete, right? You’re able to add blog posts via the
shell or admin… check. You can view the data with our user-facing data
dumper… check. Are we really done? Not so fast!
Maybe you will be satisfied entering data by creating objects in the shell
or through the more user-friendly admin, but your users probably don’t
know what a Python shell is, much less how to use it, and do you really
want to give people access to your project’s admin app? No way!
If you’ve understood the material in Chapter 10 pretty well, and include
what you’ve learned so far in this chapter, you might be wise enough to
realize that it’s still the same three-step process:
• Add an HTML form in which the user can enter data
• Insert the (URL, view) URLconf entry
• Create the view to handle the user input
We’ll take these on in the same order as our first view, earlier.
11.11.1 The Template: Adding an HTML Form
The first step is pretty simple: create a form for users. To make it easier for
us during development, just add the following HTML to the top of blog/
templates/archive.html (above the BlogPost object display) for now; we
can split it off to another file later.
<!-- archive.html -->
<form action="/blog/create/" method="post">
Title:
<input type=text name=title><br>
Body:
<textarea name=body rows=3 cols=60></textarea><br>
<input type=submit>
</form>
<hr>
{% for post in posts %}
. . .
The reason why we’re putting in the same template during development is that it’s helpful to have both the user input and the blog post(s)
display on a single page. In other words, you won’t need to click and flip
back-and-forth between a separate form entry page and the BlogPost listing display.
11.11 Working with User Input
543
11.11.2 Adding the URLconf Entry
The next step is to add our URLconf entry. Using the preceding HTML,
we’re going to use a path of /blog/create/, so we need to hook that up to
a view function we’re going to write that will save the entry to the database. Let’s call our view create_blogpost(); add the appropriate 2-tuple to
urlpatterns in your app’s URLconf so that it looks like this:
# urls.py
from django.conf.urls.defaults import *
urlpatterns = patterns('blog.views',
(r'^$', 'archive'),
(r'^create/', 'create_blogpost'),
)
The remaining task is to come up with the code for create_blogpost().
11.11.3 The View: Processing User Input
Processing Web forms in Django looks quite similar to handling the common gateway interface (CGI) variables that you saw in Chapter 10: you
just need to do the Django equivalent. You can do a casual flip-through of
the Django documentation to get enough knowledge to whip up the snippets of code to add to blog/views.py. First you’ll need some new imports,
as shown in the following:
from datetime import datetime
from django.http import HttpResponseRedirect
The actual view function then would look something like this:
def create_blogpost(request):
if request.method == 'POST':
BlogPost(
title=request.POST.get('title'),
body=request.POST.get('body'),
timestamp=datetime.now(),
).save()
return HttpResponseRedirect('/blog/')
Like the archive() view function, the request is automatically passed
in. The form input is coming in via a POST, so we need to check for that.
Next, we create a new BlogPost entry with the form data plus the current
time as the timestamp, and then save() it to the database. Then we’re
going to redirect back to /blog to see our newest post (as well as another
blank form at the top for the next blog entry).
544
Chapter 11 • Web Frameworks: Django
Again, double-check either your development or real Web server and
visit your app’s page. You’ll now see the form on top of the data dump (see
Figure 11-18), enabling us to test drive your new feature.
Figure 11-18 Our first user form (followed by previous entries).
11.11.4 Cross-Site Request Forgery
Not so fast! If you were able to debug your app so that you get a form and
submit, you’ll see that your browser does try to access the /blog/create/
URL, but it’s getting stopped by the error shown in Figure 11-19.
Django comes with a data-preserving feature that disallows POSTs
which are not secure against cross-site request forgery (CSRF) attacks. Explanations of CSRF are beyond the scope of this book, but you can read more
about them here:
• http://docs.djangoproject.com/en/dev/intro/tutorial04/#writea-simple-form
• http://docs.djangoproject.com/en/dev/ref/contrib/csrf/
For your simple app, there are two fixes, both of which involve adding
minor snippets of code to what you already have:
1. Add a CSRF token ({% csrf_token %}) to forms that POST
back to your site
2. Send the request context instance to the token via the template
11.11 Working with User Input
545
Figure 11-19 The CSRF error screen.
A request context is exactly what it sounds like: a dictionary that contains information about the request. If you go to the CSRF documentation
sites that we just provided, you’ll find out that django.template.Request
Context is always processed in a way that includes built-in CSRF protection.
The first step is accomplished by adding the token to the form. Edit the
<FORM> header line in mysite/blog/templates/archive.html, adding the CSRF
token inside the form so that it looks like this:
<form action="/blog/create/" method=post>{% csrf_token %}
The second part involves editing mysite/blog/views.py. Alter the return
line in your archive() view function by adding the RequestContext instance,
as shown here:
return render_to_response('archive.html', {'posts': posts,},
RequestContext(request))
546
Chapter 11 • Web Frameworks: Django
Don’t forget to import django.template.RequestContext:
from django.template import RequestContext
Once you save these changes, you’ll be able to submit data to your
application from a form (not the admin or the shell). CSRF errors will
cease and you’ll experience a successful BlogPost entry submission.
11.12 Forms and Model Forms
In the previous section, we demonstrated how to work with user input by
showing you the steps to create an HTML form. Now, we will show you
how Django simplifies the effort required to accept user data (Django
Forms), especially forms containing the exact fields that makes up a data
model (Django Model Forms).
11.12.1 Introducing Django Forms
Discounting the one-time additional work required to handle CSRFs, the
three earlier steps to integrate a simple input form frankly look too laborious and repetitious. After all, this is Django, virtuous student of the DRY
principle.
The most suspiciously repetitious parts of our app involve seeing our
data model embedded everywhere. In the form, we see the name and title:
Title: <input type=text name=title><br>
Body: <textarea name=body rows=3 cols=60></textarea><br>
And in the create_blogpost() view, we see pretty much the same:
BlogPost(
title=request.POST.get('title'),
body=request.POST.get('body'),
timestamp=datetime.now(),
).save()
The point is that once you’ve defined the data model, it should be the
only place where you see title, body, and perhaps timestamp (although the
last is a special case because we do not ask the user to input this value).
Based on the data model alone, isn’t it straightforward to expect the Web
framework to come up with the form fields? Why should the developer
have to write this in addition to the data model? This is where Django
forms come in.
11.12 Forms and Model Forms
547
First, let’s create a Django form for our input data:
from django import forms
class BlogPostForm(forms.Form):
title = forms.CharField(max_length=150)
body = forms.CharField(widget=forms.Textarea)
timestamp = forms.DateTimeField()
Okay, that’s not quite complete. In our HTML form, we specified the
HTML textarea element to have three rows and a width of sixty characters.
Because we’re replacing the raw HTML by writing code that automatically
generates it, we need to find a way to specify these requirements, and in
this case, the solution is to pass these attributes directly:
body = forms.CharField(
widget=forms.Textarea(attrs={'rows':3, 'cols':60})
)
11.12.2 The Case for Model Forms
Aside from the minor blip regarding specifying attributes, did you do a
double-take when looking at the BlogPostForm definition? I mean, wasn’t it
repetitious too? As you can see in the following, it looks nearly identical to
the data model:
class BlogPost(models.Model):
title = models.CharField(max_length=150)
body = models.TextField()
timestamp = models.DateTimeField()
Yes, you would be correct: they look almost like fraternal twins. This is
far too much duplication for any self-respecting Django script. What we
did previously by creating a stand-alone Form object is fine if we wanted to
create a form for a Web page from scratch without a data model backing it.
However, if the form fields are an exact match with a data model, then a
Form isn’t what we’re looking for; instead, you would really do better with
a Django ModelForm, as demonstrated here:
class BlogPostForm(forms.ModelForm):
class Meta:
model = BlogPost
Much better—now that’s the laziness we’re looking for. By switching
from a Form to a ModelForm, we can define a Meta class that designates on
which data model the form should be based. When the HTML form is generated, it will have fields for all attributes of the data model.
548
Chapter 11 • Web Frameworks: Django
In our case though, we don’t trust the user to enter the correct timestamp, and instead, we want our app to add that content programmatically, per post entry. Not a problem, we only need to add one more
attribute named exclude to remove form items from the generated HTML.
Integrate the import as well as the full BlogPostForm class presented in the
following example to the bottom of your blog/models.py file, following
your definition of BlogPost:
# blog/models.py
from django.db import models
from django import forms
class BlogPost(models.Model):
. . .
class BlogPostForm(forms.ModelForm):
class Meta:
model = BlogPost
exclude = ('timestamp',)
11.12.3 Using the ModelForm to Generate the
HTML Form
What does this buy us? Well, right off the bat we can just cut out the fields
in our form. Thus, change the code at the top of mysite/blog/templates/
archive.html to:
<form action="/blog/create/" method=post>{% csrf_token %}
<table>{{ form }}</table><br>
<input type=submit>
</form>
Yeah, you need to leave the submit button in there. Also, as you can see,
the form defaults to the innards of a table. Want some proof? Just go into
the Django shell, make a BlogPostForm, and then mess around with it a little.
It’s as easy as this:
>>> from blog.models import BlogPostForm
>>> form = BlogPostForm()
>>> form
<blog.models.BlogPostForm object at 0x12d32d0>
>>> str(form)
'<tr><th><label for="id_title">Title:</label></th><td><input
id="id_title" type="text" name="title" maxlength="150" /></td></
tr>\n<tr><th><label for="id_body">Body:</label></th><td><textarea
id="id_body" rows="10" cols="40" name="body"></textarea></td></tr>'
11.12 Forms and Model Forms
549
That’s all the HTML that you didn’t have to write. (Again, note that due
to our exclude, the timestamp is left out of the form. For fun, you can temporarily comment it out and see the additional timestamp field in the generated HTML.)
If you want output different from HTML table rows and cells, you can
request it by using the as_*() methods: {{ form.as_p }} for <p>...</p>
delimited text, {{ form.as_ul }} for a bulleted list with <li> elements, etc.
The URLconf stays the same, so the last modification necessary is
updating the view function to send the ModelForm over to the template. To
do this, you instantiate it and pass it as an additional key-value pair of the
context dictionary. So, change the final line of archive() in blog/views.py
to the following:
return render_to_response('archive.html', {'posts': posts,
'form': BlogPostForm()}, RequestContext(request))
Don’t forget to add the import for both your data and form models at the
top of views.py:
from blog.models import BlogPost, BlogPostForm
11.12.4 Processing the ModelForm Data
The changes we just made were to create the ModelForm and have it generate
the HTML to present to the user. What about after the user has submitted
her information? We still see duplication in the create_blogpost() view
which, as you know, is also in blog/views.py. Similar to how we defined the
Meta class for BlogPostForm to instruct it to take its fields from BlogPost, we
shouldn’t have to create our object like this in create_blogpost():
def create_blogpost(request):
if request.method == 'POST':
BlogPost(
title=request.POST.get('title'),
body=request.POST.get('body'),
timestamp=datetime.now(),
).save()
return HttpResponseRedirect('/blog/')
There should be no need to mention title, body, etc., because they’re in
the data model. We should be able to shorten this view to the following:
def create_blogpost(request):
if request.method == 'POST':
form = BlogPostForm(request.POST)
if form.is_valid():
form.save()
return HttpResponseRedirect('/blog/')
550
Chapter 11 • Web Frameworks: Django
Unfortunately, we can’t do this because of the timestamp. We had to
make an exception in the preceding HTML form generation, so we need to
do likewise here. Here is the if clause that we need to use:
if form.is_valid():
post = form.save(commit=False)
post.timestamp=datetime.now()
post.save()
As you can see, we have to add the timestamp to our data and then
manually save the object to get our desired result. Note that this is the
form save(), not the model save(), which returns an instance of the Blog
model, but because commit=False, no data is written to the database until
post.save() is called. Once these changes are in place, you can start using
the form normally, as illustrated in Figure 11-20.
Figure 11-20 The automatically generated user form.
11.13 More About Views
551
11.13 More About Views
The final most important thing that we need to discuss is a topic that no
Django book should omit: generic views. So far, when you’ve needed a controller or logic for you