Programming Perl第三版英文原&ccedil

Programming Perl第三版英文原&ccedil

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Programming Perl Programming Perl Third Edition Larry Wall, Tom Christiansen & Jon Orwant Beijing • Cambridge • Farnham • Köln • Paris • Sebastopol • Taipei • Tokyo Programming Perl, Third Edition by Larry Wall, Tom Christiansen, and Jon Orwant Copyright © 2000, 1996, 1991 O’Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472. Editor, First Edition: Tim O’Reilly Editor, Second Edition: Steve Talbott Editor, Third Edition: Linda Mui Technical Editor: Nathan Torkington Production Editor: Melanie Wang Cover Designer: Edie Freedman Printing History: January 1991: First Edition. September 1996: Second Edition. July 2000: Third Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a camel and the Perl language is a trademark of O’Reilly & Associates, Inc. Permission may be granted for non-commercial use; please inquire by sending mail to camel@oreilly.com. While every precaution has been taken in the preparation of this book, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Library of Congress Cataloging-in-Publication Data Wall, Larry. Programming Perl/Larry Wall, Tom Christiansen & Jon Orwant.--3rd ed. p. cm. ISBN 0-596-00027-8 1. Perl (Computer program language) I. Christiansen, Tom. II. Orwant, Jon. III. Title. QA76.73.P22 W35 2000 005.13'3--dc21 ISBN: 0-596-00027-8 [M] 00-055799 Table of Contents Preface ................................................................................................................. xvii I: Overview .................................................................................................. 1 1: An Overview of Perl .................................................................................... 3 Getting Started .................................................................................................... 3 Natural and Artificial Languages ........................................................................ 4 An Average Example ........................................................................................ 17 Filehandles ........................................................................................................ 20 Operators .......................................................................................................... 22 Control Structures ............................................................................................. 29 Regular Expressions ......................................................................................... 35 List Processing .................................................................................................. 41 What You Don’t Know Won’t Hurt You (Much) ............................................. 43 II : The Gory Details ........................................................................... 45 2: Bits and Pieces ........................................................................................... 47 Atoms ................................................................................................................ Molecules .......................................................................................................... Built-in Data Types ........................................................................................... Variables ............................................................................................................ Names ............................................................................................................... Scalar Values ..................................................................................................... Context .............................................................................................................. 47 49 50 52 53 58 69 v vi Table of Contents List Values and Arrays ...................................................................................... Hashes ............................................................................................................... Typeglobs and Filehandles .............................................................................. Input Operators ................................................................................................ 72 76 78 79 3: Unary and Binary Operators .............................................................. 86 Terms and List Operators (Leftward) .............................................................. 89 The Arrow Operator ......................................................................................... 90 Autoincrement and Autodecrement ................................................................ 91 Exponentiation ................................................................................................. 92 Ideographic Unary Operators .......................................................................... 92 Binding Operators ............................................................................................ 93 Multiplicative Operators ................................................................................... 94 Additive Operators ........................................................................................... 95 Shift Operators .................................................................................................. 95 Named Unary and File Test Operators ............................................................ 95 Relational Operators ....................................................................................... 100 Equality Operators ......................................................................................... 101 Bitwise Operators ........................................................................................... 101 C-Style Logical (Short-Circuit) Operators ...................................................... 102 Range Operator .............................................................................................. 103 Conditional Operator ..................................................................................... 105 Assignment Operators .................................................................................... 107 Comma Operators .......................................................................................... 108 List Operators (Rightward) ............................................................................. 109 Logical and, or, not, and xor ......................................................................... 109 C Operators Missing from Perl ...................................................................... 110 4: Statements and Declarations ............................................................ 111 Simple Statements .......................................................................................... Compound Statements ................................................................................... if and unless Statements ................................................................................ Loop Statements ............................................................................................. Bare Blocks ..................................................................................................... goto ................................................................................................................. Global Declarations ........................................................................................ Scoped Declarations ....................................................................................... Pragmas ........................................................................................................... 111 113 114 115 123 126 127 129 136 Table of Contents vii 5: Pattern Matching .................................................................................... 139 The Regular Expression Bestiary ................................................................... Pattern-Matching Operators ........................................................................... Metacharacters and Metasymbols .................................................................. Character Classes ............................................................................................ Quantifiers ...................................................................................................... Positions .......................................................................................................... Capturing and Clustering ............................................................................... Alternation ...................................................................................................... Staying in Control ........................................................................................... Fancy Patterns ................................................................................................ 140 143 158 165 176 178 182 187 188 202 6: Subroutines ............................................................................................... 217 Syntax .............................................................................................................. Semantics ........................................................................................................ Passing References ......................................................................................... Prototypes ....................................................................................................... Subroutine Attributes ..................................................................................... 217 219 224 225 231 7: Formats ...................................................................................................... 234 Format Variables ............................................................................................. 237 Footers ............................................................................................................ 240 8: References .................................................................................................. 242 What Is a Reference? ...................................................................................... Creating References ........................................................................................ Using Hard References ................................................................................... Symbolic References ...................................................................................... Braces, Brackets, and Quoting ...................................................................... 242 245 251 263 264 9: Data Structures ....................................................................................... 268 Arrays of Arrays .............................................................................................. Hashes of Arrays ............................................................................................ Arrays of Hashes ............................................................................................ Hashes of Hashes ........................................................................................... Hashes of Functions ....................................................................................... More Elaborate Records ................................................................................. Saving Data Structures ................................................................................... 268 275 277 279 282 283 286 viii Table of Contents 10: Packages ..................................................................................................... 288 Symbol Tables ................................................................................................ 293 Autoloading .................................................................................................... 296 11: Modules ...................................................................................................... 299 Using Modules ................................................................................................ 299 Creating Modules ........................................................................................... 301 Overriding Built-in Functions ........................................................................ 306 12: Objects ......................................................................................................... 308 Brief Refresher on Object-Oriented Lingo .................................................... Perl’s Object System ....................................................................................... Method Invocation ......................................................................................... Object Construction ....................................................................................... Class Inheritance ............................................................................................ Instance Destructors ....................................................................................... Managing Instance Data ................................................................................. Managing Class Data ...................................................................................... Summary ......................................................................................................... 308 310 311 317 321 330 331 343 346 13: Overloading .............................................................................................. 347 The overload Pragma ..................................................................................... Overload Handlers ......................................................................................... Overloadable Operators ................................................................................. The Copy Constructor (=) .............................................................................. When an Overload Handler Is Missing (nomethod and fallback) .............. Overloading Constants ................................................................................... Public Overload Functions ............................................................................ Inheritance and Overloading ......................................................................... Run-Time Overloading ................................................................................... Overloading Diagnostics ................................................................................ 348 349 350 357 358 359 360 361 361 362 14: Tied Variables .......................................................................................... 363 Tying Scalars ................................................................................................... Tying Arrays ................................................................................................... Tying Hashes .................................................................................................. Tying Filehandles ........................................................................................... A Subtle Untying Trap .................................................................................... Tie Modules on CPAN .................................................................................... 365 372 378 384 395 397 Table of Contents III :Perl as Technology ix .................................................................... 399 15: Unicode ...................................................................................................... 401 Building Character .......................................................................................... 402 Effects of Character Semantics ....................................................................... 405 Caution, Working ...................................................................................... 409 16: Interprocess Communication ............................................................ 411 Signals ............................................................................................................. Files ................................................................................................................. Pipes ................................................................................................................ System V IPC .................................................................................................. Sockets ............................................................................................................ 412 418 426 434 437 17: Threads ....................................................................................................... 446 The Process Model ......................................................................................... 447 The Thread Model .......................................................................................... 448 18: Compiling .................................................................................................. 464 The Life Cycle of a Perl Program .................................................................. Compiling Your Code .................................................................................... Executing Your Code ..................................................................................... Compiler Backends ........................................................................................ Code Generators ............................................................................................ Code Development Tools .............................................................................. Avant-Garde Compiler, Retro Interpreter ...................................................... 465 467 473 476 477 479 480 19: The Command-Line Interface ............................................................ 486 Command Processing ..................................................................................... 486 Environment Variables ................................................................................... 503 20: The Perl Debugger .................................................................................. 506 Using the Debugger ....................................................................................... Debugger Commands .................................................................................... Debugger Customization ............................................................................... Unattended Execution .................................................................................... Debugger Support .......................................................................................... The Perl Profiler ............................................................................................. 507 509 518 521 523 525 x Table of Contents 21: Internals and Externals ....................................................................... 530 How Perl Works ............................................................................................. Internal Data Types ........................................................................................ Extending Perl (Using C from Perl) ............................................................... Embedding Perl (Using Perl from C) ............................................................ The Moral of the Story ................................................................................... IV : Perl as Culture .............................................................................. 531 531 532 538 544 545 22: CPAN ............................................................................................................ 547 The CPAN modules Directory ........................................................................ 548 Using CPAN Modules ..................................................................................... 551 Creating CPAN Modules ................................................................................. 554 23: Security ....................................................................................................... 557 Handling Insecure Data ................................................................................. 558 Handling Timing Glitches .............................................................................. 569 Handling Insecure Code ................................................................................ 576 24: Common Practices ................................................................................. 585 Common Goofs for Novices .......................................................................... Efficiency ........................................................................................................ Programming with Style ................................................................................. Fluent Perl ...................................................................................................... Program Generation ....................................................................................... 585 593 603 607 616 25: Portable Perl ............................................................................................. 621 Newlines ......................................................................................................... Endianness and Number Width ..................................................................... Files and Filesystems ...................................................................................... System Interaction .......................................................................................... Interprocess Communication (IPC) ............................................................... External Subroutines (XS) .............................................................................. Standard Modules ........................................................................................... Dates and Times ............................................................................................. Internationalization ........................................................................................ Style ................................................................................................................. 622 623 624 625 626 626 627 627 628 628 Table of Contents xi 26: Plain Old Documentation .................................................................. 629 Pod in a Nutshell ............................................................................................ Pod Translators and Modules ........................................................................ Writing Your Own Pod Tools ........................................................................ Pod Pitfalls ...................................................................................................... Documenting Your Perl Programs ................................................................. 629 638 640 643 644 27: Perl Culture ............................................................................................... 645 History Made Practical ................................................................................... 645 Perl Poetry ...................................................................................................... 647 V: Reference Material .................................................................... 651 28: Special Names .......................................................................................... 653 Special Names Grouped by Type ................................................................. 653 Special Variables in Alphabetical Order ........................................................ 656 29: Functions ................................................................................................... 677 Perl Functions by Category ............................................................................ 680 Perl Functions in Alphabetical Order ............................................................ 682 30: The Standard Perl Library .................................................................. 831 Library Science ............................................................................................... 831 A Tour of the Perl Library .............................................................................. 833 31: Pragmatic Modules ................................................................................ 836 use use use use use use use use use use use use use attributes .................................................................................................. autouse ..................................................................................................... base .......................................................................................................... blib ........................................................................................................... bytes ......................................................................................................... charnames ................................................................................................ constant .................................................................................................... diagnostics ............................................................................................... fields ......................................................................................................... filetest ....................................................................................................... integer ...................................................................................................... less ............................................................................................................ lib ............................................................................................................. 837 838 839 840 840 841 842 844 846 848 849 850 850 xii Table of Contents use use use use use use use use use locale ........................................................................................................ open ......................................................................................................... overload ................................................................................................... re .............................................................................................................. sigtrap ....................................................................................................... strict .......................................................................................................... subs .......................................................................................................... vars ........................................................................................................... warnings .................................................................................................. 852 852 853 854 855 858 860 861 861 32: Standard Modules .................................................................................. 865 Listings by Type ............................................................................................. Benchmark ...................................................................................................... Carp ................................................................................................................. CGI .................................................................................................................. CGI::Carp ........................................................................................................ Class::Struct ..................................................................................................... Config .............................................................................................................. CPAN ............................................................................................................... Cwd ................................................................................................................. Data::Dumper ................................................................................................. DB_File ........................................................................................................... Dumpvalue ..................................................................................................... English ............................................................................................................ Errno ............................................................................................................... Exporter .......................................................................................................... Fatal ................................................................................................................. Fcntl ................................................................................................................ File::Basename ................................................................................................ File::Compare ................................................................................................. File::Copy ........................................................................................................ File::Find ......................................................................................................... File::Glob ........................................................................................................ File::Spec ......................................................................................................... File::stat ........................................................................................................... File::Temp ....................................................................................................... FileHandle ....................................................................................................... Getopt::Long ................................................................................................... Getopt::Std ...................................................................................................... 866 875 878 878 879 879 880 881 881 882 883 884 884 885 885 886 887 887 888 889 889 890 893 894 894 895 898 899 Table of Contents IO::Socket ....................................................................................................... IPC::Open2 ..................................................................................................... IPC::Open3 ..................................................................................................... Math::BigInt .................................................................................................... Math::Complex ............................................................................................... Math::Trig ........................................................................................................ Net::hostent ..................................................................................................... POSIX .............................................................................................................. Safe .................................................................................................................. Socket .............................................................................................................. Symbol ............................................................................................................ Sys::Hostname ................................................................................................ Sys::Syslog ....................................................................................................... Term::Cap ....................................................................................................... Text::Wrap ....................................................................................................... Time::Local ..................................................................................................... Time::localtime ............................................................................................... User::grent ...................................................................................................... User::pwent ..................................................................................................... xiii 899 900 901 902 902 903 903 904 906 907 908 909 909 911 911 912 912 913 913 33: Diagnostic Messages .............................................................................. 916 Glossary .............................................................................................................. 979 Index .................................................................................................................. 1009 Tables 1-1 Logical Operators .............................................................................................. 27 2-1 Backslashed Character Escapes ........................................................................ 61 2-2 Translation Escapes ........................................................................................... 61 2-3 Quote Constructs .............................................................................................. 63 3-1 3-2 3-3 3-4 3-5 3-6 Operator Precedence ........................................................................................ 87 Named Unary Operators .................................................................................. 96 Ambiguous Characters ...................................................................................... 97 File Test Operators ............................................................................................ 98 Relational Operators ....................................................................................... 100 Equality Operators .......................................................................................... 101 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 5-10 5-11 5-12 m// Modifiers .................................................................................................. s/// Modifiers .................................................................................................. tr/// Modifiers .................................................................................................. General Regex Metacharacters ....................................................................... Regex Quantifiers ............................................................................................ Extended Regex Sequences ............................................................................ Alphanumeric Regex Metasymbols ................................................................ Classic Character Classes ................................................................................ Composite Unicode Properties ....................................................................... Standard Unicode Properties .......................................................................... POSIX Character Classes ................................................................................. Regex Quantifiers Compared ......................................................................... 150 153 156 159 159 160 161 167 168 169 174 176 13-1 Overloadable Operators ................................................................................. 350 xiv Tables xv 14-1 Tie Modules on CPAN ..................................................................................... 397 18-1 What Happens When ..................................................................................... 484 19-1 -D Options ....................................................................................................... 493 29-1 29-2 29-3 29-4 29-5 Modes for open ............................................................................................... I/O Disciplines ................................................................................................ Template Characters for pack/unpack ........................................................... Formats for sprintf .......................................................................................... Fields Returned by stat ................................................................................... 749 754 758 797 801 I Overview Preface The Pursuit of Happiness Perl is a language for getting your job done. Of course, if your job is programming, you can get your job done with any “complete” computer language, theoretically speaking. But we know from experience that computer languages differ not so much in what they make possible, but in what they make easy. At one extreme, the so-called fourth-generation languages make it easy to do some things, but nearly impossible to do other things. At the other extreme, so-called industrial-strength languages make it equally difficult to do almost everything. Perl is different. In a nutshell, Perl is designed to make the easy jobs easy, without making the hard jobs impossible. And what are these “easy jobs” that ought to be easy? The ones you do every day, of course. You want a language that makes it easy to manipulate numbers and text, files and directories, computers and networks, and especially programs. It should be easy to run external programs and scan their output for interesting tidbits. It should be easy to send those same tidbits off to other programs that can do special things with them. It should be easy to develop, modify, and debug your own programs too. And, of course, it should be easy to compile and run your programs, and do it portably, on any modern operating system. Perl does all that, and a whole lot more. Initially designed as a glue language for Unix, Perl has long since spread to most other operating systems. Because it runs nearly everywhere, Perl is one of the most portable programming environments available today. To program C or C++ xvii xviii Preface portably, you have to put in all those strange #ifdef markings for different operating systems. To program Java portably, you have to understand the idiosyncrasies of each new Java implementation. To program a shell script portably, you have to remember the syntax for each operating system’s version of each command and somehow find the common factor that (you hope) works everywhere. And to program Visual Basic portably, you just need a more flexible definition of the word “portable”. :-) Perl happily avoids such problems while retaining many of the benefits of these other languages, with some additional magic of its own. Perl’s magic comes from many sources: the utility of its feature set, the inventiveness of the Perl community, and the exuberance of the open source movement in general. But much of this magic is simply hybrid vigor; Perl has a mixed heritage and has always viewed diversity as a strength rather than a weakness. Perl is a “give me your tired, your poor” language. If you feel like a huddled mass longing to be free, Perl is for you. Perl reaches out across cultures. Much of the explosive growth of Perl has been fueled by the hankerings of former Unix systems programmers who wanted to take along with them as much of the “old country” as they could. For them, Perl is the portable distillation of Unix culture, an oasis in the desert of “can’t get there from here”. On the other hand, it also works in the other direction: Windowsbased web designers are often delighted to discover that they can take their Perl programs and run them unchanged on the company’s Unix server. Although Perl is especially popular with systems programmers and web developers, that’s just because they discovered it first; Perl appeals to a much broader audience. From its small start as a text-processing language, Perl has grown into a sophisticated, general-purpose programming language with a rich software development environment complete with debuggers, profilers, cross-referencers, compilers, libraries, syntax-directed editors, and all the rest of the trappings of a “real” programming language — if you want them. But those are all about making hard things possible, and lots of languages can do that. Perl is unique in that it never lost its vision for keeping easy things easy. Because Perl is both powerful and accessible, it is being used daily in every imaginable field, from aerospace engineering to molecular biology, from mathematics to linguistics, from graphics to document processing, from database manipulation to network management. Perl is used by people who are desperate to analyze or convert lots of data quickly, whether you’re talking DNA sequences, web pages, or pork belly futures. Indeed, one of the jokes in the Perl community is that the next big stock market crash will probably be triggered by a bug in someone’s Perl script. (On the brighter side, any unemployed stock analysts will still have a marketable skill, so to speak.) Preface xix There are many reasons for the success of Perl. Perl was a successful open source project long before the open source movement got its name. Perl is free, and will always be free. You can use Perl however you see fit, subject only to a very liberal licensing policy. If you are in business and want to use Perl, go right ahead. You can embed Perl in the commercial applications you write without fee or restriction. And if you have a problem that the Perl community can’t fix, you have the ultimate backstop: the source code itself. The Perl community is not in the business of renting you their trade secrets in the guise of “upgrades”. The Perl community will never “go out of business” and leave you with an orphaned product. It certainly helps that Perl is free software. But that’s not enough to explain the Perl phenomenon since many freeware packages fail to thrive. Perl is not just free; it’s also fun. People feel like they can be creative in Perl because they have freedom of expression: they get to choose what to optimize for, whether that’s computer speed or programmer speed, verbosity or conciseness, readability or maintainability or reusability or portability or learnability or teachability. You can even optimize for obscurity, if you’re entering an Obfuscated Perl Contest. Perl can give you all these degrees of freedom because it’s a language with a split personality. It’s simultaneously a very simple language and a very rich language. Perl has taken good ideas from nearly everywhere and installed them into an easyto-use mental framework. To those who merely like it, Perl is the Practical Extraction and Report Language. To those who love it, Perl is the Pathologically Eclectic Rubbish Lister. And to the minimalists in the crowd, Perl seems like a pointless exercise in redundancy. But that’s okay. The world needs a few reductionists (mainly as physicists). Reductionists like to take things apart. The rest of us are just trying to get it together. There are many ways in which Perl is a simple language. You don’t have to know many special incantations to compile a Perl program — you can just execute it like a batch file or shell script. The types and structures used by Perl are easy to use and understand. Perl doesn’t impose arbitrary limitations on your data—your strings and arrays can grow as large as they like (as long as you have memory), and they’re designed to scale well as they grow. Instead of forcing you to learn new syntax and semantics, Perl borrows heavily from other languages you may already be familiar with (such as C, and awk, and BASIC, and Python, and English, and Greek). In fact, just about any programmer can read a well-written piece of Perl code and have some idea of what it does. Most important, you don’t have to know everything there is to know about Perl before you can write useful programs. You can learn Perl “small end first”. You can program in Perl Baby-Talk, and we promise not to laugh. Or more precisely, xx Preface we promise not to laugh any more than we’d giggle at a child’s creative way of putting things. Many of the ideas in Perl are borrowed from natural language, and one of the best ideas is that it’s okay to use a subset of the language as long as you get your point across. Any level of language proficiency is acceptable in Perl culture. We won’t send the language police after you. A Perl script is “correct” if it gets the job done before your boss fires you. Though simple in many ways, Perl is also a rich language, and there is much to learn about it. That’s the price of making hard things possible. Although it will take some time for you to absorb all that Perl can do, you will be glad that you have access to the extensive capabilities of Perl when the time comes that you need them. Because of its heritage, Perl was a rich language even when it was “just” a datareduction language, designed for navigating files, scanning large amounts of text, creating and obtaining dynamic data, and printing easily formatted reports based on that data. But somewhere along the line, Perl started to blossom. It also became a language for filesystem manipulation, process management, database administration, client-server programming, secure programming, web-based information management, and even for object-oriented and functional programming. These capabilities were not just slapped onto the side of Perl—each new capability works synergistically with the others because Perl was designed to be a glue language from the start. But Perl can glue together more than its own features. Perl is designed to be modularly extensible. Perl allows you to rapidly design, program, debug, and deploy applications, and it also allows you to easily extend the functionality of these applications as the need arises. You can embed Perl in other languages, and you can embed other languages in Perl. Through the module importation mechanism, you can use these external definitions as if they were built-in features of Perl. Object-oriented external libraries retain their object-orientedness in Perl. Perl helps you in other ways, too. Unlike strictly interpreted languages such as command files or shell scripts, which compile and execute a program one command at a time, Perl first compiles your whole program quickly into an intermediate format. Like any other compiler, it performs various optimizations and gives you instant feedback on everything from syntax and semantic errors to library binding mishaps. Once Perl’s compiler frontend is happy with your program, it passes off the intermediate code to the interpreter to execute (or optionally to any of several modular back ends that can emit C or bytecode). This all sounds complicated, but the compiler and interpreter are quite efficient, and most of us find that the typical compile-run-fix cycle is measured in mere seconds. Together with Preface xxi Perl’s many fail-soft characteristics, this quick turnaround capability makes Perl a language in which you really can do rapid prototyping. Then later, as your program matures, you can tighten the screws on yourself and make yourself program with less flair but more discipline. Perl helps you with that, too, if you ask nicely. Perl also helps you to write programs more securely. In addition to all the typical security interfaces provided by other languages, Perl also guards against accidental security errors through a unique data-tracing mechanism that automatically determines which data came from insecure sources and prevents dangerous operations before they can happen. Finally, Perl lets you set up specially protected compartments in which you can safely execute Perl code of dubious origin, disallowing dangerous operations. But, paradoxically, the way in which Perl helps you the most has almost nothing to do with Perl and everything to do with the people who use Perl. Perl folks are, frankly, some of the most helpful folks on earth. If there’s a religious quality to the Perl movement, then this is at the heart of it. Larry wanted the Perl community to function like a little bit of heaven, and by and large he seems to have gotten his wish, so far. Please do your part to keep it that way. Whether you are learning Perl because you want to save the world, or just because you are curious, or because your boss told you to, this handbook will lead you through both the basics and the intricacies. And although we don’t intend to teach you how to program, the perceptive reader will pick up some of the art, and a little of the science, of programming. We will encourage you to develop the three great virtues of a programmer: laziness, impatience, and hubris. Along the way, we hope you find the book mildly amusing in some spots (and wildly amusing in others). And if none of this is enough to keep you awake, just keep reminding yourself that learning Perl will increase the value of your resume. So keep reading. What’s New in This Edition Well, almost everything. Even where we kept the good bits from the previous edition (and there were quite a few good bits, we’ll admit), we’ve heavily revised and reorganized the current edition with several goals in mind. First, we wanted to increase the accessibility of the book to people coming from backgrounds other than computer science. We’ve made fewer assumptions about what the reader will know in advance. At the same time, we’ve kept the exposition lively in the hope that people who are already familiar with some of the material will not fall asleep reading it. xxii Preface Second, we wanted to present the very latest developments in Perl itself. To that end, we have not been shy about presenting the current state of the work, even where we feel that it is still experimental. While the core of Perl has been rock solid for years, the pace of development for some of the experimental extensions can be quite torrid at times. We’ll tell you honestly when we think the online documentation will be more reliable than what we have written here. Perl is a bluecollar language, so we’re not afraid to call a spade a shovel. Third, we wanted you to be able to find your way around in the book more easily, so we’ve broken this edition up into smaller, more coherent chapters and reorganized them into meaningful parts. Here’s how the new edition is laid out: Part 1, Overview Getting started is always the hardest part. This part presents the fundamental ideas of Perl in an informal, curl-up-in-your-favorite-chair fashion. Not a full tutorial, it merely offers a quick jump-start, which may not serve everyone’s need. See the section “Offline Documentation” for books that might better suit your learning style. Part 2, The Gory Details This part consists of an in-depth, no-holds-barred discussion of the guts of the language at every level of abstraction, from data types, variables, and regular expressions to subroutines, modules, and objects. You’ll gain a good sense of how the language works and, in the process, pick up a few hints on good software design. (And if you’ve never used a language with pattern matching, you’re in for a special treat.) Part 3, Perl as Technology You can do a lot with Perl all by itself, but this part will take you to a higher level of wizardry. Here you’ll learn how to make Perl jump through whatever hoops your computer sets up for it, from dealing with Unicode, interprocess communication, and multithreading, through compiling, invoking, debugging, and profiling Perl, on up to writing your own external extensions in C or C++ or interfaces to any existing API you feel like. Perl will be quite happy to talk to any interface on your computer, or for that matter, on any other computer on the Internet, weather permitting. Part 4, Perl as Culture Everyone understands that a culture must have a language, but the Perl community has always understood that a language must have a culture. This part is where we view Perl programming as a human activity, embedded in the real world of people. We’ll cover how you can improve the way you deal with both good people and bad people. We’ll also dispense a great deal of advice on how you can become a better person yourself and on how to make your programs more useful to other people. Preface xxiii Part 5, Reference Material Here we’ve put together all the chapters in which you might want to look something up alphabetically, from special variables and functions to standard modules and pragmas. The Glossary will be particularly helpful to those who are unfamiliar with the jargon of computer science. For example, if you don’t know what the meaning of “pragma” is, you could look it up right now. (If you don’t know what the meaning of “is” is, we can’t help you with that.) The Standard Distribution Most operating system vendors these days include Perl as a standard component of their systems. As of this writing, AIX, BeOS, BSDI, Debian, DG/UX, DYNIX/ptx, FreeBSD, IRIX, LynxOS, Mac OS X, OpenBSD, OS390, RedHat, SINIX, Slackware, Solaris, SuSE, and Tru64 all came with Perl as part of their standard distributions. Some companies provide Perl on separate CDs of contributed freeware or through their customer service groups. Third-party companies like ActiveState offer prebuilt Perl distributions for a variety of different operating systems, including those from Microsoft. Even if your vendor does ship Perl as standard, you’ll probably eventually want to compile and install Perl on your own. That way you’ll know you have the latest version, and you’ll be able to choose where to install your libraries and documentation. You’ll also be able to choose whether to compile Perl with support for optional extensions such as multithreading, large files, or the many low-level debugging options available through the -D command-line switch. (The user-level Perl debugger is always supported.) The easiest way to download a Perl source kit is probably to point your web browser to Perl’s home page at www.perl.com, where you’ll find download information prominently featured on the start-up page, along with links to precompiled binaries for platforms that have misplaced their C compilers. You can also head directly to CPAN (the Comprehensive Perl Archive Network, described in Chapter 22, CPAN ), using http://www.perl.com/CPAN or http://www.cpan.org. If those are too slow for you (and they might be because they’re very popular), you should find a mirror close to you. The following URLs are just a few of the CPAN mirrors around the world, now numbering over one hundred: http://www.funet.fi/pub/languages/perl/CPAN/ ftp://ftp.funet.fi/pub/languages/perl/CPAN/ xxiv Preface ftp://ftp.cs.colorado.edu/pub/perl/CPAN/ ftp://ftp.cise.ufl.edu/pub/perl/CPAN/ ftp://ftp.perl.org/pub/perl/CPAN/ http://www.perl.com/CPAN-local http://www.cpan.org/ http://www.perl.org/CPAN/ http://www.cs.uu.nl/mirr or/CPAN/ http://CPAN.pacific.net.hk/ The first pair in that list, those at the funet.fi site, point to the master CPAN repository. The MIRRORED.BY file there contains a list of all other CPAN sites, so you can just get that file and then pick your favorite mirror. Some of them are available through FTP, others through HTTP (which makes a difference behind some corporate firewalls). The http://www.perl.com/CPAN multiplexor attempts to make this selection for you. You can change your selection if you like later. Once you’ve fetched the source code and unpacked it into a directory, you should read the README and the INSTALL files to learn how to build Perl. There may also be an INSTALL.platform file for you to read there, where platform represents your operating system platform. If your platform happens to be some variety of Unix, then your commands to fetch, configure, build, and install Perl might resemble what follows. First, you must choose a command to fetch the source code. You can fetch with ftp: % ftp ftp://ftp.funet.fi/pub/languages/perl/CPAN/src/latest.tar.gz (Again, feel free to substitute a nearby CPAN mirror. Of course, if you live in Finland, that is your nearby CPAN mirror.) If you can’t use ftp, you can download via the Web using a browser or a command-line tool: % wget http://www.funet.fi/pub/languages/perl/CPAN/src/latest.tar.gz Now unpack, configure, build, and install: % % % % tar zxf latest.tar.gz cd perl-5.6.0 sh Configure -des make test && make install Or gunzip first, then tar xf. Or 5.* for whatever number. Assumes default answers. Install typically requir es superuser. This uses a conventional C development environment, so if you don’t have a C compiler, you can’t compile Perl. See the CPAN ports directory for up-to-date status on each platform to learn whether Perl comes bundled (and if so, what version), whether you can get by with the standard source kit, or whether you need a special port. Download links are given for those systems that typically require special ports or for systems from vendors who normally don’t provide a C compiler (or rather, who abnormally don’t provide a C compiler). Preface xxv Online Documentation Perl’s extensive online documentation comes as part of the standard Perl distribution. (See the next section for offline documentation.) Additional documentation shows up whenever you install a module from CPAN. When we refer to a “Perl manpage” in this book, we’re talking about this set of online Perl manual pages, sitting on your computer. The term manpage is purely a convention meaning a file containing documentation—you don’t need a Unix-style man program to read one. You may even have the Perl manpages installed as HTML pages, especially on non-Unix systems. The online manpages for Perl have been divided into separate sections, so you can easily find what you are looking for without wading through hundreds of pages of text. Since the top-level manpage is simply called perl, the Unix command man perl should take you to it.* That page in turn directs you to more specific pages. For example, man perlre will display the manpage for Perl’s regular expressions. The perldoc command often works on systems when the man command won’t. On Macs, you need to use the Shuck program. Your port may also provide the Perl manpages in HTML format or your system’s native help format. Check with your local sysadmin—unless you’re the local sysadmin. Navigating the Standard Manpages In the Beginning (of Perl, that is, back in 1987), the perl manpage was a terse document, filling about 24 pages when typeset and printed. For example, its section on regular expressions was only two paragraphs long. (That was enough, if you knew egr ep.) In some ways, nearly everything has changed since then. Counting the standard documentation, the various utilities, the per-platform porting information, and the scads of standard modules, we’re now up over 1,500 typeset pages of documentation spread across many separate manpages. (And that’s not even counting any CPAN modules you install, which is likely to be quite a few.) But in other ways, nothing has changed: there’s still a perl manpage kicking around. And it’s still the right place to start when you don’t know where to start. The difference is that once you arrive, you can’t just stop there. Perl documentation is no longer a cottage industry; it’s a supermall with hundreds of stores. When you walk in the door, you need to find the YOU ARE HERE to figure out which shop or department store sells what you’re shopping for. Of course, once you get familiar with the mall, you’ll usually know right where to go. * If you still get a truly humongous page when you do that, you’re probably picking up the ancient release 4 manpage. Check your MANPATH for archeological sites. (Say perldoc perl to find out how to configure your MANPATH based on the output of perl -V:man.dir.) xxvi Preface Here are a few of the store signs you’ll see: Manpage Covers perl perldata perlsyn perlop perlr e perlvar perlsub perlfunc perlmod perlr ef perlobj perlipc perlrun perldebug perldiag What Perl manpages are available Data types Syntax Operators and precedence Regular expressions Predefined variables Subroutines Built-in functions How to make Perl modules work References Objects Interprocess communication How to run Perl commands, plus switches Debugging Diagnostic messages That’s just a small excerpt, but it has the important parts. You can tell that if you want to learn about an operator, perlop is apt to have what you’re looking for. And if you want to find something out about predefined variables, you’d check in perlvar. If you got a diagnostic message you didn’t understand, you’d go to perldiag. And so on. Part of the standard Perl manual is the frequently asked questions (FAQ) list. It’s split up into these nine different pages: Manpage Covers perlfaq1 perlfaq2 perlfaq3 perlfaq4 perlfaq5 perlfaq6 perlfaq7 perlfaq8 perlfaq9 General questions about Perl Obtaining and learning about Perl Programming tools Data manipulation Files and formats Regular expressions General Perl language issues System interaction Networking Preface xxvii Some manpages contain platform-specific notes: Manpage Covers perlamiga perlcygwin perldos perlhpux perlmachten perlos2 perlos390 perlvms perlwin32 The The The The The The The The The Amiga port Cygwin port MS-DOS port HP-UX port Power MachTen port OS/2 port OS/390 port DEC VMS port MS-Windows port (See also Chapter 25, Portable Perl, and the CPAN ports directory described earlier for porting information.) Searching the Manpages Nobody expects you to read through all 1,500 typeset pages just to find a needle in a haystack. There’s an old saying that you can’t gr ep * dead trees. Besides the customary search capabilities inherent in most document-viewing programs, as of the 5.6.1 release of Perl, each main Perl manpage has its own search and display capability. You can search individual pages by using the name of the manpage as the command and passing a Perl regular expression (see Chapter 5, Patter n Matching) as the search pattern: % perlop comma % perlfunc split % perlvar ARGV % perldiag ’assigned to typeglob’ When you don’t quite know where something is in the documentation, you can expand your search. For example, to search all the FAQs, use the perlfaq command (which is also a manpage): % perlfaq round * Don’t forget there’s a Glossary if you need it. xxviii Preface The perltoc command (which is also a manpage) searches all the manpages’ collective tables of contents: % perltoc typeglob perl5005delta: Undefined value assigned to typeglob perldata: Typeglobs and Filehandles perldiag: Undefined value assigned to typeglob Or to search the complete online Perl manual, including all headers, descriptions, and examples, for any instances of the string, use the perlhelp command: % perlhelp CORE::GLOBAL See the perldoc manpage for details. Non-Perl Manpages When we refer to non-Perl documentation, as in getitimer (2), this refers to the getitimer manpage from section 2 of the Unix Programmer’s Manual.* Manpages for syscalls such as getitimer may not be available on non-Unix systems, but that’s probably okay, because you couldn’t use the Unix syscall there anyway. If you really do need the documentation for a Unix command, syscall, or library function, many organizations have put their manpages on the web—a quick search of AltaVista for “+crypt(3) +manual” will find many copies. Although the top-level Perl manpages are typically installed in section 1 of the standard man directories, we will omit appending a (1) to those manpage names in this book. You can recognize them anyway because they are all of the form “perlmumble”. Offline Documentation If you’d like to learn more about Perl, here are some related publications that we recommend: • Perl 5 Pocket Reference, 3d ed., by Johan Vromans (O’Reilly, 2000). This small booklet serves as a convenient quick reference for Perl. • Perl Cookbook, by Tom Christiansen and Nathan Torkington (O’Reilly, 1998). This is the companion volume to the book you have in your hands right now. * Section 2 is only supposed to contain direct calls into the operating system. (These are often called “system calls”, but we’ll consistently call them syscalls in this book to avoid confusion with the system function, which has nothing to do with syscalls). However, systems vary somewhat in which calls are implemented as syscalls and which are implemented as C library calls, so you could conceivably find getitimer (2) in section 3 instead. Preface xxix • Elements of Programming with Perl, by Andrew L. Johnson (Manning, 1999). This book aims to teach non-programmers how to program from the ground up, and to do so using Perl. • Lear ning Perl, 2d ed., by Randal Schwartz and Tom Christiansen (O’Reilly, 1997). This book teaches Unix sysadmins and Unix programmers the 30% of basic Perl that they’ll use 70% of the time. Erik Olson retargeted a version of this book for Perl programmers on Microsoft systems; it is called Lear ning Perl for Win32 Systems. • Perl: The Programmer’s Companion, by Nigel Chapman (Wiley, 1997). This fine book is geared for professional computer scientists and programmers without regard to platform. It covers Perl quickly but completely. • Mastering Regular Expressions, by Jeffrey Friedl (O’Reilly, 1997). Although it doesn’t cover the latest additions to Perl regular expressions, this book is an invaluable reference for anyone seeking to learn how regular expressions really work. • Object Oriented Perl, by Damian Conway (Manning, 1999). For beginning as well as advanced OO programmers, this astonishing book explains common and esoteric techniques for writing powerful object systems in Perl. • Mastering Algorithms with Perl, by Jon Orwant, Jarkko Hietaniemi, and John Macdonald (O’Reilly, 1999). All the useful techniques from a computer science algorithms course, but without the painful proofs. This book covers fundamental and useful algorithms in the fields of graphs, text, sets, and much more. • Writing Apache Modules with Perl and C, by Lincoln Stein and Doug MacEachern (O’Reilly, 1999). This guide to web programming teaches you how to extend the capabilities of the Apache web server, especially using the turbocharged mod_perl for fast CGI scripts and via the Perl-accessible Apache API. • The Perl Journal, edited by Jon Orwant. This quarterly magazine by programmers and for programmers regularly features programming insights, techniques, the latest news, and more. There are many other Perl books and publications out there, and out of senility, we have undoubtedly forgotten to mention some good ones. (Out of mercy, we have neglected to mention some bad ones.) In addition to the Perl-related publications listed above, we recommend the following books. They aren’t about Perl directly but still come in handy for reference, consultation, and inspiration. xxx Preface • The Art of Computer Programming, by Donald Knuth, vol. 1, Fundamental Algorithms; vol. 2, Seminumerical Algorithms; and vol. 3, Sorting and Searching (Addison-Wesley, 1998). • Intr oduction to Algorithms, by Cormen, Leiserson, and Rivest (MIT Press and McGraw-Hill, 1990). • Algorithms in C: Fundamental Data Structures, Sorting, Searching, 3d ed., by Robert Sedgewick (Addison-Wesley, 1997). • The Elements of Programming Style, by Kernighan and Plauger (Prentiss-Hall, 1988). • The Unix Programming Environment, by Kernighan and Pike (Prentiss-Hall, 1984). • POSIX Programmer’s Guide, by Donald Lewine (O’Reilly, 1991). • Advanced Programming in the UNIX Environment, by W. Richard Stevens (Addison-Wesley, 1992). • TCP/IP Illustrated, vols. 1–3, by W. Richard Stevens, (Addison-Wesley, 1994 –1996). • The Lord of the Rings by J. R. R. Tolkien (most recent printing: Houghton Mifflin, 1999). Additional Resources The Internet is a wonderful invention, and we’re all still discovering how to use it to its full potential. (Of course, some people prefer to “discover” the Internet the way Tolkien discovered Middle Earth.) Perl on the Web Visit the Perl home page at http://www.perl.com/. It tells what’s new in the Perl world and contains source code and ports, feature articles, documentation, conference schedules, and a lot more. Also visit the Perl Mongers’ web page at http://www.perl.org for a grassroots-level view of Perl’s, er, grass roots, which grow quite thickly in every part of the world, except at the South Pole, where they have to be kept indoors. Local PM groups hold regular small meetings where you can exchange Perl lore with other Perl hackers who live in your part of the world. Preface xxxi Usenet Newsgroups The Perl newsgroups are a great, if sometimes cluttered, source of information about Perl. Your first stop might be comp.lang.perl.moderated, a moderated, lowtraffic newsgroup that includes announcements and technical discussions. Because of the moderation, the newsgroup is quite readable. The high-traffic comp.lang.perl.misc group discusses everything from technical issues to Perl philosophy to Perl games and Perl poetry. Like Perl itself, comp.lang.perl.misc is meant to be useful, and no question is too silly to ask.* The comp.lang.perl.tk group discusses how to use the popular Tk toolkit from Perl. The comp.lang.perl.modules group is about the development and use of Perl modules, which are the best way to get reusable code. There may be other comp.lang.perl.whatever newsgroups by the time you read this; look around. If you aren’t using a regular newsreader to access Usenet, but a web browser instead, prepend “news:” to the newsgroup name to get at one of these named newsgroups. (This only works if you have a news server.) Alternatively, if you use a Usenet searching service like Alta Vista or Deja, specify “*perl*” as the newsgroups to search for. One other newsgroup you might want to check out, at least if you’re doing CGI programming on the Web, is comp.infosystems.www.authoring.cgi. While it isn’t strictly speaking a Perl group, most of the programs discussed there are written in Perl. It’s the right place to go for web-related Perl issues, unless you’re using mod_perl under Apache, in which case you might check out comp.infosystems.www.servers.unix. Bug Reports In the unlikely event that you should encounter a bug that’s in Perl proper and not just in your own program, you should try to reduce it to a minimal test case and then report it with the perlbug program that comes with Perl. See http://bugs.perl.org for more info. * Of course, some questions are too silly to answer. (Especially those already answered in the online manpages and FAQs. Why ask for help on a newsgroup when you could find the answer by yourself in less time than it takes to type in the question?) xxxii Preface Conventions Used in This Book Some of our conventions get larger sections of their very own. Coding conventions are discussed in the section “Programming with Style” in Chapter 24, Common Practices. In a sense, our lexical conventions are given in the Glossary (our lexicon). The following typographic conventions are used in this book: Italic is used for URLs, manpages, pathnames, and programs. New terms are also italicized when they first appear in the text. Many of these terms will have alternative definitions in the Glossary if the one in the text doesn’t do it for you. Constant width is used in examples and in regular text to show any literal code. Data values are represented by constant width in quotes (“ ”), which are not part of the value. Constant width bold is used for command-line switches. This allows one to distinguish for example, between the -w warnings switch and the -w filetest operator. It is also used in the examples to indicate the text you type in literally. Constant width italic is used for generic code terms for which you must substitute particular values. We give lots of examples, most of which are pieces of code that should go into a larger program. Some examples are complete programs, which you can recognize because they begin with a #! line. We start nearly all of our longer programs with: #!/usr/bin/perl Still other examples are things to be typed on a command line. We’ve used % to indicate a generic shell prompt: % perl -e ’print "Hello, world.\n"’ Hello, world. This style is representative of a standard Unix command line, where single quotes represent the “most quoted” form. Quoting and wildcard conventions on other systems vary. For example, many command-line interpreters under MS-DOS and VMS require double quotes instead of single quotes when you need to group arguments with spaces or wildcards in them. Preface xxxiii Acknowledgments Here we say nice things in public about our reviewers to make up for all the rude things we said to them in private: Todd Miller, Sharon Hopkins Rauenzahn, Rich Rauenzahn, Paul Marquess, Paul Grassie, Nathan Torkington, Johan Vromans, Jeff Haemer, Gurusamy Sarathy, Gloria Wall, Dan Sugalski, and Abigail. We’d like to express our special gratitude to Tim O’Reilly (and his Associates) for encouraging authors to write the sort of books people might enjoy reading. We’d Like to Hear from You We have tested and verified all of the information in this book to the best of our ability, but you may find that features have changed (or even that we have made mistakes!). Please let us know about any errors you find, as well as your suggestions for future editions, by writing: O’Reilly & Associates, Inc. 101 Morris Street Sebastopol, CA 95472 1-800-998-9938 (in the US or Canada) 1-707-829-0515 (international/local) 1-707-829-0104 (fax) You can also send messages electronically. To be put on the O’Reilly mailing list or request a catalog, send mail to info@or eilly.com. To ask technical questions or comment on this book, send mail to bookquestions@or eilly.com. We have a web site for the book, where we’ll list any errata and other Camelrelated information: http://www.or eilly.com/catalog/pperl3 Here you’ll also find all the example code from the book available for download so you don’t have to type it all in, like we did. II The Gory Details 1 An Overview of Perl Getting Started We think that Perl is an easy language to learn and use, and we hope to convince you that we’re right. One thing that’s easy about Perl is that you don’t have to say much before you say what you want to say. In many programming languages, you have to declare the types, variables, and subroutines you are going to use before you can write the first statement of executable code. And for complex problems demanding complex data structures, declarations are a good idea. But for many simple, everyday problems, you’d like a programming language in which you can simply say: print "Howdy, world!\n"; and expect the program to do just that. Perl is such a language. In fact, this example is a complete program,* and if you feed it to the Perl interpreter, it will print “Howdy, world!” on your screen. (The \n in the example produces a newline at the end of the output.) And that’s that. You don’t have to say much after you say what you want to say, either. Unlike many languages, Perl thinks that falling off the end of your program is just a normal way to exit the program. You certainly may call the exit function explicitly if you wish, just as you may declare some of your variables, or even force yourself to declare all your variables. But it’s your choice. With Perl you’re free to do The Right Thing, however you care to define it. There are many other reasons why Perl is easy to use, but it would be pointless to list them all here, because that’s what the rest of the book is for. The devil may be * Or script, or application, or executable, or doohickey. Whatever. 3 4 Chapter 1: An Overview of Perl in the details, as they say, but Perl tries to help you out down there in the hot place too. At every level, Perl is about helping you get from here to there with minimum fuss and maximum enjoyment. That’s why so many Perl programmers go around with a silly grin on their face. This chapter is an overview of Perl, so we’re not trying to present Perl to the rational side of your brain. Nor are we trying to be complete, or logical. That’s what the following chapters are for. Vulcans, androids, and like-minded humans should skip this overview and go straight to Chapter 2, Bits and Pieces, for maximum information density. If, on the other hand, you’re looking for a carefully paced tutorial, you should probably get Randal’s nice book, Lear ning Perl (published by O’Reilly & Associates). But don’t throw this book out just yet. This chapter presents Perl to the other side of your brain, whether you prefer to call it associative, artistic, passionate, or merely spongy. To that end, we’ll be presenting various views of Perl that will give you as clear a picture of Perl as the blind men had of the elephant. Well, okay, maybe we can do better than that. We’re dealing with a camel here (see the cover). Hopefully, at least one of these views of Perl will help get you over the hump. Natural and Artificial Languages Languages were first invented by humans, for the benefit of humans. In the annals of computer science, this fact has occasionally been forgotten.* Since Perl was designed (loosely speaking) by an occasional linguist, it was designed to work smoothly in the same ways that natural language works smoothly. Naturally, there are many aspects to this, since natural language works well at many levels simultaneously. We could enumerate many of these linguistic principles here, but the most important principle of language design is that easy things should be easy, and hard things should be possible. (Actually, that’s two principles.) They may seem obvious to you, but many computer languages fail at one or the other. Natural languages are good at both because people are continually trying to express both easy things and hard things, so the language evolves to handle both. Perl was designed first of all to evolve, and indeed it has evolved. Many people have contributed to the evolution of Perl over the years. We often joke that a camel is a horse designed by a committee, but if you think about it, the camel is pretty well adapted for life in the desert. The camel has evolved to be relatively self-sufficient. (On the other hand, the camel has not evolved to smell good. Neither has Perl.) This is one of the many strange reasons we picked the camel to be Perl’s mascot, but it doesn’t have much to do with linguistics. * More precisely, this fact has occasionally been remembered. Natural and Artificial Languages 5 Now when someone utters the word “linguistics”, many folks focus in on one of two things. Either they think of words, or they think of sentences. But words and sentences are just two handy ways to “chunk” speech. Either may be broken down into smaller units of meaning or combined into larger units of meaning. And the meaning of any unit depends heavily on the syntactic, semantic, and pragmatic context in which the unit is located. Natural language has words of various sorts: nouns and verbs and such. If someone says “dog” in isolation, you think of it as a noun, but you can also use the word in other ways. That is, a noun can function as a verb, an adjective, or an adverb when the context demands it. If you dog a dog during the dog days of summer, you’ll be a dog tired dogcatcher.* Perl also evaluates words differently in various contexts. We will see how it does that later. Just remember that Perl is trying to understand what you’re saying, like any good listener does. Perl works pretty hard to try to keep up its end of the bargain. Just say what you mean, and Perl will usually “get it”. (Unless you’re talking nonsense, of course—the Perl parser understands Perl a lot better than either English or Swahili.) But back to nouns. A noun can name a particular object, or it can name a class of objects generically without specifying which one is currently being referred to. Most computer languages make this distinction, only we call the particular one a value and the generic one a variable. A value just exists somewhere, who knows where, but a variable gets associated with one or more values over its lifetime. So whoever is interpreting the variable has to keep track of that association. That interpreter may be in your brain or in your computer. Variable Syntax A variable is just a handy place to keep something, a place with a name, so you know where to find your special something when you come back looking for it later. As in real life, there are various kinds of places to store things, some of them rather private, and some of them out in public. Some places are temporary, and other places are more permanent. Computer scientists love to talk about the “scope” of variables, but that’s all they mean by it. Perl has various handy ways of dealing with scoping issues, which you’ll be happy to learn later when the time is right. Which is not yet. (Look up the adjectives local, my, and our in Chapter 29, Functions, when you get curious, or see “Scoped Declarations” in Chapter 4, Statements and Declarations.) But a more immediately useful way of classifying variables is by what sort of data they can hold. As in English, Perl’s primary type distinction is between singular * And you’re probably dog tired of all this linguistics claptrap. But we’d like you to understand why Perl is different from the typical computer language, doggone it! 6 Chapter 1: An Overview of Perl and plural data. Strings and numbers are singular pieces of data, while lists of strings or numbers are plural. (And when we get to object-oriented programming, you’ll find that the typical object looks singular from the outside but plural from the inside, like a class of students.) We call a singular variable a scalar, and a plural variable an array. Since a string can be stored in a scalar variable, we might write a slightly longer (and commented) version of our first example like this: $phrase = "Howdy, world!\n"; print $phrase; # Set a variable. # Print the variable. Note that we did not have to predefine what kind of variable $phrase is. The $ character tells Perl that phrase is a scalar variable, that is, one containing a singular value. An array variable, by contrast, would start with an @ character. (It may help you to remember that a $ is a stylized “s”, for “scalar”, while @ is a stylized “a”, for “array”.) Perl has some other variable types, with unlikely names like “hash”, “handle”, and “typeglob”. Like scalars and arrays, these types of variables are also preceded by funny characters. For completeness, here are all the funny characters you’ll encounter: Type Character Example Is a name for: Scalar Array Hash Subroutine Typeglob $ @ % & * $cents @large %interest &how *struck An individual value (number or string) A list of values, keyed by number A group of values, keyed by string A callable chunk of Perl code Everything named struck Some language purists point to these funny characters as a reason to abhor Perl. This is superficial. These characters have many benefits, not least of which is that variables can be interpolated into strings with no additional syntax. Perl scripts are also easy to read (for people who have bothered to learn Perl!) because the nouns stand out from verbs. And new verbs can be added to the language without breaking old scripts. (We told you Perl was designed to evolve.) And the noun analogy is not frivolous—there is ample precedent in English and other languages for requiring grammatical noun markers. It’s how we think! (We think.) Singularities From our earlier example, you can see that scalars may be assigned a new value with the = operator, just as in many other computer languages. Scalar variables can be assigned any form of scalar value: integers, floating-point numbers, strings, and even esoteric things like references to other variables, or to objects. There are many ways of generating these values for assignment. Natural and Artificial Languages 7 As in the Unix* shell, you can use different quoting mechanisms to make different kinds of values. Double quotation marks (double quotes) do variable interpolation † and backslash interpolation (such as turning \n into a newline) while single quotes suppress interpolation. And backquotes (the ones leaning to the left) will execute an external program and return the output of the program, so you can capture it as a single string containing all the lines of output. $answer = 42; $pi = 3.14159265; $avocados = 6.02e23; $pet = "Camel"; $sign = "I love my $pet"; $cost = ’It costs $100’; $thence = $whence; $salsa = $moles * $avocados; $exit = system("vi $file"); $cwd = ‘pwd‘; # # # # # # # # # # an integer a "real" number scientific notation string string with interpolation string without interpolation another variable’s value a gastrochemical expression numeric status of a command string output from a command And while we haven’t covered fancy values yet, we should point out that scalars may also hold references to other data structures, including subroutines and objects. $ary = \@myarray; $hsh = \%myhash; $sub = \&mysub; # reference to a named array # reference to a named hash # reference to a named subroutine $ary = [1,2,3,4,5]; # reference to an unnamed array $hsh = {Na => 19, Cl => 35}; # reference to an unnamed hash $sub = sub { print $state }; # reference to an unnamed subroutine $fido = new Camel "Amelia"; # reference to an object If you use a variable that has never been assigned a value, the uninitialized variable automatically springs into existence as needed. Following the principle of least surprise, the variable is created with a null value, either "" or 0. Depending on where you use them, variables will be interpreted automatically as strings, as numbers, or as “true” and “false” values (commonly called Boolean values). Remember how important context is in human languages. In Perl, various operators expect certain kinds of singular values as parameters, so we will speak of those operators as “providing” or “supplying” a scalar context to those parameters. Sometimes we’ll be more specific, and say it supplies a numeric context, a string context, or a Boolean context to those parameters. (Later we’ll also talk about list * Here and elsewhere, when we say Unix, we mean any operating system resembling Unix, including BSD, Linux, and, of course, Unix. † Sometimes called “substitution” by shell programmers, but we prefer to reserve that word for something else in Perl. So please call it interpolation. We’re using the term in the textual sense (“this passage is a Gnostic interpolation”) rather than in the mathematical sense (“this point on the graph is an interpolation between two other points”). 8 Chapter 1: An Overview of Perl context, which is the opposite of scalar context.) Perl will automatically convert the data into the form required by the current context, within reason. For example, suppose you said this: $camels = ’123’; print $camels + 1, "\n"; The original value of $camels is a string, but it is converted to a number to add 1 to it, and then converted back to a string to be printed out as 124. The newline, represented by "\n", is also in string context, but since it’s already a string, no conversion is necessary. But notice that we had to use double quotes there—using single quotes to say ’\n’ would result in a two-character string consisting of a backslash followed by an “n”, which is not a newline by anybody’s definition. So, in a sense, double quotes and single quotes are yet another way of specifying context. The interpretation of the innards of a quoted string depends on which quotes you use. (Later, we’ll see some other operators that work like quotes syntactically but use the string in some special way, such as for pattern matching or substitution. These all work like double-quoted strings too. The double-quote context is the “interpolative” context of Perl, and is supplied by many operators that don’t happen to resemble double quotes.) Similarly, a reference behaves as a reference when you give it a “dereference” context, but otherwise acts like a simple scalar value. For example, we might say: $fido = new Camel "Amelia"; if (not $fido) { die "dead camel"; } $fido->saddle(); Here we create a reference to a Camel object and put it into the variable $fido. On the next line, we test $fido as a scalar Boolean to see if it is “true”, and we throw an exception (that is, we complain) if it is not true, which in this case would mean that the new Camel constructor failed to make a proper Camel object. But on the last line, we treat $fido as a reference by asking it to look up the saddle() method for the object held in $fido, which happens to be a Camel, so Perl looks up the saddle() method for Camel objects. More about that later. For now, just remember that context is important in Perl because that’s how Perl knows what you want without your having to say it explicitly, as many other computer languages force you to do. Pluralities Some kinds of variables hold multiple values that are logically tied together. Perl has two types of multivalued variables: arrays and hashes. In many ways, these Natural and Artificial Languages 9 behave like scalars—they spring into existence with nothing in them when needed, for instance. But they are different from scalars in that, when you assign to them, they supply a list context to the right side of the assignment rather than a scalar context. Arrays and hashes also differ from each other. You’d use an array when you want to look something up by number. You’d use a hash when you want to look something up by name. The two concepts are complementary. You’ll often see people using an array to translate month numbers into month names, and a corresponding hash to translate month names back into month numbers. (Though hashes aren’t limited to holding only numbers. You could have a hash that translates month names to birthstone names, for instance.) Arrays. An array is an ordered list of scalars, accessed* by the scalar’s position in the list. The list may contain numbers, or strings, or a mixture of both. (It might also contain references to subarrays or subhashes.) To assign a list value to an array, you simply group the values together (with a set of parentheses): @home = ("couch", "chair", "table", "stove"); Conversely, if you use @home in a list context, such as on the right side of a list assignment, you get back out the same list you put in. So you could set four scalar variables from the array like this: ($potato, $lift, $tennis, $pipe) = @home; These are called list assignments. They logically happen in parallel, so you can swap two variables by saying: ($alpha,$omega) = ($omega,$alpha); As in C, arrays are zero-based, so while you would talk about the first through fourth elements of the array, you would get to them with subscripts 0 through 3.† Array subscripts are enclosed in square brackets [like this], so if you want to select an individual array element, you would refer to it as $home[n], where n is the subscript (one less than the element number) you want. See the example that follows. Since the element you are dealing with is a scalar, you always precede it with a $. * Or keyed, or indexed, or subscripted, or looked up. Take your pick. † If this seems odd to you, just think of the subscript as an offset, that is, the count of how many array elements come before it. Obviously, the first element doesn’t have any elements before it, and so has an offset of 0. This is how computers think. (We think.) 10 Chapter 1: An Overview of Perl If you want to assign to one array element at a time, you could write the earlier assignment as: $home[0] $home[1] $home[2] $home[3] = = = = "couch"; "chair"; "table"; "stove"; Since arrays are ordered, you can do various useful operations on them, such as the stack operations push and pop. A stack is, after all, just an ordered list, with a beginning and an end. Especially an end. Perl regards the end of your array as the top of a stack. (Although most Perl programmers think of an array as horizontal, with the top of the stack on the right.) Hashes. A hash is an unordered set of scalars, accessed* by some string value that is associated with each scalar. For this reason hashes are often called associative arrays. But that’s too long for lazy typists to type, and we talk about them so often that we decided to name them something short and snappy. The other reason we picked the name “hash” is to emphasize the fact that they’re disordered. (They are, coincidentally, implemented internally using a hash-table lookup, which is why hashes are so fast, and stay so fast no matter how many values you put into them.) You can’t push or pop a hash though, because it doesn’t make sense. A hash has no beginning or end. Nevertheless, hashes are extremely powerful and useful. Until you start thinking in terms of hashes, you aren’t really thinking in Perl. Figure 1-1 shows the ordered elements of an array and the unordered (but named) elements of a hash. Since the keys to a hash are not automatically implied by their position, you must supply the key as well as the value when populating a hash. You can still assign a list to it like an ordinary array, but each pair of items in the list will be interpreted as a key and a value. Since we’re dealing with pairs of items, hashes use the funny character % to mark hash names. (If you look carefully at the % character, you can see the key and the value with a slash between them. It may help to squint.) Suppose you wanted to translate abbreviated day names to the corresponding full names. You could write the following list assignment: %longday = ("Sun", "Sunday", "Mon", "Monday", "Tue", "Tuesday", "Wed", "Wednesday", "Thu", "Thursday", "Fri", "Friday", "Sat", "Saturday"); But that’s rather difficult to read, so Perl provides the => (equals sign, greater-than sign) sequence as an alternative separator to the comma. Using this syntactic sugar * Or keyed, or indexed, or subscripted, or looked up. Take your pick. Natural and Artificial Languages 11 %longday Thu @home Thursday Fri 0 1 2 Friday 3 Sat couch chair table stove Saturday Mon Monday Sun Sunday Wed Wednesday Tue Tuesday Figur e 1-1. An array and a hash (and some creative formatting), it is much easier to see which strings are the keys and which strings are the associated values. %longday = ( "Sun" => "Mon" => "Tue" => "Wed" => "Thu" => "Fri" => "Sat" => ); "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", Not only can you assign a list to a hash, as we did above, but if you mention a hash in list context, it’ll convert the hash back to a list of key/value pairs, in a weird order. This is occasionally useful. More often people extract a list of just the keys, using the (aptly named) keys function. The key list is also unordered, but can easily be sorted if desired, using the (aptly named) sort function. Then you can use the ordered keys to pull out the corresponding values in the order you want. 12 Chapter 1: An Overview of Perl Because hashes are a fancy kind of array, you select an individual hash element by enclosing the key in braces (those fancy brackets also known as “curlies”). So, for example, if you want to find out the value associated with Wed in the hash above, you would use $longday{"Wed"}. Note again that you are dealing with a scalar value, so you use $ on the front, not %, which would indicate the entire hash. Linguistically, the relationship encoded in a hash is genitive or possessive, like the word “of” in English, or like “’s”. The wife of Adam is Eve, so we write: $wife{"Adam"} = "Eve"; Complexities Arrays and hashes are lovely, simple, flat data structures. Unfortunately, the world does not always cooperate with our attempts to oversimplify. Sometimes you need to build not-so-lovely, not-so-simple, not-so-flat data structures. Perl lets you do this by pretending that complicated values are really simple ones. To put it the other way around, Perl lets you manipulate simple scalar references that happen to refer to complicated arrays and hashes. We do this all the time in natural language when we use a simple singular noun like “government” to represent an entity that is completely convoluted and inscrutable. Among other things. To extend our previous example, suppose we want to switch from talking about Adam’s wife to Jacob’s wife. Now, as it happens, Jacob had four wives. (Don’t try this at home.) In trying to represent this in Perl, we find ourselves in the odd situation where we’d like to pretend that Jacob’s four wives were really one wife. (Don’t try this at home, either.) You might think you could write it like this: $wife{"Jacob"} = ("Leah", "Rachel", "Bilhah", "Zilpah"); # WRONG But that wouldn’t do what you want, because even parentheses and commas are not powerful enough to turn a list into a scalar in Perl. (Parentheses are used for syntactic grouping, and commas for syntactic separation.) Rather, you need to tell Perl explicitly that you want to pretend that a list is a scalar. It turns out that square brackets are powerful enough to do that: $wife{"Jacob"} = ["Leah", "Rachel", "Bilhah", "Zilpah"]; # ok That statement creates an unnamed array and puts a reference to it into the hash element $wife{"Jacob"}. So we have a named hash containing an unnamed array. This is how Perl deals with both multidimensional arrays and nested data structures. As with ordinary arrays and hashes, you can also assign individual elements, like this: Natural and Artificial Languages $wife{"Jacob"}[0] $wife{"Jacob"}[1] $wife{"Jacob"}[2] $wife{"Jacob"}[3] = = = = 13 "Leah"; "Rachel"; "Bilhah"; "Zilpah"; You can see how that looks like a multidimensional array with one string subscript and one numeric subscript. To see something that looks more tree-structured, like a nested data structure, suppose we wanted to list not only Jacob’s wives but all the sons of each of his wives. In this case we want to treat a hash as a scalar. We can use braces for that. (Inside each hash value we’ll use square brackets to represent arrays, just as we did earlier. But now we have an array in a hash in a hash.) $kids_of_wife{"Jacob"} = { "Leah" => ["Reuben", "Simeon", "Levi", "Judah", "Issachar", "Zebulun"], "Rachel" => ["Joseph", "Benjamin"], "Bilhah" => ["Dan", "Naphtali"], "Zilpah" => ["Gad", "Asher"], }; That would be more or less equivalent to saying: $kids_of_wife{"Jacob"}{"Leah"}[0] $kids_of_wife{"Jacob"}{"Leah"}[1] $kids_of_wife{"Jacob"}{"Leah"}[2] $kids_of_wife{"Jacob"}{"Leah"}[3] $kids_of_wife{"Jacob"}{"Leah"}[4] $kids_of_wife{"Jacob"}{"Leah"}[5] $kids_of_wife{"Jacob"}{"Rachel"}[0] $kids_of_wife{"Jacob"}{"Rachel"}[1] $kids_of_wife{"Jacob"}{"Bilhah"}[0] $kids_of_wife{"Jacob"}{"Bilhah"}[1] $kids_of_wife{"Jacob"}{"Zilpah"}[0] $kids_of_wife{"Jacob"}{"Zilpah"}[1] = = = = = = = = = = = = "Reuben"; "Simeon"; "Levi"; "Judah"; "Issachar"; "Zebulun"; "Joseph"; "Benjamin"; "Dan"; "Naphtali"; "Gad"; "Asher"; You can see from this that adding a level to a nested data structure it is like adding another dimension to a multidimensional array. Perl lets you think of it either way, but the internal representation is the same. The important point here is that Perl lets you pretend that a complex data structure is a simple scalar. On this simple kind of encapsulation, Perl’s entire objectoriented structure is built. When we earlier invoked the Camel constructor like this: $fido = new Camel "Amelia"; we created a Camel object that is represented by the scalar $fido. But the inside of the Camel is more complicated. As well-behaved object-oriented programmers, we’re not supposed to care about the insides of Camels (unless we happen to be the people implementing the methods of the Camel class). But generally, an object like a Camel would consist of a hash containing the particular Camel’s attributes, such as its name (“Amelia” in this case, not “fido”), and the number of humps (which we didn’t specify, but probably defaults to 1; check the front cover). 14 Chapter 1: An Overview of Perl Simplicities If your head isn’t spinning a bit from reading that last section, then you have an unusual head. People don’t generally like to deal with complex data structures, whether governmental or genealogical. So in our natural languages, we have many ways of sweeping complexity under the carpet. Many of these fall into the category of topicalization, which is just a fancy linguistics term for agreeing with someone about what you’re going to talk about (and by exclusion, what you’re probably not going to talk about). This happens on many levels in language. On a high level, we divide ourselves up into various subcultures that are interested in various subtopics and establish sublanguages that talk primarily about those subtopics. The lingo of the doctor’s office (“indissoluable asphyxiant”) is different from the lingo of the chocolate factory (“everlasting gobstopper”). Most of us automatically switch contexts as we go from one lingo to another. On a conversational level, the context switch has to be more explicit, so our language gives us many ways of saying what we’re about to say. We put titles on our books and headers on our sections. On our sentences, we put quaint phrases like “In regard to your recent query” or “For all X”. Usually, though, we just say things like, “You know that dangley thingy that hangs down in the back of your throat?” Perl also has several ways of topicalizing. One important topicalizer is the package declaration. Suppose you want to talk about Camels in Perl. You’d likely start off your Camel module by saying: package Camel; This has several notable effects. One of them is that Perl will assume from this point on that any unspecified verbs or nouns are about Camels. It does this by automatically prefixing any global name with the module name “Camel::”. So if you say: package Camel; $fido = &fetch(); then the real name of $fido is $Camel::fido (and the real name of &fetch is &Camel::fetch, but we’re not talking about verbs yet). This means that if some other module says: package Dog; $fido = &fetch(); Perl won’t get confused, because the real name of this $fido is $Dog::fido, not $Camel::fido. A computer scientist would say that a package establishes a namespace. You can have as many namespaces as you like, but since you’re only in one of them at a time, you can pretend that the other namespaces don’t exist. That’s Natural and Artificial Languages 15 how namespaces simplify reality for you. Simplification is based on pretending. (Of course, so is oversimplification, which is what we’re doing in this chapter.) Now it’s important to keep your nouns straight, but it’s just as important to keep your verbs straight. It’s nice that &Camel::fetch is not confused with &Dog::fetch within the Camel and Dog namespaces, but the really nice thing about packages is that they classify your verbs so that other packages can use them. When we said: $fido = new Camel "Amelia"; we were actually invoking the &new verb in the Camel package, which has the full name of &Camel::new. And when we said: $fido->saddle(); we were invoking the &Camel::saddle routine, because $fido remembers that it is pointing to a Camel. This is how object-oriented programming works. When you say package Camel, you’re starting a new package. But sometimes you just want to borrow the nouns and verbs of an existing package. Perl lets you do that with a use declaration, which not only borrows verbs from another package, but also checks that the module you name is loaded in from disk. In fact, you must say something like: use Camel; before you say: $fido = new Camel "Amelia"; because otherwise Perl wouldn’t know what a Camel is. The interesting thing is that you yourself don’t really need to know what a Camel is, provided you can get someone else to write the Camel module for you. Even better would be if someone had alr eady written the Camel module for you. It could be argued that the most powerful thing about Perl is not Perl itself, but CPAN (Comprehensive Perl Archive Network), which contains myriads of modules that accomplish many different tasks that you don’t have to know how to do. You just have to download it and know how to say: use Some::Cool::Module; and then use the verbs from that module in a manner appropriate to the topic under discussion. So, like topicalization in a natural language, topicalization in Perl “warps” the language that you’ll use from there to the end of the program. In fact, some of the built-in modules don’t actually introduce verbs at all, but simply warp the Perl 16 Chapter 1: An Overview of Perl language in various useful ways. These special modules we call pragmas. For instance, you’ll often see people use the pragma strict, like this: use strict; What the strict module does is tighten up some of the rules so that you have to be more explicit about various things that Perl would otherwise guess about, such as how you want your variables to be scoped. Making things explicit is helpful when you’re working on large projects. By default Perl is optimized for small projects, but with the strict pragma, Perl is also good for large projects that need to be more maintainable. Since you can add the strict pragma at any time, Perl is also good for evolving small projects into large ones, even when you didn’t expect that to happen. Which is usually. Verbs As is typical of your typical imperative computer language, many of the verbs in Perl are commands: they tell the Perl interpreter to do something. On the other hand, as is typical of a natural language, the meanings of Perl verbs tend to mush off in various directions depending on the context. A statement starting with a verb is generally purely imperative and evaluated entirely for its side effects. (We sometimes call these verbs pr ocedures, especially when they’re user-defined.) A frequently seen built-in command (in fact, you’ve seen it already) is the print command: print "Adam’s wife is $wife{’Adam’}.\n"; This has the side effect of producing the desired output: Adam’s wife is Eve. But there are other “moods” besides the imperative mood. Some verbs are for asking questions and are useful in conditionals such as if statements. Other verbs translate their input parameters into return values, just as a recipe tells you how to turn raw ingredients into something (hopefully) edible. We tend to call these verbs functions, in deference to generations of mathematicians who don’t know what the word “functional” means in normal English. An example of a built-in function would be the exponential function: $e = exp(1); # 2.718281828459 or thereabouts But Perl doesn’t make a hard distinction between procedures and functions. You’ll find the terms used interchangeably. Verbs are also sometimes called operators An Average Example 17 (when built-in), or subroutines (when user-defined).* But call them whatever you like — they all return a value, which may or may not be a meaningful value, which you may or may not choose to ignore. As we go on, you’ll see additional examples of how Perl behaves like a natural language. But there are other ways to look at Perl too. We’ve already sneakily introduced some notions from mathematical language, such as subscripts, addition, and the exponential function. But Perl is also a control language, a glue language, a prototyping language, a text-processing language, a list-processing language, and an object-oriented language. Among other things. But Perl is also just a plain old computer language. And that’s how we’ll look at it next. An Average Example Suppose you’ve been teaching a Perl class, and you’re trying to figure out how to grade your students. You have a set of exam scores for each member of a class, in random order. You’d like a combined list of all the grades for each student, plus their average score. You have a text file (imaginatively named grades) that looks like this: Noël 25 Ben 76 Clementine 49 Norm 66 Chris 92 Doug 42 Carol 25 Ben 12 Clementine 0 Norm 66 ... You can use the following script to gather all their scores together, determine each student’s average, and print them all out in alphabetical order. This program assumes rather naively that you don’t have two Carols in your class. That is, if there is a second entry for Carol, the program will assume it’s just another score for the first Carol (not to be confused with the first Noël). * Historically, Perl required you to put an ampersand character (&) on any calls to user-defined subroutines (see $fido = &fetch(); earlier). But with Perl version 5, the ampersand became optional, so that user-defined verbs can now be called with the same syntax as built-in verbs ($fido = fetch();). We still use the ampersand when talking about the name of the routine, such as when we take a reference to it ($fetcher = \&fetch;). Linguistically speaking, you can think of the ampersand form &fetch as an infinitive, “to fetch”, or the similar form “do fetch”. But we rarely say “do fetch” when we can just say “fetch”. That’s the real reason we dropped the mandatory ampersand in Perl 5. 18 Chapter 1: An Overview of Perl By the way, the line numbers are not part of the program, any other resemblances to BASIC notwithstanding. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 #!/usr/bin/perl open(GRADES, "grades") or die "Can’t open grades: $!\n"; while ($line = <GRADES>) { ($student, $grade) = split(" ", $line); $grades{$student} .= $grade . " "; } foreach $student (sort keys %grades) { $scores = 0; $total = 0; @grades = split(" ", $grades{$student}); foreach $grade (@grades) { $total += $grade; $scores++; } $average = $total / $scores; print "$student: $grades{$student}\tAverage: $average\n"; } Now before your eyes cross permanently, we’d better point out that this example demonstrates a lot of what we’ve covered so far, plus quite a bit more that we’ll explain presently. But if you let your eyes go just a little out of focus, you may start to see some interesting patterns. Take some wild guesses now as to what’s going on, and then later on we’ll tell you if you’re right. We’d tell you to try running it, but you may not know how yet. How to Do It Gee, right about now you’re probably wondering how to run a Perl program. The short answer is that you feed it to the Perl language interpreter program, which coincidentally happens to be named perl. The long answer starts out like this: There’s More Than One Way To Do It.* The first way to invoke perl (and the way most likely to work on any operating system) is to simply call perl explicitly from the command line.† If you are doing * That’s the Perl Slogan, and you’ll get tired of hearing it, unless you’re the Local Expert, in which case you’ll get tired of saying it. Sometimes it’s shortened to TMTOWTDI, pronounced “tim-toady”. But you can pronounce it however you like. After all, TMTOWTDI. † Assuming that your operating system provides a command-line interface. If you’re running an older Mac, you might need to upgrade to a version of BSD such as Mac OS X. An Average Example 19 something fairly simple, you can use the -e switch (% in the following example represents a standard shell prompt, so don’t type it). On Unix, you might type: % perl -e ’print "Hello, world!\n";’ On other operating systems, you may have to fiddle with the quotes some. But the basic principle is the same: you’re trying to cram everything Perl needs to know into 80 columns or so.* For longer scripts, you can use your favorite text editor (or any other text editor) to put all your commands into a file and then, presuming you named the script gradation (not to be confused with graduation), you’d say: % perl gradation You’re still invoking the Perl interpreter explicitly, but at least you don’t have to put everything on the command line every time. And you no longer have to fiddle with quotes to keep the shell happy. The most convenient way to invoke a script is just to name it directly (or click on it), and let the operating system find the interpreter for you. On some systems, there may be ways of associating various file extensions or directories with a particular application. On those systems, you should do whatever it is you do to associate the Perl script with the perl interpreter. On Unix systems that support the #! “shebang” notation (and most Unix systems do, nowadays), you can make the first line of your script be magical, so the operating system will know which program to run. Put a line resembling line 1 of our example into your program: #!/usr/bin/perl (If perl isn’t in /usr/bin, you’ll have to change the #! line accordingly.) Then all you have to say is: % gradation Of course, this didn’t work because you forgot to make sure the script was executable (see the manpage for chmod (1)) and in your PATH. If it isn’t in your PATH, you’ll have to provide a complete filename so that the operating system knows how to find your script. Something like: % /home/sharon/bin/gradation Finally, if you are unfortunate enough to be on an ancient Unix system that doesn’t support the magic #! line, or if the path to your interpreter is longer than * These types of scripts are often referred to as “one-liners”. If you ever end up hanging out with other Perl programmers, you’ll find that some of us are quite fond of creating intricate one-liners. Perl has occasionally been maligned as a write-only language because of these shenanigans. 20 Chapter 1: An Overview of Perl 32 characters (a built-in limit on many systems), you may be able to work around it like this: #!/bin/sh -- # perl, to stop looping eval ’exec /usr/bin/perl -S $0 ${1+"$@"}’ if 0; Some operating systems may require variants of this to deal with /bin/csh, DCL, COMMAND.COM, or whatever happens to be your default command interpreter. Ask your Local Expert. Throughout this book, we’ll just use #!/usr/bin/perl to represent all these notions and notations, but you’ll know what we really mean by it. A random clue: when you write a test script, don’t call your script test. Unix systems have a built-in test command, which will likely be executed instead of your script. Try try instead. A not-so-random clue: while learning Perl, and even after you think you know what you’re doing, we suggest using the -w switch, especially during development. This option will turn on all sorts of useful and interesting warning messages, not necessarily in that order. You can put the -w switch on the shebang line, like this: #!/usr/bin/perl -w Now that you know how to run your own Perl program (not to be confused with the perl program), let’s get back to our example. Filehandles Unless you’re using artificial intelligence to model a solipsistic philosopher, your program needs some way to communicate with the outside world. In lines 3 and 4 of our Average Example you’ll see the word GRADES, which exemplifies another of Perl’s data types, the filehandle. A filehandle is just a name you give to a file, device, socket, or pipe to help you remember which one you’re talking about, and to hide some of the complexities of buffering and such. (Internally, filehandles are similar to streams from a language like C++ or I/O channels from BASIC.) Filehandles make it easier for you to get input from and send output to many different places. Part of what makes Perl a good glue language is that it can talk to many files and processes at once. Having nice symbolic names for various external objects is just part of being a good glue language.* * Some of the other things that make Perl a good glue language are: it’s 8-bit clean, it’s embeddable, and you can embed other things in it via extension modules. It’s concise, and it “networks” easily. It’s environmentally conscious, so to speak. You can invoke it in many different ways (as we saw earlier). But most of all, the language itself is not so rigidly structured that you can’t get it to “flow” around your problem. It comes back to that TMTOWTDI thing again. Filehandles 21 You create a filehandle and attach it to a file by using open. The open function takes at least two parameters: the filehandle and filename you want to associate it with. Perl also gives you some predefined (and preopened) filehandles. STDIN is your program’s normal input channel, while STDOUT is your program’s normal output channel. And STDERR is an additional output channel that allows your program to make snide remarks off to the side while it transforms (or attempts to transform) your input into your output.* Since you can use the open function to create filehandles for various purposes (input, output, piping), you need to be able to specify which behavior you want. As you might do on the command line, you simply add characters to the filename. open(SESAME, open(SESAME, open(SESAME, open(SESAME, open(SESAME, open(SESAME, "filename") "<filename") ">filename") ">>filename") "| output-pipe-command") "input-pipe-command |") # # # # # # read from existing file (same thing, explicitly) create file and write to it append to existing file set up an output filter set up an input filter As you can see, the name you pick for the filehandle is arbitrary. Once opened, the filehandle SESAME can be used to access the file or pipe until it is explicitly closed (with, you guessed it, close(SESAME)), or until the filehandle is attached to another file by a subsequent open on the same filehandle.† Once you’ve opened a filehandle for input, you can read a line using the line reading operator, <>. This is also known as the angle operator because it’s made of angle brackets. The angle operator encloses the filehandle (<SESAME>) you want to read lines from. The empty angle operator, <>, will read lines from all the files specified on the command line, or STDIN, if none were specified. (This is standard behavior for many filter programs.) An example using the STDIN filehandle to read an answer supplied by the user would look something like this: print STDOUT "Enter a number: "; # ask for a number $number = <STDIN>; # input the number print STDOUT "The number is $number.\n"; # print the number * These filehandles are typically attached to your terminal, so you can type to your program and see its output, but they may also be attached to files (and such). Perl can give you these predefined handles because your operating system already provides them, one way or another. Under Unix, processes inherit standard input, output, and error from their parent process, typically a shell. One of the duties of a shell is to set up these I/O streams so that the child process doesn’t need to worry about them. † Opening an already opened filehandle implicitly closes the first file, making it inaccessible to the filehandle, and opens a different file. You must be careful that this is what you really want to do. Sometimes it happens accidentally, like when you say open($handle,$file), and $handle happens to contain a constant string. Be sure to set $handle to something unique, or you’ll just open a new file on the same filehandle. Or you can leave $handle undefined, and Perl will fill it in for you. 22 Chapter 1: An Overview of Perl Did you see what we just slipped by you? What’s that STDOUT doing there in those print statements? Well, that’s just one of the ways you can use an output filehandle. A filehandle may be supplied as the first argument to the print statement, and if present, tells the output where to go. In this case, the filehandle is redundant, because the output would have gone to STDOUT anyway. Much as STDIN is the default for input, STDOUT is the default for output. (In line 18 of our Average Example, we left it out to avoid confusing you up till now.) If you try the previous example, you may notice that you get an extra blank line. This happens because the line-reading operation does not automatically remove the newline from your input line (your input would be, for example, “9\n”). For those times when you do want to remove the newline, Perl provides the chop and chomp functions. chop will indiscriminately remove (and return) the last character of the string, while chomp will only remove the end of record marker (generally, “\n”) and return the number of characters so removed. You’ll often see this idiom for inputting a single line: chop($number = <STDIN>); # input number and remove newline which means the same thing as: $number = <STDIN>; chop($number); # input number # remove newline Operators As we alluded to earlier, Perl is also a mathematical language. This is true at several levels, from low-level bitwise logical operations, up through number and set manipulation, on up to larger predicates and abstractions of various sorts. And as we all know from studying math in school, mathematicians love strange symbols. What’s worse, computer scientists have come up with their own versions of these strange symbols. Perl has a number of these strange symbols too, but take heart, most are borrowed directly from C, FORTRAN, sed (1) or awk (1), so they’ll at least be familiar to users of those languages. The rest of you can take comfort in knowing that, by learning all these strange symbols in Perl, you’ve given yourself a head start on all those other strange languages. Perl’s built-in operators may be classified by number of operands into unary, binary, and trinary (or ternary) operators. They may be classified by whether they’re prefix operators (which go in front of their operands) or infix operators (which go in between their operands). They may also be classified by the kinds of objects they work with, such as numbers, strings, or files. Later, we’ll give you a table of all the operators, but first here are some handy ones to get you started. Operators 23 Some Binary Arithmetic Operators Arithmetic operators do what you would expect from learning them in school. They perform some sort of mathematical function on numbers. For example: Example Name Result $a $a $a $a Addition Multiplication Modulus Exponentiation Sum of $a and $b Product of $a and $b Remainder of $a divided by $b $a to the power of $b + $b * $b % $b ** $b Yes, we left out subtraction and division—we suspect you can figure out how they should work. Try them and see if you’re right. (Or cheat and look in Chapter 3, Unary and Binary Operators.) Arithmetic operators are evaluated in the order your math teacher taught you (exponentiation before multiplication; multiplication before addition). You can always use parentheses to make it come out differently. String Operators There is also an “addition” operator for strings that performs concatenation (that is, joining strings end to end). Unlike some languages that confuse this with numeric addition, Perl defines a separate operator (.) for string concatenation: $a = 123; $b = 456; print $a + $b; print $a . $b; # prints 579 # prints 123456 There’s also a “multiply” operator for strings, called the repeat operator. Again, it’s a separate operator (x) to keep it distinct from numeric multiplication: $a = 123; $b = 3; print $a * $b; print $a x $b; # prints 369 # prints 123123123 These string operators bind as tightly as their corresponding arithmetic operators. The repeat operator is a bit unusual in taking a string for its left argument but a number for its right argument. Note also how Perl is automatically converting from numbers to strings. You could have put all the literal numbers above in quotes, and it would still have produced the same output. Internally though, it would have been converting in the opposite direction (that is, from strings to numbers). A couple more things to think about. String concatenation is also implied by the interpolation that happens in double-quoted strings. And when you print out a list 24 Chapter 1: An Overview of Perl of values, you’re also effectively concatenating strings. So the following three statements produce the same output: print $a . ’ is equal to ’ . $b . ".\n"; print $a, ’ is equal to ’, $b, ".\n"; print "$a is equal to $b.\n"; # dot operator # list # interpolation Which of these you use in any particular situation is entirely up to you. (But bear in mind that interpolation is often the most readable.) The x operator may seem relatively worthless at first glance, but it is quite useful at times, especially for things like this: print "-" x $scrwid, "\n"; which draws a line across your screen, presuming $scrwid contains your screen width, and not your screw identifier. Assignment Operators Although it’s not exactly a mathematical operator, we’ve already made extensive use of the simple assignment operator, =. Try to remember that = means “gets set to” rather than “equals”. (There is also a mathematical equality operator == that means “equals”, and if you start out thinking about the difference between them now, you’ll save yourself a lot of headache later. The == operator is like a function that returns a Boolean value, while = is more like a procedure that is evaluated for the side effect of modifying a variable.) Like the operators described earlier, assignment operators are binary infix operators, which means they have an operand on either side of the operator. The right operand can be any expression you like, but the left operand must be a valid lvalue (which, when translated to English, means a valid storage location like a variable, or a location in an array). The most common assignment operator is simple assignment. It determines the value of the expression on its right side, and then sets the variable on the left side to that value: $a = $b; $a = $b + 5; $a = $a * 3; Notice the last assignment refers to the same variable twice; once for the computation, once for the assignment. There’s nothing wrong with that, but it’s a common enough operation that there’s a shortcut for it (borrowed from C). If you say: lvalue operator= expression Operators 25 it is evaluated as if it were: lvalue = lvalue operator expression except that the lvalue is not computed twice. (This only makes a difference if evaluation of the lvalue has side effects. But when it does make a difference, it usually does what you want. So don’t sweat it.) So, for example, you could write the previous example as: $a *= 3; which reads “multiply $a by 3”. You can do this with almost any binary operator in Perl, even some that you can’t do it with in C: $line .= "\n"; # Append newline to $line. $fill x= 80; # Make string $fill into 80 repeats of itself. $val ||= "2"; # Set $val to 2 if it isn’t already "true". Line 6 of our Average Example* contains two string concatenations, one of which is an assignment operator. And line 14 contains a +=. Regardless of which kind of assignment operator you use, the final value of the variable on the left is returned as the value of the assignment as a whole.† This will not surprise C programmers, who will already know how to use this idiom to zero out variables: $a = $b = $c = 0; You’ll also frequently see assignment used as the condition of a while loop, as in line 4 of our average example. What will surprise C programmers is that assignment in Perl returns the actual variable as an lvalue, so that you can modify the same variable more than once in a statement. For instance, you could say: ($temp -= 32) *= 5/9; to do an in-place conversion from Fahrenheit to Celsius. This is also why earlier in this chapter we could say: chop($number = <STDIN>); and have it chop the final value of $number. Generally speaking, you can use this feature whenever you want to copy something and at the same time do something else with it. * Thought we’d forgotten it, didn’t you? † This is unlike, say, Pascal, in which assignment is a statement and returns no value. We said earlier that assignment is like a procedure, but remember that in Perl, even procedures return values. 26 Chapter 1: An Overview of Perl Unary Arithmetic Operators As if $variable += 1 weren’t short enough, Perl borrows from C an even shorter way to increment a variable. The autoincrement (and autodecrement) operators simply add (or subtract) one from the value of the variable. They can be placed on either side of the variable, depending on when you want them to be evaluated: Example Name Result ++$a, $a++ --$a, $a-- Autoincrement Autodecrement Add 1 to $a Subtract 1 from $a If you place one of these “auto” operators before the variable, it is known as a pre-incremented (pre-decremented) variable. Its value will be changed before it is referenced. If it is placed after the variable, it is known as a post-incremented (post-decremented) variable, and its value is changed after it is used. For example: $a = 5; $b = ++$a; $c = $a--; # $a is assigned 5 # $b is assigned the incremented value of $a, 6 # $c is assigned 6, then $a is decremented to 5 Line 15 of our Average Example increments the number of scores by one, so that we’ll know how many scores we’re averaging. It uses a post-increment operator ($scores++), but in this case it doesn’t matter, since the expression is in a void context, which is just a funny way of saying that the expression is being evaluated only for the side effect of incrementing the variable. The value returned is being thrown away.* Logical Operators Logical operators, also known as “short-circuit” operators, allow the program to make decisions based on multiple criteria without using nested if statements. They are known as short-circuit operators because they skip (short circuit) the evaluation of their right argument if they decide the left argument has already supplied enough information to decide the overall value. This is not just for efficiency. You are explicitly allowed to depend on this short-circuiting behavior to avoid evaluating code in the right argument that you know would blow up if the left argument were not “guarding” it. You can say “California or bust!” in Perl without busting (presuming you do get to California). Perl actually has two sets of logical operators, a traditional set borrowed from C and a newer (but even more traditional) set of ultralow-precedence operators borrowed from BASIC. Both sets contribute to readability when used appropriately. * The optimizer will notice this and optimize the post-increment into a pre-increment, because that’s a bit faster to execute. (You didn’t need to know that, but we hoped it would cheer you up.) Operators 27 C’s punctuational operators work well when you want your logical operators to bind more tightly than commas, while BASIC’s word-based operators work well when you want your commas to bind more tightly than your logical operators. Often they work the same, and which set you use is a matter of personal preference. (For contrastive examples, see the section “Logical and, or, not, and xor” in Chapter 3.) Although the two sets of operators are not interchangeable due to precedence, once they’re parsed, the operators themselves behave identically; precedence merely governs the extent of their arguments. Table 1-1 lists logical operators. Table 1-1. Logical Operators Example Name Result $a && $b And $a if $a is false, $b otherwise $a || $b Or $a if $a is true, $b otherwise ! $a Not True if $a is not true $a and $b And $a if $a is false, $b otherwise $a or $b Or $a if $a is true, $b otherwise not $a Not True if $a is not true $a xor $b Xor True if $a or $b is true, but not both Since the logical operators “short-circuit” the way they do, they’re often used in Perl to conditionally execute code. The following line (line 3 from our Average Example) tries to open the file grades: open(GRADES, "grades") or die "Can’t open file grades: $!\n"; If it opens the file, it will jump to the next line of the program. If it can’t open the file, it will provide us with an error message and then stop execution. Literally, this line means “Open grades or bust!” Besides being another example of natural language, the short-circuit operators preserve the visual flow. Important actions are listed down the left side of the screen, and secondary actions are hidden off to the right. (The $! variable contains the error message returned by the operating system—see Chapter 28, Special Names.) Of course, these logical operators can also be used within the more traditional kinds of conditional constructs, such as the if and while statements. Some Numeric and String Comparison Operators Comparison, or relational, operators tell us how two scalar values (numbers or strings) relate to each other. There are two sets of operators; one does numeric 28 Chapter 1: An Overview of Perl comparison and the other does string comparison. (In either case, the arguments will be “coerced” to have the appropriate type first.) Assuming left and right arguments of $a and $b, we have: Comparison Numeric String Return Value Equal Not equal Less than Greater than Less than or equal Comparison == != < > <= <=> eq ne lt gt le cmp True if $a is equal to $b True if $a is not equal to $b True if $a is less than $b True if $a is greater than $b True if $a not greater than $b 0 if equal, 1 if $a greater, –1 if $b greater The last pair of operators (<=> and cmp) are entirely redundant. However, they’re incredibly useful in sort subroutines (see Chapter 29).* Some File Test Operators The file test operators allow you to test whether certain file attributes are set before you go and blindly muck about with the files. The most basic file attribute is, of course, whether the file exists. For example, it would be very nice to know whether your mail aliases file already exists before you go and open it as a new file, wiping out everything that was in there before. Here are a few of the file test operators: Example Name Result -e -r -w -d -f -T Exists Readable Writable Directory File Text File True True True True True True $a $a $a $a $a $a if if if if if if file file file file file file named named named named named named in $a exists in $a is readable in $a is writable in $a is a directory in $a is a regular file in $a is a text file You might use them like this: -e "/usr/bin/perl" or warn "Perl is improperly installed\n"; -f "/vmlinuz" and print "I see you are a friend of Linus\n"; * Some folks feel that such redundancy is evil because it keeps a language from being minimalistic, or orthogonal. But Perl isn’t an orthogonal language; it’s a diagonal language. By this we mean that Perl doesn’t force you to always go at right angles. Sometimes you just want to follow the hypotenuse of the triangle to get where you’re going. TMTOWTDI is about shortcuts. Shortcuts are about efficiency. Control Structures 29 Note that a regular file is not the same thing as a text file. Binary files like /vmlinuz are regular files, but they aren’t text files. Text files are the opposite of binary files, while regular files are the opposite of “irregular” files like directories and devices. There are a lot of file test operators, many of which we didn’t list. Most of the file tests are unary Boolean operators, which is to say they take only one operand (a scalar that evaluates to a filename or a filehandle), and they return either a true or false value. A few of them return something fancier, like the file’s size or age, but you can look those up when you need them in the section “Named Unary and File Test Operators” in Chapter 3. Control Structures So far, except for our one large example, all of our examples have been completely linear; we executed each command in order. We’ve seen a few examples of using the short-circuit operators to cause a single command to be (or not to be) executed. While you can write some very useful linear programs (a lot of CGI scripts fall into this category), you can write much more powerful programs if you have conditional expressions and looping mechanisms. Collectively, these are known as control structures. So you can also think of Perl as a control language. But to have control, you have to be able to decide things, and to decide things, you have to know the difference between what’s true and what’s false. What Is Truth? We’ve bandied about the term truth,* and we’ve mentioned that certain operators return a true or a false value. Before we go any further, we really ought to explain exactly what we mean by that. Perl treats truth a little differently than most computer languages, but after you’ve worked with it a while, it will make a lot of sense. (Actually, we hope it’ll make a lot of sense after you’ve read the following.) Basically, Perl holds truths to be self-evident. That’s a glib way of saying that you can evaluate almost anything for its truth value. Perl uses practical definitions of truth that depend on the type of thing you’re evaluating. As it happens, there are many more kinds of truth than there are of nontruth. * Strictly speaking, this is not true. 30 Chapter 1: An Overview of Perl Truth in Perl is always evaluated in a scalar context. Other than that, no type coercion is done. So here are the rules for the various kinds of values a scalar can hold: 1. Any string is true except for "" and "0". 2. Any number is true except for 0. 3. Any reference is true. 4. Any undefined value is false. Actually, the last two rules can be derived from the first two. Any reference (rule 3) would point to something with an address and would evaluate to a number or string containing that address, which is never 0 because it’s always defined. And any undefined value (rule 4) would always evaluate to 0 or the null string. And in a way, you can derive rule 2 from rule 1 if you pretend that everything is a string. Again, no string coercion is actually done to evaluate truth, but if the string coercion wer e done, then any numeric value of 0 would simply turn into the string "0" and be false. Any other number would not turn into the string "0", and so would be true. Let’s look at some examples so we can understand this better: 0 # would become the string "0", so false. 1 # would become the string "1", so true. 10 - 10 # 10 minus 10 is 0, would convert to string "0", so false. 0.00 # equals 0, would convert to string "0", so false. "0" # is the string "0", so false. "" # is a null string, so false. "0.00" # is the string "0.00", neither "" nor "0", so true! "0.00" + 0 # would become the number 0 (coerced by the +), so false. \$a # is a reference to $a, so true, even if $a is false. undef() # is a function returning the undefined value, so false. Since we mumbled something earlier about truth being evaluated in a scalar context, you might be wondering what the truth value of a list is. Well, the simple fact is, none of the operations in Perl will return a list in a scalar context. They’ll all notice they’re in a scalar context and return a scalar value instead, and then you apply the rules of truth to that scalar. So there’s no problem, as long as you can figure out what any given operator will return in a scalar context. As it happens, both arrays and hashes return scalar values that conveniently happen to be true if the array or hash contains any elements. More on that later. The if and unless statements We saw earlier how a logical operator could function as a conditional. A slightly more complex form of the logical operators is the if statement. The if statement evaluates a truth condition (that is, a Boolean expression) and executes a block if the condition is true: Control Structures 31 if ($debug_level > 0) { # Something has gone wrong. Tell the user. print "Debug: Danger, Will Robinson, danger!\n"; print "Debug: Answer was ’54’, expected ’42’.\n"; } A block is one or more statements grouped together by a set of braces. Since the if statement executes a block, the braces are required by definition. If you know a language like C, you’ll notice that this is different. Braces are optional in C if you have a single statement, but the braces are not optional in Perl. Sometimes, just executing a block when a condition is met isn’t enough. You may also want to execute a different block if that condition isn’t met. While you could certainly use two if statements, one the negation of the other, Perl provides a more elegant solution. After the block, if can take an optional second condition, called else, to be executed only if the truth condition is false. (Veteran computer programmers will not be surprised at this point.) At times you may even have more than two possible choices. In this case, you’ll want to add an elsif truth condition for the other possible choices. (Veteran computer programmers may well be surprised by the spelling of “elsif ”, for which nobody here is going to apologize. Sorry.) if ($city eq "New York") { print "New York is northeast of Washington, D.C.\n"; } elsif ($city eq "Chicago") { print "Chicago is northwest of Washington, D.C.\n"; } elsif ($city eq "Miami") { print "Miami is south of Washington, D.C. And much warmer!\n"; } else { print "I don’t know where $city is, sorry.\n"; } The if and elsif clauses are each computed in turn, until one is found to be true or the else condition is reached. When one of the conditions is found to be true, its block is executed and all remaining branches are skipped. Sometimes, you don’t want to do anything if the condition is true, only if it is false. Using an empty if with an else may be messy, and a negated if may be illegible; it sounds weird in English to say “if not this is true, do something”. In these situations, you would use the unless statement: unless ($destination eq $home) { print "I’m not going home.\n"; } There is no elsunless though. This is generally construed as a feature. 32 Chapter 1: An Overview of Perl Iterative (Looping) Constructs Perl has four main iterative statement types: while, until, for, and foreach. These statements allow a Perl program to repeatedly execute the same code. The while and until statements The while and until statements behave just like the if and unless statements, except that they’ll execute the block repeatedly. That is, they loop. First, the conditional part of the statement is checked. If the condition is met (if it is true for a while or false for an until), the block of the statement is executed. while ($tickets_sold < 10000) { $available = 10000 - $tickets_sold; print "$available tickets are available. How many would you like: "; $purchase = <STDIN>; chomp($purchase); $tickets_sold += $purchase; } Note that if the original condition is never met, the loop will never be entered at all. For example, if we’ve already sold 10,000 tickets, we might want to have the next line of the program say something like: print "This show is sold out, please come back later.\n"; In our Average Example earlier, line 4 reads: while ($line = <GRADES>) { This assigns the next line to the variable $line and, as we explained earlier, returns the value of $line so that the condition of the while statement can evaluate $line for truth. You might wonder whether Perl will get a false negative on blank lines and exit the loop prematurely. The answer is that it won’t. The reason is clear if you think about everything we’ve said. The line input operator leaves the newline on the end of the string, so a blank line has the value "\n". And you know that "\n" is not one of the canonical false values. So the condition is true, and the loop continues even on blank lines. On the other hand, when we finally do reach the end of the file, the line input operator returns the undefined value, which always evaluates to false. And the loop terminates, just when we wanted it to. There’s no need for an explicit test of the eof function in Perl, because the input operators are designed to work smoothly in a conditional context. In fact, almost everything is designed to work smoothly in a conditional (Boolean) context. If you mention an array in a scalar context, the length of the array is returned. So you often see command-line arguments processed like this: Control Structures 33 while (@ARGV) { process(shift @ARGV); } The shift operator removes one element from the argument list each time through the loop (and returns that element). The loop automatically exits when array @ARGV is exhausted, that is, when its length goes to 0. And 0 is already false in Perl. In a sense, the array itself has become “false”.* The for statement Another iterative statement is the for loop. The for loop runs exactly like the while loop, but looks a good deal different. (C programmers will find it very familiar though.) for ($sold = 0; $sold < 10000; $sold += $purchase) { $available = 10000 - $sold; print "$available tickets are available. How many would you like: "; $purchase = <STDIN>; chomp($purchase); } This for loop takes three expressions within the loop’s parentheses: an expression to set the initial state of the loop variable, a condition to test the loop variable, and an expression to modify the state of the loop variable. When a for loop starts, the initial state is set and the truth condition is checked. If the condition is true, the block is executed. When the block finishes, the modification expression is executed, the truth condition is again checked, and if true, the block is rerun with the next value. As long as the truth condition remains true, the block and the modification expression will continue to be executed. (Note that only the middle expression is evaluated for its value. The first and third expressions are evaluated only for their side effects, and the resulting values are thrown away!) The foreach statement The last of Perl’s iterative statements is the foreach statement, which is used to execute the same code for each of a known set of scalars, such as an array: foreach $user (@users) { if (-f "$home{$user}/.nexrc") { print "$user is cool... they use a perl-aware vi!\n"; } } * This is how Perl programmers think. So there’s no need to compare 0 to 0 to see if it’s false. Despite the fact that other languages force you to, don’t go out of your way to write explicit comparisons like while (@ARGV != 0). That’s just inefficient for both you and the computer. And anyone who has to maintain your code. 34 Chapter 1: An Overview of Perl Unlike the if and while statements, which provide scalar context to a conditional expression, the foreach statement provides a list context to the expression in parentheses. So the expression is evaluated to produce a list (not a scalar, even if there’s only one scalar in the list). Then each element of the list is aliased to the loop variable in turn, and the block of code is executed once for each list element. Note that the loop variable refers to the element itself, rather than a copy of the element. Hence, modifying the loop variable also modifies the original array. You’ll find many more foreach loops in the typical Perl program than for loops, because it’s very easy in Perl to generate the kinds of lists that foreach wants to iterate over. One idiom you’ll often see is a loop to iterate over the sorted keys of a hash: foreach $key (sort keys %hash) { In fact, line 9 of our Average Example does precisely that. Breaking out: next and last The next and last operators allow you to modify the flow of your loop. It is not at all uncommon to have a special case; you may want to skip it, or you may want to quit when you encounter it. For example, if you are dealing with Unix accounts, you may want to skip the system accounts (like root or lp). The next operator would allow you to skip to the end of your current loop iteration, and start the next iteration. The last operator would allow you to skip to the end of your block, as if your loop’s test condition had returned false. This might be useful if, for example, you are looking for a specific account and want to quit as soon as you find it. foreach $user (@users) { if ($user eq "root" or $user eq "lp") { next; } if ($user eq "special") { print "Found the special account.\n"; # do some processing last; } } It’s possible to break out of multilevel loops by labeling your loops and specifying which loop you want to break out of. Together with statement modifiers (another form of conditional which we’ll talk about later), this can make for extremely readable loop exits (if you happen to think English is readable): LINE: while ($line = <ARTICLE>) { last LINE if $line eq "\n"; # stop on first blank line next LINE if $line =˜ /ˆ#/; # skip comment lines Regular Expressions 35 # your ad here } You may be saying, “Wait a minute, what’s that funny ˆ# thing there inside the leaning toothpicks? That doesn’t look much like English.” And you’re right. That’s a pattern match containing a regular expression (albeit a rather simple one). And that’s what the next section is about. Perl is the best text processing language in the world, and regular expressions are at the heart of Perl’s text processing. Regular Expressions Regular expressions (a.k.a. regexes, regexps, or REs) are used by many search programs such as gr ep and findstr, text-munging programs like sed and awk, and editors like vi and emacs. A regular expression is a way of describing a set of strings without having to list all the strings in your set.* Many other computer languages incorporate regular expressions (some of them even advertise “Perl5 regular expressions”!), but none of these languages integrates regular expressions into the language the way Perl does. Regular expressions are used several ways in Perl. First and foremost, they’re used in conditionals to determine whether a string matches a particular pattern, because in a Boolean context they return true and false. So when you see something that looks like /foo/ in a conditional, you know you’re looking at an ordinary patter n-matching operator: if (/Windows 95/) { print "Time to upgrade?\n" } Second, if you can locate patterns within a string, you can replace them with something else. So when you see something that looks like s/foo/bar/, you know it’s asking Perl to substitute “bar” for “foo”, if possible. We call that the substitution operator. It also happens to return true or false depending on whether it succeeded, but usually it’s evaluated for its side effect: s/Windows/Linux/; Finally, patterns can specify not only where something is, but also where it isn’t. So the split operator uses a regular expression to specify where the data isn’t. That is, the regular expression defines the separators that delimit the fields of data. Our Average Example has a couple of trivial examples of this. Lines 5 and 12 each split strings on the space character in order to return a list of words. But you can split on any separator you can specify with a regular expression: ($good, $bad, $ugly) = split(/,/, "vi,emacs,teco"); * A good source of information on regular expression concepts is Jeffrey Friedl’s book, Mastering Regular Expressions (O’Reilly & Associates). 36 Chapter 1: An Overview of Perl (There are various modifiers you can use in each of these situations to do exotic things like ignore case when matching alphabetic characters, but these are the sorts of gory details that we’ll cover later when we get to the gory details.) The simplest use of regular expressions is to match a literal expression. In the case of the split above, we matched on a single comma character. But if you match on several characters in a row, they all have to match sequentially. That is, the pattern looks for a substring, much as you’d expect. Let’s say we want to show all the lines of an HTML file that contain HTTP links (as opposed to FTP links). Let’s imagine we’re working with HTML for the first time, and we’re being a little naïve. We know that these links will always have “http:” in them somewhere. We could loop through our file with this: while ($line = <FILE>) { if ($line =˜ /http:/) { print $line; } } Here, the =˜ (pattern-binding operator) is telling Perl to look for a match of the regular expression “http:” in the variable $line. If it finds the expression, the operator returns a true value and the block (a print statement) is executed.* By the way, if you don’t use the =˜ binding operator, Perl will search a default string instead of $line. It’s like when you say, “Eek! Help me find my contact lens!” People automatically know to look around near you without your actually having to tell them that. Likewise, Perl knows that there is a default place to search for things when you don’t say where to search for them. This default string is actually a special scalar variable that goes by the odd name of $_. In fact, it’s not the default just for pattern matching; many operators in Perl default to using the $_ variable, so a veteran Perl programmer would likely write the last example as: while (<FILE>) { print if /http:/; } (Hmm, another one of those statement modifiers seems to have snuck in there. Insidious little beasties.) This stuff is pretty handy, but what if we wanted to find all of the link types, not just the HTTP links? We could give a list of link types, like “http:”, “ftp:”, “mailto:”, and so on. But that list could get long, and what would we do when a new kind of link was added? * This is very similar to what the Unix command grep ’http:’ file would do. On MS-DOS you could use the find command, but it doesn’t know how to do more complicated regular expressions. (However, the misnamed findstr program of Windows NT does know about regular expressions.) Regular Expressions 37 while (<FILE>) { print if /http:/; print if /ftp:/; print if /mailto:/; # What next? } Since regular expressions are descriptive of a set of strings, we can just describe what we are looking for: a number of alphabetic characters followed by a colon. In regular expression talk (Regexese?), that would be /[a-zA-Z]+:/, where the brackets define a character class. The a-z and A-Z represent all alphabetic characters (the dash means the range of all characters between the starting and ending character, inclusive). And the + is a special character that says “one or more of whatever was before me”. It’s what we call a quantifier, meaning a gizmo that says how many times something is allowed to repeat. (The slashes aren’t really part of the regular expression, but rather part of the pattern-match operator. The slashes are acting like quotes that just happen to contain a regular expression.) Because certain classes like the alphabetics are so commonly used, Perl defines shortcuts for them: Name ASCII Definition Code Whitespace Word character Digit [ \t\n\r\f] [a-zA-Z_0-9] [0-9] \s \w \d Note that these match single characters. A \w will match any single word character, not an entire word. (Remember that + quantifier? You can say \w+ to match a word.) Perl also provides the negation of these classes by using the uppercased character, such as \D for a nondigit character. We should note that \w is not always equivalent to [a-zA-Z_0-9] (and \d is not always [0-9]). Some locales define additional alphabetic characters outside the ASCII sequence, and \w respects them. Newer versions of Perl also know about Unicode letter and digit properties and treat Unicode characters with those properties accordingly. (Perl also considers ideographs to be \w characters.) There is one other very special character class, written with a “.”, that will match any character whatsoever.* For example, /a./ will match any string containing an “a” that is not the last character in the string. Thus it will match “at” or “am” or even “a!”, but not “a”, since there’s nothing after the “a” for the dot to match. Since it’s searching for the pattern anywhere in the string, it’ll match “oasis” and “camel”, but not “sheba”. It matches “caravan” on the first “a”. It could match on * Except that it won’t normally match a newline. When you think about it, a “.” doesn’t normally match a newline in gr ep (1) either. 38 Chapter 1: An Overview of Perl the second “a”, but it stops after it finds the first suitable match, searching from left to right. Quantifiers The characters and character classes we’ve talked about all match single characters. We mentioned that you could match multiple “word” characters with \w+. The + is one kind of quantifier, but there are others. All of them are placed after the item being quantified. The most general form of quantifier specifies both the minimum and maximum number of times an item can match. You put the two numbers in braces, separated by a comma. For example, if you were trying to match North American phone numbers, the sequence \d{7,11} would match at least seven digits, but no more than eleven digits. If you put a single number in the braces, the number specifies both the minimum and the maximum; that is, the number specifies the exact number of times the item can match. (All unquantified items have an implicit {1} quantifier.) If you put the minimum and the comma but omit the maximum, then the maximum is taken to be infinity. In other words, it will match at least the minimum number of times, plus as many as it can get after that. For example, \d{7} will match only the first seven digits (a local North American phone number, for instance, or the first seven digits of a longer number), while \d{7,} will match any phone number, even an international one (unless it happens to be shorter than seven digits). There is no special way of saying “at most” a certain number of times. Just say .{0,5}, for example, to find at most five arbitrary characters. Certain combinations of minimum and maximum occur frequently, so Perl defines special quantifiers for them. We’ve already seen +, which is the same as {1,}, or “at least one of the preceding item”. There is also *, which is the same as {0,}, or “zero or more of the preceding item”, and ?, which is the same as {0,1}, or “zero or one of the preceding item” (that is, the preceding item is optional). You need to be careful of a couple things about quantification. First of all, Perl quantifiers are by default gr eedy. This means that they will attempt to match as much as they can as long as the whole pattern still matches. For example, if you are matching /\d+/ against “1234567890”, it will match the entire string. This is something to watch out for especially when you are using “.”, any character. Often, someone will have a string like: larry:JYHtPh0./NJTU:100:10:Larry Wall:/home/larry:/bin/tcsh and will try to match “larry:” with /.+:/. However, since the + quantifier is greedy, this pattern will match everything up to and including “/home/larry:”, Regular Expressions 39 because it matches as much as possible before the last colon, including all the other colons. Sometimes you can avoid this by using a negated character class, that is, by saying /[ˆ:]+:/, which says to match one or more noncolon characters (as many as possible), up to the first colon. It’s that little caret in there that negates the Boolean sense of the character class.* The other point to be careful about is that regular expressions will try to match as early as possible. This even takes precedence over being greedy. Since scanning happens left-to-right, this means that the pattern will match as far left as possible, even if there is some other place where it could match longer. (Regular expressions may be greedy, but they aren’t into delayed gratification.) For example, suppose you’re using the substitution command (s///) on the default string (variable $_ , that is), and you want to remove a string of x’s from the middle of the string. If you say: $_ = "fred xxxxxxx barney"; s/x*//; it will have absolutely no effect! This is because the x* (meaning zero or more “x” characters) will be able to match the “nothing” at the beginning of the string, since the null string happens to be zero characters wide and there’s a null string just sitting there plain as day before the “f” of “fred”.† There’s one other thing you need to know. By default, quantifiers apply to a single preceding character, so /bam{2}/ will match “bamm” but not “bambam”. To apply a quantifier to more than one character, use parentheses. So to match “bambam”, use the pattern /(bam){2}/. Minimal Matching If you were using an ancient version of Perl and you didn’t want greedy matching, you had to use a negated character class. (And really, you were still getting greedy matching of a constrained variety.) In modern versions of Perl, you can force nongreedy, minimal matching by placing a question mark after any quantifier. Our same username match would now be /.*?:/. That .*? will now try to match as few characters as possible, rather than as many as possible, so it stops at the first colon rather than at the last. * Sorry, we didn’t pick that notation, so don’t blame us. That’s just how negated character classes are customarily written in Unix culture. † Don’t feel bad. Even the authors get caught by this from time to time. 40 Chapter 1: An Overview of Perl Nailing Things Down Whenever you try to match a pattern, it’s going to try to match in every location till it finds a match. An anchor allows you to restrict where the pattern can match. Essentially, an anchor is something that matches a “nothing”, but a special kind of nothing that depends on its surroundings. You could also call it a rule, or a constraint, or an assertion. Whatever you care to call it, it tries to match something of zero width, and either succeeds or fails. (Failure merely means that the pattern can’t match that particular way. The pattern will go on trying to match some other way, if there are any other ways left to try.) The special symbol \b matches at a word boundary, which is defined as the “nothing” between a word character (\w) and a nonword character (\W), in either order. (The characters that don’t exist off the beginning and end of your string are considered to be nonword characters.) For example, /\bFred\b/ would match “Fred” in both “The Great Fred” and “Fred the Great”, but not in “Frederick the Great” because the “d” in “Frederick” is not followed by a nonword character. In a similar vein, there are also anchors for the beginning of the string and the end of the string. If it is the first character of a pattern, the caret (ˆ) matches the “nothing” at the beginning of the string. Therefore, the pattern /ˆFred/ would match “Fred” in “Frederick the Great” but not in “The Great Fred”, whereas /Fredˆ/ wouldn’t match either. (In fact, it doesn’t even make much sense.) The dollar sign ($) works like the caret, except that it matches the “nothing” at the end of the string instead of the beginning.* So now you can probably figure out that when we said: next LINE if $line =˜ /ˆ#/; we meant “Go to the next iteration of LINE loop if this line happens to begin with a # character.” Earlier we said that the sequence \d{7,11} would match a number from seven to eleven digits long. While strictly true, the statement is misleading: when you use that sequence within a real pattern match operator such as /\d{7,11}/, it does not preclude there being extra unmatched digits after the 11 matched digits! You often need to anchor quantified patterns on either or both ends to get what you expect. * This is a bit oversimplified, since we’re assuming here that your string contains no newlines; ˆ and $ are actually anchors for the beginnings and endings of lines rather than strings. We’ll try to straighten this all out in Chapter 5, Patter n Matching (to the extent that it can be straightened out). List Processing 41 Backreferences We mentioned earlier that you can use parentheses to group things for quantifiers, but you can also use parentheses to remember bits and pieces of what you matched. A pair of parentheses around a part of a regular expression causes whatever was matched by that part to be remembered for later use. It doesn’t change what the part matches, so /\d+/ and /(\d+)/ will still match as many digits as possible, but in the latter case they will be remembered in a special variable to be backreferenced later. How you refer back to the remembered part of the string depends on where you want to do it from. Within the same regular expression, you use a backslash followed by an integer. The integer corresponding to a given pair of parentheses is determined by counting left parentheses from the beginning of the pattern, starting with one. So for example, to match something similar to an HTML tag like “<B>Bold</B>”, you might use /<(.*?)>.*?<\/\1>/. This forces the two parts of the pattern to match the exact same string, such as the “B” in this example. Outside the regular expression itself, such as in the replacement part of a substitution, you use a $ followed by an integer, that is, a normal scalar variable named by the integer. So, if you wanted to swap the first two words of a string, for example, you could use: s/(\S+)\s+(\S+)/$2 $1/ The right side of the substitution (between the second and third slashes) is mostly just a funny kind of double-quoted string, which is why you can interpolate variables there, including backreference variables. This is a powerful concept: interpolation (under controlled circumstances) is one of the reasons Perl is a good textprocessing language. The other reason is the pattern matching, of course. Regular expressions are good for picking things apart, and interpolation is good for putting things back together again. Perhaps there’s hope for Humpty Dumpty after all. List Processing Much earlier in this chapter, we mentioned that Perl has two main contexts, scalar context (for dealing with singular things) and list context (for dealing with plural things). Many of the traditional operators we’ve described so far have been strictly scalar in their operation. They always take singular arguments (or pairs of singular arguments for binary operators) and always produce a singular result, even in list context. So if you write this: @array = (1 + 2, 3 - 4, 5 * 6, 7 / 8); 42 Chapter 1: An Overview of Perl you know that the list on the right side contains exactly four values, because the ordinary math operators always produce scalar values, even in the list context provided by the assignment to an array. However, other Perl operators can produce either a scalar or a list value, depending on their context. They just “know” whether a scalar or a list is expected of them. But how will you know that? It turns out to be pretty easy to figure out, once you get your mind around a few key concepts. First, list context has to be provided by something in the “surroundings”. In the previous example, the list assignment provides it. Earlier we saw that the list of a foreach loop provides it. The print operator also provides it. But you don’t have to learn these one by one. If you look at the various syntax summaries scattered throughout the rest of the book, you’ll see various operators that are defined to take a LIST as an argument. Those are the operators that pr ovide a list context. Throughout this book, LIST is used as a specific technical term to mean “a syntactic construct that provides a list context”. For example, if you look up sort, you’ll find the syntax summary: sort LIST That means that sort provides a list context to its arguments. Second, at compile time (that is, while Perl is parsing your program and translating to internal opcodes), any operator that takes a LIST provides a list context to each syntactic element of that LIST. So every top-level operator or entity in the LIST knows at compile time that it’s supposed to produce the best list it knows how to produce. This means that if you say: sort @dudes, @chicks, other(); then each of @dudes, @chicks, and other() knows at compile time that it’s supposed to produce a list value rather than a scalar value. So the compiler generates internal opcodes that reflect this. Later, at run time (when the internal opcodes are actually interpreted), each of those LIST elements produces its list in turn, and then (this is important) all the separate lists are joined together, end to end, into a single list. And that squashedflat, one-dimensional list is what is finally handed off to the function that wanted the LIST in the first place. So if @dudes contains (Fred,Barney), @chicks contains (Wilma,Betty), and the other() function returns the single-element list (Dino), then the LIST that sort sees is: (Fred,Barney,Wilma,Betty,Dino) What You Don’t Know Won’t Hurt You (Much) 43 and the LIST that sort returns is: (Barney,Betty,Dino,Fred,Wilma) Some operators produce lists (like keys), while some consume them (like print), and others transform lists into other lists (like sort). Operators in the last category can be considered filters, except that, unlike in the shell, the flow of data is from right to left, since list operators operate on arguments passed in from the right. You can stack up several list operators in a row: print reverse sort map {lc} keys %hash; That takes the keys of %hash and returns them to the map function, which lowercases all the keys by applying the lc operator to each of them, and passes them to the sort function, which sorts them, and passes them to the reverse function, which reverses the order of the list elements, and passes them to the print function, which prints them. As you can see, that’s much easier to describe in Perl than in English. There are many other ways in which list processing produces more natural code. We can’t enumerate all the ways here, but for an example, let’s go back to regular expressions for a moment. We talked about using a pattern in a scalar context to see whether it matched, but if instead you use a pattern in a list context, it does something else: it pulls out all the backreferences as a list. Suppose you’re searching through a log file or a mailbox, and you want to parse a string containing a time of the form “12:59:59 am”. You might say this: ($hour, $min, $sec, $ampm) = /(\d+):(\d+):(\d+) *(\w+)/; That’s a convenient way to set several variables simultaneously. But you could just as easily say @hmsa = /(\d+):(\d+):(\d+) *(\w+)/; and put all four values into one array. Oddly, by decoupling the power of regular expressions from the power of Perl expressions, list context increases the power of the language. We don’t often admit it, but Perl is actually an orthogonal language in addition to being a diagonal language. Have your cake, and eat it too. What You Don’t Know Won’t Hurt You (Much) Finally, allow us to return once more to the concept of Perl as a natural language. Speakers of a natural language are allowed to have differing skill levels, to speak 44 Chapter 1: An Overview of Perl different subsets of the language, to learn as they go, and generally, to put the language to good use before they know the whole language. You don’t know all of Perl yet, just as you don’t know all of English. But that’s Officially Okay in Perl culture. You can work with Perl usefully, even though we haven’t even told you how to write your own subroutines yet. We’ve scarcely begun to explain how to view Perl as a system management language, or a rapid prototyping language, or a networking language, or an object-oriented language. We could write entire chapters about some of these things. (Come to think of it, we already did.) But in the end, you must create your own view of Perl. It’s your privilege as an artist to inflict the pain of creativity on yourself. We can teach you how we paint, but we can’t teach you how you paint. There’s More Than One Way To Do It. Have the appropriate amount of fun. 2 Bits and Pieces We’re going to start small, so this chapter is about the elements of Perl. Since we’re starting small, the progression through the next several chapters is necessarily from small to large. That is, we take a bottom-up approach, beginning with the smallest components of Perl programs and building them into more elaborate structures, much like molecules are built out of atoms. The disadvantage of this approach is that you don’t necessarily get the Big Picture before getting lost in a welter of details. The advantage is that you can understand the examples as we go along. (If you’re a top-down person, just turn the book over and read the chapters backward.) Each chapter does build on the preceding chapter (or the subsequent chapter, if you’re reading backward), so you’ll need to be careful if you’re the sort of person who skips around. You’re certainly welcome to peek at the reference materials toward the end of the book as we go along. (That doesn’t count as skipping around.) In particular, any isolated word in typewriter font is likely to be found in Chapter 29, Functions. And although we’ve tried to stay operating-system neutral, if you are unfamiliar with Unix terminology and run into a word that doesn’t seem to mean what you think it ought to mean, you should check whether the word is in the Glossary. If the Glossary doesn’t work, the index probably will. Atoms Although there are various invisible things going on behind the scenes that we’ll explain presently, the smallest things you generally work with in Perl are 47 48 Chapter 2: Bits and Pieces individual characters. And we do mean characters; historically, Perl freely confused bytes with characters and characters with bytes, but in this new era of global networking, we must be careful to distinguish the two. Perl may, of course, be written entirely in the 7-bit ASCII character set. Perl also allows you to write in any 8-bit or 16-bit character set, whether it’s a national character set or some other legacy character set. However, if you choose to write in one of these older, non-ASCII character sets, you may use non-ASCII characters only within string literals. You are responsible for making sure that the semantics of your program are consistent with the particular national character set you’ve chosen. For instance, if you’re using a 16-bit encoding for an Asian national character set, keep in mind that Perl will generally think of each of your characters as two bytes, not as one character. As described in Chapter 15, Unicode, we’ve recently added support for Unicode to Perl.* This support is pervasive throughout the language: you can use Unicode characters in identifiers (variable names and such) as well as within literal strings. When you are using Unicode, you don’t need to worry about how many bits or bytes it takes to represent a character. Perl just pretends all Unicode characters are the same size (that is, size 1), even though any given character might be represented by multiple bytes internally. Perl normally represents Unicode internally as UTF-8, a variable-length encoding. (For instance, a Unicode smiley character, U-263A, would be represented internally as a three-byte sequence.) If you’ll let us drive our analogy of the physical elements a bit further, characters are atomic in the same sense as the individual atoms of the various elements. Yes, they’re composed of smaller particles known as bits and bytes, but if you break a character apart (in a character accelerator, no doubt), the individual bits and bytes lose the distinguishing chemical properties of the character as a whole. Just as neutrons are an implementation detail of the U-238 atom, so too bytes are an implementation detail of the U-263A character. So we’ll be careful to say “characters” when we mean characters, and “bytes” when we mean bytes. But we don’t mean to scare you — you can still do the good old-fashioned byte processing easily enough. All you have to do is tell Perl that you still want to think of bytes as characters. You can do that with a use bytes pragma (see Chapter 31, Pragmatic Modules). But even if you don’t do that, Perl will still do a pretty good job of keeping small characters in 8 bits when you expect it to. So don’t sweat the small stuff. Let’s move on to bigger and better things. * As excited as we are about Unicode support, most of our examples will be in ASCII, since not everyone has a decent Unicode editor yet. Molecules 49 Molecules Perl is a fr ee-form language, but that doesn’t mean that Perl is totally free of form. As computer folks usually use the term, a free-form language is one in which you can put spaces, tabs, and newlines anywhere you like—except where you can’t. One obvious place you can’t put a whitespace character is in the middle of a token. A token is what we call a sequence of characters with a unit of meaning, much like a simple word in natural language. But unlike the typical word, a token might contain other characters besides letters, just as long as they hang together to form a unit of meaning. (In that sense, they’re more like molecules, which don’t have to be composed of only one particular kind of atom.) For example, numbers and mathematical operators are considered tokens. An identifier is a token that starts with a letter or underscore and contains only letters, digits, and underscores. A token may not contain whitespace characters because this would split the token into two tokens, just as a space in an English word turns it into two words.* Although whitespace is allowed between any two tokens, whitespace is requir ed only between tokens that would otherwise be confused as a single token. All whitespace is equivalent for this purpose. Newlines are distinguished from spaces and tabs only within quoted strings, formats, and certain line-oriented forms of quoting. Specifically, newlines do not terminate statements as they do in certain other languages (such as FORTRAN or Python). Statements in Perl are terminated with semicolons, just as they are in C and its various derivatives. Unicode whitespace characters are allowed in a Unicode Perl program, but you need to be careful. If you use the special Unicode paragraph and line separators, be aware that Perl may count line numbers differently than your text editor does, so error messages may be more difficult to interpret. It’s best to stick with good old-fashioned newlines. Tokens are recognized greedily; if at a particular point the Perl parser has a choice between recognizing a short token or a long token, it will choose the long one. If you meant it to be two tokens, just insert some whitespace between the tokens. (We tend to put extra space around most operators anyway, just for readability.) Comments are indicated by the # character and extend from there through the end of the line. A comment counts as whitespace for separating tokens. The Perl language attaches no special meaning to anything you might put into a comment.† * The astute reader will point out that literal strings may contain whitespace characters. But strings can get away with it only because they have quotes on both ends to keep the spaces from leaking out. † Actually, that’s a small fib. The Perl parser does look for command-line switches on an initial #! line (see Chapter 19, The Command-Line Interface). It can also interpret the line number directives that various preprocessors produce (see the section “Generating Perl in Other Languages” in Chapter 24, Common Practices). 50 Chapter 2: Bits and Pieces One other oddity is that if a line begins with = anywhere a statement would be legal, Perl ignores everything from that line down to the next line that begins with =cut. The ignored text is assumed to be pod, or “plain old documentation”. The Perl distribution has programs that will extract pod commentary from Perl modules A and turn it into flat text, manpages, L TEX, HTML, or (someday soon) XML documents. In a complementary fashion, the Perl parser extracts the Perl code from Perl modules and ignores the pod. So you may consider this an alternate, multiline form of commenting. You may also consider it completely nuts, but Perl modules documented this way never lose track of their documentation. See Chapter 26, Plain Old Documentation, for details on pod, including a description of how to effect multiline comments in Perl. But don’t look down on the normal comment character. There’s something comforting about the visual effect of a nice row of # characters down the left side of a multiline comment. It immediately tells your eyes: “This is not code.” You’ll note that even in languages with a multiline quoting mechanisms like C, people often put a row of * characters down the left side of their comments anyway. Appearances are often more important than they appear. In Perl, just as in chemistry and in language, you can build larger and larger structures out of the smaller ones. We already mentioned the statement; it’s just a sequence of tokens that make up a command, that is, a sentence in the imperative mood. You can combine a sequence of statements into a block that is delimited by braces (also known affectionately as “curlies” by people who confuse braces with suspenders.) Blocks can in turn be combined into larger blocks. Some blocks function as subr outines, which can be combined into modules, which can be combined into pr ograms. But we’re getting ahead of ourselves—those are subjects for coming chapters. Let’s build some more tokens out of characters. Built-in Data Types Before we start talking about various kinds of tokens you can build from characters, we need a few more abstractions. To be specific, we need three data types. Computer languages vary in how many and what kinds of data types they provide. Unlike some commonly used languages that provide many confusing types for similar kinds of values, Perl provides just a few built-in data types. Consider C, in which you might run into char, short, int, long, long long, bool, wchar_t, size_t, off_t, regex_t, uid_t, u_longlong_t, pthread_key_t, fp_exception_field_type, and so on. That’s just some of the integer types! Then there are floating-point numbers, and pointers, and strings. Built-in Data Types 51 All these complicated types correspond to just one type in Perl: the scalar. (Usually Perl’s simple data types are all you need, but if not, you’re free to define fancy dynamic types using Perl’s object-oriented features — see Chapter 12, Objects.) Perl’s three basic data types are: scalars, arrays of scalars, and hashes of scalars (also known as associative arrays). Some people may prefer to call these data structur es rather than types. That’s okay. Scalars are the fundamental type from which more complicated structures are built. A scalar stores a single, simple value—typically a string or a number. Elements of this simple type may be combined into either of the two aggregate types. An array is an ordered list of scalars that you access with an integer subscript (or index). All indexing in Perl starts at 0. Unlike many programming languages, however, Perl treats negative subscripts as valid: instead of counting from the beginning, negative subscripts count back from the end of whatever it is you’re indexing into. (This applies to various substring and sublist operations as well as to regular subscripting.) A hash, on the other hand, is an unordered set of key/value pairs that you access using strings (the keys) as subscripts to look up the scalars (the values) corresponding to a given key. Variables are always one of these three types. (Other than variables, Perl also has other abstractions that you can think of as data types, such as filehandles, directory handles, formats, subroutines, symbol tables, and symbol table entries.) Abstractions are wonderful, and we’ll collect more of them as we go along, but they’re also useless in a way. You can’t do anything with an abstraction directly. That’s why computer languages have syntax. We need to introduce you to the various kinds of syntactic terms you can use to pull your abstract data into expressions. We like to use the technical term ter m when we want to talk in terms of these syntactic units. (Hmm, this could get terminally confusing. Just remember how your math teacher used to talk about the ter ms of an equation, and you won’t go terribly wrong.) Just like the terms in a math equation, the purpose of most terms in Perl is to produce values for operators like addition and multiplication to operate on. Unlike in a math equation, however, Perl has to do something with the values it calculates, not just think with a pencil in its hand about whether the two sides of the equation are equal. One of the most common things to do with a value is to store it somewhere: $x = $y; That’s an example of the assignment operator (not the numeric equality operator, which is spelled == in Perl). The assignment gets the value from $y and puts it into $x. Notice that we aren’t using the term $x for its value; we’re using it for its 52 Chapter 2: Bits and Pieces location. (The old value of $x gets clobbered by the assignment.) We say that $x is an lvalue, meaning it’s the sort of storage location we can use on the left side of an assignment. We say that $y is an rvalue because it’s used on the right side. There’s also a third kind of value, called a temporary value, that you need to understand if you want to know what Perl is really doing with your lvalues and rvalues. If we do some actual math and say: $x = $y + 1; Perl takes the rvalue $y and adds the rvalue 1 to it, which produces a temporary value that is eventually assigned to the lvalue $x. It may help you to visualize what is going on if we tell you that Perl stores these temporary values in an internal structure called a stack.* The terms of an expression (the ones we’re talking about in this chapter) tend to push values onto the stack, while the operators of the expression (which we’ll discuss in the next chapter) tend to pop them back off the stack, perhaps leaving another temporary result on the stack for the next operator to work with. The pushes and pops all balance out—by the time the expression is done, the stack is entirely empty (or as empty as it was when we started). More about temporary values later. Some terms can only be rvalues, such as the 1 above, while others can serve as either lvalues or rvalues. In particular, as the assignments above illustrate, a variable may function as either. And that’s what our next section is about. Variables Not surprisingly, there are three variable types corresponding to the three abstract data types we mentioned earlier. Each of these is prefixed by what we call a funny character.† Scalar variables are always named with an initial $, even when referring to a scalar that is part of an array or hash. It works a bit like the English word “the”. Thus, we have: Construct Meaning $days $days[28] $days{’Feb’} Simple scalar value $days 29th element of array @days “Feb” value from hash %days Note that we can use the same name for $days, @days, and %days without Perl getting confused. * A stack works just like one of those spring-loaded plate dispensers you see in a buffet restaurant — you can push plates onto the top of the stack, or you can pop them off again (to use the Comp. Sci. vernacular). † That’s another technical term in computer science. (And if it wasn’t before, it is now.) Names 53 There are other, fancier scalar terms, useful in specialized situations that we won’t go into yet. They look like this: Construct Meaning ${days} $Dog::days $#days $days->[28] $days[0][2] $days{2000}{’Feb’} $days{2000,’Feb’} Same as $days but unambiguous before alphanumerics Different $days variable, in the Dog package Last index of array @days 29th element of array pointed to by reference $days Multidimensional array Multidimensional hash Multidimensional hash emulation Entire arrays (or slices of arrays and hashes) are named with the funny character @, which works much like the words “these” or “those”: Construct Meaning @days @days[3, 4, 5] @days[3..5] @days{’Jan’,’Feb’} Array containing ($days[0], $days[1],... $days[n]) Array slice containing ($days[3], $days[4], $days[5]) Array slice containing ($days[3], $days[4], $days[5]) Hash slice containing ($days{’Jan’},$days{’Feb’}) Entire hashes are named by %: Construct Meaning %days (Jan => 31, Feb => $leap ? 29 : 28, ...) Any of these constructs may also serve as an lvalue, specifying a location you could assign a value to. With arrays, hashes, and slices of arrays or hashes, the lvalue provides multiple locations to assign to, so you can assign multiple values to them all at once: @days = 1 .. 7; Names We’ve talked about storing values in variables, but the variables themselves (their names and their associated definitions) also need to be stored somewhere. In the abstract, these places are known as namespaces. Perl provides two kinds of namespaces, which are often called symbol tables and lexical scopes.* You may have an arbitrary number of symbol tables or lexical scopes, but every name you * We also call them packages and pads when we’re talking about Perl’s specific implementations, but those longer monikers are the generic industry terms, so we’re pretty much stuck with them. Sorry. 54 Chapter 2: Bits and Pieces define gets stored in one or the other. We’ll explain both kinds of namespaces as we go along. For now we’ll just say that symbol tables are global hashes that happen to contain symbol table entries for global variables (including the hashes for other symbol tables). In contrast, lexical scopes are unnamed scratchpads that don’t live in any symbol table, but are attached to a block of code in your program. They contain variables that can only be seen by the block. (That’s what we mean by a scope). The lexical part just means, “having to do with text”, which is not at all what a lexicographer would mean by it. Don’t blame us.) Within any given namespace (whether global or lexical), every variable type has its own subnamespace, determined by the funny character. You can, without fear of conflict, use the same name for a scalar variable, an array, or a hash (or, for that matter, a filehandle, a subroutine name, a label, or your pet llama). This means that $foo and @foo are two different variables. Together with the previous rules, it also means that $foo[1] is an element of @foo totally unrelated to the scalar variable $foo. This may seem a bit weird, but that’s okay, because it is weird. Subroutines may be named with an initial &, although the funny character is optional when calling the subroutine. Subroutines aren’t generally considered lvalues, though recent versions of Perl allow you to return an lvalue from a subroutine and assign to that, so it can look as though you’re assigning to the subroutine. Sometimes you just want a name for “everything named foo” regardless of its funny character. So symbol table entries can be named with an initial *, where the asterisk stands for all the other funny characters. These are called typeglobs, and they have several uses. They can also function as lvalues. Assignment to typeglobs is how Perl implements importing of symbols from one symbol table to another. More about that later too. Like most computer languages, Perl has a list of reserved words that it recognizes as special keywords. However, because variable names always start with a funny character, reserved words don’t actually conflict with variable names. Certain other kinds of names don’t have funny characters, though, such as labels and filehandles. With these, you do have to worry (a little) about conflicting with reserved words. Since most reserved words are entirely lowercase, we recommend that you pick label and filehandle names that contain uppercase letters. For example, if you say open(LOG, logfile) rather than the regrettable open(log, "logfile"), you won’t confuse Perl into thinking you’re talking about the built-in log operator (which does logarithms, not tree trunks). Using uppercase filehandles also improves readability* and protects you from conflict with reserved words we might add in the future. For similar reasons, user-defined modules are typically named * One of the design principles of Perl is that different things should look different. Contrast this with languages that try to force different things to look the same, to the detriment of readability. Names 55 with initial capitals so that they’ll look different from the built-in modules known as pragmas, which are named in all lowercase. And when we get to objectoriented programming, you’ll notice that class names are usually capitalized for the same reason. As you might deduce from the preceding paragraph, case is significant in identifiers — FOO, Foo, and foo are all different names in Perl. Identifiers start with a letter or underscore and may be of any length (for values of “any” ranging between 1 and 251, inclusive) and may contain letters, digits, and underscores. This includes Unicode letters and digits. Unicode ideographs also count as letters, but we don’t recommend you use them unless you can read them. See Chapter 15. Names that follow funny characters don’t have to be identifiers, strictly speaking. They can start with a digit, in which case they may only contain more digits, as in $123. Names that start with anything other than a letter, digit, or underscore are (usually) limited to that one character (like $? or $$), and generally have a predefined significance to Perl. For example, just as in the Bourne shell, $$ is the current process ID and $? the exit status of your last child process. As of version 5.6, Perl also has an extensible syntax for internal variables names. Any variable of the form ${ˆNAME} is a special variable reserved for use by Perl. All these non-identifier names are forced to be in the main symbol table. See Chapter 28, Special Names, for some examples. It’s tempting to think of identifiers and names as the same thing, but when we say name, we usually mean a fully qualified name, that is, a name that says which symbol table it lives in. Such names may be formed of a sequence of identifiers separated by the :: token: $Santa::Helper::Reindeer::Rudolph::nose That works just like the directories and filenames in a pathname: /Santa/Helper/Reindeer/Rudolph/nose In the Perl version of that notion, all the leading identifiers are the names of nested symbol tables, and the last identifier is the name of the variable within the most deeply nested symbol table. For instance, in the variable above, the symbol table is named Santa::Helper::Reindeer::Rudolph::, and the actual variable within that symbol table is $nose. (The value of that variable is, of course, “red”.) A symbol table in Perl is also known as a package, so these are often called package variables. Package variables are nominally private to the package in which they exist, but are global in the sense that the packages themselves are global. That is, anyone can name the package to get at the variable; it’s just hard to do 56 Chapter 2: Bits and Pieces this by accident. For instance, any program that mentions $Dog::bert is asking for the $bert variable within the Dog:: package. That is an entirely separate variable from $Cat::bert. See Chapter 10, Packages. Variables attached to a lexical scope are not in any package, so lexically scoped variable names may not contain the :: sequence. (Lexically scoped variables are declared with a my declaration.) Name Lookups So the question is, what’s in a name? How does Perl figure out what you mean if you just say $bert? Glad you asked. Here are the rules the Perl parser uses while trying to understand an unqualified name in context: 1. First, Perl looks earlier in the immediately enclosing block to see whether the variable is declared in that same block with a my (or our) declaration (see those entries in Chapter 29, as well as the section “Scoped Declarations” in Chapter 4, Statements and Declarations). If there is a my declaration, the variable is lexically scoped and doesn’t exist in any package—it exists only in that lexical scope (that is, in the block’s scratchpad). Because lexical scopes are unnamed, nobody outside that chunk of program can even name your variable.* 2. If that doesn’t work, Perl looks for the block enclosing that block and tries again for a lexically scoped variable in the larger block. Again, if Perl finds one, the variable belongs only to the lexical scope from the point of declaration through the end of the block in which it is declared — including any nested blocks, like the one we just came from in step 1. If Perl doesn’t find a declaration, it repeats step 2 until it runs out of enclosing blocks. 3. When Perl runs out of enclosing blocks, it examines the whole compilation unit for declarations as if it were a block. (A compilation unit is just the entire current file, or the string currently being compiled by an eval STRING operator.) If the compilation unit is a file, that’s the largest possible lexical scope, and Perl will look no further for lexically scoped variables, so we go to step 4. If the compilation unit is a string, however, things get fancier. A string compiled as Perl code at run time pretends that it’s a block within the lexical scope from which the eval STRING is running, even though the actual boundaries of the lexical scope are the limits of the string containing the code rather * If you use an our declaration instead of a my declaration, this only declares a lexically scoped alias (a nickname) for a package variable, rather than declaring a true lexically scoped variable the way my does. Outside code can still get at the real variable through its package, but in all other respects an our declaration behaves like a my declaration. This is handy when you’re trying to limit your own use of globals with the use strict pragma (see the strict pragma in Chapter 31). But you should always prefer my if you don’t need a global. Names 57 than any real braces. So if Perl doesn’t find the variable in the lexical scope of the string, we pretend that the eval STRING is a block and go back to step 2, only this time starting with the lexical scope of the eval STRING operator instead of the lexical scope inside its string. 4. If we get here, it means Perl didn’t find any declaration (either my or our) for your variable. Perl now gives up on lexically scoped variables and assumes that your variable is a package variable. If the strict pragma is in effect, you will now get an error, unless the variable is one of Perl’s predefined variables or has been imported into the current package. This is because that pragma disallows the use of unqualified global names. However, we aren’t done with lexical scopes just yet. Perl does the same search of lexical scopes as it did in steps 1 through 3, only this time it searches for package declarations instead of variable declarations. If it finds such a package declaration, it knows that the current code is being compiled for the package in question and prepends the declared package name to the front of the variable. 5. If there is no package declaration in any surrounding lexical scope, Perl looks for the variable name in the unnamed top-level package, which happens to have the name main when it isn’t going around without a name tag. So in the absence of any declarations to the contrary, $bert means the same as $::bert, which means the same as $main::bert. (But because main is just another package in the top-level unnamed package, it’s also $::main::bert, and $main::main::bert, $::main::main::bert and so on. This could be construed as a useless fact. But see “Symbol Tables” in Chapter 10.) There are several implications to these search rules that might not be obvious, so we’ll make them explicit. • Because the file is the largest possible lexical scope, a lexically scoped variable can never be visible outside the file in which it’s declared. File scopes do not nest. • Any particular bit of Perl is compiled in at least one lexical scope and exactly one package scope. The mandatory lexical scope is, of course, the file itself. Additional lexical scopes are provided by each enclosing block. All Perl code is also compiled in the scope of exactly one package, and although the declaration of which package you’re in is lexically scoped, packages themselves are not lexically constrained. That is, they’re global. • An unqualified variable name may therefore be searched for in many lexical scopes, but only one package scope, whichever one is currently in effect (which is lexically determined). 58 Chapter 2: Bits and Pieces • A variable name may only attach to one scope. Although at least two different scopes (lexical and package) are active everywhere in your program, a variable can only exist in one of those scopes. • An unqualified variable name can therefore resolve to only a single storage location, either in the first enclosing lexical scope in which it is declared, or else in the current package—but not both. The search stops as soon as that storage location is resolved, and any storage location that it would have found had the search continued is effectively hidden. • The location of the typical variable name can be completely determined at compile time. Now that you know all about how the Perl compiler deals with names, you sometimes have the problem that you don’t know the name of what you want at compile time. Sometimes you want to name something indirectly; we call this the problem of indir ection. So Perl provides a mechanism: you can always replace an alphanumeric variable name with a block containing an expression that returns a refer ence to the real data. For instance, instead of saying: $bert you might say: ${ some_expression() } and if the some_expression() function returns a reference to variable $bert (or even the string, "bert"), it will work just as if you’d said $bert in the first place. On the other hand, if the function returns a reference to $ernie, you’ll get his variable instead. The syntax shown is the most general (and least legible) form of indirection, but we’ll cover several convenient variations in Chapter 8, Refer ences. Scalar Values Whether it’s named directly or indirectly, and whether it’s in a variable, or an array element, or is just a temporary value, a scalar always contains a single value. This value may be a number, a string, or a reference to another piece of data. Or, there might even be no value at all, in which case the scalar is said to be undefined. Although we might speak of a scalar as “containing” a number or a string, scalars are typeless: you are not required to declare your scalars to be of type integer or floating-point or string or whatever.* * Future versions of Perl will allow you to insert int, num, and str type declarations, not to enforce strong typing, but only to give the optimizer hints about things that it might not figure out for itself. Generally, you’d only consider doing this in tight code that must run very fast, so we’re not going to tell you how to do it yet. Optional types are also used by the pseudohash mechanism, in which case they can function as types do in a more strongly typed language. See Chapter 8 for more. Scalar Values 59 Perl stores strings as sequences of characters, with no arbitrary constraints on length or content. In human terms, you don’t have to decide in advance how long your strings are going to get, and you can include any characters including null bytes within your string. Perl stores numbers as signed integers if possible, or as double-precision floating-point values in the machine’s native format otherwise. Floating-point values are not infinitely precise. This is important to remember because comparisons like (10/3 == 1/3*10) tend to fail mysteriously. Perl converts between the various subtypes as needed, so you can treat a number as a string or a string as a number, and Perl will do the Right Thing. To convert from string to number, Perl internally uses something like the C library’s atof (3) function. To convert from number to string, it does the equivalent of an sprintf (3) with a format of "%.14g" on most machines. Improper conversions of a nonnumeric string like foo to a number count as numeric 0; these trigger warnings if you have them enabled, but are silent otherwise. See Chapter 5, Patter n Matching, for examples of detecting what sort of data a string holds. Although strings and numbers are interchangeable for nearly all intents, references are a bit different. They’re strongly typed, uncastable pointers with built-in reference-counting and destructor invocation. That is, you can use them to create complex data types, including user-defined objects. But they’re still scalars, for all that, because no matter how complicated a data structure gets, you often want to treat it as a single value. By uncastable, we mean that you can’t, for instance, convert a reference to an array into a reference to a hash. References are not castable to other pointer types. However, if you use a reference as a number or a string, you will get a numeric or string value, which is guaranteed to retain the uniqueness of the reference even though the “referenceness” of the value is lost when the value is copied from the real reference. You can compare such values or extract their type. But you can’t do much else with the values, since there’s no way to convert numbers or strings back into references. Usually, this is not a problem, because Perl doesn’t force you to do pointer arithmetic—or even allow it. See Chapter 8 for more on references. Numeric Literals Numeric literals are specified in any of several customary* floating-point or integer formats: $x $x $x $x = = = = 12345; 12345.67; 6.02e23; 4_294_967_296; # # # # integer floating point scientific notation underline for legibility * Customary in Unix culture, that is. If you’re from a different culture, welcome to ours! 60 Chapter 2: Bits and Pieces $x = 0377; $x = 0xffff; $x = 0b1100_0000; # octal # hexadecimal # binary Because Perl uses the comma as a list separator, you cannot use it to separate the thousands in a large number. Perl does allow you to use an underscore character instead. The underscore only works within literal numbers specified in your program, not for strings functioning as numbers or data read from somewhere else. Similarly, the leading 0x for hexadecimal, 0b for binary, and 0 for octal work only for literals. The automatic conversion of a string to a number does not recognize these prefixes — you must do an explicit conversion* with the oct function — which works for hex and binary numbers, too, as it happens, provided you supply the 0x or 0b on the front. String Literals String literals are usually surrounded by either single or double quotes. They work much like Unix shell quotes: double-quoted string literals are subject to backslash and variable interpolation, but single-quoted strings are not (except for \’ and \\, so that you can embed single quotes and backslashes into single-quoted strings). If you want to embed any other backslash sequences such as \n (newline), you must use the double-quoted form. (Backslash sequences are also known as escape sequences, because you “escape” the normal interpretation of characters temporarily.) A single-quoted string must be separated from a preceding word by a space because a single quote is a valid—though archaic — character in an identifier. Its use has been replaced by the more visually distinct :: sequence. That means that $main’var and $main::var are the same thing, but the second is generally considered easier to read for people and programs. Double-quoted strings are subject to various forms of character interpolation, many of which will be familiar to programmers of other languages. These are listed in Table 2-1. * Sometimes people think Perl should convert all incoming data for them. But there are far too many decimal numbers with leading zeros in the world to make Perl do this automatically. For example, the Zip Code for the O’Reilly & Associates office in Cambridge, MA, is 02140. The postmaster would get confused if your mailing label program turned 02140 into 1120 decimal. Scalar Values 61 Table 2-1. Backslashed Character Escapes Code Meaning \n Newline (usually LF) \r Carriage return (usually CR) \t Horizontal tab \f Form feed \b Backspace \a Alert (bell) \e ESC character \033 ESC in octal \x7f DEL in hexadecimal \cC Control-C \x{263a} Unicode (smiley) \N{NAME} Named character The \N{NAME} notation is usable only in conjunction with the use charnames pragma described in Chapter 31. This allows you to specify character names symbolically, as in \N{GREEK SMALL LETTER SIGMA}, \N{greek:Sigma}, or \N{sigma} —depending on how you call the pragma. See also Chapter 15. There are also escape sequences to modify the case or “meta-ness” of subsequent characters. See Table 2-2. Table 2-2. Translation Escapes Code Meaning \u Force next character to uppercase (“titlecase” in Unicode). \l Force next character to lowercase. \U Force all following characters to uppercase. \L Force all following characters to lowercase. \Q Backslash all following nonalphanumeric characters. \E End \U, \L, or \Q. You may also embed newlines directly in your strings; that is, they can begin and end on different lines. This is often useful, but it also means that if you forget a trailing quote, the error will not be reported until Perl finds another line containing the quote character, which may be much further on in the script. Fortunately, this usually causes an immediate syntax error on the same line, and Perl is then smart enough to warn you that you might have a runaway string where it thought the string started. 62 Chapter 2: Bits and Pieces Besides the backslash escapes listed above, double-quoted strings are subject to variable interpolation of scalar and list values. This means that you can insert the values of certain variables directly into a string literal. It’s really just a handy form of string concatenation.* Variable interpolation may be done for scalar variables, entire arrays (but not hashes), single elements from an array or hash, or slices (multiple subscripts) of an array or hash. Nothing else interpolates. In other words, you may only interpolate expressions that begin with $ or @, because those are the two characters (along with backslash) that the string parser looks for. Inside strings, a literal @ that is not part of an array or slice identifier but is followed by an alphanumeric character must be escaped with a backslash (\@), or else a compilation error will result. Although a complete hash specified with a % may not be interpolated into the string, single hash values or hash slices are okay, because they begin with $ and @ respectively. The following code segment prints out “The price is $100.”: $Price = ’$100’; print "The price is $Price.\n"; # not interpolated # interpolated As in some shells, you can put braces around the identifier to distinguish it from following alphanumerics: "How ${verb}able!". An identifier within such braces is forced to be a string, as is any single identifier within a hash subscript. For example: $days{’Feb’} can be written as: $days{Feb} and the quotes will be assumed. Anything more complicated in the subscript is interpreted as an expression, and then you’d have to put in the quotes: $days{’February 29th’} $days{"February 29th"} $days{ February 29th } # Ok. # Also ok. "" doesn’t have to interpolate. # WRONG, produces parse error. In particular, you should always use quotes in slices such as: @days{’Jan’,’Feb’} @days{"Jan","Feb"} @days{ Jan, Feb } # Ok. # Also ok. # Kinda wrong (breaks under use strict) Apart from the subscripts of interpolated array and hash variables, there are no multiple levels of interpolation. Contrary to the expectations of shell programmers, * With warnings enabled, Perl may report undefined values interpolated into strings as using the concatenation or join operations, even though you don’t actually use those operators there. The compiler created them for you anyway. Scalar Values 63 backticks do not interpolate within double quotes, nor do single quotes impede evaluation of variables when used within double quotes. Interpolation is extremely powerful but strictly controlled in Perl. It happens only inside double quotes, and in certain other “double-quotish” operations that we’ll describe in the next section: print "\n"; print \n ; # Ok, print a newline. # WRONG, no interpolative context. Pick Your Own Quotes Although we usually think of quotes as literal values, in Perl they function more like operators, providing various kinds of interpolating and pattern-matching capabilities. Perl provides the customary quote characters for these behaviors, but also provides a more general way for you to choose your quote character for any of them. In Table 2-3, any nonalphanumeric, nonwhitespace delimiter may be used in place of /. (The newline and space characters are no longer allowed as delimiters, although ancient versions of Perl once allowed this.) Table 2-3. Quote Constructs Customar y Generic Meaning Interpolates ’’ q// Literal string No "" qq// Literal string Yes ‘‘ qx// Command execution Yes () qw// Word list No // m// Pattern match Yes s/// s/// Pattern substitution Yes y/// tr/// Character translation No "" qr// Regular expression Yes Some of these are simply forms of “syntactic sugar” to let you avoid putting too many backslashes into quoted strings, particularly into pattern matches where your regular slashes and backslashes tend to get all tangled. If you choose single quotes for delimiters, no variable interpolation is done even on those forms that ordinarily interpolate. If the opening delimiter is an opening parenthesis, bracket, brace, or angle bracket, the closing delimiter will be the corresponding closing character. (Embedded occurrences of the delimiters must match in pairs.) Examples: $single = q!I said, "You said, ’She said it.’"!; $double = qq(Can’t we get some "good" $variable?); 64 Chapter 2: Bits and Pieces $chunk_of_code = q { if ($condition) { print "Gotcha!"; } }; The last example demonstrates that you can use whitespace between the quote specifier and its initial bracketing character. For two-element constructs like s/// and tr///, if the first pair of quotes is a bracketing pair, the second part gets its own starting quote character. In fact, the second pair needn’t be the same as the first pair. So you can write things like s<foo>(bar) or tr(a-f)[A-F]. Because whitespace is also allowed between the two inner quote characters, you could even write that last one as: tr (a-f) [A-F]; Whitespace is not allowed, however, when # is being used as the quoting character. q#foo# is parsed as the string ’foo’, while q #foo# is parsed as the quote operator q followed by a comment. Its delimiter will be taken from the next line. Comments can also be placed in the middle of two-element constructs, which allows you to write: s {foo} # Replace foo {bar}; # with bar. tr [a-f] # Transliterate lowercase hex [A-F]; # to uppercase hex Or Leave the Quotes Out Entirely A name that has no other interpretation in the grammar will be treated as if it were a quoted string. These are known as bar ewords.* As with filehandles and labels, a bareword that consists entirely of lowercase letters risks conflict with future reserved words. If you have warnings enabled, Perl will warn you about barewords. For example: @days = (Mon,Tue,Wed,Thu,Fri); print STDOUT hello, ’ ’, world, "\n"; sets the array @days to the short form of the weekdays and prints “hello world” followed by a newline on STDOUT. If you leave the filehandle out, Perl tries to interpret hello as a filehandle, resulting in a syntax error. Because this is so errorprone, some people may wish to avoid barewords entirely. The quoting operators * Variable names, filehandles, labels, and the like are not considered barewords because they have a meaning forced by a preceding token or a following token (or both). Predeclared names such as subroutines aren’t barewords either. It’s only a bareword when the parser has no clue. Scalar Values 65 listed earlier provide many convenient forms, including the qw// “quote words” construct which nicely quotes a list of space-separated words: @days = qw(Mon Tue Wed Thu Fri); print STDOUT "hello world\n"; You can go as far as to outlaw barewords entirely. If you say: use strict ’subs’; then any bareword will produce a compile-time error. The restriction lasts through the end of the enclosing scope. An inner scope may countermand this by saying: no strict ’subs’; Note that the bare identifiers in constructs like: "${verb}able" $days{Feb} are not considered barewords since they’re allowed by explicit rule rather than by having “no other interpretation in the grammar”. An unquoted name with a trailing double colon, such as main:: or Dog::, is always treated as the package name. Perl turns the would-be bareword Camel:: into the string “Camel” at compile time, so this usage is not subject to rebuke by use strict. Interpolating Array Values Array variables are interpolated into double-quoted strings by joining all elements of the array with the separator specified in the $" variable* (which contains a space by default). The following are equivalent: $temp = join( $", @ARGV ); print $temp; print "@ARGV"; Within search patterns, which also undergo double-quotish interpolation, there is an unfortunate ambiguity: is /$foo[bar]/ to be interpreted as /${foo}[bar]/ (where [bar] is a character class for the regular expression) or as /${foo[bar]}/ (where [bar] is the subscript to array @foo)? If @foo doesn’t otherwise exist, it’s obviously a character class. If @foo exists, Perl takes a good guess about [bar], and is almost always right.† If it does guess wrong, or if you’re just plain paranoid, * $LIST_SEPARATOR if you use the English module bundled with Perl. † The guesser is too boring to describe in full, but basically takes a weighted average of all the things that look like character classes (a-z, \w, initial ˆ) versus things that look like expressions (variables or reserved words). 66 Chapter 2: Bits and Pieces you can force the correct interpretation with braces as shown earlier. Even if you’re merely prudent, it’s probably not a bad idea. “Here” Documents A line-oriented form of quoting is based on the Unix shell’s her e-document syntax. It’s line-oriented in the sense that the delimiters are lines rather than characters. The starting delimiter is the current line, and the terminating delimiter is a line consisting of the string you specify. Following a <<, you specify the string to terminate the quoted material, and all lines following the current line down to but not including the terminating line are part of the string. The terminating string may be either an identifier (a word) or some quoted text. If quoted, the type of quote determines the treatment of the text, just as it does in regular quoting. An unquoted identifier works as though it were in double quotes. A backslashed identifier works as though it were in single quotes (for compatibility with shell syntax). There must be no space between the << and an unquoted identifier, although whitespace is permitted if you specify a quoted string instead of the bare identifier. (If you insert a space, it will be treated as a null identifier, which is valid but deprecated, and matches the first blank line—see the first Hurrah! example below.) The terminating string must appear by itself, unquoted and with no extra whitespace on either side, on the terminating line. print <<EOF; # same as earlier example The price is $Price. EOF print <<"EOF"; # same as above, with explicit quotes The price is $Price. EOF print <<’EOF’; # single-quoted quote All things (e.g. a camel’s journey through A needle’s eye) are possible, it’s true. But picture how the camel feels, squeezed out In one long bloody thread, from tail to snout. -- C.S. Lewis EOF print << x 10; # print next line 10 times The camels are coming! Hurrah! Hurrah! print <<"" x 10; # the preferred way to write that The camels are coming! Hurrah! Hurrah! print <<‘EOC‘; echo hi there echo lo there EOC # execute commands Scalar Values 67 print <<"dromedary", <<"camelid"; I said bactrian. dromedary She said llama. camelid # you can stack them funkshun(<<"THIS", 23, <<’THAT’); Here’s a line or two. THIS And here’s another. THAT # doesn’t matter if they’re in parens Just don’t forget that you have to put a semicolon on the end to finish the statement, because Perl doesn’t know you’re not going to try to do this: print <<’odd’ 2345 odd + 10000; # prints 12345 If you want your here docs to be indented with the rest of the code, you’ll need to remove leading whitespace from each line manually: ($quote = <<’QUOTE’) =˜ s/ˆ\s+//gm; The Road goes ever on and on, down from the door where it began. QUOTE You could even populate an array with the lines of a here document as follows: @sauces = <<End_Lines =˜ m/(\S.*\S)/g; normal tomato spicy tomato green chile pesto white wine End_Lines V-String Literals A literal that begins with a v and is followed by one or more dot-separated integers is treated as a string literal composed of characters with the specified ordinal values: $crlf = v13.10; # ASCII carriage return, line feed These are called v-strings, short for “vector strings” or “version strings” or anything else you can think of that starts with “v” and deals with lists of integers. They provide an alternate and more legible way to construct strings when you want to specify the numeric values of each character. Thus, v1.20.300.4000 is a more winsome way to produce the same string value as any of: 68 Chapter 2: Bits and Pieces "\x{1}\x{14}\x{12c}\x{fa0}" pack("U*", 1, 20, 300, 4000) chr(1) . chr(20) . chr(300) . chr(4000) If such a literal has two or more dots (three or more integers), the leading v may be omitted. print v9786; print v102.111.111; print 102.111.111; # prints UTF-8 encoded SMILEY, "\x{263a}" # prints "foo" # same thing use 5.6.0; # require a particular Perl version (or later) $ipaddr = 204.148.40.9; # the IPv4 address of oreilly.com V-strings are useful for representing IP address and version numbers. In particular, since characters can have an ordinal value larger than 255 these days, v-strings provide a way to represent version numbers of any size that can be correctly compared with a simple string comparison. Version numbers and IP addresses stored in v-strings are not human readable, since the individual integers are stored as arbitrary characters. To produce something legible, use the v flag in a printf mask, like "%vd", as described under sprintf in Chapter 29. For more on Unicode strings, see Chapter 15 and the use bytes pragma in Chapter 31; for comparing version strings using string comparison operators, see $ˆV in Chapter 28; and for representing IPv4 addresses, see gethostbyaddr in Chapter 29. Other Literal Tokens You should consider any identifier that both begins and ends with a double underscore to be reserved for special syntactic use by Perl. Two such special literals are __LINE_ _ and __FILE_ _, which represent the current line number and filename at that point in your program. They may only be used as separate tokens; they will not be interpolated into strings. Likewise, __PACKAGE_ _ is the name of the package the current code is being compiled into. If there is no current package (due to an empty package; directive), __PACKAGE_ _ is the undefined value. The token __END_ _ (or alternatively, a Control-D or Control-Z character) may be used to indicate the logical end of the script before the real end-of-file. Any following text is ignored, but may be read via the DATA filehandle. The __DATA_ _ token functions similarly to the __END_ _ token, but opens the DATA filehandle within the current package’s namespace, so that files you require can each have their own DATA filehandles open simultaneously. For more information, see DATA in Chapter 28. Context 69 Context Until now we’ve seen several terms that can produce scalar values. Before we can discuss terms further, though, we must come to terms with the notion of context. Scalar and List Context Every operation* that you invoke in a Perl script is evaluated in a specific context, and how that operation behaves may depend on the requirements of that context. There are two major contexts: scalar and list. For example, assignment to a scalar variable, or to a scalar element of an array or hash, evaluates the righthand side in a scalar context: $x = funkshun(); # scalar context $x[1] = funkshun(); # scalar context $x{"ray"} = funkshun(); # scalar context But assignment to an array or a hash, or to a slice of either, evaluates the righthand side in a list context, even if the slice picks out only one element: @x @x[1] @x{"ray"} %x = = = = funkshun(); funkshun(); funkshun(); funkshun(); # # # # list list list list context context context context Assignment to a list of scalars also provides a list context to the righthand side, even if there’s only one element in the list: ($x,$y,$z) = funkshun(); # list context ($x) = funkshun(); # list context These rules do not change at all when you declare a variable by modifying the term with my or our, so we have: my my my my $x @x %x ($x) = = = = funkshun(); funkshun(); funkshun(); funkshun(); # # # # scalar context list context list context list context You will be miserable until you learn the difference between scalar and list context, because certain operators (such as our mythical funkshun() function above) know which context they are in, and return a list in contexts wanting a list but a scalar value in contexts wanting a scalar. (If this is true of an operation, it will be mentioned in the documentation for that operation.) In computer lingo, the operations are overloaded on their return type. But it’s a very simple kind of * Here we use the term “operation” loosely to mean either an operator or a term. The two concepts fuzz into each other when you start talking about functions that parse like terms but look like unary operators. 70 Chapter 2: Bits and Pieces overloading, based only on the distinction between singular and plural values, and nothing else. If some operators respond to context, then obviously something around them has to supply the context. We’ve shown that assignment can supply a context to its right operand, but that’s not terribly surprising, since all operators supply some kind of context to each of their operands. What you really want to know is which operators supply which context to their operands. As it happens, you can easily tell which ones supply a list context because they all have LIST in their syntactic descriptions. Everything else supplies a scalar context. Generally, it’s quite intuitive.* If necessary, you can force a scalar context onto an argument in the middle of a LIST by using the scalar pseudofunction. Perl provides no way to force a list context in a scalar context, because anywhere you would want a list context it’s already provided by the LIST of some controlling function. Scalar context can be further classified into string context, numeric context, and don’t-care context. Unlike the scalar versus list distinction we just made, operations never know or care which scalar context they’re in. They simply return whatever kind of scalar value they want to and let Perl translate numbers to strings in string context, and strings to numbers in numeric context. Some scalar contexts don’t care whether a string or a number or a reference is returned, so no conversion will happen. This happens, for example, when you are assigning the value to another variable. The new variable just takes on the same subtype as the old value. Boolean Context Another special don’t-care scalar context is called Boolean context. Boolean context is simply any place where an expression is being evaluated to see whether it’s true or false. When we say “true” and “false” in this book, we mean the technical definition that Perl uses: a scalar value is true if it is not the null string "" or the number 0 (or its string equivalent, "0"). A reference is always true because it represents an address which is never 0. An undefined value (often called undef) is always false because it looks like either "" or 0, depending on whether you treat it as a string or a number. (List values have no Boolean value because list values are never produced in a scalar context!) Because Boolean context is a don’t-care context, it never causes any scalar conversions to happen, though of course the scalar context itself is imposed on any operand that cares. And for many operands that care, the scalar they produce in * Note, however, that the list context of a LIST can propagate down through subroutine calls, so it’s not always obvious from inspection whether a given statement is going to be evaluated in a scalar or list context. The program can find out its context within a subroutine by using the wantarray function. Context 71 scalar context represents a reasonable Boolean value. That is, many operators that would produce a list in list context can be used for a true/false test in Boolean context. For instance, in list context such as that provided by the unlink operator, an array name produces the list of its values: unlink @files; # Delete all files, ignoring errors. But if you use the array in a conditional (that is, in a Boolean context), the array knows it’s in a scalar context and returns the number of elements in the array, which conveniently is true as long as there are any elements left. So supposing you wanted to get warnings on each file that wasn’t deleted properly, you might write a loop like this: while (@files) { my $file = shift @files; unlink $file or warn "Can’t delete $file: $!\n"; } Here @files is evaluated in the Boolean context supplied by the while statement, so Perl evaluates the array itself to see whether it’s a “true array” or a “false array”. It’s a true array as long as there are filenames in it, but it becomes a false array as soon as the last filename is shifted out. Note that what we earlier said still holds. Despite the fact that an array contains (and can produce) a list value, we are not evaluating a list value in scalar context. We are telling the array it’s a scalar and asking what it thinks of itself. Do not be tempted to use defined @files for this. It doesn’t work because the defined function is asking whether a scalar is equal to undef, but an array is not a scalar. The simple Boolean test suffices. Void Context Another peculiar kind of scalar context is the void context. This context not only doesn’t care what the return value’s type is, it doesn’t even want a return value. From the standpoint of how functions work, it’s no different from an ordinary scalar context. But if you have warnings enabled, the Perl compiler will warn you if you use an expression with no side effects in a place that doesn’t want a value, such as in a statement that doesn’t return a value. For example, if you use a string as a statement: "Camel Lot"; you may get a warning like this: Useless use of a constant in void context in myprog line 123; 72 Chapter 2: Bits and Pieces Interpolative Context We mentioned earlier that double-quoted literal strings do backslash interpretation and variable interpolation, but that the interpolative context (often called “doublequote context” because nobody can pronounce “interpolative”) applies to more than just double-quoted strings. Some other double-quotish constructs are the generalized backtick operator qx//, the pattern match operator m//, the substitution operator s///, and the quote regex operator, qr//. The substitution operator does interpolation on its left side before doing a pattern match, and then does interpolation on its right side each time the left side matches. The interpolative context only happens inside quotes, or things that work like quotes, so perhaps it’s not fair to call it a context in the same sense as scalar and list contexts. (Then again, maybe it is.) List Values and Arrays Now that we’ve talked about context, we can talk about list literals and how they behave in context. You’ve already seen some list literals. List literals are denoted by separating individual values by commas (and enclosing the list in parentheses where precedence requires it). Because it (almost) never hurts to use extra parentheses, the syntax diagram of a list value is usually indicated like this: (LIST) Earlier we said that LIST in a syntax description indicates something that supplies list context to its arguments, but a bare list literal itself is the one partial exception to that rule, in that it supplies a list context to its arguments only when the list as a whole is in list context. The value of a list literal in list context is just the values of the arguments in the order specified. As a fancy sort of term in an expression, a list literal merely pushes a series of temporary values onto Perl’s stack, to be collected off the stack later by whatever operator wants the list. In a scalar context, however, the list literal doesn’t really behave like a LIST, in that it doesn’t supply list context to its values. Instead, it merely evaluates each of its arguments in scalar context, and returns the value of the final element. That’s because it’s really just the C comma operator in disguise, which is a binary operator that always throws away the value on the left and returns the value on the right. In terms of what we discussed earlier, the left side of the comma operator really provides a void context. Because the comma operator is left associative, if you have a series of comma-separated values, you always end up with the last value because the final comma throws away whatever any previous commas produced. So, to contrast the two, the list assignment: List Values and Arrays 73 @stuff = ("one", "two", "three"); assigns the entire list value to array @stuff, but the scalar assignment: $stuff = ("one", "two", "three"); assigns only the value “three” to variable $stuff. Like the @files array we mentioned earlier the comma operator knows whether it is in a scalar or list context, and chooses its behavior accordingly. It bears repeating that a list value is different from an array. A real array variable also knows its context, and in a list context, it would return its internal list of values just like a list literal. But in a scalar context it returns only the length of the array. The following assigns to $stuff the value 3: @stuff = ("one", "two", "three"); $stuff = @stuff; If you expected it to get the value “three”, you were probably making a false generalization by assuming that Perl uses the comma operator rule to throw away all but one of the temporary values that @stuff put on the stack. But that’s not how it works. The @stuff array never put all its values on the stack. It never put any of its values on the stack, in fact. It only put one value, the length of the array, because it knew it was in scalar context. No term or operator in scalar context will ever put a list on the stack. Instead, it will put one scalar on the stack, whatever it feels like, which is unlikely to be the last value of the list it would have returned in list context, because the last value is not likely to be the most useful value in scalar context. Got that? (If not, you’d better reread this paragraph, because it’s important.) Now back to true LIST s, the ones that do list context. Until now we’ve pretended that list literals were just lists of literals. But just as a string literal might interpolate other substrings, a list literal can interpolate other sublists. Any expression that returns values may be used within a list. The values so used may be either scalar values or list values, but they all become part of the new list value because LIST s do automatic interpolation of sublists. That is, when a LIST is evaluated, each element of the list is evaluated in a list context, and the resulting list value is interpolated into LIST just as if each individual element were a member of LIST. Thus arrays lose their identity in a LIST.* The list: (@stuff,@nonsense,funkshun()) contains the elements of @stuff, followed by the elements of @nonsense, followed by whatever values the subroutine &funkshun decides to return when called in list * Some people seem to think this is a problem, but it’s not. You can always interpolate a reference to an array if you do not want it to lose its identity. See Chapter 8. 74 Chapter 2: Bits and Pieces context. Note that any or all of these might have interpolated a null (empty) list, in which case it’s as if no array or function call had been interpolated at that point. The null list itself is represented by the literal (). As with a null array, which interpolates as a null list and is therefore effectively ignored, interpolating the null list into another list has no effect. Thus, ((),(),()) is equivalent to (). A corollary to this rule is that you may place an optional comma at the end of any list value. This makes it easy to come back later and add more elements after the last one: @releases = ( "alpha", "beta", "gamma", ); Or you can do away with the commas entirely: another way to specify a literal list is with the qw (quote words) syntax we mentioned earlier. This construct is equivalent to splitting a single-quoted string on whitespace. For example: @froots = qw( apple coconut mandarin pear ); banana guava nectarine persimmon carambola kumquat peach plum (Note that those parentheses are behaving as quote characters, not ordinary parentheses. We could just as easily have picked angle brackets or braces or slashes. But parens are pretty.) A list value may also be subscripted like a normal array. You must put the list in parentheses (real ones) to avoid ambiguity. Though it’s often used to fetch a single value out of a list, it’s really a slice of the list, so the syntax is: (LIST)[LIST] Examples: # Stat returns list value. $modification_time = (stat($file))[9]; # SYNTAX ERROR HERE. $modification_time = stat($file)[9]; # OOPS, FORGOT PARENS # Find a hex digit. $hexdigit = (’a’,’b’,’c’,’d’,’e’,’f’)[$digit-10]; # A "reverse comma operator". return (pop(@foo),pop(@foo))[0]; List Values and Arrays 75 # Get multiple values as a slice. ($day, $month, $year) = (localtime)[3,4,5]; List Assignment A list may be assigned to only if each element of the list is itself legal to assign to: ($a, $b, $c) = (1, 2, 3); ($map{red}, $map{green}, $map{blue}) = (0xff0000, 0x00ff00, 0x0000ff); You may assign to undef in a list. This is useful for throwing away some of the return values of a function: ($dev, $ino, undef, undef, $uid, $gid) = stat($file); The final list element may be an array or a hash: ($a, $b, @rest) = split; my ($a, $b, %rest) = @arg_list; You can actually put an array or hash anywhere in the list you assign to, but the first array or hash in the list will soak up all the remaining values, and anything after it will be set to the undefined value. This may be useful in a local or my, where you probably want the arrays initialized to be empty anyway. You can even assign to the empty list: () = funkshun(); That ends up calling your function in list context, but discarding the return values. If you had just called the function without an assignment, it would have instead been called in void context, which is a kind of scalar context, and might have caused the function to behave completely differently. List assignment in scalar context returns the number of elements produced by the expression on the right side of the assignment: $x = ( ($a, $b) = (7,7,7) ); $x = ( ($a, $b) = funk() ); $x = ( () = funk() ); # set $x to 3, not 2 # set $x to funk()’s return count # also set $x to funk()’s return count This is handy when you want to do a list assignment in a Boolean context, because most list functions return a null list when finished, which when assigned produces a 0, which is interpreted as false. Here’s how you might use it in a while statement: while (($login, $password) = getpwent) { if (crypt($login, $password) eq $password) { print "$login has an insecure password!\n"; } } 76 Chapter 2: Bits and Pieces Array Length You may find the number of elements in the array @days by evaluating @days in a scalar context, such as: @days + 0; scalar(@days) # implicitly force @days into a scalar context # explicitly force @days into a scalar context Note that this only works for arrays. It does not work for list values in general. As we mentioned earlier, a comma-separated list evaluated in scalar context returns the last value, like the C comma operator. But because you almost never actually need to know the length of a list in Perl, this is not a problem. Closely related to the scalar evaluation of @days is $#days. This will return the subscript of the last element of the array, or one less than the length, since there is (ordinarily) a 0th element. Assigning to $#days changes the length of the array. Shortening an array by this method destroys intervening values. You can gain some measure of efficiency by pre-extending an array that is going to get big. (You can also extend an array by assigning to an element beyond the end of the array.) You can truncate an array down to nothing by assigning the null list () to it. The following two statements are equivalent: @whatever = (); $#whatever = -1; And the following is always true: scalar(@whatever) == $#whatever + 1; Truncating an array does not recover its memory. You have to undef(@whatever) to free its memory back to your process’s memory pool. You probably can’t free it all the way back to your system’s memory pool, because few operating systems support this. Hashes As we said earlier, a hash is just a funny kind of array in which you look values up using key strings instead of numbers. A hash defines associations between keys and values, so hashes are often called associative arrays by people who are not lazy typists. There really isn’t any such thing as a hash literal in Perl, but if you assign an ordinary list to a hash, each pair of values in the list will be taken to indicate one key/value association: %map = (’red’,0xff0000,’green’,0x00ff00,’blue’,0x0000ff); Hashes 77 This has the same effect as: %map = (); # clear the hash first $map{red} = 0xff0000; $map{green} = 0x00ff00; $map{blue} = 0x0000ff; It is often more readable to use the => operator between key/value pairs. The => operator is just a synonym for a comma, but it’s more visually distinctive and also quotes any bare identifiers to the left of it (just like the identifiers in braces above), which makes it convenient for several sorts of operation, including initializing hash variables: %map = ( red => 0xff0000, green => 0x00ff00, blue => 0x0000ff, ); or initializing anonymous hash references to be used as records: $rec = { NAME => ’John Smith’, RANK => ’Captain’, SERNO => ’951413’, }; or using named parameters to invoke complicated functions: $field = radio_group( NAME VALUES DEFAULT LINEBREAK LABELS ); => => => => => ’animals’, [’camel’, ’llama’, ’ram’, ’wolf’], ’camel’, ’true’, \%animal_names, But we’re getting ahead of ourselves again. Back to hashes. You can use a hash variable (%hash) in a list context, in which case it interpolates all its key/value pairs into the list. But just because the hash was initialized in a particular order doesn’t mean that the values come back out in that order. Hashes are implemented internally using hash tables for speedy lookup, which means that the order in which entries are stored is dependent on the internal hash function used to calculate positions in the hash table, and not on anything interesting. So the entries come back in a seemingly random order. (The two elements of each key/value pair come out in the right order, of course.) For examples of how to arrange for an output ordering, see the keys function in Chapter 29. 78 Chapter 2: Bits and Pieces When you evaluate a hash variable in a scalar context, it returns a true value only if the hash contains any key/value pairs whatsoever. If there are any key/value pairs at all, the value returned is a string consisting of the number of used buckets and the number of allocated buckets, separated by a slash. This is pretty much only useful to find out whether Perl’s (compiled in) hashing algorithm is performing poorly on your data set. For example, you stick 10,000 things in a hash, but evaluating %HASH in scalar context reveals “1/8”, which means only one out of eight buckets has been touched. Presumably that one bucket contains all 10,000 of your items. This isn’t supposed to happen. To find the number of keys in a hash, use the keys function in a scalar context: scalar(keys(%HASH)). You can emulate a multidimensional hash by specifying more than one key within the braces, separated by commas. The listed keys are concatenated together, separated by the contents of $; ($SUBSCRIPT_SEPARATOR), which has a default value of chr(28). The resulting string is used as the actual key to the hash. These two lines do the same thing: $people{ $state, $county } = $census_results; $people{ join $; => $state, $county } = $census_results; This feature was originally implemented to support a2p, the awk-to-Perl translator. These days, you’d usually just use a real (well, realer) multidimensional array as described in Chapter 9, Data Structures. One place the old style is still useful is for hashes tied to DBM files (see DB_File in Chapter 32, Standard Modules), which don’t support multidimensional keys. Don’t confuse multidimensional hash emulations with slices. The one represents a scalar value, and the other represents a list value: $hash{ $x, $y, $z } @hash{ $x, $y, $z } # a single value # a slice of three values Typeglobs and Filehandles Perl uses an special type called a typeglob to hold an entire symbol table entry. (The symbol table entry *foo contains the values of $foo, @foo, %foo, &foo, and several interpretations of plain old foo.) The type prefix of a typeglob is a * because it represents all types. One use of typeglobs (or references thereto) is for passing or storing filehandles. If you want to save away a filehandle, do it this way: $fh = *STDOUT; Input Operators 79 or perhaps as a real reference, like this: $fh = \*STDOUT; This is also the way to create a local filehandle. For example: sub newopen { my $path = shift; local *FH; # not my() nor our() open(FH, $path) or return undef; return *FH; # not \*FH! } $fh = newopen(’/etc/passwd’); See the open function for other ways to generate new filehandles. The main use of typeglobs nowadays is to alias one symbol table entry to another symbol table entry. Think of an alias as a nickname. If you say: *foo = *bar; it makes everything named “foo” a synonym for every corresponding thing named “bar”. You can alias just one variable from a typeglob by assigning a reference instead: *foo = \$bar; makes $foo an alias for $bar, but doesn’t make @foo an alias for @bar, or %foo an alias for %bar. All these affect global (package) variables only; lexicals cannot be accessed through symbol table entries. Aliasing global variables like this may seem like a silly thing to want to do, but it turns out that the entire module export/import mechanism is built around this feature, since there’s nothing that says the symbol you’re aliasing has to be in your namespace. This: local *Here::blue = \$There::green; temporarily makes $Here::blue an alias for $There::green, but doesn’t make @Here::blue an alias for @There::green, or %Here::blue an alias for %There::green. Fortunately, all these complicated typeglob manipulations are hidden away where you don’t have to look at them. See the sections “Handle References” and “Symbol Table References” in Chapter 8, the section “Symbol Tables” in Chapter 10, and Chapter 11, Modules, for more discussion on typeglobs and importation. Input Operators There are several input operators we’ll discuss here because they parse as terms. Sometimes we call them pseudoliterals because they act like quoted strings in many ways. (Output operators like print parse as list operators and are discussed in Chapter 29.) 80 Chapter 2: Bits and Pieces Command Input (Backtick) Operator First of all, we have the command input operator, also known as the backtick operator, because it looks like this: $info = ‘finger $user‘; A string enclosed by backticks (grave accents, technically) first undergoes variable interpolation just like a double-quoted string. The result is then interpreted as a command line by the system, and the output of that command becomes the value of the pseudoliteral. (This is modeled after a similar operator in Unix shells.) In scalar context, a single string consisting of all the output is returned. In list context, a list of values is returned, one for each line of output. (You can set $/ to use a different line terminator.) The command is executed each time the pseudoliteral is evaluated. The numeric status value of the command is saved in $? (see Chapter 28 for the interpretation of $?, also known as $CHILD_ERROR). Unlike the csh version of this command, no translation is done on the return data — newlines remain newlines. Unlike in any of the shells, single quotes in Perl do not hide variable names in the command from interpretation. To pass a $ through to the shell you need to hide it with a backslash. The $user in our finger example above is interpolated by Perl, not by the shell. (Because the command undergoes shell processing, see Chapter 23, Security, for security concerns.) The generalized form of backticks is qx// (for “quoted execution”), but the operator works exactly the same way as ordinary backticks. You just get to pick your quote characters. As with similar quoting pseudofunctions, if you happen to choose a single quote as your delimiter, the command string doesn’t undergo double-quote interpolation; $perl_info = qx(ps $$); $shell_info = qx’ps $$’; # that’s Perl’s $$ # that’s the shell’s $$ Line Input (Angle) Operator The most heavily used input operator is the line input operator, also known as the angle operator or the readline function (since that’s what it calls internally). Evaluating a filehandle in angle brackets (STDIN, for example) yields the next line from the associated filehandle. (The newline is included, so according to Perl’s criteria for truth, a freshly input line is always true, up until end-of-file, at which point an undefined value is returned, which is conveniently false.) Ordinarily, you would assign the input value to a variable, but there is one situation where an automatic assignment happens. If and only if the line input operator is the only thing inside the conditional of a while loop, the value is automatically assigned to the special variable $_. The assigned value is then tested to see whether it is defined. (This Input Operators 81 may seem like an odd thing to you, but you’ll use the construct frequently, so it’s worth learning.) Anyway, the following lines are equivalent: while (defined($_ = <STDIN>)) { print $_; } while ($_ = <STDIN>) { print; } while (<STDIN>) { print; } for (;<STDIN>;) { print; } print $_ while defined($_ = <STDIN>); print while $_ = <STDIN>; print while <STDIN>; # # # # # # # the longest way explicitly to $_ the short way while loop in disguise long statement modifier explicitly to $_ short statement modifier Remember that this special magic requires a while loop. If you use the input operator anywhere else, you must assign the result explicitly if you want to keep the value: while (<FH1> && <FH2>) { ... } if (<STDIN>) { print; } if ($_ = <STDIN>) { print; } if (defined($_ = <STDIN>)) { print; } # # # # WRONG: discards both inputs WRONG: prints old value of $_ suboptimal: doesn’t test defined best When you’re implicitly assigning to $_ in a $_ loop, this is the global variable by that name, not one localized to the while loop. You can protect an existing value of $_ this way: while (local $_ = <STDIN>) { print; } # use local $_ Any previous value is restored when the loop is done. $_ is still a global variable, though, so functions called from inside that loop could still access it, intentionally or otherwise. You can avoid this, too, by declaring a lexical variable: while (my $line = <STDIN>) { print $line; } # now private (Both of these while loops still implicitly test for whether the result of the assignment is defined, because my and local don’t change how assignment is seen by the parser.) The filehandles STDIN, STDOUT, and STDERR are predefined and preopened. Additional filehandles may be created with the open or sysopen functions. See those functions’ documentation in Chapter 29 for details on this. In the while loops above, we were evaluating the line input operator in a scalar context, so the operator returns each line separately. However, if you use the operator in a list context, a list consisting of all remaining input lines is returned, one line per list element. It’s easy to make a large data space this way, so use this feature with care: $one_line = <MYFILE>; # Get first line. @all_lines = <MYFILE>; # Get the rest of the lines. There is no while magic associated with the list form of the input operator, because the condition of a while loop always provides a scalar context (as does any conditional). 82 Chapter 2: Bits and Pieces Using the null filehandle within the angle operator is special; it emulates the command-line behavior of typical Unix filter programs such as sed and awk. When you read lines from <>, it magically gives you all the lines from all the files mentioned on the command line. If no files were mentioned, it gives you standard input instead, so your program is easy to insert into the middle of a pipeline of processes. Here’s how it works: the first time <> is evaluated, the @ARGV array is checked, and if it is null, $ARGV[0] is set to “-”, which when opened gives you standard input. The @ARGV array is then processed as a list of filenames. More explicitly, the loop: while (<>) { ... } # code for each line is equivalent to the following Perl-like pseudocode: @ARGV = (’-’) unless @ARGV; while (@ARGV) { $ARGV = shift @ARGV; if (!open(ARGV, $ARGV)) { warn "Can’t open $ARGV: next; } while (<ARGV>) { ... } } # assume STDIN iff empty # shorten @ARGV each time $!\n"; # code for each line except that it isn’t so cumbersome to say, and will actually work. It really does shift array @ARGV and put the current filename into the global variable $ARGV. It also uses the special filehandle ARGV internally— <> is just a synonym for the more explicitly written <ARGV>, which is a magical filehandle. (The pseudocode above doesn’t work because it treats <ARGV> as nonmagical.) You can modify @ARGV before the first <> as long as the array ends up containing the list of filenames you really want. Because Perl uses its normal open function here, a filename of “-” counts as standard input wherever it is encountered, and the more esoteric features of open are automatically available to you (such as opening a “file” named “gzip -dc < file.gz|”). Line numbers ($.) continue as if the input were one big happy file. (But see the example under eof in Chapter 29 for how to reset line numbers on each file.) If you want to set @ARGV to your own list of files, go right ahead: # default to README file if no args given @ARGV = ("README") unless @ARGV; If you want to pass switches into your script, you can use one of the Getopt::* modules or put a loop on the front like this: Input Operators 83 while (@ARGV and $ARGV[0] =˜ /ˆ-/) { $_ = shift; last if /ˆ--$/; if (/ˆ-D(.*)/) { $debug = $1 } if (/ˆ-v/) { $verbose++ } ... # other switches } while (<>) { ... # code for each line } The <> symbol will return false only once. If you call it again after this, it will assume you are processing another @ARGV list, and if you haven’t set @ARGV, it will input from STDIN. If the string inside the angle brackets is a scalar variable (for example, <$foo>), that variable contains an indir ect filehandle, either the name of the filehandle to input from or a reference to such a filehandle. For example: $fh = \*STDIN; $line = <$fh>; or: open($fh, "<data.txt"); $line = <$fh>; Filename Globbing Operator You might wonder what happens to a line input operator if you put something fancier inside the angle brackets. What happens is that it mutates into a different operator. If the string inside the angle brackets is anything other than a filehandle name or a scalar variable (even if there are just extra spaces), it is interpreted as a filename pattern to be “globbed”.* The pattern is matched against the files in the current directory (or the directory specified as part of the fileglob pattern), and the filenames so matched are returned by the operator. As with line input, names are returned one at a time in scalar context, or all at once in list context. The latter usage is more common; you often see things like: @files = <*.xml>; As with other kinds of pseudoliterals, one level of variable interpolation is done first, but you can’t say <$foo> because that’s an indirect filehandle as explained earlier. In older versions of Perl, programmers would insert braces to force * Fileglobs have nothing to do with the previously mentioned typeglobs, other than that they both use the * character in a wildcard fashion. The * character has the nickname “glob” when used like this. With typeglobs, you’re globbing symbols with the same name from the symbol table. With a fileglob, you’re doing wildcard matching on the filenames in a directory, just as the various shells do. 84 Chapter 2: Bits and Pieces interpretation as a fileglob: <${foo}>. These days, it’s considered cleaner to call the internal function directly as glob($foo), which is probably the right way to have invented it in the first place. So instead you’d write @files = glob("*.xml"); if you despise overloading the angle operator for this. Which you’re allowed to do. Whether you use the glob function or the old angle-bracket form, the fileglob operator also does while magic like the line input operator, assigning the result to $_. (That was the rationale for overloading the angle operator in the first place.) For example, if you wanted to change the permissions on all your C code files, you might say: while (glob "*.c") { chmod 0644, $_; } which is equivalent to: while (<*.c>) { chmod 0644, $_; } The glob function was originally implemented as a shell command in older versions of Perl (and in even older versions of Unix), which meant it was comparatively expensive to execute and, worse still, wouldn’t work exactly the same everywhere. Nowadays it’s a built-in, so it’s more reliable and a lot faster. See the description of the File::Glob module in Chapter 32 for how to alter the default behavior of this operator, such as whether to treat spaces in its operand (argument) as pathname separators, whether to expand tildes or braces, whether to be case insensitive, and whether to sort the return values — amongst other things. Of course, the shortest and arguably the most readable way to do the chmod command above is to use the fileglob as a list operator: chmod 0644, <*.c>; A fileglob evaluates its (embedded) operand only when starting a new list. All values must be read before the operator will start over. In a list context, this isn’t important because you automatically get them all anyway. In a scalar context, however, the operator returns the next value each time it is called, or a false value if you’ve just run out. Again, false is returned only once. So if you’re expecting a single value from a fileglob, it is much better to say: ($file) = <blurch*>; # list context Input Operators 85 than to say: $file = <blurch*>; # scalar context because the former returns all matched filenames and resets the operator, whereas the latter alternates between returning filenames and returning false. If you’re trying to do variable interpolation, it’s definitely better to use the glob operator because the older notation can cause confusion with the indirect filehandle notation. This is where it becomes apparent that the borderline between terms and operators is a bit mushy: @files = <$dir/*.[ch]>; @files = glob("$dir/*.[ch]"); @files = glob $some_pattern; # Works, but avoid. # Call glob as function. # Call glob as operator. We left the parentheses off of the last example to illustrate that glob can be used either as a function (a term) or as a unary operator; that is, a prefix operator that takes a single argument. The glob operator is an example of a named unary operator, which is just one kind of operator we’ll talk about in the next chapter. Later, we’ll talk about pattern-matching operators, which also parse like terms but behave like operators. 3 Unary and Binary Operators In the last chapter, we talked about the various kinds of terms you might use in an expression, but to be honest, isolated terms are a bit boring. Many terms are party animals. They like to have relationships with each other. The typical young term feels strong urges to identify with and influence other terms in various ways, but there are many different kinds of social interaction and many different levels of commitment. In Perl, these relationships are expressed using operators. Sociology has to be good for something. From a mathematical perspective, operators are just ordinary functions with special syntax. From a linguistic perspective, operators are just irregular verbs. But as any linguist will tell you, the irregular verbs in a language tend to be the ones you use most often. And that’s important from an information theory perspective because the irregular verbs tend to be shorter and more efficient in both production and recognition. In practical terms, operators are handy. Operators come in various flavors, depending on their arity (how many operands they take), their pr ecedence (how hard they try to take those operands away from surrounding operators), and their associativity (whether they prefer to do things right to left or left to right when associated with operators of the same precedence). Perl operators come in three arities: unary, binary, and trinary (or ter nary, if your native tongue is Shibboleth). Unary operators are always prefix operators (except 86 Introduction 87 for the postincrement and postdecrement operators).* The others are all infix operators — unless you count the list operators, which can prefix any number of arguments. But most people just think of list operators as normal functions that you can forget to put parentheses around. Here are some examples: ! $x $x * $y $x ? $y : $z print $x, $y, $z # # # # a a a a unary operator binary operator trinary operator list operator An operator’s precedence controls how tightly it binds. Operators with higher precedence grab the arguments around them before operators with lower precedence. The archetypal example is straight out of elementary math, where multiplication takes precedence over addition: 2 + 3 * 4 # yields 14, not 20 The order in which two operators of the same precedence are executed depends on their associativity. These rules also follow math conventions to some extent: 2 * 3 * 4 2 ** 3 ** 4 2 != 3 != 4 # means (2 * 3) * 4, left associative # means 2 ** (3 ** 4), right associative # illegal, nonassociative Table 3-1 lists the associativity and arity of the Perl operators from highest precedence to lowest. Table 3-1. Operator Precedence Associativity Arity Precedence Class None 0 Terms, and list operators (leftward) Left 2 -> None 1 ++ -- Right 2 ** Right 1 ! ˜ > and unary + and - Left 2 =˜ !˜ Left 2 * / % x Left 2 + - . Left 2 << >> Right 0,1 Named unary operators None 2 < > <= >= lt gt le ge None 2 == != <=> eq ne cmp Left 2 & * Though you can think of various quotes and brackets as circumfix operators that delimit terms. 88 Chapter 3: Unary and Binary Operators Table 3-1. Operator Precedence (continued) Associativity Arity Precedence Class Left 2 | ˆ Left 2 && Left 2 || None 2 .. ... Right 3 ?: Right 2 = += -= *= and so on Left 2 , => Right 0+ List operators (rightward) Right 1 not Left 2 and Left 2 or xor It may seem to you that there are too many precedence levels to remember. Well, you’re right, there are. Fortunately, you’ve got two things going for you here. First, the precedence levels as they’re defined usually follow your intuition, presuming you’re not psychotic. And second, if you’re merely neurotic, you can always put in extra parentheses to relieve your anxiety. Another helpful hint is that any operators borrowed from C keep the same precedence relationship with each other, even where C’s precedence is slightly screwy. (This makes learning Perl easier for C folks and C++ folks. Maybe even Java folks.) The following sections cover these operators in precedence order. With very few exceptions, these all operate on scalar values only, not list values. We’ll mention the exceptions as they come up. Although references are scalar values, using most of these operators on references doesn’t make much sense, because the numeric value of a reference is only meaningful to the internals of Perl. Nevertheless, if a reference points to an object of a class that allows overloading, you can call these operators on such objects, and if the class has defined an overloading for that particular operator, it will define how the object is to be treated under that operator. This is how complex numbers are implemented in Perl, for instance. For more on overloading, see Chapter 13, Overloading. Terms and List Operators (Leftward) 89 Terms and List Operators (Leftward) Any ter m is of highest precedence in Perl. Terms include variables, quote and quotelike operators, most expressions in parentheses, or brackets or braces, and any function whose arguments are parenthesized. Actually, there aren’t really any functions in this sense, just list operators and unary operators behaving as functions because you put parentheses around their arguments. Nevertheless, the name of Chapter 29 is Functions. Now listen carefully. Here are a couple of rules that are very important and simplify things greatly, but may occasionally produce counterintuitive results for the unwary. If any list operator (such as print) or any named unary operator (such as chdir) is followed by a left parenthesis as the next token (ignoring whitespace), the operator and its parenthesized arguments are given highest precedence, as if it were a normal function call. The rule is this: If it looks like a function call, it is a function call. You can make it look like a nonfunction by prefixing the parentheses with a unary plus, which does absolutely nothing, semantically speaking—it doesn’t even coerce the argument to be numeric. For example, since || has lower precedence than chdir, we get: chdir $foo chdir($foo) chdir ($foo) chdir +($foo) || || || || die; die; die; die; # # # # (chdir (chdir (chdir (chdir $foo) $foo) $foo) $foo) || || || || die die die die but, because * has higher precedence than chdir, we get: chdir $foo * 20; chdir($foo) * 20; chdir ($foo) * 20; chdir +($foo) * 20; # # # # chdir ($foo * 20) (chdir $foo) * 20 (chdir $foo) * 20 chdir ($foo * 20) Likewise for any numeric operator that happens to be a named unary operator, such as rand: rand 10 * 20; rand(10) * 20; rand (10) * 20; rand +(10) * 20; # # # # rand (10 * 20) (rand 10) * 20 (rand 10) * 20 rand (10 * 20) In the absence of parentheses, the precedence of list operators such as print, sort, or chmod is either very high or very low depending on whether you look at the left side or the right side of the operator. (That’s what the “Leftward” is doing in the title of this section.) For example, in: @ary = (1, 3, sort 4, 2); print @ary; # prints 1324 90 Chapter 3: Unary and Binary Operators the commas on the right of the sort are evaluated before the sort, but the commas on the left are evaluated after. In other words, a list operator tends to gobble up all the arguments that follow it, and then act like a simple term with regard to the preceding expression. You still have to be careful with parentheses: # These evaluate exit before doing the print: print($foo, exit); # Obviously not what you want. print $foo, exit; # Nor this. # These do the print before evaluating exit: (print $foo), exit; # This is what you want. print($foo), exit; # Or this. print ($foo), exit; # Or even this. The easiest place to get burned is where you’re using parentheses to group mathematical arguments, and you forget that parentheses are also used to group function arguments: print ($foo & 255) + 1, "\n"; # prints ($foo & 255) That probably doesn’t do what you expect at first glance. Fortunately, mistakes of this nature generally produce warnings like “Useless use of addition in a void context” when warnings are enabled. Also parsed as terms are the do {} and eval {} constructs, as well as subroutine and method calls, the anonymous array and hash composers and {}, and the anonymous subroutine composer sub {}. The Arrow Operator Just as in C and C++, the binary -> operator is an infix dereference operator. If the right side is a [...] array subscript, a {...} hash subscript, or a (...) subroutine argument list, the left side must be a reference (either hard or symbolic) to an array, a hash, or a subroutine, respectively. In an lvalue (assignable) context, if the left side is not a reference, it must be a location capable of holding a hard reference, in which case such a reference will be autovivified for you. For more on this (and some warnings about accidental autovivification) see Chapter 8, Refer ences. $aref->[42] $href->{"corned beef"} $sref->(1,2,3) # an array dereference # a hash dereference # a subroutine dereference Otherwise, it’s a method call of some kind. The right side must be a method name (or a simple scalar variable containing the method name), and the left side must Autoincrement and Autodecrement 91 evaluate to either an object (a blessed reference) or a class name (that is, a package name): $yogi = Bear->new("Yogi"); $yogi->swipe($picnic); # a class method call # an object method call The method name may be qualified with a package name to indicate in which class to start searching for the method, or with the special package name, SUPER::, to indicate that the search should start in the parent class. See Chapter 12, Objects. Autoincrement and Autodecrement The ++ and -- operators work as in C. That is, when placed before a variable, they increment or decrement the variable before returning the value, and when placed after, they increment or decrement the variable after returning the value. For example, $a++ increments the value of scalar variable $a, returning the value befor e it performs the increment. Similarly, --$b{(/(\w+)/)[0]} decrements the element of the hash %b indexed by the first “word” in the default search variable ($_ ) and returns the value after the decrement.* The autoincrement operator has a little extra built-in magic. If you increment a variable that is numeric, or that has ever been used in a numeric context, you get a normal increment. If, however, the variable has only been used in string contexts since it was set, has a value that is not the null string, and matches the pattern /ˆ[a-zA-Z]*[0-9]*$/, the increment is done as a string, preserving each character within its range, with carry: print print print print ++($foo ++($foo ++($foo ++($foo = = = = ’99’); ’a0’); ’Az’); ’zz’); # # # # prints prints prints prints ’100’ ’b1’ ’Ba’ ’aaa’ As of this writing, magical autoincrement has not been extended to Unicode letters and digits, but it might be in the future. The autodecrement operator, however, is not magical, and we have no plans to make it so. * Okay, so that wasn’t exactly fair. We just wanted to make sure you were paying attention. Here’s how that expression works. First the pattern match finds the first word in $_ using the regular expression \w+. The parentheses around that cause the word to be returned as a single-element list value because the pattern match is in a list context. The list context is supplied by the list slice operator, (...)[0], which returns the first (and only) element of the list. That value is used as the key for the hash, and the hash entry (value) is decremented and returned. In general, when confronted with a complex expression, analyze it from the inside out to see what order things happen in. 92 Chapter 3: Unary and Binary Operators Exponentiation Binary ** is the exponentiation operator. Note that it binds even more tightly than unary minus, so -2**4 is -(2**4), not (-2)**4. The operator is implemented using C’s pow (3) function, which works with floating-point numbers internally. It calculates using logarithms, which means that it works with fractional powers, but you sometimes get results that aren’t as exact as a straight multiplication would produce. Ideographic Unary Operators Most unary operators just have names (see “Named Unary and File Test Operators” later in this chapter), but some operators are deemed important enough to merit their own special symbolic representation. All of these operators seem to have something to do with negation. Blame the mathematicians. Unary ! performs logical negation, that is, “not”. See not for a lower precedence version of logical negation. The value of a negated operand is true (1) if the operand is false (numeric 0, string "0", the null string, or undefined) and false ("") if the operand is true. Unary - performs arithmetic negation if the operand is numeric. If the operand is an identifier, a string consisting of a minus sign concatenated with the identifier is returned. Otherwise, if the string starts with a plus or minus, a string starting with the opposite sign is returned. One effect of these rules is that -bareword is equivalent to "-bareword". This is most useful for Tk programmers. Unary ˜ performs bitwise negation, that is, 1’s complement. By definition, this is somewhat nonportable when limited by the word size of your machine. For example, on a 32-bit machine, ˜123 is 4294967172, while on a 64-bit machine, it’s 18446744073709551492. But you knew that already. What you perhaps didn’t know is that if the argument to ˜ happens to be a string instead of a number, a string of identical length is returned, but with all the bits of the string complemented. This is a fast way to flip a lot of bits all at once, and it’s a way to flip those bits portably, since it doesn’t depend on the word size of your computer. Later we’ll also cover the bitwise logical operators, which have stringoriented variants as well. Unary + has no semantic effect whatsoever, even on strings. It is syntactically useful for separating a function name from a parenthesized expression that would otherwise be interpreted as the complete list of function arguments. (See examples under the section “Terms and List Operators”.) If you think about it sideways, + negates the effect that parentheses have of turning prefix operators into functions. Binding Operators 93 Unary \ creates a reference to whatever follows it. Used on a list, it creates a list of references. See the section “The Backslash Operator” in Chapter 8 for details. Do not confuse this behavior with the behavior of backslash within a string, although both forms do convey the vaguely negational notion of protecting the next thing from interpretation. This resemblance is not entirely accidental. Binding Operators Binary =˜ binds a string expression to a pattern match, substitution, or transliteration (loosely called translation). These operations would otherwise search or modify the string contained in $_ (the default variable). The string you want to bind is put on the left, while the operator itself is put on the right. The return value indicates the success or failure of the operator on the right, since the binding operator doesn’t really do anything on its own. If the right argument is an expression rather than a pattern match, substitution, or transliteration, it will be interpreted as a search pattern at run time. That is to say, $_ =˜ $pat is equivalent to $_ =˜ /$pat/. This is less efficient than an explicit search, since the pattern must be checked and possibly recompiled every time the expression is evaluated. You can avoid this recompilation by precompiling the original pattern using the qr// (quote regex) operator. Binary !˜ is just like =˜ except the return value is negated logically. The following expressions are functionally equivalent: $string !˜ /pattern/ not $string =˜ /pattern/ We said that the return value indicates success, but there are many kinds of success. Substitutions return the number of successful matches, as do transliterations. (In fact, the transliteration operator is often used to count characters.) Since any nonzero result is true, it all works out. The most spectacular kind of true value is a list assignment of a pattern: in a list context, pattern matches can return substrings matched by the parentheses in the pattern. But again, according to the rules of list assignment, the list assignment itself will return true if anything matched and was assigned, and false otherwise. So you sometimes see things like: if ( ($k,$v) = $string =˜ m/(\w+)=(\w*)/ ) { print "KEY $k VALUE $v\n"; } Let’s pick that apart. The =˜ has precedence over =, so =˜ happens first. The =˜ binds $string to the pattern match on the right, which is scanning for occurrences of things that look like KEY=VALUE in your string. It’s in a list context because it’s on the right side of a list assignment. If the pattern matches, it returns a list to be 94 Chapter 3: Unary and Binary Operators assigned to $k and $v. The list assignment itself is in a scalar context, so it returns 2, the number of values on the right side of the assignment. And 2 happens to be true, since our scalar context is also a Boolean context. When the match fails, no values are assigned, which returns 0, which is false. For more on the politics of patterns, see Chapter 5, Patter n Matching. Multiplicative Operators Perl provides the C-like operators * (multiply), / (divide), and % (modulo). The * and / work exactly as you would expect, multiplying or dividing their two operands. Division is done in floating point, unless you’ve used the integer pragmatic module. The % operator converts its operands to integers before finding the remainder according to integer division. (However, it does this integer division in floating point if necessary, so your operands can be up to 15 digits long on most 32-bit machines.) Assume that your two operands are called $a and $b. If $b is positive, then the result of $a % $b is $a minus the largest multiple of $b that is not greater than $a (which means the result will always be in the range 0 .. $b-1). If $b is negative, then the result of $a % $b is $a minus the smallest multiple of $b that is not less than $a (which means the result will be in the range $b+1 .. 0). When use integer is in scope, % gives you direct access to the modulus operator as implemented by your C compiler. This operator is not well defined for negative operands, but will execute faster. Binary x is the repetition operator. Actually, it’s two operators. In scalar context, it returns a concatenated string consisting of the left operand repeated the number of times specified by the right operand. (For backward compatibility, it also does this in list context if the left argument is not in parentheses.) print ’-’ x 80; print "\t" x ($tab/8), ’ ’ x ($tab%8); # print row of dashes # tab over In list context, if the left operand is a list in parentheses, the x works as a list replicator rather than a string replicator. This is useful for initializing all the elements of an array of indeterminate length to the same value: @ones = (1) x 80; @ones = (5) x @ones; # a list of 80 1’s # set all elements to 5 Similarly, you can also use x to initialize array and hash slices: @keys = qw(perls before swine); @hash{@keys} = ("") x @keys; Named Unary and File Test Operators 95 If this mystifies you, note that @keys is being used both as a list on the left side of the assignment and as a scalar value (returning the array length) on the right side of the assignment. The previous example has the same effect on %hash as: $hash{perls} = ""; $hash{before} = ""; $hash{swine} = ""; Additive Operators Strangely enough, Perl also has the customary + (addition) and – (subtraction) operators. Both operators convert their arguments from strings to numeric values if necessary and return a numeric result. Additionally, Perl provides the . operator, which does string concatenation. For example: $almost = "Fred" . "Flintstone"; # returns FredFlintstone Note that Perl does not place a space between the strings being concatenated. If you want the space, or if you have more than two strings to concatenate, you can use the join operator, described in Chapter 29, Functions. Most often, though, people do their concatenation implicitly inside a double-quoted string: $fullname = "$firstname $lastname"; Shift Operators The bit-shift operators (<< and >>) return the value of the left argument shifted to the left (<<) or to the right (>>) by the number of bits specified by the right argument. The arguments should be integers. For example: 1 << 4; 32 >> 4; # returns 16 # returns 2 Be careful, though. Results on large (or negative) numbers may vary depending on the number of bits your machine uses to represent integers. Named Unary and File Test Operators Some of the “functions” described in Chapter 29 are really unary operators. Table 3-2 lists all the named unary operators. 96 Chapter 3: Unary and Binary Operators Table 3-2. Named Unary Operators -X (file tests) gethostbyname localtime return alarm getnetbyname lock rmdir caller getpgrp log scalar chdir getprotobyname lstat sin chroot glob my sleep cos gmtime oct sqrt defined goto ord srand delete hex quotemeta stat do int rand uc eval lc readlink ucfirst exists lcfirst ref umask exit length require undef Unary operators have a higher precedence than some of the binary operators. For example: sleep 4 | 3; does not sleep for 7 seconds; it sleeps for 4 seconds and then takes the return value of sleep (typically zero) and bitwise ORs that with 3, as if the expression were parenthesized as: (sleep 4) | 3; Compare this with: print 4 | 3; which does take the value of 4 ORed with 3 before printing it (7 in this case), as if it were written: print (4 | 3); This is because print is a list operator, not a simple unary operator. Once you’ve learned which operators are list operators, you’ll have no trouble telling unary operators and list operators apart. When in doubt, you can always use parentheses to turn a named unary operator into a function. Remember, if it looks like a function, it is a function. Another funny thing about named unary operators is that many of them default to $_ if you don’t supply an argument. However, if you omit the argument but the token following the named unary operator looks like it might be the start of an Named Unary and File Test Operators 97 argument, Perl will get confused because it’s expecting a term. Whenever the Perl tokener gets to one of the characters listed in Table 3-3, the tokener returns different token types depending on whether it expects a term or operator. Table 3-3. Ambiguous Characters Character Operator Term + Addition Unary plus - Subtraction Unary minus * Multiplication *typeglob / Division /pattern/ < Less than, left shift . Concatenation .3333 ? ?: ?pattern? % Modulo %assoc & &, && &subroutine <HANDLE>, <<END So a typical boo-boo is: next if length < 80; in which the < looks to the parser like the beginning of the <> input symbol (a term) instead of the “less than” (an operator) you were thinking of. There’s really no way to fix this and still keep Perl pathologically eclectic. If you’re so incredibly lazy that you cannot bring yourself to type the two characters $_ , then use one of these instead: next next next next if length() < 80; if (length) < 80; if 80 > length; unless length >= 80; When a term is expected, a minus sign followed by a single letter will always be interpreted as a file test operator. A file test operator is a unary operator that takes one argument, either a filename or a filehandle, and tests the associated file to see whether something is true about it. If the argument is omitted, it tests $_, except for -t, which tests STDIN. Unless otherwise documented, it returns 1 for true and "" for false, or the undefined value if the file doesn’t exist or is otherwise inaccessible. Currently implemented file test operators are listed in Table 3-4. 98 Chapter 3: Unary and Binary Operators Table 3-4. File Test Operators Operator Meaning -r File is readable by effective UID/GID. -w File is writable by effective UID/GID. -x File is executable by effective UID/GID. -o File is owned by effective UID. -R File is readable by real UID/GID. -W File is writable by real UID/GID. -X File is executable by real UID/GID. -O File is owned by real UID. -e File exists. -z File has zero size. -s File has nonzero size (returns size). -f File is a plain file. -d File is a directory. -l File is a symbolic link. -p File is a named pipe (FIFO). -S File is a socket. -b File is a block special file. -c File is a character special file. -t Filehandle is opened to a tty. -u File has setuid bit set. -g File has setgid bit set. -k File has sticky bit set. -T File is a text file. -B File is a binary file (opposite of -T). -M Age of file (at startup) in days since modification. -A Age of file (at startup) in days since last access. -C Age of file (at startup) in days since inode change. Note that -s/a/b/ does not do a negated substitution. Saying -exp($foo) still works as expected, however—only single letters following a minus are interpreted as file tests. The interpretation of the file permission operators -r, -R, -w, -W, -x, and -X is based solely on the mode of the file and the user and group IDs of the user. There may be other reasons you can’t actually read, write, or execute the file, such as Named Unary and File Test Operators 99 Andrew File System (AFS) access control lists.* Also note that for the superuser, -r, -R, -w, and -W always return 1, and -x and -X return 1 if any execute bit is set in the mode. Thus, scripts run by the superuser may need to do a stat in order to determine the actual mode of the file or temporarily set the UID to something else. The other file test operators don’t care who you are. Anybody can use the test for “regular” files: while (<>) { chomp; next unless -f $_; ... } # ignore "special" files The -T and -B switches work as follows. The first block or so of the file is examined for strange characters such as control codes or bytes with the high bit set (that don’t look like UTF-8). If more than a third of the bytes appear to be strange, it’s a binary file; otherwise, it’s a text file. Also, any file containing ASCII NUL (\0) in the first block is considered a binary file. If -T or -B is used on a filehandle, the current input (standard I/O or “stdio”) buffer is examined rather than the first block of the file. Both -T and -B return true on an empty file, or on a file at EOF (end-of-file) when testing a filehandle. Because Perl has to read a file to do the -T test, you don’t want to use -T on special files that might hang or give you other kinds of grief. So on most occasions you’ll want to test with a -f first, as in: next unless -f $file && -T $file; If any of the file tests (or either the stat or lstat operator) are given the special filehandle consisting of a solitary underline, then the stat structure of the previous file test (or stat operator) is used, thereby saving a system call. (This doesn’t work with -t, and you need to remember that lstat and -l will leave values in the stat structure for the symbolic link, not the real file. Likewise, -l _ will always be false after a normal stat.) Here are a couple of examples: print "Can do.\n" if -r $a || -w _ || -x _; stat($filename); print "Readable\n" if -r _; print "Writable\n" if -w _; print "Executable\n" if -x _; print "Setuid\n" if -u _; print "Setgid\n" if -g _; * You may, however, override the built-in semantics with the use filetest pragma. See Chapter 31, Pragmatic Modules. 100 Chapter 3: Unary and Binary Operators print "Sticky\n" if -k _; print "Text\n" if -T _; print "Binary\n" if -B _; File ages for -M, -A, and -C are returned in days (including fractional days) since the script started running. This time is stored in the special variable $ˆT ($BASETIME). Thus, if the file changed after the script started, you would get a negative time. Note that most time values (86,399 out of 86,400, on average) are fractional, so testing for equality with an integer without using the int function is usually futile. Examples: next unless -M $file > .5; # files are older than 12 hours &newfile if -M $file < 0; # file is newer than process &mailwarning if int(-A) == 90; # file ($_) was accessed 90 days ago today To reset the script’s start time to the current time, say this: $ˆT = time; Relational Operators Perl has two classes of relational operators. One class operates on numeric values, the other on string values, as shown in Table 3-5. Table 3-5. Relational Operators Numeric String Meaning > gt Greater than >= ge Greater than or equal to < lt Less than <= le Less than or equal to These operators return 1 for true and "" for false. Note that relational operators are nonassociating, which means that $a < $b < $c is a syntax error. In the absence of locale declarations, string comparisons are based on the ASCII/Unicode collating sequences, and, unlike in some computer languages, trailing spaces count in the comparison. With a locale declaration, the collation order specified by the locale is used. (Locale-based collation mechanisms may or may not interact well with the Unicode collation mechanisms currently in development.) Bitwise Operators 101 Equality Operators The equality operators listed in Table 3-6 are much like the relational operators. Table 3-6. Equality Operators Numeric String Meaning == eq Equal to != ne Not equal to <=> cmp Comparison, with signed result The equal and not-equal operators return 1 for true and "" for false (just as the relational operators do). The <=> and cmp operators return -1 if the left operand is less than the right operand, 0 if they are equal, and +1 if the left operand is greater than the right. Although the equality operators appear to be similar to the relational operators, they do have a lower precedence level, so $a < $b <=> $c < $d is syntactically valid. For reasons that are apparent to anyone who has seen Star Wars, the <=> operator is known as the “spaceship” operator. Bitwise Operators Like C, Perl has bitwise AND, OR, and XOR (exclusive OR) operators: &, |, and ˆ. You’ll have noticed from your painstaking examination of the table at the start of this chapter that bitwise AND has a higher precedence than the others, but we’ve cheated and combined them in this discussion. These operators work differently on numeric values than they do on strings. (This is one of the few places where Perl cares about the difference.) If either operand is a number (or has been used as a number), both operands are converted to integers, and the bitwise operation is performed between the two integers. These integers are guaranteed to be at least 32 bits long, but can be 64 bits on some machines. The point is that there’s an arbitrary limit imposed by the machine’s architecture. If both operands are strings (and have not been used as numbers since they were set), the operators do bitwise operations between corresponding bits from the two strings. In this case, there’s no arbitrary limit, since strings aren’t arbitrarily limited in size. If one string is longer than the other, the shorter string is considered to have a sufficient number of 0 bits on the end to make up the difference. 102 Chapter 3: Unary and Binary Operators For example, if you AND together two strings: "123.45" & "234.56" you get another string: "020.44" But if you AND together a string and a number: "123.45" & 234.56 The string is first converted to a number, giving: 123.45 & 234.56 The numbers are then converted to integers: 123 & 234 which evaluates to 106. Note that all bit strings are true (unless they result in the string “0”). This means if you want to see whether any byte came out to nonzero, instead of writing this: if ( "fred" & "\1\2\3\4" ) { ... } you need to write this: if ( ("fred" & "\1\2\3\4") =˜ /[ˆ\0]/ ) { ... } C-Style Logical (Short-Circuit) Operators Like C, Perl provides the && (logical AND) and || (logical OR) operators. They evaluate from left to right (with && having slightly higher precedence than ||) testing the truth of the statement. These operators are known as short-circuit operators because they determine the truth of the statement by evaluating the fewest number of operands possible. For example, if the left operand of an && operator is false, the right operand is never evaluated because the result of the operator is false regardless of the value of the right operand. Example Name Result $a && $b $a || $b And Or $a if $a is false, $b otherwise $a if $a is true, $b otherwise Such short circuits not only save time, but are frequently used to control the flow of evaluation. For example, an oft-appearing idiom in Perl programs is: open(FILE, "somefile") || die "Can’t open somefile: $!\n"; Range Operator 103 In this case, Perl first evaluates the open function. If the value is true (because somefile was successfully opened), the execution of the die function is unnecessary, and so is skipped. You can read this literally as “Open some file or die!” The && and || operators differ from C’s in that, rather than returning 0 or 1, they return the last value evaluated. In the case of ||, this has the delightful result that you can select the first of a series of scalar values that happens to be true. Thus, a reasonably portable way to find out the user’s home directory might be: $home = || || || $ENV{HOME} $ENV{LOGDIR} (getpwuid($<))[7] die "You’re homeless!\n"; On the other hand, since the left argument is always evaluated in scalar context, you can’t use || for selecting between two aggregates for assignment: @a = @b || @c; @a = scalar(@b) || @c; @a = @b ? @b : @c; # This doesn’t do the right thing # because it really means this. # This works fine, though. Perl also provides lower precedence and and or operators that some people find more readable and don’t force you to use parentheses on list operators. They also short-circuit. See Table 1-1 for a complete list. Range Operator The .. range operator is really two different operators depending on the context. In scalar context, .. returns a Boolean value. The operator is bi-stable, like an electronic flip-flop, and emulates the line-range (comma) operator of sed, awk, and various editors. Each scalar .. operator maintains its own Boolean state. It is false as long as its left operand is false. Once the left operand is true, the range operator stays true until the right operand is true, after which the range operator becomes false again. The operator doesn’t become false until the next time it is evaluated. It can test the right operand and become false on the same evaluation as the one where it became true (the way awk ’s range operator behaves), but it still returns true once. If you don’t want it to test the right operand until the next evaluation (which is how sed ’s range operator works), just use three dots (...) instead of two. With both .. and ..., the right operand is not evaluated while the operator is in the false state, and the left operand is not evaluated while the operator is in the true state. The value returned is either the null string for false or a sequence number (beginning with 1) for true. The sequence number is reset for each range encountered. The final sequence number in a range has the string “E0” appended to it, which doesn’t affect its numeric value, but gives you something to search for if you want 104 Chapter 3: Unary and Binary Operators to exclude the endpoint. You can exclude the beginning point by waiting for the sequence number to be greater than 1. If either operand of scalar .. is a numeric literal, that operand implicitly compared to the $. variable, which contains the current line number for your input file. Examples: if (101 .. 200) { print; } # print 2nd hundred lines next line if (1 .. /ˆ$/); # skip header lines s/ˆ/> / if (/ˆ$/ .. eof()); # quote body In list context, .. returns a list of values counting (by ones) from the left value to the right value. This is useful for writing for (1..10) loops and for doing slice operations on arrays: for (101 .. 200) { print; } @foo = @foo[0 .. $#foo]; @foo = @foo[ -5 .. -1]; # prints 101102...199200 # an expensive no-op # slice last 5 items If the left value is greater than the right value, a null list is returned. (To produce a list in reverse order, see the reverse operator.) If its operands are strings, the range operator makes use of the magical autoincrement algorithm discussed earlier.* So you can say: @alphabet = (’A’ .. ’Z’); to get all the letters of the (English) alphabet, or: $hexdigit = (0 .. 9, ’a’ .. ’f’)[$num & 15]; to get a hexadecimal digit, or: @z2 = (’01’ .. ’31’); print $z2[$mday]; to get dates with leading zeros. You can also say: @combos = (’aa’ .. ’zz’); to get all combinations of two lowercase letters. However, be careful of something like: @bigcombos = (’aaaaaa’ .. ’zzzzzz’); since that will require lots of memory. More precisely, it’ll need space to store 308,915,776 scalars. Let’s hope you allocated a very large swap partition. Perhaps you should consider an iterative approach instead. * If the final value specified is not in the sequence that the magical increment would produce, the sequence continues until the next value is longer than the final value specified. Conditional Operator 105 Conditional Operator As in C, ?: is the only trinary operator. It’s often called the conditional operator because it works much like an if-then-else, except that, since it’s an expression and not a statement, it can be safely embedded within other expressions and functions calls. As a trinary operator, its two parts separate three expressions: COND ? THEN : ELSE If the condition COND is true, only the THEN expression is evaluated, and the value of that expression becomes the value of the entire expression. Otherwise, only the ELSE expression is evaluated, and its value becomes the value of the entire expression. Scalar or list context propagates downward into the second or third argument, whichever is selected. (The first argument is always in scalar context, since it’s a conditional.) $a = $ok ? $b : $c; # get a scalar @a = $ok ? @b : @c; # get an array $a = $ok ? @b : @c; # get a count of an array’s elements You’ll often see the conditional operator embedded in lists of values to format with printf, since nobody wants to replicate the whole statement just to switch between two related values. printf "I have %d camel%s.\n", $n, $n == 1 ? "" : "s"; Conveniently, the precedence of ?: is higher than a comma but lower than most operators you’d use inside (such as == in this example), so you don’t usually have to parenthesize anything. But you can add parentheses for clarity if you like. For conditional operators nested within the THEN parts of other conditional operators, we suggest that you put in line breaks and indent as if they were ordinary if statements: $leapyear = $year % 4 == 0 ? $year % 100 == 0 ? $year % 400 == 0 ? 1 : 0 : 1 : 0; 106 Chapter 3: Unary and Binary Operators For conditionals nested within the ELSE parts of earlier conditionals, you can do a similar thing: $leapyear $year ? : = % 4 0 $year % 100 ? 1 : $year % 400 ? 0 : 1; but it’s usually better to line up all the COND and THEN parts vertically: $leapyear $year $year $year = % 4 ? 0 : % 100 ? 1 : % 400 ? 0 : 1; Lining up the question marks and colons can make sense of even fairly cluttered structures: printf "Yes, $i18n eq $i18n eq $i18n eq I like my %s "french" ? "german" ? "japanese" ? book!\n", "chameau" : "Kamel" : "\x{99F1}\x{99DD}" : "camel" You can assign to the conditional operator* if both the second and third arguments are legal lvalues (meaning that you can assign to them), and both are scalars or both are lists (otherwise, Perl won’t know which context to supply to the right side of the assignment): ($a_or_b ? $a : $b) = $c; # sets either $a or $b to have the value of $c Bear in mind that the conditional operator binds more tightly than the various assignment operators. Usually this is what you want (see the $leapyear assignments above, for example), but you can’t have it the other way without using parentheses. Using embedded assignments without parentheses will get you into trouble, and you might not get a parse error because the conditional operator can be parsed as an lvalue. For example, you might write this: $a % 2 ? $a += 10 : $a += 2 # WRONG But that would be parsed like this: (($a % 2) ? ($a += 10) : $a) += 2 * This is not necessarily guaranteed to contribute to the readability of your program. But it can be used to create some cool entries in an Obfuscated Perl contest. Assignment Operators 107 Assignment Operators Perl recognizes the C assignment operators, as well as providing some of its own. There are quite a few of them: = **= += -= .= *= /= %= x= &= |= ˆ= <<= >>= &&= ||= Each operator requires a target lvalue (typically a variable or array element) on the left side and an expression on the right side. For the simple assignment operator: TARGET = EXPR the value of the EXPR is stored into the variable or location designated by TARGET. For the other operators, Perl evaluates the expression: TARGET OP= EXPR as if it were written: TARGET = TARGET OP EXPR That’s a handy mental rule, but it’s misleading in two ways. First, assignment operators always parse at the precedence level of ordinary assignment, regardless of the precedence that OP would have by itself. Second, TARGET is evaluated only once. Usually that doesn’t matter unless there are side effects, such as an autoincrement: $var[$a++] += $value; $var[$a++] = $var[$a++] + $value; # $a is incremented once # $a is incremented twice Unlike in C, the assignment operator produces a valid lvalue. Modifying an assignment is equivalent to doing the assignment and then modifying the variable to which it was assigned. This is useful for modifying a copy of something, like this: ($tmp = $global) += $constant; which is the equivalent of: $tmp = $global + $constant; Likewise: ($a += 2) *= 3; is equivalent to: $a += 2; $a *= 3; 108 Chapter 3: Unary and Binary Operators That’s not terribly useful, but here’s an idiom you see frequently: ($new = $old) =˜ s/foo/bar/g; In all cases, the value of the assignment is the new value of the variable. Since assignment operators associate right-to-left, this can be used to assign many variables the same value, as in: $a = $b = $c = 0; which assigns 0 to $c, and the result of that (still 0) to $b, and the result of that (still 0) to $a. List assignment may be done only with the plain assignment operator, =. In list context, list assignment returns the list of new values just as scalar assignment does. In scalar context, list assignment returns the number of values that were available on the right side of the assignment, as mentioned in Chapter 2, Bits and Pieces. This makes it useful for testing functions that return a null list when unsuccessful (or no longer successful), as in: while (($key, $value) = each %gloss) { ... } next unless ($dev, $ino, $mode) = stat $file; Comma Operators Binary “,” is the comma operator. In scalar context it evaluates its left argument in void context, throws that value away, then evaluates its right argument in scalar context and returns that value. This is just like C’s comma operator. For example: $a = (1, 3); assigns 3 to $a. Do not confuse the scalar context use with the list context use. In list context, a comma is just the list argument separator, and inserts both its arguments into the LIST. It does not throw any values away. For example, if you change the previous example to: @a = (1, 3); you are constructing a two-element list, while: atan2(1, 3); is calling the function atan2 with two arguments. The => digraph is mostly just a synonym for the comma operator. It’s useful for documenting arguments that come in pairs. It also forces any identifier to its immediate left to be interpreted as a string. Logical and, or, not, and xor 109 List Operators (Rightward) The right side of a list operator governs all the list operator’s arguments, which are comma separated, so the precedence of a list operator is lower than a comma if you’re looking to the right. Once a list operator starts chewing up comma-separated arguments, the only things that will stop it are tokens that stop the entire expression (like semicolons or statement modifiers), or tokens that stop the current subexpression (like right parentheses or brackets), or the low precedence logical operators we’ll talk about next. Logical and, or, not, and xor As lower precedence alternatives to &&, ||, and !, Perl provides the and, or, and not operators. The behavior of these operators is identical—in particular, and and or short-circuit like their counterparts, which makes them useful not only for logical expressions but also for control flow. Since the precedence of these operators is much lower than the ones borrowed from C, you can safely use them after a list operator without the need for parentheses: unlink "alpha", "beta", "gamma" or gripe(), next LINE; With the C-style operators you’d have to write it like this: unlink("alpha", "beta", "gamma") || (gripe(), next LINE); But you can’t just up and replace all instances of || with or. Suppose you change this: $xyz = $x || $y || $z; to this: $xyz = $x or $y or $z; # WRONG That wouldn’t do the same thing at all! The precedence of the assignment is higher than or but lower than ||, so it would always assign $x to $xyz, and then do the ors. To get the same effect as ||, you’d have to write: $xyz = ( $x or $y or $z ); The moral of the story is that you still must learn precedence (or use parentheses) no matter which variety of logical operators you use. 110 Chapter 3: Unary and Binary Operators There is also a logical xor operator that has no exact counterpart in C or Perl, since the only other exclusive-OR operator (^) works on bits. The xor operator can’t short-circuit, since both sides must be evaluated. The best equivalent for $a xor $b is perhaps !$a != !$b. One could also write !$a ˆ !$b or even $a ? !$b : !!$b, of course. The point is that both $a and $b have to evaluate to true or false in a Boolean context, and the existing bitwise operator doesn’t provide a Boolean context without help. C Operators Missing from Perl Here is what C has that Perl doesn’t: unary & The address-of operator. Perl’s \ operator (for taking a reference) fills the same ecological niche, however: $ref_to_var = \$var; But Perl references are much safer than C pointers. unary * The dereference-address operator. Since Perl doesn’t have addresses, it doesn’t need to dereference addresses. It does have references though, so Perl’s variable prefix characters serve as dereference operators, and indicate type as well: $, @, %, and &. Oddly enough, there actually is a * dereference operator, but since * is the funny character indicating a typeglob, you wouldn’t use it the same way. (TYPE) The typecasting operator. Nobody likes to be typecast anyway. 4 Statements and Declarations A Perl program consists of a sequence of declarations and statements. A declaration may be placed anywhere a statement may be placed, but its primary effect occurs at compile time. A few declarations do double duty as ordinary statements, but most are totally transparent at run time. After compilation, the main sequence of statements is executed just once. Unlike many programming languages, Perl doesn’t require variables to be explicitly declared; they spring into existence upon their first use, whether you’ve declared them or not. If you try to use a value from a variable that’s never had a value assigned to it, it’s quietly treated as 0 when used as a number, as "" (the null string) when used as a string, or simply as false when used as a logical value. If you prefer to be warned about using undefined values as though they were real strings or numbers, or even to treat doing so as an error, the use warnings declaration will take care of that; see the section “Pragmas” at the end of this chapter. You may declare your variables though, if you like, using either my or our in front of the variable name. You can even make it an error to use an undeclared variable. This kind of discipline is fine, but you have to declare that you want the discipline. Normally, Perl minds its own business about your programming habits, but under the use strict declaration, the use of undeclared variables is apprehended at compile time. Again, see the “Pragmas” section. Simple Statements A simple statement is an expression evaluated for its side effects. Every simple statement must end in a semicolon, unless it is the final statement in a block. In 111 112 Chapter 4: Statements and Declarations that case, the semicolon is optional—Perl knows that you must be done with the statement, since you’ve finished the block. But put the semicolon in anyway if it’s at the end of a multiline block, because you might eventually add another line. Even though operators like eval {}, do {}, and sub {} all look like compound statements, they really aren’t. True, they allow multiple statements on the inside, but that doesn’t count. From the outside, those operators are just terms in an expression, and thus they need an explicit semicolon if used as the last item in a statement. Any simple statement may optionally be followed by a single modifier, just before the terminating semicolon (or block ending). The possible modifiers are: if EXPR unless EXPR while EXPR until EXPR foreach LIST The if and unless modifiers work pretty much as they do in English: $trash->take(’out’) if $you_love_me; shutup() unless $you_want_me_to_leave; The while and until modifiers evaluate repeatedly. As you might expect, a while modifier keeps executing the expression as long as its expression remains true, and an until modifier keeps executing only as long as it remains false: $expression++ while -e "$file$expression"; kiss(’me’) until $I_die; The foreach modifier (also spelled for) evaluates once for each element in its LIST, with $_ aliased to the current element: s/java/perl/ for @resumes; print "field: $_\n" foreach split /:/, $dataline; The while and until modifiers have the usual while-loop semantics (conditional evaluated first), except when applied to a do BLOCK (or to the now-deprecated do SUBROUTINE statement), in which case the block executes once before the conditional is evaluated. This allows you to write loops like this: do { $line = <STDIN>; ... } until $line eq ".\n"; See the three different do entries in Chapter 29, Functions. Note also that the loopcontrol operators described later will not work in this construct, since modifiers don’t take loop labels. You can always place an extra block around it to terminate early, or inside it to iterate early, as described later in the section “Bare Blocks”. Or Compound Statements 113 you could write a real loop with multiple loop controls inside. Speaking of real loops, we’ll talk about compound statements next. Compound Statements A sequence of statements within a scope* is called a block. Sometimes the scope is the entire file, such as a required file or the file containing your main program. Sometimes the scope is a string being evaluated with eval. But generally, a block is surrounded by braces ({}). When we say scope, we mean any of these three. When we mean a block with braces, we’ll use the term BLOCK. Compound statements are built out of expressions and BLOCKs. Expressions are built out of terms and operators. In our syntax descriptions, we’ll use the word EXPR to indicate a place where you can use any scalar expression. To indicate an expression evaluated in list context, we’ll say LIST. The following statements may be used to control conditional and repeated execution of BLOCKs. (The LABEL portion is optional.) if if if if (EXPR) (EXPR) (EXPR) (EXPR) unless unless unless unless BLOCK BLOCK else BLOCK BLOCK elsif (EXPR) BLOCK ... BLOCK elsif (EXPR) BLOCK ... else BLOCK (EXPR) (EXPR) (EXPR) (EXPR) BLOCK BLOCK else BLOCK BLOCK elsif (EXPR) BLOCK ... BLOCK elsif (EXPR) BLOCK ... else BLOCK LABEL while (EXPR) BLOCK LABEL while (EXPR) BLOCK continue BLOCK LABEL until (EXPR) BLOCK LABEL until (EXPR) BLOCK continue BLOCK LABEL for (EXPR; EXPR; EXPR) BLOCK LABEL foreach (LIST) BLOCK LABEL foreach VAR (LIST) BLOCK LABEL foreach VAR (LIST) BLOCK continue BLOCK LABEL BLOCK LABEL BLOCK continue BLOCK Note that unlike in C and Java, these are defined in terms of BLOCKs, not statements. This means that the braces are required—no dangling statements allowed. If you want to write conditionals without braces there are several ways to do so. * Scopes and namespaces are described in Chapter 2, Bits and Pieces, in the “Names” section. 114 Chapter 4: Statements and Declarations The following all do the same thing: unless (open(FOO, $foo)) if (!open(FOO, $foo)) { die "Can’t open $foo: $!" } { die "Can’t open $foo: $!" } die "Can’t open $foo: $!" die "Can’t open $foo: $!" unless open(FOO, $foo); if !open(FOO, $foo); open(FOO, $foo) open FOO, $foo || die "Can’t open $foo: $!"; or die "Can’t open $foo: $!"; Under most circumstances, we tend to prefer the last pair. These forms come with less eye-clutter than the others, especially the “or die” version. With the || form you need to get used to using parentheses religiously, but with the or version, it doesn’t matter if you forget. But the main reason we like the last versions better is because of how they pull the important part of the statement right up to the front of the line where you’ll see it first. The error handling is shoved off to the side so that you don’t have to pay attention to it unless you want to.* If you tab all your “or die” checks over to the same column on the right each time, it’s even easier to read: chdir $dir open FOO, $file @lines = <FOO> close FOO or or or or die die die die "chdir $dir: $!"; "open $file: $!"; "$file is empty?"; "close $file: $!"; if and unless Statements The if statement is straightforward. Because BLOCKs are always bounded by braces, there is never any ambiguity regarding which particular if an else or elsif goes with. In any given sequence of if/elsif/else BLOCKs, only the first one whose condition evaluates to true is executed. If none of them is true, then the else BLOCK, if there is one, is executed. It’s usually a good idea to put an else at the end of a chain of elsifs to guard against a missed case. If you use unless in place of if, the sense of its test is reversed. That is: unless ($x == 1) ... is equivalent to: if ($x != 1) ... * (Like this footnote.) Loop Statements 115 or even to the unsightly: if (!($x == 1)) ... The scope of a variable declared in the controlling condition extends from its declaration through the rest of that conditional only, including any elsifs and the final else clause if present, but not beyond: if ((my $color = <STDIN>) =˜ /red/i) { $value = 0xff0000; } elsif ($color =˜ /green/i) { $value = 0x00ff00; } elsif ($color =˜ /blue/i) { $value = 0x0000ff; } else { warn "unknown RGB component ‘$color’, using black instead\n"; $value = 0x000000; } After the else, the $color variable is no longer in scope. If you want the scope to extend further, declare the variable beforehand. Loop Statements All loop statements have an optional LABEL in their formal syntax. (You can put a label on any statement, but it has a special meaning to a loop.) If present, the label consists of an identifier followed by a colon. It’s customary to make the label uppercase to avoid potential confusion with reserved words, and so it stands out better. And although Perl won’t get confused if you use a label that already has a meaning like if or open, your readers might. while and until Statements The while statement repeatedly executes the block as long as EXPR is true. If the word while is replaced by the word until, the sense of the test is reversed; that is, it executes the block only as long as EXPR remains false. The conditional is still tested before the first iteration, though. The while or until statement can have an optional extra block: the continue block. This block is executed every time the block is continued, either by falling off the end of the first block or by an explicit next (a loop-control operator that goes to the next iteration). The continue block is not heavily used in practice, but it’s in here so we can define the for loop rigorously in the next section. 116 Chapter 4: Statements and Declarations Unlike the foreach loop we’ll see in a moment, a while loop never implicitly localizes any variables in its test condition. This can have “interesting” consequences when while loops use globals for loop variables. In particular, see the section “Line input (angle) operator” in Chapter 2 for how implicit assignment to the global $_ can occur in certain while loops, along with an example of how to deal with the problem by explicitly localizing $_. For other loop variables, however, it’s best to declare them with my, as in the next example. A variable declared in the test condition of a while or until statement is visible only in the block or blocks governed by that test. It is not part of the surrounding scope. For example: while (my $line = <STDIN>) { $line = lc $line; } continue { print $line; # still visible } # $line now out of scope here Here the scope of $line extends from its declaration in the control expression throughout the rest of the loop construct, including the continue block, but not beyond. If you want the scope to extend further, declare the variable before the loop. for Loops The three-part for loop has three semicolon-separated expressions within its parentheses. These expressions function respectively as the initialization, the condition, and the re-initialization expressions of the loop. All three expressions are optional (but not the semicolons); if omitted, the condition is always true. Thus, the three-part for loop can be defined in terms of the corresponding while loop. This: LABEL: for (my $i = 1; $i <= 10; $i++) { ... } is like this: { my $i = 1; LABEL: while ($i <= 10) { ... } Loop Statements 117 continue { $i++; } } except that there’s not really an outer block. (We just put one there to show how the scope of the my is limited.) If you want to iterate through two variables simultaneously, just separate the parallel expressions with commas: for ($i = 0, $bit = 0; $i < 32; $i++, $bit <<= 1) { print "Bit $i is set\n" if $mask & $bit; } # the values in $i and $bit persist past the loop Or declare those variables to be visible only inside the for loop: for (my ($i, $bit) = (0, 1); $i < 32; $i++, $bit <<= 1) { print "Bit $i is set\n" if $mask & $bit; } # loop’s versions of $i and $bit now out of scope Besides the normal looping through array indices, for can lend itself to many other interesting applications. It doesn’t even need an explicit loop variable. Here’s one example that avoids the problem you get when you explicitly test for end-offile on an interactive file descriptor, causing your program to appear to hang. $on_a_tty = -t STDIN && -t STDOUT; sub prompt { print "yes? " if $on_a_tty } for ( prompt(); <STDIN>; prompt() ) { # do something } Another traditional application for the three-part for loop results from the fact that all three expressions are optional, and the default condition is true. If you leave out all three expressions, you have written an infinite loop: for (;;) { ... } This is the same as writing: while (1) { ... } If the notion of infinite loops bothers you, we should point out that you can always fall out of the loop at any point with an explicit loop-control operator such 118 Chapter 4: Statements and Declarations as last. Of course, if you’re writing the code to control a cruise missile, you may not actually need an explicit loop exit. The loop will be terminated automatically at the appropriate moment.* foreach Loops The foreach loop iterates over a list of values by setting the control variable (VAR) to each successive element of the list: foreach VAR (LIST) { ... } The foreach keyword is just a synonym for the for keyword, so you can use foreach and for interchangeably, whichever you think is more readable in a given situation. If VAR is omitted, the global $_ is used. (Don’t worry—Perl can easily distinguish for (@ARGV) from for ($i=0; $i<$#ARGV; $i++) because the latter contains semicolons.) Here are some examples: $sum = 0; foreach $value (@array) { $sum += $value } for $count (10,9,8,7,6,5,4,3,2,1,’BOOM’) { # do a countdown print "$count\n"; sleep(1); } for (reverse ’BOOM’, 1 .. 10) { print "$_\n"; sleep(1); } # same thing for $field (split /:/, $data) { print "Field contains: ‘$field’\n"; } # any LIST expression foreach $key (sort keys %hash) { print "$key => $hash{$key}\n"; } That last one is the canonical way to print out the values of a hash in sorted order. See the keys and sort entries in Chapter 29 for more elaborate examples. There is no way with foreach to tell where you are in a list. You may compare adjacent elements by remembering the previous one in a variable, but sometimes you just have to break down and write a three-part for loop with subscripts. That’s what the other kind of for loop is there for, after all. If LIST consists entirely of assignable values (meaning variables, generally, not enumerated constants), you can modify each of those variables by modifying VAR * That is, the fallout from the loop tends to occur automatically. Loop Statements 119 inside the loop. That’s because the foreach loop index variable is an implicit alias for each item in the list that you’re looping over. Not only can you modify a single array in place, you can also modify multiple arrays and hashes in a single list: foreach $pay (@salaries) { $pay *= 1.08; } # grant 8% raises for (@christmas, @easter) { s/ham/turkey/; } s/ham/turkey/ for @christmas, @easter; # change menu for ($scalar, @array, values %hash) { s/ˆ\s+//; s/\s+$//; } # same thing # strip leading whitespace # strip trailing whitespace The loop variable is valid only from within the dynamic or lexical scope of the loop and will be implicitly lexical if the variable was previously declared with my. This renders it invisible to any function defined outside the lexical scope of the variable, even if called from within that loop. However, if no lexical declaration is in scope, the loop variable will be a localized (dynamically scoped) global variable; this allows functions called from within the loop to access that variable. In either case, any previous value the localized variable had before the loop will be restored automatically upon loop exit. If you prefer, you may explicitly declare which kind of variable (lexical or global) to use. This makes it easier for maintainers of your code to know what’s really going on; otherwise, they’ll need to search back up through enclosing scopes for a previous declaration to figure out which kind of variable it is: for my $i (1 .. 10) { ... } for our $Tick (1 .. 10) { ... } # $i always lexical # $Tick always global When a declaration accompanies the loop variable, the shorter for spelling is preferred over foreach, since it reads better in English. Here’s how a C or Java programmer might first think to code up a particular algorithm in Perl: for ($i = 0; $i < @ary1; $i++) { for ($j = 0; $j < @ary2; $j++) { if ($ary1[$i] > $ary2[$j]) { last; # Can’t go to outer loop. :-( } $ary1[$i] += $ary2[$j]; } # this is where that last takes me } 120 Chapter 4: Statements and Declarations But here’s how a veteran Perl programmer might do it: WID: foreach $this (@ary1) { JET: foreach $that (@ary2) { next WID if $this > $that; $this += $that; } } See how much easier that was in idiomatic Perl? It’s cleaner, safer, and faster. It’s cleaner because it’s less noisy. It’s safer because if code gets added between the inner and outer loops later on, the new code won’t be accidentally executed, since next (explained below) explicitly iterates the outer loop rather than merely breaking out of the inner one. And it’s faster because Perl executes a foreach statement more rapidly than it would the equivalent for loop, since the elements are accessed directly instead of through subscripting. But write it however you like. TMTOWTDI. Like the while statement, the foreach statement can also take a continue block. This lets you execute a bit of code at the bottom of each loop iteration no matter whether you got there in the normal course of events or through a next. Speaking of which, now we can finally say it: next is next. Loop Control We mentioned that you can put a LABEL on a loop to give it a name. The loop’s LABEL identifies the loop for the loop-control operators next, last, and redo. The LABEL names the loop as a whole, not just the top of the loop. Hence, a loopcontrol operator referring to the loop doesn’t actually “go to” the loop label itself. As far as the computer is concerned, the label could just as easily have been placed at the end of the loop. But people like things labeled at the top, for some reason. Loops are typically named for the item the loop is processing on each iteration. This interacts nicely with the loop-control operators, which are designed to read like English when used with an appropriate label and a statement modifier. The archetypal loop works on lines, so the archetypal loop label is LINE:, and the archetypal loop-control operator is something like this: next LINE if /ˆ#/; # discard comments The syntax for the loop-control operators is: last LABEL next LABEL redo LABEL Loop Statements 121 The LABEL is optional; if omitted, the operator refers to the innermost enclosing loop. But if you want to jump past more than one level, you must use a LABEL to name the loop you want to affect. That LABEL does not have to be in your lexical scope, though it probably ought to be. But in fact, the LABEL can be anywhere in your dynamic scope. If this forces you to jump out of an eval or subroutine, Perl issues a warning (upon request). Just as you may have as many return operators in a function as you like, you may have as many loop-control operators in a loop as you like. This is not to be considered wicked or even uncool. During the early days of structured programming, some people insisted that loops and subroutines have only one entry and one exit. The one-entry notion is still a good idea, but the one-exit notion has led people to write a lot of unnatural code. Much of programming consists of traversing decision trees. A decision tree naturally starts with a single trunk but ends with many leaves. Write your code with the number of loop exits (and function returns) that is natural to the problem you’re trying to solve. If you’ve declared your variables with reasonable scopes, everything gets automatically cleaned up at the appropriate moment, no matter how you leave the block. The last operator immediately exits the loop in question. The continue block, if any, is not executed. The following example bombs out of the loop on the first blank line: LINE: while (<STDIN>) { last LINE if /ˆ$/; ... } # exit when done with mail header The next operator skips the rest of the current iteration of the loop and starts the next one. If there is a continue clause on the loop, it is executed just before the condition is re-evaluated, just like the third component of a three-part for loop. Thus it can be used to increment a loop variable, even when a particular iteration of the loop has been interrupted by a next: LINE: while (<STDIN>) { next LINE if /ˆ#/; next LINE if /ˆ$/; ... } continue { $count++; } # skip comments # skip blank lines The redo operator restarts the loop block without evaluating the conditional again. The continue block, if any, is not executed. This operator is often used by programs that want to fib to themselves about what was just input. Suppose you were 122 Chapter 4: Statements and Declarations processing a file that sometimes had a backslash at the end of a line to continue the record on the next line. Here’s how you could use redo for that: while (<>) { chomp; if (s/\\$//) { $_ .= <>; redo unless eof; } # now process $_ } # don’t read past each file’s eof which is the customary Perl shorthand for the more explicitly (and tediously) written version: LINE: while (defined($line = <ARGV>)) { chomp($line); if ($line =˜ s/\\$//) { $line .= <ARGV>; redo LINE unless eof(ARGV); } # now process $line } Here’s an example from a real program that uses all three loop-control operators. Although this particular strategy of parsing command-line arguments is less common now that we have the Getopts::* modules bundled with Perl, it’s still a nice illustration of the use of loop-control operators on named, nested loops: ARG: while (@ARGV && $ARGV[0] =˜ s/ˆ-(?=.)//) { OPT: for (shift @ARGV) { m/ˆ$/ && do { m/ˆ-$/ && do { s/ˆd// && do { $Debug_Level++; s/ˆl// && do { $Generate_Listing++; s/ˆi(.*)// && do { $In_Place = $1 || ".bak"; say_usage("Unknown option: $_"); } } next last redo redo next ARG; ARG; OPT; OPT; ARG; }; }; }; }; }; One more point about loop-control operators. You may have noticed that we are not calling them “statements”. That’s because they aren’t statements—although like any expression, they can be used as statements. You can almost think of them as unary operators that just happen to cause a change in control flow. So you can use them anywhere it makes sense to use them in an expression. In fact, you can even use them where it doesn’t make sense. One sometimes sees this coding error: open FILE, $file or warn "Can’t open $file: $!\n", next FILE; # WRONG Bare Blocks 123 The intent is fine, but the next FILE is being parsed as one of the arguments to warn, which is a list operator. So the next executes before the warn gets a chance to emit the warning. In this case, it’s easily fixed by turning the warn list operator into the warn function call with some suitably situated parentheses: open FILE, $file or warn("Can’t open $file: $!\n"), next FILE; # okay However, you might find it easier to read this: unless (open FILE, $file) { warn "Can’t open $file: $!\n"; next FILE; } Bare Blocks A BLOCK by itself (labeled or not) is semantically equivalent to a loop that executes once. Thus you can use last to leave the block or redo to restart the block.* Note that this is not true of the blocks in eval {}, sub {}, or, much to everyone’s surprise, do {}. These three are not loop blocks because they’re not BLOCKs by themselves; the keyword in front makes them mere terms in an expression that just happen to include a code block. Since they’re not loop blocks, they cannot be given a label to apply loop controls to. Loop controls may only be used on true loops, just as a return may only be used within a subroutine (well, or an eval). Loop controls don’t work in an if or unless, either, since those aren’t loops. But you can always introduce an extra set of braces to give yourself a bare block, which does count as a loop: if (/pattern/) {{ last if /alpha/; last if /beta/; last if /gamma/; # do something here only if still in if() }} Here’s how a block can be used to let loop-control operators work with a do{} construct. To next or redo a do, put a bare block inside: do {{ next if $x == $y; # do something here }} until $x++ > $z; * For reasons that may (or may not) become clear upon reflection, a next also exits the once-through block. There is a slight difference, however: a next will execute a continue block, but a last won’t. 124 Chapter 4: Statements and Declarations For last, you have to be more elaborate: { do { last if $x = $y ** 2; # do something here } while $x++ <= $z; } And if you want both loop controls available, you’ll have put a label on those blocks so you can tell them apart: DO_LAST: { do { DO_NEXT: { next DO_NEXT if $x == $y; last DO_LAST if $x = $y ** 2; # do something here } } while $x++ <= $z; } But certainly by that point (if not before), you’d be better off using an ordinary infinite loop with last at the end: for (;;) next last # do last } { if $x == $y; if $x = $y ** 2; something here unless $x++ <= $z; Case Structures Unlike some other programming languages, Perl has no official switch or case statement. That’s because Perl doesn’t need one, having many ways to do the same thing. A bare block is particularly convenient for doing case structures (multiway switches). Here’s one: SWITCH: { if (/ˆabc/) { $abc = 1; last SWITCH; } if (/ˆdef/) { $def = 1; last SWITCH; } if (/ˆxyz/) { $xyz = 1; last SWITCH; } $nothing = 1; } and here’s another: SWITCH: { /ˆabc/ /ˆdef/ /ˆxyz/ $nothing = } && do { $abc = 1; last SWITCH; }; && do { $def = 1; last SWITCH; }; && do { $xyz = 1; last SWITCH; }; 1; Bare Blocks 125 or, formatted so that each case stands out more: SWITCH: { /ˆabc/ && do { $abc = 1; last SWITCH; /ˆdef/ }; && do { $def = 1; last SWITCH; /ˆxyz/ }; && do { $xyz = 1; last SWITCH; }; $nothing = 1; } or even (horrors!): if (/ˆabc/) { $abc = 1 elsif (/ˆdef/) { $def = 1 elsif (/ˆxyz/) { $xyz = 1 else { $nothing } } } = 1 } In this next example, notice how the last operators ignore the do {} blocks, which aren’t loops, and exit the for loop instead: for ($very_nasty_long_name[$i++][$j++]->method()) { /this pattern/ and do { push @flags, ’-e’; last; }; /that one/ and do { push @flags, ’-h’; last; }; /something else/ and do { last; }; die "unknown value: ‘$_’"; } You might think it odd to loop over a single value, since you’ll only go through the loop once. But it’s convenient to use for/foreach’s aliasing capability to make a temporary, localized assignment to $_. On repeated compares against the same long value, this makes it much easier to type and therefore harder to mistype. It avoids possible side effects from evaluating the expression again. And pertinent to this section, it’s also one of the most commonly seen standard idioms for implementing a switch or case structure. Cascading use of the ?: operator can also work for simple cases. Here we again use a for for its aliasing property to make repeated comparisons more legible: for ($user_color_preference) { $value = /red/ ? 0xFF0000 /green/ ? 0x00FF00 /blue/ ? 0x0000FF 0x000000 } : : : ; # black if all fail 126 Chapter 4: Statements and Declarations For situations like this last one, it’s sometimes better to build yourself a hash and quickly index into it to pull the answer out. Unlike the cascading conditionals we just looked at, a hash scales to an unlimited number of entries, and takes no more time to look up the first one than the last. The disadvantage is that you can only do an exact lookup, not a pattern match. If you have a hash like this: %color_map = ( azure chartreuse lavender magenta turquoise ); => => => => => 0xF0FFFF, 0x7FFF00, 0xE6E6FA, 0xFF00FF, 0x40E0D0, then exact string lookups run quickly: $value = $color_map{ lc $user_color_preference } || 0x000000; Even complicated multiway branching statements (with each case involving the execution of several different statements) can be turned into fast lookups. You just need to use a hash of references to functions. See the section “Hashes of Functions” in Chapter 9, Data Structures, for how to handle those. goto Although not for the faint of heart (nor for the pure of heart), Perl does support a goto operator. There are three forms: goto LABEL, goto EXPR, and goto &NAME. The goto LABEL form finds the statement labeled with LABEL and resumes execution there. It cant be used to jump into any construct that requires initialization, such as a subroutine or a foreach loop. It also can’t be used to jump into a construct that has been optimized away (see Chapter 18, Compiling). It can be used to go almost anywhere else within the current block or any block in your dynamic scope (that is, a block you were called from). You can even goto out of subroutines, but it’s usually better to use some other construct. The author of Perl has never felt the need to use this form of goto (in Perl, that is—C is another matter). The goto EXPR form is just a generalization of goto LABEL. It expects the expression to produce a label name, whose location obviously has to be resolved dynamically by the interpreter. This allows for computed gotos per FORTRAN, but isn’t necessarily recommended if you’re optimizing for maintainability: goto(("FOO", "BAR", "GLARCH")[$i]); # hope 0 <= i < 3 @loop_label = qw/FOO BAR GLARCH/; goto $loop_label[rand @loop_label]; # random teleport Global Declarations 127 In almost all cases like this, it’s usually a far, far better idea to use the structured control flow mechanisms of next, last, or redo instead of resorting to a goto. For certain applications, a hash of references to functions or the catch-and-throw pair of eval and die for exception processing can also be prudent approaches. The goto &NAME form is highly magical and sufficiently removed from the ordinary goto to exempt its users from the opprobrium to which goto users are customarily subjected. It substitutes a call to the named subroutine for the currently running subroutine. This behavior is used by AUTOLOAD subroutines to load another subroutine and then pretend that the other subroutine was called in the first place. After the goto, not even caller will be able to tell that this routine was called first. The autouse, AutoLoader, and SelfLoader modules all use this strategy to define functions the first time they’re called, and then to jump right to them without anyone ever knowing the functions weren’t there all along. Global Declarations Subroutine and format declarations are global declarations. No matter where you place them, what they declare is global (it’s local to a package, but packages are global to the program, so everything in a package is visible from anywhere). A global declaration can be put anywhere a statement can, but it has no effect on the execution of the primary sequence of statements—the declarations take effect at compile time. This means you can’t conditionally declare subroutines or formats by hiding them from the compiler inside a run-time conditional like an if, since only the interpreter pays attention to those conditions. Subroutine and format declarations (and use and no declarations) are seen by the compiler no matter where they occur. Global declarations are typically put at the beginning or the end of your program, or off in some other file. However, if you’re declaring any lexically scoped variables (see the next section), you’ll want to make sure your format or subroutine definition falls within the scope of the variable declarations if you expect it to be able to access those private variables. Note that we sneakily switched from talking about declarations to definitions. Sometimes it helps to split the definition of the subroutine from its declaration. The only syntactic difference between the two is that the definition supplies a BLOCK containing the code to be executed, while the declaration doesn’t. (A subroutine definition acts as its own declaration if no declaration has been seen.) Splitting the definition from the declaration allows you to put the subroutine declaration at the front of the file and the definition at the end (with your lexically scoped variable declarations happily in the middle): 128 Chapter 4: Statements and Declarations sub count (@); my $x; $x = count(3,2,1); sub count (@) { @_ } # # # # Compiler Compiler Compiler Compiler now now can now knows how to call count(). knows about lexical variable. validate function call. knows what count() means. As this example shows, subroutines don’t actually have to be defined before calls to them can be compiled (indeed, the definition can even by delayed until first use, if you use autoloading), but declaring subroutines helps the compiler in various ways and gives you more options in how you can call them. Declaring a subroutine allows it to be used without parentheses, as if it were a built-in operator, from that point forward in the compilation. (We used parentheses to call count in the last example, but we didn’t actually need to.) You can declare a subroutine without defining it just by saying: sub myname; $me = myname $0 or die "can’t get myname"; A bare declaration like that declares the function to be a list operator, not a unary operator, so be careful to use or there instead of ||. The || operator binds too tightly to use after list operators, though you can always use parentheses around the list operators arguments to turn the list operator back into something that behaves more like a function call. Alternatively, you can use the prototype ($) to turn the subroutine into a unary operator: sub myname ($); $me = myname $0 || die "can’t get myname"; That now parses as you’d expect, but you still ought to get in the habit of using or in that situation. For more on prototypes, see Chapter 6, Subr outines. You do need to define the subroutine at some point, or you’ll get an error at run time indicating that you’ve called an undefined subroutine. Other than defining the subroutine yourself, there are several ways to pull in definitions from elsewhere. You can load definitions from other files with a simple require statement; this was the best way to load files in Perl 4, but there are two problems with it. First, the other file will typically insert subroutine names into a package (a symbol table) of its own choosing, not your packages. Second, a require happens at run time, so it occurs too late to serve as a declaration in the file invoking the require. There are times, however, when delayed loading is what you want. A more useful way to pull in declarations and definitions is with the use declaration, which effectively requires the module at compile time (because use counts as a BEGIN block) and then lets you import some of the module’s declarations into your own program. Thus use can be considered a kind of global declaration, in Scoped Declarations 129 that it imports names at compile time into your own (global) package just as if you’d declared them yourself. See the section “Symbol Tables” in Chapter 10, Packages, for low-level mechanics on how importation works between packages; Chapter 11, Modules, for how to set up a module’s imports and exports; and Chapter 18 for an explanation of BEGIN and its cousins, CHECK, INIT, and END, which are also global declarations of a sort because they’re dealt with at compile time and can have global effects. Scoped Declarations Like global declarations, lexically scoped declarations have an effect at the time of compilation. Unlike global declarations, lexically scoped declarations only apply from the point of the declaration through the end of the innermost enclosing scope (block, file, or eval —whichever comes first). That’s why we call them lexically scoped, though perhaps “textually scoped” would be more accurate, since lexical scoping has little to do with lexicons. But computer scientists the world over know what “lexically scoped” means, so we perpetuate the usage here. Perl also supports dynamically scoped declarations. A dynamic scope also extends to the end of the innermost enclosing block, but in this case “enclosing” is defined dynamically at run time rather than textually at compile time. To put it another way, blocks nest dynamically by invoking other blocks, not by including them. This nesting of dynamic scopes may correlate somewhat to the nesting of lexical scopes, but the two are generally not identical, especially when any subroutines have been invoked. We mentioned that some aspects of use could be considered global declarations, but other aspects of use are lexically scoped. In particular, use not only imports package symbols but also implements various magical compiler hints, known as pragmas (or if you’re into classical forms, pragmata). Most pragmas are lexically scoped, including the use strict ’vars’ pragma which forces you to declare your variables before you can use them. See the later section, “Pragmas”. A package declaration, oddly enough, is itself lexically scoped, despite the fact that a package is a global entity. But a package declaration merely declares the identity of the default package for the rest of the enclosing block. Undeclared, unqualified variable names* are looked up in that package. In a sense, a package is never declared at all, but springs into existence when you refer to something that belongs to that package. It’s all very Perlish. * Also unqualified names of subroutines, filehandles, directory handles, and formats. 130 Chapter 4: Statements and Declarations Scoped Variable Declarations Most of the rest of the chapter is about using global variables. Or rather, it’s about not using global variables. There are various declarations that help you not use global variables—or at least, not use them foolishly. We already mentioned the package declaration, which was introduced into Perl long ago to allow globals to be split up into separate packages. This works pretty well for certain kinds of variables. Packages are used by libraries, modules, and classes to store their interface data (and some of their semi-private data) to avoid conflicting with variables and functions of the same name in your main program or in other modules. If you see someone write $Some::stuff,* they’re using the $stuff scalar variable from the package Some. See Chapter 10. If this were all there were to the matter, Perl programs would quickly become unwieldy as they got longer. Fortunately, Perl’s three scoping declarations make it easy to create completely private variables (using my), to give selective access to global ones (using our), and to provide temporary values to global variables (using local): my $nose; our $House; local $TV_channel; If more than one variable is listed, the list must be placed in parentheses. For my and our, the elements may only be simple scalar, array, or hash variables. For local, the constraints are somewhat more relaxed: you may also localize entire typeglobs and individual elements or slices of arrays and hashes: my ($nose, @eyes, %teeth); our ($House, @Autos, %Kids); local (*Spouse, $phone{HOME}); Each of these modifiers offers a different sort of “confinement” to the variables they modify. To oversimplify slightly: our confines names to a scope, local confines values to a scope, and my confines both names and values to a scope. Each of these constructs may be assigned to, though they differ in what they actually do with the values, since they have different mechanisms for storing values. They also differ somewhat if you don’t (as we didn’t above) assign any values to them: my and local cause the variables in question to start out with values of undef or (), as appropriate; our, on the other hand, leaves the current value of its associated global unchanged. * Or the archaic $Some’stuff, which probably shouldn’t be encouraged outside of Perl poetry. Scoped Declarations 131 Syntactically, my, our, and local are simply modifiers (like adjectives) on an lvalue expression. When you assign to a modified lvalue, the modifier doesn’t change whether the lvalue is viewed as a scalar or a list. To figure how the assignment will work, just pretend that the modifier isn’t there. So either of: my ($foo) = <STDIN>; my @array = <STDIN>; supplies a list context to the righthand side, while: my $foo = <STDIN>; supplies a scalar context. Modifiers bind more tightly (with higher precedence) than the comma does. The following example erroneously declares only one variable, not two, because the list following the modifier is not enclosed in parentheses. my $foo, $bar = 1; # WRONG This has the same effect as: my $foo; $bar = 1; You’ll get a warning about the mistake if warnings are enabled, whether via the -w or -W command-line switches, or, preferably, through the use warnings declaration explained later in “Pragmas”. In general, it’s best to declare a variable in the smallest possible scope that suits it. Since variables declared in a control-flow statement are visible only in the block governed by that statement, their visibility is reduced. It reads better in English this way, too. sub check_warehouse { for my $widget (our @Current_Inventory) { print "I have a $widget in stock today.\n"; } } The most frequently seen form of declaration is my, which declares lexically scoped variables for which both the names and values are stored in the current scope’s temporary scratchpad and may not be accessed globally. Closely related is the our declaration, which enters a lexically scoped name in the current scope, just as my does, but actually refers to a global variable that anyone else could access if they wished. In other words, it’s a global variable masquerading as a lexical. The other form of scoping, dynamic scoping, applies to local variables, which despite the word “local” are really global variables and have nothing to do with the local scratchpad. 132 Chapter 4: Statements and Declarations Lexically Scoped Variables: my To help you avoid the maintenance headaches of global variables, Perl provides lexically scoped variables, often called lexicals for short. Unlike globals, lexicals guarantee you privacy. Assuming you don’t hand out references to these private variables that would let them be fiddled with indirectly, you can be certain that every possible access to these private variables is restricted to code within one discrete and easily identifiable section of your program. That’s why we picked the keyword my, after all. A statement sequence may contain declarations of lexically scoped variables. Such declarations tend to be placed at the front of the statement sequence, but this is not a requirement. In addition to declaring variable names at compile time, the declarations act like ordinary run-time statements: each of them is elaborated within the sequence of statements as if it were an ordinary statement without the modifier: my $name = "fred"; my @stuff = ("car", "house", "club"); my ($vehicle, $home, $tool) = @stuff; These lexical variables are totally hidden from the world outside their immediately enclosing scope. Unlike the dynamic scoping effects of local (see the next section), lexicals are hidden from any subroutine called from their scope. This is true even if the same subroutine is called from itself or elsewhere—each instance of the subroutine gets its own “scratchpad” of lexical variables. Unlike block scopes, file scopes don’t nest; there’s no “enclosing” going on, at least not textually. If you load code from a separate file with do, require, or use, the code in that file cannot access your lexicals, nor can you access lexicals from that file. However, any scope within a file (or even the file itself) is fair game. It’s often useful to have scopes larger than subroutine definitions, because this lets you share private variables among a limited set of subroutines. This is how you create variables that a C programmer would think of as “static”: { my $state = 0; sub on { $state = 1 } sub off { $state = 0 } sub toggle { $state = !$state } } Scoped Declarations 133 The eval STRING operator also works as a nested scope, since the code in the eval can see its caller’s lexicals (as long as the names aren’t hidden by identical declarations within the eval’s own scope). Anonymous subroutines can likewise access any lexical variables from their enclosing scopes; if they do so, they’re what are known as closur es.* Combining those two notions, if a block evals a string that creates an anonymous subroutine, the subroutine becomes a closure with full access to the lexicals of both the eval and the block, even after the eval and the block have exited. See the section “Closures” in Chapter 8, Refer ences. The newly declared variable (or value, in the case of local) does not show up until the statement after the statement containing the declaration. Thus you could mirror a variable this way: my $x = $x; That initializes the new inner $x with the current value $x, whether the current meaning of $x is global or lexical. (If you don’t initialize the new variable, it starts out with an undefined or empty value.) Declaring a lexical variable of a particular name hides any previously declared lexical of the same name. It also hides any unqualified global variable of the same name, but you can always get to the global variable by explicitly qualifying it with the name of the package the global is in, for example, $PackageName::varname. Lexically Scoped Global Declarations: our A better way to access globals, especially for programs and modules running under the use strict declaration, is the our declaration. This declaration is lexically scoped in that it applies only through the end of the current scope. But unlike the lexically scoped my or the dynamically scoped local, our does not isolate anything to the current lexical or dynamic scope. Instead, it provides access to a global variable in the current package, hiding any lexicals of the same name that would have otherwise hidden that global from you. In this respect, our variables act just like my variables. If you place an our declaration outside any brace-delimited block, it lasts through the end of the current compilation unit. Often, though, people put it just inside the top of a subroutine definition to indicate that they’re accessing a global variable: sub check_warehouse { our @Current_Inventory; my $widget; * As a mnemonic, note the common element between “enclosing scope” and “closure”. (The actual definition of closure comes from a mathematical notion concerning the completeness of sets of values and operations on those values.) 134 Chapter 4: Statements and Declarations foreach $widget (@Current_Inventory) { print "I have a $widget in stock today.\n"; } } Since global variables are longer in life and broader in visibility than private variables, we like to use longer and flashier names for them than for temporary variable. This practice alone, if studiously followed, can do as much as use strict can toward discouraging the use of global variables, especially in less prestidigitatorial typists. Repeated our declarations do not meaningfully nest. Every nested my produces a new variable, and every nested local a new value. But every time you use our, you’re talking about the same global variable, irrespective of nesting. When you assign to an our variable, the effects of that assignment persist after the scope of the declaration. That’s because our never creates values; it just exposes a limited form of access to the global, which lives forever: our $PROGRAM_NAME = "waiter"; { our $PROGRAM_NAME = "server"; # Code called here sees "server". ... } # Code executed here still sees "server". Contrast this with what happens under my or local, where after the block, the outer variable or value becomes visible again: my $i = 10; { my $i = 99; ... } # Code compiled here sees outer variable. local $PROGRAM_NAME = "waiter"; { local $PROGRAM_NAME = "server"; # Code called here sees "server". ... } # Code executed here sees "waiter" again. It usually only makes sense to assign to an our declaration once, probably at the very top of the program or module, or, more rarely, when you preface the our with a local of its own: Scoped Declarations 135 { local our @Current_Inventory = qw(bananas); check_warehouse(); # no, we haven’t no bananas :-) } Dynamically Scoped Variables: local Using a local operator on a global variable gives it a temporary value each time local is executed, but it does not affect that variable’s global visibility. When the program reaches the end of that dynamic scope, this temporary value is discarded and the original value restored. But it’s always still a global variable that just happens to hold a temporary value while that block is executing. If you call some other function while your global contains the temporary value and that function accesses that global variable, it sees the temporary value, not the original one. In other words, that other function is in your dynamic scope, even though it’s presumably not in your lexical scope.* If you have a local that looks like this: { local $var = $newvalue; some_func(); ... } you can think of it purely in terms of run-time assignments: { $oldvalue = $var; $var = $newvalue; some_func(); ... } continue { $var = $oldvalue; } The difference is that with local the value is restored no matter how you exit the block, even if you prematurely return from that scope. The variable is still the same global variable, but the value found there depends on which scope the function was called from. That’s why it’s called dynamic scoping—because it changes during run time. As with my, you can initialize a local with a copy of the same global variable. Any changes to that variable during the execution of a subroutine (and any others called from within it, which of course can still see the dynamically scoped global) * That’s why lexical scopes are sometimes called static scopes: to contrast them with dynamic scopes and emphasize their compile-time determinability. Don’t confuse this use of the term with how static is used in C or C++. The term is heavily overloaded, which is why we avoid it. 136 Chapter 4: Statements and Declarations will be thrown away when the subroutine returns. You’d certainly better comment what you are doing, though: # WARNING: Changes are temporary to this dynamic scope. local $Some_Global = $Some_Global; A global variable then is still completely visible throughout your whole program, no matter whether it was explicitly declared with our or just allowed to spring into existence, or whether it’s holding a local value destined to be discarded when the scope exits. In tiny programs, this isn’t so bad, but for large ones, you’ll quickly lose track of where in the code all these global variables are being used. You can forbid accidental use of globals, if you want, through the use strict ’vars’ pragma, described in the next section. Although both my and local confer some degree of protection, by and large you should prefer my over local. Sometimes, though, you have to use local so you can temporarily change the value of an existing global variable, like those listed in Chapter 28, Special Names. Only alphanumeric identifiers may be lexically scoped, and many of those special variables aren’t strictly alphanumeric. You also need to use local to make temporary changes to a package’s symbol table as shown in the section “Symbol Tables” in Chapter 10. Finally, you can use local on a single element or a whole slice of an array or a hash. This even works if the array or hash happens to be a lexical variable, layering local’s dynamic scoping behavior on top of those lexicals. We won’t talk much more about the semantics of local here. See local in Chapter 29 for more information. Pragmas Many programming languages allow you to give hints to the compiler. In Perl, these hints are conveyed to the compiler with the use declaration. Some pragmas are: use use use use use warnings; strict; integer; bytes; constant pi => ( 4 * atan2(1,1) ); Perl pragmas are all described in Chapter 31, Pragmatic Modules, but right now we’ll just talk specifically about a couple that are most useful with the material covered in this chapter. Although a few pragmas are global declarations that affect global variables or the current package, most are lexically scoped declarations whose effects are Pragmas 137 constrained to last only until the end of the enclosing block, file, or eval (whichever comes first). A lexically scoped pragma can be countermanded in an inner scope with a no declaration, which works just like use but in reverse. Controlling Warnings To show how this works, we’ll manipulate the warnings pragma to tell Perl whether to issue warnings for questionable practices: use warnings; # Enable warnings from here till end of file. ... { no warnings; # Disable warnings through end of block. ... } # Warnings are automatically enabled again here. Once warnings are enabled, Perl complains about variables used only once, variable declarations that mask other declarations in the same scope, improper conversions of strings into numbers, using undefined values as legitimate strings or numbers, trying to write to files you only opened read-only (or didn’t open at all), and many other conditions documented in Chapter 33, Diagnostic Messages. The use warnings pragma is the preferred way to control warnings. Old programs could only use the -w command-line switch or modify the global $ˆW variable: { local $ˆW = 0; ... } It’s much better to use the use warnings and no warnings pragmas. A pragma is better because it happens at compile time, because it’s a lexical declaration and therefore cannot affect code it wasn’t intended to affect, and because (although we haven’t shown you in these simple examples) it affords fine-grained control over discrete classes of warnings. For more about the warnings pragma, including how to convert merely noisy warnings into fatal errors, and how to override the pragma to turn on warnings globally even if a module says not to, see use warnings in Chapter 31. Controlling the Use of Globals Another commonly seen declaration is the use strict pragma, which has several functions, one of which is to control the use of global variables. Normally, Perl lets you create new globals (or all too often, step on old globals) just by mentioning them. No variable declarations are necessary — by default, that is. Because unbridled use of globals can make large programs or modules painful to maintain, 138 Chapter 4: Statements and Declarations you may sometimes wish to discourage their accidental use. As an aid to preventing such accidents, you can say: use strict ’vars’; This means that any variable mentioned from here to the end of the enclosing scope must refer either to a lexical variable or to an explicitly allowed global. If it’s not one of those, a compilation error results. A global is explicitly allowed if one of the following is true: • It’s one of Perl’s program-wide special variables (see Chapter 28). • It’s fully qualified with its package name (see Chapter 10), • It’s imported into the current package (see Chapter 11). • It’s masquerading as a lexically scoped variable via an our declaration. (This is the main reason we added our declarations to Perl.) Of course, there’s always the fifth alternative — if the pragma proves burdensome, simply countermand it within an inner block using: no strict ’vars’ You can also turn on strict checking of symbolic dereferences and accidental use of barewords with this pragma. Normally people just say: use strict; to enable all three strictures. See the use strict entry in Chapter 31 for more information. 5 Pattern Matching Perl’s built-in support for pattern matching lets you search large amounts of data conveniently and efficiently. Whether you run a huge commercial portal site scanning every newsfeed in existence for interesting tidbits, or a government organization dedicated to figuring out human demographics (or the human genome), or an educational institution just trying to get some dynamic information up on your web site, Perl is the tool of choice, in part because of its database connections, but largely because of its pattern-matching capabilities. If you take “text” in the widest possible sense, perhaps 90% of what you do is 90% text processing. That’s really what Perl is all about and always has been about—in fact, it’s even part of Perl’s name: Practical Extraction and Report Language. Perl’s patterns provide a powerful way to scan through mountains of mere data and extract useful information from it. You specify a pattern by creating a regular expression (or regex), and Perl’s regular expression engine (the “Engine”, for the rest of this chapter) then takes that expression and determines whether (and how) the pattern matches your data. While most of your data will probably be text strings, there’s nothing stopping you from using regexes to search and replace any byte sequence, even what you’d normally think of as “binary” data. To Perl, bytes are just characters that happen to have an ordinal value less than 256. (More on that in Chapter 15, Unicode.) If you’re acquainted with regular expressions from some other venue, we should warn you that regular expressions are a bit different in Perl. First, they aren’t entirely “regular” in the theoretical sense of the word, which means they can do much more than the traditional regular expressions taught in computer science classes. Second, they are used so often in Perl that they have their own special variables, operators, and quoting conventions which are tightly integrated into the 139 140 Chapter 5: Pattern Matching language, not just loosely bolted on like any other library. Programmers new to Perl often look in vain for functions like these: match( $string, $pattern ); subst( $string, $pattern, $replacement ); But matching and substituting are such fundamental tasks in Perl that they merit one-letter operators: m/PATTERN/ and s/PATTERN/REPLACEMENT/ (m// and s///, for short). Not only are they syntactically brief, but they’re also parsed like doublequoted strings rather than ordinary operators; nevertheless, they operate like operators, so we’ll call them that. Throughout this chapter, you’ll see these operators used to match patterns against a string. If some portion of the string fits the pattern, we say that the match is successful. There are lots of cool things you can do with a successful pattern match. In particular, if you are using s///, a successful match causes the matched portion of the string to be replaced with whatever you specified as the REPLACEMENT. This chapter is all about how to build and use patterns. Perl’s regular expressions are potent, packing a lot of meaning into a small space. They can therefore be daunting if you try to intuit the meaning of a long pattern as a whole. But if you can break it up into its parts, and if you know how the Engine interprets those parts, you can understand any regular expression. It’s not unusual to see a hundred line C or Java program expressed with a one-line regular expression in Perl. That regex may be a little harder to understand than any single line out of the longer program; on the other hand, the regex will likely be much easier to understand than the longer program taken as a whole. You just have to keep these things in perspective. The Regular Expression Bestiary Before we dive into the rules for interpreting regular expressions, let’s see what some patterns look like. Most characters in a regular expression simply match themselves. If you string several characters in a row, they must match in order, just as you’d expect. So if you write the pattern match: /Frodo/ you can be sure that the pattern won’t match unless the string contains the substring “Frodo” somewhere. (A substring is just a part of a string.) The match could be anywhere in the string, just as long as those five characters occur somewhere, next to each other and in that order. Other characters don’t match themselves, but “misbehave” in some way. We call these metacharacters. (All metacharacters are naughty in their own right, but some are so bad that they also cause other nearby characters to misbehave as well.) The Regular Expression Bestiary 141 Here are the miscreants: \ | ( ) [ { ˆ $ * + ? . Metacharacters are actually very useful and have special meanings inside patterns; we’ll tell you all those meanings as we go along. But we do want to reassure you that you can always match any of these twelve characters literally by putting a backslash in front of it. For example, backslash is itself a metacharacter, so to match a literal backslash, you’d backslash the backslash: \\. You see, backslash is one of those characters that makes other characters misbehave. It just works out that when you make a misbehaving metacharacter misbehave, it ends up behaving—a double negative, as it were. So backslashing a character to get it to be taken literally works, but only on punctuational characters; backslashing an (ordinarily well-behaved) alphanumeric character does the opposite: it turns the literal character into something special. Whenever you see such a two-character sequence: \b \D \t \3 \s you’ll know that the sequence is a metasymbol that matches something strange. For instance, \b matches a word boundary, while \t matches an ordinary tab character. Notice that a tab is one character wide, while a word boundary is zero characters wide because it’s the spot between two characters. So we call \b a zer owidth assertion. Still, \t and \b are alike in that they both assert something about a particular spot in the string. Whenever you assert something in a regular expression, you’re just claiming that that particular something has to be true in order for the pattern to match. Most pieces of a regular expression are some sort of assertion, including the ordinary characters that simply assert that they match themselves. To be precise, they also assert that the next thing will match one character later in the string, which is why we talk about the tab character being “one character wide”. Some assertions (like \t) eat up some of the string as they match, and others (like \b) don’t. But we usually reserve the term “assertion” for the zero-width assertions. To avoid confusion, we’ll call the thing with width an atom. (If you’re a physicist, you can think of nonzero-width atoms as massive, in contrast to the zero-width assertions, which are massless like photons.) You’ll also see some metacharacters that aren’t assertions; rather, they’re structural (just as braces and semicolons define the structure of ordinary Perl code, but don’t really do anything). These structural metacharacters are in some ways the most important ones because the crucial first step in learning to read regular 142 Chapter 5: Pattern Matching expressions is to teach your eyes to pick out the structural metacharacters. Once you’ve learned that, reading regular expressions is a breeze.* One such structural metacharacter is the vertical bar, which indicates alter nation: /Frodo|Pippin|Merry|Sam/ That means that any of those strings can trigger a match; this is covered in “Alternation” later in the chapter. And in the “Capturing and Clustering” section after that, we’ll show you how to use parentheses around portions of your pattern to do gr ouping: /(Frodo|Drogo|Bilbo) Baggins/ or even: /(Frod|Drog|Bilb)o Baggins/ Another thing you’ll see are what we call quantifiers, which say how many of the previous thing should match in a row. Quantifiers look like this: * + ? *? {3} {2,5} You’ll never see them in isolation like that, though. Quantifiers only make sense when attached to atoms—that is, to assertions that have width.† Quantifiers attach to the previous atom only, which in human terms means they normally quantify only one character. If you want to match three copies of “bar” in a row, you need to group the individual characters of “bar” into a single “molecule” with parentheses, like this: /(bar){3}/ That will match “barbarbar”. If you’d said /bar{3}/, that would match “barrr”—which might qualify you as Scottish but disqualify you as barbarbaric. (Then again, maybe not. Some of our favorite metacharacters are Scottish.) For more on quantifiers, see “Quantifiers” later. Now that you’ve seen a few of the beasties that inhabit regular expressions, you’re probably anxious to start taming them. However, before we discuss regular expressions in earnest, we need to backtrack a little and talk about the patternmatching operators that make use of regular expressions. (And if you happen to spot a few more regex beasties along the way, just leave a decent tip for the tour guide.) * Admittedly, a stiff breeze at times, but not something that will blow you away. † Quantifiers are a bit like the statement modifiers in Chapter 4, Statements and Declarations, which can only attach to a single statement. Attaching a quantifier to a zero-width assertion would be like trying to attach a while modifier to a declaration—either of which makes about as much sense as asking your local apothecary for a pound of photons. Apothecaries only deal in atoms and such. Pattern-Matching Operators 143 Pattern-Matching Operators Zoologically speaking, Perl’s pattern-matching operators function as a kind of cage for regular expressions, to keep them from getting out. This is by design; if we were to let the regex beasties wander throughout the language, Perl would be a total jungle. The world needs its jungles, of course—they’re the engines of biological diversity, after all—but jungles should stay where they belong. Similarly, despite being the engines of combinatorial diversity, regular expressions should stay inside pattern match operators where they belong. It’s a jungle in there. As if regular expressions weren’t powerful enough, the m// and s/// operators also provide the (likewise confined) power of double-quote interpolation. Since patterns are parsed like double-quoted strings, all the normal double-quote conventions will work, including variable interpolation (unless you use single quotes as the delimiter) and special characters indicated with backslash escapes. (See “Specific Characters” later in this chapter.) These are applied before the string is interpreted as a regular expression. (This is one of the few places in the Perl language where a string undergoes more than one pass of processing.) The first pass is not quite normal double-quote interpolation, in that it knows what it should interpolate and what it should pass on to the regular expression parser. So, for instance, any $ immediately followed by a vertical bar, closing parenthesis, or the end of the string will be treated not as a variable interpolation, but as the traditional regex assertion meaning end-of-line. So if you say: $foo = "bar"; /$foo$/; the double-quote interpolation pass knows that those two $ signs are functioning differently. It does the interpolation of $foo, then hands this to the regular expression parser: /bar$/; Another consequence of this two-pass parsing is that the ordinary Perl tokener finds the end of the regular expression first, just as if it were looking for the terminating delimiter of an ordinary string. Only after it has found the end of the string (and done any variable interpolation) is the pattern treated as a regular expression. Among other things, this means you can’t “hide” the terminating delimiter of a pattern inside a regex construct (such as a character class or a regex comment, which we haven’t covered yet). Perl will see the delimiter wherever it is and terminate the pattern at that point. 144 Chapter 5: Pattern Matching You should also know that interpolating variables into a pattern slows down the pattern matcher, because it feels it needs to check whether the variable has changed, in case it has to recompile the pattern (which will slow it down even further). See “Variable Interpolation” later in this chapter. The tr/// transliteration operator does not interpolate variables; it doesn’t even use regular expressions! (In fact, it probably doesn’t belong in this chapter at all, but we couldn’t think of a better place to put it.) It does share one feature with m// and s///, however: it binds to variables using the =˜ and !˜ operators. The =˜ and !˜ operators, described in Chapter 3, Unary and Binary Operators, bind the scalar expression on their lefthand side to one of three quote-like operators on their right: m// for matching a pattern, s/// for substituting some string for a substring matched by a pattern, and tr/// (or its synonym, y///) for transliterating one set of characters to another set. (You may write m// as //, without the m, if slashes are used for the delimiter.) If the righthand side of =˜ or !˜ is none of these three, it still counts as a m// matching operation, but there’ll be no place to put any trailing modifiers (see “Pattern Modifiers” later), and you’ll have to handle your own quoting: print "matches" if $somestring =˜ $somepattern; Really, there’s little reason not to spell it out explicitly: print "matches" if $somestring =˜ m/$somepattern/; When used for a matching operation, =˜ and !˜ are sometimes pronounced “matches” and “doesn’t match” respectively (although “contains” and “doesn’t contain” might cause less confusion). Apart from the m// and s/// operators, regular expressions show up in two other places in Perl. The first argument to the split function is a special match operator specifying what not to return when breaking a string into multiple substrings. See the description and examples for split in Chapter 29, Functions. The qr// (“quote regex”) operator also specifies a pattern via a regex, but it doesn’t try to match anything (unlike m//, which does). Instead, the compiled form of the regex is returned for future use. See “Variable Interpolation” for more information. You apply one of the m//, s///, or tr/// operators to a particular string with the =˜ binding operator (which isn’t a real operator, just a kind of topicalizer, linguistically speaking). Here are some examples: $haystack =˜ m/needle/ $haystack =˜ /needle/ # match a simple pattern # same thing $italiano =˜ s/butter/olive oil/ # a healthy substitution $rotate13 =˜ tr/a-zA-Z/n-za-mN-ZA-M/ # easy encryption (to break) Pattern-Matching Operators 145 Without a binding operator, $_ is implicitly used as the “topic”: /new life/ and /new civilizations/ # search in $_ and (if found) # boldly search $_ again s/sugar/aspartame/ # substitute a substitute into $_ tr/ATCG/TAGC/ # complement the DNA stranded in $_ Because s/// and tr/// change the scalar to which they’re applied, you may only use them on valid lvalues: "onshore" =˜ s/on/off/; # WRONG: compile-time error However, m// works on the result of any scalar expression: if ((lc $magic_hat->fetch_contents->as_string) =˜ /rabbit/) { print "Nyaa, what’s up doc?\n"; } else { print "That trick never works!\n"; } But you have to be a wee bit careful, since =˜ and !˜ have rather high precedence — in our previous example the parentheses are necessary around the left term.* The !˜ binding operator works like =˜, but negates the logical result of the operation: if ($song !˜ /words/) { print qq/"$song" appears to be a song without words.\n/; } Since m//, s///, and tr/// are quote operators, you may pick your own delimiters. These work in the same way as the quoting operators q//, qq//, qr//, and qw// (see the section “Pick your own quotes” in Chapter 2, Bits and Pieces). $path =˜ s#/tmp#/var/tmp/scratch#; if ($dir =˜ m[/bin]) { print "No binary directories please.\n"; } When using paired delimiters with s/// or tr///, if the first part is one of the four customary bracketing pairs (angle, round, square, or curly), you may choose different delimiters for the second part than you chose for the first: s(egg)<larva>; s{larva}{pupa}; s[pupa]/imago/; * Without the parentheses, the lower-precedence lc would have applied to the whole pattern match instead of just the method call on the magic hat object. 146 Chapter 5: Pattern Matching Whitespace is allowed in front of the opening delimiters: s (egg) <larva>; s {larva} {pupa}; s [pupa] /imago/; Each time a pattern successfully matches (including the pattern in a substitution), it sets the $‘, $&, and $’ variables to the text left of the match, the whole match, and the text right of the match. This is useful for pulling apart strings into their components: "hot cross buns" =˜ /cross/; print "Matched: <$‘> $& <$’>\n"; print "Left: <$‘>\n"; print "Match: <$&>\n"; print "Right: <$’>\n"; # # # # Matched: Left: Match: Right: <hot > cross < buns> <hot > <cross> < buns> For better granularity and efficiency, use parentheses to capture the particular portions that you want to keep around. Each pair of parentheses captures the substring corresponding to the subpatter n in the parentheses. The pairs of parentheses are numbered from left to right by the positions of the left parentheses; the substrings corresponding to those subpatterns are available after the match in the numbered variables, $1, $2, $3, and so on:* $_ = "Bilbo Baggins’s birthday is September 22"; /(.*)’s birthday is (.*)/; print "Person: $1\n"; print "Date: $2\n"; $‘, $&, $’, and the numbered variables are global variables implicitly localized to the enclosing dynamic scope. They last until the next successful pattern match or the end of the current scope, whichever comes first. More on this later, in a different scope. Once Perl sees that you need one of $‘, $&, or $’ anywhere in the program, it provides them for every pattern match. This will slow down your program a bit. Perl uses a similar mechanism to produce $1, $2, and so on, so you also pay a price for each pattern that contains capturing parentheses. (See “Clustering” to avoid the cost of capturing while still retaining the grouping behavior.) But if you never use $‘ $&, or $’, then patterns without capturing parentheses will not be penalized. So it’s usually best to avoid $‘, $&, and $’ if you can, especially in library modules. But if you must use them once (and some algorithms really appreciate their convenience), then use them at will, because you’ve already paid the price. $& is not so costly as the other two in recent versions of Perl. * Not $0, though, which holds the name of your program. Pattern-Matching Operators 147 Pattern Modifiers We’ll discuss the individual pattern-matching operators in a moment, but first we’d like to mention another thing they all have in common, modifiers. Immediately following the final delimiter of an m//, s///, qr//, or tr/// operator, you may optionally place one or more single-letter modifiers, in any order. For clarity, modifiers are usually written as “the /o modifier” and pronounced “the slash oh modifier”, even though the final delimiter might be something other than a slash. (Sometimes people say “flag” or “option” to mean “modifier”; that’s okay too.) Some modifiers change the behavior of the individual operator, so we’ll describe those in detail later. Others change how the regex is interpreted, so we’ll talk about them here. The m//, s///, and qr// operators* all accept the following modifiers after their final delimiter: Modifier Meaning /i /s /m /x /o Ignore alphabetic case distinctions (case insensitive). Let . match newline and ignore deprecated $* variable. Let ˆ and $ match next to embedded \n. Ignore (most) whitespace and permit comments in pattern. Compile pattern once only. The /i modifier says to match both upper- and lowercase (and title case, under Unicode). That way /perl/i would also match the strings “PROPERLY” or “Perlaceous” (amongst other things). A use locale pragma may also have some influence on what is considered to be equivalent. (This may be a negative influence on strings containing Unicode.) The /s and /m modifiers don’t involve anything kinky. Rather, they affect how Perl treats matches against a string that contains newlines. But they aren’t about whether your string actually contains newlines; they’re about whether Perl should assume that your string contains a single line (/s) or multiple lines (/m), because certain metacharacters work differently depending on whether they’re expected to behave in a line-oriented fashion or not. Ordinarily, the metacharacter “.” matches any one character except a newline, because its traditional meaning is to match characters within a line. With /s, however, the “.” metacharacter can also match a newline, because you’ve told Perl to ignore the fact that the string might contain multiple newlines. (The /s modifier also makes Perl ignore the deprecated $* variable, which we hope you too have * The tr/// operator does not take regexes, so these modifiers do not apply. 148 Chapter 5: Pattern Matching been ignoring.) The /m modifier, on the other hand, changes the interpretation of the ˆ and $ metacharacters by letting them match next to newlines within the string instead of considering only the ends of the string. See the examples in the section “Positions” later in this chapter. The /o modifier controls pattern recompilation. Unless the delimiters chosen are single quotes (m’PATTERN’, s’PATTERN’REPLACEMENT’, or qr’PATTERN’), any variables in the pattern will be interpolated (and may cause the pattern to be recompiled) every time the pattern operator is evaluated. If you want such a pattern to be compiled once and only once, use the /o modifier. This prevents expensive run-time recompilations; it’s useful when the value you are interpolating won’t change during execution. However, mentioning /o constitutes a promise that you won’t change the variables in the pattern. If you do change them, Perl won’t even notice. For better control over recompilation, use the qr// regex quoting operator. See “Variable Interpolation” later in this chapter for details. The /x is the expressive modifier: it allows you to exploit whitespace and explanatory comments in order to expand your pattern’s legibility, even extending the pattern across newline boundaries. Er, that is to say, /x modifies the meaning of the whitespace characters (and the # character): instead of letting them do self-matching as ordinary characters do, it turns them into metacharacters that, oddly, now behave as whitespace (and comment characters) should. Hence, /x allows spaces, tabs, and newlines for formatting, just like regular Perl code. It also allows the # character, not normally special in a pattern, to introduce a comment that extends through the end of the current line within the pattern string.* If you want to match a real whitespace character (or the # character), then you’ll have to put it into a character class, or escape it with a backslash, or encode it using an octal or hex escape. (But whitespace is normally matched with a \s* or \s+ sequence, so the situation doesn’t arise often in practice.) Taken together, these features go a long way toward making traditional regular expressions a readable language. In the spirit of TMTOWTDI, there’s now more than one way to write a given regular expression. In fact, there’s more than two ways: m/\w+:(\s+\w+)\s*\d+/; # A word, colon, space, word, space, digits. m/\w+: (\s+ \w+) \s* \d+/x; # A word, colon, space, word, space, digits. m{ \w+: # Match a word and a colon. * Be careful not to include the pattern delimiter in the comment—because of its “find the end first” rule, Perl has no way of knowing you didn’t intend to terminate the pattern at that point. Pattern-Matching Operators ( \s+ \w+ ) \s* \d+ 149 # # # # # # (begin group) Match one or more spaces. Match another word. (end group) Match zero or more spaces. Match some digits }x; We’ll explain those new metasymbols later in the chapter. (This section was supposed to be about pattern modifiers, but we’ve let it get out of hand in our excitement about /x. Ah well.) Here’s a regular expression that finds duplicate words in paragraphs, stolen right out of the Perl Cookbook. It uses the /x and /i modifiers, as well as the /g modifier described later. # Find duplicate words in paragraphs, possibly spanning line boundaries. # Use /x for space and comments, /i to match both ‘is’ # in "Is is this ok?", and use /g to find all dups. $/ = ""; # "paragrep" mode while (<>) { while ( m{ \b # start at a word boundary (\w\S+) # find a wordish chunk ( \s+ # separated by some whitespace \1 # and that chunk again ) + # repeat ad lib \b # until another word boundary }xig ) { print "dup word ’$1’ at paragraph $.\n"; } } When run on this chapter, it produces warnings like this: dup word ’that’ at paragraph 100 As it happens, we know that that particular instance was intentional. The m// Operator (Matching) EXPR =˜ m/PATTERN/cgimosx EXPR =˜ /PATTERN/cgimosx EXPR =˜ ?PATTERN?cgimosx m/PATTERN/cgimosx /PATTERN/cgimosx ?PATTERN?cgimosx The m// operator searches the string in the scalar EXPR for PATTERN. If / or ? is the delimiter, the initial m is optional. Both ? and ’ have special meanings as delimiters: the first is a once-only match; the second suppresses variable interpolation and the six translation escapes (\U and company, described later). 150 Chapter 5: Pattern Matching If PATTERN evaluates to a null string, either because you specified it that way using // or because an interpolated variable evaluated to the empty string, the last successfully executed regular expression not hidden within an inner block (or within a split, grep, or map) is used instead. In scalar context, the operator returns true (1) if successful, false ("") otherwise. This form is usually seen in Boolean context: if ($shire =˜ m/Baggins/) { ... } # search for Baggins in $shire if ($shire =˜ /Baggins/) { ... } # search for Baggins in $shire if ( m#Baggins# ) if ( /Baggins/ ) { ... } # search right here in $_ { ... } # search right here in $_ Used in list context, m// returns a list of substrings matched by the capturing parentheses in the pattern (that is, $1, $2, $3, and so on) as described later under “Capturing and Clustering”. The numbered variables are still set even when the list is returned. If the match fails in list context, a null list is returned. If the match succeeds in list context but there were no capturing parentheses (nor /g), a list value of (1) is returned. Since it returns a null list on failure, this form of m// can also be used in Boolean context, but only when participating indirectly via a list assignment: if (($key,$value) = /(\w+): (.*)/) { ... } Valid modifiers for m// (in whatever guise) are shown in Table 5-1. Table 5-1. m// Modifiers Modifier Meaning /i Ignore alphabetic case. /m Let ˆ and $ match next to embedded \n. /s Let . match newline and ignore deprecated $*. /x Ignore (most) whitespace and permit comments in pattern. /o Compile pattern once only. /g Globally find all matches. /cg Allow continued search after failed /g match. The first five modifiers apply to the regex and were described earlier. The last two change the behavior of the match operation itself. The /g modifier specifies global matching — that is, matching as many times as possible within the string. How it behaves depends on context. In list context, m//g returns a list of all matches found. Here we find all the places someone mentioned “perl”, “Perl”, “PERL”, and so on: Pattern-Matching Operators 151 if (@perls = $paragraph =˜ /perl/gi) { printf "Perl mentioned %d times.\n", scalar @perls; } If there are no capturing parentheses within the /g pattern, then the complete matches are returned. If there are capturing parentheses, then only the strings captured are returned. Imagine a string like: $string = "password=xyzzy verbose=9 score=0"; Also imagine you want to use that to initialize a hash like this: %hash = (password => "xyzzy", verbose => 9, score => 0); Except, of course, you don’t have a list, you have a string. To get the corresponding list, you can use the m//g operator in list context to capture all of the key/value pairs from the string: %hash = $string =˜ /(\w+)=(\w+)/g; The (\w+) sequence captures an alphanumeric word. See the section “Capturing and Clustering”. Used in scalar context, the /g modifier indicates a pr ogressive match, which makes Perl start the next match on the same variable at a position just past where the last one stopped. The \G assertion represents that position in the string; see “Positions” later in this chapter for a description of \G. If you use the /c (for “continue”) modifier in addition to /g, then when the /g runs out, the failed match doesn’t reset the position pointer. If a ? is the delimiter, as in ?PATTERN?, this works just like a normal /PATTERN/ search, except that it matches only once between calls to the reset operator. This can be a convenient optimization when you want to match only the first occurrence of the pattern during the run of the program, not all occurrences. The operator runs the search every time you call it, up until it finally matches something, after which it turns itself off, returning false until you explicitly turn it back on with reset. Perl keeps track of the match state for you. The ?? operator is most useful when an ordinary pattern match would find the last rather than the first occurrence: open DICT, "/usr/dict/words" or die "Can’t open words: $!\n"; while (<DICT>) { $first = $1 if ?(ˆneur.*)?; $last = $1 if /(ˆneur.*)/; } print $first,"\n"; # prints "neurad" print $last,"\n"; # prints "neurypnology" 152 Chapter 5: Pattern Matching The reset operator will reset only those instances of ?? compiled in the same package as the call to reset. Saying m?? is equivalent to saying ??. The s/// Operator (Substitution) LVALUE =˜ s/PATTERN/REPLACEMENT/egimosx s/PATTERN/REPLACEMENT/egimosx This operator searches a string for PATTERN and, if found, replaces the matched substring with the REPLACEMENT text. (Modifiers are described later in this section.) $lotr = $hobbit; # Just copy The Hobbit $lotr =˜ s/Bilbo/Frodo/g; # and write a sequel the easy way. The return value of an s/// operation (in scalar and list contexts alike) is the number of times it succeeded (which can be more than once if used with the /g modifier, as described earlier). On failure, since it substituted zero times, it returns false (""), which is numerically equivalent to 0. if ($lotr =˜ s/Bilbo/Frodo/) { print "Successfully wrote sequel." } $change_count = $lotr =˜ s/Bilbo/Frodo/g; The replacement portion is treated as a double-quoted string. You may use any of the dynamically scoped pattern variables described earlier ($‘, $&, $’, $1, $2, and so on) in the replacement string, as well as any other double-quote gizmos you care to employ. For instance, here’s an example that finds all the strings “revision”, “version”, or “release”, and replaces each with its capitalized equivalent, using the \u escape in the replacement portion: s/revision|version|release/\u$&/g; # Use | to mean "or" in a pattern All scalar variables expand in double-quote context, not just these strange ones. Suppose you had a %Names hash that mapped revision numbers to internal project names; for example, $Names{"3.0"} might be code-named “Isengard”. You could use s/// to find version numbers and replace them with their corresponding project names: s/version ([0-9.]+)/the $Names{$1} release/g; In the replacement string, $1 returns what the first (and only) pair of parentheses captured. (You could use also \1 as you would in the pattern, but that usage is deprecated in the replacement. In an ordinary double-quoted string, \1 means a Control-A.) If PATTERN is a null string, the last successfully executed regular expression is used instead. Both PATTERN and REPLACEMENT are subject to variable interpolation, but a PATTERN is interpolated each time the s/// operator is evaluated as a whole, while the REPLACEMENT is interpolated every time the pattern matches. (The PATTERN can match multiple times in one evaluation if you use the /g modifier.) Pattern-Matching Operators 153 As before, the first five modifiers in Table 5-2 alter the behavior of the regex; they’re the same as in m// and qr//. The last two alter the substitution operator itself. Table 5-2. s/// Modifiers Modifier Meaning /i Ignore alphabetic case (when matching). /m Let ˆ and $ match next to embedded \n. /s Let . match newline and ignore deprecated $*. /x Ignore (most) whitespace and permit comments in pattern. /o Compile pattern once only. /g Replace globally, that is, all occurrences. /e Evaluate the right side as an expression. The /g modifier is used with s/// to replace every match of PATTERN with the REPLACEMENT value, not just the first one found. A s///g operator acts as a global search and replace, making all the changes at once, much like list m//g, except that m//g doesn’t change anything. (And s///g is not a progressive match as scalar m//g was.) The /e modifier treats the REPLACEMENT as a chunk of Perl code rather than as an interpolated string. The result of executing that code is used as the replacement string. For example, s/([0-9]+)/sprintf("%#x", $1)/ge would convert all numbers into hexadecimal, changing, for example, 2581 into 0xb23. Or suppose that, in our earlier example, you weren’t sure that you had names for all the versions, so you wanted to leave any others unchanged. With a little creative /x formatting, you could say: s{ version \s+ ( [0-9.]+ ) }{ $Names{$1} ? "the $Names{$1} release" : $& }xge; The righthand side of your s///e (or in this case, the lower side) is syntax-checked and compiled at compile time along with the rest of your program. Any syntax error is detected during compilation, and run-time exceptions are left uncaught. Each additional /e after the first one (like /ee, /eee, and so on) is equivalent to calling eval STRING on the result of the code, once per extra /e. This evaluates the 154 Chapter 5: Pattern Matching result of the code expression and traps exceptions in the special $@ variable. See the section “Programmatic Patterns” later in the chapter for more details. Modifying strings en passant Sometimes you want a new, modified string without clobbering the old one upon which the new one was based. Instead of writing: $lotr = $hobbit; $lotr =˜ s/Bilbo/Frodo/g; you can combine these into one statement. Due to precedence, parentheses are required around the assignment, as they are with most combinations applying =˜ to an expression. ($lotr = $hobbit) =˜ s/Bilbo/Frodo/g; Without the parentheses around the assignment, you’d only change $hobbit and get the number of replacements stored into $lotr, which would make a rather dull sequel. You can’t use a s/// operator directly on an array. For that, you need a loop. By a lucky coincidence, the aliasing behavior of for/foreach, combined with its use of $_ as the default loop variable, yields the standard Perl idiom to search and replace each element in an array: for (@chapters) { s/Bilbo/Frodo/g } # Do substitutions chapter by chapter. s/Bilbo/Frodo/g for @chapters; # Same thing. As with a simple scalar variable, you can combine the substitution with an assignment if you’d like to keep the original values around, too: @oldhues = (’bluebird’, ’bluegrass’, ’bluefish’, ’the blues’); for (@newhues = @oldhues) { s/blue/red/ } print "@newhues\n"; # prints: redbird redgrass redfish the reds The idiomatic way to perform repeated substitutes on the same variable is to use a once-through loop. For example, here’s how to canonicalize whitespace in a variable: for ($string) { s/ˆ\s+//; s/\s+$//; s/\s+/ /g; } # discard leading whitespace # discard trailing whitespace # collapse internal whitespace which just happens to produce the same result as: $string = join(" ", split " ", $string); Pattern-Matching Operators 155 You can also use such a loop with an assignment, as we did in the array case: for ($newshow = $oldshow) { s/Fred/Homer/g; s/Wilma/Marge/g; s/Pebbles/Lisa/g; s/Dino/Bart/g; } When a global substitution just isn’t global enough Occasionally, you can’t just use a /g to get all the changes to occur, either because the substitutions have to happen right-to-left or because you need the length of $‘ to change between matches. You can usually do what you want by calling s/// repeatedly. However, you want the loop to stop when the s/// finally fails, so you have to put it into the conditional, which leaves nothing to do in the main part of the loop. So we just write a 1, which is a rather boring thing to do, but bored is the best you can hope for sometimes. Here are some examples that use a few more of those odd regex beasties that keep popping up: # put commas in the right places in an integer 1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/; # expand tabs to 8-column spacing 1 while s/\t+/’ ’ x (length($&)*8 - length($‘)%8)/e; # remove (nested (even deeply nested (like this))) remarks 1 while s/\([ˆ()]*\)//g; # remove duplicate words (and triplicate (and quadruplicate...)) 1 while s/\b(\w+) \1\b/$1/gi; That last one needs a loop because otherwise it would turn this: Paris in THE THE THE THE spring. into this: Paris in THE THE spring. which might cause someone who knows a little French to picture Paris sitting in an artesian well emitting iced tea, since “thé” is French for “tea”. A Parisian is never fooled, of course. The tr/// Operator (Transliteration) LVALUE =˜ tr/SEARCHLIST/REPLACEMENTLIST/cds tr/SEARCHLIST/REPLACEMENTLIST/cds For sed devotees, y/// is provided as a synonym for tr///. This is why you can’t call a function named y, any more than you can call a function named q or m. In all other respects, y/// is identical to tr///, and we won’t mention it again. 156 Chapter 5: Pattern Matching This operator might not appear to fit into a chapter on pattern matching, since it doesn’t use patterns. This operator scans a string, character by character, and replaces each occurrence of a character found in SEARCHLIST (which is not a regular expression) with the corresponding character from REPLACEMENTLIST (which is not a replacement string). It looks a bit like m// and s///, though, and you can even use the =˜ or !˜ binding operators on it, so we describe it here. (qr// and split are pattern-matching operators, but you don’t use the binding operators on them, so they’re elsewhere in the book. Go figure.) Transliteration returns the number of characters replaced or deleted. If no string is specified via the =˜ or !˜ operator, the $_ string is altered. The SEARCHLIST and REPLACEMENTLIST may define ranges of sequential characters with a dash: $message =˜ tr/A-Za-z/N-ZA-Mn-za-m/; # rot13 encryption. Note that a range like A-Z assumes a linear character set like ASCII. But each character set has its own ideas of how characters are ordered and thus of which characters fall in a particular range. A sound principle is to use only ranges that begin from and end at either alphabets of equal case (a-e, A-E), or digits (0-4). Anything else is suspect. When in doubt, spell out the character sets in full: ABCDE. The SEARCHLIST and REPLACEMENTLIST are not variable interpolated as doublequoted strings; you may, however, use those backslash sequences that map to a specific character, such as \n or \015. Table 5-3 lists the modifiers applicable to the tr/// operator. They’re completely different from those you apply to m//, s///, or qr//, even if some look the same. Table 5-3. tr/// Modifiers Modifier Meaning /c Complement SEARCHLIST. /d Delete found but unreplaced characters. /s Squash duplicate replaced characters. If the /c modifier is specified, the character set in SEARCHLIST is complemented; that is, the effective search list consists of all the characters not in SEARCHLIST. In the case of Unicode, this can represent a lot of characters, but since they’re stored logically, not physically, you don’t need to worry about running out of memory. The /d modifier turns tr/// into what might be called the “transobliteration” operator: any characters specified by SEARCHLIST but not given a replacement in REPLACEMENTLIST are deleted. (This is slightly more flexible than the behavior of some tr (1) programs, which delete anything they find in SEARCHLIST, period.) Pattern-Matching Operators 157 If the /s modifier is specified, sequences of characters converted to the same character are squashed down to a single instance of the character. If the /d modifier is used, REPLACEMENTLIST is always interpreted exactly as specified. Otherwise, if REPLACEMENTLIST is shorter than SEARCHLIST, the final character is replicated until it is long enough. If REPLACEMENTLIST is null, the SEARCHLIST is replicated, which is surprisingly useful if you just want to count characters, not change them. It’s also useful for squashing characters using /s. tr/aeiou/!/; tr{/\\\r\n\b\f. }{_}; # change any vowel into ! # change strange chars into an underscore tr/A-Z/a-z/ for @ARGV; # canonicalize to lowercase ASCII $count = ($para =˜ tr/\n//); # count the newlines in $para $count = tr/0-9//; # count the digits in $_ $word =˜ tr/a-zA-Z//s; # bookkeeper -> bokeper tr/@$%*//d; tr#A-Za-z0-9+/##cd; # delete any of those # remove non-base64 chars # change en passant ($HOST = $host) =˜ tr/a-z/A-Z/; $pathname =˜ tr/a-zA-Z/_/cs; # change non-(ASCII)alphas to single underbar tr [\200-\377] [\000-\177]; # strip 8th bit, bytewise If the same character occurs more than once in SEARCHLIST, only the first is used. Therefore, this: tr/AAA/XYZ/ will change any single character A to an X (in $_ ). Although variables aren’t interpolated into tr///, you can still get the same effect by using eval EXPR: $count = eval "tr/$oldlist/$newlist/"; die if $@; # propagates exception from illegal eval contents One more note: if you want to change your text to uppercase or lowercase, don’t use tr///. Use the \U or \L sequences in a double-quoted string (or the equivalent uc and lc functions) since they will pay attention to locale or Unicode information and tr/a-z/A-Z/ won’t. Additionally, in Unicode strings, the \u sequence and its corresponding ucfirst function understand the notion of titlecase, which for some languages may be distinct from simply converting to uppercase. 158 Chapter 5: Pattern Matching Metacharacters and Metasymbols Now that we’ve admired all the fancy cages, we can go back to looking at the critters in the cages, those funny-looking symbols you put inside the patterns. By now you’ll have cottoned to the fact that these symbols aren’t regular Perl code like function calls or arithmetic operators. Regular expressions are their own little language nestled inside of Perl. (There’s a bit of the jungle in all of us.) For all their power and expressivity, patterns in Perl recognize the same 12 traditional metacharacters (the Dirty Dozen, as it were) found in many other regular expression packages: \ | ( ) [ { ˆ $ * + ? . Some of those bend the rules, making otherwise normal characters that follow them special. We don’t like to call the longer sequences “characters”, so when they make longer sequences, we call them metasymbols (or sometimes just “symbols”). But at the top level, those twelve metacharacters are all you (and Perl) need to think about. Everything else proceeds from there. Some simple metacharacters stand by themselves, like . and ˆ and $. They don’t directly affect anything around them. Some metacharacters work like prefix operators, governing what follows them, like \. Others work like postfix operators, governing what immediately precedes them, like *, +, and ?. One metacharacter, |, acts like an infix operator, standing between the operands it governs. There are even bracketing metacharacters that work like circumfix operators, governing something contained inside them, like (...) and [...]. Parentheses are particularly important, because they specify the bounds of | on the inside, and of *, +, and ? on the outside. If you learn only one of the twelve metacharacters, choose the backslash. (Er . . . and the parentheses.) That’s because backslash disables the others. When a backslash precedes a nonalphanumeric character in a Perl pattern, it always makes that next character a literal. If you need to match one of the twelve metacharacters in a pattern literally, you write them with a backslash in front. Thus, \. matches a real dot, \$ a real dollar sign, \\ a real backslash, and so on. This is known as “escaping” the metacharacter, or “quoting it”, or sometimes just “backslashing” it. (Of course, you already know that backslash is used to suppress variable interpolation in double-quoted strings.) Although a backslash turns a metacharacter into a literal character, its effect upon a following alphanumeric character goes the other direction. It takes something that was regular and makes it special. That is, together they make a metasymbol. An alphabetical list of these metasymbols can be found below in Table 5-7. Metacharacters and Metasymbols 159 Metasymbol Tables In the following tables, the Atomic column says “yes” if the given metasymbol is quantifiable (if it can match something with width, more or less). Also, we’ve used “...” to represent “something else”. Please see the later discussion to find out what “...” means, if it is not clear from the one-line gloss in the table.) Table 5-4 shows the basic traditional metasymbols. The first four of these are the structural metasymbols we mentioned earlier, while the last three are simple metacharacters. The . metacharacter is an example of an atom because it matches something with width (the width of a character, in this case); ˆ and $ are examples of assertions, because they match something of zero width, and because they are only evaluated to see if they’re true or not. Table 5-4. General Regex Metacharacters Symbol Atomic Meaning \... Varies De-meta next nonalphanumeric character, meta next alphanumeric character (maybe). ...|... No Alternation (match one or the other). (...) Yes Grouping (treat as a unit). [...] Yes Character class (match one character from a set). ˆ No True at beginning of string (or after any newline, maybe). . Yes Match one character (except newline, normally). $ No True at end of string (or before any newline, maybe). The quantifiers, which are further described in their own section, indicate how many times the preceding atom (that is, single character or grouping) should match. These are listed in Table 5-5. Table 5-5. Regex Quantifiers Quantifier Atomic Meaning * No Match 0 or more times (maximal). + No Match 1 or more times (maximal). ? No Match 1 or 0 times (maximal). {COUNT} No Match exactly COUNT times. {MIN,} No Match at least MIN times (maximal). {MIN,MAX} No Match at least MIN but not more than MAX times (maximal). *? No Match 0 or more times (minimal). +? No Match 1 or more times (minimal). ?? No Match 0 or 1 time (minimal). 160 Chapter 5: Pattern Matching Table 5-5. Regex Quantifiers (continued) Quantifier Atomic Meaning {MIN,}? No Match at least MIN times (minimal). {MIN,MAX}? No Match at least MIN but not more than MAX times (minimal). A minimal quantifier tries to match as few characters as possible within its allowed range. A maximal quantifier tries to match as many characters as possible within its allowed range. For instance, .+ is guaranteed to match at least one character of the string, but will match all of them given the opportunity. The opportunities are discussed later in “The Little Engine That /Could(n’t)?/”. You’ll note that quantifiers may never be quantified. We wanted to provide an extensible syntax for new kinds of metasymbols. Given that we only had a dozen metacharacters to work with, we chose a formerly illegal regex sequence to use for arbitrary syntactic extensions. These metasymbols are all of the form (?KEY...); that is, a (balanced) parenthesis followed by a question mark, followed by a KEY and the rest of the subpattern. The KEY character indicates which particular regex extension it is. See Table 5-6 for a list of these. Most of them behave structurally since they’re based on parentheses, but they also have additional meanings. Again, only atoms may be quantified because they represent something that’s really there (potentially). Table 5-6. Extended Regex Sequences Extension Atomic Meaning (?#...) No Comment, discard. (?:...) Yes Cluster-only parentheses, no capturing. (?imsx-imsx) No Enable/disable pattern modifiers. (?imsx-imsx:...) Yes Cluster-only parentheses plus modifiers. (?=...) No True if lookahead assertion succeeds. (?!...) No True if lookahead assertion fails. (?<=...) No True if lookbehind assertion succeeds. (?<!...) No True if lookbehind assertion fails. (?>...) Yes Match nonbacktracking subpattern. (?{...}) No Execute embedded Perl code. (??{...}) Yes Match regex from embedded Perl code. (?(...)...|...) Yes Match with if-then-else pattern. (?(...)...) Yes Match with if-then pattern. Metacharacters and Metasymbols 161 And finally, Table 5-7 shows all of your favorite alphanumeric metasymbols. (Symbols that are processed by the variable interpolation pass are marked with a dash in the Atomic column, since the Engine never even sees them.) Table 5-7. Alphanumeric Regex Metasymbols Symbol Atomic Meaning \0 Yes Match the null character (ASCII NUL). \NNN Yes Match the character given in octal, up to \377. \n Yes Match nth previously captured string (decimal). \a Yes Match the alarm character (BEL). \A No True at the beginning of a string. \b Yes Match the backspace character (BS). \b No True at word boundary. \B No True when not at word boundary. \cX Yes Match the control character Control-X (\cZ, \c[, etc.). \C Yes Match one byte (C char) even in utf8 (dangerous). \d Yes Match any digit character. \D Yes Match any nondigit character. \e Yes Match the escape character (ASCII ESC, not backslash). \E — End case (\L, \U) or metaquote (\Q) translation. \f Yes Match the form feed character (FF). \G No True at end-of-match position of prior m//g. \l — Lowercase the next character only. \L — Lowercase till \E. \n Yes Match the newline character (usually NL, but CR on Macs). \N{NAME} Yes Match the named char (\N{greek:Sigma}). \p{PROP} Yes Match any character with the named property. \P{PROP} Yes Match any character without the named property. \Q — Quote (de-meta) metacharacters till \E. \r Yes Match the return character (usually CR, but NL on Macs). \s Yes Match any whitespace character. \S Yes Match any nonwhitespace character. \t Yes Match the tab character (HT). \u — Titlecase next character only. \U — Uppercase (not titlecase) till \E. \w Yes Match any “word” character (alphanumerics plus “_”). \W Yes Match any nonword character. \x{abcd} Yes Match the character given in hexadecimal. 162 Chapter 5: Pattern Matching Table 5-7. Alphanumeric Regex Metasymbols (continued) Symbol Atomic Meaning \X Yes Match Unicode “combining character sequence” string. \z No True at end of string only. \Z No True at end of string or before optional newline. The braces are optional on \p and \P if the property name is one character. The braces are optional on \x if the hexadecimal number is two digits or less. The braces are never optional on \N. Only metasymbols with “Match the . . . ” or “Match any . . . ” descriptions may be used within character classes (square brackets). That is, character classes are limited to containing specific sets of characters, so within them you may only use metasymbols that describe other specific sets of characters, or that describe specific individual characters. Of course, these metasymbols may also be used outside character classes, along with all the other nonclassificatory metasymbols. Note however that \b is two entirely different beasties: it’s a backspace character inside the character class, but a word boundary assertion outside. There is some amount of overlap between the characters that a pattern can match and the characters an ordinary double-quoted string can interpolate. Since regexes undergo two passes, it is sometimes ambiguous which pass should process a given character. When there is ambiguity, the variable interpolation pass defers the interpretation of such characters to the regular expression parser. But the variable interpolation pass can only defer to the regex parser when it knows it is parsing a regex. You can specify regular expressions as ordinary double-quoted strings, but then you must follow normal double-quote rules. Any of the previous metasymbols that happen to map to actual characters will still work, even though they’re not being deferred to the regex parser. But you can’t use any of the other metasymbols in ordinary double quotes (or in any similar constructs such as ‘...‘, qq(...), qx(...), or the equivalent here documents). If you want your string to be parsed as a regular expression without doing any matching, you should be using the qr// (quote regex) operator. Note that the case and metaquote translation escapes (\U and friends) must be processed during the variable interpolation pass because the purpose of those metasymbols is to influence how variables are interpolated. If you suppress variable interpolation with single quotes, you don’t get the translation escapes either. Neither variables nor translation escapes (\U, etc.) are expanded in any single quoted string, nor in single-quoted m’...’ or qr’...’ operators. Even when you Metacharacters and Metasymbols 163 do interpolation, these translation escapes are ignored if they show up as the result of variable interpolation, since by then it’s too late to influence variable interpolation. Although the transliteration operator doesn’t take regular expressions, any metasymbol we’ve discussed that matches a single specific character also works in a tr/// operation. The rest do not (except for backslash, which continues to work in the backward way it always works.) Specific Characters As mentioned before, everything that’s not special in a pattern matches itself. That means an /a/ matches an “a”, an /=/ matches an “=”, and so on. Some characters, though, aren’t very easy to type in from the keyboard or, even if you manage that, don’t show up on a printout; control characters are notorious for this. In a regular expression, Perl recognizes the following double-quotish character aliases: Escape Meaning \0 \a \e \f \n \r \t Null character (ASCII NUL) Alarm (BEL) Escape (ESC) Form feed (FF) Newline (NL, CR on Mac) Return (CR, NL on Mac) Tab (HT) Just as in double-quoted strings, Perl also honors the following four metasymbols in patterns: \cX A named control character, like \cC for Control-C, \cZ for Control-Z, \c[ for ESC, and \c? for DEL. \NNN A character specified using its two- or three-digit octal code. The leading 0 is optional, except for values less than 010 (8 decimal) since (unlike in doublequoted strings) the single-digit versions are always considered to be backreferences to captured strings within a pattern. Multiple digits are interpreted as the nth backreference if you’ve captured at least n substrings earlier in the pattern (where n is considered as a decimal number). Otherwise, they are interpreted as a character specified in octal. 164 Chapter 5: Pattern Matching \x{LONGHEX} \xHEX A character number specified as one or two hex digits ([0-9a-fA-F]), as in \x1B. The one-digit form is usable only if the character following it is not a hex digit. If braces are used, you may use as many digits as you’d like, which may result in a Unicode character. For example, \x{262f} matches a Unicode YIN YANG. \N{NAME} A named character, such \N{GREEK SMALL LETTER EPSILON}, \N{greek:epsilon}, or \N{epsilon}. This requires the use charnames pragma described in Chapter 31, Pragmatic Modules, which also determines which flavors of those names you may use (":long", ":full", ":short" respectively, corresponding to the three styles just shown). A list of all Unicode character names can be found in your closest Unicode standards document, or in PATH_TO_PERLLIB/unicode/Names.txt. Wildcard Metasymbols Three special metasymbols serve as generic wildcards, each of them matching “any” character (for certain values of “any”). These are the dot (“.”), \C, and \X. None of these may be used in a character class. You can’t use the dot there because it would match (nearly) any character in existence, so it’s something of a universal character class in its own right. If you’re going to include or exclude everything, there’s not much point in having a character class. The special wildcards \C and \X have special structural meanings that don’t map well to the notion of choosing a single Unicode character, which is the level at which character classes work. The dot metacharacter matches any one character other than a newline. (And with the /s modifier, it matches that, too.) Like any of the dozen special characters in a pattern, to match a dot literally, you must escape it with a backslash. For example, this checks whether a filename ends with a dot followed by a one-character extension: if ($pathname =˜ /\.(.)\z/s) { print "Ends in $1\n"; } The first dot, the escaped one, is the literal character, and the second says “match any character”. The \z says to match only at the end of the string, and the /s modifier lets the dot match a newline as well. (Yes, using a newline as a file extension Isn’t Very Nice, but that doesn’t mean it can’t happen.) Character Classes 165 The dot metacharacter is most often used with a quantifier. A .* matches a maximal number of characters, while a .*? matches a minimal number of characters. But it’s also sometimes used without a quantifier for its width: /(..):(..):(..)/ matches three colon-separated fields, each of which is two characters long. If you use a dot in a pattern compiled under the lexically scoped use utf8 pragma, then it will match any Unicode character. (You’re not supposed to need a use utf8 for that, but accidents will happen. The pragma may not be necessary by the time you read this.) use utf8; use charnames qw/:full/; $BWV[887] = "G\N{MUSIC SHARP SIGN} minor"; ($note, $black, $mode) = $BWV[887] =˜ /ˆ([A-G])(.)\s+(\S+)/; print "That’s lookin’ sharp!\n" if $black eq chr(9839); The \X metasymbol matches a character in a more extended sense. It really matches a string of one or more Unicode characters known as a “combining character sequence”. Such a sequence consists of a base character followed by any “mark” characters (diacritical markings like cedillas or diereses) that combine with that base character to form one logical unit. \X is exactly equivalent to (?:\PM\pM*). This allows it to match one logical character, even when that really comprises several separate characters. The length of the match in /\X/ would exceed one character if it matched any combining characters. (And that’s character length, which has little to do with byte length). If you are using Unicode and really want to get at a single byte instead of a single character, you can use the \C metasymbol. This will always match one byte (specifically, one C language char type), even if this gets you out of sync with your Unicode character stream. See the appropriate warnings about doing this in Chapter 15. Character Classes In a pattern match, you may match any character that has—or that does not have — a particular property. There are four ways to specify character classes. You may specify a character classes in the traditional way using square brackets and enumerating the possible characters, or you may use any of three mnemonic shortcuts: the classic Perl classes, the new Perl Unicode properties, or the standard POSIX classes. Each of these shortcuts matches only one character from its set. Quantify them to match larger expanses, such as \d+ to match one or more digits. (An easy mistake is to think that \w matches a word. Use \w+ to match a word.) 166 Chapter 5: Pattern Matching Custom Character Classes An enumerated list of characters in square brackets is called a character class and matches any one of the characters in the list. For example, [aeiouy] matches a letter that can be a vowel in English. (For Welsh add a “w”, for Scottish an “r”.) To match a right square bracket, either backslash it or place it first in the list. Character ranges may be indicated using a hyphen and the a-z notation. Multiple ranges may be combined; for example, [0-9a-fA-F] matches one hex “digit”. You may use a backslash to protect a hyphen that would otherwise be interpreted as a range delimiter, or just put it at the beginning or end of the class (a practice which is arguably less readable but more traditional). A caret (or circumflex, or hat, or up arrow) at the front of the character class inverts the class, causing it to match any single character not in the list. (To match a caret, either don’t put it first, or better, escape it with a backslash.) For example, [ˆaeiouy] matches any character that isn’t a vowel. Be careful with character class negation, though, because the universe of characters is expanding. For example, that character class matches consonants—and also matches spaces, newlines, and anything (including vowels) in Cyrillic, Greek, or nearly any other script, not to mention every idiograph in Chinese, Japanese, and Korean. And someday maybe even Cirth, Tengwar, and Klingon. (Linear B and Etruscan, for sure.) So it might be better to specify your consonants explicitly, such as [cbdfghjklmnpqrstvwxyz], or [b-df-hj-np-tv-z] for short. (This also solves the issue of “y” needing to be in two places at once, which a set complement would preclude.) Normal character metasymbols are supported inside a character class, (see “Specific Characters”), such as \n, \t, \cX, \NNN, and \N{NAME}. Additionally, you may use \b within a character class to mean a backspace, just as it does in a doublequoted string. Normally, in a pattern match, it means a word boundary. But zerowidth assertions don’t make any sense in character classes, so here \b returns to its normal meaning in strings. You may also use any predefined character class described later in the chapter (classic, Unicode, or POSIX), but don’t try to use them as endpoints of a range—that doesn’t make sense, so the “-” will be interpreted literally. All other metasymbols lose their special meaning inside square brackets. In particular, you can’t use any of the three generic wildcards: “.”, \X, or \C. The first often surprises people, but it doesn’t make much sense to use the universal character class within a restricted one, and you often want to match a literal dot as part of a character class—when you’re matching filenames, for instance. It’s also meaningless to specify quantifiers, assertions, or alternation inside a character class, since the characters are interpreted individually. For example, [fee|fie|foe|foo] means the same thing as [feio|]. Character Classes 167 Classic Perl Character Class Shortcuts Since the beginning, Perl has provided a number of character class shortcuts. These are listed in Table 5-8. All of them are backslashed alphabetic metasymbols, and in each case, the uppercase version is the negation of the lowercase version. The meanings of these are not quite as fixed as you might expect; the meanings can be influenced by locale settings. Even if you don’t use locales, the meanings can change whenever a new Unicode standard comes out, adding scripts with new digits and letters. (To keep the old byte meanings, you can always use bytes. For explanations of the utf8 meanings, see “Unicode Properties” later in this chapter. In any case, the utf8 meanings are a superset of the byte meanings.) Table 5-8. Classic Character Classes Symbol Meaning As Bytes As utf8 \d Digit [0-9] \p{IsDigit} \D Nondigit [ˆ0-9] \P{IsDigit} \s Whitespace [ \t\n\r\f] \p{IsSpace} \S Nonwhitespace [ˆ \t\n\r\f] \P{IsSpace} \w Word character [a-zA-Z0-9_] \p{IsWord} \W Non-(word character) [ˆa-zA-Z0-9_] \P{IsWord} (Yes, we know most words don’t have numbers or underscores in them; \w is for matching “words” in the sense of tokens in a typical programming language. Or Perl, for that matter.) These metasymbols may be used either outside or inside square brackets, that is, either standalone or as part of a constructed character class: if ($var =˜ /\D/) { warn "contains non-digit" } if ($var =˜ /[ˆ\w\s.]/) { warn "contains non-(word, space, dot)" } Unicode Properties Unicode properties are available using \p{PROP} and its set complement, \P{PROP}. For the rare properties with one-character names, braces are optional, as in \pN to indicate a numeric character (not necessarily decimal—Roman numerals are numeric characters too). These property classes may be used by themselves or combined in a constructed character class: if ($var =˜ /ˆ\p{IsAlpha}+$/) { print "all alphabetic" } if ($var =˜ s/[\p{Zl}\p{Zp}]/\n/g) { print "fixed newline wannabes" } Some properties are directly defined in the Unicode standard, and some properties are composites defined by Perl, based on the standard properties. Zl and Zp are standard Unicode properties representing line separators and paragraph 168 Chapter 5: Pattern Matching separators, while IsAlpha is defined by Perl to be a property class combining the standard properties Ll, Lu, Lt, and Lo, (that is, letters that are lowercase, uppercase, titlecase, or other). As of version 5.6.0 of Perl, you need to use utf8 for these properties to work. This restriction will be relaxed in the future. There are a great many properties. We’ll list the ones we know about, but the list is necessarily incomplete. New properties are likely to be in new versions of Unicode, and you can even define your own properties. More about that later. The Unicode Consortium produces the online resources that turn into the various files Perl uses in its Unicode implementation. For more about these files, see Chapter 15. You can get a nice overview of Unicode in the document PATH_TO_PERLLIB/unicode/Unicode3.html where PATH_TO_PERLLIB is what is printed out by: perl -MConfig -le ’print $Config{privlib}’ Most Unicode properties are of the form \p{IsPROP}. The Is is optional, since it’s so common, but you may prefer to leave it in for readability. Perl’s Unicode properties First, Table 5-9 lists Perl’s composite properties. They’re defined to be reasonably close to the standard POSIX definitions for character classes. Table 5-9. Composite Unicode Properties Property Equivalent IsASCII [\x00-\x7f] IsAlnum [\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}\p{IsNd}] IsAlpha [\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}] IsCntrl \p{IsC} IsDigit \p{Nd} IsGraph [ˆ\pC\p{IsSpace}] IsLower \p{IsLl} IsPrint \P{IsC} IsPunct \p{IsP} IsSpace [\t\n\f\r\p{IsZ}] IsUpper [\p{IsLu}\p{IsLt}] IsWord [_\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}\p{IsNd}] IsXDigit [0-9a-fA-F] Perl also provides the following composites for each of main categories of standard Unicode properties (see the next section): Character Classes 169 Property Meaning Normative IsC IsL IsM IsN IsP IsS IsZ Crazy control codes and such Letters Marks Numbers Punctuation Symbols Separators (Zeparators?) Yes Partly Yes Yes No No Yes Standard Unicode properties Table 5-10 lists the most basic standard Unicode properties, derived from each character’s category. No character is a member of more than one category. Some properties are normative; others are merely informative. See the Unicode Standard for the standard spiel on just how normative the normative information is, and just how informative the informative information isn’t. Table 5-10. Standard Unicode Properties Property Meaning Normative IsCc Other, Control Yes IsCf Other, Format Yes IsCn Other, Not assigned Yes IsCo Other, Private Use Yes IsCs Other, Surrogate Yes IsLl Letter, Lowercase Yes IsLm Letter, Modifier No IsLo Letter, Other No IsLt Letter, Titlecase Yes IsLu Letter, Uppercase Yes IsMc Mark, Combining Yes IsMe Mark, Enclosing Yes IsMn Mark, Nonspacing Yes IsNd Number, Decimal digit Yes IsNl Number, Letter Yes IsNo Number, Other Yes IsPc Punctuation, Connector No IsPd Punctuation, Dash No IsPe Punctuation, Close No IsPf Punctuation, Final quote No 170 Chapter 5: Pattern Matching Table 5-10. Standard Unicode Properties (continued) Property Meaning Normative IsPi Punctuation, Initial quote No IsPo Punctuation, Other No IsPs Punctuation, Open No IsSc Symbol, Currency No IsSk Symbol, Modifier No IsSm Symbol, Math No IsSo Symbol, Other No IsZl Separator, Line Yes IsZp Separator, Paragraph Yes IsZs Separator, Space Yes Another useful set of properties has to do with whether a given character can be decomposed (either canonically or compatibly) into other simpler characters. Canonical decomposition doesn’t lose any formatting information. Compatibility decomposition may lose formatting information such as whether a character is a superscript. Property Information Lost IsDecoCanon IsDecoCompat Nothing Something (one of the following) IsDCcircle IsDCfinal IsDCfont IsDCfraction IsDCinitial IsDCisolated IsDCmedial IsDCnarrow IsDCnoBreak IsDCsmall IsDCsquare IsDCsub IsDCsuper IsDCvertical IsDCwide IsDCcompat Circle around character Final position preference (Arabic) Variant font preference Vulgar fraction characteristic Initial position preference (Arabic) Isolated position preference (Arabic) Medial position preference (Arabic) Narrow characteristic Nonbreaking preference on space or hyphen Small characteristic Square around CJK character Subscription Superscription Rotation (horizontal to vertical) Wide characteristic Identity (miscellaneous) Character Classes 171 Here are some properties of interest to people doing bidirectional rendering: Property Meaning IsBidiL IsBidiLRE IsBidiLRO IsBidiR IsBidiAL IsBidiRLE IsBidiRLO IsBidiPDF IsBidiEN IsBidiES IsBidiET IsBidiAN IsBidiCS IsBidiNSM IsBidiBN IsBidiB IsBidiS IsBidiWS IsBidiON IsMirrored Left-to-right (Arabic, Hebrew) Left-to-right embedding Left-to-right override Right-to-left Right-to-left Arabic Right-to-left embedding Right-to-left override Pop directional format European number European number separator European number terminator Arabic number Common number separator Nonspacing mark Boundary neutral Paragraph separator Segment separator Whitespace Other Neutrals Reverse when used right-to-left The following properties classify various syllabaries according to vowel sounds: IsSylA IsSylAA IsSylAAI IsSylAI IsSylC IsSylE IsSylEE IsSylI IsSylII IsSylN IsSylO IsSylOO IsSylU IsSylV IsSylWA IsSylWAA IsSylWC IsSylWE IsSylWEE IsSylWI For example, \p{IsSylA} would match \N{KATAKANA \N{KATAKANA LETTER KU}. IsSylWII IsSylWO IsSylWOO IsSylWU IsSylWV LETTER KA} but not Now that we’ve basically told you all these Unicode 3.0 properties, we should point out that a few of the more esoteric ones aren’t implemented in version 5.6.0 of Perl because its implementation was based in part on Unicode 2.0, and things like the bidirectional algorithm were still being worked out. However, by the time you read this, the missing properties may well be implemented, so we listed them anyway. Unicode block properties Some Unicode properties are of the form \p{InSCRIPT}. (Note the distinction between Is and In.) The In properties are for testing block ranges of a particular 172 Chapter 5: Pattern Matching SCRIPT. If you have a character, and you wonder whether it were written in Greek script, you could test with: print "It’s Greek to me!\n" if chr(931) =˜ /\p{InGreek}/; That works by checking whether a character is “in” the valid range of that script type. This may be negated with \P{InSCRIPT} to find out whether something isn’t in a particular script’s block, such as \P{InDingbats} to test whether a string contains a non-dingbat. Block properties include the following: InArabic InArmenian InArrows InBasicLatin InBengali InBopomofo InBoxDrawing InCherokee InCyrillic InDevanagari InDingbats InEthiopic InGeorgian InGreek InGujarati InGurmukhi InHangulJamo InHebrew InHiragana InKanbun InKannada InKatakana InKhmer InLao InMalayalam InMongolian InMyanmar InOgham InOriya InRunic InSinhala InSpecials InSyriac InTamil InTelugu InThaana InThai InTibetan InYiRadicals InYiSyllables Not to mention jawbreakers like these: InAlphabeticPresentationForms InArabicPresentationForms-A InArabicPresentationForms-B InBlockElements InBopomofoExtended InBraillePatterns InCJKCompatibility InCJKCompatibilityForms InCJKCompatibilityIdeographs InCJKRadicalsSupplement InCJKSymbolsandPunctuation InCJKUnifiedIdeographs InCJKUnifiedIdeographsExtensionA InCombiningDiacriticalMarks InCombiningHalfMarks InCombiningMarksforSymbols InControlPictures InCurrencySymbols InEnclosedAlphanumerics InEnclosedCJKLettersandMonths InGeneralPunctuation InGeometricShapes InGreekExtended InHalfwidthandFullwidthForms InHangulCompatibilityJamo InHangulSyllables InHighPrivateUseSurrogates InHighSurrogates InIdeographicDescriptionCharacters InIPAExtensions InKangxiRadicals InLatin-1Supplement InLatinExtended-A InLatinExtended-B InLatinExtendedAdditional InLetterlikeSymbols InLowSurrogates InMathematicalOperators InMiscellaneousSymbols InMiscellaneousTechnical InNumberForms InOpticalCharacterRecognition InPrivateUse InSuperscriptsandSubscripts InSmallFormVariants InSpacingModifierLetters And the winner is: InUnifiedCanadianAboriginalSyllabics See PATH_TO_PERLLIB/unicode/In/*.pl to get an up-to-date listing of all of these character block properties. Note that these In properties are only testing to see if the character is in the block of characters allocated for that script. There is no Character Classes 173 guarantee that all characters in that range are defined; you also need to test against one of the Is properties discussed earlier to see if the character is defined. There is also no guarantee that a particular language doesn’t use characters outside its assigned block. In particular, many European languages mix extended Latin characters with Latin-1 characters. But hey, if you need a particular property that isn’t provided, that’s not a big problem. Read on. Defining your own character properties To define your own property, you need to write a subroutine with the name of the property you want (see Chapter 6, Subr outines). The subroutine should be defined in the package that needs the property (see Chapter 10, Packages), which means that if you want to use it in multiple packages, you’ll either have to import it from a module (see Chapter 11, Modules), or inherit it as a class method from the package in which it is defined (see Chapter 12, Objects). Once you’ve got that all settled, the subroutine should return data in the same format as the files in PATH_TO_PERLLIB/unicode/Is directory. That is, just return a list of characters or character ranges in hexadecimal, one per line. If there is a range, the two numbers are separated by a tab. Suppose you wanted a property that would be true if your character is in the range of either of the Japanese syllabaries, known as hiragana and katakana. (Together they’re known as kana). You can just put in the two ranges like this: sub InKana { return <<’END’; 3040 309F 30A0 30FF END } Alternatively, you could define it in terms of existing property names: sub InKana { return <<’END’; +utf8::InHiragana +utf8::InKatakana END } You can also do set subtraction using a “-” prefix. Suppose you only wanted the actual characters, not just the block ranges of characters. You could weed out all the undefined ones like this: sub IsKana { return <<’END’; +utf8::InHiragana 174 Chapter 5: Pattern Matching +utf8::InKatakana -utf8::IsCn END } You can also start with a complemented character set using the “!” prefix: sub IsNotKana { return <<’END’; !utf8::InHiragana -utf8::InKatakana +utf8::IsCn END } Perl itself uses exactly the same tricks to define the meanings of its “classic” character classes (like \w) when you include them in your own custom character classes (like [-.\w\s]). You might think that the more complicated you get with your rules, the slower they will run, but in fact, once Perl has calculated the bit pattern for a particular 64-bit swatch of your property, it caches it so it never has to recalculate the pattern again. (It does it in 64-bit swatches so that it doesn’t even have to decode your utf8 to do its lookups.) Thus, all character classes, builtin or custom, run at essentially the same speed (fast) once they get going. POSIX-Style Character Classes Unlike Perl’s other character class shortcuts, the POSIX-style character-class syntax notation, [:CLASS:], is available for use only when constructing other character classes, that is, inside an additional pair of square brackets. For example, /[.,[:alpha:][:digit:]]/ will search for one character that is either a literal dot (because it’s in a character class), a comma, an alphabetic character, or a digit. The POSIX classes available as of revision 5.6 of Perl are shown in Table 5-11. Table 5-11. POSIX Character Classes Class Meaning alnum Any alphanumeric, that is, an alpha or a digit. alpha Any letter. (That’s a lot more letters than you think, unless you’re thinking Unicode, in which case it’s still a lot.) ascii Any character with an ordinal value between 0 and 127. cntrl Any control character. Usually characters that don’t produce output as such, but instead control the terminal somehow; for example, newline, form feed, and backspace are all control characters. Characters with an ord value less than 32 are most often classified as control characters. digit A character representing a decimal digit, such as 0 to 9. (Includes other characters under Unicode.) Equivalent to \d. Character Classes 175 Table 5-11. POSIX Character Classes (continued) Class Meaning graph Any alphanumeric or punctuation character. lower A lowercase letter. print Any alphanumeric or punctuation character or space. punct Any punctuation character. space Any space character. Includes tab, newline, form feed, and carriage return (and a lot more under Unicode.) Equivalent to \s. upper Any uppercase (or titlecase) letter. word Any identifier character, either an alnum or underline. xdigit Any hexadecimal digit. Though this may seem silly ([0-9a-fA-F] works just fine), it is included for completeness. You can negate the POSIX character classes by prefixing the class name with a ˆ following the [:. (This is a Perl extension.) For example: POSIX Classic [:ˆdigit:] [:ˆspace:] [:ˆword:] \D \S \W If the use utf8 pragma is not requested, but the use locale pragma is, the classes correlate directly with the equivalent functions in the C library’s isalpha (3) interface (except for word, which is a Perl extension, mirroring \w). If the utf8 pragma is used, POSIX character classes are exactly equivalent to the corresponding Is properties listed in Table 5-9. For example [:lower:] and \p{Lower} are equivalent, except that the POSIX classes may only be used within constructed character classes, whereas Unicode properties have no such restriction and may be used in patterns wherever Perl shortcuts like \s and \w may be used. The brackets are part of the POSIX-style [::] construct, not part of the whole character class. This leads to writing patterns like /ˆ[[:lower:][:digit:]]+$/, to match a string consisting entirely of lowercase letters or digits (plus an optional trailing newline). In particular, this does not work: 42 =˜ /ˆ[:digit:]$/ # WRONG That’s because it’s not inside a character class. Rather, it is a character class, the one representing the characters “:”, “i”, “t”, “g”, and “d”. Perl doesn’t care that you specified “:” twice. 176 Chapter 5: Pattern Matching Here’s what you need instead: 42 =˜ /ˆ[[:digit:]]+$/ The POSIX character classes [.cc.] and [=cc=] are recognized but produce an error indicating they are not supported. Trying to use any POSIX character class in older verions of Perl is likely to fail miserably, and perhaps even silently. If you’re going to use POSIX character classes, it’s best to require a new version of Perl by saying: use 5.6.0; Quantifiers Unless you say otherwise, each item in a regular expression matches just once. With a pattern like /nop/, each of those characters must match, each right after the other. Words like “panoply” or “xenophobia” are fine, because wher e the match occurs doesn’t matter. If you wanted to match both “xenophobia” and “Snoopy”, you couldn’t use the /nop/ pattern, since that requires just one “o” between the “n” and the “p”, and Snoopy has two. This is where quantifiers come in handy: they say how many times something may match, instead of the default of matching just once. Quantifiers in a regular expression are like loops in a program; in fact, if you think of a regex as a program, then they ar e loops. Some loops are exact, like “repeat this match five times only” ({5}). Others give both lower and upper bounds on the match count, like “repeat this match at least twice but no more than four times” ({2,4}). Others have no closed upper bound at all, like “match this at least twice, but as many times as you’d like” ({2,}). Table 5-12 shows the quantifiers that Perl recognizes in a pattern. Table 5-12. Regex Quantifiers Compared Maximal Minimal Allowed Range {MIN,MAX} {MIN,MAX}? Must occur at least MIN times but no more than MAX times {MIN,} {MIN,}? Must occur at least MIN times {COUNT} {COUNT}? Must match exactly COUNT times * *? 0 or more times (same as {0,}) + +? 1 or more times (same as {1,}) ? ?? 0 or 1 time (same as {0,1}) Something with a * or a ? doesn’t actually have to match. That’s because they can match 0 times and still be considered a success. A + may often be a better fit, since it has to be there at least once. Quantifiers 177 Don’t be confused by the use of “exactly” in the previous table. It refers only to the repeat count, not the overall string. For example, $n =˜ /\d{3}/ doesn’t say “is this string exactly three digits long?” It asks whether there’s any point within $n at which three digits occur in a row. Strings like “101 Morris Street” test true, but so do strings like “95472” or “1-800-555-1212”. All contain three digits at one or more points, which is all you asked about. See the section “Positions” for how to use positional assertions (as in /ˆ\d{3}$/) to nail this down. Given the opportunity to match something a variable number of times, maximal quantifiers will elect to maximize the repeat count. So when we say “as many times as you’d like”, the greedy quantifier interprets this to mean “as many times as you can possibly get away with”, constrained only by the requirement that this not cause specifications later in the match to fail. If a pattern contains two openended quantifiers, then obviously both cannot consume the entire string: characters used by one part of the match are no longer available to a later part. Each quantifier is greedy at the expense of those that follow it, reading the pattern left to right. That’s the traditional behavior of quantifiers in regular expressions. However, Perl permits you to reform the behavior of its quantifiers: by placing a ? after that quantifier, you change it from maximal to minimal. That doesn’t mean that a minimal quantifier will always match the smallest number of repetitions allowed by its range, any more than a maximal quantifier must always match the greatest number allowed in its range. The overall match must still succeed, and the minimal match will take as much as it needs to succeed, and no more. (Minimal quantifiers value contentment over greed.) For example, in the match: "exasperate" =˜ /e(.*)e/ # $1 now "xasperat" the .* matches “xasperat”, the longest possible string for it to match. (It also stores that value in $1, as described in the section “Capturing and Clustering” later in the chapter.) Although a shorter match was available, a greedy match doesn’t care. Given two choices at the same starting point, it always returns the longer of the two. Contrast this with this: "exasperate" =˜ /e(.*?)e/ # $1 now "xasp" Here, the minimal matching version, .*?, is used. Adding the ? to * makes *? take on the opposite behavior: now given two choices at the same starting point, it always returns the shorter of the two. Although you could read *? as saying to match zero or more of something but preferring zero, that doesn’t mean it will always match zero characters. If it did so 178 Chapter 5: Pattern Matching here, for example, and left $1 set to "", then the second “e” wouldn’t be found, since it doesn’t immediately follow the first one. You might also wonder why, in minimally matching /e(.*?)e/, Perl didn’t stick “rat” into $1. After all, “rat” also falls between two e’s, and is shorter than “xasp”. In Perl, the minimal/maximal choice applies only when selecting the shortest or longest from among several matches that all have the same starting point. If two possible matches exist, but these start at different offsets in the string, then their lengths don’t matter—nor does it matter whether you’ve used a minimal quantifier or a maximal one. The earliest of several valid matches always wins out over all latecomers. It’s only when multiple possible matches start at the same point that you use minimal or maximal matching to break the tie. If the starting points differ, there’s no tie to break. Perl’s matching is normally leftmost longest; with minimal matching, it becomes leftmost shortest. But the “leftmost” part never varies and is the dominant criterion.* There are two ways to defeat the leftward leanings of the pattern matcher. First, you can use an earlier greedy quantifier (typically .*) to try to slurp earlier parts of the string. In searching for a match for a greedy quantifier, it tries for the longest match first, which effectively searches the rest of the string right-to-left: "exasperate" =˜ /.*e(.*?)e/ # $1 now "rat" But be careful with that, since the overall match now includes the entire string up to that point. The second way to defeat leftmostness to use positional assertions, discussed in the next section. Positions Some regex constructs represent positions in the string to be matched, which is a location just to the left or right of a real character. These metasymbols are examples of zer o-width assertions because they do not correspond to actual characters in the string. We often just call them “assertions”. (They’re also known as “anchors” because they tie some part of the pattern to a particular position.) You can always manipulate positions in a string without using patterns. The builtin substr function lets you extract and assign to substrings, measured from the beginning of the string, the end of the string, or from a particular numeric offset. This might be all you need if you were working with fixed-length records, for * Not all regex engines work this way. Some believe in overall greed, in which the longest match always wins, even if it shows up later. Perl isn’t that way. You might say that eagerness holds priority over greed (or thrift). For a more formal discussion of this principle and many others, see the section “The Little Engine That /Could(n’t)?/”. Positions 179 instance. Patterns are only necessary when a numeric offset isn’t sufficient. But most of the time, offsets aren’t sufficient — at least, not sufficiently convenient, compared to patterns. Beginnings: The \A and ˆ Assertions The \A assertion matches only at the beginning of the string, no matter what. However, the ˆ assertion is the traditional beginning-of-line assertion as well as a beginning-of-string assertion. Therefore, if the pattern uses the /m modifier* and the string has embedded newlines, ˆ also matches anywhere inside the string immediately following a newline character: /\Abar/ /ˆbar/ /ˆbar/m # Matches "bar" and "barstool" # Matches "bar" and "barstool" # Matches "bar" and "barstool" and "sand\nbar" Used in conjunction with /g, the /m modifier lets ˆ match many times in the same string: s/ˆ\s+//gm; # Trim leading whitespace on each line $total++ while /ˆ./mg; # Count nonblank lines Endings: The \z, \Z, and $ Assertions The \z metasymbol matches at the end of the string, no matter what’s inside. \Z matches right before the newline at the end of the string if there is a newline, or at the end if there isn’t. The $ metacharacter usually means the same as \Z. However, if the /m modifier was specified and the string has embedded newlines, then $ can also match anywhere inside the string right in front of a newline: /bot\z/ /bot\Z/ /bot$/ /bot$/m # # # # Matches Matches Matches Matches "robot" "robot" and "abbot\n" "robot" and "abbot\n" "robot" and "abbot\n" and "robot\nrules" /ˆrobot$/ /ˆrobot$/m /\Arobot\Z/ /\Arobot\z/ # # # # Matches Matches Matches Matches "robot" and "robot\n" "robot" and "robot\n" and "this\nrobot\n" "robot" and "robot\n" only "robot" -- but why didn’t you use eq? As with ˆ, the /m modifier lets $ match many times in the same string when used with /g. (These examples assume that you’ve read a multiline record into $_, perhaps by setting $/ to "" before reading.) * Or you’ve set the deprecated $* variable to 1 and you’re not overriding $* with the /s modifier. 180 Chapter 5: Pattern Matching s/\s*$//gm; # Trim trailing whitespace on each line in paragraph while (/ˆ([ˆ:]+):\s*(.*)/gm ) { # get mail header $headers{$1} = $2; } In “Variable Interpolation” later in this chapter, we’ll discuss how you can interpolate variables into patterns: if $foo is “bc”, then /a$foo/ is equivalent to /abc/. Here, the $ does not match the end of the string. For a $ to match the end of the string, it must be at the end of the pattern or immediately be followed by a vertical bar or closing parenthesis. Boundaries: The \b and \B Assertions The \b assertion matches at any word boundary, defined as the position between a \w character and a \W character, in either order. If the order is \W\w, it’s a beginning-of-word boundary, and if the order is \w\W, it’s an end-of-word boundary. (The ends of the string count as \W characters here.) The \B assertion matches any position that is not a word boundary, that is, the middle of either \w\w or \W\W. /\bis\b/ /\Bis\B/ /\bis\B/ /\Bis\b/ # # # # matches matches matches matches "what it is" and "that is it" "thistle" and "artist" "istanbul" and "so--isn’t that butter?" "confutatis" and "metropolis near you" Because \W includes all punctuation characters (except the underscore), there are \b boundaries in the middle of strings like “isn’t”, “booktech@oreilly.com”, “M.I.T.”, and “key/value”. Inside a character class ([\b]), a \b represents a backspace rather than a word boundary. Progressive Matching When used with the /g modifier, the pos function allows you to read or set the offset where the next progressive match will start: $burglar = "Bilbo Baggins"; while ($burglar =˜ /b/gi) { printf "Found a B at %d\n", pos($burglar)-1; } (We subtract one from the position because that was the length of the string we were looking for, and pos is always the position just past the match.) The code above prints: Found a B at 0 Found a B at 3 Found a B at 6 Positions 181 After a failure, the match position normally resets back to the start. If you also apply the /c (for “continue”) modifier, then when the /g runs out, the failed match doesn’t reset the position pointer. This lets you continue your search past that point without starting over at the very beginning. $burglar = "Bilbo Baggins"; while ($burglar =˜ /b/gci) { # ADD /c printf "Found a B at %d\n", pos($burglar)-1; } while ($burglar =˜ /i/gi) { printf "Found an I at %d\n", pos($burglar)-1; } Besides the three B’s it found earlier, Perl now reports finding an i at position 10. Without the /c, the second loop’s match would have restarted from the beginning and found another i at position 6 first. Where You Left Off: The \G Assertion Whenever you start thinking in terms of the pos function, it’s tempting to start carving your string up with substr, but this is rarely the right thing to do. More often, if you started with pattern matching, you should continue with pattern matching. However, if you’re looking for a positional assertion, you’re probably looking for \G. The \G assertion represents within the pattern the same point that pos represents outside of it. When you’re progressively matching a string with the /g modifier (or you’ve used the pos function to directly select the starting point), you can use \G to specify the position just after the previous match. That is, it matches the location immediately before whatever character would be identified by pos. This allows you to remember where you left off: ($recipe = <<’DISH’) =˜ s/ˆ\s+//gm; Preheat oven to 451 deg. fahrenheit. Mix 1 ml. dilithium with 3 oz. NaCl and stir in 4 anchovies. Glaze with 1 g. mercury. Heat for 4 hours and let cool for 3 seconds. Serves 10 aliens. DISH $recipe $recipe $recipe $recipe $recipe $recipe =˜ =˜ =˜ =˜ =˜ =˜ /\d+ /g; /\G(\w+)/; /\d+ /g; /\G(\w+)/; /\d+ /g; /\G(\w+)/; # $1 is now "deg" # $1 is now "ml" # $1 is now "oz" The \G metasymbol is often used in a loop, as we demonstrate in our next example. We “pause” after every digit sequence, and at that position, we test whether 182 Chapter 5: Pattern Matching there’s an abbreviation. If so, we grab the next two words. Otherwise, we just grab the next word: pos($recipe) = 0; # Just to be safe, reset \G to 0 while ( $recipe =˜ /(\d+) /g ) { my $amount = $1; if ($recipe =˜ / \G (\w{0,3}) \. \s+ (\w+) /x) { # abbrev. + word print "$amount $1 of $2\n"; } else { $recipe =˜ / \G (\w+) /x; # just a word print "$amount $1\n"; } } That produces: 451 deg of fahrenheit 1 ml of dilithium 3 oz of NaCl 4 anchovies 1 g of mercury 4 hours 3 seconds 10 aliens Capturing and Clustering Patterns allow you to group portions of your pattern together into subpatterns and to remember the strings matched by those subpatterns. We call the first behavior clustering and the second one capturing. Capturing To capture a substring for later use, put parentheses around the subpattern that matches it. The first pair of parentheses stores its substring in $1, the second pair in $2, and so on. You may use as many parentheses as you like; Perl just keeps defining more numbered variables for you to represent these captured strings. Some examples: /(\d)(\d)/ # Match two digits, capturing them into $1 and $2 /(\d+)/ # Match one or more digits, capturing them all into $1 /(\d)+/ # Match a digit one or more times, capturing the last into $1 Note the difference between the second and third patterns. The second form is usually what you want. The third form does not create multiple variables for multiple digits. Parentheses are numbered when the pattern is compiled, not when it is matched. Capturing and Clustering 183 Captured strings are often called backr efer ences because they refer back to parts of the captured text. There are actually two ways to get at these backreferences. The numbered variables you’ve seen are how you get at backreferences outside of a pattern, but inside the pattern, that doesn’t work. You have to use \1, \2, etc.* So to find doubled words like “the the” or “had had”, you might use this pattern: /\b(\w+) \1\b/i But most often, you’ll be using the $1 form, because you’ll usually apply a pattern and then do something with the substrings. Suppose you have some text (a mail header) that looks like this: From: gnat@perl.com To: camelot@oreilly.com Date: Mon, 17 Jul 2000 09:00:00 -1000 Subject: Eye of the needle and you want to construct a hash that maps the text before each colon to the text afterward. If you were looping through this text line by line (say, because you were reading it from a file) you could do that as follows: while (<>) { /ˆ(.*?): (.*)$/; $fields{$1} = $2; } # Pre-colon text into $1, post-colon into $2 Like $‘, $&, and $’, these numbered variables are dynamically scoped through the end of the enclosing block or eval string, or to the next successful pattern match, whichever comes first. You can use them in the righthand side (the replacement part) of a substitute, too: s/ˆ(\S+) (\S+)/$2 $1/; # Swap first two words Groupings can nest, and when they do, the groupings are counted by the location of the left parenthesis. So given the string “Primula Brandybuck”, the pattern: /ˆ((\w+) (\w+))$/ would capture “Primula Brandybuck” into $1, “Primula” into $2, and “Brandybuck” into $3. This is depicted in Figure 5-1. * You can’t use $1 for a backreference within the pattern because that would already have been interpolated as an ordinary variable back when the regex was compiled. So we use the traditional \1 backreference notation inside patterns. For two- and three-digit backreference numbers, there is some ambiguity with octal character notation, but that is neatly solved by considering how many captured patterns are available. For instance, if Perl sees a \11 metasymbol, it’s equivalent to $11 only if there are at least 11 substrings captured earlier in the pattern. Otherwise, it’s equivalent to \011, that is, a tab character. 184 Chapter 5: Pattern Matching /^((\w+) (\w+))$/ $2 $1 $3 Figur e 5-1. Creating backrefer ences with parentheses Patterns with captures are often used in list context to populate a list of values, since the pattern is smart enough to return the captured substrings as a list: ($first, $last) = /ˆ(\w+) (\w+)$/; ($full, $first, $last) = /ˆ((\w+) (\w+))$/; With the /g modifier, a pattern can return multiple substrings from multiple matches, all in one list. Suppose you had the mail header we saw earlier all in one string (in $_, say). You could do the same thing as our line-by-line loop, but with one statement: %fields = /ˆ(.*?): (.*)$/gm; The pattern matches four times, and each time it matches, it finds two substrings. The /gm match returns all of these as a flat list of eight strings, which the list assignment to %fields will conveniently interpret as four key/value pairs, thus restoring harmony to the universe. Several other special variables deal with text captured in pattern matches. $& contains the entire matched string, $‘ everything to the left of the match, $’ everything to the right. $+ contains the contents of the last backreference. $_ = "Speak, <EM>friend</EM>, and enter."; m[ (<.*?>) (.*?) (</.*?>) ]x; # A tag, then chars, then an end tag print "prematch: $‘\n"; # Speak, print "match: $&\n"; # <EM>friend</EM> print "postmatch: $’\n"; # , and enter. print "lastmatch: $+\n"; # </EM> For more explanation of these magical Elvish variables (and for a way to write them in English), see Chapter 28, Special Names. The @- (@LAST_MATCH_START) array holds the offsets of the beginnings of any submatches, and @+ (@LAST_MATCH_END) holds the offsets of the ends: Capturing and Clustering 185 #!/usr/bin/perl $alphabet = "abcdefghijklmnopqrstuvwxyz"; $alphabet =˜ /(hi).*(stu)/; print "The entire match began at $-[0] and ended at $+[0]\n"; print "The first match began at $-[1] and ended at $+[1]\n"; print "The second match began at $-[2] and ended at $+[2]\n"; If you really want to match a literal parenthesis character instead of having it interpreted as a metacharacter, backslash it: /\(e.g., .*?\)/ This matches a parenthesized example (e.g., this statement). But since dot is a wildcard, this also matches any parenthetical statement with the first letter e and third letter g (ergo, this statement too). Clustering Bare parentheses both cluster and capture. But sometimes you don’t want that. Sometimes you just want to group portions of the pattern without creating a backreference. You can use an extended form of parentheses to suppress capturing: the (?:PATTERN) notation will cluster without capturing. There are at least three reasons you might want to cluster without capturing: 1. To quantify something. 2. To limit the scope of interior alternation; for example, /ˆcat|cow|dog$/ needs to be /ˆ(?:cat|cow|dog)$/ so that the cat doesn’t run away with the ˆ. 3. To limit the scope of an embedded pattern modifier to a particular subpattern, such as in /foo(?-i:Case_Matters)bar/i. (See the next section, “Cloistered Pattern Modifiers.”) In addition, it’s more efficient to suppress the capture of something you’re not going to use. On the minus side, the notation is a little noisier, visually speaking. In a pattern, a left parenthesis immediately followed by a question mark denotes a regex extension. The current regular expression bestiary is relatively fixed—we don’t dare create a new metacharacter, for fear of breaking old Perl programs. Instead, the extension syntax is used to add new features to the bestiary. In the remainder of the chapter, we’ll see many more regex extensions, all of which cluster without capturing, as well as doing something else. The (?:PATTERN) extension is just special in that it does nothing else. So if you say: @fields = split(/\b(?:a|b|c)\b/) 186 Chapter 5: Pattern Matching it’s like: @fields = split(/\b(a|b|c)\b/) but doesn’t spit out extra fields. (The split operator is a bit like m//g in that it will emit extra fields for all the captured substrings within the pattern. Ordinarily, split only returns what it didn’t match. For more on split see Chapter 29.) Cloistered Pattern Modifiers You may cloister the /i, /m, /s, and /x modifiers within a portion of your pattern by inserting them (without the slash) between the ? and : of the clustering notation. If you say: /Harry (?i:s) Truman/ it matches both “Harry S Truman” and “Harry s Truman”, whereas: /Harry (?x: [A-Z] \.? \s )?Truman/ matches both “Harry S Truman” and “Harry S. Truman”, as well as “Harry Truman”, and: /Harry (?ix: [A-Z] \.? \s )?Truman/ matches all five, by combining the /i and /x modifiers within the cloister. You can also subtract modifiers from a cloister with a minus sign: /Harry (?x-i: [A-Z] \.? \s )?Truman/i This matches any capitalization of the name—but if the middle initial is provided, it must be capitalized, since the /i applied to the overall pattern is suspended inside the cloister. By omitting the colon and PATTERN, you can export modifier settings to an outer cluster, turning it into a cloister. That is, you can selectively turn modifiers on and off for the cluster one level outside the modifiers’ parentheses, like so: /(?i)foo/ /foo((?-i)bar)/i /foo((?x-i) bar)/ # Equivalent to /foo/i # "bar" must be lower case # Enables /x and disables /i for "bar" Note that the second and third examples create backreferences. If that wasn’t what you wanted, then you should have been using (?-i:bar) and (?x-i: bar), respectively. Setting modifiers on a portion of your pattern is particularly useful when you want “.” to match newlines in part of your pattern but not in the rest of it. Setting /s on the whole pattern doesn’t help you there. Alternation 187 Alternation Inside a pattern or subpattern, use the | metacharacter to specify a set of possibilities, any one of which could match. For instance: /Gandalf|Saruman|Radagast/ matches Gandalf or Saruman or Radagast. The alternation extends only as far as the innermost enclosing parentheses (whether capturing or not): /prob|n|r|l|ate/ # Match prob, n, r, l, or ate /pro(b|n|r|l)ate/ # Match probate, pronate, prorate, or prolate /pro(?:b|n|r|l)ate/ # Match probate, pronate, prorate, or prolate The second and third forms match the same strings, but the second form captures the variant character in $1 and the third form does not. At any given position, the Engine tries to match the first alternative, and then the second, and so on. The relative length of the alternatives does not matter, which means that in this pattern: /(Sam|Samwise)/ $1 will never be set to Samwise no matter what string it’s matched against, because Sam will always match first. When you have overlapping matches like this, put the longer ones at the beginning. But the ordering of the alternatives only matters at a given position. The outer loop of the Engine does left-to-right matching, so the following always matches the first Sam: "’Sam I am,’ said Samwise" =˜ /(Samwise|Sam)/; # $1 eq "Sam" But you can force right-to-left scanning by making use of greedy quantifiers, as discussed earlier in “Quantifiers”: "’Sam I am,’ said Samwise" =˜ /.*(Samwise|Sam)/; # $1 eq "Samwise" You can defeat any left-to-right (or right-to-left) matching by including any of the various positional assertions we saw earlier, such as \G, ˆ, and $. Here we anchor the pattern to the end of the string: "’Sam I am,’ said Samwise" =˜ /(Samwise|Sam)$/; # $1 eq "Samwise" That example factors the $ out of the alternation (since we already had a handy pair of parentheses to put it after), but in the absence of parentheses you can also 188 Chapter 5: Pattern Matching distribute the assertions to any or all of the individual alternatives, depending on how you want them to match. This little program displays lines that begin with either a __DATA_ _ or __END_ _ token: #!/usr/bin/perl while (<>) { print if /ˆ__DATA_ _|ˆ_ _END_ _/; } But be careful with that. Remember that the first and last alternatives (before the first | and after the last one) tend to gobble up the other elements of the regular expression on either side, out to the ends of the expression, unless there are enclosing parentheses. A common mistake is to ask for: /ˆcat|dog|cow$/ when you really mean: /ˆ(cat|dog|cow)$/ The first matches “cat” at the beginning of the string, or “dog” anywhere, or “cow” at the end of the string. The second matches any string consisting solely of “cat” or “dog” or “cow”. It also captures $1, which you may not want. You can also say: /ˆcat$|ˆdog$|ˆcow$/ We’ll show you another solution later. An alternative can be empty, in which case it always matches. /com(pound|)/; # Matches "compound" or "com" /com(pound(s|)|)/; # Matches "compounds", "compound", or "com" This is much like using the ? quantifier, which matches 0 times or 1 time: /com(pound)?/; # Matches "compound" or "com" /com(pound(s?))?/; # Matches "compounds", "compound", or "com" /com(pounds?)?/; # Same, but doesn’t use $2 There is one difference, though. When you apply the ? to a subpattern that captures into a numbered variable, that variable will be undefined if there’s no string to go there. If you used an empty alternative, it would still be false, but would be a defined null string instead. Staying in Control As any good manager knows, you shouldn’t micromanage your employees. Just tell them what you want, and let them figure out the best way of doing it. Similarly, it’s often best to think of a regular expression as a kind of specification: “Here’s what I want; go find a string that fits the bill.” Staying in Control 189 On the other hand, the best managers also understand the job their employees are trying to do. The same is true of pattern matching in Perl. The more thoroughly you understand of how Perl goes about the task of matching any particular pattern, the more wisely you’ll be able to make use of Perl’s pattern matching capabilities. One of the most important things to understand about Perl’s pattern-matching is when not to use it. Letting Perl Do the Work When people of a certain temperament first learn regular expressions, they’re often tempted to see everything as a problem in pattern matching. And while that may even be true in the larger sense, pattern matching is about more than just evaluating regular expressions. It’s partly about looking for your car keys where you dropped them, not just under the streetlamp where you can see better. In real life, we all know that it’s a lot more efficient to look in the right places than the wrong ones. Similarly, you should use Perl’s control flow to decide which patterns to execute, and which ones to skip. A regular expression is pretty smart, but it’s smart like a horse. It can get distracted if it sees too much at once. So sometimes you have to put blinders onto it. For example, you’ll recall our earlier example of alternation: /Gandalf|Saruman|Radagast/ That works as advertised, but not as well as it might, because it searches every position in the string for every name before it moves on to the next position. Astute readers of The Lord of the Rings will recall that, of the three wizards named above, Gandalf is mentioned much more frequently than Saruman, and Saruman is mentioned much more frequently than Radagast. So it’s generally more efficient to use Perl’s logical operators to do the alternation: /Gandalf/ || /Saruman/ || /Radagast/ This is yet another way of defeating the “leftmost” policy of the Engine. It only searches for Saruman if Gandalf was nowhere to be seen. And it only searches for Radagast if Saruman is also absent. Not only does this change the order in which things are searched, but it sometimes allows the regular expression optimizer to work better. It’s generally easier to optimize searching for a single string than for several strings simultaneously. Similarly, anchored searches can often be optimized if they’re not too complicated. You don’t have to limit your control of the control flow to the || operator. Often you can control things at the statement level. You should always think about 190 Chapter 5: Pattern Matching weeding out the common cases first. Suppose you’re writing a loop to process a configuration file. Many configuration files are mostly comments. It’s often best to discard comments and blank lines early before doing any heavy-duty processing, even if the heavy duty processing would throw out the comments and blank lines in the course of things: while (<CONF>) { next if /ˆ#/; next if /ˆ\s*(#|$)/; chomp; munchabunch($_); } Even if you’re not trying to be efficient, you often need to alternate ordinary Perl expressions with regular expressions simply because you want to take some action that is not possible (or very difficult) from within the regular expression, such as printing things out. Here’s a useful number classifier: warn warn warn warn warn warn warn "has nondigits" if /\D/; "not a natural number" unless /ˆ\d+$/; # rejects -3 "not an integer" unless /ˆ-?\d+$/; # rejects +3 "not an integer" unless /ˆ[+-]?\d+$/; "not a decimal number" unless /ˆ-?\d+\.?\d*$/; # rejects .2 "not a decimal number" unless /ˆ-?(?:\d+(?:\.\d*)?|\.\d+)$/; "not a C float" unless /ˆ([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/; We could stretch this section out a lot longer, but really, that sort of thing is what this whole book is about. You’ll see many more examples of the interplay of Perl code and pattern matching as we go along. In particular, see the later section “Programmatic Patterns”. (It’s okay to read the intervening material first, of course.) Variable Interpolation Using Perl’s control flow mechanisms to control regular expression matching has its limits. The main difficulty is that it’s an “all or nothing” approach; either you run the pattern, or you don’t. Sometimes you know the general outlines of the pattern you want, but you’d like to have the capability of parameterizing it. Variable interpolation provides that capability, much like parameterizing a subroutine lets you have more influence over its behavior than just deciding whether to call it or not. (More about subroutines in the next chapter). One nice use of interpolation is to provide a little abstraction, along with a little readability. With regular expressions you may certainly write things concisely: if ($num =˜ /ˆ[-+]?\d+\.?\d*$/) { ... } Staying in Control 191 But what you mean is more apparent when you write: $sign = ’[-+]?’; $digits = ’\d+’; $decimal = ’\.?’; $more_digits = ’\d*’; $number = "$sign$digits$decimal$more_digits"; ... if ($num =˜ /ˆ$number$/o) { ... } We’ll cover this use of interpolation more under “Generated patterns” later in this chapter. We’ll just point out that we used the /o modifier to suppress recompilation because we don’t expect $number to change its value over the course of the program. Another cute trick is to turn your tests inside out and use the variable string to pattern-match against a set of known strings: chomp($answer = <STDIN>); if ("SEND" =˜ /ˆ\Q$answer/i) elsif ("STOP" =˜ /ˆ\Q$answer/i) elsif ("ABORT" =˜ /ˆ\Q$answer/i) elsif ("LIST" =˜ /ˆ\Q$answer/i) elsif ("EDIT" =˜ /ˆ\Q$answer/i) { { { { { print print print print print "Action "Action "Action "Action "Action is is is is is send\n" stop\n" abort\n" list\n" edit\n" } } } } } This lets your user perform the “send” action by typing any of S, SE, SEN, or SEND (in any mixture of upper- and lowercase). To “stop”, they’d have to type at least ST (or St, or sT, or st). When backslashes happen When you think of double-quote interpolation, you usually think of both variable and backslash interpolation. But as we mentioned earlier, for regular expressions there are two passes, and the interpolation pass defers most of the backslash interpretation to the regular expression parser (which we discuss later). Ordinarily, you don’t notice the difference, because Perl takes pains to hide the difference. (One sequence that’s obviously different is the \b metasymbol, which turns into a word boundary assertion—outside of character classes, anyway. Inside a character class where assertions make no sense, it reverts to being a backspace, as it is normally.) It’s actually fairly important that the regex parser handle the backslashes. Suppose you’re searching for tab characters in a pattern with a /x modifier: ($col1, $col2) = /(.*?) \t+ (.*?)/x; If Perl didn’t defer the interpretation of \t to the regex parser, the \t would have turned into whitespace, which the regex parser would have ignorantly ignored because of the /x. But Perl is not so ignoble, or tricky. 192 Chapter 5: Pattern Matching You can trick yourself though. Suppose you abstracted out the column separator, like this: $colsep = "\t+"; # (double quotes) ($col1, $col2) = /(.*?) $colsep (.*?)/x; Now you’ve just blown it, because the \t turns into a real tab before it gets to the regex parser, which will think you said /(.*?)+(.*?)/ after it discards the whitespace. Oops. To fix, avoid /x, or use single quotes. Or better, use qr//. (See the next section.) The only double-quote escapes that are processed as such are the six translation escapes: \U, \u, \L, \l, \Q, and \E. If you ever look into the inner workings of the Perl regular expression compiler, you’ll find code for handling escapes like \t for tab, \n for newline, and so on. But you won’t find code for those six translation escapes. (We only listed them in Table 5-7 because people expect to find them there.) If you somehow manage to sneak any of them into the pattern without going through double-quotish evaluation, they won’t be recognized. How could they find their way in? Well, you can defeat interpolation by using single quotes as your pattern delimiter. In m’...’, qr’...’, and s’...’...’, the single quotes suppress variable interpolation and the processing of translation escapes, just as they would in a single-quoted string. Saying m’\ufrodo’ won’t find a capitalized version of poor frodo. However, since the “normal” backslash characters aren’t really processed on that level anyway, m’\t\d’ still matches a real tab followed by any digit. Another way to defeat interpolation is through interpolation itself. If you say: $var = ’\U’; /${var}frodo/; poor frodo remains uncapitalized. Perl won’t redo the interpolation pass for you just because you interpolated something that looks like it might want to be reinterpolated. You can’t expect that to work any more than you’d expect this double interpolation to work: $hobbit = ’Frodo’; $var = ’$hobbit’; /$var/; # (single quotes) # means m’$hobbit’, not m’Frodo’. Here’s another example that shows how most backslashes are interpreted by the regex parser, not by variable interpolation. Imagine you have a simple little gr ep-style program written in Perl:* * If you didn’t know what a gr ep program was before, you will now. No system should be without gr ep —we believe gr ep is the most useful small program ever invented. (It logically follows that we don’t believe Perl is a small program.) Staying in Control 193 #!/usr/bin/perl $pattern = shift; while (<>) { print if /$pattern/o; } If you name that program pgr ep and call it this way: % pgrep ’\t\d’ *.c then you’ll find that it prints out all lines of all your C source files in which a digit follows a tab. You didn’t have to do anything special to get Perl to realize that \t was a tab. If Perl’s patterns wer e just double-quote interpolated, you would have; fortunately, they aren’t. They’re recognized directly by the regex parser. The real gr ep program has a -i switch that turns off case-sensitive matching. You don’t have to add such a switch to your pgr ep program; it can already handle that without modification. You just pass it a slightly fancier pattern, with an embedded /i modifier: % pgrep ’(?i)ring’ LotR*.pod That now searches for any of “Ring”, “ring”, “RING”, and so on. You don’t see this feature too much in literal patterns, since you can always just write /ring/i. But for patterns passed in on the command line, in web search forms, or embedded in configuration files, it can be a lifesaver. (Speaking of rings.) The qr// quote regex operator Variables that interpolate into patterns necessarily do so at run time, not compile time. This slows down execution because Perl has to check whether you’ve changed the contents of the variable; if so, it would have to recompile the regular expression. As mentioned in “Pattern-Matching Operators”, if you promise never to change the pattern, you can use the /o option to interpolate and compile only once: print if /$pattern/o; Although that works fine in our pgr ep program, in the general case, it doesn’t. Imagine you have a slew of patterns, and you want to match each of them in a loop, perhaps like this: foreach $item (@data) { foreach $pat (@patterns) { if ($item =˜ /$pat/) { ... } } } You couldn’t write /$pat/o because the meaning of $pat varies each time through the inner loop. 194 Chapter 5: Pattern Matching The solution to this is the qr/PATTERN/imosx operator. This operator quotes—and compiles — its PATTERN as a regular expression. PATTERN is interpolated the same way as in m/PATTERN/. If ’ is used as the delimiter, no interpolation of variables (or the six translation escapes) is done. The operator returns a Perl value that may be used instead of the equivalent literal in a corresponding pattern match or substitute. For example: $regex = qr/my.STRING/is; s/$regex/something else/; is equivalent to: s/my.STRING/something else/is; So for our nested loop problem above, preprocess your pattern first using a separate loop: @regexes = (); foreach $pat (@patterns) { push @regexes, qr/$pat/; } Or all at once using Perl’s map operator: @regexes = map { qr/$_/ } @patterns; And then change the loop to use those precompiled regexes: foreach $item (@data) { foreach $re (@regexes) { if ($item =˜ /$re/) { ... } } } Now when you run the match, Perl doesn’t have to create a compiled regular expression on each if test, because it sees that it already has one. The result of a qr// may even be interpolated into a larger match, as though it were a simple string: $regex = qr/$pattern/; $string =˜ /foo${regex}bar/; # interpolate into larger patterns This time, Perl does recompile the pattern, but you could always chain several qr// operators together into one. The reason this works is because the qr// operator returns a special kind of object that has a stringification overload as described in Chapter 13, Overloading. If you print out the return value, you’ll see the equivalent string: Staying in Control $re = qr/my.STRING/is; print $re; 195 # prints (?si-xm:my.STRING) The /s and /i modifiers were enabled in the pattern because they were supplied to qr//. The /x and /m, however, are disabled because they were not. Any time you interpolate strings of unknown provenance into a pattern, you should be prepared to handle any exceptions thrown by the regex compiler, in case someone fed you a string containing untamable beasties: $re = qr/$pat/is; # might escape and eat you $re = eval { qr/$pat/is } || warn ... # caught it in an outer cage For more on the eval operator, see Chapter 29. The Regex Compiler After the variable interpolation pass has had its way with the string, the regex parser finally gets a shot at trying to understand your regular expression. There’s not actually a great deal that can go wrong at this point, apart from messing up the parentheses, or using a sequence of metacharacters that doesn’t mean anything. The parser does a recursive-descent analysis of your regular expression and, if it parses, turns it into a form suitable for interpretation by the Engine (see the next section). Most of the interesting stuff that goes on in the parser involves optimizing your regular expression to run as fast as possible. We’re not going to explain that part. It’s a trade secret. (Rumors that looking at the regular expression code will drive you insane are greatly exaggerated. We hope.) But you might like to know what the parser actually thought of your regular expression, and if you ask it politely, it will tell you. By saying use re "debug", you can examine how the regex parser processes your pattern. (You can also see the same information by using the -Dr command-line switch, which is available to you if your Perl was compiled with the -DDEBUGGING flag during installation.) #!/usr/bin/perl use re "debug"; "Smeagol" =˜ /ˆSm(.*)g[aeiou]l$/; The output is below. You can see that prior to execution Perl compiles the regex and assigns meaning to the components of the pattern: BOL for the beginning of line (ˆ), REG_ANY for the dot, and so on: Compiling REx ‘ˆSm(.*)g[aeiou]l$’ size 24 first at 2 rarest char l at 0 rarest char S at 0 1: BOL(2) 2: EXACT <Sm>(4) 4: OPEN1(6) 196 Chapter 5: Pattern Matching 6: STAR(8) 7: REG_ANY(0) 8: CLOSE1(10) 10: EXACT <g>(12) 12: ANYOF[aeiou](21) 21: EXACT <l>(23) 23: EOL(24) 24: END(0) anchored ‘Sm’ at 0 floating ‘l’$ at 4..2147483647 (checking anchored) anchored(BOL) minlen 5 Omitting $‘ $& $’ support. Some of the lines summarize the conclusions of the regex optimizer. It knows that the string must start with “Sm”, and that therefore there’s no reason to do the ordinary left-to-right scan. It knows that the string must end with an “l”, so it can reject out of hand any string that doesn’t. It knows that the string must be at least five characters long, so it can ignore any string shorter than that right off the bat. It also knows what the rarest character in each constant string is, which can help in searching “studied” strings. (See study in Chapter 29.) It then goes on to trace how it executes the pattern: EXECUTING... Guessing start of match, REx ‘ˆSm(.*)g[aeiou]l$’ against ‘Smeagol’... Guessed: match at offset 0 Matching REx ‘ˆSm(.*)g[aeiou]l$’ against ‘Smeagol’ Setting an EVAL scope, savestack=3 0 <> <Smeagol> | 1: BOL 0 <> <Smeagol> | 2: EXACT <Sm> 2 <Sm> <eagol> | 4: OPEN1 2 <Sm> <eagol> | 6: STAR REG_ANY can match 5 times out of 32767... Setting an EVAL scope, savestack=3 7 <Smeagol> <> | 8: CLOSE1 7 <Smeagol> <> | 10: EXACT <g> failed... 6 <Smeago> <l> | 8: CLOSE1 6 <Smeago> <l> | 10: EXACT <g> failed... 5 <Smeag> <ol> | 8: CLOSE1 5 <Smeag> <ol> | 10: EXACT <g> failed... 4 <Smea> <gol> | 8: CLOSE1 4 <Smea> <gol> | 10: EXACT <g> 5 <Smeag> <ol> | 12: ANYOF[aeiou] 6 <Smeago> <l> | 21: EXACT <l> 7 <Smeagol> <> | 23: EOL 7 <Smeagol> <> | 24: END Match successful! Freeing REx: ‘ˆSm(.*)g[aeiou]l$’ Staying in Control 197 If you follow the stream of whitespace down the middle of Smeagol, you can actually see how the Engine overshoots to let the .* be as greedy as possible, then backtracks on that until it finds a way for the rest of the pattern to match. But that’s what the next section is about. The Little Engine That /Could(n’t)?/ And now we’d like to tell you the story of the Little Regex Engine that says, “I think I can. I think I can. I think I can.” In this section, we lay out the rules used by Perl’s regular expression engine to match your pattern against a string. The Engine is extremely persistent and hardworking. It’s quite capable of working even after you think it should quit. The Engine doesn’t give up until it’s certain there’s no way to match the pattern against the string. The Rules below explain how the Engine “thinks it can” for as long as possible, until it knows it can or can’t. The problem for our Engine is that its task is not merely to pull a train over a hill. It has to search a (potentially) very complicated space of possibilities, keeping track of where it has been and where it hasn’t. The Engine uses a nondeterministic finite-state automaton (NFA, not to be confused with NFL, a nondeterministic football league) to find a match. That just means that it keeps track of what it has tried and what it hasn’t, and when something doesn’t pan out, it backs up and tries something else. This is known as backtracking. (Er, sorry, we didn’t invent these terms. Really.) The Engine is capable of trying a million subpatterns at one spot, then giving up on all those, backing up to within one choice of the beginning, and trying the million subpatterns again at a different spot. The Engine is not terribly intelligent; just persistent, and thorough. If you’re cagey, you can give the Engine an efficient pattern that doesn’t let it do a lot of silly backtracking. When someone trots out a phrase like “Regexes choose the leftmost, longest match”, that means that Perl generally prefers the leftmost match over longest match. But the Engine doesn’t realize it’s “preferring” anything, and it’s not really thinking at all, just gutting it out. The overall preferences are an emergent behavior resulting from many individual and unrelated choices. Here are those choices:* Rule 1 The Engine tries to match as far left in the string as it can, such that the entire regular expression matches under Rule 2. * Some of these choices may be skipped if the regex optimizer has any say, which is equivalent to the Little Engine simply jumping through the hill via quantum tunneling. But for this discussion we’re pretending the optimizer doesn’t exist. 198 Chapter 5: Pattern Matching The Engine starts just before the first character and tries to match the entire pattern starting there. The entire pattern matches if and only if the Engine reaches the end of the pattern before it runs off the end of the string. If it matches, it quits immediately—it doesn’t keep looking for a “better” match, even though the pattern might match in many different ways. If it is unable to match the pattern at the first position in the string, it admits temporary defeat and moves to the next position in the string, between the first and second characters, and tries all the possibilities again. If it succeeds, it stops. If it fails, it continues on down the string. The pattern match as a whole doesn’t fail until it has tried to match the entire regular expression at every position in the string, including after the last character. A string of n characters actually provides n + 1 positions to match at. That’s because the beginnings and the ends of matches are between the characters of the string. This rule sometimes surprises people when they write a pattern like /x*/ that can match zero or more “x” characters. If you try that pattern on a string like “fox”, it won’t find the “x”. Instead, it will immediately match the null string before the “f” and never look further. If you want it to match one or more x characters, you need to use /x+/ instead. See the quantifiers under Rule 5. A corollary to this rule is that any pattern matching the null string is guaranteed to match at the leftmost position in the string (in the absence of any zerowidth assertions to the contrary). Rule 2 When the Engine encounters a set of alternatives (separated by | symbols), either at the top level or at the current “cluster” level, it tries them left-to-right, stopping on the first successful match that allows successful completion of the entire pattern. A set of alternatives matches a string if any of the alternatives match under Rule 3. If none of the alternatives matches, it backtracks to the Rule that invoked this Rule, which is usually Rule 1, but could be Rule 4 or 6, if we’re within a cluster. That rule will then look for a new position at which to apply Rule 2. If there’s only one alternative, then either it matches or it doesn’t, and Rule 2 still applies. (There’s no such thing as zero alternatives, because a null string always matches.) Staying in Control 199 Rule 3 Any particular alternative matches if every item listed in the alternative matches sequentially according to Rules 4 and 5 (such that the entire regular expression can be satisfied). An item consists of either an assertion, which is covered in Rule 4, or a quantified atom, covered by Rule 5. Items that have choices on how to match are given a “pecking order” from left to right. If the items cannot be matched in order, the Engine backtracks to the next alternative under Rule 2. Items that must be matched sequentially aren’t separated in the regular expression by anything syntactic—they’re merely juxtaposed in the order they must match. When you ask to match /ˆfoo/, you’re actually asking for four items to be matched one after the other. The first is a zero-width assertion, matched under Rule 4, and the other three are ordinary characters that must match themselves, one after the other, under Rule 5. The left-to-right pecking order means that in a pattern like: /x*y*/ x* gets to pick one way to match, and then y* tries all its ways. If that fails, then x* gets to pick its second choice, and make y* try all of its ways again. And so on. The items to the right “vary faster”, to borrow a phrase from multidimensional arrays. Rule 4 If an assertion does not match at the current position, the Engine backtracks to Rule 3 and retries higher-pecking-order items with different choices. Some assertions are fancier than others. Perl supports many regex extensions, some of which are zero-width assertions. For example, the positive lookahead (?=...) and the negative lookahead (?!...) don’t actually match any characters, but merely assert that the regular expression represented by ... would (or would not) match at this point, were we to attempt it, hypothetically speaking.* Rule 5 A quantified atom matches only if the atom itself matches some number of times that is allowed by the quantifier. (The atom itself is matched according to Rule 6.) * In actual fact, the Engine does attempt it. The Engine goes back to Rule 2 to test the subpattern, and then wipes out any record of how much string was eaten, returning only the success or failure of the subpattern as the value of the assertion. (It does, however, remember any captured substrings.) 200 Chapter 5: Pattern Matching Different quantifiers require different numbers of matches, and most of them allow a range of numbers of matches. Multiple matches must all match in a row; that is, they must be adjacent within the string. An unquantified atom is assumed to have a quantifier requiring exactly one match (that is, /x/ is the same as /x{1}/). If no match can be found at the current position for any allowed quantity of the atom in question, the Engine backtracks to Rule 3 and retries higher-pecking-order items with different choices. The quantifiers are *, +, ?, *?, +?, ??, and the various brace forms. If you use the {COUNT} form, then there is no choice, and the atom must match exactly that number of times or not at all. Otherwise, the atom can match over a range of quantities, and the Engine keeps track of all the choices so that it can backtrack if necessary. But then the question arises as to which of these choices to try first. One could start with the maximal number of matches and work down, or the minimal number of matches and work up. The traditional quantifiers (without a trailing question mark) specify gr eedy matching; that is, they attempt to match as many characters as possible. To find the greediest match, the Engine has to be a little bit careful. Bad guesses are potentially rather expensive, so the Engine doesn’t actually count down from the maximum value, which after all could be Very Large and cause millions of bad guesses. What the Engine actually does is a little bit smarter: it first counts up to find out how many matching atoms (in a row) are really there in the string, and then it uses that actual maximum as its first choice. (It also remembers all the shorter choices in case the longest one doesn’t pan out.) It then (at long last) tries to match the rest of the pattern, assuming the longest choice to be the best. If the longest choice fails to produce a match for the rest of the pattern, it backtracks and tries the next longest. If you say /.*foo/, for example, it will try to match the maximal number of “any” characters (represented by the dot) clear out to the end of the line before it ever tries looking for “foo”; and then when the “foo” doesn’t match there (and it can’t, because there’s not enough room for it at the end of the string), the Engine will back off one character at a time until it finds a “foo”. If there is more than one “foo” in the line, it’ll stop on the last one, since that will really be the first one it encounters as it backtracks. When the entire pattern succeeds using some particular length of .*, the Engine knows it can throw away all the other shorter choices for .* (the ones it would have used had the current “foo” not panned out). By placing a question mark after any greedy quantifier, you turn it into a frugal quantifier that chooses the smallest quantity for the first try. So if you say /.*?foo/, the .*? first tries to match 0 characters, then 1 character, then 2, and Staying in Control 201 so on until it can match the “foo”. Instead of backtracking backward, it backtracks forward, so to speak, and ends up finding the first “foo” on the line instead of the last. Rule 6 Each atom matches according to the designated semantics of its type. If the atom doesn’t match (or does match, but doesn’t allow a match of the rest of the pattern), the Engine backtracks to Rule 5 and tries the next choice for the atom’s quantity. Atoms match according to the following types: • A regular expression in parentheses, (...), matches whatever the regular expression (represented by ...) matches according to Rule 2. Parentheses therefore serve as a clustering operator for quantification. Bare parentheses also have the side effect of capturing the matched substring for later use in a backr efer ence. This side effect can be suppressed by using (?:...) instead, which has only the clustering semantics—it doesn’t store anything in $1, $2, and so on. Other forms of parenthetical atoms (and assertions) are possible — see the rest of this chapter. • A dot matches any character, except maybe newline. • A list of characters in square brackets (a character class) matches any one of the characters specified by the list. • A backslashed letter matches either a particular character or a character from a set of characters, as listed in Table 5-7. • Any other backslashed character matches that character. • Any character not mentioned above matches itself. That all sounds rather complicated, but the upshot of it is that, for each set of choices given by a quantifier or alternation, the Engine has a knob it can twiddle. It will twiddle those knobs until the entire pattern matches. The Rules just say in which order the Engine is allowed to twiddle those knobs. Saying the Engine prefers the leftmost match merely means it twiddles the start position knob the slowest. And backtracking is just the process of untwiddling the knob you just twiddled in order to try twiddling a knob higher in the pecking order, that is, one that varies slower. Here’s a more concrete example, a program that detects when two consecutive words share a common ending and beginning: $a = ’nobody’; $b = ’bodysnatcher’; if ("$a $b" =˜ /ˆ(\w+)(\w+) \2(\w+)$/) { print "$2 overlaps in $1-$2-$3\n"; } 202 Chapter 5: Pattern Matching This prints: body overlaps in no-body-snatcher You might think that $1 would first grab up all of “nobody” due to greediness. And in fact, it does—at first. But once it’s done so, there aren’t any further characters to put in $2, which needs characters put into it because of the + quantifier. So the Engine backs up and $1 begrudgingly gives up one character to $2. This time the space character matches successfully, but then it sees \2, which represents a measly “y”. The next character in the string is not a “y”, but a “b”. This makes the Engine back up all the way and try several more times, eventually forcing $1 to surrender the body to $2. Habeas corpus, as it were. Actually, that won’t quite work out if the overlap is itself the product of a doubling, as in the two words “rococo” and “cocoon”. The algorithm above would have decided that the overlapping string, $2, must be just “co” rather than “coco”. But we don’t want a “rocococoon”; we want a “rococoon”. Here’s one of those places you can outsmart the Engine. Adding a minimal matching quantifier to the $1 part gives the much better pattern /ˆ(\w+?)(\w+) \2(\w+)$/, which does exactly what we want. For a much more detailed discussion of the pros and cons of various kinds of regular expression engines, see Jeffrey Friedl’s book, Mastering Regular Expressions. Perl’s regular expression Engine works very well for many of the everyday problems you want to solve with Perl, and it even works okay for those not-so-everyday problems, if you give it a little respect and understanding. Fancy Patterns Lookaround Assertions Sometimes you just need to sneak a peek. There are four regex extensions that help you do just that, and we call them lookar ound assertions because they let you scout around in a hypothetical sort of way, without committing to matching any characters. What these assertions assert is that some pattern would (or would not) match if we were to try it. The Engine works it all out for us by actually trying to match the hypothetical pattern, and then pretending that it didn’t match (if it did). When the Engine peeks ahead from its current position in the string, we call it a lookahead assertion. If it peeks backward, we call it a lookbehind assertion. The lookahead patterns can be any regular expression, but the lookbehind patterns may only be fixed width, since they have to know where to start the hypothetical match from. Fancy Patterns 203 While these four extensions are all zero-width assertions, and hence do not consume characters (at least, not officially), you can in fact capture substrings within them if you supply extra levels of capturing parentheses. (?=PATTERN) (positive lookahead) When the Engine encounters (?=PATTERN), it looks ahead in the string to ensure that PATTERN occurs. If you’ll recall, in our earlier duplicate word remover, we had to write a loop because the pattern ate too much each time through: $_ = "Paris in THE THE THE THE spring."; # remove duplicate words (and triplicate (and quadruplicate...)) 1 while s/\b(\w+) \1\b/$1/gi; Whenever you hear the phrase “ate too much”, you should always think “lookahead assertion”. (Well, almost always.) By peeking ahead instead of gobbling up the second word, you can write a one-pass duplicate word remover like this: s/ \b(\w+) \s (?= \1\b ) //gxi; Of course, this isn’t quite right, since it will mess up valid phrases like “The clothes you DON DON’t fit.” (?!PATTERN) (negative lookahead) When the Engine encounters (?!PATTERN), it looks ahead in the string to ensure that PATTERN does not occur. To fix our previous example, we can add a negative lookahead assertion after the positive assertion to weed out the case of contractions: s/ \b(\w+) \s (?= \1\b (?! ’\w))//xgi; That final \w is necessary to avoid confusing contractions with words at the ends of single-quoted strings. We can take this one step further, since earlier in this chapter we intentionally used “that that particular”, and we’d like our program to not “fix” that for us. So we can add an alternative to the negative lookahead in order to pre-unfix that “that”, (thereby demonstrating that any pair of parentheses can be used to cluster alternatives): s/ \b(\w+) \s (?= \1\b (?! ’\w | \s particular))//gix; Now we know that that particular phrase is safe. Unfortunately, the Gettysburg Address is still broken. So we add another exception: s/ \b(\w+) \s (?= \1\b (?! ’\w | \s particular | \s nation))//igx; This is just starting to get out of hand. So let’s do an Official List of Exceptions, using a cute interpolation trick with the $" variable to separate the alternatives 204 Chapter 5: Pattern Matching with the | character: @thatthat = qw(particular nation); local $" = ’|’; s/ \b(\w+) \s (?= \1\b (?! ’\w | \s (?: @thatthat )))//xig; (?<=PATTERN) (positive lookbehind) When the Engine encounters (?<=PATTERN), it looks backward in the string to ensure that PATTERN already occurred. Our example still has a problem. Although it now lets Honest Abe say things like “that that nation”, it also allows “Paris, in the the nation of France”. We can add a positive lookbehind assertion in front of our exception list to make sure that we apply our @thatthat exceptions only to a real “that that”. s/ \b(\w+) \s (?= \1\b (?! ’\w | (?<= that) \s (?: @thatthat )))//ixg; Yes, it’s getting terribly complicated, but that’s why this section is called “Fancy Patterns”, after all. If you need to complicate the pattern any more than we’ve done so far, judicious use of comments and qr// will help keep you sane. Or at least saner. (?<!PATTERN) (negative lookbehind) When the Engine encounters (?<!PATTERN), it looks backward in the string to ensure that PATTERN did not occur. Let’s go for a really simple example this time. How about the easy version of that old spelling rule, “I before E except after C”? In Perl, you spell it: s/(?<!c)ei/ie/g You’ll have to weigh for yourself whether you want to handle any of the exceptions. (For example, “weird” is spelled weird, especially when you spell it “wierd”.) Nonbacktracking Subpatterns As described in “The Little Engine That /Could(n’t)?/”, the Engine often backtracks as it proceeds through the pattern. You can block the Engine from backtracking back through a particular set of choices by creating a nonbacktracking subpattern. A nonbacktracking subpattern looks like (?>PATTERN), and it works exactly like a simple (?:PATTERN), except that once PATTERN has found a match, it suppresses backtracking on any of the quantifiers or alternatives inside the subpattern. (Hence, it is meaningless to use this on a PATTERN that doesn’t contain quantifiers or alternatives.) The only way to get it to change its mind is to backtrack to something before the subpattern and reenter the subpattern from the left. Fancy Patterns 205 It’s like going into a car dealership. After a certain amount of haggling over the price, you deliver an ultimatum: “Here’s my best offer; take it or leave it.” If they don’t take it, you don’t go back to haggling again. Instead, you backtrack clear out the door. Maybe you go to another dealership, and start haggling again. You’re allowed to haggle again, but only because you reentered the nonbacktracking pattern again in a different context. For devotees of Prolog or SNOBOL, you can think of this as a scoped cut or fence operator. Consider how in "aaab" =˜ /(?:a*)ab/, the a* first matches three a’s, but then gives up one of them because the last a is needed later. The subgroup sacrifices some of what it wants in order for the whole match to succeed. (Which is like letting the car salesman talk you into giving him more of your money because you’re afraid to walk away from the deal.) In contrast, the subpattern in "aaab" =˜ /(?>a*)ab/ will never give up what it grabs, even though this behavior causes the whole match to fail. (As the song says, you have to know when to hold ’em, when to fold ’em, and when to walk away.) Although (?>PATTERN) is useful for changing the behavior of a pattern, it’s mostly used for speeding up the failure of certain matches that you know will fail anyway (unless they succeed outright). The Engine can take a spectacularly long time to fail, particular with nested quantifiers. The following pattern will succeed almost instantly: $_ = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"; /a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*[b]/; But success is not the problem. Failure is. If you remove that final “b” from the string, the pattern will probably run for many, many years before failing. Many, many millennia. Actually, billions and billions of years.* You can see by inspection that the pattern can’t succeed if there’s no “b” on the end of the string, but the regex optimizer is not smart enough (as of this writing) to figure out that /[b]/ is equivalent to /b/. But if you give it a hint, you can get it to fail quickly while still letting it succeed where it can: /(?>a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*)[b]/; For a (hopefully) more realistic example, imagine a program that’s supposed to read in a paragraph at a time and show just the lines that are continued, where * Actually, it’s more on the order of septillions and septillions. We don’t know exactly how long it would take. We didn’t care to wait around watching it not fail. In any event, your computer is likely to crash before the heat death of the universe, and this regular expression takes longer than either of those. 206 Chapter 5: Pattern Matching contination lines are specified with trailing backslashes. Here’s a sample from Perl’s Makefile that uses this line-continuation convention: # Files to be built with variable substitution before miniperl # is available. sh = Makefile.SH cflags.SH config_h.SH makeaperl.SH makedepend.SH \ makedir.SH myconfig.SH writemain.SH You could write your simple program this way: #!/usr/bin/perl -00p while ( /( (.+) ( (?<=\\) \n .* )+ ) /gx) { print "GOT $.: $1\n\n"; } That works, but it’s really quite slow. That’s because the Engine backtracks a character at a time from the end of the line, shrinking what’s in $1. This is pointless. And writing it without the extraneous captures doesn’t help much. Using: (.+(?:(?<=\\)\n.*)+) for a pattern is somewhat faster, but not much. This is where a nonbacktracking subpattern helps a lot. The pattern: ((?>.+)(?:(?<=\\)\n.*)+) does the same thing, but more than an order of magnitude faster because it doesn’t waste time backtracking in search of something that isn’t there. You’ll never get a success with (?>...) that you wouldn’t get with (?:...) or even a simple (...). But if you’re going to fail, it’s best to fail quickly and get on with your life. Programmatic Patterns Most Perl programs tend to follow an imperative (also called procedural) programming style, like a series of discrete commands laid out in a readily observable order: “Preheat oven, mix, glaze, heat, cool, serve to aliens.” Sometimes into this mix you toss a few dollops of functional programming (“Use a little more glaze than you think you need, even after taking this into account, recursively”), or sprinkle it with bits of object-oriented techniques (“but please hold the anchovy objects”). Often it’s a combination of all of these. But the regular expression Engine takes a completely different approach to problem solving, more of a declarative approach. You describe goals in the language of regular expressions, and the Engine implements whatever logic is needed to solve your goals. Logic programming languages (such as Prolog) don’t always get as much exposure as the other three styles, but they’re more common than you’d Fancy Patterns 207 think. Perl couldn’t even be built without make (1) or yacc (1), both of which could be considered, if not purely declarative languages, at least hybrids that blend imperative and logic programming together. You can do this sort of thing in Perl, too, by blending goal declarations and imperative code together more miscibly than we’ve done so far, drawing upon the strengths of both. You can programmatically build up the string you’ll eventually present to the regex Engine, in a sense creating a program that writes a new program on the fly. You can also supply ordinary Perl expressions as the replacement part of s/// via the /e modifier. This allows you to dynamically generate the replacement string by executing a bit of code every time the pattern matches. Even more elaborately, you can interject bits of code wherever you’d like in a middle of a pattern using the (?{ CODE }) extension, and that code will be executed every time the Engine encounters that code as it advances and recedes in its intricate backtracking dance. Finally, you can use s///ee or (??{ CODE }) to add another level of indirection: the results of executing those code snippets will themselves be re-evaluated for further use, creating bits of program and pattern on the fly, just in time. Generated patterns It has been said* that programs that write programs are the happiest programs in the world. In Jeffrey Friedl’s book, Mastering Regular Expressions, the final tour de force demonstrates how to write a program that produces a regular expression to determine whether a string conforms to the RFC 822 standard; that is, whether it contains a standards-compliant, valid mail header. The pattern produced is several thousand characters long, and about as easy to read as a crash dump in pure binary. But Perl’s pattern matcher doesn’t care about that; it just compiles up the pattern without a hitch and, even more interestingly, executes the match very quickly — much more quickly, in fact, than many short patterns with complex backtracking requirements. That’s a very complicated example. Earlier we showed you a very simple example of the same technique when we built up a $number pattern out of its components (see the section “Variable Interpolation”). But to show you the power of this programmatic approach to producing a pattern, let’s work out a problem of medium complexity. Suppose you wanted to pull out all the words with a certain vowel-consonant sequence; for example, “audio” and “eerie” both follow a VVCVV pattern. * By Andrew Hume, the famous Unix philosopher. 208 Chapter 5: Pattern Matching Although describing what counts as a consonant or a vowel is easy, you wouldn’t ever want to type that in more than once. Even for our simple VVCVV case, you’d need to type in a pattern that looked something like this: ˆ[aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy]$ A more general-purpose program would accept a string like “VVCVV” and programmatically generate that pattern for you. For even more flexibility, it could accept a word like “audio” as input and use that as a template to infer “VVCVV”, and from that, the long pattern above. It sounds complicated, but really isn’t, because we’ll let the program generate the pattern for us. Here’s a simple cvmap program that does all of that: #!/usr/bin/perl $vowels = ’aeiouy’; $cons = ’cbdfghjklmnpqrstvwxzy’; %map = (C => $cons, V => $vowels); # init map for C and V for $class ($vowels, $cons) { for (split //, $class) { $map{$_} .= $class; } } # now for each type # get each letter of that type # and map the letter back to the type for $char (split //, shift) { $pat .= "[$map{$char}]"; } # for each letter in template word # add appropriate character class $re = qr/ˆ${pat}$/i; print "REGEX is $re\n"; @ARGV = (’/usr/dict/words’) if -t && !@ARGV; # compile the pattern # debugging output # pick a default dictionary while (<>) { print if /$re/; } # and now blaze through the input # printing any line that matches The %map variable holds all the interesting bits. Its keys are each letter of the alphabet, and the corresponding value is all the letters of its type. We throw in C and V, too, so you can specify either “VVCVV” or “audio”, and still get out “eerie”. Each character in the argument supplied to the program is used to pull out the right character class to add to the pattern. Once the pattern is created and compiled up with qr//, the match (even a very long one) will run quickly. Here’s why you might get if you run this program on “fortuitously”: % cvmap fortuitously /usr/dict/wordses REGEX is (?i-xsm:ˆ[cbdfghjklmnpqrstvwxzy][aeiouy][cbdfghjklmnpqrstvwxzy][cbd fghjklmnpqrstvwxzy][aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy][c bdfghjklmnpqrstvwxzy][cbdfghjklmnpqrstvwxzy][aeiouycbdfghjklmnpqrstvwxzy]$) carriageable circuitously Fancy Patterns 209 fortuitously languorously marriageable milquetoasts sesquiquarta sesquiquinta villainously Looking at that REGEX, you can see just how much villainous typing you saved by programming languorously, albeit circuitously. Substitution evaluations When the /e modifier (“e” is for expression evaluation) is used on an s/PATTERN/CODE/e expression, the replacement portion is interpreted as a Perl expression, not just as a double-quoted string. It’s like an embedded do { CODE }. Even though it looks like a string, it’s really just a code block that gets compiled up at the same time as rest of your program, long before the substitution actually happens. You can use the /e modifier to build replacement strings with fancier logic than double-quote interpolation allows. This shows the difference: s/(\d+)/$1 * 2/; s/(\d+)/$1 * 2/e; # Replaces "42" with "42 * 2" # Replaces "42" with "84" And this converts Celsius temperatures into Fahrenheit: $_ = "Preheat oven to 233C.\n"; s/\b(\d+\.?\d*)C\b/int($1 * 1.8 + 32) . "F"/e; # convert to 451F Applications of this technique are limitless. Here’s a filter that modifies its files in place (like an editor) by adding 100 to every number that starts a line (and that is followed by a colon, which we only peek at, but don’t actually match, or replace): % perl -pi -e ’s/ˆ(\d+)(?=:)/100 + $1/e’ filename Now and then, you want to do more than just use the string you matched in another computation. Sometimes you want that string to be a computation, whose own evaluation you’ll use for the replacement value. Each additional /e modifier after the first wraps an eval around the code to execute. The following two lines do the same thing, but the first one is easier to read: s/PATTERN/CODE/ee s/PATTERN/eval(CODE)/e You could use this technique to replace mentions of simple scalar variables with their values: s/(\$\w+)/$1/eeg; # Interpolate most scalars’ values 210 Chapter 5: Pattern Matching Because it’s really an eval, the /ee even finds lexical variables. A slightly more elaborate example calculates a replacement for simple arithmetical expressions on (nonnegative) integers: $_ = "I have 4 + 19 dollars and 8/2 cents.\n"; s{ ( \d+ \s* # find an integer [+*/-] # and an arithmetical operator \s* \d+ # and another integer ) }{ $1 }eegx; # then expand $1 and run that code print; # "I have 23 dollars and 4 cents." Like any other eval STRING, compile-time errors (like syntax problems) and runtime exceptions (like dividing by zero) are trapped. If so, the $@ ($EVAL_ERROR) variable says what went wrong. Match-time code evaluation In most programs that use regular expressions, the surrounding program’s run-time control structure drives the logical execution flow. You write if or while loops, or make function or method calls, that wind up calling a pattern-matching operation now and then. Even with s///e, it’s the substitution operator that is in control, executing the replacement code only after a successful match. With code subpatterns, the normal relationship between regular expression and program code is inverted. As the Engine is applying its Rules to your pattern at match time, it may come across a regex extension of the form (?{ CODE }). When triggered, this subpattern doesn’t do any matching or any looking about. It’s a zero-width assertion that always “succeeds”, evaluated only for its side effects. Whenever the Engine needs to progress over the code subpattern as it executes the pattern, it runs that code. "glyph" =˜ /.+ (?{ print "hi" }) ./x; # Prints "hi" twice. As the Engine tries to match glyph against this pattern, it first lets the .+ eat up all five letters. Then it prints “hi”. When it finds that final dot, all five letters have been eaten, so it needs to backtrack back to the .+ and make it give up one of the letters. Then it moves forward through the pattern again, stopping to print “hi” again before assigning h to the final dot and completing the match successfully. The braces around the CODE fragment are intended to remind you that it is a block of Perl code, and it certainly behaves like a block in the lexical sense. That is, if you use my to declare a lexically scoped variable in it, it is private to the block. But if you use local to localize a dynamically scoped variable, it may not do what you Fancy Patterns 211 expect. A (?{ CODE }) subpattern creates an implicit dynamic scope that is valid throughout the rest of the pattern, until it either succeeds or backtracks through the code subpattern. One way to think of it is that the block doesn’t actually return when it gets to the end. Instead, it makes an invisible recursive call to the Engine to try to match the rest of the pattern. Only when that recursive call is finished does it return from the block, delocalizing the localized variables.* In the next example, we initialize $i to 0 by including a code subpattern at the beginning of the pattern. Then we match any number of characters with .* —but we place another code subpattern in between the . and the * so we can count how many times . matches. $_ = ’lothlorien’; m/ (?{ $i = 0 }) (. (?{ $i++ }) lori /x; )* # Set $i to 0 # Update $i, even after backtracking # Forces a backtrack The Engine merrily goes along, setting $i to 0 and letting the .* gobble up all 10 characters in the string. When it encounters the literal lori in the pattern, it backtracks and gives up those four characters from the .*. After the match, $i will still be 10. If you wanted $i to reflect how many characters the .* actually ended up with, you could make use of the dynamic scope within the pattern: $_ = ’lothlorien’; m/ (?{ $i = 0 }) (. (?{ local $i = $i + 1; }) )* # Update $i, backtracking-safe. lori (?{ $result = $i }) # Copy to non-localized location. /x; Here, we use local to ensure that $i contains the number of characters matched by .*, regardless of backtracking. $i will be forgotten after the regular expression ends, so the code subpattern, (?{ $result = $i }), ensures that the count will live on in $result. The special variable $ˆR (described in Chapter 28) holds the result of the last (?{ CODE }) that was executed as part of a successful match. * People who are familiar with recursive descent parsers may find this behavior confusing because such compilers return from a recursive function call whenever they figure something out. The Engine doesn’t do that—when it figures something out, it goes deeper into recursion (even when exiting a parenthetical group!). A recursive descent parser is at a minimum of recursion when it succeeds at the end, but the Engine is at a local maximum of recursion when it succeeds at the end of the pattern. You might find it helpful to dangle the pattern from its left end and think of it as a skinny representation of a call graph tree. If you can get that picture into your head, the dynamic scoping of local variables will make more sense. (And if you can’t, you’re no worse off than before.) 212 Chapter 5: Pattern Matching You can use a (?{ CODE }) extension as the COND of a (?(COND)IFTRUE|IFFALSE). If you do this, $ˆR will not be set, and you may omit the parentheses around the conditional: "glyph" =˜ /.+(?(?{ $foo{bar} gt "symbol" }).|signet)./; Here, we test whether $foo{bar} is greater than symbol. If so, we include . in the pattern, and if not, we include signet in the pattern. Stretched out a bit, it might be construed as more readable: "glyph" =˜ m{ .+ (?(?{ $foo{bar} gt "symbol" }) . | signet ) . }x; # some anythings # if # this is true # match another anything # else # match signet # and one more anything When use re ’eval’ is in effect, a regex is allowed to contain (?{ CODE }) subpatterns even if the regular expression interpolates variables: /(.*?) (?{length($1) < 3 && warn}) $suffix/; # Error without use re ’eval’ This is normally disallowed since it is a potential security risk. Even though the pattern above may be innocuous because $suffix is innocuous, the regex parser can’t tell which parts of the string were interpolated and which ones weren’t, so it just disallows code subpatterns entirely if there were any interpolations. If the pattern is obtained from tainted data, even use re ’eval’ won’t allow the pattern match to proceed. When use re ’taint’ is in effect and a tainted string is the target of a regex, the captured subpatterns (either in the numbered variables or in the list of values returned by m// in list context) are tainted. This is useful when regex operations on tainted data are meant not to extract safe substrings, but merely to perform other transformations. See Chapter 23, Security, for more on tainting. For the purpose of this pragma, precompiled regular expressions (usually obtained from qr//) are not considered to be interpolated: /foo${pat}bar/ This is allowed if $pat is a precompiled regular expression, even if $pat contains (?{ CODE }) subpatterns. Fancy Patterns 213 Earlier we showed you a bit of what use re ’debug’ prints out. A more primitive debugging solution is to use (?{ CODE }) subpatterns to print out what’s been matched so far during the match: "abcdef" =˜ / .+ (?{print "Matched so far: $&\n"}) bcdef $/x; This prints: Matched Matched Matched Matched Matched Matched so so so so so so far: far: far: far: far: far: abcdef abcde abcd abc ab a showing the .+ grabbing all the letters and giving them up one by one as the Engine backtracks. Match-time pattern interpolation You can build parts of your pattern from within the pattern itself. The (??{ CODE }) extension allows you to insert code that evaluates to a valid pattern. It’s like saying /$pattern/, except that you can generate $pattern at run time—more specifically, at match time. For instance: /\w (??{ if ($threshold > 1) { "red" } else { "blue" } }) \d/x; This is equivalent to /\wred\d/ if $threshold is greater than 1, and /\wblue\d/ otherwise. You can include backreferences inside the evaluated code to derive patterns from just-matched substrings (even if they will later become unmatched through backtracking). For instance, this matches all strings that read the same backward as forward (known as palindromedaries, phrases with a hump in the middle): /ˆ (.+) .? (??{quotemeta reverse $1}) $/xi; You can balance parentheses like so: $text =˜ /( \(+ ) (.*?) (??{ ’\)’ x length $1 })/x; This matches strings of the form (shazam!) and (((shazam!))), sticking shazam! into $2. Unfortunately, it doesn’t notice whether the parentheses in the middle are balanced. For that we need recursion. Fortunately, you can do recursive patterns too. You can have a compiled pattern that uses (??{ CODE }) to refer to itself. Recursive matching is pretty irregular, as 214 Chapter 5: Pattern Matching regular expressions go. Any text on regular expressions will tell you that a standard regex can’t match nested parentheses correctly. And that’s correct. It’s also correct that Perl’s regexes aren’t standard. The following pattern* matches a set of nested parentheses, however deep they go: $np = qr{ \( (?: (?> [ˆ()]+ ) | (??{ $np }) )* \) }x; # Non-parens without backtracking # Group with matching parens You could use it like this to match a function call: $funpat = qr/\w+$np/; ’myfunfun(1,(2*(3+4)),5)’ =˜ /ˆ$funpat$/; # Matches! Conditional interpolation The (?(COND)IFTRUE|IFFALSE) regex extension is similar to Perl’s ?: operator. If COND is true, the IFTRUE pattern is used; otherwise, the IFFALSE pattern is used. The COND can be a backreference (expressed as a bare integer, without the \ or $), a lookaround assertion, or a code subpattern. (See the sections “Lookaround Assertions” and “Match-time code evaluation” earlier in this chapter.) If the COND is an integer, it is treated as a backreference. For instance, consider: #!/usr/bin/perl $x = ’Perl is free.’; $y = ’ManagerWare costs $99.95.’; foreach ($x, $y) { /ˆ(\w+) (?:is|(costs)) (?(2)(\$\d+)|\w+)/; # Either (\$\d+) or \w+ if ($3) { print "$1 costs money.\n"; # ManagerWare costs money. } else { print "$1 doesn’t cost money.\n"; # Perl doesn’t cost money. } } Here, the COND is (2), which is true if a second backreference exists. If that’s the case, (\$\d+) is included in the pattern at that point (creating the $3 backreference); otherwise, \w+ is used. * Note that you can’t declare the variable in the same statement in which you’re going to use it. You can always declare it earlier, of course. Fancy Patterns 215 If the COND is a lookaround or code subpattern, the truth of the assertion is used to determine whether to include IFTRUE or IFFALSE: /[ATGC]+(?(?<=AA)G|C)$/; This uses a lookbehind assertion as the COND to match a DNA sequence that ends in either AAG, or some other base combination and C. You can omit the |IFFALSE alternative. If you do, the IFTRUE pattern will be included in the pattern as usual if the COND is true, but if the condition isn’t true, the Engine will move on to the next portion of the pattern. Defining Your Own Assertions You can’t change how Perl’s Engine works, but if you’re sufficiently warped, you can change how it sees your pattern. Since Perl interprets your pattern similarly to double-quoted strings, you can use the wonder of overloaded string constants to see to it that text sequences of your choosing are automatically translated into other text sequences. In the example below, we specify two transformations to occur when Perl encounters a pattern. First, we define \tag so that when it appears in a pattern, it’s automatically translated to (?:<.*?>), which matches most HTML and XML tags. Second, we “redefine” the \w metasymbol so that it handles only English letters. We’ll define a package called Tagger that hides the overloading from our main program. Once we do that, we’ll be able to say: use Tagger; $_ = ’<I>camel</I>’; print "Tagged camel found" if /\tag\w+\tag/; Here’s Tagger.pm, couched in the form of a Perl module (see Chapter 11): package Tagger; use overload; sub import { overload::constant ’qr’ => \&convert } sub convert { my $re = shift; $re =˜ s/ \\tag /<.*?>/xg; $re =˜ s/ \\w /[A-Za-z]/xg; return $re; } 1; 216 Chapter 5: Pattern Matching The Tagger module is handed the pattern immediately before interpolation, so you can bypass the overloading by bypassing interpolation, as follows: $re = ’\tag\w+\tag’; print if /$re/; # This string begins with \t, a tab # Matches a tab, followed by an "a"... If you wanted the interpolated variable to be customized, call the convert function directly: $re = ’\tag\w+\tag’; $re = Tagger::convert $re; print if /$re/; # This string begins with \t, a tab # expand \tag and \w # $re becomes <.*?>[A-Za-z]+<.*?> Now if you’re still wondering what those sub thingies are there in the Tagger module, you’ll find out soon enough because that’s what our next chapter is all about. 6 Subroutines Like many languages, Perl provides for user-defined subroutines.* These subroutines may be defined anywhere in the main program, loaded in from other files via the do, require, or use keywords, or generated at run time using eval. You can even load them at run time with the mechanism described in the section “Autoloading” in Chapter 10, Packages. You can call a subroutine indirectly, using a variable containing either its name or a reference to the routine, or through an object, letting the object determine which subroutine should really be called. You can generate anonymous subroutines, accessible only through references, and if you want, use these to clone new, nearly identical functions via closur es, which are covered in the section by that name in Chapter 8, Refer ences. Syntax To declare a named subroutine without defining it, use one of these forms: sub sub sub sub NAME NAME PROTO NAME ATTRS NAME PROTO ATTRS To declare and define a named subroutine, add a BLOCK: sub sub sub sub NAME BLOCK NAME PROTO BLOCK NAME ATTRS BLOCK NAME PROTO ATTRS BLOCK * We’ll also call them functions, but functions are the same thing as subroutines in Perl. Sometimes we’ll even call them methods, which are defined the same way, but called differently. 217 218 Chapter 6: Subroutines To create an anonymous subroutine or closure, leave out the NAME: sub sub sub sub BLOCK BLOCK ATTRS BLOCK PROTO ATTRS BLOCK PROTO PROTO and ATTRS stand for the prototype and attributes, each of which is discussed in its own section later in the chapter. They’re not so important—the NAME and the BLOCK are the essential parts, even when they’re missing. For the forms without a NAME, you still have to provide some way of calling the subroutine. So be sure to save the return value since this form of sub declaration is not only compiled at compile time as you would expect, but also produces a runtime return value: $subref = sub BLOCK; To import subroutines defined in another module, say: use MODULE qw(NAME1 NAME2 NAME3...); To call subroutines directly, say: NAME(LIST) NAME LIST &NAME # # # # & is optional with parentheses. Parens optional if sub predeclared/imported. Exposes current @_ to that subroutine, (and circumvents prototypes). To call subroutines indirectly (by name or by reference), use any of these: &$subref(LIST) $subref->(LIST) &$subref # The & is not optional on indirect call # (unless using infix notation). # Exposes current @_ to that subroutine. The official name of a subroutine includes the & prefix. A subroutine may be called using the prefix, but the & is usually optional, and so are the parentheses if the subroutine has been predeclared. However, the & is not optional when you’re just naming the subroutine, such as when it’s used as an argument to defined or undef or when you want to generate a reference to a named subroutine by saying $subref = \&name. Nor is the & optional when you want to make an indirect subroutine call using the &$subref() or &{$subref}() constructs. However, the more convenient $subref->() notation does not require it. See Chapter 8 for more about references to subroutines. Perl doesn’t force a particular capitalization style on your subroutine names. However, one loosely held convention is that functions called indirectly by Perl’s runtime system (BEGIN, CHECK, INIT, END, AUTOLOAD, DESTROY, and all the functions mentioned in Chapter 14, Tied Variables) are in all capitals, so you might want to avoid using that style. (But subroutines used for constant values are customarily named with all caps too. That’s okay. We hope . . . ) Semantics 219 Semantics Before you get too worked up over all that syntax, just remember that the normal way to define a simple subroutine ends up looking like this: sub razzle { print "Ok, you’ve been razzled.\n"; } and the normal way to call it is simply: razzle(); In this case, we ignored inputs (arguments) and outputs (return values). But the Perl model for passing data into and out of a subroutine is really quite simple: all function parameters are passed as one single, flat list of scalars, and multiple return values are likewise returned to the caller as one single, flat list of scalars. As with any LIST, any arrays or hashes passed in these lists will interpolate their values into the flattened list, losing their identities—but there are several ways to get around this, and the automatic list interpolation is frequently quite useful. Both parameter lists and return lists may contain as many or as few scalar elements as you’d like (though you may put constraints on the parameter list by using prototypes). Indeed, Perl is designed around this notion of variadic functions (those taking any number of arguments), unlike C, where they’re sort of grudgingly kludged in so that you can call printf (3). Now, if you’re going to design a language around the notion of passing varying numbers of arbitrary arguments, you’d better make it easy to process those arbitrary lists of arguments. Any arguments passed to a Perl routine come in as the array @_. If you call a function with two arguments, they are accessible inside the function as the first two elements of that array: $_[0] and $_[1]. Since @_ is a just a regular array with an irregular name, you can do anything to it you’d normally do to an array.* The array @_ is a local array, but its values are aliases to the actual scalar parameters. (This is known as pass-by-reference semantics.) Thus you can modify the actual parameters if you modify the corresponding element of @_. (This is rarely done, however, since it’s so easy to return interesting values in Perl.) The return value of the subroutine (or of any other block, for that matter) is the value of the last expression evaluated. Or you may use an explicit return statement to specify the return value and exit the subroutine from any point in the subroutine. Either way, as the subroutine is called in a scalar or list context, so also is the final expression of the routine evaluated in that same scalar or list context. * This is an area where Perl is mor e orthogonal than the typical programming language. 220 Chapter 6: Subroutines Tricks with Parameter Lists Perl does not yet have named formal parameters, but in practice all you do is copy the values of @_ to a my list, which serves nicely for a list of formal parameters. (Not coincidentally, copying the values changes the pass-by-reference semantics into pass-by-value, which is how people usually expect parameters to work anyway, even if they don’t know the fancy computer science terms for it.) Here’s a typical example: sub maysetenv { my ($key, $value) = @_; $ENV{$key} = $value unless $ENV{$key}; } But you aren’t required to name your parameters, which is the whole point of the @_ array. For example, to calculate a maximum, you can just iterate over @_ directly: sub max { my $max = shift(@_); for my $item (@_) { $max = $item if $max < $item; } return $max; } $bestday = max($mon,$tue,$wed,$thu,$fri); Or you can fill an entire hash at once: sub configuration { my %options = @_; print "Maximum verbosity.\n" if $options{VERBOSE} == 9; } configuration(PASSWORD => "xyzzy", VERBOSE => 9, SCORE => 0); Here’s an example of not naming your formal arguments so that you can modify your actual arguments: upcase_in($v1, $v2); # this changes $v1 and $v2 sub upcase_in { for (@_) { tr/a-z/A-Z/ } } You aren’t allowed to modify constants in this way, of course. If an argument were actually a scalar literal like "hobbit" or read-only scalar variable like $1, and you tried to change it, Perl would raise an exception (presumably fatal, possibly careerthreatening). For example, this won’t work: Semantics 221 upcase_in("frederick"); It would be much safer if the upcase_in function were written to return a copy of its parameters instead of changing them in place: ($v3, $v4) = upcase($v1, $v2); sub upcase { my @parms = @_; for (@parms) { tr/a-z/A-Z/ } # Check whether we were called in list context. return wantarray ? @parms : $parms[0]; } Notice how this (unprototyped) function doesn’t care whether it was passed real scalars or arrays. Perl will smash everything into one big, long, flat @_ parameter list. This is one of the places where Perl’s simple argument-passing style shines. The upcase function will work perfectly well without changing the upcase definition even if we feed it things like this: @newlist = upcase(@list1, @list2); @newlist = upcase( split /:/, $var ); Do not, however, be tempted to do this: (@a, @b) = upcase(@list1, @list2); # WRONG Why not? Because, like the flat incoming parameter list in @_, the return list is also flat. So this stores everything in @a and empties out @b by storing the null list there. See the later section “Passing References” for alternatives. Error Indications If you want your function to return in such a way that the caller will realize there’s been an error, the most natural way to do this in Perl is to use a bare return statement without an argument. That way when the function is used in scalar context, the caller gets undef, and when used in list context, the caller gets a null list. Under extraordinary circumstances, you might choose to raise an exception to indicate an error. Use this measure sparingly, though; otherwise, your whole program will be littered with exception handlers. For example, failing to open a file in a generic file-opening function is hardly an exceptional event. However, ignoring that failure might well be. The wantarray built-in returns undef if your function was called in void context, so you can tell if you’re being ignored: if ($something_went_awry) { return if defined wantarray; # good, not void context. die "Pay attention to my error, you danglesocket!!!\n"; } 222 Chapter 6: Subroutines Scoping Issues Subroutines may be called recursively because each call gets its own argument array, even when the routine calls itself. If a subroutine is called using the & form, the argument list is optional. If the & is used but the argument list is omitted, something special happens: the @_ array of the calling routine is supplied implicitly. This is an efficiency mechanism that new users may wish to avoid. &foo(1,2,3); foo(1,2,3); # pass three arguments # the same foo(); &foo(); # pass a null list # the same &foo; foo; # foo() gets current args, like foo(@_), but faster! # like foo() if sub foo predeclared, else bareword "foo" Not only does the & form make the argument list optional, but it also disables any prototype checking on the arguments you do provide. This is partly for historical reasons and partly to provide a convenient way to cheat if you know what you’re doing. See the section “Prototypes” later in this chapter. Variables you access from inside a function that haven’t been declared private to that function are not necessarily global variables; they still follow the normal block-scoping rules of Perl. As explained in the “Names” section of Chapter 2, Bits and Pieces, this means they look first in the surrounding lexical scope (or scopes) for resolution, then on to the single package scope. From the viewpoint of a subroutine, then, any my variables from an enclosing lexical scope are still perfectly visible. For example, the bumpx function below has access to the file-scoped $x lexical variable because the scope where the my was declared — the file itself—hasn’t been closed off before the subroutine is defined: # top of file my $x = 10; # declare and initialize variable sub bumpx { $x++ } # function can see outer lexical variable C and C++ programmers would probably think of $x as a “file static” variable. It’s private as far as functions in other files are concerned, but global from the perspective of functions declared after the my. C programmers who come to Perl looking for what they would call “static variables” for files or functions find no such keyword in Perl. Perl programmers generally avoid the word “static”, because static systems are dead and boring, and because the word is so muddled in historical usage. Semantics 223 Although Perl doesn’t include the word “static” in its lexicon, Perl programmers have no problem creating variables that are private to a function and persist across function calls. There’s just no special word for these. Perl’s richer scoping primitives combine with automatic memory management in ways that someone looking for a “static” keyword might never think of trying. Lexical variables don’t get automatically garbage collected just because their scope has exited; they wait to get recycled until they’re no longer used, which is much more important. To create private variables that aren’t automatically reset across function calls, enclose the whole function in an extra block and put both the my declaration and the function definition within that block. You can even put more than one function there for shared access to an otherwise private variable: { my $counter = 0; sub next_counter { return ++$counter } sub prev_counter { return --$counter } } As always, access to the lexical variable is limited to code within the same lexical scope. The names of the two functions, on the other hand, are globally accessible (within the package), and, since they were defined inside $counter’s scope, they can still access that variable even though no one else can. If this function is loaded via require or use, then this is probably just fine. If it’s all in the main program, you’ll need to make sure any run-time assignment to my is executed early enough, either by putting the whole block before your main program, or alternatively, by placing a BEGIN or INIT block around it to make sure it gets executed before your program starts: BEGIN { my @scale = (’A’ .. ’G’); my $note = -1; sub next_pitch { return $scale[ ($note += 1) %= @scale ] }; } The BEGIN doesn’t affect the subroutine definition, nor does it affect the persistence of any lexicals used by the subroutine. It’s just there to ensure the variables get initialized before the subroutine is ever called. For more on declaring private and global variables, see my and our respectively in Chapter 29, Functions. The BEGIN and INIT constructs are explained in Chapter 18, Compiling. 224 Chapter 6: Subroutines Passing References If you want to pass more than one array or hash into or out of a function, and you want them to maintain their integrity, then you’ll need to use an explicit pass-byreference mechanism. Before you do that, you need to understand references as detailed in Chapter 8. This section may not make much sense to you otherwise. But hey, you can always look at the pictures . . . Here are a few simple examples. First, let’s define a function that expects a reference to an array. When the array is large, it’s much faster to pass it in as a single reference than a long list of values: $total = sum ( \@a ); sub sum { my ($aref) = @_; my ($total) = 0; foreach (@$aref) { $total += $_ } return $total; } Let’s pass in several arrays to a function and have it pop each of them, returning a new list of all their former last elements: @tailings = popmany ( \@a, \@b, \@c, \@d ); sub popmany { my @retlist = (); for my $aref (@_) { push @retlist, pop @$aref; } return @retlist; } Here’s how you might write a function that does a kind of set intersection by returning a list of keys occurring in all the hashes passed to it: @common = inter( \%foo, \%bar, \%joe ); sub inter { my %seen; for my $href (@_) { while (my $k = each %$href ) { $seen{$k}++; } } return grep { $seen{$_} == @_ } keys %seen; } So far, we’re just using the normal list return mechanism. What happens if you want to pass or return a hash? Well, if you’re only using one of them, or you don’t Prototypes 225 mind them concatenating, then the normal calling convention is okay, although a little expensive. As we explained earlier, where people get into trouble is here: (@a, @b) = func(@c, @d); or here: (%a, %b) = func(%c, %d); That syntax simply won’t work. It just sets @a or %a and clears @b or %b. Plus the function doesn’t get two separate arrays or hashes as arguments: it gets one long list in @_, as always. You may want to arrange for your functions to use references for both input and output. Here’s a function that takes two array references as arguments and returns the two array references ordered by the number of elements they have in them: ($aref, $bref) = func(\@c, \@d); print "@$aref has more than @$bref\n"; sub func { my ($cref, $dref) = @_; if (@$cref > @$dref) { return ($cref, $dref); } else { return ($dref, $cref); } } For passing filehandles or directory handles into or out of functions, see the sections “Filehandle References” and “Symbol Table References” in Chapter 8. Prototypes Perl lets you define your own functions to be called like Perl’s built-in functions. Consider push(@array, $item), which must tacitly receive a reference to @array, not just the list values held in @array, so that the array can be modified. Pr ototypes let you declare subroutines to take arguments just like many of the built-ins, that is, with certain constraints on the number and types of arguments. We call them “prototypes”, but they work more like automatic templates for the calling context than like what C or Java programmers would think of as prototypes. With these templates, Perl will automatically add implicit backslashes, or calls to scalar, or whatever else it takes to get things to show up in a way that matches the template. For instance, if you declare: sub mypush (\@@); 226 Chapter 6: Subroutines then mypush takes arguments exactly like push does. For this to work, the declaration of the function to be called must be visible at compile time. The prototype only affects the interpretation of function calls when the & character is omitted. In other words, if you call it like a built-in function, it behaves like a built-in function. If you call it like an old-fashioned subroutine, then it behaves like an old-fashioned subroutine. The & suppresses prototype checks and associated contextual effects. Since prototypes are taken into consideration only at compile time, it naturally falls out that they have no influence on subroutine references like \&foo or on indirect subroutine calls like &{$subref} or $subref->(). Method calls are not influenced by prototypes, either. That’s because the actual function to be called is indeterminate at compile time, depending as it does on inheritance, which is dynamically determined in Perl. Since the intent is primarily to let you define subroutines that work like built-in functions, here are some prototypes you might use to emulate the corresponding built-ins: Declared as Called as sub sub sub sub sub sub sub mylink $old, $new myreverse $a,$b,$c myjoin ":",$a,$b,$c mypop @array mysplice @array,@array,0,@pushme mykeys %{$hashref} mypipe READHANDLE, WRITEHANDLE mylink ($$) myreverse (@) myjoin ($@) mypop (\@) mysplice (\@$$@) mykeys (\%) mypipe (**) sub myindex ($$;$) myindex &getstring, "substr" myindex &getstring, "substr", $start sub mysyswrite (*$;$$) mysyswrite OUTF, $buf mysyswrite OUTF, $buf, length($buf)-$off, $off sub myopen (*;$@) myopen HANDLE myopen HANDLE, $name myopen HANDLE, "-|", @cmd sub mygrep (&@) sub myrand ($) sub mytime () mygrep { /foo/ } $a,$b,$c myrand 42 mytime Any backslashed prototype character (shown between parentheses in the left column above) represents an actual argument (exemplified in the right column), which absolutely must start with that character. Just as the first argument to keys must start with %, so too must the first argument to mykeys. Prototypes 227 A semicolon separates mandatory arguments from optional arguments. (It would be redundant before @ or %, since lists can be null.) Unbackslashed prototype characters have special meanings. Any unbackslashed @ or % eats all the rest of the actual arguments and forces list context. (It’s equivalent to LIST in a syntax description.) An argument represented by $ has scalar context forced on it. An & requires a reference to a named or anonymous subroutine. A * allows the subroutine to accept anything in that slot that would be accepted by a built-in as a filehandle: a bare name, a constant, a scalar expression, a typeglob, or a reference to a typeglob. The value will be available to the subroutine either as a simple scalar or (in the latter two cases) as a reference to the typeglob. If you wish to always convert such arguments to a typeglob reference, use Symbol::qualify_to_ref as follows: use Symbol ’qualify_to_ref’; sub foo (*) { my $fh = qualify_to_ref(shift, caller); ... } Note how the last three examples in the table are treated specially by the parser. mygrep is parsed as a true list operator, myrand is parsed as a true unary operator with unary precedence the same as rand, and mytime is truly argumentless, just like time. That is, if you say: mytime +2; you’ll get mytime() + 2, not mytime(2), which is how it would be parsed without the prototype, or with a unary prototype. The mygrep example also illustrates how & is treated specially when it is the first argument. Ordinarily, an & prototype would demand an argument like \&foo or sub{}. When it is the first argument, however, you can leave off the sub of your anonymous subroutine, and just pass a bare block in the “indirect object” slot (with no comma after it). So one nifty thing about the & prototype is that you can generate new syntax with it, provided the & is in the initial position: sub try (&$) { my ($try, $catch) = @_; eval { &$try }; if ($@) { local $_ = $@; &$catch; } } sub catch (&) { $_[0] } 228 Chapter 6: Subroutines try { die "phooey"; } # not the end of the function call! catch { /phooey/ and print "unphooey\n"; }; This prints “unphooey”. What happens is that try is called with two arguments, the anonymous function {die "phooey";} and the return value of the catch function, which in this case is nothing but its own argument, the entire block of yet another anonymous function. Within try, the first function argument is called while protected within an eval block to trap anything that blows up. If something does blow up, the second function is called with a local version of the global $_ variable set to the raised exception.* If this all sounds like pure gobbledygook, you’ll have to read about die and eval in Chapter 29, and then go check out anonymous functions and closures in Chapter 8. On the other hand, if it intrigues you, you might check out the Error module on CPAN, which uses this to implement elaborately structured exception handling with try, catch, except, otherwise, and finally clauses. Here’s a reimplementation of the grep operator (the built-in one is more efficient, of course): sub mygrep (&@) { my $coderef = shift; my @result; foreach $_ (@_) { push(@result, $_) if &$coderef; } return @result; } Some folks would prefer to see full alphanumeric prototypes. Alphanumerics have been intentionally left out of prototypes for the express purpose of someday adding named, formal parameters. (Maybe.) The current mechanism’s main goal is to let module writers enforce a certain amount of compile-time checking on module users. Inlining Constant Functions Functions prototyped with (), meaning that they take no arguments at all, are parsed like the time built-in. More interestingly, the compiler treats such functions as potential candidates for inlining. If the result of that function, after Perl’s optimization and constant-folding pass, is either a constant or a lexically scoped scalar * Yes, there are still unresolved issues having to do with the visibility of @_. We’re ignoring that question for the moment. But if we make @_ lexically scoped someday, as already occurs in the experimental threaded versions of Perl, those anonymous subroutines can act like closures. Prototypes 229 with no other references, then that value will be used in place of calls to that function. Calls made using &NAME are never inlined, however, just as they are not subject to any other prototype effects. (See the use constant pragma in Chapter 31, Pragmatic Modules, for an easy way to declare such constants.) Both version of these functions to compute π will be inlined by the compiler: sub pi () sub PI () { 3.14159 } { 4 * atan2(1, 1) } # Not exact, but close # As good as it gets In fact, all of the following functions are inlined because Perl can determine everything at compile time: sub FLAG_FOO () sub FLAG_BAR () sub FLAG_MASK () { 1 << 8 } { 1 << 9 } { FLAG_FOO | FLAG_BAR } sub OPT_GLARCH () sub GLARCH_VAL () if (OPT_GLARCH) else } { (0x1B58 & FLAG_MASK) == 0 } { { return 23 } { return 42 } sub N () { int(GLARCH_VAL) / 3 } BEGIN { # compiler runs this block at compile time my $prod = 1; # persistent, private variable for (1 .. N) { $prod *= $_ } sub NFACT () { $prod } } In the last example, the NFACT function is inlined because it has a void prototype and the variable it returns is not changed by that function—and furthermore can’t be changed by anyone else, since it’s in a lexical scope. So the compiler replaces uses of NFACT with that value, which was precomputed at compile time because of the surrounding BEGIN. If you redefine a subroutine that was eligible for inlining, you’ll get a mandatory warning. (You can use this warning to tell whether the compiler inlined a particular subroutine.) The warning is considered severe enough not to be optional, because previously compiled invocations of the function will still use the old value of the function. If you need to redefine the subroutine, ensure that it isn’t inlined either by dropping the () prototype (which changes calling semantics, so beware) or by thwarting the inlining mechanism in some other way, such as: sub not_inlined () { return 23 if $$; } See Chapter 18 for more about what happens during the compilation and execution phases of your program’s life. 230 Chapter 6: Subroutines Care with Prototypes It’s probably best to put prototypes on new functions, not retrofit prototypes onto older ones. These are context templates, not ANSI C prototypes, so you must be especially careful about silently imposing a different context. Suppose, for example, you decide that a function should take just one parameter, like this: sub func ($) { my $n = shift; print "you gave me $n\n"; } That makes it a unary operator (like the rand built-in) and changes how the compiler determines the function’s arguments. With the new prototype, the function consumes just one, scalar-context argument instead of many arguments in list context. If someone has been calling it with an array or list expression, even if that array or list contained just a single element, where before it worked, now you’ve got something completely different: func @foo; func split /:/; func "a", "b", "c"; func("a", "b", "c"); # # # # counts @foo elements counts number of fields returned passes "a" only, discards "b" and "c" suddenly, a compiler error! You’ve just supplied an implicit scalar in front of the argument list, which can be more than a bit surprising. The old @foo that used to hold one thing doesn’t get passed in. Instead, 1 (the number of elements in @foo) is now passed to func. And the split, being called in scalar context, scribbles all over your @_ parameter list. In the third example, because func has been prototyped as a unary operator, only “a” is passed in; then the return value from func is discarded as the comma operator goes on to evaluate the next two items and return “c.” In the final example, the user now gets a syntax error at compile time on code that used to compile and run just fine. If you’re writing new code and would like a unary operator that takes only a scalar variable, not any old scalar expression, you could prototype it to take a scalar refer ence: sub func (\$) { my $nref = shift; print "you gave me $$nref\n"; } Now the compiler won’t let anything by that doesn’t start with a dollar sign: func func func func @foo; split/:/; $s; $a[3]; # # # # compiler compiler this one and this error, saw @, want $ error, saw function, want $ is ok -- got real $ symbol one Subroutine Attributes func $h{stuff}[-1]; func 2+5; func ${ \(2+5) }; 231 # or even this # scalar expr still a compiler error # ok, but is the cure worse than the disease? If you aren’t careful, you can get yourself into trouble with prototypes. But if you are careful, you can do a lot of neat things with them. This is all very powerful, of course, and should only be used in moderation to make the world a better place. Subroutine Attributes A subroutine declaration or definition may have a list of attributes associated with it. If such an attribute list is present, it is broken up at whitespace or colon boundaries and treated as though a use attributes had been seen. See the use attributes pragma in Chapter 31 for internal details. There are three standard attributes for subroutines: locked, method, and lvalue. The locked and method Attributes # Only one thread is allowed into this function. sub afunc : locked { ... } # Only one thread is allowed into this function on a given object. sub afunc : locked method { ... } Setting the locked attribute is meaningful only when the subroutine or method is intended to be called by multiple threads simultaneously. When set on a nonmethod subroutine, Perl ensures that a lock is acquired on the subroutine itself before that subroutine is entered. When set on a method subroutine (that is, one also marked with the method attribute), Perl ensures that any invocation of it implicitly locks its first argument (the object) before execution. Semantics of this lock are the same as using the lock operator on the subroutine as the first statement in that routine. See Chapter 17, Thr eads, for more on locking. The method attribute can be used by itself: sub afunc : method { ... } Currently this has only the effect of marking the subroutine so as not to trigger the “Ambiguous call resolved as CORE::%s” warning. (We may make it mean more someday.) The attribute system is user-extensible, letting you create your own attribute names. These new attributes must be valid as simple identifier names (without any punctuation other than the “_” character). They may have a parameter list appended, which is currently only checked for whether its parentheses nest properly. 232 Chapter 6: Subroutines Here are examples of valid syntax (even though the attributes are unknown): sub fnord (&\%) : switch(10,foo(7,3)) : expensive; sub plugh () : Ugly(’\(") :Bad; sub xyzzy : _5x5 { ... } Here are examples of invalid syntax: sub sub sub sub sub fnord snoid xyzzy plugh snurt : : : : : switch(10,foo(); Ugly(’(’); 5x5; Y2::north; foo + bar; # # # # # ()-string not balanced ()-string not balanced "5x5" not a valid identifier "Y2::north" not a simple identifier "+" not a colon or space The attribute list is passed as a list of constant strings to the code that associates them with the subroutine. Exactly how this works (or doesn’t) is highly experimental. Check attributes (3) for current details on attribute lists and their manipulation. The lvalue Attribute It is possible to return a modifiable scalar value from a subroutine, but only if you declare the subroutine to return an lvalue: my $val; sub canmod : lvalue { $val; } sub nomod { $val; } canmod() = 5; nomod() = 5; # Assigns to $val. # ERROR If you’re passing parameters to an lvalued subroutine, you’ll usually want parentheses to disambiguate what’s being assigned: canmod $x canmod 42 canmod($x) canmod(42) = = = = 5; 5; 5; 5; # # # # assigns 5 to $x first! can’t change a constant; compile-time error this is ok and so is this If you want to be sneaky, you can get around this in the particular case of a subroutine that takes one argument. Declaring the function with a prototype of ($) causes the function to be parsed with the precedence of a named unary operator. Since named unaries have higher precedence than assignment, you no longer need the parentheses. (Whether this is desirable or not is left up to the style police.) Subroutine Attributes 233 You don’t have to be sneaky in the particular case of a subroutine that allows zero arguments (that is, with a () prototype). You can without ambiguity say this: canmod = 5; That works because no valid term begins with =. Similarly, lvalued method calls can omit the parentheses when you don’t pass any arguments: $obj->canmod = 5; We promise not to break those two constructs in future versions of Perl. They’re handy when you want to wrap object attributes in method calls (so that they can be inherited like method calls but accessed like variables). The scalar or list context of both the lvalue subroutine and the righthand side of an assignment to that subroutine is determined as if the subroutine call were replaced by a scalar. For example, consider: data(2,3) = get_data(3,4); Both subroutines here are called in scalar context, while in: (data(2,3)) = get_data(3,4); and in: (data(2),data(3)) = get_data(3,4); all the subroutines are called in list context. The current implementation does not allow arrays and hashes to be returned from lvalue subroutines directly. You can always return a reference instead. 7 Formats Perl has a mechanism to help you generate simple reports and charts. To facilitate this, Perl helps you code up your output page close to how it will look when it’s printed. It can keep track of things like how many lines are on a page, the current page number, when to print page headers, and so on. Keywords are borrowed from FORTRAN: format to declare and write to execute; see the relevant entries in Chapter 29, Functions. Fortunately, the layout is much more legible, more like the PRINT USING statement of BASIC. Think of it as a poor man’s nr off (1). (If you know nr off, that may not sound like a recommendation.) Formats, like packages and subroutines, are declared rather than executed, so they may occur at any point in your program. (Usually it’s best to rukeep them all together.) They have their own namespace apart from all the other types in Perl. This means that if you have a function named “Foo”, it is not the same thing as a format named “Foo”. However, the default name for the format associated with a given filehandle is the same as the name of that filehandle. Thus, the default format for STDOUT is named “STDOUT”, and the default format for filehandle TEMP is named “TEMP”. They just look the same. They aren’t. Output record formats are declared as follows: format NAME = FORMLIST . If NAME is omitted, format STDOUT is defined. FORMLIST consists of a sequence of lines, each of which may be of one of three types: • 234 A comment, indicated by putting a # in the first column. Introduction • A “picture” line giving the format for one output line. • 235 An argument line supplying values to plug into the previous picture line. Picture lines are printed exactly as they look, except for certain fields that substitute values into the line.* Each substitution field in a picture line starts with either @ (at) or ˆ (caret). These lines do not undergo any kind of variable interpolation. The @ field (not to be confused with the array marker @) is the normal kind of field; the other kind, the ˆ field, is used to do rudimentary multiline text-block filling. The length of the field is supplied by padding out the field with multiple <, >, or | characters to specify, respectively, left justification, right justification, or centering. If the variable exceeds the width specified, it is truncated. As an alternate form of right justification, you may also use # characters (after an initial @ or ˆ) to specify a numeric field. You can insert a . in place of one of the # characters to line up the decimal points. If any value supplied for these fields contains a newline, only the text up to the newline is printed. Finally, the special field @* can be used for printing multiline, nontruncated values; it should generally appear on a picture line by itself. The values are specified on the following line in the same order as the picture fields. The expressions providing the values should be separated by commas. The expressions are all evaluated in a list context before the line is processed, so a single list expression could produce multiple list elements. The expressions may be spread out to more than one line if enclosed in braces. (If so, the opening brace must be the first token on the first line). This lets you line up the values under their respective format fields for easier reading. If an expression evaluates to a number with a decimal part, and if the corresponding picture specifies that the decimal part should appear in the output (that is, any picture except multiple # characters without an embedded .), the character used for the decimal point is always determined by the current LC_NUMERIC locale. This means that if, for example, the run-time environment happens to specify a German locale, a comma will be used instead of a period. See the perllocale manpage for more information. Inside an expression, the whitespace characters \n, \t, and \f are all considered equivalent to a single space. Thus, you could think of this filter as being applied to each value in the format: * Even those fields maintain the integrity of the columns you put them in, however. There is nothing in a picture line that can cause fields to grow or shrink or shift back and forth. The columns you see are sacred in a WYSIWYG sense—assuming you’re using a fixed-width font. Even control characters are assumed to have a width of one. 236 Chapter 7: Formats $value =˜ tr/\n\t\f/ /; The remaining whitespace character, \r, forces the printing of a new line if the picture line allows it. Picture fields that begin with ˆ rather than @ are treated specially. With a # field, the field is blanked out if the value is undefined. For other field types, the caret enables a kind of fill mode. Instead of an arbitrary expression, the value supplied must be a scalar variable name that contains a text string. Perl puts as much text as it can into the field, and then chops off the front of the string so that the next time the variable is referenced, more of the text can be printed. (Yes, this means that the variable itself is altered during execution of the write call and is not preserved. Use a scratch variable if you want to preserve the original value.) Normally you would use a sequence of fields lined up vertically to print out a block of text. You might wish to end the final field with the text “...”, which will appear in the output if the text was too long to appear in its entirety. You can change which characters are legal to “break” on (or after) by changing the variable $: (that’s $FORMAT_LINE_BREAK_CHARACTERS if you’re using the English module) to a list of the desired characters. Using ˆ fields can produce variable-length records. If the text to be formatted is short, just repeat the format line with the ˆ field in it a few times. If you just do this for short data you’d end up getting a few blank lines. To suppress lines that would end up blank, put a ˜ (tilde) character anywhere in the line. (The tilde itself will be translated to a space upon output.) If you put a second tilde next to the first, the line will be repeated until all the text in the fields on that line are exhausted. (This works because the ˆ fields chew up the strings they print. But if you use a field of the @ variety in conjunction with two tildes, the expression you supply had better not give the same value every time forever! Use a shift, or some other operator with a side effect that exhausts the set of values.) Top-of-form processing is by default handled by a format with the same name as the current filehandle with _TOP concatenated to it. It’s triggered at the top of each page. See write in Chapter 29. Here are some examples: # a report on the /etc/passwd file format STDOUT_TOP = Passwd File Name Login Office Uid Gid Home -----------------------------------------------------------------. format STDOUT = @<<<<<<<<<<<<<<<<<< @||||||| @<<<<<<@>>>> @>>>> @<<<<<<<<<<<<<<<<< $name, $login, $office,$uid,$gid, $home . Format Variables 237 # a report from a bug report form format STDOUT_TOP = Bug Reports @<<<<<<<<<<<<<<<<<<<<<<< @||| @>>>>>>>>>>>>>>>>>>>>>>> $system, $%, $date -----------------------------------------------------------------. format STDOUT = Subject: @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $subject Index: @<<<<<<<<<<<<<<<<<<<<<<<<<<<< ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<< $index, $description Priority: @<<<<<<<<<< Date: @<<<<<<< ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<< $priority, $date, $description From: @<<<<<<<<<<<<<<<<<<<<<<<<<<<<< ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<< $from, $description Assigned to: @<<<<<<<<<<<<<<<<<<<<<< ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<< $programmer, $description ˜ ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<< $description ˜ ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<< $description ˜ ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<< $description ˜ ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<< $description ˜ ˆ<<<<<<<<<<<<<<<<<<<<<<<... $description . Lexical variables are not visible within a format unless the format is declared within the scope of the lexical variable. It is possible to intermix prints with writes on the same output channel, but you’ll have to handle the $- special variable ($FORMAT_LINES_LEFT if you’re using the English module) yourself. Format Variables The current format name is stored in the variable $˜ ($FORMAT_NAME), and the current top-of-form format name is in $ˆ ($FORMAT_TOP_NAME). The current output page number is stored in $% ($FORMAT_PAGE_NUMBER), and the number of lines on the page is in $= ($FORMAT_LINES_PER_PAGE). Whether to flush the output buffer on this handle automatically is stored in $| ($OUTPUT_AUTOFLUSH). The string to be output before each top of page (except the first) is stored in $ˆL ($FORMAT_FORMFEED). These variables are set on a per-filehandle basis, so you’ll need to select the filehandle associated with a format in order to affect its format variables: 238 Chapter 7: Formats select((select(OUTF), $˜ = "My_Other_Format", $ˆ = "My_Top_Format" )[0]); Pretty ugly, eh? It’s a common idiom though, so don’t be too surprised when you see it. You can at least use a temporary variable to hold the previous filehandle: $ofh = select(OUTF); $˜ = "My_Other_Format"; $ˆ = "My_Top_Format"; select($ofh); This is a much better approach in general because not only does legibility improve, but you now have an intermediary statement in the code to stop on when you’re single-stepping in the debugger. If you use the English module, you can even read the variable names: use English; $ofh = select(OUTF); $FORMAT_NAME = "My_Other_Format"; $FORMAT_TOP_NAME = "My_Top_Format"; select($ofh); But you still have those funny calls to select. If you want to avoid them, use the FileHandle module bundled with Perl. Now you can access these special variables using lowercase method names instead: use FileHandle; OUTF->format_name("My_Other_Format"); OUTF->format_top_name("My_Top_Format"); Much better! Since the values line following your picture line may contain arbitrary expressions (for @ fields, not ˆ fields), you can farm out more sophisticated processing to other functions, like sprintf or one of your own. For example, to insert commas into a number: format Ident = @<<<<<<<<<<<<<<< commify($n) . To get a real @, ˜, or ˆ into the field, do this: format Ident = I have an @ here. "@" . Format Variables 239 To center a whole line of text, do something like this: format Ident = @|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| "Some text line" . The > field-length indicator ensures that the text will be right-justified within the field, but the field as a whole occurs exactly where you show it occurring. There is no built-in way to say “float this field to the right-hand side of the page, however wide it is.” You have to specify where it goes relative to the left margin. The truly desperate can generate their own format on the fly, based on the current number of columns (not supplied), and then eval it: $format = "format STDOUT = \n" . ’ˆ’ . ’<’ x $cols . "\n" . ’$entry’ . "\n" . "\tˆ" . "<" x ($cols-8) . "˜˜\n" . ’$entry’ . "\n" . ".\n"; print $format if $Debugging; eval $format; die $@ if $@; The most important line there is probably the print. What the print would print out looks something like this: format STDOUT = ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $entry ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<˜˜ $entry . Here’s a little program that behaves like the fmt (1) Unix utility: format = ˆ<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< ˜˜ $_ . $/ = ""; while (<>) { s/\s*\n\s*/ /g; write; } 240 Chapter 7: Formats Footers While $ˆ ($FORMAT_TOP_NAME) contains the name of the current header format, there is no corresponding mechanism to do the same thing automatically for a footer. Not knowing how big a format is going to be until you evaluate it is one of the major problems. It’s on the TODO list.* Here’s one strategy: if you have a fixed-size footer, you can get footers by checking $- ($FORMAT_LINES_LEFT) before each write and then print the footer yourself if necessary. Here’s another strategy; open a pipe to yourself, using open(MESELF, "|-") (see the open entry in Chapter 29) and always write to MESELF instead of STDOUT. Have your child process postprocess its STDIN to rearrange headers and footers however you like. Not very convenient, but doable. Accessing Formatting Internals For low-level access to the internal formatting mechanism, you may use the builtin formline operator and access $ˆA (the $ACCUMULATOR variable) directly. (Formats essentially compile into a sequence of calls to formline.) For example: $str = formline <<’END’, 1,2,3; @<<< @||| @>>> END print "Wow, I just stored ‘$ˆA’ in the accumulator!\n"; Or to create an swrite subroutine that is to write as sprintf is to printf, do this: use Carp; sub swrite { croak "usage: swrite PICTURE ARGS" unless @_; my $format = shift; $ˆA = ""; formline($format, @_); return $ˆA; } $string = swrite(<<’END’, 1, 2, 3); Check me out @<<< @||| @>>> END print $string; * That doesn’t guarantee we’ll ever do it, of course. Formats are somewhat passé in this age of WWW, Unicode, XML, XSLT, and whatever the next few things after that are. Footers 241 If you were using the FileHandle module, you could use formline as follows to wrap a block of text at column 72: use FileHandle; STDOUT->formline("ˆ" . ("<" x 72) . "˜˜\n", $long_text); 8 References For both practical and philosophical reasons, Perl has always been biased in favor of flat, linear data structures. And for many problems, this is just what you want. Suppose you wanted to build a simple table (two-dimensional array) showing vital statistics — age, eye color, and weight—for a group of people. You could do this by first creating an array for each individual: @john = (47, "brown", 186); @mary = (23, "hazel", 128); @bill = (35, "blue", 157); You could then construct a single, additional array consisting of the names of the other arrays: @vitals = (’john’, ’mary’, ’bill’); To change John’s eyes to “red” after a night on the town, we want a way to change the contents of the @john array given only the simple string “john”. This is the basic problem of indir ection, which various languages solve in various ways. In C, the most common form of indirection is the pointer, which lets one variable hold the memory address of another variable. In Perl, the most common form of indirection is the refer ence. What Is a Reference? In our example, $vitals[0] has the value “john”. That is, it contains a string that happens to be the name of another (global) variable. We say that the first variable refers to the second, and this sort of reference is called a symbolic reference, since 242 What Is a Reference? 243 Perl has to look up @john in a symbol table to find it. (You might think of symbolic references as analogous to symbolic links in the filesystem.) We’ll talk about symbolic references later in this chapter. The other kind of reference is a hard reference, and this is what most Perl programmers use to accomplish their indirections (if not their indiscretions). We call them hard references not because they’re difficult, but because they’re real and solid. If you like, think of hard references as real references and symbolic references as fake references. It’s like the difference between true friendship and mere name-dropping. When we don’t specify which type of reference we mean, it’s a hard reference. Figure 8-1 depicts a variable named $bar referring to the contents of a scalar named $foo which has the value “bot”. $foo "bot" $foo $foo = "bot" $bar "bot" $foo = "bot" $bar $bar = \$foo "foo" $bar = "foo" Figur e 8-1. A hard refer ence and a symbolic refer ence Unlike a symbolic reference, a real reference refers not to the name of another variable (which is just a container for a value) but to an actual value itself, some internal glob of data. There’s no good word for that thing, but when we have to, we’ll call it a refer ent. Suppose, for example, that you create a hard reference to a lexically scoped array named @array. This hard reference, and the referent it refers to, will continue to exist even after @array goes out of scope. A referent is only destroyed when all the references to it are eliminated. A referent doesn’t really have a name of its own, apart from the references to it. To put it another way, every Perl variable name lives in some kind of symbol table, holding one hard reference to its underlying (otherwise nameless) referent. That referent might be simple, like a number or string, or complex, like an array or hash. Either way, there’s still exactly one reference from the variable to its value. You might create additional hard references to the same referent, but if so, the variable doesn’t know (or care) about them.* * If you’re curious, you can determine the underlying refcount with the Devel::Peek module, bundled with Perl. 244 Chapter 8: References A symbolic reference is just a string that happens to name something in a package symbol table. It’s not so much a distinct type as it is something you do with a string. But a hard reference is a different beast entirely. It is the third of the three kinds of fundamental scalar data types, the other two being strings and numbers. A hard reference doesn’t know something’s name just to refer to it, and it’s actually completely normal for there to be no name to use in the first place. Such totally nameless referents are called anonymous; we discuss them in “Anonymous Data” below. To refer ence a value, in the terminology of this chapter, is to create a hard reference to it. (There’s a special operator for this creative act.) The reference so created is simply a scalar, which behaves in all familiar contexts just like any other scalar. To der efer ence this scalar means to use the reference to get at the referent. Both referencing and dereferencing occur only when you invoke certain explicit mechanisms; implicit referencing or dereferencing never occurs in Perl. Well, almost never. A function call can use implicit pass-by-reference semantics — if it has a prototype declaring it that way. If so, the caller of the function doesn’t explicitly pass a reference, although you still have to dereference it explicitly within the function. See the section “Prototypes” in Chapter 6, Subr outines. And to be perfectly honest, there’s also some behind-the-scenes dereferencing happening when you use certain kinds of filehandles, but that’s for backward compatibility and is transparent to the casual user. Finally, two built-in functions, bless and lock, each take a reference for their argument but implicitly dereference it to work their magic on what lies behind. But those confessions aside, the basic principle still holds that Perl isn’t interested in muddling your levels of indirection. A reference can point to any data structure. Since references are scalars, you can store them in arrays and hashes, and thus build arrays of arrays, arrays of hashes, hashes of arrays, arrays of hashes and functions, and so on. There are examples of these in Chapter 9, Data Structures. Keep in mind, though, that Perl arrays and hashes are internally one-dimensional. That is, their elements can hold only scalar values (strings, numbers, and references). When we use a phrase like “array of arrays”, we really mean “array of references to arrays”, just as when we say “hash of functions” we really mean “hash of references to subroutines”. But since references are the only way to implement such structures in Perl, it follows that the shorter, less accurate phrase is not so inaccurate as to be false, and therefore should not be totally despised, unless you’re into that sort of thing. Creating References 245 Creating References There are several ways to create references, most of which we will describe before explaining how to use (dereference) the resulting references. The Backslash Operator You can create a reference to any named variable or subroutine with a backslash. (You may also use it on an anonymous scalar value like 7 or "camel", although you won’t often need to.) This operator works like the & (address-of) operator in C—at least at first glance. Here are some examples: $scalarref $constref $arrayref $hashref $coderef $globref = = = = = = \$foo; \186_282.42; \@ARGV; \%ENV; \&handler; \*STDOUT; The backslash operator can do more than produce a single reference. It will generate a whole list of references if applied to a list. See the section “Other Tricks You Can Do with Hard References” for details. Anonymous Data In the examples just shown, the backslash operator merely makes a duplicate of a reference that is already held in a variable name—with one exception. The 186_282.42 isn’t referenced by a named variable—it’s just a value. It’s one of those anonymous referents we mentioned earlier. Anonymous referents are accessed only through references. This one happens to be a number, but you can create anonymous arrays, hashes, and subroutines as well. The anonymous array composer You can create a reference to an anonymous array with square brackets: $arrayref = [1, 2, [’a’, ’b’, ’c’, ’d’]]; Here we’ve composed an anonymous array of three elements, whose final element is a reference to an anonymous array of four elements (depicted in Figure 8-2). (The multidimensional syntax described later can be used to access this. For example, $arrayref->[2][1] would have the value “b”.) 246 Chapter 8: References ’a’ ’b’ ’c’ ’d’ 1 2 $arrayref Figur e 8-2. A refer ence to an array, whose third element is itself an array refer ence We now have one way to represent the table at the beginning of the chapter: $table = [ [ "john", 47, "brown", 186], [ "mary", 23, "hazel", 128], [ "bill", 35, "blue", 157] ]; Square brackets work like this only where the Perl parser is expecting a term in an expression. They should not be confused with the brackets in an expression like $array[6] —although the mnemonic association with arrays is intentional. Inside a quoted string, square brackets don’t compose anonymous arrays; instead, they become literal characters in the string. (Square brackets do still work for subscripting in strings, or you wouldn’t be able to print string values like "VAL=$array[6]\n". And to be totally honest, you can in fact sneak anonymous array composers into strings, but only when embedded in a larger expression that is being interpolated. We’ll talk about this cool feature later in the chapter because it involves dereferencing as well as referencing.) The anonymous hash composer You can create a reference to an anonymous hash with braces: $hashref = { ’Adam’ => ’Eve’, ’Clyde’ => $bonnie, ’Antony’ => ’Cleo’ . ’patra’, }; For the values (but not the keys) of the hash, you can freely mix other anonymous array, hash, and subroutine composers to produce as complicated a structure as you like. Creating References 247 We now have another way to represent the table at the beginning of the chapter: $table = { "john" => [ 47, "brown", 186 ], "mary" => [ 23, "hazel", 128 ], "bill" => [ 35, "blue", 157 ], }; That’s a hash of arrays. Choosing the best data structure is a tricky business, and the next chapter is devoted to it. But as a teaser, we could even use a hash of hashes for our table: $table = { "john" => { age eyes weight }, "mary" => { age eyes weight }, "bill" => { age eyes weight }, => 47, => "brown", => 186, => 23, => "hazel", => 128, => 35, => "blue", => 157, }; As with square brackets, braces work like this only where the Perl parser is expecting a term in an expression. They should not be confused with the braces in an expression like $hash{key} —although the mnemonic association with hashes is (again) intentional. The same caveats apply to the use of braces within strings. There is one additional caveat which didn’t apply to square brackets. Since braces are also used for several other things (including blocks), you may occasionally have to disambiguate braces at the beginning of a statement by putting a + or a return in front, so that Perl realizes the opening brace isn’t starting a block. For example, if you want a function to make a new hash and return a reference to it, you have these options: sub hashem { { @_ } } sub hashem { +{ @_ } } sub hashem { return { @_ } } # Silently WRONG -- returns @_. # Ok. # Ok. The anonymous subroutine composer You can create a reference to an anonymous subroutine by using sub without a subroutine name: $coderef = sub { print "Boink!\n" }; # Now &$coderef prints "Boink!" 248 Chapter 8: References Note the presence of the semicolon, required here to terminate the expression. (It isn’t required after the more common usage of sub NAME {} that declares and defines a named subroutine.) A nameless sub {} is not so much a declaration as it is an operator—like do {} or eval {} —except that the code inside isn’t executed immediately. Instead, it just generates a reference to the code, which in our example is stored in $coderef. However, no matter how many times you execute the line shown above, $coderef will still refer to the same anonymous subroutine.* Object Constructors Subroutines can also return references. That may sound trite, but sometimes you are supposed to use a subroutine to create a reference rather than creating the reference yourself. In particular, special subroutines called constructors create and return references to objects. An object is simply a special kind of reference that happens to know which class it’s associated with, and constructors know how to create that association. They do so by taking an ordinary referent and turning it into an object with the bless operator, so we can speak of an object as a blessed reference. There’s nothing religious going on here; since a class acts as a userdefined type, blessing a referent simply makes it a user-defined type in addition to a built-in one. Constructors are often named new —especially by C++ programmers — but they can be named anything in Perl. Constructors can be called in any of these ways: $objref $objref $objref $objref = = = = Doggie::->new(Tail => ’short’, Ears => ’long’); new Doggie:: Tail => ’short’, Ears => ’long’; Doggie->new(Tail => ’short’, Ears => ’long’); new Doggie Tail => ’short’, Ears => ’long’; #1 #2 #3 #4 The first and second invocations are the same. They both call a function named new that is supplied by the Doggie module. The third and fourth invocations are the same as the first two, but are slightly more ambiguous: the parser will get confused if you define your own subroutine named Doggie. (Which is why people typically stick with lowercase names for subroutines and uppercase for modules.) The fourth invocation can also get confused if you’ve defined your own new subroutine and don’t happen to have done either a require or a use of the Doggie module, either of which has the effect of declaring the module. Always declare your modules if you want to use #4. (And watch out for stray Doggie subroutines.) See Chapter 12, Objects for a discussion of Perl objects. * But even though there’s only one anonymous subroutine, there may be several copies of the lexical variables in use by the subroutine, depending on when the subroutine reference was generated. These are discussed later in the section “Closures”. Creating References 249 Handle References References to filehandles or directory handles can be created by referencing the typeglob of the same name: splutter(\*STDOUT); sub splutter { my $fh = shift; print $fh "her um well a hmmm\n"; } $rec = get_rec(\*STDIN); sub get_rec { my $fh = shift; return scalar <$fh>; } If you’re passing around filehandles, you can also use the bare typeglob to do so: in the example above, you could have used *STDOUT or *STDIN instead of \*STDOUT and \*STDIN. Although you can usually use typeglob and references to typeglobs interchangeably, there are a few places where you can’t. Simple typeglobs can’t be blessed into objectdom, and typeglob references can’t be passed back out of the scope of a localized typeglob. When generating new filehandles, older code would often do something like this to open a list of files: for $file (@names) { local *FH; open(*FH, $file) || next; $handle{$file} = *FH; } That still works, but now it’s just as easy to let an undefined variable autovivify an anonymous typeglob: for $file (@names) { my $fh; open($fh, $file) || next; $handle{$file} = $fh; } With indirect filehandles, it doesn’t matter whether you use use typeglobs, references to typeglobs, or one of the more exotic I/O objects. You just use a scalar that — one way or another—gets interpreted as a filehandle. For most purposes, you can use either a typeglob or a typeglob reference almost indiscriminately. As we admitted earlier, there is some implicit dereferencing magic going on here. 250 Chapter 8: References Symbol Table References In unusual circumstances, you might not know what type of reference you need when your program is written. A reference can be created by using a special syntax, affectionately known as the *foo{THING} syntax. *foo{THING} returns a reference to the THING slot in *foo, which is the symbol table entry holding the values of $foo, @foo, %foo, and friends. $scalarref $arrayref $hashref $coderef $globref $ioref = = = = = = *foo{SCALAR}; *ARGV{ARRAY}; *ENV{HASH}; *handler{CODE}; *foo{GLOB}; *STDIN{IO}; # # # # # # Same as Same as Same as Same as Same as Er... \$foo \@ARGV \%ENV \&handler \*foo All of these are self-explanatory except for *STDIN{IO}. It yields the actual internal IO::Handle object that the typeglob contains, that is, the part of the typeglob that the various I/O functions are actually interested in. For compatibility with previous versions of Perl, *foo{FILEHANDLE} is a synonym for the hipper *foo{IO} notation. In theory, you can use a *HANDLE{IO} anywhere you’d use a *HANDLE or a \*HANDLE, such as for passing handles into or out of subroutines, or storing them in larger data structures. (In practice, there are still some wrinkles to be ironed out.) The advantage of them is that they access only the real I/O object you want, not the whole typeglob, so you run no risk of clobbering more than you want to through a typeglob assignment (although if you always assign to a scalar variable instead of to a typeglob, you’ll be okay). One disadvantage is that there’s no way to autovivify one as of yet.* splutter(*STDOUT); splutter(*STDOUT{IO}); sub splutter { my $fh = shift; print $fh "her um well a hmmm\n"; } Both invocations of splutter() print “her um well a hmmm”. The *foo{THING} thing returns undef if that particular THING hasn’t been seen by the compiler yet, except when THING is SCALAR. It so happens that *foo{SCALAR} returns a reference to an anonymous scalar even if $foo hasn’t been seen yet. (Perl always adds a scalar to any typeglob as an optimization to save a bit of code elsewhere. But don’t depend on it to stay that way in future releases.) * Currently, open my $fh autovivifies a typeglob instead of an IO::Handle object, but someday we may fix that, so you shouldn’t rely on the typeglobbedess of what open currently autovivifies. Using Hard References 251 Implicit Creation of References A final method for creating references is not really a method at all. References of an appropriate type simply spring into existence if you dereference them in an lvalue context that assumes they exist. This is extremely useful, and is also What You Expect. This topic is covered later in this chapter, where we’ll discuss how to dereference all of the references we’ve created so far. Using Hard References Just as there are numerous ways to create references, there are also several ways to use, or der efer ence, a reference. There is just one overriding principle: Perl does no implicit referencing or dereferencing.* When a scalar is holding a reference, it always behaves like a simple scalar. It doesn’t magically start being an array or hash or subroutine; you have to tell it explicitly to do so, by dereferencing it. Using a Variable as a Variable Name When you encounter a scalar like $foo, you should be thinking “the scalar value of foo.” That is, there’s a foo entry in the symbol table, and the $ funny character is a way of looking at whatever scalar value might be inside. If what’s inside is a reference, you can look inside that (dereferencing $foo) by prepending another funny character. Or looking at it the other way around, you can replace the literal foo in $foo with a scalar variable that points to the actual referent. This is true of any variable type, so not only is $$foo the scalar value of whatever $foo refers to, but @$bar is the array value of whatever $bar refers to, %$glarch is the hash value of whatever $glarch refers to, and so on. The upshot is that you can put an extra funny character on the front of any simple scalar variable to dereference it: $foo = "three humps"; $scalarref = \$foo; # $scalarref is now a reference to $foo $camel_model = $$scalarref; # $camel_model is now "three humps" Here are some other dereferences: $bar = $$scalarref; push(@$arrayref, $filename); $$arrayref[0] = "January"; # Set the first element of @$arrayref @$arrayref[4..6] = qw/May June July/; # Set several elements of @$arrayref %$hashref = (KEY => "RING", BIRD => "SING"); # Initialize whole hash $$hashref{KEY} = "VALUE"; # Set one key/value pair @$hashref{"KEY1","KEY2"} = ("VAL1","VAL2"); # Set two more pairs * We already confessed that this was a small fib. We’re not about to do so again. 252 Chapter 8: References &$coderef(1,2,3); print $handleref "output\n"; This form of dereferencing can only make use of a simple scalar variable (one without a subscript). That is, dereferencing happens befor e (or binds tighter than) any array or hash lookups. Let’s use some braces to clarify what we mean: an expression like $$arrayref[0] is equivalent to ${$arrayref}[0] and means the first element of the array referred to by $arrayref. That is not at all the same as ${$arrayref[0]}, which is dereferencing the first element of the (probably nonexistent) array named @arrayref. Likewise, $$hashref{KEY} is the same as ${$hashref}{KEY}, and has nothing to do with ${$hashref{KEY}}, which would be dereferencing an entry in the (probably nonexistent) hash named %hashref. You will be miserable until you understand this. You can achieve multiple levels of referencing and dereferencing by concatenating the appropriate funny characters. The following prints “howdy”: $refrefref = \\\"howdy"; print $$$$refrefref; You can think of the dollar signs as operating right to left. But the beginning of the chain must still be a simple, unsubscripted scalar variable. There is, however, a way to get fancier, which we already sneakily used earlier, and which we’ll explain in the next section. Using a BLOCK as a Variable Name Not only can you dereference a simple variable name, you can also dereference the contents of a BLOCK. Anywhere you’d put an alphanumeric identifier as part of a variable or subroutine name, you can replace the identifier with a BLOCK returning a reference of the correct type. In other words, the earlier examples could all be disambiguated like this: $bar = ${$scalarref}; push(@{$arrayref}, $filename); ${$arrayref}[0] = "January"; @{$arrayref}[4..6] = qw/May June July/; ${$hashref}{"KEY"} = "VALUE"; @{$hashref}{"KEY1","KEY2"} = ("VAL1","VAL2"); &{$coderef}(1,2,3); not to mention: $refrefref = \\\"howdy"; print ${${${$refrefref}}}; Admittedly, it’s silly to use the braces in these simple cases, but the BLOCK can contain any arbitrary expression. In particular, it can contain subscripted expressions. Using Hard References 253 In the following example, $dispatch{$index} is assumed to contain a reference to a subroutine (sometimes called a “coderef ”). The example invokes the subroutine with three arguments. &{ $dispatch{$index} }(1, 2, 3); Here, the BLOCK is necessary. Without that outer pair of braces, Perl would have treated $dispatch as the coderef instead of $dispatch{$index}. Using the Arrow Operator For references to arrays, hashes, or subroutines, a third method of dereferencing involves the use of the -> infix operator. This form of syntactic sugar that makes it easier to get at individual array or hash elements, or to call a subroutine indirectly. The type of the dereference is determined by the right operand, that is, by what follows directly after the arrow. If the next thing after the arrow is a bracket or brace, the left operand is treated as a reference to an array or a hash, respectively, to be subscripted by the expression on the right. If the next thing is a left parenthesis, the left operand is treated as a reference to a subroutine, to be called with whatever parameters you supply in the parentheses on the right. Each of these next trios is equivalent, corresponding to the three notations we’ve introduced. (We’ve inserted some spaces to line up equivalent elements.) $ $arrayref [2] = "Dorian"; ${ $arrayref }[2] = "Dorian"; $arrayref->[2] = "Dorian"; #1 #2 #3 $ $hashref {KEY} = "F#major"; ${ $hashref }{KEY} = "F#major"; $hashref->{KEY} = "F#major"; #1 #2 #3 & $coderef (Presto => 192); &{ $coderef }(Presto => 192); $coderef->(Presto => 192); #1 #2 #3 You can see that the initial funny character is missing from the third notation in each trio. The funny character is guessed at by Perl, which is why it can’t be used to dereference complete arrays, complete hashes, or slices of either. As long as you stick with scalar values, though, you can use any expression to the left of the ->, including another dereference, because multiple arrow operators associate left to right: print $array[3]->{"English"}->[0]; You can deduce from this expression that the fourth element of @array is intended to be a hash reference, and the value of the “English” entry in that hash is intended to be an array reference. 254 Chapter 8: References Note that $array[3] and $array->[3] are not the same. The first is talking about the fourth element of @array, while the second one is talking about the fourth element of the (possibly anonymous) array whose reference is contained in $array. Suppose now that $array[3] is undefined. The following statement is still legal: $array[3]->{"English"}->[0] = "January"; This is one of those cases mentioned earlier in which references spring into existence (or “autovivify”) when used as an lvalue (that is, when a value is being assigned to it). If $array[3] was undefined, it’s automatically defined as a hash reference so that we can set a value for $array[3]->{"English"} in it. Once that’s done, $array[3]->{"English"} is automatically defined as an array reference so that we can assign something to the first element in that array. Note that rvalues are a little different: print $array[3]->{"English"}->[0] only defines $array[3] and $array[3]->{"English"}, not $array[3]->{"English"}->[0], since the final element is not an lvalue. (The fact that it defines the first two at all in an rvalue context could be considered a bug. We may fix that someday.) The arrow is optional between brackets or braces, or between a closing bracket or brace and a parenthesis for an indirect function call. So you can shrink the previous code down to: $dispatch{$index}(1, 2, 3); $array[3]{"English"}[0] = "January"; In the case of ordinary arrays, this gives you multidimensional arrays that are just like C’s array: $answer[$x][$y][$z] += 42; Well, okay, not entir ely like C’s arrays. For one thing, C doesn’t know how to grow its arrays on demand, while Perl does. Also, some constructs that are similar in the two languages parse differently. In Perl, the following two statements do the same thing: $listref->[2][2] = "hello"; $$listref[2][2] = "hello"; # Pretty clear # A bit confusing This second of these statements may disconcert the C programmer, who is accustomed to using *a[i] to mean “what’s pointed to by the i th element of a”. But in Perl, the five characters ($ @ * % &) effectively bind more tightly than braces or brackets.* Therefore, it is $$listref and not $listref[2] that is taken to be a * But not because of operator precedence. The funny characters in Perl are not operators in that sense. Perl’s grammar simply prohibits anything more complicated than a simple variable or block from following the initial funny character, for various funny reasons. Using Hard References 255 reference to an array. If you want the C behavior, either you have to write ${$listref[2]} to force the $listref[2] to get evaluated before the leading $ dereferencer, or you have to use the -> notation: $listref[2]->[$greeting] = "hello"; Using Object Methods If a reference happens to be a reference to an object, then the class that defines that object probably provides methods to access the innards of the object, and you should generally stick to those methods if you’re merely using the class (as opposed to implementing it). In other words, be nice, and don’t treat an object like a regular reference, even though Perl lets you when you really need to. Perl does not enforce encapsulation. We are not totalitarians here. We do expect some basic civility, however. In return for this civility, you get complete orthogonality between objects and data structures. Any data structure can behave as an object when you want it to. Or not, when you don’t. Pseudohashes A pseudohash is any reference to an array whose first element is a reference to a hash. You can treat the pseudohash reference as either an array reference (as you would expect) or a hash reference (as you might not expect). Here’s an example of a pseudohash: $john = [ {age => 1, eyes => 2, weight => 3}, 47, "brown", 186 ]; The underlying hash in $john->[0] defines the names ("age", "eyes", "weight") of the array elements that follow (47, "brown", 186). Now you can access an element with both hash and array notations: $john->{weight} $john->[3] # Treats $john as a hashref # Treats $john as an arrayref Pseudohash magic is not deep; it only knows one “trick”: how to turn a hash dereference into an array dereference. When adding another element to a pseudohash, you have to explicitly tell the underlying mapping hash where the element will reside before you can use the hash notation: $john->[0]{height} = 4; $john->{height} = "tall"; # height is to be element 4 # Or $john->[4] = "tall" Perl raises an exception if you try to delete a key from a pseudohash, although you can always delete keys from the mapping hash. Perl also raises an exception 256 Chapter 8: References if you try to access a nonexistent key, where “existence” means presence in the mapping hash: delete $john->[0]{height}; # Deletes from the underlying hash only $john->{height}; # This now raises an exception $john->[4]; # Still prints "tall" Don’t try to splice the array unless you know what you’re doing. If the array elements move around, the mapping hash values will still refer to the old element positions, unless you change those explicitly, too. Pseudohash magic is not deep. To avoid inconsistencies, you can use the fields::phash function provided by the use fields pragma to create a pseudohash: use fields; $ph = fields::phash(age => 47, eyes => "brown", weight => 186); print $ph->{age}; There are two ways to check for the existence of a key in a pseudohash. The first is to use exists, which checks whether the given field has ever been set. It acts this way to match the behavior of a real hash. For instance: use fields; $ph= fields::phash([qw(age eyes brown)], [47]); $ph->{eyes} = undef; print exists $ph->{age}; # True, ’age’ was set in declaration. print exists $ph->{weight}; # False, ’weight’ has not been used. print exists $ph->{eyes}; # True, your ’eyes’ have been touched. The second way is to use exists on the mapping hash sitting in the first array element. This checks whether the given key is a valid field for that pseudohash: print exists $ph->[0]{age}; # True, ’age’ is a valid field print exists $ph->[0]{name}; # False, ’name’ can’t be used Unlike what happens in a real hash, calling delete on a pseudohash element deletes only the array value corresponding to the key, not the real key in the mapping hash. To delete the key, you have to explicitly delete it from the mapping hash. Once you do that, you may no longer use that key name as a pseudohash subscript: print print print print print delete $ph->{age}; exists $ph->{age}; exists $ph->[0]{age}; delete $ph->[0]{age}; $ph->{age}; # # # # # Removes and returns $ph->[1], 47 Now false True, ’age’ key still usable Now ’age’ key is gone Run-time exception You’ve probably begun to wonder what could possibly have motivated this masquerade of arrays prancing about in hashes’ clothing. Arrays provide faster lookups and more efficient storage, while hashes offer the convenience of naming Using Hard References 257 (instead of numbering) your data; pseudohashes provide the best of both worlds. But it’s not until you consider Perl’s compilation phase that the greatest benefit becomes apparent. With the help of a pragma or two, the compiler can verify proper access to valid fields, so you can find out about nonexistent subscripts (or spelling errors) before your program starts to run. Pseudohashes’ properties of speed, efficiency, and compile-time access checking (you might even think of it as type safety) are especially handy for creating efficient and robust class modules. See the discussion of the use fields pragma in Chapter 12 and Chapter 31, Pragmatic Modules. Pseudohashes are a new and relatively experimental feature; as such, the underlying implementation may well change in the future. To protect yourself from such changes, always go through the fields module’s documented interface via its phash and new functions. Other Tricks You Can Do with Hard References As mentioned earlier, the backslash operator is usually used on a single referent to generate a single reference, but it doesn’t have to be. When used on a list of referents, it produces a list of corresponding references. The second line of the following example does the same thing as the first line, since the backslash is automatically distributed throughout the whole list. @reflist = (\$s, \@a, \%h, \&f); @reflist = \($s, @a %h, &f); # List of four references # Same thing If a parenthesized list contains exactly one array or hash, then all of its values are interpolated and references to each returned: @reflist = \(@x); @reflist = map { \$_ } @x; # Interpolate array, then get refs # Same thing This also occurs when there are internal parentheses: @reflist = \(@x, (@y)); @reflist = (\@x, map { \$_ } @y); # But only single aggregates expand # Same thing If you try this with a hash, the result will contain references to the values (as you’d expect), but references to copies of the keys (as you might not expect). Since array and hash slices are really just lists, you can backslash a slice of either of these to get a list of references. Each of the next three lines does exactly the same thing: @envrefs = \@ENV{’HOME’, ’TERM’}; # Backslashing a slice @envrefs = \( $ENV{HOME}, $ENV{TERM} ); # Backslashing a list @envrefs = ( \$ENV{HOME}, \$ENV{TERM} ); # A list of two references 258 Chapter 8: References Since functions can return lists, you can apply a backslash to them. If you have more than one function to call, first interpolate each function’s return values into a larger list and then backslash the whole thing: @reflist = \fx(); @reflist = map { \$_ } fx(); # Same thing @reflist = \( fx(), fy(), fz() ); @reflist = ( \fx(), \fy(), \fz() ); @reflist = map { \$_ } fx(), fy(), fz(); # Same thing # Same thing The backslash operator always supplies a list context to its operand, so those functions are all called in list context. If the backslash is itself in scalar context, you’ll end up with a reference to the last value of the list returned by the function: @reflist = \localtime(); $lastref = \localtime(); # Ref to each of nine time elements # Ref to whether it’s daylight savings time In this regard, the backslash behaves like the named Perl list operators, such as print, reverse, and sort, which always supply a list context on their right no mat- ter what might be happening on their left. As with named list operators, use an explicit scalar to force what follows into scalar context: $dateref = \scalar localtime(); # \"Sat Jul 16 11:42:18 2000" You can use the ref operator to determine what a reference is pointing to. Think of ref as a “typeof” operator that returns true if its argument is a reference and false otherwise. The value returned depends on the type of thing referenced. Builtin types include SCALAR, ARRAY, HASH, CODE, GLOB, REF, LVALUE, IO, IO::Handle, and Regexp. Here, we use it to check subroutine arguments: sub sum { my $arrayref = shift; warn "Not an array reference" if ref($arrayref) ne "ARRAY"; return eval join("+", @$arrayref); } If you use a hard reference in a string context, it’ll be converted to a string containing both the type and the address: SCALAR(0x1fc0e). (The reverse conversion cannot be done, since reference count information is lost during stringification— and also because it would be dangerous to let programs access a memory address named by an arbitrary string.) You can use the bless operator to associate a referent with a package functioning as an object class. When you do this, ref returns the class name instead of the internal type. An object reference used in a string context returns a string with the external and internal types, and the address in memory: MyType=HASH(0x20d10) or IO::Handle=IO(0x186904). See Chapter 12 for more details about objects. Using Hard References 259 Since the way in which you dereference something always indicates what sort of referent you’re looking for, a typeglob can be used the same way a reference can, despite the fact that a typeglob contains multiple referents of various types. So ${*main::foo} and ${\$main::foo} both access the same scalar variable, although the latter is more efficient. Here’s a trick for interpolating the return value of a subroutine call into a string: print "My sub returned @{[ mysub(1,2,3) ]} that time.\n"; It works like this. At compile time, when the @{...} is seen within the doublequoted string, it’s parsed as a block that returns a reference. Within the block, there are square brackets that create a reference to an anonymous array from whatever is in the brackets. So at run time, mysub(1,2,3) is called in list context, and the results are loaded into an anonymous array, a reference to which is then returned within the block. That array reference is then immediately dereferenced by the surrounding @{...}, and the array value is interpolated into the doublequoted string just as an ordinary array would be. This chicanery is also useful for arbitrary expressions, such as: print "We need @{ [$n + 5] } widgets!\n"; Be careful though: square brackets supply a list context to their expression. In this case it doesn’t matter, although the earlier call to mysub might care. When it does matter, use an explicit scalar to force the context: print "mysub returns @{ [scalar mysub(1,2,3)] } now.\n"; Closures Earlier we talked about creating anonymous subroutines with a nameless sub {}. You can think of those subroutines as defined at run time, which means that they have a time of generation as well as a location of definition. Some variables might be in scope when the subroutine is created, and different variables might be in scope when the subroutine is called. Forgetting about subroutines for a moment, consider a reference that refers to a lexical variable: { my $critter = "camel"; $critterref = \$critter; } The value of $$critterref will remain “camel” even though $critter disappears after the closing curly brace. But $critterref could just as well have referred to a subroutine that refers to $critter: 260 Chapter 8: References { my $critter = "camel"; $critterref = sub { return $critter }; } This is a closur e, which is a notion out of the functional programming world of LISP and Scheme.* It means that when you define an anonymous function in a particular lexical scope at a particular moment, it pretends to run in that scope even when later called from outside that scope. (A purist would say it doesn’t have to pretend — it actually does run in that scope.) In other words, you are guaranteed to get the same copy of a lexical variable each time, even if other instances of that lexical variable have been created before or since for other instances of that closure. This gives you a way to set values used in a subroutine when you define it, not just when you call it. You can also think of closures as a way to write a subroutine template without using eval. The lexical variables act as parameters for filling in the template, which is useful for setting up little bits of code to run later. These are commonly called callbacks in event-based programming, where you associate a bit of code with a keypress, mouse click, window exposure, and so on. When used as callbacks, closures do exactly what you expect, even if you don’t know the first thing about functional programming. (Note that this closure business only applies to my variables. Global variables work as they’ve always worked, since they’re neither created nor destroyed the way lexical variables are.) Another use for closures is within function generators; that is, functions that create and return brand new functions. Here’s an example of a function generator implemented with closures: sub make_saying { my $salute = shift; my $newfunc = sub { my $target = shift; print "$salute, $target!\n"; }; return $newfunc; # Return a closure } $f = make_saying("Howdy"); # Create a closure $g = make_saying("Greetings"); # Create another closure # Time passes... $f->("world"); $g->("earthlings"); * In this context, the word “functional” should not be construed as an antonym of “dysfunctional”. Using Hard References 261 This prints: Howdy, world! Greetings, earthlings! Note in particular how $salute continues to refer to the actual value passed into make_saying, despite the fact that the my $salute has gone out of scope by the time the anonymous subroutine runs. That’s what closures are all about. Since $f and $g hold references to functions that, when called, still need access to the distinct versions of $salute, those versions automatically stick around. If you now overwrite $f, its version of $salute would automatically disappear. (Perl only cleans up when you’re not looking.) Perl doesn’t provide references to object methods (described in Chapter 12) but you can get a similar effect using a closure. Suppose you want a reference not just to the subroutine the method represents, but one which, when invoked, would call that method on a particular object. You can conveniently remember both the object and the method as lexical variables bound up inside a closure: sub get_method_ref { my ($self, $methodname) = @_; my $methref = sub { # the @_ below is not the same as the one above! return $self->$methodname(@_); }; return $methref; } my $dog = new Doggie:: Name => "Lucky", Legs => 3, Tail => "clipped"; our $wagger = get_method_ref($dog, ’wag’); $wagger->("tail"); # Calls $dog->wag(’tail’). Not only can you get Lucky to wag what’s left of his tail now, even once the lexical $dog variable has gone out of scope and Lucky is nowhere to be seen, the global $wagger variable can still get him to wag his tail, wherever he is. Closures as function templates Using a closure as a function template allows you to generate many functions that act similarly. Suppose you want a suite of functions that generate HTML font changes for various colors: print "Be ", red("careful"), "with that ", green("light"), "!!!"; The red and green functions would be very similar. We’d like to name our functions, but closures don’t have names since they’re just anonymous subroutines 262 Chapter 8: References with an attitude. To get around that, we’ll perform the cute trick of naming our anonymous subroutines. You can bind a coderef to an existing name by assigning it to a typeglob of the name of the function you want. (See the section “Symbol Tables” in Chapter 10, Packages. In this case, we’ll bind it to two different names, one uppercase and one lowercase: @colors = qw(red blue green yellow orange purple violet); for my $name (@colors) { no strict ’refs’; # Allow symbolic references *$name = *{uc $name} = sub { "<FONT COLOR=’$name’7gt;@_</FONT>" }; } Now you can call functions named red, RED, blue, BLUE, and so on, and the appropriate closure will be invoked. This technique reduces compile time and conserves memory, and is less error-prone as well, since syntax checks happen during compilation. It’s critical that any variables in the anonymous subroutine be lexicals in order to create a closure. That’s the reason for the my above. This is one of the few places where giving a prototype to a closure makes sense. If you wanted to impose scalar context on the arguments of these functions (probably not a wise idea for this example), you could have written it this way instead: *$name = sub ($) { "<FONT COLOR=’$name’>$_[0]</FONT>" }; That’s almost good enough. However, since prototype checking happens during compile time, the run-time assignment above happens too late to be of much use. You could fix this by putting the whole loop of assignments within a BEGIN block, forcing it to occur during compilation. (More likely, you’d put it out in a module that you use at compile time.) Then the prototypes will be visible during the rest of the compilation. Nested subroutines If you are accustomed (from other programming languages) to using subroutines nested within other subroutines, each with their own private variables, you’ll have to work at it a bit in Perl. Named subroutines do not nest properly, although anonymous ones do.* Anyway, we can emulate nested, lexically scoped subroutines using closures. Here’s an example: sub outer { my $x = $_[0] + 35; local *inner = sub { return $x * 19 }; return $x + inner(); } * To be more precise, globally named subroutines don’t nest. Unfortunately, that’s the only kind of named subroutine declaration we have. We haven’t yet implemented lexically scoped, named subroutines (known as my subs), but when we do, they should nest correctly. Symbolic References 263 Now inner can only be called from within outer, because of the temporary assignments of the closure. But when it is, it has normal access to the lexical variable $x from the scope of outer. This has the interesting effect of creating a function local to another function, something not normally supported in Perl. Because local is dynamically scoped, and because function names are global to their package, any other function that outer called could also call the temporary version of inner. To prevent that, you’d need an extra level of indirection: sub outer { my $x = $_[0] + 35; my $inner = sub { return $x * 19 }; return $x + $inner->(); } Symbolic References What happens if you try to dereference a value that is not a hard reference? The value is then treated as a symbolic refer ence. That is, the reference is interpreted as a string representing the name of a global variable. Here is how this works: $name = "bam"; $$name = 1; $name->[0] = 4; $name->{X} = "Y"; @$name = (); keys %$name; &$name; # # # # # # Sets $bam Sets the first element of @bam Sets the X element of %bam to Y Clears @bam Yields the keys of %bam Calls &bam This is very powerful, and slightly dangerous, in that it’s possible to intend (with the utmost sincerity) to use a hard reference, but to accidentally use a symbolic reference instead. To protect against that, you can say: use strict ’refs’; and then only hard references will be allowed for the rest of the enclosing block. An inner block may countermand the decree with: no strict ’refs’; It is also important to understand the difference between the following two lines of code: ${identifier}; # Same as $identifier. ${"identifier"}; # Also $identifier, but a symbolic reference. 264 Chapter 8: References Because the second form is quoted, it is treated as a symbolic reference and will generate an error if use strict ’refs’ is in effect. Even if strict ’refs’ is not in effect, it can only refer to a package variable. But the first form is identical to the unbracketed form, and will refer to even a lexically scoped variable if one is declared. The next example shows this (and the next section discusses it). Only package variables are accessible through symbolic references, because symbolic references always go through the package symbol table. Since lexical variables aren’t in a package symbol table, they are therefore invisible to this mechanism. For example: our $value = "global"; { my $value = "private"; print "Inside, mine is ${value}, "; print "but ours is ${’value’}.\n"; } print "Outside, ${value} is again ${’value’}.\n"; which prints: Inside, mine is private, but ours is global. Outside, global is again global. Braces, Brackets, and Quoting In the previous section, we pointed out that ${identifier} is not treated as a symbolic reference. You might wonder how this interacts with reserved words, and the short answer is that it doesn’t. Despite the fact that push is a reserved word, these two statements print “pop on over”: $push = "pop on "; print "${push}over"; The reason is that, historically, this use of braces is how Unix shells have isolated a variable name from subsequent alphanumeric text that would otherwise be interpreted as part of the name. It’s how many people expect variable interpolation to work, so we made it work the same way in Perl. But with Perl, the notion extends further and applies to any braces used in generating references, whether or not they’re inside quotes. This means that: print ${push} . ’over’; or even (since spaces never matter): print ${ push } . ’over’; Braces, Brackets, and Quoting 265 both print “pop on over”, even though the braces are outside of double quotes. The same rule applies to any identifier used for subscripting a hash. So, instead of writing: $hash{ "aaa" }{ "bbb" }{ "ccc" } you can just write: $hash{ aaa }{ bbb }{ ccc } or: $hash{aaa}{bbb}{ccc} and not worry about whether the subscripts are reserved words. So this: $hash{ shift } is interpreted as $hash{"shift"}. You can force interpretation as a reserved word by adding anything that makes it more than a mere identifier: $hash{ shift() } $hash{ +shift } $hash{ shift @_ } References Don’t Work as Hash Keys Hash keys are stored internally as strings.* If you try to store a reference as a key in a hash, the key value will be converted into a string: $x{ \$a } = $a; ($key, $value) = each %x; print $$key; # WRONG We mentioned earlier that you can’t convert a string back to a hard reference. So if you try to dereference $key, which contains a mere string, it won’t return a hard dereference, but rather a symbolic dereference—and since you probably don’t have a variable named SCALAR(0x1fc0e), you won’t accomplish what you’re attempting. You might want to do something more like: $r = \@a; $x{ $r } = $r; Then at least you can use the hash value, which will be a hard reference, instead of the key, which won’t. Although you can’t store a reference as a key, if (as in the earlier example) you use a hard reference in a string context, it is guaranteed to produce a unique * They’re also stored exter nally as strings, such as when you put them into a DBM file. In fact, DBM files requir e that their keys (and values) be strings. 266 Chapter 8: References string, since the address of the reference is included as part of the resulting string. So you can in fact use a reference as a unique hash key. You just can’t dereference it later. There is one special kind of hash in which you ar e able to use references as keys. Through the magic* of the Tie::RefHash module bundled with Perl, the thing we just said you couldn’t do, you can do: use Tie::RefHash; tie my %h, ’Tie::RefHash’; %h = ( ["this", "here"] => "at home", ["that", "there"] => "elsewhere", ); while ( my($keyref, $value) = each %h ) { print "@$keyref is $value\n"; } In fact, by tying different implementations to the built-in types, you can make scalars, hashes, and arrays behave in many of the ways we’ve said you can’t. That’ll show us! Stupid authors . . . For more about tying, see Chapter 14, Tied Variables. Garbage Collection, Circular References, and Weak References High-level languages typically allow programmers not to worry about deallocating memory when they’re done using it. This automatic reclamation process is known as garbage collection. For most purposes, Perl uses a fast and simple referencebased garbage collector. When a block is exited, its locally scoped variables are normally freed up, but it is possible to hide your garbage so that Perl’s garbage collector can’t find it. One serious concern is that unreachable memory with a nonzero reference count will normally not get freed. Therefore, circular references are a bad idea: { # make $a and $b point to each other my ($a, $b); $a = \$b; $b = \$a; } * Yes, that is a technical term, as you’ll notice if you muddle through the mg.c file in the Perl source distribution. Braces, Brackets, and Quoting 267 or more simply: { # make $a point to itself my $a; $a = \$a; } Even though $a should be deallocated at the end of the block, it isn’t. When building recursive data structures, you’ll have to break (or weaken; see below) the selfreference yourself if you want to reclaim the memory before your program (or thread) exits. (Upon exit, the memory will be reclaimed for you automatically via a costly but complete mark-and-sweep garbage collection.) If the data structure is an object, you can use a DESTROY method to break the reference automatically; see “Garbage Collection with DESTROY Methods” in Chapter 12. A similar situation can occur with caches—repositories of data designed for fasterthan-normal retrieval. Outside the cache, there are references to data inside the cache. The problem occurs when all of those references are deleted, but the cache data with its internal reference remains. The existence of any reference prevents the referent from being reclaimed by Perl, even though we want cache data to disappear as soon as it’s no longer needed. As with circular references, we want a reference that doesn’t affect the reference count, and therefore doesn’t delay garbage collection. Weak refer ences solve the problems caused by circular references and cache data by allowing you to “weaken” any reference; that is, make it not affect the reference count. When the last nonweak reference to an object is deleted, the object is destroyed and all the weak references to the object are automatically freed. To use this feature, you need the WeakRef package from CPAN, which contains additional documentation. Weak references are an experimental feature. But hey, somebody’s gotta be the guinea pig. 9 Data Structures Perl provides for free many of the data structures that you have to build yourself in other programming languages. The stacks and queues that budding computer scientists learn about are both just arrays in Perl. When you push and pop (or shift and unshift) an array, it’s a stack; when you push and shift (or unshift and pop) an array, it’s a queue. And many of the tree structures in the world are built only to provide fast, dynamic access to a conceptually flat lookup table. Hashes, of course, are built into Perl, and provide fast, dynamic access to a conceptually flat lookup table, only without the mind-numbingly recursive data structures that are claimed to be beautiful by people whose minds have been suitably numbed already. But sometimes you want nested data structures because they most naturally model the problem you’re trying to solve. So Perl lets you combine and nest arrays and hashes to create arbitrarily complex data structures. Properly applied, they can be used to create linked lists, binary trees, heaps, B-trees, sets, graphs, and anything else you can devise. See Mastering Algorithms with Perl (O’Reilly, 1999), the Perl Cookbook (O’Reilly, 1998), or CPAN, the central repository for all such modules. But simple combinations of arrays and hashes may be all you ever need, so they’re what we’ll talk about in this chapter. Arrays of Arrays There are many kinds of nested data structures. The simplest kind to build is an array of arrays, also called a two-dimensional array or a matrix. (The obvious generalization applies: an array of arrays of arrays is a three-dimensional array, and so on for higher dimensions.) It’s reasonably easy to understand, and nearly everything that applies here will also be applicable to the fancier data structures that we’ll explore in subsequent sections. 268 Arrays of Arrays 269 Creating and Accessing a Two-Dimensional Array Here’s how to put together a two-dimensional array: # Assign a @AoA = ( [ [ [ ); list of array references to an array. "fred", "barney" ], "george", "jane", "elroy" ], "homer", "marge", "bart" ], print $AoA[2][1]; # prints "marge" The overall list is enclosed by parentheses, not brackets, because you’re assigning a list and not a reference. If you wanted a reference to an array instead, you’d use brackets: # Create an reference to an array of array references. $ref_to_AoA = [ [ "fred", "barney", "pebbles", "bamm bamm", "dino", ], [ "homer", "bart", "marge", "maggie", ], [ "george", "jane", "elroy", "judy", ],; print $ref_to_AoA->[2][3]; # prints "judy" Remember that there is an implied -> between every pair of adjacent braces or brackets. Therefore these two lines: $AoA[2][3] $ref_to_AoA->[2][3] are equivalent to these two lines: $AoA[2]->[3] $ref_to_AoA->[2]->[3] There is, however, no implied -> before the first pair of brackets, which is why the dereference of $ref_to_AoA requires the initial ->. Also remember that you can count backward from the end of an array with a negative index, so: $AoA[0][-2] is the next-to-last element of the first row. Growing Your Own Those big list assignments are well and good for creating a fixed data structure, but what if you want to calculate each element on the fly, or otherwise build the structure piecemeal? 270 Chapter 9: Data Structures Let’s read in a data structure from a file. We’ll assume that it’s a plain text file, where each line is a row of the structure, and each line consists of elements delimited by whitespace. Here’s how to proceed:* while (<>) { @tmp = split; push @AoA, [ @tmp ]; } # Split elements into an array. # Add an anonymous array reference to @AoA. Of course, you don’t need to name the temporary array, so you could also say: while (<>) { push @AoA, [ split ]; } If you want a reference to an array of arrays, you can do this: while (<>) { push @$ref_to_AoA, [ split ]; } Both of those examples add new rows to the array of arrays. What about adding new columns? If you’re just dealing with two-dimensional arrays, it’s often easiest to use simple assignment:† for $x (0 .. 9) { for $y (0 .. 9) { $AoA[$x][$y] = func($x, $y); } } # For each row... # For each column... # ...set that cell for $x ( 0..9 ) { $ref_to_AoA->[$x][3] = func2($x); } # For each row... # ...set the fourth column It doesn’t matter in what order you assign the elements, nor does it matter whether the subscripted elements of @AoA are already there or not; Perl will gladly create them for you, setting intervening elements to the undefined value as need be. (Perl will even create the original reference in $ref_to_AoA for you if it needs to.) If you just want to append to a row, you have to do something a bit funnier: # Append new columns to an existing row. push @{ $AoA[0] }, "wilma", "betty"; * Here as in other chapters, we omit (for clarity) the my declarations that you would ordinarily put in. In this example, you’d normally write my @tmp = split. † As with the temp assignment earlier, we’ve simplified; the loops in this chapter would likely be written for my $x in real code. Arrays of Arrays 271 Notice that this wouldn’t work: push $AoA[0], "wilma", "betty"; # WRONG! That won’t even compile, because the argument to push must be a real array, not just a reference to an array. Therefore, the first argument absolutely must begin with an @ character. What comes after the @ is somewhat negotiable. Access and Printing Now let’s print the data structure. If you only want one element, this is sufficient: print $AoA[3][2]; But if you want to print the whole thing, you can’t just say: print @AoA; # WRONG It’s wrong because you’ll see stringified references instead of your data. Perl never automatically dereferences for you. Instead, you have to roll yourself a loop or two. The following code prints the whole structure, looping through the elements of @AoA and dereferencing each inside the print statement: for $row ( @AoA ) { print "@$row\n"; } If you want to keep track of subscripts, you might do this: for $i ( 0 .. $#AoA ) { print "row $i is: @{$AoA[$i]}\n"; } or maybe even this (notice the inner loop): for $i ( 0 .. $#AoA ) { for $j ( 0 .. $#{$AoA[$i]} ) { print "element $i $j is $AoA[$i][$j]\n"; } } As you can see, things are getting a bit complicated. That’s why sometimes it’s easier to use a temporary variable on your way through: for $i ( 0 .. $#AoA ) { $row = $AoA[$i]; for $j ( 0 .. $#{$row} ) { print "element $i $j is $row->[$j]\n"; } } 272 Chapter 9: Data Structures Slices If you want to access a slice (part of a row) of a multidimensional array, you’re going to have to do some fancy subscripting. The pointer arrows give us a nice way to access a single element, but no such convenience exists for slices. You can always extract the elements of your slice one-by-one with a loop: @part = (); for ($y = 7; $y < 13; $y++) { push @part, $AoA[4][$y]; } This particular loop could be replaced with an array slice: @part = @{ $AoA[4] } [ 7..12 ]; If you want a two-dimensional slice, say, with $x running from 4..8 and $y from 7..12, here’s one way to do it: @newAoA = (); for ($startx = $x = 4; $x <= 8; $x++) { for ($starty = $y = 7; $y <= 12; $y++) { $newAoA[$x - $startx][$y - $starty] = $AoA[$x][$y]; } } In this example, the individual values within our destination two-dimensional array, @newAoA, are assigned one by one, taken from a two-dimensional subarray of @AoA. An alternative is to create anonymous arrays, each consisting of a desired slice of an @AoA subarray, and then put references to these anonymous arrays into @newAoA. We would then be writing references into @newAoA (subscripted once, so to speak) instead of subarray values into a twice-subscripted @newAoA. This method eliminates the innermost loop: for ($x = 4; $x <= 8; $x++) { push @newAoA, [ @{ $AoA[$x] } [ 7..12 ] ]; } Of course, if you do this often, you should probably write a subroutine called something like extract_rectangle. And if you do it very often with large collections of multidimensional data, you should probably use the PDL (Perl Data Language) module, available from CPAN. Common Mistakes As mentioned earlier, Perl arrays and hashes are one-dimensional. In Perl, even “multidimensional” arrays are actually one-dimensional, but the values along that dimension are references to other arrays, which collapse many elements into one. Arrays of Arrays 273 If you print these values out without dereferencing them, you will get the stringified references rather than the data you want. For example, these two lines: @AoA = ( [2, 3], [4, 5, 7], [0] ); print "@AoA"; result in something like: ARRAY(0x83c38) ARRAY(0x8b194) ARRAY(0x8b1d0) On the other hand, this line displays 7: print $AoA[1][2]; When constructing an array of arrays, remember to compose new references for the subarrays. Otherwise, you will just create an array containing the element counts of the subarrays, like this: for $i (1..10) { @array = somefunc($i); $AoA[$i] = @array; } # WRONG! Here @array is being accessed in a scalar context, and therefore yields the count of its elements, which is dutifully assigned to $AoA[$i]. The proper way to assign the reference will be shown in a moment. After making the previous mistake, people realize they need to assign a reference, so the next mistake people naturally make involves taking a reference to the same memory location over and over again: for $i (1..10) { @array = somefunc($i); $AoA[$i] = \@array; } # WRONG AGAIN! Every reference generated by the second line of the for loop is the same, namely, a reference to the single array @array. Yes, this array changes on each pass through the loop, but when everything is said and done, $AoA contains 10 references to the same array, which now holds the last set of values assigned to it. print @{$AoA[1]} will reveal the same values as print @{$AoA[2]}. Here’s a more successful approach: for $i (1..10) { @array = somefunc($i); $AoA[$i] = [ @array ]; } # RIGHT! The brackets around @array create a new anonymous array, into which the elements of @array are copied. We then store a reference to that new array. 274 Chapter 9: Data Structures A similar result — though more difficult to read — would be produced by: for $i (1..10) { @array = somefunc($i); @{$AoA[$i]} = @array; } Since $AoA[$i] needs to be a new reference, the reference springs into existence. Then, the preceding @ dereferences this new reference, with the result that the values of @array are assigned (in list context) to the array referenced by $AoA[$i]. You might wish to avoid this construct for clarity’s sake. But there is a situation in which you might use it. Suppose @AoA is already an array of references to arrays. That is, you’ve made assignments like: $AoA[3] = \@original_array; And now suppose that you want to change @original_array (that is, you want to change the fourth row of $AoA) so that it refers to the elements of @array. This code will work: @{$AoA[3]} = @array; In this case, the reference itself does not change, but the elements of the referenced array do. This overwrites the values of @original_array. Finally, the following dangerous-looking code actually works fine: for $i (1..10) { my @array = somefunc($i); $AoA[$i] = \@array; } That’s because the lexically scoped my @array variable is created afresh on each pass through the loop. So even though it looks as though you’ve stored the same variable reference each time, you haven’t. This is a subtle distinction, but the technique can produce more efficient code, at the risk of misleading less-enlightened programmers. (It’s more efficient because there’s no copy in the final assignment.) On the other hand, if you have to copy the values anyway (which the first assignment in the loop is doing), then you might as well use the copy implied by the brackets and avoid the temporary variable: for $i (1..10) { $AoA[$i] = [ somefunc($i) ]; } In summary: $AoA[$i] = [ @array ]; # Safest, sometimes fastest $AoA[$i] = \@array; # Fast but risky, depends on my-ness of array @{ $AoA[$i] } = @array; # A bit tricky Hashes of Arrays 275 Once you’ve mastered arrays of arrays, you’ll want to tackle more complex data structures. If you’re looking for C structures or Pascal records, you won’t find any special reserved words in Perl to set these up for you. What you get instead is a more flexible system. If your idea of a record structure is less flexible than this, or if you’d like to provide your users with something more opaque and rigid, then you can use the object-oriented features detailed in Chapter 12, Objects. Perl has just two ways of organizing data: as ordered lists stored in arrays and accessed by position, or as unordered key/value pairs stored in hashes and accessed by name. The best way to represent a record in Perl is with a hash reference, but how you choose to organize such records will vary. You might want to keep an ordered list of these records that you can look up by number, in which case you’d use an array of hash references to store the records. Or, you might wish to look the records up by name, in which case you’d maintain a hash of hash references. You could even do both at once, with pseudohashes. In the following sections, you will find code examples detailing how to compose (from scratch), generate (from other sources), access, and display several different data structures. We first demonstrate three straightforward combinations of arrays and hashes, followed by a hash of functions and more irregular data structures. We end with a demonstration of how these data structures can be saved. These examples assume that you have already familiarized yourself with the explanations set forth earlier in this chapter. Hashes of Arrays Use a hash of arrays when you want to look up each array by a particular string rather than merely by an index number. In our example of television characters, instead of looking up the list of names by the zeroth show, the first show, and so on, we’ll set it up so we can look up the cast list given the name of the show. Because our outer data structure is a hash, we can’t order the contents, but we can use the sort function to specify a particular output order. Composition of a Hash of Arrays You can create a hash of anonymous arrays as follows: # We customarily omit quotes when the keys are identifiers. %HoA = ( flintstones => [ "fred", "barney" ], jetsons => [ "george", "jane", "elroy" ], simpsons => [ "homer", "marge", "bart" ], ); 276 Chapter 9: Data Structures To add another array to the hash, you can simply say: $HoA{teletubbies} = [ "tinky winky", "dipsy", "laa-laa", "po" ]; Generation of a Hash of Arrays Here are some techniques for populating a hash of arrays. To read from a file with the following format: flintstones: fred barney wilma dino jetsons: george jane elroy simpsons: homer marge bart you could use either of the following two loops: while ( <> ) { next unless s/ˆ(.*?):\s*//; $HoA{$1} = [ split ]; } while ( $line = <> ) { ($who, $rest) = split /:\s*/, $line, 2; @fields = split ’ ’, $rest; $HoA{$who} = [ @fields ]; } If you have a subroutine get_family that returns an array, you can use it to stuff %HoA with either of these two loops: for $group ( "simpsons", "jetsons", "flintstones" ) { $HoA{$group} = [ get_family($group) ]; } for $group ( "simpsons", "jetsons", "flintstones" ) { @members = get_family($group); $HoA{$group} = [ @members ]; } You can append new members to an existing array like so: push @{ $HoA{flintstones} }, "wilma", "pebbles"; Access and Printing of a Hash of Arrays You can set the first element of a particular array as follows: $HoA{flintstones}[0] = "Fred"; To capitalize the second Simpson, apply a substitution to the appropriate array element: $HoA{simpsons}[1] =˜ s/(\w)/\u$1/; Arrays of Hashes 277 You can print all of the families by looping through the keys of the hash: for $family ( keys %HoA ) { print "$family: @{ $HoA{$family} }\n"; } With a little extra effort, you can add array indices as well: for $family ( keys %HoA ) { print "$family: "; for $i ( 0 .. $#{ $HoA{$family} } ) { print " $i = $HoA{$family}[$i]"; } print "\n"; } Or sort the arrays by how many elements they have: for $family ( sort { @{$HoA{$b}} <=> @{$HoA{$a}} } keys %HoA ) { print "$family: @{ $HoA{$family} }\n" } Or even sort the arrays by the number of elements and then order the elements ASCIIbetically (or to be precise, utf8ically): # Print the whole thing sorted by number of members and name. for $family ( sort { @{$HoA{$b}} <=> @{$HoA{$a}} } keys %HoA ) { print "$family: ", join(", ", sort @{ $HoA{$family} }), "\n"; } Arrays of Hashes An array of hashes is useful when you have a bunch of records that you’d like to access sequentially, and each record itself contains key/value pairs. Arrays of hashes are used less frequently than the other structures in this chapter. Composition of an Array of Hashes You can create an array of anonymous hashes as follows: @AoH = ( { husband wife son }, { husband wife son }, => "barney", => "betty", => "bamm bamm", => "george", => "jane", => "elroy", 278 Chapter 9: Data Structures { husband => "homer", wife => "marge", son => "bart", }, ); To add another hash to the array, you can simply say: push @AoH, { husband => "fred", wife => "wilma", daughter => "pebbles" }; Generation of an Array of Hashes Here are some techniques for populating an array of hashes. To read from a file with the following format: husband=fred friend=barney you could use either of the following two loops: while ( <> ) { $rec = {}; for $field ( split ) { ($key, $value) = split /=/, $field; $rec->{$key} = $value; } push @AoH, $rec; } while ( <> ) { push @AoH, { split /[\s=]+/ }; } If you have a subroutine get_next_pair that returns key/value pairs, you can use it to stuff @AoH with either of these two loops: while ( @fields = get_next_pair() ) { push @AoH, { @fields }; } while (<>) { push @AoH, { get_next_pair($_) }; } You can append new members to an existing hash like so: $AoH[0]{pet} = "dino"; $AoH[2]{pet} = "santa’s little helper"; Hashes of Hashes 279 Access and Printing of an Array of Hashes You can set a key/value pair of a particular hash as follows: $AoH[0]{husband} = "fred"; To capitalize the husband of the second array, apply a substitution: $AoH[1]{husband} =˜ s/(\w)/\u$1/; You can print all of the data as follows: for $href ( @AoH ) { print "{ "; for $role ( keys %$href ) { print "$role=$href->{$role} "; } print "}\n"; } and with indices: for $i ( 0 .. $#AoH ) { print "$i is { "; for $role ( keys %{ $AoH[$i] } ) { print "$role=$AoH[$i]{$role} "; } print "}\n"; } Hashes of Hashes A multidimensional hash is the most flexible of Perl’s nested structures. It’s like building up a record that itself contains other records. At each level, you index into the hash with a string (quoted when necessary). Remember, however, that the key/value pairs in the hash won’t come out in any particular order; you can use the sort function to retrieve the pairs in whatever order you like. Composition of a Hash of Hashes You can create a hash of anonymous hashes as follows: %HoH = ( flintstones => { husband => pal => }, jetsons => { husband => wife => "his boy" => }, "fred", "barney", "george", "jane", "elroy", # Key quotes needed. 280 Chapter 9: Data Structures simpsons => { husband => "homer", wife => "marge", kid => "bart", }, ); To add another anonymous hash to %HoH, you can simply say: $HoH{ mash } captain major corporal }; = { => "pierce", => "burns", => "radar", Generation of a Hash of Hashes Here are some techniques for populating a hash of hashes. To read from a file with the following format: flintstones: husband=fred pal=barney wife=wilma pet=dino you could use either of the following two loops: while ( <> ) { next unless s/ˆ(.*?):\s*//; $who = $1; for $field ( split ) { ($key, $value) = split /=/, $field; $HoH{$who}{$key} = $value; } } while ( <> ) { next unless s/ˆ(.*?):\s*//; $who = $1; $rec = {}; $HoH{$who} = $rec; for $field ( split ) { ($key, $value) = split /=/, $field; $rec->{$key} = $value; } } If you have a subroutine get_family that returns a list of key/value pairs, you can use it to stuff %HoH with either of these three snippets: for $group ( "simpsons", "jetsons", "flintstones" ) { $HoH{$group} = { get_family($group) }; } for $group ( "simpsons", "jetsons", "flintstones" ) { @members = get_family($group); $HoH{$group} = { @members }; } Hashes of Hashes 281 sub hash_families { my @ret; for $group ( @_ ) { push @ret, $group, { get_family($group) }; } return @ret; } %HoH = hash_families( "simpsons", "jetsons", "flintstones" ); You can append new members to an existing hash like so: %new_folks = ( wife => "wilma", pet => "dino"; ); for $what (keys %new_folks) { $HoH{flintstones}{$what} = $new_folks{$what}; } Access and Printing of a Hash of Hashes You can set a key/value pair of a particular hash as follows: $HoH{flintstones}{wife} = "wilma"; To capitalize a particular key/value pair, apply a substitution to an element: $HoH{jetsons}{’his boy’} =˜ s/(\w)/\u$1/; You can print all the families by looping through the keys of the outer hash and then looping through the keys of the inner hash: for $family ( keys %HoH ) { print "$family: "; for $role ( keys %{ $HoH{$family} } ) { print "$role=$HoH{$family}{$role} "; } print "\n"; } In very large hashes, it may be slightly faster to retrieve both keys and values at the same time using each (which precludes sorting): while ( ($family, $roles) = each %HoH ) { print "$family: "; while ( ($role, $person) = each %$roles ) { print "$role=$person "; } print "\n"; } 282 Chapter 9: Data Structures (Unfortunately, it’s the large hashes that really need to be sorted, or you’ll never find what you’re looking for in the printout.) You can sort the families and then the roles as follows: for $family ( sort keys %HoH ) { print "$family: "; for $role ( sort keys %{ $HoH{$family} } ) { print "$role=$HoH{$family}{$role} "; } print "\n"; } To sort the families by the number of members (instead of ASCIIbetically (or utf8ically)), you can use keys in a scalar context: for $family ( sort { keys %{$HoH{$a}} <=> keys %{$HoH{$b}} } keys %HoH ) { print "$family: "; for $role ( sort keys %{ $HoH{$family} } ) { print "$role=$HoH{$family}{$role} "; } print "\n"; } To sort the members of a family in some fixed order, you can assign ranks to each: $i = 0; for ( qw(husband wife son daughter pal pet) ) { $rank{$_} = ++$i } for $family ( sort { keys %{$HoH{$a}} <=> keys %{$HoH{$b}} } keys %HoH ) { print "$family: "; for $role ( sort { $rank{$a} <=> $rank{$b} } keys %{ $HoH{$family} } ) { print "$role=$HoH{$family}{$role} "; } print "\n"; } Hashes of Functions When writing a complex application or network service in Perl, you might want to make a large number of commands available to your users. Such a program might have code like this to examine the user’s selection and take appropriate action: if ($cmd =˜ /ˆexit$/i) { exit } elsif ($cmd =˜ /ˆhelp$/i) { show_help() } elsif ($cmd =˜ /ˆwatch$/i) { $watch = 1 } elsif ($cmd =˜ /ˆmail$/i) { mail_msg($msg) } elsif ($cmd =˜ /ˆedit$/i) { $edited++; editmsg($msg); } elsif ($cmd =˜ /ˆdelete$/i) { confirm_kill() } else { warn "Unknown command: ‘$cmd’; Try ‘help’ next time\n"; } More Elaborate Records 283 You can also store references to functions in your data structures, just as you can store references to arrays or hashes: %HoF = ( exit help watch mail edit delete ); => => => => => => # Compose a hash of functions sub { exit }, \&show_help, sub { $watch = 1 }, sub { mail_msg($msg) }, sub { $edited++; editmsg($msg); }, \&confirm_kill, if ($HoF{lc $cmd}) { $HoF{lc $cmd}->() } # Call function else { warn "Unknown command: ‘$cmd’; Try ‘help’ next time\n" } In the second to last line, we check whether the specified command name (in lowercase) exists in our “dispatch table”, %HoF. If so, we invoke the appropriate command by dereferencing the hash value as a function and pass that function an empty argument list. We could also have dereferenced it as &{ $HoF{lc $cmd} }(), or, as of the 5.6 release of Perl, simply $HoF{lc $cmd}(). More Elaborate Records So far, what we’ve seen in this chapter are simple, two-level, homogeneous data structures: each element contains the same kind of referent as all the other elements at that level. It certainly doesn’t have to be that way. Any element can hold any kind of scalar, which means that it could be a string, a number, or a reference to anything at all. The reference could be an array or hash reference, or a pseudohash, or a reference to a named or anonymous function, or an object. The only thing you can’t do is to stuff multiple referents into one scalar. If you find yourself trying to do that, it’s a sign that you need an array or hash reference to collapse multiple values into one. In the sections that follow, you will find code examples designed to illustrate many of the possible types of data you might want to store in a record, which we’ll implement using a hash reference. The keys are uppercase strings, a convention sometimes employed (and occasionally unemployed, but only briefly) when the hash is being used as a specific record type. Composition, Access, and Printing of More Elaborate Records Here is a record with six disparate fields: $rec = { TEXT => $string, SEQUENCE => [ @old_values ], 284 Chapter 9: Data Structures LOOKUP THATCODE THISCODE HANDLE => => => => { %some_table }, \&some_function, sub { $_[0] ** $_[1] }, \*STDOUT, }; The TEXT field is a simple string, so you can just print it: print $rec->{TEXT}; SEQUENCE and LOOKUP are regular array and hash references: print $rec->{SEQUENCE}[0]; $last = pop @{ $rec->{SEQUENCE} }; print $rec->{LOOKUP}{"key"}; ($first_k, $first_v) = each %{ $rec->{LOOKUP} }; THATCODE is a named subroutine and THISCODE is an anonymous subroutine, but they’re invoked identically: $that_answer = $rec->{THATCODE}->($arg1, $arg2); $this_answer = $rec->{THISCODE}->($arg1, $arg2); With an extra pair of braces, you can treat $rec->{HANDLE} as an indirect object: print { $rec->{HANDLE} } "a string\n"; If you’re using the FileHandle module, you can even treat the handle as a regular object: use FileHandle; $rec->{HANDLE}->autoflush(1); $rec->{HANDLE}->print("a string\n"); Composition, Access, and Printing of Even More Elaborate Records Naturally, the fields of your data structures can themselves be arbitrarily complex data structures in their own right: %TV = ( flintstones => { series => "flintstones", nights => [ "monday", "thursday", "friday" ], members => [ { name => "fred", role => "husband", age => 36, }, { name => "wilma", role => "wife", age => 31, }, { name => "pebbles", role => "kid", age => 4, },, }, More Elaborate Records 285 jetsons => { series => "jetsons", nights => [ "wednesday", "saturday" ], members => [ { name => "george", role => "husband", age => 41, }, { name => "jane", role => "wife", age => 39, }, { name => "elroy", role => "kid", age => 9, },, }, simpsons => { series => "simpsons", nights => [ "monday" ], members => [ { name => "homer", role => "husband", age => 34, }, { name => "marge", role => "wife", age => 37, }, { name => "bart", role => "kid", age => 11, },, }, ); Generation of a Hash of Complex Records Because Perl is quite good at parsing complex data structures, you might just put your data declarations in a separate file as regular Perl code, and then load them in with the do or require built-in functions. Another popular approach is to use a CPAN module (such as XML::Parser) to load in arbitrary data structures expressed in some other language (such as XML). You can build data structures piecemeal: $rec = {}; $rec->{series} = "flintstones"; $rec->{nights} = [ find_days() ]; Or read them in from a file (here, assumed to be in field=value syntax): @members = (); while (<>) { %fields = split /[\s=]+/; push @members, { %fields }; } $rec->{members} = [ @members ]; And fold them into larger data structures keyed by one of the subfields: $TV{ $rec->{series} } = $rec; You can use extra pointer fields to avoid duplicate data. For example, you might want a "kids" field included in a person’s record, which might be a reference to an array containing references to the kids’ own records. By having parts of your 286 Chapter 9: Data Structures data structure refer to other parts, you avoid the data skew that would result from updating the data in one place but not in another: for $family (keys %TV) { my $rec = $TV{$family}; # temporary pointer @kids = (); for $person ( @{$rec->{members}} ) { if ($person->{role} =˜ /kid|son|daughter/) { push @kids, $person; } } # $rec and $TV{$family} point to same data! $rec->{kids} = [ @kids ]; } The $rec->{kids} = [ @kids ] assignment copies the array contents—but they are merely references to uncopied data. This means that if you age Bart as follows: $TV{simpsons}{kids}[0]{age}++; # increments to 12 then you’ll see the following result, because $TV{simpsons}{kids}[0] and $TV{simpsons}{members}[2] both point to the same underlying anonymous hash table: print $TV{simpsons}{members}[2]{age}; # also prints 12 Now, to print the entire %TV structure: for $family ( keys %TV ) { print "the $family"; print " is on ", join (" and ", @{ $TV{$family}{nights} }), "\n"; print "its members are:\n"; for $who ( @{ $TV{$family}{members} } ) { print " $who->{name} ($who->{role}), age $who->{age}\n"; } print "children: "; print join (", ", map { $_->{name} } @{ $TV{$family}{kids} } ); print "\n\n"; } Saving Data Structures If you want to save your data structures for use by another program later, there are many ways to do it. The easiest way is to use Perl’s Data::Dumper module, which turns a (possibly self-referential) data structure into a string that can be saved externally and later reconstituted with eval or do. use Data::Dumper; $Data::Dumper::Purity = 1; # since %TV is self-referential open (FILE, "> tvinfo.perldata") or die "can’t open tvinfo: $!"; print FILE Data::Dumper->Dump([\%TV], [’*TV’]); close FILE or die "can’t close tvinfo: $!"; Saving Data Structures 287 A separate program (or the same program) can then read in the file later: open (FILE, "< tvinfo.perldata") or die "can’t open tvinfo: $!"; undef $/; # read in file all at once eval <FILE>; # recreate %TV die "can’t recreate tv data from tvinfo.perldata: $@" if $@; close FILE or die "can’t close tvinfo: $!"; print $TV{simpsons}{members}[2]{age}; or simply: do "tvinfo.perldata" or die "can’t recreate tvinfo: $! $@"; print $TV{simpsons}{members}[2]{age}; Many other solutions are available, with storage formats ranging from packed binary (very fast) to XML (very interoperable). Check out a CPAN mirror near you today! 10 Packages In this chapter, we get to start having fun, because we get to start talking about software design. If we’re going to talk about good software design, we have to talk about Laziness, Impatience, and Hubris, the basis of good software design. We’ve all fallen into the trap of using cut-and-paste when we should have defined a higher-level abstraction, if only just a loop or subroutine.* To be sure, some folks have gone to the opposite extreme of defining ever-growing mounds of higherlevel abstractions when they should have used cut-and-paste.† Generally, though, most of us need to think about using more abstraction rather than less. Caught somewhere in the middle are the people who have a balanced view of how much abstraction is good, but who jump the gun on writing their own abstractions when they should be reusing existing code.‡ Whenever you’re tempted to do any of these things, you need to sit back and think about what will do the most good for you and your neighbor over the long haul. If you’re going to pour your creative energies into a lump of code, why not make the world a better place while you’re at it? (Even if you’re only aiming for the program to succeed, you need to make sure it fits the right ecological niche.) The first step toward ecologically sustainable programming is simply this: don’t litter in the park. When you write a chunk of code, think about giving the code its own namespace, so that your variables and functions don’t clobber anyone else’s, or vice versa. A namespace is a bit like your home, where you’re allowed to be as * This is a form of False Laziness. † This is a form of False Hubris. ‡ You guessed it—this is False Impatience. But if you’re determined to reinvent the wheel, at least try to invent a better one. 288 Introduction 289 messy as you like, as long as you keep your external interface to other citizens moderately civil. In Perl, a namespace is called a package. Packages provide the fundamental building block upon which the higher-level concepts of modules and classes are constructed. Like the notion of “home”, the notion of “package” is a bit nebulous. Packages are independent of files. You can have many packages in a single file, or a single package that spans several files, just as your home could be one small garret in a larger building (if you’re a starving artist), or it could comprise several buildings (if your name happens to be Queen Elizabeth). But the usual size of a home is one building, and the usual size of a package is one file. Perl provides some special help for people who want to put one package in one file, as long as you’re willing to give the file the same name as the package and use an extension of .pm, which is short for “perl module”. The module is the fundamental unit of reusability in Perl. Indeed, the way you use a module is with the use command, which is a compiler directive that controls the importation of subroutines and variables from a module. Every example of use you’ve seen until now has been an example of module reuse. The Comprehensive Perl Archive Network, or CPAN, is where you should put your modules if other people might find them useful. Perl has thrived because of the willingness of programmers to share the fruits of their labor with the community. Naturally, CPAN is also where you can find modules that others have thoughtfully uploaded for everyone to use. See Chapter 22, CPAN, and www.cpan.org for details. The trend over the last 25 years or so has been to design computer languages that enforce a state of paranoia. You’re expected to program every module as if it were in a state of siege. Certainly there are some feudal cultures where this is appropriate, but not all cultures are like this. In Perl culture, for instance, you’re expected to stay out of someone’s home because you weren’t invited in, not because there are bars on the windows.* This is not a book about object-oriented methodology, and we’re not here to convert you into a raving object-oriented zealot, even if you want to be converted. There are already plenty of books out there for that. Perl’s philosophy of objectoriented design fits right in with Perl’s philosophy of everything else: use objectoriented design where it makes sense, and avoid it where it doesn’t. Your call. In OO-speak, every object belongs to a grouping called a class. In Perl, classes and packages and modules are all so closely related that novices can often think * But Perl provides some bars if you want them, too. See “Handling Insecure Code” in Chapter 23, Security. 290 Chapter 10: Packages of them as being interchangeable. The typical class is implemented by a module that defines a package with the same name as the class. We’ll explain all of this in the next few chapters. When you use a module, you benefit from direct software reuse. With classes, you benefit from indirect software reuse when one class uses another through inheritance. And with classes, you get something more: a clean interface to another namespace. Everything in a class is accessed indirectly, insulating the class from the outside world. As we mentioned in Chapter 8, Refer ences, object-oriented programming in Perl is accomplished through references whose referents know which class they belong to. In fact, now that you know about references, you know almost everything difficult about objects. The rest of it just “lays under the fingers”, as a pianist would say. You will need to practice a little, though. One of your basic finger exercises consists of learning how to protect different chunks of code from inadvertently tampering with each other’s variables. Every chunk of code belongs to a particular package, which determines what variables and subroutines are available to it. As Perl encounters a chunk of code, it is compiled into what we call the curr ent package. The initial current package is called “main”, but you can switch the current package to another one at any time with the package declaration. The current package determines which symbol table is used to find your variables, subroutines, I/O handles, and formats. Any variable not declared with my is associated with a package—even seemingly omnipresent variables like $_ and %SIG. In fact, there’s really no such thing as a global variable in Perl, just package variables. (Special identifiers like _ and SIG merely seem global because they default to the main package instead of the current one.) The scope of a package declaration is from the declaration itself through the end of the enclosing scope (block, file, or eval —whichever comes first) or until another package declaration at the same level, which supersedes the earlier one. (This is a common practice). All subsequent identifiers (including those declared with our, but not including those declared with my or those qualified with a different package name) will be placed in the symbol table belonging to the current package. (Variables declared with my are independent of packages; they are always visible within, and only within, their enclosing scope, regardless of any package declarations.) Typically, a package declaration will be the first statement of a file meant to be included by require or use. But again, that’s by convention. You can put a package Introduction 291 declaration anywhere you can put a statement. You could even put it at the end of a block, in which case it would have no effect whatsoever. You can switch into a package in more than one place; a package declaration merely selects the symbol table to be used by the compiler for the rest of that block. (This is how a given package can span more than one file.) You can refer to identifiers* in other packages by prefixing (“qualifying”) the identifier with the package name and a double colon: $Package::Variable. If the package name is null, the main package is assumed. That is, $::sail is equivalent to $main::sail.† The old package delimiter was a single quote, so in old Perl programs you’ll see variables like $main’sail and $somepack’horse. But the double colon is now the preferred delimiter, in part because it’s more readable to humans, and in part because it’s more readable to emacs macros. It also makes C++ programmers feel like they know what’s going on—as opposed to using the single quote as the separator, which was there to make Ada programmers feel like they knew what’s going on. Because the old-fashioned syntax is still supported for backward compatibility, if you try to use a string like "This is $owner’s house", you’ll be accessing $owner::s; that is, the $s variable in package owner, which is probably not what you meant. Use braces to disambiguate, as in "This is ${owner}’s house". The double colon can be used to chain together identifiers in a package name: $Red::Blue::var. This means the $var belonging to the Red::Blue package. The Red::Blue package has nothing to do with any Red or Blue packages that might happen to exist. That is, a relationship between Red::Blue and Red or Blue may have meaning to the person writing or using the program, but it means nothing to Perl. (Well, other than the fact that, in the current implementation, the symbol table Red::Blue happens to be stored in the symbol table Red. But the Perl language makes no use of that directly.) For this reason, every package declaration must declare a complete package name. No package name ever assumes any kind of implied “prefix”, even if (seemingly) declared within the scope of some other package declaration. * By identifiers, we mean the names used as symbol table keys for accessing scalar variables, array variables, hash variables, subroutines, file or directory handles, and formats. Syntactically speaking, labels are also identifiers, but they aren’t put into a particular symbol table; rather, they are attached directly to the statements in your program. Labels cannot be package qualified. † To clear up another bit of potential confusion, in a variable name like $main::sail, we use the term “identifier” to talk about main and sail, but not main::sail. We call that a variable name instead, because identifiers cannot contain colons. 292 Chapter 10: Packages Only identifiers (names starting with letters or an underscore) are stored in a package’s symbol table. All other symbols are kept in the main package, including all the nonalphabetic variables, like $!, $?, and $_ . In addition, when unqualified, the identifiers STDIN, STDOUT, STDERR, ARGV, ARGVOUT, ENV, INC, and SIG are forced to be in package main, even when used for other purposes than their built-in ones. Don’t name your package m, s, y, tr, q, qq, qr, qw, or qx unless you’re looking for a lot of trouble. For instance, you won’t be able to use the qualified form of an identifier as a filehandle because it will be interpreted instead as a pattern match, a substitution, or a transliteration. Long ago, variables beginning with an underscore were forced into the main package, but we decided it was more useful for package writers to be able to use a leading underscore to indicate semi-private identifiers meant for internal use by that package only. (Truly private variables can be declared as file-scoped lexicals, but that works best when the package and module have a one-to-one relationship, which is common but not required.) The %SIG hash (which is for trapping signals; see Chapter 16, Interpr ocess Communication) is also special. If you define a signal handler as a string, it’s assumed to refer to a subroutine in the main package unless another package name is explicitly used. Use a fully qualified signal handler name if you want to specify a particular package, or avoid strings entirely by assigning a typeglob or a function reference instead: $SIG{QUIT} $SIG{QUIT} $SIG{QUIT} $SIG{QUIT} $SIG{QUIT} = = = = = "Pkg::quit_catcher"; # fully qualified handler name "quit_catcher"; # implies "main::quit_catcher" *quit_catcher; # forces current package’s sub \&quit_catcher; # forces current package’s sub sub { print "Caught SIGQUIT\n" }; # anonymous sub The notion of “current package” is both a compile-time and run-time concept. Most variable name lookups happen at compile time, but run-time lookups happen when symbolic references are dereferenced, and also when new bits of code are parsed under eval. In particular, when you eval a string, Perl knows which package the eval was invoked in and propagates that package inward when evaluating the string. (You can always switch to a different package inside the eval string, of course, since an eval string counts as a block, just like a file loaded in with do, require, or use.) Alternatively, if an eval wants to find out what package it’s in, the special symbol __PACKAGE_ _ contains the current package name. Since you can treat it as a string, you could use it in a symbolic reference to access a package variable. But if you were doing that, chances are you should have declared the variable with our instead so it could be accessed as if it were a lexical. Symbol Tables 293 Symbol Tables The contents of a package are collectively called a symbol table. Symbol tables are stored in a hash whose name is the same as the package, but with two colons appended. The main symbol table’s name is thus %main::. Since main also happens to be the default package, Perl provides %:: as an abbreviation for %main::. Likewise, the symbol table for the Red::Blue package is named %Red::Blue::. As it happens, the main symbol table contains all other top-level symbol tables, including itself, so %Red::Blue:: is also %main::Red::Blue::. When we say that a symbol table “contains” another symbol table, we mean that it contains a reference to the other symbol table. Since main is the top-level package, it contains a reference to itself, with the result that %main:: is the same as %main::main::, and %main::main::main::, and so on, ad infinitum. It’s important to check for this special case if you write code that traverses all symbol tables. Inside a symbol table’s hash, each key/value pair matches a variable name to its value. The keys are the symbol identifiers, and the values are the corresponding typeglobs. So when you use the *NAME typeglob notation, you’re really just accessing a value in the hash that holds the current package’s symbol table. In fact, the following have (nearly) the same effect: *sym = *main::variable; *sym = $main::{"variable"}; The first is more efficient because the main symbol table is accessed at compile time. It will also create a new typeglob by that name if none previously exists, whereas the second form will not. Since a package is a hash, you can look up the keys of the package and get to all the variables of the package. Since the values of the hash are typeglobs, you can dereference them in several ways. Try this: foreach $symname (sort keys %main::) { local *sym = $main::{$symname}; print "\$$symname is defined\n" if defined $sym; print "\@$symname is nonnull\n" if @sym; print "\%$symname is nonnull\n" if %sym; } Since all packages are accessible (directly or indirectly) through the main package, you can write Perl code to visit every package variable in your program. The Perl debugger does precisely that when you ask it to dump all your variables with the V command. Note that if you do this, you won’t see variables declared with my since those are independent of packages, although you will see variables declared with our. See Chapter 20, The Perl Debugger. 294 Chapter 10: Packages Earlier we said that only identifiers are stored in packages other than main. That was a bit of a fib: you can use any string you want as the key in a symbol table hash — it’s just that it wouldn’t be valid Perl if you tried to use a non-identifier directly: $!@#$% ${’!@#$%’} = 0; = 1; ${’main::!@#$%’} = 2; print ${ $main::{’!@#$%’} } # WRONG, syntax error. # Ok, though unqualified. # Can qualify within the string. # Ok, prints 2! Assignment to a typeglob performs an aliasing operation; that is, *dick = *richard; causes variables, subroutines, formats, and file and directory handles accessible via the identifier richard to also be accessible via the symbol dick. If you want to alias only a particular variable or subroutine, assign a reference instead: *dick = \$richard; That makes $richard and $dick the same variable, but leaves @richard and @dick as separate arrays. Tricky, eh? This is how the Exporter works when importing symbols from one package to another. For example: *SomePack::dick = \&OtherPack::richard; imports the &richard function from package OtherPack into SomePack, making it available as the &dick function. (The Exporter module is described in the next chapter.) If you precede the assignment with a local, the aliasing will only last as long as the current dynamic scope. This mechanism may be used to retrieve a reference from a subroutine, making the referent available as the appropriate data type: *units = populate() ; print $units{kg}; # Assign \%newhash to the typeglob # Prints 70; no dereferencing needed! sub populate { my %newhash = (km => 10, kg => 70); return \%newhash; } Likewise, you can pass a reference into a subroutine and use it without dereferencing: %units = (miles => 6, stones => 11); fillerup( \%units ); # Pass in a reference print $units{quarts}; # Prints 4 Symbol Tables 295 sub fillerup { local *hashsym = shift; $hashsym{quarts} = 4; } # Assign \%units to the typeglob # Affects %units; no dereferencing needed! These are tricky ways to pass around references cheaply when you don’t want to have to explicitly dereference them. Note that both techniques only work with package variables; they would not have worked had we declared %units with my. Another use of symbol tables is for making “constant” scalars: *PI = \3.14159265358979; Now you cannot alter $PI, which is probably a good thing, all in all. This isn’t the same as a constant subroutine, which is optimized at compile time. A constant subroutine is one prototyped to take no arguments and to return a constant expression; see the section “Inlining Constant Functions” in Chapter 6, Subr outines, for details. The use constant pragma (see Chapter 31, Pragmatic Modules) is a convenient shorthand: use constant PI => 3.14159; Under the hood, this uses the subroutine slot of *PI, instead of the scalar slot used earlier. It’s equivalent to the more compact (but less readable): *PI = sub () { 3.14159 }; That’s a handy idiom to know anyway—assigning a sub {} to a typeglob is the way to give a name to an anonymous subroutine at run time. Assigning a typeglob reference to another typeglob (*sym = \*oldvar) is the same as assigning the entire typeglob, because Perl automatically dereferences the typeglob reference for you. And when you set a typeglob to a simple string, you get the entire typeglob named by that string, because Perl looks up the string in the current symbol table. The following are all equivalent to one another, though the first two compute the symbol table entry at compile time, while the last two do so at run time: *sym *sym *sym *sym = *oldvar; = \*oldvar; = *{"oldvar"}; = "oldvar"; # autodereference # explicit symbol table lookup # implicit symbol table lookup When you perform any of the following assignments, you’re replacing just one of the references within the typeglob: *sym *sym *sym *sym = = = = \$frodo; \@sam; \%merry; \&pippin; 296 Chapter 10: Packages If you think about it sideways, the typeglob itself can be viewed as a kind of hash, with entries for the different variable types in it. In this case, the keys are fixed, since a typeglob can contain exactly one scalar, one array, one hash, and so on. But you can pull out the individual references, like this: *pkg::sym{SCALAR} *pkg::sym{ARRAY} *pkg::sym{HASH} *pkg::sym{CODE} *pkg::sym{GLOB} *pkg::sym{IO} *pkg::sym{NAME} *pkg::sym{PACKAGE} # # # # # # # # same as \$pkg::sym same as \@pkg::sym same as \%pkg::sym same as \&pkg::sym same as \*pkg::sym internal file/dir handle, no direct equivalent "sym" (not a reference) "pkg" (not a reference) You can say *foo{PACKAGE} and *foo{NAME} to find out what name and package the *foo symbol table entry comes from. This may be useful in a subroutine that is passed typeglobs as arguments: sub identify_typeglob { my $glob = shift; print ’You gave me ’, *{$glob}{PACKAGE}, ’::’, *{$glob}{NAME}, "\n"; } identify_typeglob(*foo); identify_typeglob(*bar::glarch); This prints: You gave me main::foo You gave me bar::glarch The *foo{THING} notation can be used to obtain references to individual elements of *foo. See the section “Symbol Table References” in Chapter 8 for details. This syntax is primarily used to get at the internal filehandle or directory handle reference, because the other internal references are already accessible in other ways. (The old *foo{FILEHANDLE} is still supported to mean *foo{IO}, but don’t let its name fool you into thinking it can distinguish filehandles from directory handles.) But we thought we’d generalize it because it looks kind of pretty. Sort of. You probably don’t need to remember all this unless you’re planning to write another Perl debugger. Autoloading Normally, you can’t call a subroutine that isn’t defined. However, if there is a subroutine named AUTOLOAD in the undefined subroutine’s package (or in the case of an object method, in the package of any of the object’s base classes), then the Autoloading 297 AUTOLOAD subroutine is called with the same arguments that would have been passed to the original subroutine. You can define the AUTOLOAD subroutine to return values just like a regular subroutine, or you can make it define the routine that didn’t exist and then call that as if it’d been there all along. The fully qualified name of the original subroutine magically appears in the package-global $AUTOLOAD variable, in the same package as the AUTOLOAD routine. Here’s a simple example that gently warns you about undefined subroutine invocations instead of exiting: sub AUTOLOAD { our $AUTOLOAD; warn "Attempt to call $AUTOLOAD failed.\n"; } blarg(10); # our $AUTOLOAD will be set to main::blarg print "Still alive!\n"; Or you can return a value on behalf of the undefined subroutine: sub AUTOLOAD { our $AUTOLOAD; return "I see $AUTOLOAD(@_)\n"; } print blarg(20); # prints: I see main::blarg(20) Your AUTOLOAD subroutine might load a definition for the undefined subroutine using eval or require, or use the glob assignment trick discussed earlier, and then execute that subroutine using the special form of goto that can erase the stack frame of the AUTOLOAD routine without a trace. Here we define the subroutine by assigning a closure to the glob: sub AUTOLOAD { my $name = our $AUTOLOAD; *$AUTOLOAD = sub { print "I see $name(@_)\n" }; goto &$AUTOLOAD; # Restart the new routine. } blarg(30); glarb(40); blarg(50); The into files files # prints: I see main::blarg(30) # prints: I see main::glarb(40) # prints: I see main::blarg(50) standard AutoSplit module is used by module writers to split their modules separate files (with filenames ending in .al ), each holding one routine. The are placed in the auto/ directory of your system’s Perl library, after which the can be autoloaded on demand by the standard AutoLoader module. 298 Chapter 10: Packages A similar approach is taken by the SelfLoader module, except that it autoloads functions from the file’s own DATA area, which is less efficient in some ways and more efficient in others. Autoloading of Perl functions by AutoLoader and SelfLoader is analogous to dynamic loading of compiled C functions by DynaLoader, except that autoloading is done at the granularity of the function call, whereas dynamic loading is done at the granularity of the complete module, and will usually link in many C or C++ functions all at once. (Note that many Perl programmers get along just fine without the AutoSplit, AutoLoader, SelfLoader, or DynaLoader modules. You just need to know that they’re there, in case you can’t get along just fine without them.) One can have great fun with AUTOLOAD routines that serve as wrappers to other interfaces. For example, let’s pretend that any function that isn’t defined should just call system with its arguments. All you’d do is this: sub AUTOLOAD { my $program = our $AUTOLOAD; $program =˜ s/.*:://; # trim package name system($program, @_); } (Congratulations, you’ve now implemented a rudimentary form of the Shell module that comes standard with Perl.) You can call your autoloader (on Unix) like this: date(); who(’am’, ’i’); ls(’-l’); echo("Abadugabudabuda..."); In fact, if you predeclare the functions you want to call that way, you can pretend they’re built-ins and omit the parentheses on the call: sub sub sub sub date (;$$); who (;$$$$); ls; echo ($@); # # # # Allow Allow Allow Allow date; who "am", "i"; ls "-l"; echo "That’s all, folks!"; zero to two arguments. zero to four args. any number of args. at least one arg. 11 Modules The module is the fundamental unit of code reuse in Perl. Under the hood, it’s just a package defined in a file of the same name (with .pm on the end). In this chapter, we’ll explore how you can use other people’s modules and create your own. Perl comes bundled with a large number of modules, which you can find in the lib directory of your Perl distribution. Many of those modules are described in Chapter 32, Standard Modules, and Chapter 31, Pragmatic Modules. All the standard modules also have extensive online documentation, which (horrors) may be more up-to-date than this book. Try the perldoc command if your man command doesn’t work. The Comprehensive Perl Archive Network (CPAN) contains a worldwide repository of modules contributed by the Perl community, and is discussed in Chapter 22, CPAN. See also http://www.cpan.org. Using Modules Modules come in two flavors: traditional and object-oriented. Traditional modules define subroutines and variables for the caller to import and use. Object-oriented modules function as class definitions and are accessed through method calls, described in Chapter 12, Objects. Some modules do both. Perl modules are typically included in your program by saying: use MODULE LIST; or just: use MODULE; 299 300 Chapter 11: Modules MODULE must be an identifier naming the module’s package and file. (The syntax descriptions here are meant to be suggestive; the full syntax of the use statement is given in Chapter 29, Functions.) The use statement does a preload of MODULE at compile time and then an import of the symbols you’ve requested so that they’ll be available for the rest of the compilation. If you do not supply a LIST of symbols that you want, the symbols named in the module’s internal @EXPORT array are used — assuming you’re using the Exporter module, described in “Module Privacy and the Exporter” later in this chapter. (If you do supply a LIST, all your symbols must be mentioned in the module’s @EXPORT or @EXPORT_OK arrays, or an error will result.) Since modules use the Exporter to import symbols into the current package, you can use symbols from the module without providing a package qualifier: use Fred; flintstone(); # If Fred.pm has @EXPORT = qw(flintstone) # ...this calls Fred::flintstone(). All Perl module files have the extension .pm. Both use and require assume this (as well as the quotes) so that you don’t have to spell out "MODULE.pm". Using the bare identifier helps to differentiate new modules from .pl and .ph libraries used in old versions of Perl. It also introduces MODULE as an official module name, which helps the parser in certain ambiguous situations. Any double colons in the module name are translated into your system’s directory separator, so if your module is named Red::Blue::Green, Perl might look for it as Red/Blue/Gr een.pm. Perl will search for modules in each of the directories listed in the @INC array. Since use loads modules at compile time, any modifications to @INC need to occur at compile time as well. You can do this with the lib pragma described in Chapter 31 or with a BEGIN block. Once a module is included, a key/value pair will be added to the %INC hash. The key will be the module filename (Red/Blue/Green.pm in our example) and the value will be the full pathname, which might be something like C:/perl/site/lib/Red/Blue/Green.pm for a properly installed module on a Windows system. Module names should be capitalized unless they’re functioning as pragmas. Pragmas are in effect compiler directives (hints for the compiler), so we reserve the lowercase pragma names for future use. When you use a module, any code inside the module is executed, just as it would be for an ordinary require. If you really don’t care whether the module is pulled in at compile time or run time, you can just say: require MODULE; In general, however, use is preferred over require because it looks for modules during compilation, so you learn about any mistakes sooner. Creating Modules 301 These two statements do almost the same thing: require MODULE; require "MODULE.pm"; They differ in two ways, however. In the first statement, require translates any double colons in the module name into your system’s directory separator, just as use does. The second case does no translation, forcing you to specify the pathname of your module literally, which is less portable. The other difference is that the first require tells the compiler that the expressions with indirect object notation involving “MODULE ” (such as $ob = purge MODULE) are method calls, not function calls. (Yes, this really can make a difference, if there’s a conflicting definition of purge in your own module.) Because the use declaration and the related no declaration imply a BEGIN block, the compiler loads the module (and runs any executable initialization code in it) as soon as it encounters that declaration, befor e it compiles the rest of the file. This is how pragmas can change the compiler’s behavior, and also how modules are able to declare subroutines that are then visible as list operators for the remainder of compilation. This will not work if you use require instead of use. Just about the only reason to use require is if you have two modules that each need a function from the other. (And we’re not sure that’s a good reason.) Perl modules always load a .pm file, but that file may in turn load associated files, such as dynamically linked C or C++ libraries or autoloaded Perl subroutine definitions. If so, the additional shenanigans will be entirely transparent to the module user. It is the responsibility of the .pm file to load (or arrange to autoload) any additional functionality. The POSIX module happens to perform both dynamic loading and autoloading, but the user can say just: use POSIX; to get all the exported functions and variables. Creating Modules Earlier, we said that there are two ways for a module to make its interface available to your program: by exporting symbols or by allowing method calls. We’ll show you an example of the first style here; the second style is for object-oriented modules and is described in the next chapter. (Object-oriented modules should export nothing, since the whole idea of methods is that Perl finds them for you automatically, based on the type of the object.) 302 Chapter 11: Modules To construct a module called Bestiary, create a file called Bestiary.pm that looks like this: package require our our our our Bestiary; Exporter; @ISA @EXPORT @EXPORT_OK $VERSION = = = = qw(Exporter); qw(camel); # Symbols to be exported by default qw($weight); # Symbols to be exported on request 1.00; # Version number ### Include your variables and functions here sub camel { print "One-hump dromedary" } $weight = 1024; 1; A program can now say use Bestiary to be able to access the camel function (but not the $weight variable), and use Bestiary qw(camel $weight) to access both the function and the variable. You can also create modules that dynamically load code written in C. See Chapter 21, Inter nals and Externals, for details. Module Privacy and the Exporter Perl does not automatically patrol private/public borders within its modules— unlike languages such as C++, Java, and Ada, Perl isn’t obsessed with enforced privacy. A Perl module would prefer that you stay out of its living room because you weren’t invited, not because it has a shotgun. The module and its user have a contract, part of which is common law and part of which is written. Part of the common law contract is that a module refrain from changing any namespace it wasn’t asked to change. The written contract for the module (that is, the documentation) may make other provisions. But then, having read the contract, you presumably know that when you say use RedefineTheWorld you’re redefining the world, and you’re willing to risk the consequences. The most common way to redefine worlds is to use the Exporter module. As we’ll see later in the chapter, you can even redefine built-ins with this module. When you use a module, the module typically makes some variables or functions available to your program, or more specifically, to your program’s current package. This act of exporting symbols from the module (and thus importing them into your program) is sometimes called polluting your namespace. Most modules use Exporter to do this; that’s why most modules say something like this near the top: Creating Modules 303 require Exporter; our @ISA = ("Exporter"); These two lines make the module inherit from the Exporter class. Inheritance is described in the next chapter, but all you need to know is our Bestiary module can now export symbols into other packages with lines like these: our @EXPORT = qw($camel %wolf ram); # Export by default our @EXPORT_OK = qw(leopard @llama $emu); # Export by request our %EXPORT_TAGS = ( # Export as group camelids => [qw($camel @llama)], critters => [qw(ram $camel %wolf)], ); From the viewpoint of the exporting module, the @EXPORT array contains the names of variables and functions to be exported by default: what your program gets when it says use Bestiary. Variables and functions in @EXPORT_OK are exported only when the program specifically requests them in the use statement. Finally, the key/value pairs in %EXPORT_TAGS allow the program to include particular groups of the symbols listed in @EXPORT and @EXPORT_OK. From the viewpoint of the importing package, the use statement specifies a list of symbols to import, a group named in %EXPORT_TAGS, a pattern of symbols, or nothing at all, in which case the symbols in @EXPORT would be imported from the module into your program. You can include any of these statements to import symbols from the Bestiary module: use use use use use use use use use Bestiary; # Import Bestiary (); # Import Bestiary qw(ram @llama); # Import Bestiary qw(:camelids); # Import Bestiary qw(:DEFAULT); # Import Bestiary qw(/am/); # Import Bestiary qw(/ˆ\$/); # Import Bestiary qw(:critters !ram); # Import Bestiary qw(:critters !:camelids); # Import @EXPORT symbols nothing the ram function and @llama array $camel and @llama @EXPORT symbols $camel, @llama, and ram all scalars the critters, but exclude ram critters, but no camelids Leaving a symbol off the export lists (or removing it explicitly from the import list with the exclamation point) does not render it inaccessible to the program using the module. The program will always be able to access the contents of the module’s package by fully qualifying the package name, like %Bestiary::gecko. (Since lexical variables do not belong to packages, privacy is still possible: see “Private Methods” in the next chapter.) You can say BEGIN { $Exporter::Verbose=1 } to see how the specifications are being processed and what is actually being imported into your package. 304 Chapter 11: Modules The Exporter is itself a Perl module, and if you’re curious you can see the typeglob trickery it uses to export symbols from one package into another. Inside the Export module, the key function is named import, which performs the necessary aliasing to make a symbol in one package appear to be in another. In fact, a use Bestiary LIST statement is exactly equivalent to: BEGIN { require Bestiary; import Bestiary LIST; } This means that your modules don’t have to use the Exporter. A module can do anything it jolly well pleases when it’s used, since use just calls the ordinary import method for the module, and you can define that method to do anything you like. Exporting without using Exporter’s import method The Exporter defines a method called export_to_level, used for situations where (for some reason) you can’t directly call Exporter’s import method. The export_to_level method is invoked like this: MODULE->export_to_level($where_to_export, @what_to_export); where $where_to_export is an integer indicating how far up the calling stack to export your symbols, and @what_to_export is an array listing the symbols to export (usually @_ ). For example, suppose our Bestiary had an import function of its own: package Bestiary; @ISA = qw(Exporter); @EXPORT_OK = qw ($zoo); sub import { $Bestiary::zoo = "menagerie"; } The presence of this import function prevents Exporter’s import function from being inherited. If you want Bestiary’s import function to behave just like Exporter’s import function once it sets $Bestiary::zoo, you’d define it as follows: sub import { $Bestiary::zoo = "menagerie"; Bestiary->export_to_level(1, @_); } This exports symbols to the package one level “above” the current package. That is, to whatever program or module is using the Bestiary. Creating Modules 305 Version checking If your module defines a $VERSION variable, a program using your module can ensure that the module is sufficiently recent. For example: use Bestiary 3.14; # The Bestiary must be version 3.14 or later use Bestiary v1.0.4; # The Bestiary must be version 1.0.4 or later These are converted into calls to Bestiary->require_version, which your module then inherits. Managing unknown symbols In some situations, you may want to pr event certain symbols from being exported. Typically, this applies to modules that have functions or constants that might not make sense on some systems. You can prevent the Exporter from exporting those symbols by placing them in the @EXPORT_FAIL array. If a program attempts to import any of these symbols, the Exporter gives the module an opportunity to handle the situation before generating an error. It does this by calling an export_fail method with a list of the failed symbols, which you might define as follows (assuming your module uses the Carp module): sub export_fail { my $class = shift; carp "Sorry, these symbols are unavailable: @_"; return @_; } The Exporter provides a default export_fail method, which simply returns the list unchanged and makes the use fail with an exception raised for each symbol. If export_fail returns an empty list, no error is recorded and all the requested symbols are exported. Tag-handling utility functions Since the symbols listed within %EXPORT_TAGS must also appear in either @EXPORT or @EXPORT_OK, the Exporter provides two functions to let you add those tagged sets of symbols: %EXPORT_TAGS = (foo => [qw(aa bb cc)], bar => [qw(aa cc dd)]); Exporter::export_tags(’foo’); # add aa, bb and cc to @EXPORT Exporter::export_ok_tags(’bar’); # add aa, cc and dd to @EXPORT_OK Specifying names that are not tags is erroneous. 306 Chapter 11: Modules Overriding Built-in Functions Many built-in functions may be overridden, although (like knocking holes in your walls) you should do this only occasionally and for good reason. Typically, this might be done by a package attempting to emulate missing built-in functionality on a non-Unix system. (Do not confuse overriding with overloading, which adds additional object-oriented meanings to built-in operators, but doesn’t override much of anything. See the discussion of the overload module in Chapter 13, Overloading for more on that.) Overriding may be done only by importing the name from a module—ordinary predeclaration isn’t good enough. To be perfectly forthcoming, it’s the assignment of a code reference to a typeglob that triggers the override, as in *open = \&myopen. Furthermore, the assignment must occur in some other package; this makes accidental overriding through typeglob aliasing intentionally difficult. However, if you really want to do your own overriding, don’t despair, because the subs pragma lets you predeclare subroutines via the import syntax, so those names then override the built-in ones: use subs qw(chdir chroot chmod chown); chdir $somewhere; sub chdir { ... } In general, modules should not export built-in names like open or chdir as part of their default @EXPORT list, since these names may sneak into someone else’s namespace and change the semantics unexpectedly. If the module includes the name in the @EXPORT_OK list instead, importers will be forced to explicitly request that the built-in name be overridden, thus keeping everyone honest. The original versions of the built-in functions are always accessible via the CORE pseudopackage. Therefore, CORE::chdir will always be the version originally compiled into Perl, even if the chdir keyword has been overridden. Well, almost always. The foregoing mechanism for overriding built-in functions is restricted, quite deliberately, to the package that requests the import. But there is a more sweeping mechanism you can use when you wish to override a built-in function everywhere, without regard to namespace boundaries. This is achieved by defining the function in the CORE::GLOBAL pseudopackage. Below is an example that replaces the glob operator with something that understands regular expressions. (Note that this example does not implement everything needed to cleanly override Perl’s built-in glob, which behaves differently depending on whether it appears in a scalar or list context. Indeed, many Perl built-ins have such contextsensitive behaviors, and any properly written override should adequately support these. For a fully functional example of glob overriding, study the File::Glob module bundled with Perl.) Anyway, here’s the antisocial version: Overriding Built-in Functions 307 *CORE::GLOBAL::glob = sub { my $pat = shift; my @got; local *D; if (opendir D, ’.’) { @got = grep /$pat/, readdir D; closedir D; } return @got; } package Whatever; print <ˆ[a-z_]+\.pm\$>; # show all pragmas in the current directory By overriding glob globally, this preemptively forces a new (and subversive) behavior for the glob operator in every namespace, without the cognizance or cooperation of modules that own those namespaces. Naturally, this must be done with extreme caution—if it must be done at all. And it probably mustn’t. Our overriding philosophy is: it’s nice to be important, but it’s more important to be nice. 12 Objects First of all, you need to understand packages and modules; see Chapter 10, Packages, and Chapter 11, Modules. You also need to know about references and data structures; see Chapter 8, Refer ences and Chapter 9, Data Structures. It’s also helpful to understand a little about object-oriented programming (OOP), so in the next section we’ll give you a little course on OOL (object-oriented lingo). Brief Refresher on Object-Oriented Lingo An object is a data structure with a collection of behaviors. We generally speak of the behaviors as acted out by the object directly, sometimes to the point of anthropomorphizing the object. For example, we might say that a rectangle “knows” how to display itself on the screen, or that it “knows” how to compute its own area. Every object gets its behaviors by virtue of being an instance of a class. The class defines methods: behaviors that apply to the class and its instances. When the distinction matters, we refer to methods that apply only to a particular object as instance methods and those that apply to the entire class as class methods. But this is only a convention—to Perl, a method is just a method, distinguished only by the type of its first argument. You can think of an instance method as some action performed by a particular object, such as printing itself out, copying itself, or altering one or more of its properties (“set this sword’s name to Anduril”). Class methods might perform operations on many objects collectively (“display all swords”) or provide other operations that aren’t dependent on any particular object (“from now on, whenever a 308 Brief Refresher on Object-Oriented Lingo 309 new sword is forged, register its owner in this database”). Methods that generate instances (objects) of a class are called constructor methods (“create a sword with a gem-studded hilt and a secret inscription”). These are usually class methods (“make me a new sword”) but can also be instance methods (“make a copy just like this sword here”). A class may inherit methods from par ent classes, also known as base classes or superclasses. If it does, it’s known as a derived class or a subclass. (Confusing the issue further, some literature uses “base class” to mean a “most super” superclass. That’s not what we mean by it.) Inheritance makes a new class that behaves just like an existing one but also allows for altered or added behaviors not found in its parents. When you invoke a method whose definition is not found in the class, Perl automatically consults the parent classes for a definition. For example, a sword class might inherit its attack method from a generic blade class. Parents can themselves have parents, and Perl will search those classes as well when it needs to. The blade class might in turn inherit its attack method from an even more generic weapon class. When the attack method is invoked on an object, the resulting behavior may depend on whether that object is a sword or an arrow. Perhaps there wouldn’t be any difference at all, which would be the case if both swords and arrows inherited their attacking behavior from the generic weapon class. But if there were a difference in behaviors, the method dispatch mechanism would always select the attack method suitable for the object in question. The useful property of always selecting the most appropriate behavior for a particular type of object is known as polymorphism. It’s an important form of not caring. You have to care about the innards of your objects when you’re implementing a class, but when you use a class, you should be thinking of its objects as black boxes. You can’t see what’s inside, you shouldn’t need to know how it works, and you interact with the box only on its terms: via the methods provided by the class. Even if you know what those methods do to the object, you should resist the urge to fiddle around yourself. It’s like the remote control for your television set: even if you know what’s going on inside it, you shouldn’t monkey with its innards without good reason. Perl lets you peer inside the object from outside the class when you need to. But doing so breaks its encapsulation, the principle that all access to an object should be through methods alone. Encapsulation decouples the published interface (how an object should be used) from the implementation (how it actually works). Perl does not have an explicit interface facility apart from this unwritten contract between designer and user. Both parties are expected to exercise common sense and common decency: the user by relying only upon the documented interface, the designer by not breaking that interface. 310 Chapter 12: Objects Perl doesn’t force a particular style of programming on you, and it doesn’t have the obsession with privacy that some other object-oriented languages do. Perl does have an obsession with freedom, however, and one of the freedoms you have as a Perl programmer is the right to select as much or as little privacy as you like. In fact, Perl can have stronger privacy in its classes and objects than C++. That is, Perl does not restrict you from anything, and in particular it doesn’t restrict you from restricting yourself, if you’re into that kind of thing. The sections “Private Methods” and “Closures as Objects” later in this chapter demonstrate how you can increase your dosage of discipline. Admittedly, there’s a lot more to objects than this, and a lot of ways to find out more about object-oriented design. But that’s not our purpose here. So, on we go. Perl’s Object System Perl doesn’t provide any special syntax for defining objects, classes, or methods. Instead, it reuses existing constructs to implement these three concepts.* Here are some simple definitions that you may find reassuring: An object is simply a reference . . . er, a referent. Since references let individual scalars represent larger collections of data, it shouldn’t be a surprise that references are used for all objects. Technically, an object isn’t the reference proper—it’s really the referent that the reference points at. This distinction is frequently blurred by Perl programmers, however, and since we feel it’s a lovely metonymy, we will perpetuate the usage here when it suits us.† A class is simply a package. A package serves as a class by using the package’s subroutines to execute the class’s methods, and by using the package’s variables to hold the class’s global data. Often, a module is used to hold one or more classes. A method is simply a subroutine. You just declare subroutines in the package you’re using as the class; these will then be used as the class’s methods. Method invocation, a new way to call subroutines, passes an extra argument: the object or package used for invoking the method. * Now ther e’s an example of software reuse for you! † We prefer linguistic vigor over mathematical rigor. Either you will agree or you won’t. Method Invocation 311 Method Invocation If you were to boil down all of object-oriented programming into one quintessential notion, it would be abstraction. It’s the single underlying theme you’ll find running through all those 10-dollar words that OO enthusiasts like to bandy about, like polymorphism and inheritance and encapsulation. We believe in those fancy words, but we’ll address them from the practical viewpoint of what it means to invoke methods. Methods lie at the heart of the object system because they provide the abstraction layer needed to implement all these fancy terms. Instead of directly accessing a piece of data sitting in an object, you invoke an instance method. Instead of directly calling a subroutine in some package, you invoke a class method. By interposing this level of indirection between class use and class implementation, the program designer remains free to tinker with the internal workings of the class, with little risk of breaking programs that use it. Perl supports two different syntactic forms for invoking methods. One uses a familiar style you’ve already seen elsewhere in Perl, and the second is a form you may recognize from other programming languages. No matter which form of method invocation is used, the subroutine constituting the method is always passed an extra initial argument. If a class is used to invoke the method, that argument will be the name of the class. If an object is used to invoke the method, that argument will be the reference to the object. Whichever it is, we’ll call it the method’s invocant. For a class method, the invocant is the name of a package. For an instance method, the invocant is a reference that specifies an object. In other words, the invocant is whatever the method was invoked with. Some OO literature calls this the method’s agent or its actor. Grammatically, the invocant is neither the subject of the action nor the receiver of that action. It’s more like an indirect object, the beneficiary on whose behalf the action is performed—just like the word “me” in the command, “Forge me a sword!” Semantically, you can think of the invocant as either an invoker or an invokee, whichever fits better into your mental apparatus. We’re not going to tell you how to think. (Well, not about that.) Most methods are invoked explicitly, but methods may also be invoked implicitly when triggered by object destructors, overloaded operators, or tied variables. Properly speaking, these are not regular subroutine calls, but rather method invocations automatically triggered by Perl on behalf of an object. Destructors are described later in this chapter, overloading is described in Chapter 13, Overloading, and ties are described in Chapter 14, Tied Variables. 312 Chapter 12: Objects One difference between methods and regular subroutines is when their packages are resolved — that is, how early (or late) Perl decides which code should be executed for the method or subroutine. A subroutine’s package is resolved during compilation, before your program begins to run.* In contrast, a method’s package isn’t resolved until it is actually invoked. (Prototypes are checked at compile time, which is why regular subroutines can use them but methods can’t.) The reason a method’s package can’t be resolved earlier is relatively straightforward: the package is determined by the class of the invocant, and the invocant isn’t known until the method is actually invoked. At the heart of OO is this simple chain of logic: once the invocant is known, the invocant’s class is known, and once the class is known, the class’s inheritance is known, and once the class’s inheritance is known, the actual subroutine to call is known. The logic of abstraction comes at a price. Because of the late resolution of methods, an object-oriented solution in Perl is likely to run slower than the corresponding non-OO solution. For some of the fancier techniques described later, it could be a lot slower. However, many common problems are solved not by working faster, but by working smarter. That’s where OO shines. Method Invocation Using the Arrow Operator We mentioned that there are two styles of method invocation. The first style for invoking a method looks like this: INVOCANT->METHOD(LIST) INVOCANT->METHOD For obvious reasons, this style is usually called the arrow form of invocation. (Do not confuse -> with =>, the “double-barreled” arrow used as a fancy comma.) Parentheses are required if there are any arguments. When executed, the invocation first locates the subroutine determined jointly by the class of the INVOCANT and the METHOD name, and then calls that subroutine, passing INVOCANT as its first argument. When INVOCANT is a reference, we say that METHOD is invoked as an instance method, and when INVOCANT is a package name, we say that METHOD is invoked as a class method. There really is no difference between the two, other than that the package name is more obviously associated with the class itself than with the objects of the class. You’ll have to take our word for it that the objects also know * More precisely, the subroutine call is resolved down to a particular typeglob, a reference to which is stuffed into the compiled opcode tree. The meaning of that typeglob is negotiable even at run time — this is how AUTOLOAD can autoload a subroutine for you. Normally, however, the meaning of the typeglob is also resolved at compile time by the definition of an appropriately named subroutine. Method Invocation 313 their class. We’ll tell you in a bit how to associate an object with a class name, but you can use objects without knowing that. For example, to construct an object using the class method summon and then invoke the instance method speak on the resulting object, you might say this: $mage = Wizard->summon("Gandalf"); # class method $mage->speak("friend"); # instance method The summon and speak methods are defined by the Wizard class — or one of the classes from which it inherits. But you shouldn’t worry about that. Do not meddle in the affairs of Wizards. Since the arrow operator is left associative (see Chapter 3, Unary and Binary Operators), you can even combine the two statements into one: Wizard->summon("Gandalf")->speak("friend"); Sometimes you want to invoke a method without knowing its name ahead of time. You can use the arrow form of method invocation and replace the method name with a simple scalar variable: $method = "summon"; $mage = Wizard->$method("Gandalf"); # Invoke Wizard->summon $travel = $companion eq "Shadowfax" ? "ride" : "walk"; $mage->$travel("seven leagues"); # Invoke $mage->ride or $mage->walk Although you’re using the name of the method to invoke it indirectly, this usage is not forbidden by use strict ’refs’, since all method calls are in fact looked up symbolically at the time they’re resolved. In our example, we stored the name of a subroutine in $travel, but you could also store a subroutine reference. This bypasses the method lookup algorithm, but sometimes that’s exactly what you want to do. See both the section “Private Methods” and the discussion of the can method in the section “UNIVERSAL: The Ultimate Ancestor Class”. To create a reference to a particular method being called on a particular instance, see the section “Closures” in Chapter 8. Method Invocation Using Indirect Objects The second style of method invocation looks like this: METHOD INVOCANT (LIST) METHOD INVOCANT LIST METHOD INVOCANT 314 Chapter 12: Objects The parentheses around LIST are optional; if omitted, the method acts as a list operator. So you can have statements like the following, all of which use this style of method call: $mage = summon Wizard "Gandalf"; $nemesis = summon Balrog home => "Moria", weapon => "whip"; move $nemesis "bridge"; speak $mage "You cannot pass"; break $staff; # safer to use: break $staff (); The list operator syntax should be familiar to you; it’s the same style used for passing filehandles to print or printf: print STDERR "help!!!\n"; It’s also similar to English sentences like “Give Gollum the Preciousss”, so we call it the indir ect object form. The invocant is expected in the indir ect object slot. When you read about passing a built-in function like system or exec something in its “indirect object slot”, this means that you’re supplying this extra, comma-less argument in the same place you would when you invoke a method using the indirect object syntax. The indirect object form even permits you to specify the INVOCANT as a BLOCK that evaluates to an object (reference) or class (package). This lets you combine those two invocations into one statement this way: speak { summon Wizard "Gandalf" } "friend"; Syntactic Snafus with Indirect Objects One syntax will often be more readable than the other. The indirect object syntax is less cluttered, but suffers from several forms of syntactic ambiguity. The first is that the LIST part of an indirect object invocation is parsed the same as any other list operator. Thus, the parentheses of: enchant $sword ($pips + 2) * $cost; are assumed to surround all the arguments, regardless of what comes afterward. It would therefore be be equivalent to this: ($sword->enchant($pips + 2)) * $cost; That’s unlikely to do what you want: enchant is only being called with $pips + 2, and the method’s return value is then multiplied by $cost. As with other list operators, you must also be careful of the precedence of && and || versus and and or. Method Invocation 315 For example, this: name $sword $oldname || "Glamdring"; # can’t use "or" here! becomes the intended: $sword->name($oldname || "Glamdring"); but this: speak $mage "friend" && enter(); # should’ve been "and" here! becomes the dubious: $mage->speak("friend" && enter()); which could be fixed by rewriting into one of these equivalent forms: enter() if $mage->speak("friend"); $mage->speak("friend") && enter(); speak $mage "friend" and enter(); The second syntactic infelicity of the indirect object form is that its INVOCANT is limited to a name, an unsubscripted scalar variable, or a block.* As soon as the parser sees one of these, it has its INVOCANT, so it starts looking for its LIST. So these invocations: move $party->{LEADER}; move $riders[$i]; # probably wrong! # probably wrong! actually parse as these: $party->move->{LEADER}; $riders->move([$i]); rather than what you probably wanted: $party->{LEADER}->move; $riders[$i]->move; The parser only looks a little ways ahead to find the invocant for an indirect object, not even as far as it would look for a unary operator. This oddity does not arise with the first notation, so you might wish to stick with the arrow as your weapon of choice. Even English has a similar issue here. Think about the command, “Throw your cat out the window a toy mouse to play with.” If you parse that sentence too quickly, you’ll end up throwing the cat, not the mouse (unless you notice that the cat is already out the window). Like Perl, English has two different syntaxes for * Attentive readers will recall that this is precisely the same list of syntactic items that are allowed after a funny character to indicate a variable dereference—for example, @ary, @$aryref, or @{$aryref}. 316 Chapter 12: Objects expressing the agent: “Throw your cat the mouse” and “Throw the mouse to your cat.” Sometimes the longer form is clearer and more natural, and sometimes the shorter one is. At least in Perl, you’re required to use braces around any complicated indirect object. Package-Quoted Classes The final syntactic ambiguity with the indirect object style of method invocation is that it may not be parsed as a method call at all, because the current package may have a subroutine of the same name as the method. When using a class method with a literal package name as the invocant, there is a way to resolve this ambiguity while still keeping the indirect object syntax: package-quote the classname by appending a double colon to it. $obj = method CLASS::; # forced to be "CLASS"->method This is important because the commonly seen notation: $obj = new CLASS; # might not parse as method will not always behave properly if the current package has a subroutine named new or CLASS. Even if you studiously use the arrow form instead of the indirect object form to invoke methods, this can, on rare occasion, still be a problem. At the cost of extra punctuation noise, the CLASS:: notation guarantees how Perl will parse your method invocation. The first two examples below do not always parse the same way, but the second two do: $obj = new ElvenRing; $obj = ElvenRing->new; # could be new("ElvenRing") # or even new(ElvenRing()) # could be ElvenRing()->new() $obj = new ElvenRing::; $obj = ElvenRing::->new; # always "ElvenRing"->new() # always "ElvenRing"->new() This package-quoting notation can be made prettier with some creative alignment: $obj = new ElvenRing:: name => owner => domain => stone => "Narya", "Gandalf", "fire", "ruby"; Still, you may say, “Oh, ugh!” at that double colon, so we’ll tell you that you can almost always get away with a bare class name, provided two things are true. First, there is no subroutine of the same name as the class. (If you follow the convention that subroutine names like new start lowercase and class names like ElvenRing start uppercase, this is never a problem.) Second, the class has been loaded with one of: Object Construction 317 use ElvenRing; require ElvenRing; Either of these declarations ensures that Perl knows ElvenRing is a module name, which forces any bare name like new before the class name ElvenRing to be interpreted as a method call, even if you happen to have declared a new subroutine of your own in the current package. People don’t generally get into trouble with indirect objects unless they start cramming multiple classes into the same file, in which case Perl might not know that a particular package name was supposed to be a class name. People who name subroutines with names that look like ModuleNames also come to grief eventually. Object Construction All objects are references, but not all references are objects. A reference won’t work as an object unless its referent is specially marked to tell Perl what package it belongs to. The act of marking a referent with a package name—and therefore, its class, since a class is just a package—is known as blessing. You can think of the blessing as turning a reference into an object, although it’s more accurate to say that it turns the reference into an object reference. The bless function takes either one or two arguments. The first argument is a reference and the second is the package to bless the referent into. If the second argument is omitted, the current package is used. $obj = { }; bless($obj); bless($obj, "Critter"); # Get reference to anonymous hash. # Bless hash into current package. # Bless hash into class Critter. Here we’ve used a reference to an anonymous hash, which is what people usually use as the data structure for their objects. Hashes are extremely flexible, after all. But allow us to emphasize that you can bless a reference to anything you can make a reference to in Perl, including scalars, arrays, subroutines, and typeglobs. You can even bless a reference to a package’s symbol table hash if you can think of a good reason to. (Or even if you can’t.) Object orientation in Perl is completely orthogonal to data structure. Once the referent has been blessed, calling the built-in ref function on its reference returns the name of the blessed class instead of the built-in type, such as HASH. If you want the built-in type, use the reftype function from the attributes module. See use attributes in Chapter 31, Pragmatic Modules. And that’s how to make an object. Just take a reference to something, give it a class by blessing it into a package, and you’re done. That’s all there is to it if you’re designing a minimal class. If you’re using a class, there’s even less to it, because the author of a class will have hidden the bless inside a method called a 318 Chapter 12: Objects constructor, which creates and returns instances of the class. Because bless returns its first argument, a typical constructor can be as simple as this: package Critter; sub spawn { bless {}; } Or, spelled out slightly more explicitly: package Critter; sub spawn { my $self = {}; # Reference to an empty anonymous hash bless $self, "Critter"; # Make that hash a Critter object return $self; # Return the freshly generated Critter } With that definition in hand, here’s how one might create a Critter object: $pet = Critter->spawn; Inheritable Constructors Like all methods, a constructor is just a subroutine, but we don’t call it as a subroutine. We always invoke it as a method—a class method, in this particular case, because the invocant is a package name. Method invocations differ from regular subroutine calls in two ways. First, they get the extra argument we discussed earlier. Second, they obey inheritance, allowing one class to use another’s methods. We’ll describe the underlying mechanics of inheritance more rigorously in the next section, but for now, some simple examples of its effects should help you design your constructors. For instance, suppose we have a Spider class that inherits methods from the Critter class. In particular, suppose the Spider class doesn’t have its own spawn method. The following correspondences apply: Method Call Resulting Subroutine Call Critter->spawn() Spider->spawn() Critter::spawn("Critter") Critter::spawn("Spider") The subroutine called is the same in both cases, but the argument differs. Note that our spawn constructor above completely ignored its argument, which means our Spider object was incorrectly blessed into class Critter. A better constructor would provide the package name (passed in as the first argument) to bless: sub spawn { my $class = shift; my $self = { }; bless($self, $class); return $self; } # Store the package name # Bless the reference into that package Object Construction 319 Now you could use the same subroutine for both these cases: $vermin = Critter->spawn; $shelob = Spider->spawn; And each object would be of the proper class. This even works indirectly, as in: $type = "Spider"; $shelob = $type->spawn; # same as "Spider"->spawn That’s still a class method, not an instance method, because its invocant holds a string and not a reference. If $type were an object instead of a class name, the previous constructor definition wouldn’t have worked, because bless needs a class name. But for many classes, it makes sense to use an existing object as the template from which to create another. In these cases, you can design your constructors so that they work with either objects or class names: sub spawn { my $invocant = shift; my $class = ref($invocant) || $invocant; # Object or class name my $self = { }; bless($self, $class); return $self; } Initializers Most objects maintain internal information that is indirectly manipulated by the object’s methods. All our constructors so far have created empty hashes, but there’s no reason to leave them empty. For instance, we could have the constructor accept extra arguments to store into the hash as key/value pairs. The OO literature often refers to such data as pr operties, attributes, accessors, member data, instance data, or instance variables. The section “Instance Variables” later in this chapter discusses attributes in more detail. Imagine a Horse class with instance attributes like “name” and “color”: $steed = Horse->new(name => "Shadowfax", color => "white"); If the object is implemented as a hash reference, the key/value pairs can be interpolated directly into the hash once the invocant is removed from the argument list: sub new { my $invocant = shift; my $class = ref($invocant) || $invocant; my $self = { @_ }; # Remaining args become attributes bless($self, $class); # Bestow objecthood return $self; } 320 Chapter 12: Objects This time we used a method named new for the class’s constructor, which just might lull C++ programmers into thinking they know what’s going on. But Perl doesn’t consider “new” to be anything special; you may name your constructors whatever you like. Any method that happens to create and return an object is a de facto constructor. In general, we recommend that you name your constructors whatever makes sense in the context of the problem you’re solving. For example, constructors in the Tk module are named after the widgets they create. In the DBI module, a constructor named connect returns a database handle object, and another constructor named prepare is invoked as an instance method and returns a statement handle object. But if there is no suitable context-specific constructor name, new is perhaps not a terrible choice. Then again, maybe it’s not such a bad thing to pick a random name to force people to read the interface contract (meaning the class documentation) before they use its constructors. Elaborating further, you can set up your constructor with default key/value pairs, which the user could later override by supplying them as arguments: sub new { my $invocant = shift; my $class = ref($invocant) || $invocant; my $self = { color => "bay", legs => 4, owner => undef, @_, # Override previous attributes }; return bless $self, $class; } $ed = Horse->new; # A 4-legged bay horse $stallion = Horse->new(color => "black"); # A 4-legged black horse This Horse constructor ignores its invocant’s existing attributes when used as an instance method. You could create a second constructor designed to be called as an instance method, and if designed properly, you could use the values from the invoking object as defaults for the new one: $steed = Horse->new(color => "dun"); $foal = $steed->clone(owner => "EquuGen Guild, Ltd."); sub clone { my $model = shift; my $self = $model->new(%$model, @_); return $self; # Previously blessed by ->new } (You could also have rolled this functionality directly into new, but then the name wouldn’t quite fit the function.) Class Inheritance 321 Notice how even in the clone constructor, we don’t hardcode the name of the Horse class. We have the original object invoke its own new method, whatever that may be. If we had written that as Horse->new instead of $model->new, then the class wouldn’t have facilitated inheritance by a Zebra or Unicorn class. You wouldn’t want to clone Pegasus and suddenly find yourself with a horse of a different color. Sometimes, however, you have the opposite problem: rather than trying to share one constructor among several classes, you’re trying to have several constructors share one class’s object. This happens whenever a constructor wants to call a base class’s constructor to do part of the construction work. Perl doesn’t do hierarchical construction for you. That is, Perl does not automatically call the constructors (or the destructors) for any base classes of the class requested, so your constructor will have to do that itself and then add any additional attributes the derived class needs. So the situation is not unlike the clone routine, except that instead of copying an existing object into the new object, you want to call your base class’s constructor and then transmogrify the new base object into your new derived object. Class Inheritance As with the rest of Perl’s object system, inheritance of one class by another requires no special syntax to be added to the language. When you invoke a method for which Perl finds no subroutine in the invocant’s package, that package’s @ISA array* is examined. This is how Perl implements inheritance: each element of a given package’s @ISA array holds the name of another package, which is searched when methods are missing. For example, the following makes the Horse class a subclass of the Critter class. (We declare @ISA with our because it has to be a package variable, not a lexical declared with my.) package Horse; our @ISA = "Critter"; You should now be able to use a Horse class or object everywhere that a Critter was previously used. If your new class passes this empty subclass test, you know that Critter is a proper base class, fit for inheritance. Suppose you have a Horse object in $steed and invoke a move method on it: $steed->move(10); Because $steed is a Horse, Perl’s first choice for that method is the Horse::move subroutine. If there isn’t one, instead of raising a run-time exception, Perl consults the first element of @Horse::ISA, which directs it to look in the Critter package for Critter::move. If this subroutine isn’t found either, and Critter has its own * Pronounced “is a”, as in “A horse is a critter.” 322 Chapter 12: Objects @Critter::ISA array, then that too will be consulted for the name of an ancestral package that might supply a move method, and so on back up the inheritance hierarchy until we come to a package without an @ISA. The situation we just described is single inheritance, where each class has only one parent. Such inheritance is like a linked list of related packages. Perl also supports multiple inheritance; just add more packages to the class’s @ISA. This kind of inheritance works more like a tree data structure, because every package can have more than one immediate parent. Some people find this to be sexier. When you invoke a method methname on an invocant of type classname, Perl tries six different ways to find a subroutine to use: 1. First, Perl looks in the invocant’s own package for a subroutine named classname::methname. If that fails, inheritance kicks in, and we go to step 2. 2. Next, Perl checks for methods inherited from base classes by looking in all parent packages listed in @classname::ISA for a parent::methname subroutine. The search is left-to-right, recursive, and depth-first. The recursion assures that grandparent classes, great-grandparent classes, great-great-grandparent classes, and so on, are all searched. 3. If that fails, Perl looks for a subroutine named UNIVERSAL::methname. 4. At this point, Perl gives up on methname and starts looking for an AUTOLOAD. First, it looks for a subroutine named classname::AUTOLOAD. 5. Failing that, Perl searches all parent packages listed in @classname::ISA, for any parent::AUTOLOAD subroutine. The search is again left-to-right, recursive, and depth-first. 6. Finally, Perl looks for a subroutine named UNIVERSAL::AUTOLOAD. Perl stops after the first successful attempt and invokes that subroutine. If no subroutine is found, an exception is raised, one that you’ll see frequently: Can’t locate object method "methname" via package "classname" If you’ve built a debugging version of Perl using the -DDEBUGGING option to your C compiler, by using Perl’s -Do switch, you can watch it go through each of these steps when it resolves method invocation. We will discuss the inheritance mechanism in more detail as we go along. Inheritance Through @ISA If @ISA contains more than one package name, the packages are all searched in left-to-right order. The search is depth-first, so if you have a Mule class set up for inheritance this way: Class Inheritance 323 package Mule; our @ISA = ("Horse", "Donkey"); Perl looks for any methods missing from Mule first in Horse (and any of its ancestors, like Critter) before going on to search through Donkey and its ancestors. If a missing method is found in a base class, Perl internally caches that location in the current class for efficiency, so the next time it has to find the method, it doesn’t have to look as far. Changing @ISA or defining new methods invalidates the cache and causes Perl to perform the lookup again. When Perl searches for a method, it makes sure that you haven’t created a circular inheritance hierarchy. This could happen if two classes inherit from one another, even indirectly through other classes. Trying to be your own great-grandfather is too paradoxical even for Perl, so the attempt raises an exception. However, Perl does not consider it an error to inherit from more than one class sharing a common ancestry, which is rather like cousins marrying. Your inheritance hierarchy just stops looking like a tree and starts to look like a directed acyclic graph. This doesn’t bother Perl—so long as the graph really is acyclic. When you set @ISA, the assignment normally happens at run time, so unless you take precautions, code in BEGIN, CHECK, or INIT blocks won’t be able to use the inheritance hierarchy. One precaution (or convenience) is the use base pragma, which lets you require classes and add them to @ISA at compile time. Here’s how you might use it: package Mule; use base ("Horse", "Donkey"); # declare superclasses This is a shorthand for: package Mule; BEGIN { our @ISA = ("Horse", "Donkey"); require Horse; require Donkey; } except that use base also takes into account any use fields declarations. Sometimes folks are surprised that including a class in @ISA doesn’t require the appropriate module for you. That’s because Perl’s class system is largely orthogonal to its module system. One file can hold many classes (since they’re just packages), and one package may be mentioned in many files. But in the most common situation, where one package and one class and one module and one file all end up being pretty interchangeable if you squint enough, the use base pragma offers a declarative syntax that establishes inheritance, loads in module files, and accommodates any declared base class fields. It’s one of those convenient diagonals we keep mentioning. 324 Chapter 12: Objects See the descriptions of use base and use fields in Chapter 31 for further details. Accessing Overridden Methods When a class defines a method, that subroutine overrides methods of the same name in any base classes. Imagine that you’ve a Mule object (which is derived from class Horse and class Donkey), and you decide to invoke your object’s breed method. Although the parent classes have their own breed methods, the designer of the Mule class overrode those by supplying the Mule class with its own breed method. That means the following cross is unlikely to be productive: $stallion = Horse->new(gender => "male"); $molly = Mule->new(gender => "female"); $colt = $molly->breed($stallion); Now suppose that through the miracle of genetic engineering, you find some way around a mule’s notorious sterility problem, so you want to skip over the nonviable Mule::breed method. You could call your method as an ordinary subroutine, being sure to pass the invocant explicitly: $colt = Horse::breed($molly, $stallion); However, this sidesteps inheritance, which is nearly always the wrong thing to do. It’s perfectly imaginable that no Horse::breed subroutine exists because both Horses and Donkeys derive that behavior from a common parent class called Equine. If, on the other hand, you want to specify that Perl should start searching for a method in a particular class, just use ordinary method invocation but qualify the method name with the class: $colt = $molly->Horse::breed($stallion); Occasionally, you’ll want a method in a derived class to act as a wrapper around some method in a base class. The method in the derived class can itself invoke the method in the base class, adding its own actions before or after that invocation. You could use the notation just demonstrated to specify at which class to start the search. But in most cases of overridden methods, you don’t want to have to know or specify which parent class’s overridden method to execute. That’s where the SUPER pseudoclass comes in handy. It lets you invoke an overridden base class method without having to specify which class defined that method.* The following subroutine looks in the current package’s @ISA without making you specify particular classes: * This is not to be confused with the mechanism mentioned in Chapter 11 for overriding Perl’s built-in functions, which aren’t object methods and so aren’t overridden by inheritance. You call overridden built-ins via the CORE pseudopackage, not the SUPER pseudopackage. Class Inheritance 325 package Mule; our @ISA = qw(Horse Donkey); sub kick { my $self = shift; print "The mule kicks!\n"; $self->SUPER::kick(@_); } The SUPER pseudopackage is meaningful only when used inside a method. Although the implementer of a class can employ SUPER in their own code, someone who merely uses a class’s objects cannot. SUPER does not always work as you might like when multiple inheritance is involved. As you’d expect, it follows @ISA just as the regular inheritance mechanism does: in left-to-right, recursive, depth-first order. If both Horse and Donkey had a speak method, and you preferred the Donkey method, you’d have to name that parent class explicitly: sub speak { my $self = shift; print "The mule speaks!\n"; $self->Donkey::speak(@_); } More elaborate approaches to multiple inheritance situations can be crafted using the UNIVERSAL::can method described in the next section. Or you can grab the Class::Multimethods module from CPAN, which provides many elaborate solutions, including finding the closest match instead of leftmost one. Every bit of code in Perl knows what its current package is, as determined by the last package statement. A SUPER method consults the @ISA only of the package into which the call to SUPER was compiled. It does not care about the class of the invocant, nor about the package of the subroutine that was called. This can cause problems if you try to define methods in another class by merely playing tricks with the method name: package Bird; use Dragonfly; sub Dragonfly::divebomb { shift->SUPER::divebomb(@_) } Unfortunately, this invokes Bird’s superclass, not Dragonfly ’s. To do what you’re trying to do, you need to explicitly switch into the appropriate package for the compilation of SUPER as well: package Bird; use Dragonfly; { package Dragonfly; sub divebomb { shift->SUPER::divebomb(@_) } } 326 Chapter 12: Objects As this example illustrates, you never need to edit a module file just to add methods to an existing class. Since a class is just a package, and a method just a subroutine, all you have to do is define a function in that package as we’ve done here, and the class suddenly has a new method. No inheritance required. Only the package matters, and since packages are global, any package can be accessed from anywhere in the program. (Did we mention we’re going to install a jacuzzi in your living room next week?) UNIVERSAL: The Ultimate Ancestor Class If no method definition with the right name is found after searching the invocant’s class and all its ancestor classes recursively, one more check for a method of that name is made in the special predefined class called UNIVERSAL. This package never appears in an @ISA, but is always consulted when an @ISA check fails. You can think of UNIVERSAL as the ultimate ancestor from which all classes implicitly derive. The following predefined methods are available in class UNIVERSAL, and thus in all classes. These all work regardless of whether they are invoked as class methods or object methods. INVOCANT->isa(CLASS) The isa method returns true if INVOCANT ’s class is CLASS or any class inheriting from CLASS. Instead of a package name, CLASS may also be one of the built-in types, such as “HASH” or “ARRAY”. (Checking for an exact type does not bode well for encapsulation or polymorphism, though. You should be relying on method dispatch to give you the right method.) use FileHandle; if (FileHandle->isa("Exporter")) { print "FileHandle is an Exporter.\n"; } $fh = FileHandle->new(); if ($fh->isa("IO::Handle")) { print "\$fh is some sort of IOish object.\n"; } if ($fh->isa("GLOB")) { print "\$fh is really a GLOB reference.\n"; } INVOCANT->can(METHOD) The can method returns a reference to the subroutine that would be called if METHOD were applied to INVOCANT. If no such subroutine is found, can returns undef. if ($invocant->can("copy")) { print "Our invocant can copy.\n"; } Class Inheritance 327 This could be used to conditionally invoke a method only if one exists: $obj->snarl if $obj->can("snarl"); Under multiple inheritance, this allows a method to invoke all overridden base class methods, not just the leftmost one: sub snarl { my $self = shift; print "Snarling: @_\n"; my %seen; for my $parent (@ISA) { if (my $code = $parent->can("snarl")) { $self->$code(@_) unless $seen{$code}++; } } } We use the %seen hash to keep track of which subroutines we’ve already called, so we can avoid calling the same subroutine more than once. This could happen if several parent classes shared a common ancestor. Methods that would trigger an AUTOLOAD (described in the next section) will not be accurately reported unless the package has declared (but not defined) the subroutines it wishes to have autoloaded. INVOCANT->VERSION(NEED) The VERSION method returns the version number of INVOCANT ’s class, as stored in the package’s $VERSION variable. If the NEED argument is provided, it verifies that the current version isn’t less than NEED and raises an exception if it is. This is the method that use invokes to determine whether a module is sufficiently recent. use Thread 1.0; # calls Thread->VERSION(1.0) print "Running version ", Thread->VERSION, " of Thread.\n"; You may supply your own VERSION method to override the method in UNIVERSAL. However, this will cause any classes derived from your class to use the overridden method, too. If you don’t want that to happen, you should design your method to delegate other classes’ version requests back up to UNIVERSAL. The methods in UNIVERSAL are built-in Perl subroutines, which you may call if you fully qualify them and pass two arguments, as in UNIVERSAL::isa($formobj, "HASH"). (This is not recommended, though, because can usually has the answer you’re really looking for.) You’re free to add your own methods to class UNIVERSAL. (You should be careful, of course; you could really mess someone up who is expecting not to find the method name you’re defining, perhaps so that they can autoload it from somewhere else.) Here we create a copy method that objects of all classes can use if 328 Chapter 12: Objects they’ve not defined their own. (We fail spectacularly if invoked on a class instead of an object.) use Data::Dumper; use Carp; sub UNIVERSAL::copy { my $self = shift; if (ref $self) { return eval Dumper($self); # no CODE refs } else { confess "UNIVERSAL::copy can’t copy class $self"; } } This Data::Dumper strategy doesn’t work if the object contains any references to subroutines, because they cannot be properly reproduced. Even if the source were available, the lexical bindings would be lost. Method Autoloading Normally, when you call an undefined subroutine in a package that defines an AUTOLOAD subroutine, the AUTOLOAD subroutine is called in lieu of raising an exception (see the section “Autoloading” in Chapter 10). With methods, this works a little differently. If the regular method search (through the class, its ancestors, and finally UNIVERSAL) fails to find a match, the same sequence is run again, this time looking for an AUTOLOAD subroutine. If found, this subroutine is called as a method, with the package’s $AUTOLOAD variable set to the fully qualified name of the subroutine on whose behalf AUTOLOAD was called. You need to be a bit cautious when autoloading methods. First, the AUTOLOAD subroutine should return immediately if it’s being called on behalf of a method named DESTROY, unless your goal was to simulate DESTROY, which has a special meaning to Perl described in the section “Instance Destructors” later in this chapter. sub AUTOLOAD { return if our $AUTOLOAD =˜ /::DESTROY$/; ... } Second, if the class is providing an AUTOLOAD safety net, you won’t be able to use UNIVERAL::can on a method name to check whether it’s safe to invoke. You have to check for AUTOLOAD separately: if ($obj->can("methname") || $obj->can("AUTOLOAD")) { $obj->methname(); } Finally, under multiple inheritance, if a class inherits from two or more classes each of which has an AUTOLOAD, only the leftmost will ever be triggered, since Perl stops as soon as it finds the first AUTOLOAD. Class Inheritance 329 The last two quirks are easily circumvented by declaring the subroutines in the package whose AUTOLOAD is supposed to manage those methods. You can do this either with individual declarations: package Goblin; sub kick; sub bite; sub scratch; or with the use subs pragma, which is more convenient if you have many methods to declare: package Goblin; use subs qw(kick bite scratch); Even though you’ve only declared these subroutines and not defined them, this is enough for the system to think they’re real. They show up in a UNIVERAL::can check, and, more importantly, they show up in step 2 of the search for a method, which will never progress to step 3, let alone step 4. “But, but,” you exclaim, “they invoke AUTOLOAD, don’t they?” Well, yes, they do eventually, but the mechanism is different. Having found the method stub via step 2, Perl tries to call it. When it is discovered that the method isn’t all it was cracked up to be, the AUTOLOAD method search kicks in again, but this time, it starts its search in the class containing the stub, which restricts the method search to that class and its ancestors (and UNIVERSAL). That’s how Perl finds the correct AUTOLOAD to run and knows to ignore AUTOLOADs from the wrong part of the original inheritance tree. Private Methods There is one way to invoke a method so that Perl ignores inheritance altogether. If instead of a literal method name, you specify a simple scalar variable containing a reference to a subroutine, then the subroutine is called immediately. In the description of UNIVERSAL->can in the previous section, the last example invokes all overridden methods using the subroutine’s reference, not its name. An intriguing aspect of this behavior is that it can be used to implement private method calls. If you put your class in a module, you can make use of the file’s lexical scope for privacy. First, store an anonymous subroutine in a file-scoped lexical: # declare private method my $secret_door = sub { my $self = shift; ... }; 330 Chapter 12: Objects Later on in the file, you can use that variable as though it held a method name. The closure will be called directly, without regard to inheritance. As with any other method, the invocant is passed as an extra argument. sub knock { my $self = shift; if ($self->{knocked}++ > 5) { $self->$secret_door(); } } This enables the file’s own subroutines (the class methods) to invoke a method that code outside that lexical scope cannot access. Instance Destructors As with any other referent in Perl, when the last reference to an object goes away, its memory is implicitly recycled. With an object, you have the opportunity to capture control just as this is about to happen by defining a DESTROY subroutine in the class’s package. This method is triggered automatically at the appropriate moment, with the about-to-be-recycled object as its only argument. Destructors are rarely needed in Perl, because memory management is handled automatically for you. Some objects, though, may have state outside the memory system that you’d like to attend to, such as filehandles or database connections. package MailNotify; sub DESTROY { my $self = shift; my $fh = $self->{mailhandle}; my $id = $self->{name}; print $fh "\n$id is signing off at " . localtime() . "\n"; close $fh; # close pipe to mailer } Just as Perl uses only a single method to construct an object even when the constructor’s class inherits from one or more other classes, Perl also uses only one DESTROY method per object destroyed regardless of inheritance. In other words, Perl does not do hierarchical destruction for you. If your class overrides a superclass’s destructor, then your DESTROY method may need to invoke the DESTROY method for any applicable base classes: sub DESTROY { my $self = shift; # check for an overridden destructor... $self->SUPER::DESTROY if $self->can("SUPER::DESTROY"); # now do your own thing before or after } Managing Instance Data 331 This applies only to inherited classes; an object that is simply contained within the current object — as, for example, one value in a larger hash—will be freed and destroyed automatically. This is one reason why containership via mere aggregation (sometimes called a “has-a” relationship) is often cleaner and clearer than inheritance (an “is-a” relationship). In other words, often you really only need to store one object inside another directly instead of through inheritance, which can add unnecessary complexity. Sometimes when users reach for multiple inheritance, single inheritance will suffice. Explicitly calling DESTROY is possible but seldom needed. It might even be harmful since running the destructor more than once on the same object could prove unpleasant. Garbage Collection with DESTROY Methods As described in the section “Garbage Collection, Circular References, and Weak References” in Chapter 8, a variable that refers to itself (or multiple variables that refer to one another indirectly) will not be freed until the program (or embedded interpreter) is about to exit. If you want to reclaim the memory any earlier, you usually have to explicitly break the reference or weaken it using the WeakRef module on CPAN. With objects, an alternative solution is to create a container class that holds a pointer to the self-referential data structure. Define a DESTROY method for the containing object’s class that manually breaks the circularities in the self-referential structure. You can find an example of this in Chapter 13 of the Perl Cookbook in the recipe 13.13, “Coping with Circular Data Structures”. When an interpreter shuts down, all its objects are destroyed, which is important for multithreaded or embedded Perl applications. Objects are always destroyed in a separate pass before ordinary references. This is to prevent DESTROY methods from using references that have themselves been destroyed. (And also because plain references are only garbage-collected in embedded interpreters, since exiting a process is a very fast way of reclaiming references. But exiting won’t run the object destructors, so Perl does that first.) Managing Instance Data Most classes create objects that are essentially just data structures with several internal data fields (instance variables) plus methods to manipulate them. Perl classes inherit methods, not data, but as long as all access to the object is through method calls anyway, this works out fine. If you want data inheritance, 332 Chapter 12: Objects you have to effect it through method inheritance. By and large, this is not a necessity in Perl, because most classes store the attributes of their object in an anonymous hash. The object’s instance data is contained within this hash, which serves as its own little namespace to be carved up by whatever classes do something with the object. For example, if you want an object called $city to have a data field named elevation, you can simply access $city->{elevation}. No declarations are necessary. But method wrappers have their uses. Suppose you want to implement a Person object. You decide to have a data field called “name”, which by a strange coincidence you’ll store under the key name in the anonymous hash that will serve as the object. But you don’t want users touching the data directly. To reap the rewards of encapsulation, users need methods to access that instance variable without lifting the veil of abstraction. For example, you might make a pair of accessor methods: sub get_name { my $self = shift; return $self->{name}; } sub set_name { my $self = shift; $self->{name} = shift; } which leads to code like this: $him = Person->new(); $him->set_name("Frodo"); $him->set_name( ucfirst($him->get_name) ); You could even combine both methods into one: sub name { my $self = shift; if (@_) { $self->{name} = shift } return $self->{name}; } This would then lead to code like this: $him = Person->new(); $him->name("Frodo"); $him->name( ucfirst($him->name) ); The advantage of writing a separate function for each instance variable (which for our Person class might be name, age, height, and so on) is that it is direct, obvious, and flexible. The drawback is that every time you want a new class, you end up defining one or two nearly identical methods per instance variable. This isn’t too bad for the first few, and you’re certainly welcome to do it that way if you’d Managing Instance Data 333 like. But when convenience is preferred over flexibility, you might prefer one of the techniques described in the following sections. Note that we will be varying the implementation, not the interface. If users of your class respect the encapsulation, you’ll be able to transparently swap one implementation for another without the users noticing. (Family members in your inheritance tree using your class for a subclass or superclass might not be so forgiving, since they know you far better than strangers do.) If your users have been peeking and poking into the private affairs of your class, the inevitable disaster is their own fault and none of your concern. All you can do is live up to your end of the contract by maintaining the interface. Trying to stop everyone else in the world from ever doing something slightly wicked will take up all your time and energy — and in the end, fail anyway. Dealing with family members is more challenging. If a subclass overrides a superclass’s attribute accessor, should it access the same field in the hash, or not? An argument can be made either way, depending on the nature of the attribute. For the sake of safety in the general case, each accessor can prefix the name of the hash field with its own classname, so that subclass and superclass can both have their own version. Several of the examples below, including the standard Struct::Class module, use this subclass-safe strategy. You’ll see accessors resembling this: sub name { my $self = shift; my $field = __PACKAGE_ _ . "::name"; if (@_) { $self->{$field} = shift } return $self->{$field}; } In each of the following examples, we create a simple Person class with fields name, race, and aliases, each with an identical interface but a completely different implementation. We’re not going to tell you which one we like the best, because we like them all the best, depending on the occasion. And tastes differ. Some folks prefer stewed conies; others prefer fissssh. Field Declarations with use fields Objects don’t have to be implemented as anonymous hashes. Any reference will do. For example, if you used an anonymous array, you could set up a constructor like this: sub new { my $invocant = shift; my $class = ref($invocant) || $invocant; return bless , $class; } 334 Chapter 12: Objects and have accessors like these: sub name { my $self = shift; if (@_) { $self->[0] = shift } return $self->[0]; } sub race { my $self = shift; if (@_) { $self->[1] = shift } return $self->[1]; } sub aliases { my $self = shift; if (@_) { $self->[2] = shift } return $self->[2]; } Arrays are somewhat faster to access than hashes and don’t take up quite as much memory, but they’re not at all convenient to use. You have to keep track of the index numbers (not just in your class, but in your superclass, too), which must somehow indicate which pieces of the array your class is using. Otherwise, you might reuse a slot. The use fields pragma addresses all of these points: package Person; use fields qw(name race aliases); This pragma does not create accessor methods for you, but it does rely on some built-in magic (called pseudohashes) to do something similar. (You may wish to wrap accessors around the fields anyway, as we do in the following example.) Pseudohashes are array references that you can use like hashes because they have an associated key map table. The use fields pragma sets this key map up for you, effectively declaring which fields are valid for the Person object; this makes the Perl compiler aware of them. If you declare the type of your object variable (as in my Person $self, in the next example), the compiler is smart enough to optimize access to the fields into straight array accesses. Perhaps more importantly, it validates field names for type safety (well, typo safety, really) at compile time. (See the section “Pseudohashes” in Chapter 8.) A constructor and sample accessors would look like this: package Person; use fields qw(name race aliases); sub new { my $type = shift; my Person $self = fields::new(ref $type || $type); Managing Instance Data 335 $self->{name} = "unnamed"; $self->{race} = "unknown"; $self->{aliases} = ; return $self; } sub name { my Person $self = shift; $self->{name} = shift if @_; return $self->{name}; } sub race { my Person $self = shift; $self->{race} = shift if @_; return $self->{race}; } sub aliases { my Person $self = shift; $self->{aliases} = shift if @_; return $self->{aliases}; } 1; If you misspell one of the literal keys used to access the pseudohash, you won’t have to wait until run time to learn about this. The compiler knows what type of object $self is supposed to refer to (because you told it), so it can check that the code accesses only those fields that Person objects actually have. If you have horses on the brain and try to access a nonexistent field (such as $self->{mane}), the compiler can flag this error right away and will never turn the erroneous program over to the interpreter to run. There’s still a bit of repetition in declaring methods to get at instance variables, so you still might like to automate the creation of simple accessor methods using one of the techniques below. However, because all these techniques use some sort of indirection, if you use them, you will lose the compile-time benefits of typo-checking lexically typed hash accesses. You’ll still keep the (small) time and space advantages, though. If you do elect to use a pseudohash to implement your class, any class that inherits from this one must be aware of that underlying pseudohash implementation. If an object is implemented as a pseudohash, all participants in the inheritance hierarchy should employ the use base and use fields declarations. For example, package Wizard; use base "Person"; use fields qw(staff color sphere); This makes the Wizard module a subclass of class Person, and loads the Person.pm file. It also registers three new fields in this class to go along with those from Person. That way when you write: 336 Chapter 12: Objects my Wizard $mage = fields::new("Wizard"); you’ll get a pseudohash object with access to both classes’ fields: $mage->name("Gandalf"); $mage->color("Grey"); Since all subclasses must know that they are using a pseudohash implementation, they should use the direct pseudohash notation for both efficiency and type safety: $mage->{name} = "Gandalf"; $mage->{color} = "Grey"; If you want to keep your implementations interchangeable, however, outside users of your class must use the accessor methods. Although use base supports only single inheritance, this is seldom a severe restriction. See the descriptions of use base and use fields in Chapter 31. Generating Classes with Class::Struct The standard Class::Struct module exports a function named struct. This creates all the trapping you’ll need to get started on an entire class. It generates a constructor named new, plus accessor methods for each of the data fields (instance variables) named in that structure. For example, if you put the class in a Person.pm file: package Person; use Class::Struct; struct Person => { name => ’$’, race => ’$’, aliases => ’@’, }; 1; # create a definition for a "Person" # name field is a scalar # race field is also a scalar # but aliases field is an array ref Then you could use that module this way: use Person; my $mage = Person->new(); $mage->name("Gandalf"); $mage->race("Istar"); $mage->aliases( ["Mithrandir", "Olorin", "Incanus"] ); The Class::Struct module created all four of those methods. Because it follows the subclass-safe policy of always prefixing the field name with the class name, it also permits an inherited class to have its own separate field of the same name as a base class field without conflict. That means in this case that “Person::name” rather than just “name” is used for the hash key for that particular instance variable. Managing Instance Data 337 Fields in a struct declaration don’t have to be basic Perl types. They can also specify other classes, but classes created with struct work best because the function makes assumptions about how the classes behave that aren’t generally true of all classes. For example, the new method for the appropriate class is invoked to initialize the field, but many classes have constructors with other names. See the description of Class::Struct in Chapter 32, Standard Modules, and its online documentation for more information. Many standard modules use Class::Struct to implement their classes, including User::pwent and Net::hostent. Reading their code can prove instructive. Generating Accessors with Autoloading As we mentioned earlier, when you invoke a nonexistent method, Perl has two different ways to look for an AUTOLOAD method, depending on whether you declared a stub method. You can use this property to provide access to the object’s instance data without writing a separate function for each instance. Inside the AUTOLOAD routine, the name of the method actually invoked can be retrieved from the $AUTOLOAD variable. Consider the following code: use Person; $him = Person->new; $him->name("Aragorn"); $him->race("Man"); $him->aliases( ["Strider", "Estel", "Elessar"] ); printf "%s is of the race of %s.\n", $him->name, $him->race; print "His aliases are: ", join(", ", @{$him->aliases}), ".\n"; As before, this version of the Person class implements a data structure with three fields: name, race, and aliases: package Person; use Carp; my %Fields = ( "Person::name" => "unnamed", "Person::race" => "unknown", "Person::aliases" => , ); # The next declaration guarantees we get our own autoloader. use subs qw(name race aliases); sub new { my $invocant = shift; my $class = ref($invocant) || $invocant; my $self = { %Fields, @_ }; # clone like Class::Struct bless $self, $class; return $self; } 338 Chapter 12: Objects sub AUTOLOAD { my $self = shift; # only handle instance methods, not class methods croak "$self not an object" unless ref($invocant); my $name = our $AUTOLOAD; return if $name =˜ /::DESTROY$/; unless (exists $self->{$name}) { croak "Can’t access ‘$name’ field in $self"; } if (@_) { return $self->{$name} = shift } else { return $self->{$name} } } As you see, there are no methods named name, race, or aliases anywhere to be found. The AUTOLOAD routine takes care of all that. When someone uses $him->name("Aragorn"), the AUTOLOAD subroutine is called with $AUTOLOAD set to “Person::name”. Conveniently, by leaving it fully qualified, it’s in exactly the right form for accessing fields of the object hash. That way if you use this class as part of a larger class hierarchy, you don’t conflict with uses of the same name in other classes. Generating Accessors with Closures Most accessor methods do essentially the same thing: they simply fetch or store a value from that instance variable. In Perl, the most natural way to create a family of near-duplicate functions is looping around a closure. But closures are anonymous functions lacking names, and methods need to be named subroutines in the class’s package symbol table so that they can be called by name. This is no problem — just assign the closure reference to a typeglob of the appropriate name. package Person; sub new { my $invocant = shift; my $self = bless({}, ref $invocant || $invocant); $self->init(); return $self; } sub init { my $self = shift; $self->name("unnamed"); $self->race("unknown"); $self->aliases(); } for my $field (qw(name race aliases)) { my $slot = __PACKAGE_ _ . "::$field"; no strict "refs"; # So symbolic ref to typeglob works. Managing Instance Data 339 *$field = sub { my $self = shift; $self->{$slot} = shift if @_; return $self->{$slot}; }; } Closures are the cleanest hand-rolled way to create a multitude of accessor methods for your instance data. It’s efficient for both the computer and you. Not only do all the accessors share the same bit of code (they only need their own lexical pads), but later if you decide to add another attribute, the changes required are minimal: just add one more word to the for loop’s list, and perhaps something to the init method. Using Closures for Private Objects So far, these techniques for managing instance data have offered no mechanism for “protection” from external access. Anyone outside the class can open up the object’s black box and poke about inside—if they don’t mind voiding the warranty. Enforced privacy tends to get in the way of people trying to get their jobs done. Perl’s philosophy is that it’s better to encapsulate one’s data with a sign that says: IN CASE OF FIRE BREAK GLASS You should respect such encapsulation when possible, but still have easy access to the contents in an emergency situation, like for debugging. But if you do want to enforce privacy, Perl isn’t about to get in your way. Perl offers low-level building blocks that you can use to surround your class and its objects with an impenetrable privacy shield—one stronger, in fact, than that found in many popular object-oriented languages. Lexical scopes and the lexical variables inside them are the key components here, and closures play a pivotal role. In the section “Private Methods,” we saw how a class can use closures to implement methods that are invisible outside the module file. Later we’ll look at accessor methods that regulate class data so private that not even the rest of the class has unrestricted access. Those are still fairly traditional uses of closures. The truly interesting approach is to use a closure as the very object itself. The object’s instance variables are locked up inside a scope to which the object alone—that is, the closure—has free access. This is a very strong form of encapsulation; not only is it proof against external tampering, even other methods in the same class must use the proper access methods to get at the object’s instance data. 340 Chapter 12: Objects Here’s an example of how this might work. We’ll use closures both for the objects themselves and for the generated accessors: package Person; sub new { my $invocant = shift; my $class = ref($invocant) || $invocant; my $data = { NAME => "unnamed", RACE => "unknown", ALIASES => , }; my $self = sub { my $field = shift; ############################# ### ACCESS CHECKS GO HERE ### ############################# if (@_) { $data->{$field} = shift } return $data->{$field}; }; bless($self, $class); return $self; } # generate method names for my $field (qw(name race aliases)) { no strict "refs"; # for access to the symbol table *$field = sub { my $self = shift; return $self->(uc $field, @_); }; } The object created and returned by the new method is no longer a hash, as it was in other constructors we’ve looked at. It’s a closure with unique access to the attribute data stored in the hash referred to by $data. Once the constructor call is finished, the only access to $data (and hence to the attributes) is via the closure. In a call like $him->name("Bombadil"), the invoking object stored in $self is the closure that was blessed and returned by the constructor. There’s not a lot one can do with a closure beyond calling it, so we do just that with $self->(uc $field, @_). Don’t be fooled by the arrow; this is just a regular indirect function call, not a method invocation. The initial argument is the string “name”, and any remaining arguments are whatever else was passed in.* Once we’re executing inside the closure, the hash reference inside $data is again accessible. The closure is then free to permit or deny access to whatever it pleases. * Sure, the double-function call is slow, but if you wanted fast, would you really be using objects in the first place? Managing Instance Data 341 No one outside the closure object has unmediated access to this very private instance data, not even other methods in the class. They could try to call the closure the way the methods generated by the for loop do, perhaps setting an instance variable the class never heard of. But this approach is easily blocked by inserting various bits of code in the constructor where you see the comment about access checks. First, we need a common preamble: use Carp; local $Carp::CarpLevel = 1; # Keeps croak messages short my ($cpack, $cfile) = caller(); Now for each of the checks. The first one makes sure the specified attribute name exists: croak "No valid field ’$field’ in object" unless exists $data->{$field}; This one allows access only by callers from the same file: carp "Unmediated access denied to foreign file" unless $cfile eq __FILE_ _; This one allows access only by callers from the same package: carp "Unmediated access denied to foreign package ${cpack}::" unless $cpack eq __PACKAGE_ _; And this one allows access only by callers whose classes inherit ours: carp "Unmediated access denied to unfriendly class ${cpack}::" unless $cpack->isa(__PACKAGE_ _); All these checks block unmediated access only. Users of the class who politely use the class’s designated methods are under no such restriction. Perl gives you the tools to be just as persnickety as you want to be. Fortunately, not many people want to be. But some people ought to be. Persnickety is good when you’re writing flight control software. If you either want or ought to be one of those people, and you prefer using working code over reinventing everything on your own, check out Damian Conway’s Tie::SecureHash module on CPAN. It implements restricted hashes with support for public, protected, and private persnicketations. It also copes with the inheritance issues that we’ve ignored in the previous example. Damian has also written an even more ambitious module, Class::Contract, that imposes a formal software engineering regimen over Perl’s flexible object system. This module’s feature list reads like a checklist from a computer science 342 Chapter 12: Objects professor’s software engineering textbook,* including enforced encapsulation, static inheritance, and design-by-contract condition checking for object-oriented Perl, along with a declarative syntax for attribute, method, constructor, and destructor definitions at both the object and class level, and preconditions, postconditions, and class invariants. Whew! New Tricks As of release 5.6 of Perl, you can also declare a method to indicate that it returns an lvalue. This is done with the lvalue subroutine attribute (not to be confused with object attributes). This experimental feature allows you to treat the method as something that would appear on the lefthand side of an equal sign: package Critter; sub new { my $class = shift; my $self = { pups => 0, @_ }; bless $self, $class; } sub pups : lvalue { my $self = shift; $self->{pups}; } # Override default. # We’ll assign to pups() later. package main; $varmint = Critter->new(pups => 4); $varmint->pups *= 2; # Assign to $varmint->pups! $varmint->pups =˜ s/(.)/$1$1/; # Modify $varmint->pups in place! print $varmint->pups; # Now we have 88 pups. This lets you pretend $varmint->pups is a variable while still obeying encapsulation. See the section “The lvalue Attribute” in Chapter 6, Subr outines. If you’re running a threaded version of Perl and want to ensure that only one thread can call a particular method on an object, you can use the locked and method attributes to do that: sub pups : locked method { ... } When any thread invokes the pups method on an object, Perl locks the object before execution, preventing other threads from doing the same. See the section “The locked and method Attributes” in Chapter 6. * Can you guess what Damian’s job is? By the way, we highly recommend his book, Object Oriented Perl (Manning Publications, 1999). Managing Class Data 343 Managing Class Data We’ve looked at several approaches to accessing per-object data values. Sometimes, though, you want some common state shared by all objects of a class. Instead of being an attribute of just one instance of the class, these variables are global to the entire class, no matter which class instance (object) you use to access them through. (C++ programmers would think of these as static member data.) Here are some situations where class variables might come in handy: • To keep a count of all objects ever created, or how many are still kicking around. • To keep a list of all objects over which you can iterate. • To store the name or file descriptor of a log file used by a class-wide debugging method. • To keep collective data, like the total amount of cash dispensed by all ATMs in a network in a given day. • To track the last object created by a class, or the most accessed object. • To keep a cache of in-memory objects that have already been reconstituted from persistent memory. • To provide an inverted lookup table so you can find an object based on the value one of its attributes. The question comes down to deciding where to store the state for those shared attributes. Perl has no particular syntactic mechanism to declare class attributes, any more than it has for instance attributes. Perl provides the developer with a broad set of powerful but flexible features that can be uniquely crafted to the particular demands of the situation. You can then select the mechanism that makes the most sense for the given situation instead of having to live with someone else’s design decisions. Alternatively, you can live with the design decisions someone else has packaged up and put onto CPAN. Again, TMTOWTDI. Like anything else pertaining to a class, class data shouldn’t be accessed directly, especially from outside the implementation of the class itself. It doesn’t say much for encapsulation to set up carefully controlled accessor methods for instance variables but then invite the public in to diddle your class variables directly, such as by setting $SomeClass::Debug = 1. To establish a clear firewall between interface and implementation, you can create accessor methods to manipulate class data similar to those you use for instance data. 344 Chapter 12: Objects Imagine we want to keep track of the total world population of Critter objects. We’ll store that number in a package variable, but provide a method called population so that users of the class don’t have to know about the implementation. Critter->population() $gollum->population() # Access via class name # Access via instance Since a class in Perl is just a package, the most natural place to store class data is in a package variable. Here’s a simple implementation of such a class. The population method ignores its invocant and just returns the current value of the package variable, $Population. (Some programmers like to capitalize their globals.) package Critter; our $Population = 0; sub population { return $Population; } sub DESTROY { $Population-- } sub spawn { my $invocant = shift; my $class = ref($invocant) || $invocant; $Population++; return bless { name => shift || "anon" }, $class; } sub name { my $self = shift; $self->{name} = shift if @_; return $self->{name}; } If you want to make class data methods that work like accessors for instance data, do this: our $Debugging = 0; # class datum sub debug { shift; # intentionally ignore invocant $Debugging = shift if @_; return $Debugging; } Now you can set the overall debug level through the class or through any of its instances. Because it’s a package variable, $Debugging is globally accessible. But if you change the our variable to my, then only code later in that same file can see it. You can go still further—you can restrict unfettered access to class attributes even from the rest of class itself. Wrap the variable declaration in a block scope: { my $Debugging = 0; # lexically scoped class datum sub debug { shift; # intentionally ignore invocant $Debugging = shift if @_; Managing Class Data 345 return $Debugging; } } Now no one is allowed to read or write the class attributes without using the accessor method, since only that subroutine is in the same scope as the variable and has access to it. If a derived class inherits these class accessors, then these still access the original data, no matter whether the variables were declared with our or my. The data isn’t package-relative. You might look at it as methods executing in the class in which they were originally defined, not in the class that invoked them. For some kinds of class data, this approach works fine, but for others, it doesn’t. Suppose we create a Warg subclass of Critter. If we want to keep our populations separate, Warg can’t inherit Critter’s population method, because that method as written always returns the value of $Critter::Population. You’ll have to decide on a case-by-case basis whether it makes any sense for class attributes to be package relative. If you want package-relative attributes, use the invocant’s class to locate the package holding the class data: sub debug { my $invocant = shift; my $class = ref($invocant) || $invocant; my $varname = $class . "::Debugging"; no strict "refs"; # to access package data symbolically $$varname = shift if @_; return $$varname; } We temporarily rescind strict references because otherwise we couldn’t use the fully qualified symbolic name for the package global. This is perfectly reasonable: since all package variables by definition live in a package, there’s nothing wrong with accessing them via that package’s symbol table. Another approach is to make everything an object needs—even its global class data — available via that object (or passed in as parameters). To do this, you’ll often have to make a dedicated constructor for each class, or at least have a dedicated initialization routine to be called by the constructor. In the constructor or initializer, you store references to any class data directly in the object itself, so nothing ever has to go looking for it. The accessor methods use the object to find a reference to the data. Rather than put the complexity of locating the class data in each method, just let the object tell the method where the data is located. This approach works well 346 Chapter 12: Objects only when the class data accessor methods are invoked as instance methods, because the class data could be in unreachable lexicals you couldn’t get at using a package name. No matter how you roll it, package-relative class data is always a bit awkward. It’s really a lot cleaner if, when you inherit a class data accessor method, you effectively inherit the state data that it’s accessing as well. See the perltootc manpage for numerous, more elaborate approaches to management of class data. Summary That’s about all there is to it, except for everything else. Now you just need to go off and buy a book about object-oriented design methodology and bang your forehead with it for the next six months or so. 13 Overloading Objects are cool, but sometimes they’re just a little too cool. Sometimes you would rather they behaved a little less like objects and a little more like regular data types. But there’s a problem: objects are referents represented by references, and references aren’t terribly useful except as references. You can’t add references, or print them, or (usefully) apply many of Perl’s built-in operators. The only thing you can do is dereference them. So you find yourself writing many explicit method invocations, like this: print $object->as_string; $new_object = $subject->add($object); Such explicit dereferencing is in general a good thing; you should never confuse your references with your referents, except when you want to confuse them. Now would be one of those times. If you design your class with overloading, you can pretend the references aren’t there and simply say: print $object; $new_object = $subject + $object; When you overload one of Perl’s built-in operators, you define how it behaves when it’s applied to objects of a particular class. A number of standard Perl modules use overloading, such as Math::BigInt, which lets you create Math::BigInt objects that behave just like regular integers but have no size limits. You can add them with +, divide them with /, compare them with <=>, and print them with print. Note that overloading is not the same as autoloading, which is loading a missing function or method on demand. Neither is it the same as overriding, which is one function or method masking another. Overloading hides nothing; it adds meaning to an operation that would have been nonsense on a mere reference. 347 348 Chapter 13: Overloading The overload Pragma The use overload pragma implements operator overloading. You provide it with a key/value list of operators and their associated behaviors: package MyClass; use overload ’+’ => \&myadd, # coderef ’<’ => "less_than"; # named method ’abs’ => sub { return @_ }, # anonymous subroutine Now when you try to add two MyClass objects, the myadd subroutine will be called to create the result. When you try to compare two MyClass objects with the < operator, Perl notices that the behavior is specified as a string and interprets the string as a method name and not simply as a subroutine name. In the example above, the less_than method might be supplied by the MyClass package itself or inherited from a base class of MyClass, but the myadd subroutine must be supplied by the current package. The anonymous subroutine for abs supplies itself even more directly. However these routines are supplied, we’ll call them handlers. For unary operators (those taking only one operand, like abs), the handler specified for the class is invoked whenever the operator is applied to an object of that class. For binary operators like + or <, the handler is invoked whenever the first operand is an object of the class or when the second operand is an object of the class and the first operand has no overloading behavior. That’s so you can say either: $object + 6 or: 6 + $object without having to worry about the order of operands. (In the second case, the operands will be swapped when passed to the handler). If our expression was: $animal + $vegetable and $animal and $vegetable were objects of different classes, both of which used overloading, then the overloading behavior of $animal would be triggered. (We’ll hope the animal likes vegetables.) There is only one trinary (ternary) operator in Perl, ?:, and you can’t overload it. Fortunately. Overload Handlers 349 Overload Handlers When an overloaded operator is, er, operated, the corresponding handler is invoked with three arguments. The first two arguments are the two operands. If the operator only uses one operand, the second argument is undef. The third argument indicates whether the first two arguments were swapped. Even under the rules of normal arithmetic, some operations, like addition or multiplication, don’t usually care about the order of their arguments, but others, like subtraction and division, do.* Consider the difference between: $object - 6 and: 6 - $object If the first two arguments to a handler have been swapped, the third argument will be true. Otherwise, the third argument will be false, in which case there is a finer distinction as well: if the handler has been triggered by another handler involving assignment (as in += using + to figure out how to add), then the third argument is not merely false, but undef. This distinction enables some optimizations. As an example, here is a class that lets you manipulate a bounded range of numbers. It overloads both + and - so that the result of adding or subtracting objects constrains the values within the range 0 and 255: package ClipByte; use overload ’+’ => \&clip_add, ’-’ => \&clip_sub; sub new { my $class = shift; my $value = shift; return bless \$value => $class; } sub clip_add { my ($x, $y) = @_; my ($value) = ref($x) ? $$x : $x; $value += ref($y) ? $$y : $y; $value = 255 if $value > 255; $value = 0 if $value < 0; return bless \$value => ref($x); } * Your overloaded objects are not required to respect the rules of normal arithmetic, of course, but it’s usually best not to surprise people. Oddly, many languages make the mistake of overloading + with string concatenation, which is not commutative and only vaguely additive. For a different approach, see Perl. 350 Chapter 13: Overloading sub clip_sub { my ($x, $y, $swap) = @_; my ($value) = (ref $x) ? $$x : $x; $value -= (ref $y) ? $$y : $y; if ($swap) { $value = -$value } $value = 255 if $value > 255; $value = 0 if $value < 0; return bless \$value => ref($x); } package main; $byte1 = ClipByte->new(200); $byte2 = ClipByte->new(100); $byte3 = $byte1 + $byte2; $byte4 = $byte1 - $byte2; $byte5 = 150 - $byte2; # 255 # 100 # 50 You’ll note that every function here is by necessity a constructor, so each one takes care to bless its new object back into the current class, whatever that is; we assume our class might be inherited. We also assume that if $y is a reference, it’s a reference to an object of our own type. Instead of testing ref($y), we could have called $y->isa("ClipByte") if we wanted to be more thorough (and run slower). Overloadable Operators You can only overload certain operators, which are shown in Table 13-1. The operators are also listed in the %overload::ops hash made available when you use overload, though the categorization is a little different there. Table 13-1. Overloadable Operators Categor y Operators Conversion "" 0+ bool Arithmetic + - * / % ** x . neg Logical ! Bitwise & | ˜ ˆ ! << >> Assignment += -= *= /= %= **= x= .= <<= >>= ++ -- Comparison == < <= > >= != <=> lt le gt ge eq ne cmp Mathematical atan2 cos sin exp abs log sqrt Iterative <> Dereference ${} @{} %{} &{} *{} Pseudo nomethod fallback => Note that neg, bool, nomethod, and fallback are not actual Perl operators. The five dereferencers, "", and 0+ probably don’t seem like operators either. Nevertheless, Overloadable Operators 351 they are all valid keys for the parameter list you provide to use overload. This is not really a problem. We’ll let you in on a little secret: it’s a bit of a fib to say that the overload pragma overloads operators. It overloads the underlying operations, whether invoked explicitly via their “official” operators, or implicitly via some related operator. (The pseudo-operators we mentioned can only be invoked implicitly.) In other words, overloading happens not at the syntactic level, but at the semantic level. The point is not to look good. The point is to do the right thing. Feel free to generalize. Note also that = does not overload Perl’s assignment operator, as you might expect. That would not do the right thing. More on that later. We’ll start by discussing the conversion operators, not because they’re the most obvious (they aren’t), but because they’re the most useful. Many classes overload nothing but stringification, specified by the "" key. (Yes, that really is two doublequotes in a row.) Conversion operators: "", 0+, bool These three keys let you provide behaviors for Perl’s automatic conversions to strings, numbers, and Boolean values, respectively. We say that stringification occurs when any nonstring variable is used as a string. It’s what happens when you convert a variable into a string via printing, interpolation, concatenation, or even by using it as a hash key. Stringification is also why you see something like SCALAR(0xba5fe0) when you try to print an object. We say that numification occurs when a nonnumeric variable is converted into a number in any numeric context, such as any mathematical expression, array index, or even as an operand of the .. range operator. Finally, while nobody here quite has the nerve to call it boolification, you can define how an object should be interpreted in a Boolean context (such as if, unless, while, for, and, or, &&, ||, ?:, or the block of a grep expression) by creating a bool handler. Any of the three conversion operators can be autogenerated if you have any one of them (we’ll explain autogeneration later). Your handlers can return any value you like. Note that if the operation that triggered the conversion is also overloaded, that overloading will occur immediately afterward. Here’s a demonstration of "" that invokes an object’s as_string handler upon stringification. Don’t forget to quote the quotes: package Person; use overload q("") => \&as_string; 352 Chapter 13: Overloading sub new { my $class = shift; return bless { @_ } => $class; } sub as_string { my $self = shift; my ($key, $value, $result); while (($key, $value) = each %$self) { $result .= "$key => $value\n"; } return $result; } $obj = Person->new(height => 72, weight => 165, eyes => "brown"); print $obj; Instead of something like Person=HASH(0xba1350), this prints (in hash order): weight => 165 height => 72 eyes => brown (We sincerely hope this person was not measured in kg and cm.) Arithmetic operators: +, -, *, /, %, **, x, ., neg These should all be familiar except for neg, which is a special overloading key for the unary minus: the - in -123. The distinction between the neg and - keys allows you to specify different behaviors for unary minus and binary minus, more commonly known as subtraction. If you overload - but not neg, and then try to use a unary minus, Perl will emulate a neg handler for you. This is known as autogeneration, where certain operators can be reasonably deduced from other operators (on the assumption that the overloaded operators will have the same relationships as the regular operators). Since unary minus can be expressed as a function of binary minus (that is, -123 is equivalent to 0 - 123), Perl doesn’t force you to overload neg when - will do. (Of course, if you’ve arbitrarily defined binary minus to divide the second argument by the first, unary minus will be a fine way to throw a divide-by-0 exception.) Concatenation via the . operator can be autogenerated via the stringification handler (see "" above). Logical operator: ! If a handler for ! is not specified, it can be autogenerated using the bool, "", or 0+ handler. If you overload the ! operator, the not operator will also trigger whatever behavior you requested. (Remember our little secret?) Overloadable Operators 353 You may be surprised at the absence of the other logical operators, but most logical operators can’t be overloaded because they short-circuit. They’re really control-flow operators that need to be able to delay evaluation of some of their arguments. That’s also the reason the ?: operator isn’t overloaded. Bitwise operators: &, |, ˜, ˆ, <<, >> The ˜ operator is a unary operator; all the others are binary. Here’s how we could overload >> to do something like chop: package ShiftString; use overload ’>>’ => \&right_shift, ’""’ => sub { ${ $_[0] } }; sub new { my $class = shift; my $value = shift; return bless \$value => $class; } sub right_shift { my ($x, $y) = @_; my $value = $$x; substr($value, -$y) = ""; return bless \$value => ref($x); } $camel = ShiftString->new("Camel"); $ram = $camel >> 2; print $ram; # Cam Assignment operators: +=, -=, *=, /=, %=, **=, x=, .=, <<=, >>=, ++, -These assignment operators might change the value of their arguments or leave them as is. The result is assigned to the lefthand operand only if the new value differs from the old one. This allows the same handler to be used to overload both += and +. Although this is permitted, it is seldom recommended, since by the semantics described later under “When an Overload Handler Is Missing (nomethod and fallback)”, Perl will invoke the handler for + anyway, assuming += hasn’t been overloaded directly. Concatenation (.=) can be autogenerated using stringification followed by ordinary string concatenation. The ++ and -- operators can be autogenerated from + and - (or += and -=). Handlers implementing ++ and -- are expected to mutate (alter) their arguments. If you wanted autodecrement to work on letters as well as numbers, you could do that with a handler as follows: 354 Chapter 13: Overloading package MagicDec; use overload q(--) => \&decrement, q("") => sub { ${ $_[0] } }; sub new { my $class = shift; my $value = shift; bless \$value => $class; } sub decrement { my @string = reverse split(//, ${ $_[0] } ); my $i; for ($i = 0; $i < @string; $i++ ) { last unless $string[$i] =˜ /a/i; $string[$i] = chr( ord($string[$i]) + 25 ); } $string[$i] = chr( ord($string[$i]) - 1 ); my $result = join(’’, reverse @string); $_[0] = bless \$result => ref($_[0]); } package main; for $normal (qw/perl NZ Pa/) { $magic = MagicDec->new($normal); $magic--; print "$normal goes to $magic\n"; } That prints out: perl goes to perk NZ goes to NY Pa goes to Oz exactly reversing Perl’s magical string autoincrement operator. The ++$a operation can be autogenerated using $a += 1 or $a = $a + 1, and $a-- using $a -= 1 or $a = $a - 1. However, this does not trigger the copying behavior that a real ++ operator would. See “The Copy Constructor” later in this chapter. Comparison operators: ==, <, <=, >, >=, !=, <=>, lt, le, gt, ge, eq, ne, cmp If <=> is overloaded, it can be used to autogenerate behaviors for <, <=, >, >=, ==, and !=. Similarly, if cmp is overloaded, it can be used to autogenerate behaviors for lt, le, gt, ge, eq, and ne. Overloadable Operators 355 Note that overloading cmp won’t let you sort objects as easily as you’d like, because what will be compared are the stringified versions of the objects instead of the objects themselves. If that was your goal, you’d want to overload "" as well. Mathematical functions: atan2, cos, sin, exp, abs, log, sqrt If abs is unavailable, it can be autogenerated from < or <=> combined with either unary minus or subtraction. An overloaded - can be used to autogenerate missing handlers for unary minus or for the abs function, which may also be separately overloaded. (Yes, we know that abs looks like a function, whereas unary minus looks like an operator, but they aren’t all that different as far as Perl’s concerned.) Iterative operator: <> The <> handler can be triggered by using either readline (when it reads from a filehandle, as in while (<FH>)) or glob (when it is used for fileglobbing, as in @files = <*.*>). package LuckyDraw; use overload ’<>’ => sub { my $self = shift; return splice @$self, rand @$self, 1; }; sub new { my $class = shift; return bless [@_] => $class; } package main; $lotto = new LuckyDraw 1 .. 51; for (qw(1st 2nd 3rd 4th 5th 6th)) { $lucky_number = <$lotto>; print "The $_ lucky number is: $lucky_number.\n"; } $lucky_number = <$lotto>; print "\nAnd the bonus number is: $lucky_number.\n"; In California, this prints: The 1st lucky number is: 18 The 2nd lucky number is: 11 The 3rd lucky number is: 40 356 Chapter 13: Overloading The 4th lucky number is: 7 The 5th lucky number is: 51 The 6th lucky number is: 33 And the bonus number is: 5 Dereference operators: ${}, @{}, %{}, &{}, *{} Attempts to dereference scalar, array, hash, subroutine, and glob references can be intercepted by overloading these five symbols. The online Perl documentation for overload demonstrates how you can use this operator to simulate your own pseudohashes. Here’s a simpler example that implements an object as an anonymous array but permits hash referencing. Don’t try to treat it as a real hash; you won’t be able to delete key/value pairs from the object. If you want to combine array and hash notations, use a real pseudohash (as it were). package PsychoHash; use overload ’%{}’ => \&as_hash; sub as_hash { my ($x) = shift; return { @$x }; } sub new { my $class = shift; return bless [ @_ ] => $class; } $critter = new PsychoHash( height => 72, weight => 365, type => "camel" ); print $critter->{weight}; # prints 365 Also see Chapter 14, Tied Variables, for a mechanism to let you redefine basic operations on hashes, arrays, and scalars. When overloading an operator, try not to create objects with references to themselves. For instance, use overload ’+’ => sub { bless [ \$_[0], \$_[1] ] }; This is asking for trouble, since if you say $animal += $vegetable, the result will make $animal a reference to a blessed array reference whose first element is $animal. This is a circular refer ence, which means that even if you destroy $animal, its memory won’t be freed until your process (or interpreter) terminates. See “Garbage Collection, Circular References, and Weak References” in Chapter 8, Refer ences. The Copy Constructor (=) 357 The Copy Constructor (=) Although it looks like a regular operator, = has a special and slightly subintuitive meaning as an overload key. It does not overload the Perl assignment operator. It can’t, because that operator has to be reserved for assigning references, or everything breaks. The handler for = is used in situations where a mutator (such as ++, --, or any of the assignment operators) is applied to a reference that shares its object with another reference. The = handler lets you intercept the mutator and copy the object yourself so that the copy alone is mutated. Otherwise, you’d clobber the original. $copy = $original; ++$copy; # copies only the reference # changes underlying shared object Now, bear with us. Suppose that $original is a reference to an object. To make ++$copy modify only $copy and not $original, a copy of $copy is first made, and $copy is assigned a reference to this new object. This operation is not performed until ++$copy is executed, so $copy coincides with $original before the increment — but not afterward. In other words, it’s the ++ that recognizes the need for the copy and calls out to your copy constructor. The need for copying is recognized only by mutators such as ++ or +=, or by nomethod, which is described later. If the operation is autogenerated via +, as in: $copy = $original; $copy = $copy + 1; then no copying occurs, because + doesn’t know it’s being used as a mutator. If the copy constructor is required during the execution of some mutator, but a handler for = was not specified, it can be autogenerated as a string copy provided the object is a plain scalar and not something fancier. For example, the code actually executed for the sequence: $copy = $original; ... ++$copy; might end up as something like this: $copy = $original; ... $copy = $copy->clone(undef, ""); $copy->incr(undef, ""); 358 Chapter 13: Overloading This assumes $original points to an overloaded object, ++ was overloaded with \&incr, and = was overloaded with \&clone. Similar behavior is triggered by $copy = $original++, which is interpreted as $copy = $original; ++$original. When an Overload Handler Is Missing (nomethod and fallback) If you apply an unoverloaded operator to an object, Perl first tries to autogenerate a behavior from other overloaded operators using the rules described earlier. If that fails, Perl looks for an overloading behavior for nomethod and uses that if available. That handler is to operators what an AUTOLOAD subroutine is to subroutines: it’s what you do when you can’t think of what else to do. If used, the nomethod key should be followed by a reference to a handler that accepts four arguments, (not three as all the other handlers expect). The first three arguments are no different than in any other handler; the fourth is a string corresponding to the operator whose handler is missing. This serves the same purpose as the $AUTOLOAD variable does in AUTOLOAD subroutines. If Perl has to look for a nomethod handler but can’t find one, an exception is raised. If you want to prevent autogeneration from occurring, or you want a failed autogeneration attempt to result in no overloading at all, you can define the special fallback overloading key. It has three useful states: undef If fallback is not set, or is explicitly set to undef, the sequence of overloading events is unaffected: handlers are sought, autogeneration is attempted, and finally the nomethod handler is invoked. If that fails, an exception is raised. false If fallback is set to a defined but false value (like 0), autogeneration is never attempted. Perl will call the nomethod handler if one exists, but raise an exception otherwise. true This is nearly the same behavior as for undef, but no exception is raised if an appropriate handler cannot be synthesized via autogeneration. Instead, Perl reverts to following the unoverloaded behavior for that operator, as though there were no use overload pragma in the class at all. Overloading Constants 359 Overloading Constants You can change how constants are interpreted by Perl with overload::constant, which is most usefully placed in a package’s import method. (If you do this, you should properly invoke overload::remove_constant in the package’s unimport method so that the package can clean up after itself when you ask it to.) Both overload::constant and overload::remove_constant expect a list of key/value pairs. The keys should be any of integer, float, binary, q, and qr, and each value should be the name of a subroutine, an anonymous subroutine, or a code reference that will handle the constants. sub import { overload::constant ( integer float binary q qr => => => => => \&integer_handler, \&float_handler, \&base_handler, \&string_handler, \&regex_handler ) } Any handlers you provide for integer and float will be invoked whenever the Perl tokener encounters a constant number. This is independent of the use constant pragma; simple statements such as $year = cube(12) + 1; $pi = 3.14159265358979; # integer # float will trigger whatever handler you requested. The binary key lets you intercept binary, octal, and hexadecimal constants. q handles single-quoted strings (including strings introduced with q) and constant substrings within qq- and qx-quoted strings and here documents. Finally, qr handles constant pieces within regular expressions, as described at the end of Chapter 5, Patter n Matching. The handler will be passed three arguments. The first argument is the original constant, in whatever form it was provided to Perl. The second argument is how Perl actually interpreted the constant; for instance, 123_456 will appear as 123456. The third argument is defined only for strings handled by the q and qr handlers, and will be one of qq, q, s, or tr depending on how the string is to be used. qq means that the string is from an interpolated context, such as double quotes, backticks, an m// match, or the pattern of an s/// substitution. q means that the string is from an uninterpolated context, s means that the constant is a replacement string in an s/// substitution, and tr means that it’s a component of a tr/// or y/// expression. 360 Chapter 13: Overloading The handler should return a scalar, which will be used in place of the constant. Often, that scalar will be a reference to an overloaded object, but there’s nothing preventing you from doing something more dastardly: package DigitDoubler; use overload; # A module to be placed in DigitDoubler.pm sub import { overload::constant ( integer => \&handler, float => \&handler ) } sub handler { my ($orig, $interp, $context) = @_; return $interp * 2; # double all constants } 1; Note that handler is shared by both keys, which works okay in this case. Now when you say: use DigitDoubler; $trouble = 123; $jeopardy = 3.21; # trouble is now 246 # jeopardy is now 6.42 you redefine the world. If you intercept string constants, it is recommended that you provide a concatenation operator (“.”) as well, since an interpolated expression like "ab$cd!!" is merely a shortcut for the longer ’ab’ . $cd . ’!!’. Similarly, negative numbers are considered negations of positive constants, so you should provide a handler for neg when you intercept integers or floats. (We didn’t need to do that earlier, because we’re returning actual numbers, not overloaded object references.) Note that overload::constant does not propagate into run-time compilation inside eval, which can be either a bug or a feature depending on how you look at it. Public Overload Functions As of the 5.6 release of Perl, the use overload pragma provides the following functions for public consumption. overload::StrVal(OBJ) This function returns the string value that OBJ would have in absence of stringification overloading (""). Run-Time Overloading 361 overload::Overloaded(OBJ) This function returns a true value if OBJ is subject to any operator overloading at all, and false otherwise. overload::Method(OBJ, OPERATOR) This function returns a reference to whatever code implements the overloading for OPERATOR when it operates on OBJ, or undef if no such overloading exists. Inheritance and Overloading Inheritance interacts with overloading in two ways. The first occurs when a handler is named as a string rather than provided as a code reference or anonymous subroutine. When named as a string, the handler is interpreted as a method, and can therefore be inherited from superclasses. The second interaction between inheritance and overloading is that any class derived from a overloaded class is itself subject to that overloading. In other words, overloading is itself inherited. The set of handlers in a class is the union of handlers of all that class’s ancestors, recursively. If a handler can be found in several different ancestors, the handler actually used is governed by the usual rules for method inheritance. For example, if class Alpha inherits from classes Beta and Gamma in that order, and class Beta overloads + with \&Beta::plus_sub, but class Gamma overloads + with the string "plus_meth", then Beta::plus_sub will be called when you try to apply + to an Alpha object. Since the value of the fallback key is not a handler, its inheritance is not governed by the rules given above. In the current implementation, the fallback value from the first overloaded ancestor is used, but this is accidental and subject to change without notice (well, without much notice). Run-Time Overloading Since use statements are executed at compile time, the only way to change overloading during run time is: eval " use overload ’+’ => \&my_add "; You can also say: eval " no overload ’+’, ’--’, ’<=’ "; although the use of these constructs during run time is questionable. 362 Chapter 13: Overloading Overloading Diagnostics If your Perl was compiled with -DDEBUGGING, you can view diagnostic messages for overloading when you run a program with the -Do switch or its equivalent. You can also deduce which operations are overloaded using the m command of Perl’s built-in debugger. If you’re feeling overloaded now, maybe the next chapter will tie things back together for you. 14 Tied Variables Some human endeavors require a disguise. Sometimes the intent is to deceive, but more often, the intent is to communicate something true at a deeper level. For instance, many job interviewers expect you to dress up in a tie to indicate that you’re seriously interested in fitting in, even though both of you know you’ll never wear a tie on the job. It’s odd when you think about it: tying a piece of cloth around your neck can magically get you a job. In Perl culture, the tie operator plays a similar role: it lets you create a seemingly normal variable that, behind the disguise, is actually a full-fledged Perl object that is expected to have an interesting personality of its own. It’s just an odd bit of magic, like pulling Bugs Bunny out of a hat. Put another way, the funny characters $, @, %, or * in front of a variable name tell Perl and its programmers a great deal—they each imply a particular set of archetypal behaviors. You can warp those behaviors in various useful ways with tie, by associating the variable with a class that implements a new set of behaviors. For instance, you can create a regular Perl hash, and then tie it to a class that makes the hash into a database, so that when you read values from the hash, Perl magically fetches data from an external database file, and when you set values in the hash, Perl magically stores data in the external database file. In this case, “magically” means “transparently doing something very complicated”. You know the old saying: any technology sufficiently advanced is indistinguishable from a Perl script. (Seriously, people who play with the guts of Perl use magic as a technical term referring to any extra semantics attached to variables such as %ENV or %SIG. Tied variables are just an extension of that.) 363 364 Chapter 14: Tied Variables Perl already has built-in dbmopen and dbmclose functions that magically tie hash variables to databases, but those functions date back to the days when Perl had no tie. Now tie provides a more general mechanism. In fact, Perl itself implements dbmopen and dbmclose in terms of tie. You can tie a scalar, array, hash, or filehandle (via its typeglob) to any class that provides appropriately named methods to intercept and emulate normal accesses to those variables. The first of those methods is invoked at the point of the tie itself: tying a variable always invokes a constructor, which, if successful, returns an object that Perl squirrels away where you don’t see it, down inside the “normal” variable. You can always retrieve that object later using the tied function on the normal variable: tie VARIABLE, CLASSNAME, LIST; # binds VARIABLE to CLASSNAME $object = tied VARIABLE; Those two lines are equivalent to: $object = tie VARIABLE, CLASSNAME, LIST; Once it’s tied, you treat the normal variable normally, but each access automatically invokes methods on the underlying object; all the complexity of the class is hidden behind those method invocations. If later you want to break the association between the variable and the class, you can untie the variable: untie VARIABLE; You can almost think of tie as a funny kind of bless, except that it blesses a bare variable instead of an object reference. It also can take extra parameters, just as a constructor can—which is not terribly surprising, since it actually does invoke a constructor internally, whose name depends on which type of variable you’re tying: either TIESCALAR, TIEARRAY, TIEHASH, or TIEHANDLE.* These constructors are invoked as class methods with the specified CLASSNAME as their invocant, plus any additional arguments you supplied in LIST. (The VARIABLE is not passed to the constructor.) These four constructors each return an object in the customary fashion. They don’t really care whether they were invoked from tie, nor do any of the other methods in the class, since you can always invoke them directly if you’d like. In one sense, all the magic is in the tie, not in the class implementing the tie. It’s just an ordinary class with funny method names, as far as the class is concerned. (Indeed, some tied modules provide extra methods that aren’t visible through the tied variable; these methods must be called explicitly as you would any other object * Since the constructors have separate names, you could even provide a single class that implements all of them. That would allow you to tie scalars, arrays, hashes, and filehandles all to the same class, although this is not generally done, since it would make the other magical methods tricky to write. Tying Scalars 365 method. Such extra methods might provide services like file locking, transaction protection, or anything else an instance method might do.) So these constructors bless and return an object reference just as any other constructor would. That reference need not refer to the same type of variable as the one being tied; it just has to be blessed, so that the tied variable can find its way back to your class for succor. For instance, our long TIEARRAY example will use a hash-based object, so it can conveniently hold additional information about the array it’s emulating. The tie function will not use or require a module for you—you must do that yourself explicitly, if necessary, before calling the tie. (On the other hand, the dbmopen function will, for backward compatibility, attempt to use one or another DBM implementation. But you can preempt its selection with an explicit use, provided the module you use is one of the modules in dbmopen’s list of modules to try. See the online docs for the AnyDBM_File module for a fuller explanation.) The methods called by a tied variable have predetermined names like FETCH and STORE, since they’re invoked implicitly (that is, triggered by particular events) from within the innards of Perl. These names are in ALLCAPS, a convention we often follow for such implicitly called routines. (Other special names that follow this convention include BEGIN, CHECK, INIT, END, DESTROY, and AUTOLOAD, not to mention UNIVERSAL->VERSION. In fact, nearly all of Perl’s predefined variables and filehandles are in uppercase: STDIN, SUPER, CORE, CORE::GLOBAL, DATA, @EXPORT, @INC, @ISA, @ARGV, and %ENV. Of course, built-in operators and pragmas go to the opposite extreme and have no capitals at all.) The first thing we’ll cover is extremely simple: how to tie a scalar variable. Tying Scalars To implement a tied scalar, a class must define the following methods: TIESCALAR, FETCH, and STORE (and possibly DESTROY). When you tie a scalar variable, Perl calls TIESCALAR. When you read the tied variable, it calls FETCH, and when you assign a value to the variable, it calls STORE. If you’ve kept the object returned by the initial tie (or if you retrieve it later using tied), you can access the underlying object yourself — this does not trigger its FETCH or STORE methods. As an object, it’s not magical at all, but rather quite objective. If a DESTROY method exists, Perl invokes it when the last reference to the tied object disappears, just as for any other object. That happens when your program ends or when you call untie, which eliminates the reference used by the tie. However, untie doesn’t eliminate any outstanding references you might have stored elsewhere; DESTROY is deferred until those references are gone, too. 366 Chapter 14: Tied Variables The Tie::Scalar and Tie::StdScalar packages, both found in the standard Tie::Scalar module, provide some simple base class definitions if you don’t want to define all of these methods yourself. Tie::Scalar provides elemental methods that do very little, and Tie::StdScalar provides methods that make a tied scalar behave like a regular Perl scalar. (Which seems singularly useless, but sometimes you just want a bit of a wrapper around the ordinary scalar semantics, for example, to count the number of times a particular variable is set.) Before we show you our elaborate example and complete description of all the mechanics, here’s a taste just to whet your appetite—and to show you how easy it really is. Here’s a complete program: #!/usr/bin/perl package Centsible; sub TIESCALAR { bless \my $self, shift } sub STORE { ${ $_[0] } = $_[1] } # do the default thing sub FETCH { sprintf "%.02f", ${ my $self = shift } } # round value package main; tie $bucks, "Centsible"; $bucks = 45.00; $bucks *= 1.0715; # tax $bucks *= 1.0715; # and double tax! print "That will be $bucks, please.\n"; When run, that program produces: That will be 51.67, please. To see the difference it makes, comment out the call to tie; then you’ll get: That will be 51.66505125, please. Admittedly, that’s more work than you’d normally go through to round numbers. Scalar-Tying Methods Now that you’ve seen a sample of what’s to come, let’s develop a more elaborate scalar-tying class. Instead of using any canned package for the base class (especially since scalars are so simple), we’ll look at each of the four methods in turn, building an example class named ScalarFile. Scalars tied to this class contain regular strings, and each such variable is implicitly associated with a file where that string is stored. (You might name your variables to remind you which file you’re referring to.) Variables are tied to the class this way: use ScalarFile; # load ScalarFile.pm tie $camel, "ScalarFile", "/tmp/camel.lot"; Once the variable has been tied, its previous contents are clobbered, and the internal connection between the variable and its object overrides the variable’s normal Tying Scalars 367 semantics. When you ask for the value of $camel, it now reads the contents of /tmp/camel.lot, and when you assign a value to $camel, it writes the new contents out to /tmp/camel.lot, obliterating any previous occupants. The tie is on the variable, not the value, so the tied nature of a variable does not propagate across assignment. For example, let’s say you copy a variable that’s been tied: $dromedary = $camel; Instead of reading the value in the ordinary fashion from the $camel scalar variable, Perl invokes the FETCH method on the associated underlying object. It’s as though you’d written this: $dromedary = (tied $camel)->FETCH(): Or if you remember the object returned by tie, you could use that reference directly, as in the following sample code: $clot = tie $camel, "ScalarFile", "/tmp/camel.lot"; $dromedary = $camel; # through the implicit interface $dromedary = $clot->FETCH(); # same thing, but explicitly If the class provides methods besides TIESCALAR, FETCH, STORE, and DESTROY, you could use $clot to invoke them manually. However, one normally minds one’s own business and leaves the underlying object alone, which is why you often see the return value from tie ignored. You can still get at the object via tied if you need it later (for example, if the class happens to document any extra methods you need). Ignoring the returned object also eliminates certain kinds of errors, which we’ll cover later. Here’s the preamble of our class, which we will put into ScalarFile.pm: package ScalarFile; use Carp; use strict; use warnings; use warnings::register; my $count = 0; # # # # # Propagate error messages nicely. Enforce some discipline on ourselves. Turn on lexically scoped warnings. Allow user to say "use warnings ’ScalarFile’". Internal count of tied ScalarFiles. The standard Carp module exports the carp, croak, and confess subroutines, which we’ll use in the code later in this section. As usual, see Chapter 32, Standard Modules, or the online docs for more about Carp. The following methods are defined by the class. CLASSNAME->TIESCALAR(LIST) The TIESCALAR method of the class is triggered whenever you tie a scalar variable. The optional LIST contains any parameters needed to initialize the object properly. (In our example, there is only one parameter: the name of the file.) 368 Chapter 14: Tied Variables The method should return an object, but this doesn’t have to be a reference to a scalar. In our example, though, it is. sub TIESCALAR { # in ScalarFile.pm my $class = shift; my $filename = shift; $count++; # A file-scoped lexical, private to class. return bless \$filename, $class; } Since there’s no scalar equivalent to the anonymous array and hash composers, and {}, we merely bless a lexical variable’s referent, which effectively becomes anonymous as soon as the name goes out of scope. This works fine (you could do the same thing with arrays and hashes) as long as the variable really is lexical. If you try this trick on a global, you might think you’re getting away with it, until you try to create another camel.lot. Don’t be tempted to write something like this: sub TIESCALAR { bless \$_[1], $_[0] } # WRONG, could refer to global. A more robustly written constructor might check that the filename is accessible. We check first to see if the file is readable, since we don’t want to clobber the existing value. (In other words, we shouldn’t assume the user is going to write first. They might be treasuring their old Camel Lot file from a previous run of the program.) If we can’t open or create the filename specified, we’ll indicate the error gently by returning undef and optionally printing a warning via carp. (We could just croak instead — it’s a matter of taste whether you prefer fish or frogs.) We’ll use the warnings pragma to determine whether the user is interested in our warning: sub TIESCALAR { # in ScalarFile.pm my $class = shift; my $filename = shift; my $fh; if (open $fh, "<", $filename or open $fh, ">", $filename) { close $fh; $count++; return bless \$filename, $class; } carp "Can’t tie $filename: $!" if warnings::enabled(); return; } Given such a constructor, we can now associate the scalar $string with the file camel.lot: tie ($string, "ScalarFile", "camel.lot") or die; Tying Scalars 369 (We’re still assuming some things we shouldn’t. In a production version of this, we’d probably open the filehandle once and remember the filehandle as well as the filename for the duration of the tie, keeping the handle exclusively locked with flock the whole time. Otherwise we’re open to race conditions— see “Timing Glitches” in Chapter 23, Security.) SELF->FETCH This method is invoked whenever you access the tied variable (that is, read its value). It takes no arguments beyond the object tied to the variable. In our example, that object contains the filename. sub FETCH { my $self = shift; confess "I am not a class method" unless ref $self; return unless open my $fh, $$self; read($fh, my $value, -s $fh); # NB: don’t use -s on pipes! return $value; } This time we’ve decided to blow up (raise an exception) if FETCH gets something other than a reference. (Either it was invoked as a class method, or someone miscalled it as a subroutine.) There’s no other way for us to return an error, so it’s probably the right thing to do. In fact, Perl would have raised an exception in any event as soon as we tried to dereference $self; we’re just being polite and using confess to spew a complete stack backtrace onto the user’s screen. (If that can be considered polite.) We can now see the contents of camel.lot when we say this: tie($string, "ScalarFile", "camel.lot"); print $string; SELF->STORE(VALUE) This method is run when the tied variable is set (assigned). The first argument, SELF, is as always the object associated with the variable; VALUE is whatever was assigned to the variable. (We use the term “assigned” loosely—any operation that modifies the variable can call STORE.) sub STORE { my($self,$value) = @_; ref $self or confess "not open my $fh, ">", $$self or croak "can’t syswrite($fh, $value) == length $value or croak "can’t close $fh or croak "can’t return $value; } a class method"; clobber $$self: $!"; write to $$self: $!"; close $$self: $!"; 370 Chapter 14: Tied Variables After “assigning” it, we return the new value—because that’s what assignment does. If the assignment wasn’t successful, we croak out the error. Possible causes might be that we didn’t have permission to write to the associated file, or the disk filled up, or gremlins infested the disk controller. Sometimes you control the magic, and sometimes the magic controls you. We can now write to camel.lot when we say this: tie($string, "ScalarFile", "camel.lot"); $string = "Here is the first line of camel.lot\n"; $string .= "And here is another line, automatically appended.\n"; SELF->DESTROY This method is triggered when the object associated with the tied variable is about to be garbage collected, in case it needs to do something special to clean up after itself. As with other classes, such a method is seldom necessary, since Perl deallocates the moribund object’s memory for you automatically. Here, we’ll define a DESTROY method that decrements our count of tied files: sub DESTROY { my $self = shift; confess "wrong type" unless ref $self; $count--; } We might then also supply an extra class method to retrieve the current count. Actually, it doesn’t care whether it’s called as a class method or an object method, but you don’t have an object anymore after the DESTROY, now do you? sub count { # my $invocant = shift; $count; } You can call this as a class method at any time like this: if (ScalarFile->count) { warn "Still some tied ScalarFiles sitting around somewhere...\n"; } That’s about all there is to it. Actually, it’s more than all there is to it, since we’ve done a few nice things here for the sake of completeness, robustness, and general aesthetics (or lack thereof ). Simpler TIESCALAR classes are certainly possible. Magical Counter Variables Here’s a simple Tie::Counter class, inspired by the CPAN module of the same name. Variables tied to this class increment themselves by 1 every time they’re used. For example: Tying Scalars 371 tie my $counter, "Tie::Counter", 100; @array = qw /Red Green Blue/; for my $color (@array) { # Prints: print " $counter $color\n"; # 100 Red } # 101 Green # 102 Blue The constructor takes as an optional extra argument the first value of the counter, which defaults to 0. Assigning to the counter will set a new value. Here’s the class: package Tie::Counter; sub FETCH { ++ ${ $_[0] } } sub STORE { ${ $_[0] } = $_[1] } sub TIESCALAR { my ($class, $value) = @_; $value = 0 unless defined $value; bless \$value => $class; } 1; # if in module See how small that is? It doesn’t take much code to put together a class like this. Magically Banishing $_ This curiously exotic tie class is used to outlaw unlocalized uses of $_. Instead of pulling in the module with use, which invokes the class’s import method, this module should be loaded with no to call the seldom-used unimport method. The user says: no Underscore; And then all uses of $_ as an unlocalized global raise an exception. Here’s a little test suite for the module: #!/usr/bin/perl no Underscore; @tests = ( "Assignment" "Reading" "Matching" "Chop" "Filetest" "Nesting" ); => => => => => => sub sub sub sub sub sub { { { { { { $_ = "Bad" }, print }, $x = /badness/ }, chop }, -x }, for (1..3) { print } }, while ( ($name, $code) = splice(@tests, 0, 2) ) { print "Testing $name: "; eval { &$code }; print $@ ? "detected" : " missed!"; print "\n"; } 372 Chapter 14: Tied Variables which prints out the following: Testing Testing Testing Testing Testing Testing Assignment: detected Reading: detected Matching: detected Chop: detected Filetest: detected Nesting: 123 missed! The last one was “missed” because it was properly localized by the for loop and thus safe to access. Here’s the curiously exotic Underscore module itself. (Did we mention that it’s curiously exotic?) It works because tied magic is effectively hidden by a local. The module does the tie in its own initialization code so that a require also works. package Underscore; use Carp; sub TIESCALAR { bless \my $dummy => shift } sub FETCH { croak ’Read access to $_ forbidden’ } sub STORE { croak ’Write access to $_ forbidden’ } sub unimport { tie($_, __PACKAGE_ _) } sub import { untie $_ } tie($_, __PACKAGE_ _) unless tied $_; 1; It’s hard to usefully mix calls to use and no for this class in your program, because they all happen at compile time, not run time. You could call Underscore->import and Underscore->unimport directly, just as use and no do. Normally, though, to renege and let yourself freely use $_ again, you’d just use local on it, which is the whole point. Tying Arrays A class implementing a tied array must define at least the methods TIEARRAY, FETCH, and STORE. There are many optional methods: the ubiquitous DESTROY method, of course, but also the STORESIZE and FETCHSIZE methods used to provide $#array and scalar(@array) access. In addition, CLEAR is triggered when Perl needs to empty the array, and EXTEND when Perl would have pre-extended allocation in a real array. You may also define the POP, PUSH, SHIFT, UNSHIFT, SPLICE, DELETE, and EXISTS methods if you want the corresponding Perl functions to work on the tied array. The Tie::Array class can serve as a base class to implement the first five of those functions in terms of FETCH and STORE. (Tie::Array’s default implementation of DELETE and EXISTS simply calls croak.) As long as you define FETCH and STORE, it doesn’t matter what kind of data structure your object contains. Tying Arrays 373 On the other hand, the Tie::StdArray class (defined in the standard Tie::Array module) provides a base class with default methods that assume the object contains a regular array. Here’s a simple array-tying class that makes use of this. Because it uses Tie::StdArray as its base class, it only needs to define the methods that should be treated in a nonstandard way. #!/usr/bin/perl package ClockArray; use Tie::Array; our @ISA = ’Tie::StdArray’; sub FETCH { my($self,$place) = @_; $self->[ $place % 12 ]; } sub STORE { my($self,$place,$value) = @_; $self->[ $place % 12 ] = $value; } package main; tie my @array, ’ClockArray’; @array = ( "a" ... "z" ); print "@array\n"; When run, the program prints out “y z o p q r s t u v w x”. This class provides an array with only a dozen slots, like hours of a clock, numbered 0 through 11. If you ask for the 15th array index, you really get the 3rd one. Think of it as a travel aid for people who haven’t learned how to read 24-hour clocks. Array-Tying Methods That’s the simple way. Now for some nitty-gritty details. To demonstrate, we’ll implement an array whose bounds are fixed at its creation. If you try to access anything beyond those bounds, an exception is raised. For example: use BoundedArray; tie @array, "BoundedArray", 2; $array[0] $array[1] $array[2] $array[3] = = = = "fine"; "good"; "great"; "whoa"; # Prohibited; displays an error message. The preamble code for the class is as follows: package BoundedArray; use Carp; use strict; 374 Chapter 14: Tied Variables To avoid having to define SPLICE later, we’ll inherit from the Tie::Array class: use Tie::Array; our @ISA = ("Tie::Array"); CLASSNAME->TIEARRAY(LIST) As the constructor for the class, TIEARRAY should return a blessed reference through which the tied array will be emulated. In this next example, just to show you that you don’t really have to return an array reference, we’ll choose a hash reference to represent our object. A hash works out well as a generic record type: the value in the hash’s “BOUND” key will store the maximum bound allowed, and its “DATA” value will hold the actual data. If someone outside the class tries to dereference the object returned (doubtless thinking it an array reference), an exception is raised. sub TIEARRAY { my $class = shift; my $bound = shift; confess "usage: tie(\@ary, ’BoundedArray’, max_subscript)" if @_ || $bound =˜ /\D/; return bless { BOUND => $bound, DATA => }, $class; } We can now say: tie(@array, "BoundedArray", 3); # maximum allowable index is 3 to ensure that the array will never have more than four elements. Whenever an individual element of the array is accessed or stored, FETCH and STORE will be called just as they were for scalars, but with an extra index argument. SELF->FETCH(INDEX) This method is run whenever an individual element in the tied array is accessed. It receives one argument after the object: the index of the value we’re trying to fetch. sub FETCH { my ($self, $index) = @_; if ($index > $self->{BOUND}) { confess "Array OOB: $index > $self->{BOUND}"; } return $self->{DATA}[$index]; } SELF->STORE(INDEX, VALUE) This method is invoked whenever an element in the tied array is set. It takes two arguments after the object: the index at which we’re trying to store something and the value we’re trying to put there. For example: Tying Arrays 375 sub STORE { my($self, $index, $value) = @_; if ($index > $self->{BOUND} ) { confess "Array OOB: $index > $self->{BOUND}"; } return $self->{DATA}[$index] = $value; } SELF->DESTROY Perl calls this method when the tied variable needs to be destroyed and its memory reclaimed. This is almost never needed in a language with garbage collection, so for this example we’ll just leave it out. SELF->FETCHSIZE The FETCHSIZE method should return the total number of items in the tied array associated with SELF. It’s equivalent to scalar(@array), which is usually equal to $#array + 1. sub FETCHSIZE { my $self = shift; return scalar @{$self->{DATA}}; } SELF->STORESIZE(COUNT) This method sets the total number of items in the tied array associated with SELF to be COUNT. If the array shrinks, you should remove entries beyond COUNT. If the array grows, you should make sure the new positions are undefined. For our BoundedArray class, we also ensure that the array doesn’t grow beyond the limit initially set. sub STORESIZE { my ($self, $count) = @_; if ($count > $self->{BOUND}) { confess "Array OOB: $count > $self->{BOUND}"; } $#{$self->{DATA}} = $count; } SELF->EXTEND(COUNT) Perl uses the EXTEND method to indicate that the array is likely to expand to hold COUNT entries. That way you can can allocate memory in one big chunk instead of in many successive calls later on. Since our BoundedArrays have fixed upper bounds, we won’t define this method. SELF->EXISTS(INDEX) This method verifies that the element at INDEX exists in the tied array. For our BoundedArray, we just employ Perl’s built-in exists after verifying that it’s not an attempt to look past the fixed upper bound. 376 Chapter 14: Tied Variables sub EXISTS { my ($self, $index) = @_; if ($index > $self->{BOUND}) { confess "Array OOB: $index > $self->{BOUND}"; } exists $self->{DATA}[$index]; } SELF->DELETE(INDEX) The DELETE method removes the element at INDEX from the tied array SELF. For our BoundedArray class, the method looks nearly identical to EXISTS, but this is not the norm. sub DELETE { my ($self, $index) = @_; print STDERR "deleting!\n"; if ($index > $self->{BOUND}) { confess "Array OOB: $index > $self->{BOUND}"; } delete $self->{DATA}[$index]; } SELF->CLEAR This method is called whenever the array has to be emptied. That happens when the array is set to a list of new values (or an empty list), but not when it’s provided to the undef function. Since a cleared BoundedArray always satisfies the upper bound, we don’t need check anything here: sub CLEAR { my $self = shift; $self->{DATA} = ; } If you set the array to a list, CLEAR will trigger but won’t see the list values. So if you violate the upper bound like so: tie(@array, "BoundedArray", 2); @array = (1, 2, 3, 4); the CLEAR method will still return successfully. The exception will only be raised on the subsequent STORE. The assignment triggers one CLEAR and four STOREs. SELF->PUSH(LIST) This method appends the elements of LIST to the array. Here’s how it might look for our BoundedArray class: sub PUSH { my $self = shift; if (@_ + $#{$self->{DATA}} > $self->{BOUND}) { confess "Attempt to push too many elements"; } Tying Arrays 377 push @{$self->{DATA}}, @_; } SELF->UNSHIFT(LIST) This method prepends the elements of LIST to the array. For our BoundedArray class, the subroutine would be similar to PUSH. SELF->POP The POP method removes the last element of the array and returns it. For BoundedArray, it’s a one-liner: sub POP { my $self = shift; pop @{$self->{DATA}} } SELF->SHIFT The SHIFT method removes the first element of the array and returns it. For BoundedArray, it’s similar to POP. SELF->SPLICE(OFFSET, LENGTH, LIST) This method lets you splice the SELF array. To mimic Perl’s built-in splice, OFFSET should be optional and default to zero, with negative values counting back from the end of the array. LENGTH should also be optional, defaulting to rest of the array. LIST can be empty. If it’s properly mimicking the built-in, the method will return a list of the original LENGTH elements at OFFSET (that is, the list of elements to be replaced by LIST). Since splicing is a somewhat complicated operation, we won’t define it at all; we’ll just use the SPLICE subroutine from the Tie::Array module that we got for free when we inherited from Tie::Array. This way we define SPLICE in terms of other BoundedArray methods, so the bounds checking will still occur. That completes our BoundedArray class. It warps the semantics of arrays just a little. But we can do better, and in very much less space. Notational Convenience One of the nice things about variables is that they interpolate. One of the not-sonice things about functions is that they don’t. You can use a tied array to make a function that can be interpolated. Suppose you want to interpolate random integers in a string. You can just say: #!/usr/bin/perl package RandInterp; sub TIEARRAY { bless \my $self }; sub FETCH { int rand $_[1] }; package main; tie @rand, "RandInterp"; 378 Chapter 14: Tied Variables for (1,10,100,1000) { print "A random integer less than $_ would be $rand[$_]\n"; } $rand[32] = 5; # Will this reformat our system disk? When run, this prints: A random integer less than A random integer less than A random integer less than A random integer less than Can’t locate object method 1 would be 0 10 would be 3 100 would be 46 1000 would be 755 "STORE" via package "RandInterp" at foo line 10. As you can see, it’s no big deal that we didn’t even implement STORE. It just blows up like normal. Tying Hashes A class implementing a tied hash should define eight methods. TIEHASH constructs new objects. FETCH and STORE access the key/value pairs. EXISTS reports whether a key is present in the hash, and DELETE removes a key along with its associated value.* CLEAR empties the hash by deleting all key/value pairs. FIRSTKEY and NEXTKEY iterate over the key/value pairs when you call keys, values, or each. And as usual, if you want to perform particular actions when the object is deallocated, you may define a DESTROY method. (If this seems like a lot of methods, you didn’t read the last section on arrays attentively. In any event, feel free to inherit the default methods from the standard Tie::Hash module, redefining only the interesting ones. Again, Tie::StdHash assumes the implementation is also a hash.) For example, suppose you want to create a hash where every time you assign a value to a key, instead of overwriting the previous contents, the new value is appended to an array of values. That way when you say: $h{$k} = "one"; $h{$k} = "two"; It really does: push @{ $h{$k} }, "one"; push @{ $h{$k} }, "two"; That’s not a very complicated idea, so you should be able to use a pretty simple module. Using Tie::StdHash as a base class, it is. Here’s a Tie::AppendHash that does just that: * Remember that Perl distinguishes between a key not existing in the hash and a key existing in the hash but having a corresponding value of undef. The two possibilities can be tested with exists and defined, respectively. Tying Hashes 379 package Tie::AppendHash; use Tie::Hash; our @ISA = ("Tie::StdHash"); sub STORE { my ($self, $key, $value) = @_; push @{$self->{key}}, $value; } 1; Hash-Tying Methods Here’s an example of an interesting tied-hash class: it gives you a hash representing a particular user’s dot files (that is, files whose names begin with a period, which is a naming convention for initialization files under Unix). You index into the hash with the name of the file (minus the period) and get back that dot file’s contents. For example: use DotFiles; tie %dot, "DotFiles"; if ( $dot{profile} =˜ /MANPATH/ or $dot{login} =˜ /MANPATH/ or $dot{cshrc} =˜ /MANPATH/ ) { print "you seem to set your MANPATH\n"; } Here’s another way to use our tied class: # Third argument is the name of a user whose dot files we will tie to. tie %him, "DotFiles", "daemon"; foreach $f (keys %him) { printf "daemon dot file %s is size %d\n", $f, length $him{$f}; } In our DotFiles example we implement the object as a regular hash containing several important fields, of which only the {CONTENTS} field will contain what the user thinks of as the hash. Here are the object’s actual fields: Field Contents USER HOME CLOBBER CONTENTS Whose dot files this object represents. Where those dot files live. Whether we are allowed to change or remove those dot files. The hash of dot file names and content mappings. Here’s the start of DotFiles.pm: package DotFiles; use Carp; sub whowasi { (caller(1))[3] . "()" } my $DEBUG = 0; sub debug { $DEBUG = @_ ? shift : 1 } 380 Chapter 14: Tied Variables For our example, we want to be able to turn on debugging output to help in tracing during development, so we set up $DEBUG for that. We also keep one convenience function around internally to help print out warnings: whowasi returns the name of the function that called the current function (whowasi’s “grandparent” function). Here are the methods for the DotFiles tied hash: CLASSNAME->TIEHASH(LIST) Here’s the DotFiles constructor: sub TIEHASH { my $self = shift; my $user = shift || $>; my $dotdir = shift || ""; croak "usage: @{[ &whowasi ]} [USER [DOTDIR]]" if @_; $user = getpwuid($user) if $user =˜ /ˆ\d+$/; my $dir = (getpwnam($user))[7] or croak "@{ [&whowasi] }: no user $user"; $dir .= "/$dotdir" if $dotdir; my $node = { USER HOME CONTENTS CLOBBER }; => => => => $user, $dir, {}, 0, opendir DIR, $dir or croak "@{[&whowasi]}: can’t opendir $dir: $!"; for my $dot ( grep /ˆ\./ && -f "$dir/$_", readdir(DIR)) { $dot =˜ s/ˆ\.//; $node->{CONTENTS}{$dot} = undef; } closedir DIR; return bless $node, $self; } It’s probably worth mentioning that if you’re going to apply file tests to the values returned by the above readdir, you’d better prepend the directory in question (as we do). Otherwise, since no chdir was done, you’d likely be testing the wrong file. SELF->FETCH(KEY) This method implements reading an element from the tied hash. It takes one argument after the object: the key whose value we’re trying to fetch. The key is a string, and you can do anything you like with it (consistent with its being a string). Tying Hashes 381 Here’s the fetch for our DotFiles example: sub FETCH { carp &whowasi if $DEBUG; my $self = shift; my $dot = shift; my $dir = $self->{HOME}; my $file = "$dir/.$dot"; unless (exists $self->{CONTENTS}->{$dot} || -f $file) { carp "@{[&whowasi]}: no $dot file" if $DEBUG; return undef; } # Implement a cache. if (defined $self->{CONTENTS}->{$dot}) { return $self->{CONTENTS}->{$dot}; } else { return $self->{CONTENTS}->{$dot} = ‘cat $dir/.$dot‘; } } We cheated a little by running the Unix cat (1) command, but it would be more portable (and more efficient) to open the file ourselves. On the other hand, since dotfiles are a Unixy concept, we’re not that concerned. Or shouldn’t be. Or something . . . SELF->STORE(KEY, VALUE) This method does the dirty work whenever an element in the tied hash is set (written). It takes two arguments after the object: the key under which we’re storing the new value, and the value itself. For our DotFiles example, we won’t let users overwrite a file without first invoking the clobber method on the original object returned by tie: sub STORE { carp &whowasi if $DEBUG; my $self = shift; my $dot = shift; my $value = shift; my $file = $self->{HOME} . "/.$dot"; croak "@{[&whowasi]}: $file not clobberable" unless $self->{CLOBBER}; open(F, "> $file") or croak "can’t open $file: $!"; print F $value; close(F); } 382 Chapter 14: Tied Variables If someone wants to clobber something, they can say: $ob = tie %daemon_dots, "daemon"; $ob->clobber(1); $daemon_dots{signature} = "A true daemon\n"; But they could alternatively set {CLOBBER} with tied: tie %daemon_dots, "DotFiles", "daemon"; tied(%daemon_dots)->clobber(1); or as one statement: (tie %daemon_dots, "DotFiles", "daemon")->clobber(1); The clobber method is simply: sub clobber { my $self = shift; $self->{CLOBBER} = @_ ? shift : 1; } SELF->DELETE(KEY) This method handles requests to remove an element from the hash. If your emulated hash uses a real hash somewhere, you can just call the real delete. Again, we’ll be careful to check whether the user really wants to clobber files: sub DELETE { carp &whowasi if $DEBUG; my $self = shift; my $dot = shift; my $file = $self->{HOME} . "/.$dot"; croak "@{[&whowasi]}: won’t remove file $file" unless $self->{CLOBBER}; delete $self->{CONTENTS}->{$dot}; unlink $file or carp "@{[&whowasi]}: can’t unlink $file: $!"; } SELF->CLEAR This method is run when the whole hash needs to be cleared, usually by assigning the empty list to it. In our example, that would remove all the user’s dot files! It’s such a dangerous thing that we’ll require CLOBBER to be set higher than 1 before this can happen: sub CLEAR { carp &whowasi if $DEBUG; my $self = shift; croak "@{[&whowasi]}: won’t remove all dotfiles for $self->{USER}" unless $self->{CLOBBER} > 1; for my $dot ( keys %{$self->{CONTENTS}}) { $self->DELETE($dot); } } Tying Hashes 383 SELF->EXISTS(KEY) This method runs when the user invokes the exists function on a particular hash. In our example, we’ll look at the {CONTENTS} hash element to find the answer: sub EXISTS { carp &whowasi if $DEBUG; my $self = shift; my $dot = shift; return exists $self->{CONTENTS}->{$dot}; } SELF->FIRSTKEY This method is called when the user begins to iterate through the hash, such as with a keys, values, or each call. By calling keys in a scalar context, we reset its internal state to ensure that the next each used in the return statement will get the first key. sub FIRSTKEY { carp &whowasi if $DEBUG; my $self = shift; my $temp = keys %{$self->{CONTENTS}}; return scalar each %{$self->{CONTENTS}}; } SELF->NEXTKEY(PREVKEY) This method is the iterator for a keys, values, or each function. PREVKEY is the last key accessed, which Perl knows to supply. This is useful if the NEXTKEY method needs to know its previous state to calculate the next state. For our example, we are using a real hash to represent the tied hash’s data, except that this hash is stored in the hash’s CONTENTS field instead of in the hash itself. So we can just rely on Perl’s each iterator: sub NEXTKEY { carp &whowasi if $DEBUG; my $self = shift; return scalar each %{ $self->{CONTENTS} } } SELF->DESTROY This method is triggered when a tied hash’s object is about to be deallocated. You don’t really need it except for debugging and extra cleanup. Here’s a very simple version: sub DESTROY { carp &whowasi if $DEBUG; } 384 Chapter 14: Tied Variables Now that we’ve given you all those methods, your homework is to go back and find the places we interpolated @{[&whowasi]} and replace them with a simple tied scalar named $whowasi that does the same thing. Tying Filehandles A class implementing a tied filehandle should define the following methods: TIEHANDLE and at least one of PRINT, PRINTF, WRITE, READLINE, GETC, and READ. The class can also provide a DESTROY method, and BINMODE, OPEN, CLOSE, EOF, FILENO, SEEK, TELL, READ, and WRITE methods to enable the corresponding Perl built-ins for the tied filehandle. (Well, that isn’t quite true: WRITE corresponds to syswrite and has nothing to do with Perl’s built-in write function for printing with format decla- rations.) Tied filehandles are especially useful when Perl is embedded in another program (such as Apache or vi ) and output to STDOUT or STDERR needs to be redirected in some special way. But filehandles don’t actually have to be tied to a file at all. You can use output statements to build up an in-memory data structure and input statements to read them back in. Here’s an easy way to reverse a sequence of print and printf statements without reversing the individual lines: package ReversePrint; use strict; sub TIEHANDLE { my $class = shift; bless , $class; } sub PRINT { my $self = shift; push @$self, join ’’, @_; } sub PRINTF { my $self = shift; my $fmt = shift; push @$self, sprintf $fmt, @_; } sub READLINE { my $self = shift; pop @$self; } package main; my $m = "--MORE--\n"; tie *REV, "ReversePrint"; # Do some prints and printfs. print REV "The fox is now dead.$m"; Tying Filehandles 385 printf REV <<"END", int rand 10000000; The quick brown fox jumps over over the lazy dog %d times! END print REV <<"END"; The quick brown fox jumps over the lazy dog. END # Now read back from the same handle. print while <REV>; This prints: The quick brown fox jumps over the lazy dog. The quick brown fox jumps over over the lazy dog 3179357 times! The fox is now dead.--MORE-- Filehandle-Tying Methods For our extended example, we’ll create a filehandle that uppercases strings printed to it. Just for kicks, we’ll begin the file with <SHOUT> when it’s opened and end with </SHOUT> when it’s closed. That way we can rant in well-formed XML. Here’s the top of our Shout.pm file that will implement the class: package Shout; use Carp; # So we can croak our errors We’ll now list the method definitions in Shout.pm. CLASSNAME->TIEHANDLE(LIST) This is the constructor for the class, which as usual should return a blessed reference. sub TIEHANDLE { my $class = shift; my $form = shift; open my $self, $form, @_ or croak "can’t open $form@_: $!"; if ($form =˜ />/) { print $self "<SHOUT>\n"; $$self->{WRITING} = 1; # Remember to do end tag } return bless $self, $class; # $self is a glob ref } Here, we open a new filehandle according to the mode and filename passed to the tie operator, write <SHOUT> to the file, and return a blessed reference to it. There’s a lot of stuff going on in that open statement, but we’ll just point out that, in addition to the usual “open or die” idiom, the my $self furnishes an 386 Chapter 14: Tied Variables undefined scalar to open, which knows to autovivify it into a typeglob. The fact that it’s a typeglob is also significant, because not only does the typeglob contain the real I/O object of the file, but it also contains various other handy data structures that come along for free, like a scalar ($$$self), an array (@$$self), and a hash (%$$self). (We won’t mention the subroutine, &$$self.) The $form is the filename-or-mode argument. If it’s a filename, @_ is empty, so it behaves as a two-argument open. Otherwise, $form is the mode for the rest of the arguments. After the open, we test to see whether we should write the beginning tag. If so, we do. And right away, we use one of those glob data structures we mentioned. That $$self->{WRITING} is an example of using the glob to store interesting information. In this case, we remember whether we did the beginning tag so we know whether to do the corresponding end tag. We’re using the %$$self hash, so we can give the field a decent name. We could have used the scalar as $$$self, but that wouldn’t be self-documenting. (Or it would only be self-documenting, depending on how you look at it.) SELF->PRINT(LIST) This method implements a print to the tied handle. The LIST is whatever was passed to print. Our method below uppercases each element of LIST : sub PRINT { my $self = shift; print $self map {uc} @_; } SELF->READLINE This method supplies the data when the filehandle is read from via the angle operator (<FH>) or readline. The method should return undef when there is no more data. sub READLINE { my $self = shift; return <$self>; } Here, we simply return <$self> so that the method will behave appropriately depending on whether it was called in scalar or list context. SELF->GETC This method runs whenever getc is used on the tied filehandle. sub GETC { my $self = shift; return getc($self); } Tying Filehandles 387 Like several of the methods in our Shout class, the GETC method simply calls its corresponding Perl built-in and returns the result. SELF->OPEN(LIST) Our TIEHANDLE method itself opens a file, but a program using the Shout class that calls open afterward triggers this method. sub OPEN { my $self = shift; my $form = shift; my $name = "$form@_"; $self->CLOSE; open($self, $form, @_) or croak "can’t reopen $name: $!"; if ($form =˜ />/) { print $self "<SHOUT>\n" or croak "can’t start print: $!"; $$self->{WRITING} = 1; # Remember to do end tag } else { $$self->{WRITING} = 0; # Remember not to do end tag } return 1; } We invoke our own CLOSE method to explicitly close the file in case the user didn’t bother to. Then we open a new file with whatever filename was specified in the open and shout at it. SELF->CLOSE This method deals with the request to close the handle. Here, we seek to the end of the file and, if that was successful, print </SHOUT> before using Perl’s built-in close. sub CLOSE { my $self = shift; if ($$self->{WRITING}) { $self->SEEK(0, 2) $self->PRINT("</SHOUT>\n") } return close $self; } or return; or return; SELF->SEEK(LIST) When you seek on a tied filehandle, the SEEK method gets called. sub SEEK { my $self = shift; my ($offset, $whence) = @_; return seek($self, $offset, $whence); } 388 Chapter 14: Tied Variables SELF->TELL This method is invoked when tell is used on the tied handle. sub TELL { my $self = shift; return tell $self; } SELF->PRINTF(LIST) This method is run whenever printf is used on the tied handle. The LIST will contain the format and the items to be printed. sub PRINTF { my $self = shift; my $template = shift; return $self->PRINT(sprintf $template, @_); } Here, we use sprintf to generate the formatted string and pass it to PRINT for uppercasing. There’s nothing that requires you to use the built-in sprintf function though. You could interpret the percent escapes to suit your own purpose. SELF->READ(LIST) This method responds when the handle is read using read or sysread. Note that we modify the first argument of LIST “in-place”, mimicking read’s ability to fill in the scalar passed in as its second argument. sub READ { my ($self, undef, $length, $offset) = @_; my $bufref = \$_[1]; return read($self, $$bufref, $length, $offset); } SELF->WRITE(LIST) This method gets invoked when the handle is written to with syswrite. Here, we uppercase the string to be written. sub WRITE { my $self = shift; my $string = uc(shift); my $length = shift || length $string; my $offset = shift || 0; return syswrite $self, $string, $length, $offset; } SELF->EOF This method returns a Boolean value when a filehandle tied to the Shout class is tested for its end-of-file status using eof. Tying Filehandles 389 sub EOF { my $self = shift; return eof $self; } SELF->BINMODE(DISC) This method specifies the I/O discipline to be used on the filehandle. If none is specified, it puts the tied filehandle into binary mode (the :raw discipline), for filesystems that distinguish between text and binary files. sub BINMODE { my $self = shift; my $disc = shift || ":raw"; return binmode $self, $disc; } That’s how you’d write it, but it’s actually useless in our case because the open already wrote on the handle. So in our case we should probably make it say: sub BINMODE { croak("Too late to use binmode") } SELF->FILENO This method should return the file descriptor (fileno) associated with the tied filehandle by the operating system. sub FILENO { my $self = shift; return fileno $self; } SELF->DESTROY As with the other types of ties, this method is triggered when the tied object is about to be destroyed. This is useful for letting the object clean up after itself. Here, we make sure that the file is closed, in case the program forgot to call close. We could just say close $self, but it’s better to invoke the CLOSE method of the class. That way if the designer of the class decides to change how files are closed, this DESTROY method won’t have to be modified. sub DESTROY { my $self = shift; $self->CLOSE; # Close the file using Shout’s CLOSE method. } Here’s a demonstration of our Shout class: #!/usr/bin/perl use Shout; tie(*FOO, Shout::, ">filename"); print FOO "hello\n"; # seek FOO, 0, 0; # @lines = <FOO>; # close FOO; # Prints HELLO. Rewind to beginning. Calls the READLINE method. Close file explicitly. 390 Chapter 14: Tied Variables open(FOO, "+<", "filename"); seek(FOO, 8, 0); sysread(FOO, $inbuf, 5); print "found $inbuf\n"; seek(FOO, -5, 1); syswrite(FOO, "ciao!\n", 6); untie(*FOO); # # # # # # # Reopen FOO, calling OPEN. Skip the "<SHOUT>\n". Read 5 bytes from FOO into $inbuf. Should print "hello". Back up over the "hello". Write 6 bytes into FOO. Calls the CLOSE method implicitly. After running this, the file contains: <SHOUT> CIAO! </SHOUT> Here are some more strange and wonderful things to do with that internal glob. We use the same hash as before, but with new keys PATHNAME and DEBUG. First we install a stringify overloading so that printing one of our objects reveals the pathname (see Chapter 13, Overloading): # This is just so totally cool! use overload q("") => sub { $_[0]->pathname }; # This is the stub to put in each function you want to trace. sub trace { my $self = shift; local $Carp::CarpLevel = 1; Carp::cluck("\ntrace magical method") if $self->debug; } # Overload handler to print out our path. sub pathname { my $self = shift; confess "i am not a class method" unless ref $self; $$self->{PATHNAME} = shift if @_; return $$self->{PATHNAME}; } # Dual moded. sub debug { my $self = shift; my $var = ref $self ? \$$self->{DEBUG} : \our $Debug; $$var = shift if @_; return ref $self ? $$self->{DEBUG} || $Debug : $Debug; } And then call trace on entry to all your ordinary methods like this: sub GETC { $_[0]->trace; my($self) = @_; getc($self); } # NEW Tying Filehandles 391 And also set the pathname in TIEHANDLE and OPEN: sub TIEHANDLE { my $class = shift; my $form = shift; my $name = "$form@_"; open my $self, $form, @_ or if ($form =˜ />/) { print $self "<SHOUT>\n"; $$self->{WRITING} = 1; } bless $self, $class; $self->pathname($name); return $self; } # NEW croak "can’t open $name: $!"; # Remember to do end tag # $fh is a glob ref # NEW sub OPEN { $_[0]->trace; # NEW my $self = shift; my $form = shift; my $name = "$form@_"; $self->CLOSE; open($self, $form, @_) or croak "can’t reopen $name: $!"; $self->pathname($name); # NEW if ($form =˜ />/) { print $self "<SHOUT>\n" or croak "can’t start print: $!"; $$self->{WRITING} = 1; # Remember to do end tag } else { $$self->{WRITING} = 0; # Remember not to do end tag } return 1; } Somewhere you also have to call $self->debug(1) to turn debugging on. When you do that, all your Carp::cluck calls will produce meaningful messages. Here’s one that we get while doing the reopen above. It shows us three deep in method calls, as we’re closing down the old file in preparation for opening the new one: trace magical method at foo line 87 Shout::SEEK(’>filename’, ’>filename’, 0, 2) called at foo line 81 Shout::CLOSE(’>filename’) called at foo line 65 Shout::OPEN(’>filename’, ’+<’, ’filename’) called at foo line 141 Creative Filehandles You can tie the same filehandle to both the input and the output of a two-ended pipe. Suppose you wanted to run the bc (1) (arbitrary precision calculator) program this way: use Tie::Open2; tie *CALC, ’Tie::Open2’, "bc -l"; $sum = 2; 392 Chapter 14: Tied Variables for (1 .. 7) { print CALC "$sum * $sum\n"; $sum = <CALC>; print "$_: $sum"; chomp $sum; } close CALC; One would expect it to print this: 1: 2: 3: 4: 5: 6: 7: 4 16 256 65536 4294967296 18446744073709551616 340282366920938463463374607431768211456 One’s expectations would be correct if one had the bc (1) program on one’s computer, and one also had Tie::Open2 defined as follows. This time we’ll use a blessed array for our internal object. It contains our two actual filehandles for reading and writing. (The dirty work of opening a double-ended pipe is done by IPC::Open2; we’re just doing the fun part.) package Tie::Open2; use strict; use Carp; use Tie::Handle; # do not inherit from this! use IPC::Open2; sub TIEHANDLE { my ($class, @cmd) = @_; no warnings ’once’; my @fhpair = \do { local(*RDR, *WTR) }; bless $_, ’Tie::StdHandle’ for @fhpair; bless(\@fhpair => $class)->OPEN(@cmd) || die; return \@fhpair; } sub OPEN { my ($self, @cmd) = @_; $self->CLOSE if grep {defined} @{ $self->FILENO }; open2(@$self, @cmd); } sub FILENO { my $self = shift; [ map { fileno $self->[$_] } 0,1 ]; } for my $outmeth ( qw(PRINT PRINTF WRITE) ) { no strict ’refs’; *$outmeth = sub { Tying Filehandles 393 my $self = shift; $self->[1]->$outmeth(@_); }; } for my $inmeth ( qw(READ READLINE GETC) ) { no strict ’refs’; *$inmeth = sub { my $self = shift; $self->[0]->$inmeth(@_); }; } for my $doppelmeth ( qw(BINMODE CLOSE EOF)) { no strict ’refs’; *$doppelmeth = sub { my $self = shift; $self->[0]->$doppelmeth(@_) && $self->[1]->$doppelmeth(@_); }; } for my $deadmeth ( qw(SEEK TELL)) { no strict ’refs’; *$deadmeth = sub { croak("can’t $deadmeth a pipe"); }; } 1; The final four loops are just incredibly snazzy, in our opinion. For an explanation of what’s going on, look back at the section entitled “Closures as Function Templates” in Chapter 8, Refer ences. Here’s an even wackier set of classes. The package names should give you a clue as to what they do. use strict; package Tie::DevNull; sub TIEHANDLE { my $class = shift; my $fh = local *FH; bless \$fh, $class; } for (qw(READ READLINE GETC PRINT PRINTF WRITE)) { no strict ’refs’; *$_ = sub { return }; } package Tie::DevRandom; sub READLINE { rand() . "\n"; } sub TIEHANDLE { my $class = shift; my $fh = local *FH; bless \$fh, $class; } 394 Chapter 14: Tied Variables sub FETCH { rand() } sub TIESCALAR { my $class = shift; bless \my $self, $class; } package Tie::Tee; sub TIEHANDLE { my $class = shift; my @handles; for my $path (@_) { open(my $fh, ">$path") || die "can’t write $path"; push @handles, $fh; } bless \@handles, $class; } sub PRINT { my $self = shift; my $ok = 0; for my $fh (@$self) { $ok += print $fh @_; } return $ok == @$self; } The Tie::Tee class emulates the standard Unix tee (1) program, which sends one stream of output to multiple different destinations. The Tie::DevNull class emulates the null device, /dev/null on Unix systems. And the Tie::DevRandom class produces random numbers either as a handle or as a scalar, depending on whether you call TIEHANDLE or TIESCALAR ! Here’s how you call them: package main; tie tie tie tie *SCATTER, *RANDOM, *NULL, my $randy, "Tie::Tee", qw(tmp1 - tmp2 >tmp3 tmp4); "Tie::DevRandom"; "Tie::DevNull"; "Tie::DevRandom"; for my $i (1..10) { my $line = <RANDOM>; chomp $line; for my $fh (*NULL, *SCATTER) { print $fh "$i: $line $randy\n"; } } This produces something like the following on your screen: 1: 2: 3: 4: 0.124115571686165 0.156618299751194 0.799749050426126 0.599474551447884 0.20872819474074 0.678171662366353 0.300184963960792 0.213935286029916 A Subtle Untying Trap 395 5: 0.700232143543861 0.800773751296671 6: 0.201203608274334 0.0654303290639575 7: 0.605381294683365 0.718162304090487 8: 0.452976481105495 0.574026269121667 9: 0.736819876983848 0.391737610662044 10: 0.518606540417331 0.381805078272308 But that’s not all! It wrote to your screen because of the - in the *SCATTER tie above. But that line also told it to create files tmp1, tmp2, and tmp4, as well as to append to file tmp3. (We also wrote to the *NULL filehandle in the loop, though of course that didn’t show up anywhere interesting, unless you’re interested in black holes.) A Subtle Untying Trap If you intend to make use of the object returned from tie or tied, and the class defines a destructor, there is a subtle trap you must guard against. Consider this (admittedly contrived) example of a class that uses a file to log all values assigned to a scalar: package Remember; sub TIESCALAR { my $class = shift; my $filename = shift; open(my $handle, ">", $filename) or die "Cannot open $filename: $!\n"; print $handle "The Start\n"; bless {FH => $handle, VALUE => 0}, $class; } sub FETCH { my $self = shift; return $self->{VALUE}; } sub STORE { my $self = shift; my $value = shift; my $handle = $self->{FH}; print $handle "$value\n"; $self->{VALUE} = $value; } sub DESTROY { my $self = shift; my $handle = $self->{FH}; print $handle "The End\n"; close $handle; } 1; 396 Chapter 14: Tied Variables Here is an example that makes use of our Remember class: use strict; use Remember; my $fred; $x = tie $fred, "Remember", "camel.log"; $fred = 1; $fred = 4; $fred = 5; untie $fred; system "cat camel.log"; This is the output when it is executed: The Start 1 4 5 The End So far, so good. Let’s add an extra method to the Remember class that allows comments in the file—say, something like this: sub comment { my $self = shift; my $message = shift; print { $self->{FH} } $handle $message, "\n"; } And here is the previous example, modified to use the comment method: use strict; use Remember; my ($fred, $x); $x = tie $fred, "Remember", "camel.log"; $fred = 1; $fred = 4; comment $x "changing..."; $fred = 5; untie $fred; system "cat camel.log"; Now the file will be empty, which probably wasn’t what you intended. Here’s why. Tying a variable associates it with the object returned by the constructor. This object normally has only one reference: the one hidden behind the tied variable itself. Calling “untie” breaks the association and eliminates that reference. Since there are no remaining references to the object, the DESTROY method is triggered. Tie Modules on CPAN 397 However, in the example above we stored a second reference to the object tied to $x. That means that after the untie there will still be a valid reference to the object. DESTROY won’t get triggered, and the file won’t get flushed and closed. That’s why there was no output: the filehandle’s buffer was still in memory. It won’t hit the disk until the program exits. To detect this, you could use the -w command-line flag, or include the use warnings "untie" pragma in the current lexical scope. Either technique would identify a call to untie while there were still references to the tied object remaining. If so, Perl prints this warning: untie attempted while 1 inner references still exist To get the program to work properly and silence the warning, eliminate any extra references to the tied object befor e calling untie. You can do that explicitly: undef $x; untie $fred; Often though you can solve the problem simply by making sure your variables go out of scope at the appropriate time. Tie Modules on CPAN Before you get all inspired to write your own tie module, you should check to see if someone’s already done it. There are lots of tie modules on CPAN, with more every day. (Well, every month, anyway.) Table 14-1 lists some of them. Table 14-1. Tie Modules on CPAN Module Description GnuPG::Tie::Encrypt Ties a filehandle interface to encryption with the GNU Privacy Guard. IO::WrapTie Wraps tied objects in an IO::Handle interface. MLDBM Transparently stores complex data values, not just flat strings, in a DBM file. Net::NISplusTied Ties hashes to NIS+ tables. Tie::Cache::LRU Implements a least-recently used cache. Tie::Const Provides constant scalars and hashes. Tie::Counter Enchants a scalar variable to increment upon each access. Tie::CPHash Implements a case-preserving but case-insensitive hash. Tie::DB_FileLock Provides locking access to Berkeley DB 1.x. Tie::DBI Ties hashes to DBI relational databases. Tie::DB_Lock Ties hashes to databases using shared and exclusive locks. 398 Chapter 14: Tied Variables Table 14-1. Tie Modules on CPAN (continued) Module Description Tie::Dict Ties a hash to an RPC dict server. Tie::Dir Ties a hash for reading directories. Tie::DirHandle Ties directory handles. Tie::FileLRUCache Implements a lightweight, filesystem-based, persistent LRU cache. Tie::FlipFlop Implements a tie that alternates between two values. Tie::HashDefaults Lets a hash have default values. Tie::HashHistory Tracks history of all changes to a hash. Tie::IxHash Provides ordered associative arrays for Perl. Tie::LDAP Implements an interface to an LDAP database. Tie::Persistent Provides persistent data structures via tie. Tie::Pick Randomly picks (and removes) an element from a set. Tie::RDBM Ties hashes to relational databases. Tie::SecureHash Supports namespace-based encapsulation. Tie::STDERR Sends output of your STDERR to another process such as a mailer. Tie::Syslog Ties a filehandle to automatically syslog its output. Tie::TextDir Ties a directory of files. Tie::TransactHash Edits a hash in transactions without changing the order during the transaction. Tie::VecArray Provides an array interface to a bit vector. Tie::Watch Places watch points on Perl variables. Win32::TieRegistry Provides powerful and easy ways to manipulate a Microsoft Windows registry. III Perl as Technology 15 Unicode If you do not yet know what Unicode is, you will soon—even if you skip reading this chapter—because working with Unicode is becoming a necessity. (Some people think of it as a necessary evil, but it’s really more of a necessary good. In either case, it’s a necessary pain.) Historically, people made up character sets to reflect what they needed to do in the context of their own culture. Since people of all cultures are naturally lazy, they’ve tended to include only the symbols they needed, excluding the ones they didn’t need. That worked fine as long as we were only communicating with other people of our own culture, but now that we’re starting to use the Internet for cross-cultural communication, we’re running into problems with the exclusive approach. It’s hard enough to figure out how to type accented characters on an American keyboard. How in the world (literally) can one write a multilingual web page? Unicode is the answer, or at least part of the answer (see also XML). Unicode is an inclusive rather than an exclusive character set. While people can and do haggle over the various details of Unicode (and there are plenty of details to haggle over), the overall intent is to make everyone sufficiently happy* with Unicode so that they’ll willingly use Unicode as the international medium of exchange for textual data. Nobody is forcing you to use Unicode, just as nobody is forcing you to read this chapter (we hope). People will always be allowed to use their old exclusive character sets within their own culture. But in that case (as we say), portability suffers. * Or in some cases, insufficiently unhappy. 401 402 Chapter 15: Unicode The Law of Conservation of Suffering says that if we reduce the suffering in one place, suffering must increase elsewhere. In the case of Unicode, we must suffer the migration from byte semantics to character semantics. Since, through an accident of history, Perl was invented by an American, Perl has historically confused the notions of bytes and characters. In migrating to Unicode, Perl must somehow unconfuse them. Paradoxically, by getting Perl itself to unconfuse bytes and characters, we can allow the Perl programmer to confuse them, relying on Perl to keep them straight, just as we allow programmers to confuse numbers and strings and rely on Perl to convert back and forth as necessary. To the extent possible, Perl’s approach to Unicode is the same as its approach to everything else: Just Do The Right Thing. Ideally, we’d like to achieve these four Goals: Goal #1: Old byte-oriented programs should not spontaneously break on the old byteoriented data they used to work on. Goal #2: Old byte-oriented programs should magically start working on the new character-oriented data when appropriate. Goal #3: Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode. Goal #4: Perl should remain one language, rather than forking into a byte-oriented Perl and a character-oriented Perl. Taken together, these Goals are practically impossible to reach. But we’ve come remarkably close. Or rather, we’re still in the process of coming remarkably close, since this is a work in progress. As Unicode continues to evolve, so will Perl. But our overarching plan is to provide a safe migration path that gets us where we want to go with minimal casualties along the way. How we do that is the subject of the next section. Building Character In releases of Perl prior to 5.6, all strings were viewed as sequences of bytes.* In versions 5.6 and later, however, a string may contain characters wider than a byte. We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit computers, 0 .. 2**64-1). These * You may prefer to call them “octets”; that’s okay, but we think the two words are pretty much synonymous these days, so we’ll stick with the blue-collar word. Building Character 403 numbers represent abstract characters, and the larger the number, the “wider” the character, in some sense; but unlike many languages, Perl is not tied to any particular width of character representation. Perl uses a variable-length encoding (based on UTF-8), so these abstract character numbers may, or may not, be packed one number per byte. Obviously, character number 18,446,744,073,709,551,615 (that is, “\x{ffff_ffff_ffff_ffff}”) is never going to fit into a byte (in fact, it takes 13 bytes), but if all the characters in your string are in the range 0..127 decimal, then they are certainly packed one per byte, since UTF-8 is the same as ASCII in the lowest seven bits. Perl uses UTF-8 only when it thinks it is beneficial, so if all the characters in your string are in the range 0..255, there’s a good chance the characters are all packed in bytes—but in the absence of other knowledge, you can’t be sure because internally Perl converts between fixed 8-bit characters and variable-length UTF-8 characters as necessary. The point is, you shouldn’t have to worry about it most of the time, because the character semantics are preserved at an abstract level regardless of representation. In any event, if your string contains any character numbers larger than 255 decimal, the string is certainly stored in UTF-8. More accurately, it is stored in Perl’s extended version of UTF-8, which we call utf8, in honor of a pragma by that name, but mostly because it’s easier to type. (And because “real” UTF-8 is only allowed to contain character numbers blessed by the Unicode Consortium. Perl’s utf8 is allowed to contain any character numbers you need to get your job done. Perl doesn’t give a rip whether your character numbers are officially correct or just correct.) We said you shouldn’t worry about it most of the time, but people like to worry anyway. Suppose you use a v-string to represent an IPv4 address: $locaddr = v127.0.0.1; # Certainly stored as bytes. $oreilly = v204.148.40.9; # Might be stored as bytes or utf8. $badaddr = v2004.148.40.9; # Certainly stored as utf8. Everyone can figure out that $badaddr will not work as an IP address. So it’s easy to think that if O’Reilly’s network address gets forced into a UTF-8 representation, it will no longer work. But the characters in the string are abstract numbers, not bytes. Anything that uses an IPv4 address, such as the gethostbyaddr function, should automatically coerce the abstract character numbers back into a byte representation (and fail on $badaddr). 404 Chapter 15: Unicode The interfaces between Perl and the real world have to deal with the details of the representation. To the extent possible, existing interfaces try to do the right thing without your having to tell them what to do. But you do occasionally have to give instructions to some interfaces (such as the open function), and if you write your own interface to the real world, it will need to be either smart enough to figure things out for itself or at least smart enough to follow instructions when you want it to behave differently than it would by default.* Since Perl worries about maintaining transparent character semantics within the language itself, the only place you need to worry about byte versus character semantics is in your interfaces. By default, all your old Perl interfaces to the outside world are byte-oriented, so they produce and consume byte-oriented data. That is to say, on the abstract level, all your strings are sequences of numbers in the range 0..255, so if nothing in the program forces them into utf8 representations, your old program continues to work on byte-oriented data just as it did before. So put a check mark by Goal #1 above. If you want your old program to work on new character-oriented data, you must mark your character-oriented interfaces such that Perl knows to expect characteroriented data from those interfaces. Once you’ve done this, Perl should automatically do any conversions necessary to preserve the character abstraction. The only difference is that you’ve introduced some strings into your program that are marked as potentially containing characters higher than 255, so if you perform an operation between a byte string and utf8 string, Perl will internally coerce the byte string into a utf8 string before performing the operation. Typically, utf8 strings are coerced back to byte strings only when you send them to a byte interface, at which point, if the string contains characters larger than 255, you have a problem that can be handled in various ways depending on the interface in question. So you can put a check mark by Goal #2. Sometimes you want to mix code that understands character semantics with code that has to run with byte semantics, such as I/O code that reads or writes fixedsize blocks. In this case, you may put a use bytes declaration around the byte-oriented code to force it to use byte semantics even on strings marked as utf8 strings. You are then responsible for any necessary conversions. But it’s a way of enforcing a stricter local reading of Goal #1, at the expense of a looser global reading of Goal #2. * On some systems, there may be ways of switching all your interfaces at once. If the -C commandline switch is used, (or the global ${ˆWIDE_SYSTEM_CALLS} variable is set to 1), all system calls will use the corresponding wide character APIs. (This is currently only implemented on Microsoft Windows.) The current plan of the Linux community is that all interfaces will switch to UTF-8 mode if $ENV{LC_CTYPE} is set to “UTF-8”. Other communities may take other approaches. Our mileage may vary. Effects of Character Semantics 405 Goal #3 has largely been achieved, partly by doing lazy conversions between byte and utf8 representations and partly by being sneaky in how we implement potentially slow features of Unicode, such as character property lookups in huge tables. Goal #4 has been achieved by sacrificing a small amount of interface compatibility in pursuit of the other Goals. By one way of looking at it, we didn’t fork into two different Perls; but by another way of looking at it, revision 5.6 of Perl is a forked version of Perl with regard to earlier versions, and we don’t expect people to switch from earlier versions until they’re sure the new version will do what they want. But that’s always the case with new versions, so we’ll allow ourselves to put a check mark by Goal #4 as well. Effects of Character Semantics The upshot of all this is that a typical built-in operator will operate on characters unless it is in the scope of a use bytes pragma. However, even outside the scope of use bytes, if all of the operands of the operator are stored as 8-bit characters (that is, none of the operands are stored in utf8), then character semantics are indistinguishable from byte semantics, and the result of the operator will be stored in 8-bit form internally. This preserves backward compatibility as long as you don’t feed your program any characters wider than Latin-1. The utf8 pragma is primarily a compatibility device that enables recognition of UTF-8 in literals and identifiers encountered by the parser. It may also be used for enabling some of the more experimental Unicode support features. Our long-term goal is to turn the utf8 pragma into a no-op. The use bytes pragma will never turn into a no-op. Not only is it necessary for byte-oriented code, but it also has the side effect of defining byte-oriented wrappers around certain functions for use outside the scope of use bytes. As of this writing, the only defined wrapper is for length, but there are likely to be more as time goes by. To use such a wrapper, say: use bytes (); # Load wrappers without importing byte semantics. ... $charlen = length("\x{ffff_ffff}"); # Returns 1. $bytelen = bytes::length("\x{ffff_ffff}"); # Returns 7. Outside the scope of a use bytes declaration, Perl version 5.6 works (or at least, is intended to work) like this: • Strings and patterns may now contain characters that have an ordinal value larger than 255: use utf8; $convergence = "  "; 406 Chapter 15: Unicode Presuming you have a Unicode-capable editor to edit your program, such characters will typically occur directly within the literal strings as UTF-8 characters. For now, you have to declare a use utf8 at the top of your program to enable the use of UTF-8 in literals. If you don’t have a Unicode editor, you can always specify a particular character in ASCII with an extension of the \x notation. A character in the Latin-1 range may be written either as \x{ab} or as \xab, but if the number exceeds two hexidecimal digits, you must use braces. Unicode characters are specified by putting the hexadecimal code within braces after the \x. For instance, a Unicode smiley face is \x{263A}. There is no syntactic construct in Perl that assumes Unicode characters are exactly 16 bits, so you may not use \u263A as you can in other languages; \x{263A} is the closest equivalent. For inserting named characters via \N{CHARNAME}, see the use charnames pragma in Chapter 31, Pragmatic Modules. • Identifiers within the Perl script may contain Unicode alphanumeric characters, including ideographs: use utf8; $ ++; # A child is born. Again, use utf8 is needed (for now) to recognize UTF-8 in your script. You are currently on your own when it comes to using the canonical forms of characters — Perl doesn’t (yet) attempt to canonicalize variable names for you. We recommend that you canonicalize your programs to Normalization Form C, since that’s what Perl will someday canonicalize to by default. See www.unicode.org for the latest technical report on canonicalization. • Regular expressions match characters instead of bytes. For instance, dot matches a character instead of a byte. If the Unicode Consortium ever gets around to approving the Tengwar script, then (despite the fact that such characters are represented in four bytes of UTF-8), this matches: "\N{TENGWAR LETTER SILME NUQUERNA}" =˜ /ˆ.$/ The \C pattern is provided to force a match on a single byte (“char” in C, hence \C). Use \C with care, since it can put you out of sync with the character boundaries in your string, and you may get “Malformed UTF-8 character” errors. You may not use \C in square brackets, since it doesn’t represent any particular character or set of characters. Effects of Character Semantics • 407 Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. So \w can be used to match an ideograph: " " =˜ /\w/ • Named Unicode properties and block ranges can be used as character classes via the new \p (matches property) and \P (doesn’t match property) constructs. For instance, \p{Lu} matches any character with the Unicode uppercase property, while \p{M} matches any mark character. Single-letter properties may omit the brackets, so mark characters can be matched by \pM also. Many predefined character classes are available, such as \p{IsMirrored} and \p{InTibetan}: "\N{greek:Iota}" =˜ /\p{Lu}/ You may also use \p and \P within square bracket character classes. (In version 5.6.0 of Perl, you need to use utf8 for character properties to work right. This restriction will be lifted in the future.) See Chapter 5, Patter n Matching, for details of matching on Unicode properties. • The special pattern \X matches any extended Unicode sequence (a “combining character sequence” in Standardese), where the first character is a base character and subsequent characters are mark characters that apply to the base character. It is equivalent to (?:\PM\pM*): "o\N{COMBINING TILDE BELOW}" =˜ /\X/ You may not use \X in square brackets, because it might match multiple characters and it doesn’t match any particular character or set of characters. • The tr/// operator transliterates characters instead of bytes. To turn all characters outside the Latin-1 range into a question mark, you could say: tr/\0-\x{10ffff}/\0-\xff?/; • # utf8 to latin1 char Case translation operators use the Unicode case translation tables when provided character input. Note that uc translates to uppercase, while ucfirst translates to titlecase (for languages that make the distinction). Naturally the corresponding backslash sequences have the same semantics: $x $x $x $x = = = = "\u$word"; "\U$word"; "\l$word"; "\L$word"; # # # # titlecase uppercase lowercase lowercase first letter of $word $word first letter of $word $word 408 Chapter 15: Unicode Be careful, because the Unicode case translation tables don’t attempt to provide round-trip mappings in every instance, particularly for languages that use different numbers of characters for titlecase or uppercase than they do for the equivalent lowercase letter. As they say in the standard, while the case properties themselves are normative, the case mappings are only informational. • Most operators that deal with positions or lengths in the string will automatically switch to using character positions, including chop, substr, pos, index, rindex, sprintf, write, and length. Operators that deliberately don’t switch include vec, pack, and unpack. Operators that really don’t care include chomp, as well as any other operator that treats a string as a bucket of bits, such as the default sort and the operators dealing with filenames. use bytes; $bytelen = length("I do no bytes; $charlen = length("I do • ."); # 15 bytes ."); # but 9 characters The pack/unpack letters “c” and “C” do not change, since they’re often used for byte-oriented formats. (Again, think “char” in the C language.) However, there is a new “U” specifier that will convert between UTF-8 characters and integers: pack("U*", 1, 20, 300, 4000) eq v1.20.300.4000 • The chr and ord functions work on characters: chr(1).chr(20).chr(300).chr(4000) eq v1.20.300.4000 In other words, chr and ord are like pack("U") and unpack("U"), not like pack("C") and unpack("C"). In fact, the latter two are how you now emulate byte-oriented chr and ord if you’re too lazy to use bytes. • And finally, scalar reverse reverses by character rather than by byte:  " eq reverse " " " If you look in directory PATH_TO_PERLLIB/unicode, you’ll find a number of files that have to do with defining the semantics above. The Unicode properties database from the Unicode Consortium is in a file called Unicode.300 (for Unicode 3.0). This file has already been processed by mktables.PL into lots of little .pl files in the same directory (and in subdirectories Is/, In/, and To/ ), some of which are automatically slurped in by Perl to implement things like \p (see the Is/ and In/ directories) and uc (see the To/ directory). Other files are slurped in by modules like the use charnames pragma (see Name.pl ). But as of this writing, there are still a number of files that are just sitting there waiting for you to write an access module for them: Caution, Working 409 ArabLink.pl ArabLnkGrp.pl Bidir ectional.pl Block.pl Category.pl CombiningClass.pl Decomposition.pl JamoShort.pl Number.pl To/Digit.pl A much more readable summary of Unicode, with many hyperlinks, is in PATH_TO_PERLLIB/unicode/Unicode3.html. Note that when the Unicode consortium comes out with a new version, some of these filenames are likely to change, so you’ll have to poke around. You can find PATH_TO_PERLLIB with the following incantation: % perl -MConfig -le ’print $Config{privlib}’ To find out just about everything there is to find out about Unicode, you should check out The Unicode Standard, Version 3.0 (ISBN 0-201-61633-5). Caution, Working As of this writing (that is, with respect to version 5.6.0 of Perl), there are still some caveats on use of Unicode. (Check your online docs for updates.) • The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination of whether a particular pattern will match Unicode characters is made when the pattern is compiled (based on whether the pattern contains Unicode characters) and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode. • There is currently no easy way to mark data read from a file or other external source as being utf8. This will be a major area of focus in the near future and is probably already fixed as you read this. • There is no method for automatically coercing input and output to some encoding other than UTF-8. This is planned in the near future, however, so check your online docs. 410 • Chapter 15: Unicode Use of locales with utf8 may lead to odd results. Currently, there is some attempt to apply 8-bit locale information to characters in the range 0..255, but this is demonstrably incorrect for locales that use characters above that range (when mapped into Unicode). It will also tend to run slower. Avoidance of locales is strongly encouraged. Unicode is fun—you just have to define fun correctly. 16 Interprocess Communication Computer processes have almost as many ways of communicating as people do. The difficulties of interprocess communication should not be underestimated. It doesn’t do you any good to listen for verbal cues when your friend is using only body language. Likewise, two processes can communicate only when they agree on the means of communication, and on the conventions built on top of that. As with any kind of communication, the conventions to be agreed upon range from lexical to pragmatic: everything from which lingo you’ll use, up to whose turn it is to talk. These conventions are necessary because it’s very difficult to communicate bare semantics in the absence of context. In our lingo, interprocess communication is usually pronounced IPC. The IPC facilities of Perl range from the very simple to the very complex. Which facility you should use depends on the complexity of the information to be communicated. The simplest kind of information is almost no information at all: just the awareness that a particular event has happened at a particular point in time. In Perl, these events are communicated via a signal mechanism modeled on the Unix signal system. At the other extreme, the socket facilities of Perl allow you to communicate with any other process on the Internet using any mutually supported protocol you like. Naturally, this freedom comes at a price: you have to go through a number of steps to set up the connections and make sure you’re talking the same language as the process on the other end. This may in turn require you to adhere to any number of other strange customs, depending on local conventions. To be protocoligorically correct, you might even be required to speak a language like XML, or Java, or Perl. Horrors. 411 412 Chapter 16: Interprocess Communication Sandwiched in between are some facilities intended primarily for communicating with processes on the same machine. These include good old-fashioned files, pipes, FIFOs, and the various System V IPC syscalls. Support for these facilities varies across platforms; modern Unix systems (including Apple’s Mac OS X) should support all of them, and, except for signals and SysV IPC, most of the rest are supported on any recent Microsoft operating systems, including pipes, forking, file locking, and sockets.* More information about porting in general can be found in the standard Perl documentation set (in whatever format your system displays it) under perlport. Microsoft-specific information can be found under perlwin32 and perlfork, which are installed even on non-Microsoft systems. For textbooks, we suggest the following: • The Perl Cookbook, by Tom Christiansen and Nathan Torkington (O’Reilly and Associates, 1998), chapters 16 through 18. • Advanced Programming in the UNIX Environment, by W. Richard Stevens (Addison-Wesley, 1992). • TCP/IP Illustrated, by W. Richard Stevens, Volumes I–III (Addison-Wesley, 1992–1996). Signals Perl uses a simple signal-handling model: the %SIG hash contains references (either symbolic or hard) to user-defined signal handlers. Certain events cause the operating system to deliver a signal to the affected process. The handler corresponding to that event is called with one argument containing the name of the signal that triggered it. To send a signal to another process, you use the kill function. Think of it as sending a one-bit piece of information to the other process.† If that process has installed a signal handler for that signal, it can execute code when it receives the signal. But there’s no way for the sending process to get any sort of return value, other than knowing that the signal was legally sent. The sender receives no feedback saying what, if anything, the receiving process did with the signal. We’ve classified this facility as a form of IPC, but in fact, signals can come from various sources, not just other processes. A signal might also come from your own process, or it might be generated when the user at the keyboard types a particular sequence like Control-C or Control-Z, or it might be manufactured by the kernel when a special event transpires, such as when a child process exits, or when your * Well, except for AF_UNIX sockets. † Actually, it’s more like five or six bits, depending on how many signals your OS defines and on whether the other process makes use of the fact that you didn’t send a different signal. Signals 413 process runs out of stack space or hits a file size or memory limit. But your own process can’t easily distinguish among these cases. A signal is like a package that arrives mysteriously on your doorstep with no return address. You’d best open it carefully. Since entries in the %SIG array can be hard references, it’s common practice to use anonymous functions for simple signal handlers: $SIG{INT} = sub { die "\nOutta here!\n" }; $SIG{ALRM} = sub { die "Your alarm clock went off" }; Or you could create a named function and assign its name or reference to the appropriate slot in the hash. For example, to intercept interrupt and quit signals (often bound to Control-C and Control-\ on your keyboard), set up a handler like this: sub catch_zap { my $signame = shift; our $shucks++; die "Somebody sent me } $shucks = 0; $SIG{INT} = ’catch_zap’; $SIG{INT} = \&catch_zap; $SIG{QUIT} = \&catch_zap; a SIG$signame!"; # always means &main::catch_zap # best strategy # catch another, too Notice how all we do in the signal handler is set a global variable and then raise an exception with die. Whenever possible, try to avoid anything more complicated than that, because on most systems the C library is not re-entrant. Signals are delivered asynchronously,* so calling any print functions (or even anything that needs to malloc (3) more memory) could in theory trigger a memory fault and subsequent core dump if you were already in a related C library routine when the signal was delivered. (Even the die routine is a bit unsafe unless the process is executing within an eval, which suppresses the I/O from die, which keeps it from calling the C library. Probably.) An even easier way to trap signals is to use the sigtrap pragma to install simple, default signal handlers: use sigtrap qw(die INT QUIT); use sigtrap qw(die untrapped normal-signals stack-trace any error-signals); The pragma is useful when you don’t want to bother writing your own handler, but you still want to catch dangerous signals and perform an orderly shutdown. By default, some of these signals are so fatal to your process that your program * Synchronizing signal delivery with Perl-level opcodes is scheduled for a future release of Perl, which should solve the matter of signals and core dumps. 414 Chapter 16: Interprocess Communication will just stop in its tracks when it receives one. Unfortunately, that means that any END functions for at-exit handling and DESTROY methods for object finalization are not called. But they ar e called on ordinary Perl exceptions (such as when you call die), so you can use this pragma to painlessly convert the signals into exceptions. Even though you aren’t dealing with the signals yourself, your program still behaves correctly. See the description of use sigtrap in Chapter 31, Pragmatic Modules, for many more features of this pragma. You may also set the %SIG handler to either of the strings “IGNORE” or “DEFAULT”, in which case Perl will try to discard the signal or allow the default action for that signal to occur (though some signals can be neither trapped nor ignored, such as the KILL and STOP signals; see signal (3), if you have it, for a list of signals available on your system and their default behaviors). The operating system thinks of signals as numbers rather than names, but Perl, like most people, prefers symbolic names to magic numbers. To find the names of the signals, list out the keys of the %SIG hash, or use the kill -l command if you have one on your system. You can also use Perl’s standard Config module to determine your operating system’s mapping between signal names and signal numbers. See Config (3) for an example of this. Because %SIG is a global hash, assignments to it affect your entire program. It’s often more considerate to the rest of your program to confine your signal catching to a restricted scope. Do this with a local signal handler assignment, which goes out of effect once the enclosing block is exited. (But remember that local values are visible in functions called from within that block.) { } local $SIG{INT} = ’IGNORE’; ... # Do whatever you want here, ignoring all SIGINTs. fn(); # SIGINTs ignored inside fn() too! ... # And here. # Block exit restores previous $SIG{INT} value. fn(); # SIGINTs not ignored inside fn() (presumably). Signaling Process Groups Processes (under Unix, at least) are organized into process groups, generally corresponding to an entire job. For example, when you fire off a single shell command that consists of a series of filter commands that pipe data from one to the other, those processes (and their child processes) all belong to the same process group. That process group has a number corresponding to the process number of the process group leader. If you send a signal to a positive process number, it just Signals 415 sends the signal to the process, but if you send a signal to a negative number, it sends that signal to every process whose process group number is the corresponding positive number, that is, the process number of the process group leader. (Conveniently for the process group leader, the process group ID is just $$.) Suppose your program wants to send a hang-up signal to all child processes it started directly, plus any grandchildren started by those children, plus any greatgrandchildren started by those grandchildren, and so on. To do this, your program first calls setpgrp(0,0) to become the leader of a new process group, and any processes it creates will be part of the new group. It doesn’t matter whether these processes were started manually via fork, automaticaly via piped opens, or as backgrounded jobs with system("cmd &"). Even if those processes had children of their own, sending a hang-up signal to your entire process group will find them all (except for processes that have set their own process group or changed their UID to give themselves diplomatic immunity to your signals). { local $SIG{HUP} = ’IGNORE’; kill(HUP, -$$); # exempt myself # signal my own process group } Another interesting signal is signal number 0. This doesn’t actually affect the target process, but instead checks that it’s alive and hasn’t changed its UID. That is, it checks whether it’s legal to send a signal, without actually sending one. unless (kill 0 => $kid_pid) { warn "something wicked happened to $kid_pid"; } Signal number 0 is the only signal that works the same under Microsoft ports of Perl as it does in Unix. On Microsoft systems, kill does not actually deliver a signal. Instead, it forces the target process to exit with the status indicated by the signal number. This may be fixed someday. The magic 0 signal, however, still behaves in the standard, nondestructive fashion. Reaping Zombies When a process exits, its parent is sent a CHLD signal by the kernel and the process becomes a zombie* until the parent calls wait or waitpid. If you start another process in Perl using anything except fork, Perl takes care of reaping your zombied children, but if you use a raw fork, you’re expected to clean up after yourself. On many but not all kernels, a simple hack for autoreaping zombies is to set $SIG{CHLD} to ’IGNORE’. A more flexible (but tedious) approach is to reap them * Yes, that really is the technical term. 416 Chapter 16: Interprocess Communication yourself. Because more than one child may have died before you get around to dealing with them, you must gather your zombies in a loop until there aren’t any more: use POSIX ":sys_wait_h"; sub REAPER { 1 until waitpid(-1, WNOHANG) == -1) } To run this code as needed, you can either set a CHLD signal handler for it: $SIG{CHLD} = \&REAPER; or, if you’re running in a loop, just arrange to call the reaper every so often. This is the best approach because it isn’t subject to the occasional core dump that signals can sometimes trigger in the C library. However, it’s expensive if called in a tight loop, so a reasonable compromise is to use a hybrid strategy where you minimize the risk within the handler by doing as little as possible and waiting until outside to reap zombies: our $zombies = 0; $SIG{CHLD} = sub { $zombies++ }; sub reaper { my $zombie; our %Kid_Status; # store each exit status $zombies = 0; while (($zombie = waitpid(-1, WNOHANG)) != -1) { $Kid_Status{$zombie} = $?; } } while (1) { reaper() if $zombies; ... } This code assumes your kernel supports reliable signals. Old SysV traditionally didn’t, which made it impossible to write correct signal handlers there. Ever since way back in the 5.003 release, Perl has used the sigaction (2) syscall where available, which is a lot more dependable. This means that unless you’re running on an ancient operating system or with an ancient Perl, you won’t have to reinstall your handlers and risk missing signals. Fortunately, all BSD-flavored systems (including Linux, Solaris, and Mac OS X) plus all POSIX-compliant systems provide reliable signals, so the old broken SysV behavior is more a matter of historical note than of current concern. With these newer kernels, many other things will work better, too. For example, “slow” syscalls (those that can block, like read, wait, and accept) will restart automatically if interrupted by a signal. In the bad old days, user code had to remember to check explicitly whether each slow syscall failed with $! ($ERRNO) set to Signals 417 EINTR and, if so, restart. This wouldn’t happen just from INT signals; even innocuous signals like TSTP (from a Control-Z) or CONT (from foregrounding the job) would abort the syscall. Perl now restarts the syscall for you automatically if the operating system allows it to. This is generally construed to be a feature. You can check whether you have the more rigorous POSIX-style signal behavior by loading the Config module and checking whether $Config{d_sigaction} has a true value. To find out whether slow syscalls are restartable, check your system documentation on sigaction (2) or sigvec (3), or scrounge around your C sys/signal.h file for SV_INTERRUPT or SA_RESTART. If one or both symbols are found, you probably have restartable syscalls. Timing Out Slow Operations A common use for signals is to impose time limits on long-running operations. If you’re on a Unix system (or any other POSIX-conforming system that supports the ALRM signal), you can ask the kernel to send your process an ALRM at some point in the future: use Fcntl ’:flock’; eval { local $SIG{ALRM} = sub { die "alarm clock restart" }; alarm 10; # schedule alarm in 10 seconds eval { flock(FH, LOCK_EX) # a blocking, exclusive lock or die "can’t flock: $!"; }; alarm 0; # cancel the alarm }; alarm 0; # race condition protection die if $@ && $@ !˜ /alarm clock restart/; # reraise If the alarm hits while you’re waiting for the lock, and you simply catch the signal and return, you’ll go right back into the flock because Perl automatically restarts syscalls where it can. The only way out is to raise an exception through die and then let eval catch it. (This works because the exception winds up calling the C library’s long jmp (3) function, which is what really gets you out of the restarting syscall.) The nested exception trap is included because calling flock would raise an exception if flock is not implemented on your platform, and you need to make sure to clear the alarm anyway. The second alarm 0 is provided in case the signal comes in after running the flock but before getting to the first alarm 0. Without the second alarm, you would risk a tiny race condition—but size doesn’t matter in race conditions; they either exist or they don’t. And we prefer that they don’t. 418 Chapter 16: Interprocess Communication Blocking Signals Now and then, you’d like to delay receipt of a signal during some critical section of code. You don’t want to blindly ignore the signal, but what you’re doing is too important to interrupt. Perl’s %SIG hash doesn’t implement signal blocking, but the POSIX module does, through its interface to the sigpr ocmask (2) syscall: use POSIX qw(:signal_h); $sigset = POSIX::SigSet->new; $blockset = POSIX::SigSet->new(SIGINT, SIGQUIT, SIGCHLD); sigprocmask(SIG_BLOCK, $blockset, $sigset) or die "Could not block INT,QUIT,CHLD signals: $!\n"; Once the three signals are all blocked, you can do whatever you want without fear of being bothered. When you’re done with your critical section, unblock the signals by restoring the old signal mask: sigprocmask(SIG_SETMASK, $sigset) or die "Could not restore INT,QUIT,CHLD signals: $!\n"; If any of the three signals came in while blocked, they are delivered immediately. If two or more different signals are pending, the order of delivery is not defined. Additionally, no distinction is made between having received a particular signal once while blocked and having received it many times.* For example, if nine child processes exited while you were blocking CHLD signals, your handler (if you had one) would still be called only once after you unblocked. That’s why, when you reap zombies, you should always loop until they’re all gone. Files Perhaps you’ve never thought about files as an IPC mechanism before, but they shoulder the lion’s share of interprocess communication — far more than all other means combined. When one process deposits its precious data in a file and another process later retrieves that data, those processes have communicated. Files offer something unique among all forms of IPC covered here: like a papyrus scroll unearthed after millennia buried in the desert, a file can be unearthed and read long after its writer’s personal end.† Factoring in persistence with comparative ease of use, it’s no wonder that files remain popular. Using files to transmit information from the dead past to some unknown future poses few surprises. You write the file to some permanent medium like a disk, and that’s about it. (You might tell a web server where to find it, if it contains HTML.) * Traditionally, that is. Countable signals may be implemented on some real-time systems according to the latest specs, but we haven’t seen these yet. † Presuming that a process can have a personal end. Files 419 The interesting challenge is when all parties are still alive and trying to communicate with one another. Without some agreement about whose turn it is to have their say, reliable communication is impossible; agreement may be achieved through file locking, which is covered in the next section. In the section after that, we discuss the special relationship that exists between a parent process and its children, which allows related parties to exchange information through inherited access to the same files. Files certainly have their limitations when it comes to things like remote access, synchronization, reliability, and session management. Other sections of the chapter cover various IPC mechanisms invented to address such limitations. File Locking In a multitasking environment, you need to be careful not to collide with other processes that are trying to use the same file you’re using. As long as all processes are just reading, there’s no problem, but as soon as even one process needs to write to the file, complete chaos ensues unless some sort of locking mechanism acts as traffic cop. Never use the mere existence of a filename (that is, -e $file) as a locking indication, because a race condition exists between the test for existence of that filename and whatever you plan to do with it (like create it, open it, or unlink it). See the section “Handling Race Conditions” in Chapter 23, Security, for more about this. Perl’s portable locking interface is the flock(HANDLE,FLAGS) function, described in Chapter 29, Functions. Perl maximizes portability by using only the simplest and most widespread locking features found on the broadest range of platforms. These semantics are simple enough that they can be emulated on most systems, including those that don’t support the traditional syscall of that name, such as System V or Windows NT. (If you’re running a Microsoft system earlier than NT, though, you’re probably out of luck, as you would be if you’re running a system from Apple before Mac OS X.) Locks come in two varieties: shared (the LOCK_SH flag) and exclusive (the LOCK_EX flag). Despite the suggestive sound of “exclusive”, processes aren’t required to obey locks on files. That is, flock only implements advisory locking, which means that locking a file does not stop another process from reading or even writing the file. Requesting an exclusive lock is just a way for a process to let the operating system suspend it until all current lockers, whether shared or exclusive, are finished with it. Similarly, when a process asks for a shared lock, it is just suspending itself until there is no exclusive locker. Only when all parties use the file-locking mechanism can a contended file be accessed safely. 420 Chapter 16: Interprocess Communication Therefore, flock is a blocking operation by default. That is, if you can’t get the lock you want immediately, the operating system suspends your process till you can. Here’s how to get a blocking, shared lock, typically used for reading a file: use Fcntl qw(:DEFAULT :flock); open(FH, "< filename") or die "can’t open filename: $!"; flock(FH, LOCK_SH) or die "can’t lock filename: $!"; # now read from FH You can try to acquire a lock in a nonblocking fashion by including the LOCK_NB flag in the flock request. If you can’t be given the lock right away, the function fails and immediately returns false. Here’s an example: flock(FH, LOCK_SH | LOCK_NB) or die "can’t lock filename: $!"; You may wish to do something besides raising an exception as we did here, but you certainly don’t dare do any I/O on the file. If you are refused a lock, you shouldn’t access the file until you can get the lock. Who knows what scrambled state you might find the file in? The main purpose of the nonblocking mode is to let you go off and do something else while you wait. But it can also be useful for producing friendlier interactions by warning users that it might take a while to get the lock, so they don’t feel abandoned: use Fcntl qw(:DEFAULT :flock); open(FH, "< filename") or die "can’t open filename: $!"; unless (flock(FH, LOCK_SH | LOCK_NB)) { local $| = 1; print "Waiting for lock on filename..."; flock(FH, LOCK_SH) or die "can’t lock filename: $!"; print "got it.\n" } # now read from FH Some people will be tempted to put that nonblocking lock into a loop. The main problem with nonblocking mode is that, by the time you get back to checking again, someone else may have grabbed the lock because you abandoned your place in line. Sometimes you just have to get in line and wait. If you’re lucky there will be some magazines to read. Locks are on filehandles, not on filenames.* When you close the file, the lock dissolves automatically, whether you close the file explicitly by calling close or implicitly by reopening the handle or by exiting your process. * Actually, locks aren’t on filehandles—they’re on the file descriptors associated with the filehandles since the operating system doesn’t know about filehandles. That means that all our die messages about failing to get a lock on filenames are technically inaccurate. But error messages of the form “I can’t get a lock on the file represented by the file descriptor associated with the filehandle originally opened to the path filename, although by now filename may represent a different file entirely than our handle does” would just confuse the user (not to mention the reader). Files 421 To get an exclusive lock, typically used for writing, you have to be more careful. You cannot use a regular open for this; if you use an open mode of <, it will fail on files that don’t exist yet, and if you use >, it will clobber any files that do. Instead, use sysopen on the file so it can be locked before getting overwritten. Once you’ve safely opened the file for writing but haven’t yet touched it, successfully acquire the exclusive lock and only then truncate the file. Now you may overwrite it with the new data. use Fcntl qw(:DEFAULT :flock); sysopen(FH, "filename", O_WRONLY | O_CREAT) or die "can’t open filename: $!"; flock(FH, LOCK_EX) or die "can’t lock filename: $!"; truncate(FH, 0) or die "can’t truncate filename: $!"; # now write to FH If you want to modify the contents of a file in place, use sysopen again. This time you ask for both read and write access, creating the file if needed. Once the file is opened, but before you’ve done any reading or writing, get the exclusive lock and keep it around your entire transaction. It’s often best to release the lock by closing the file because that guarantees all buffers are written before the lock is released. An update involves reading in old values and writing out new ones. You must do both operations under a single exclusive lock, lest another process read the (imminently incorrect) value after (or even before) you do, but before you write. (We’ll revisit this situation when we cover shared memory later in this chapter.) use Fcntl qw(:DEFAULT :flock); sysopen(FH, "counterfile", O_RDWR | O_CREAT) or die "can’t open counterfile: $!"; flock(FH, LOCK_EX) or die "can’t write-lock counterfile: $!"; $counter = <FH> || 0; # first time would be undef seek(FH, 0, 0) or die "can’t rewind counterfile : $!"; print FH $counter+1, "\n" or die "can’t write counterfile: $!"; # next line technically superfluous in this program, but # a good idea in the general case truncate(FH, tell(FH)) or die "can’t truncate counterfile: $!"; close(FH) or die "can’t close counterfile: $!"; You can’t lock a file you haven’t opened yet, and you can’t have a single lock that applies to more than one file. What you can do, though, is use a completely separate file to act as a sort of semaphore, like a traffic light, to provide controlled 422 Chapter 16: Interprocess Communication access to something else through regular shared and exclusive locks on the semaphore file. This approach has several advantages. You can have one lockfile that controls access to multiple files, avoiding the kind of deadlock that occurs when one process tries to lock those files in one order while another process is trying to lock them in a different order. You can use a semaphore file to lock an entire directory of files. You can even control access to something that’s not even in the filesystem, like a shared memory object or the socket upon which several preforked servers would like to call accept. If you have a DBM file that doesn’t provide its own explicit locking mechanism, an auxiliary lockfile is the best way to control concurrent access by multiple agents. Otherwise, your DBM library’s internal caching can get out of sync with the file on disk. Before calling dbmopen or tie, open and lock the semaphore file. If you open the database with O_RDONLY, you’ll want to use LOCK_SH for the lock. Otherwise, use LOCK_EX for exclusive access to updating the database. (Again, this only works if all participants agree to pay attention to the semaphore.) use Fcntl qw(:DEFAULT :flock); use DB_File; # demo purposes only; any db is fine $DBNAME = "/path/to/database"; $LCK = $DBNAME . ".lockfile"; # use O_RDWR if you expect to put data in the lockfile sysopen(DBLOCK, $LCK, O_RDONLY | O_CREAT) or die "can’t open $LCK: $!"; # must get lock before opening database flock(DBLOCK, LOCK_SH) or die "can’t LOCK_SH $LCK: $!"; tie(%hash, "DB_File", $DBNAME, O_RDWR | O_CREAT) or die "can’t tie $DBNAME: $!"; Now you can safely do whatever you’d like with the tied %hash. When you’re done with your database, make sure you explicitly release those resources, and in the opposite order that you acquired them: untie %hash; close DBLOCK; # must close database before lockfile # safe to let go of lock now If you have the GNU DBM library installed, you can use the standard GDBM_File module’s implicit locking. Unless the initial tie contains the GDBM_NOLOCK flag, the library makes sure that only one writer may open a GDBM file at a time, and that readers and writers do not have the database open at the same time. Files 423 Passing Filehandles Whenever you create a child process using fork, that new process inherits all its parent’s open filehandles. Using filehandles for interprocess communication is easiest to illustrate by using plain files first. Understanding how this works is essential for mastering the fancier mechanisms of pipes and sockets described later in this chapter. The simplest example opens a file and starts up a child process. The child then uses the filehandle already opened for it: open(INPUT, "< /etc/motd") or die "/etc/motd: $!"; if ($pid = fork) { waitpid($pid,0) } else { defined($pid) or die "fork: $!"; while (<INPUT>) { print "$.: $_" } exit; # don’t let child fall back into main code } # INPUT handle now at EOF in parent Once access to a file has been granted by open, it stays granted until the filehandle is closed; changes to the file’s permissions or to the owner’s access privileges have no effect on accessibility. Even if the process later alters its user or group IDs, or the file has its ownership changed to a different user or group, that doesn’t affect filehandles that are already open. Programs running under increased permissions (like set-id programs or systems daemons) often open a file under their increased rights and then hand off the filehandle to a child process that could not have opened the file on its own. Although this feature is of great convenience when used intentionally, it can also create security issues if filehandles accidentally leak from one program to the next. To avoid granting implicit access to all possible filehandles, Perl automatically closes any filehandles it has opened (including pipes and sockets) whenever you explicitly exec a new program or implicitly execute one through a call to a piped open, system, or qx// (backticks). The system filehandles STDIN, STDOUT, and STDERR are exempt from this because their main purpose is to provide communications linkage between programs. So one way of passing a filehandle to a new program is to copy the filehandle to one of the standard filehandles: open(INPUT, "< /etc/motd") if ($pid = fork) { wait } else { defined($pid) open(STDIN, "<&INPUT") exec("cat", "-n") } or die "/etc/motd: $!"; or die "fork: $!"; or die "dup: $!"; or die "exec cat: $!"; 424 Chapter 16: Interprocess Communication If you really want the new program to gain access to a filehandle other than these three, you can, but you have to do one of two things. When Perl opens a new file (or pipe or socket), it checks the current setting of the $ˆF ($SYSTEM_FD_MAX) variable. If the numeric file descriptor used by that new filehandle is greater than $ˆF, the descriptor is marked as one to close. Otherwise, Perl leaves it alone, and new programs you exec will inherit access. It’s not always easy to predict what file descriptor your newly opened filehandle will have, but you can temporarily set your maximum system file descriptor to some outrageously high number for the duration of the open: # open file and mark INPUT to be left open across execs { local $ˆF = 10_000; open(INPUT, "< /etc/motd") or die "/etc/motd: $!"; } # old value of $ˆF restored on scope exit Now all you have to do is get the new program to pay attention to the descriptor number of the filehandle you just opened. The cleanest solution (on systems that support this) is to pass a special filename that equates to a file descriptor. If your system has a directory called /dev/fd or /pr oc/$$/fd containing files numbered from 0 through the maximum number of supported descriptors, you can probably use this strategy. (Many Linux operating systems have both, but only the /pr oc version tends to be correctly populated. BSD and Solaris prefer /dev/fd. You’ll have to poke around at your system to see which looks better for you.) First, open and mark your filehandle as one to be left open across execs as shown in the previous code, then fork like this: if ($pid = fork) { wait } else { defined($pid) or die "fork: $!"; $fdfile = "/dev/fd/" . fileno(INPUT); exec("cat", "-n", $fdfile) or die "exec cat: $!"; } If your system supports the fcntl syscall, you may diddle the filehandle’s closeon-exec flag manually. This is convenient for those times when you didn’t realize back when you created the filehandle that you would want to share it with your children. use Fcntl qw/F_SETFD/; fcntl(INPUT, F_SETFD, 0) or die "Can’t clear close-on-exec flag on INPUT: $!\n"; Files 425 You can also force a filehandle to close: fcntl(INPUT, F_SETFD, 1) or die "Can’t set close-on-exec flag on INPUT: $!\n"; You can also query the current status: use Fcntl qw/F_SETFD F_GETFD/; printf("INPUT will be %s across execs\n", fcntl(INPUT, F_GETFD, 1) ? "closed" : "left open"); If your system doesn’t support file descriptors named in the filesystem, and you want to pass a filehandle other than STDIN, STDOUT, or STDERR, you can still do so, but you’ll have to make special arrangements with that program. Common strategies for this are to pass the descriptor number through an environment variable or a command-line option. If the executed program is in Perl, you can use open to convert a file descriptor into a filehandle. Instead of specifying a filename, use “&=” followed by the descriptor number. if (defined($ENV{input_fdno}) && $ENV{input_fdno}) =˜ /ˆ\d$/) { open(INPUT, "<&=$ENV{input_fdno}") or die "can’t fdopen $ENV{input_fdno} for input: $!"; } It gets even easier than that if you’re going to be running a Perl subroutine or program that expects a filename argument. You can use the descriptor-opening feature of Perl’s regular open function (but not sysopen or three-argument open) to make this happen automatically. Imagine you have a simple Perl program like this: #!/usr/bin/perl -p # nl - number input lines printf "%6d ", $.; Presuming you’ve arranged for the INPUT handle to stay open across execs, you can call that program this way: $fdspec = ’<&=’ . fileno(INPUT); system("nl", $fdspec); or to catch the output: @lines = ‘nl ’$fdspec’‘; # single quotes protect spec from shell Whether or not you exec another program, if you use file descriptors inherited across fork, there’s one small gotcha. Unlike variables copied across a fork, which 426 Chapter 16: Interprocess Communication actually get duplicate but independent copies, file descriptors really ar e the same in both processes. If one process reads data from the handle, the seek pointer (file position) advances in the other process, too, and that data is no longer available to either process. If they take turns reading, they’ll leapfrog over each other in the file. This makes intuitive sense for handles attached to serial devices, pipes, or sockets, since those tend to be read-only devices with ephemeral data. But this behavior may surprise you with disk files. If this is a problem, reopen any files that need separate tracking after the fork. The fork operator is a concept derived from Unix, which means it might not be implemented correctly on all non-Unix/non-POSIX platforms. Notably, fork works on Microsoft systems only if you’re running Perl 5.6 (or better) on Windows 98 (or later). Although fork is implemented via multiple concurrent execution streams within the same program on these systems, these aren’t the sort of threads where all data is shared by default; here, only file descriptors are. See also Chapter 17, Thr eads. Pipes A pipe is a unidirectional I/O channel that can transfer a stream of bytes from one process to another. Pipes come in both named and nameless varieties. You may be more familiar with nameless pipes, so we’ll talk about those first. Anonymous Pipes Perl’s open function opens a pipe instead of a file when you append or prepend a pipe symbol to the second argument to open. This turns the rest of the arguments into a command, which will be interpreted as a process (or set of processes) that you want to pipe a stream of data either into or out of. Here’s how to start up a child process that you intend to write to: open SPOOLER, "| cat -v | lpr -h 2>/dev/null" or die "can’t fork: $!"; local $SIG{PIPE} = sub { die "spooler pipe broke" }; print SPOOLER "stuff\n"; close SPOOLER or die "bad spool: $! $?"; This example actually starts up two processes, the first of which (running cat) we print to directly. The second process (running lpr) then receives the output of the first process. In shell programming, this is often called a pipeline. A pipeline can have as many processes in a row as you like, as long as the ones in the middle know how to behave like filters; that is, they read standard input and write standard output. Pipes 427 Perl uses your default system shell (/bin/sh on Unix) whenever a pipe command contains special characters that the shell cares about. If you’re only starting one command, and you don’t need—or don’t want—to use the shell, you can use the multi-argument form of a piped open instead: open SPOOLER, "|-", "lpr", "-h" or die "can’t run lpr: $!"; # requires 5.6.1 If you reopen your program’s standard output as a pipe to another program, anything you subsequently print to STDOUT will be standard input for the new program. So to page your program’s output,* you’d use: if (-t STDOUT) { # my $pager = $ENV{PAGER} || open(STDOUT, "| $pager") } END { close(STDOUT) } only if stdout is a terminal ’more’; or die "can’t fork a pager: $!"; or die "cant close STDOUT: $!" When you’re writing to a filehandle connected to a pipe, always explicitly close that handle when you’re done with it. That way your main program doesn’t exit before its offspring. Here’s how to start up a child process that you intend to read from: open STATUS, "netstat -an 2>/dev/null |" or die "can’t fork: $!"; while (<STATUS>) { next if /ˆ(tcp|udp)/; print; } close STATUS or die "bad netstat: $! $?"; You can open a multistage pipeline for input just as you can for output. And as before, you can avoid the shell by using an alternate form of open: open STATUS, "-|", "netstat", "-an" or die "can’t run netstat: $!"; # requires 5.6.1 But then you don’t get I/O redirection, wildcard expansion, or multistage pipes, since Perl relies on your shell to do those. You might have noticed that you can use backticks to accomplish the same effect as opening a pipe for reading: print grep { !/ˆ(tcp|udp)/ } ‘netstat -an 2>&1‘; die "bad netstat" if $?; * That is, let them view it one screenful at a time, not set off random bird calls. 428 Chapter 16: Interprocess Communication While backticks are extremely handy, they have to read the whole thing into memory at once, so it’s often more efficient to open your own piped filehandle and process the file one line or record at a time. This gives you finer control over the whole operation, letting you kill off the child process early if you like. You can also be more efficient by processing the input as it’s coming in, since computers can interleave various operations when two or more processes are running at the same time. (Even on a single-CPU machine, input and output operations can happen while the CPU is doing something else.) Because you’re running two or more processes concurrently, disaster can strike the child process any time between the open and the close. This means that the parent must check the return values of both open and close. Checking the open isn’t good enough, since that will only tell you whether the fork was successful, and possibly whether the subsequent command was successfully launched. (It can tell you this only in recent versions of Perl, and only if the command is executed directly by the forked child, not via the shell.) Any disaster that happens after that is reported from the child to the parent as a nonzero exit status. When the close function sees that, it knows to return a false value, indicating that the actual status value should be read from the $? ($CHILD_ERROR) variable. So checking the return value of close is just as important as checking open. If you’re writing to a pipe, you should also be prepared to handle the PIPE signal, which is sent to you if the process on the other end dies before you’re done sending to it. Talking to Yourself Another approach to IPC is to make your program talk to itself, in a manner of speaking. Actually, your process talks over pipes to a forked copy of itself. It works much like the piped open we talked about in the last section, except that the child process continues executing your script instead of some other command. To represent this to the open function, you use a pseudocommand consisting of a minus. So the second argument to open looks like either “-|” or “|-”, depending on whether you want to pipe from yourself or to yourself. As with an ordinary fork command, the open function returns the child’s process ID in the parent process but 0 in the child process. Another asymmetry is that the filehandle named by the open is used only in the parent process. The child’s end of the pipe is hooked to either STDIN or STDOUT as appropriate. That is, if you open a pipe to minus with |-, you can write to the filehandle you opened, and your kid will find this in STDIN: if (open(TO, "|-")) { print TO $fromparent; } else { $tochild = <STDIN>; Pipes 429 exit; } If you open a pipe fr om minus with -|, you can read from the filehandle you opened, which will return whatever your kid writes to STDOUT: if (open(FROM, "-|")) { $toparent = <FROM>; } else { print STDOUT $fromchild; exit; } One common application of this construct is to bypass the shell when you want to open a pipe from a command. You might want to do this because you don’t want the shell to interpret any possible metacharacters in the filenames you’re trying to pass to the command. If you’re running release 5.6.1 or greater of Perl, you can use the multi-argument form of open to get the same result. Another use of a forking open is to safely open a file or command even while you’re running under an assumed UID or GID. The child you fork drops any special access rights, then safely opens the file or command and acts as an intermediary, passing data between its more powerful parent and the file or command it opened. Examples can be found in the section “Accessing Commands and Files Under Reduced Privileges”, in Chapter 23. One creative use of a forking open is to filter your own output. Some algorithms are much easier to implement in two separate passes than they are in just one pass. Here’s a simple example in which we emulate the Unix tee (1) program by sending our normal output down a pipe. The agent on the other end of the pipe (one of our own subroutines) distributes our output to all the files specified: tee("/tmp/foo", "/tmp/bar", "/tmp/glarch"); while (<>) { print "$ARGV at line $. => $_"; } close(STDOUT) or die "can’t close STDOUT: $!"; sub tee { my @output = @_; my @handles = (); for my $path (@output) { my $fh; # open will fill this in unless (open ($fh, ">", $path)) { warn "cannot write to $path: $!"; next; } push @handles, $fh; } 430 Chapter 16: Interprocess Communication # reopen STDOUT in parent and return return if my $pid = open(STDOUT, "|-"); die "cannot fork: $!" unless defined $pid; # process STDIN in child while (<STDIN>) { for my $fh (@handles) { print $fh $_ or die "tee output failed: $!"; } } for my $fh (@handles) { close($fh) or die "tee closing failed: $!"; } exit; # don’t let the child return to main! } This technique can be applied repeatedly to push as many filters on your output stream as you wish. Just keep calling functions that fork-open STDOUT, and have the child read from its parent (which it sees as STDIN) and pass the massaged output along to the next function in the stream. Another interesting application of talking to yourself with fork-open is to capture the output from an ill-mannered function that always splats its results to STDOUT. Imagine if Perl only had printf and no sprintf. What you’d need would be something that worked like backticks, but with Perl functions instead of external commands: badfunc("arg"); # drat, escaped! $string = forksub(\&badfunc, "arg"); # caught it as string @lines = forksub(\&badfunc, "arg"); # as separate lines sub forksub { my $kidpid = open my $self, "-|"; defined $kidpid or die "cannot fork: $!"; shift->(@_), exit unless $kidpid; local $/ unless wantarray; return <$self>; # closes on scope exit } We’re not claiming this is efficient; a tied filehandle would probably be a good bit faster. But it’s a lot easier to code up if you’re in more of a hurry than your computer is. Bidirectional Communication Although using open to connect to another command over a pipe works reasonably well for unidirectional communication, what about bidirectional communication? The obvious approach doesn’t actually work: Pipes 431 open(PROG_TO_READ_AND_WRITE, "| some program |") # WRONG! and if you forget to enable warnings, then you’ll miss out entirely on the diagnostic message: Can’t do bidirectional pipe at myprog line 3. The open function doesn’t allow this because it’s rather prone to deadlock unless you’re quite careful. But if you’re determined, you can use the standard IPC::Open2 library module to attach two pipes to a subprocess’s STDIN and STDOUT. There’s also an IPC::Open3 module for tridirectional I/O (allowing you to also catch your child’s STDERR), but this requires either an awkward select loop or the somewhat more convenient IO::Select module. But then you’ll have to avoid Perl’s buffered input operations like <> (readline). Here’s an example using open2: use IPC::Open2; local (*Reader, *Writer); $pid = open2(\*Reader, \*Writer, "bc -l"); $sum = 2; for (1 .. 5) { print Writer "$sum * $sum\n"; chomp($sum = <Reader>); } close Writer; close Reader; waitpid($pid, 0); print "sum is $sum\n"; You can also autovivify lexical filehandles: my ($fhread, $fhwrite); $pid = open2($fhread, $fhwrite, "cat -u -n"); The problem with this in general is that standard I/O buffering is really going to ruin your day. Even though your output filehandle is autoflushed (the library does this for you) so that the process on the other end will get your data in a timely manner, you can’t usually do anything to force it to return the favor. In this particular case, we were lucky: bc expects to operate over a pipe and knows to flush each output line. But few commands are so designed, so this seldom works out unless you yourself wrote the program on the other end of the double-ended pipe. Even simple, apparently interactive programs like ftp fail here because they won’t do line buffering on a pipe. They’ll only do it on a tty device. The IO::Pty and Expect modules from CPAN can help with this because they provide a real tty (actually, a real pseudo-tty, but it acts like a real one). This gets you line buffering in the other process without modifying its program. 432 Chapter 16: Interprocess Communication If you split your program into several processes and want these to all have a conversation that goes both ways, you can’t use Perl’s high-level pipe interfaces, because these are all unidirectional. You’ll need to use two low-level pipe function calls, each handling one direction of the conversation: pipe(FROM_PARENT, TO_CHILD) or die "pipe: $!"; pipe(FROM_CHILD, TO_PARENT) or die "pipe: $!"; select((select(TO_CHILD), $| = 1))[0]); # autoflush select((select(TO_PARENT), $| = 1))[0]); # autoflush if ($pid = fork) { close FROM_PARENT; close TO_PARENT; print TO_CHILD "Parent Pid $$ is sending this\n"; chomp($line = <FROM_CHILD>); print "Parent Pid $$ just read this: ‘$line’\n"; close FROM_CHILD; close TO_CHILD; waitpid($pid,0); } else { die "cannot fork: $!" unless defined $pid; close FROM_CHILD; close TO_CHILD; chomp($line = <FROM_PARENT>); print "Child Pid $$ just read this: ‘$line’\n"; print TO_PARENT "Child Pid $$ is sending this\n"; close FROM_PARENT; close TO_PARENT; exit; } On many Unix systems, you don’t actually have to make two separate pipe calls to achieve full duplex communication between parent and child. The socketpair syscall provides bidirectional connections between related processes on the same machine. So instead of two pipes, you only need one socketpair. use Socket; socketpair(Child, Parent, AF_UNIX, SOCK_STREAM, PF_UNSPEC) or die "socketpair: $!"; # or letting perl pick filehandles for you my ($kidfh, $dadfh); socketpair($kidfh, $dadfh, AF_UNIX, SOCK_STREAM, PF_UNSPEC) or die "socketpair: $!"; After the fork, the parent closes the Parent handle, then reads and writes via the Child handle. Meanwhile, the child closes the Child handle, then reads and writes via the Parent handle. If you’re looking into bidirectional communications because the process you’d like to talk to implements a standard Internet service, you should usually just skip the middleman and use a CPAN module designed for that exact purpose. (See the “Sockets” section later for a list of a some of these.) Pipes 433 Named Pipes A named pipe (often called a FIFO) is a mechanism for setting up a conversation between unrelated processes on the same machine. The names in a “named” pipe exist in the filesystem, which is just a funny way to say that you can put a special file in the filesystem namespace that has another process behind it instead of a disk.* A FIFO is convenient when you want to connect a process to an unrelated one. When you open a FIFO, your process will block until there’s a process on the other end. So if a reader opens the FIFO first, it blocks until the writer shows up — and vice versa. To create a named pipe, use the POSIX mkfifo function — if you’re on a POSIX system, that is. On Microsoft systems, you’ll instead want to look into the Win32::Pipe module, which, despite its possible appearance to the contrary, creates named pipes. (Win32 users create anonymous pipes using pipe just like the rest of us.) For example, let’s say you’d like to have your .signatur e file produce a different answer each time it’s read. Just make it a named pipe with a Perl program on the other end that spits out random quips. Now every time any program (like a mailer, newsreader, finger program, and so on) tries to read from that file, that program will connect to your program and read in a dynamic signature. In the following example, we use the rarely seen -p file test operator to determine whether anyone (or anything) has accidentally removed our FIFO.† If they have, there’s no reason to try to open it, so we treat this as a request to exit. If we’d used a simple open function with a mode of “> $fpath”, there would have been a tiny race condition that would have risked accidentally creating the signature as a plain file if it disappeared between the -p test and the open. We couldn’t use a “+< $fpath” mode, either, because opening a FIFO for read-write is a nonblocking open (this is only true of FIFOs). By using sysopen and omitting the O_CREAT flag, we avoid this problem by never creating a file by accident. use Fcntl; # for sysopen chdir; # go home $fpath = ’.signature’; $ENV{PATH} .= ":/usr/games"; unless (-p $fpath) { # not a pipe if (-e _) { # but a something else die "$0: won’t overwrite .signature\n"; * You can do the same thing with Unix-domain sockets, but you can’t use open on those. † Another use is to see if a filehandle is connected to a pipe, named or anonymous, as in -p STDIN. 434 Chapter 16: Interprocess Communication } else { require POSIX; POSIX::mkfifo($fpath, 0666) or die "can’t mknod $fpath: $!"; warn "$0: created $fpath as a named pipe\n"; } } while (1) { # exit if signature file manually removed die "Pipe file disappeared" unless -p $fpath; # next line blocks until there’s a reader sysopen(FIFO, $fpath, O_WRONLY) or die "can’t write $fpath: $!"; print FIFO "John Smith (smith\@host.org)\n", ‘fortune -s‘; close FIFO; select(undef, undef, undef, 0.2); # sleep 1/5th second } The short sleep after the close is needed to give the reader a chance to read what was written. If we just immediately loop back up around and open the FIFO again before our reader has finished reading the data we just sent, then no end-of-file is seen because there’s once again a writer. We’ll both go round and round until during one iteration, the writer falls a little behind and the reader finally sees that elusive end-of-file. (And we were worried about race conditions?) System V IPC Everyone hates System V IPC. It’s slower than paper tape, carves out insidious little namespaces completely unrelated to the filesystem, uses human-hostile numbers to name its objects, and is constantly losing track of its own mind. Every so often, your sysadmin has to go on a search-and-destroy mission to hunt down these lost SysV IPC objects with ipcs (1) and kill them with ipcr m (1), hopefully before the system runs out of memory. Despite all this pain, ancient SysV IPC still has a few valid uses. The three kinds of IPC objects are shared memory, semaphores, and messages. For message passing, sockets are the preferred mechanisms these days, and they’re a lot more portable, too. For simple uses of semaphores, the filesystem tends to get used. As for shared memory — well, now there’s a problem for you. If you have it, the more modern mmap (2) syscall fits the bill,* but the quality of the implementation varies from system to system. It also requires a bit of care to avoid letting Perl reallocate your strings from where mmap (2) put them. But when programmers look into using mmap (2), they hear these incoherent mumbles from the resident wizards about how it suffers from dodgy cache coherency issues on systems without something called a “unified buffer cache”—or maybe it was a “flycatcher unibus”—and, * There’s even an Mmap module on CPAN. System V IPC 435 figuring the devil they know is better than the one they don’t, run quickly back to the SysV IPC they know and hate for all their shared memory needs. Here’s a little program that demonstrates controlled access to a shared memory buffer by a brood of sibling processes. SysV IPC objects can also be shared among unr elated processes on the same computer, but then you have to figure out how they’re going to find each other. To mediate safe access, we’ll create a semaphore per piece.* Every time you want to get or put a new value into the shared memory, you have to go through the semaphore first. This can get pretty tedious, so we’ll wrap access in an object class. IPC::Shareable goes one step further, wrapping its object class in a tie interface. This program runs until you interrupt it with a Control-C or equivalent: #!/usr/bin/perl -w use v5.6.0; # or better use strict; use sigtrap qw(die INT TERM HUP QUIT); my $PROGENY = shift(@ARGV) || 3; eval { main() }; # see DESTROY below for why die if $@ && $@ !˜ /ˆCaught a SIG/; print "\nDone.\n"; exit; sub main { my $mem = ShMem->alloc("Original Creation at " . localtime); my(@kids, $child); $SIG{CHLD} = ’IGNORE’; for (my $unborn = $PROGENY; $unborn > 0; $unborn--) { if ($child = fork) { print "$$ begat $child\n"; next; } die "cannot fork: $!" unless defined $child; eval { while (1) { $mem->lock(); $mem->poke("$$ " . localtime) unless $mem->peek =˜ /ˆ$$\b/o; $mem->unlock(); } }; * It would be more realistic to create a pair of semaphores for each bit of shared memory, one for reading and the other for writing, and in fact, that’s what the IPC::Shareable module on CPAN does. But we’re trying to keep things simple here. It’s worth admitting, though, that with a couple of semaphores, you could then make use of pretty much the only redeeming feature of SysV IPC: you could perform atomic operations on entire sets of semaphores as one unit, which is occasionally useful. 436 Chapter 16: Interprocess Communication die if $@ && $@ !˜ /ˆCaught a SIG/; exit; # child death } while (1) { print "Buffer is ", $mem->get, "\n"; sleep 1; } } And here’s the ShMem package, which that program uses. You can just tack it on to the end of the program, or put it in its own file (with a “1;” at the end) and require it from the main program. (The two IPC modules it uses in turn are found in the standard Perl distribution.) package ShMem; use IPC::SysV qw(IPC_PRIVATE IPC_RMID IPC_CREAT S_IRWXU); use IPC::Semaphore; sub MAXBUF() { 2000 } sub alloc { # constructor method my $class = shift; my $value = @_ ? shift : ’’; my $key = shmget(IPC_PRIVATE, MAXBUF, S_IRWXU) or die "shmget: $!"; my $sem = IPC::Semaphore->new(IPC_PRIVATE, 1, S_IRWXU | IPC_CREAT) or die "IPC::Semaphore->new: $!"; $sem->setval(0,1) or die "sem setval: $!"; my $self = bless { OWNER => $$, SHMKEY => $key, SEMA => $sem, } => $class; $self->put($value); return $self; } Now for the fetch and store methods. The get and put methods lock the buffer, but peek and poke don’t, so the latter two should be used only while the object is manually locked—which you have to do when you want to retrieve an old value and store back a modified version, all under the same lock. The demo program does this in its while (1) loop. The entire transaction must occur under the same lock, or the testing and setting wouldn’t be atomic and might bomb. sub get { my $self = shift; $self->lock; my $value = $self->peek(@_); $self->unlock; return $value; } Sockets 437 sub peek { my $self = shift; shmread($self->{SHMKEY}, my $buff=’’, 0, MAXBUF) or die "shmread: $!"; substr($buff, index($buff, "\0")) = ’’; return $buff; } sub put { my $self = shift; $self->lock; $self->poke(@_); $self->unlock; } sub poke { my($self,$msg) = @_; shmwrite($self->{SHMKEY}, $msg, 0, MAXBUF) or die "shmwrite: $!"; } sub lock { my $self = shift; $self->{SEMA}->op(0,-1,0) or die "semop: $!"; } sub unlock { my $self = shift; $self->{SEMA}->op(0,1,0) or die "semop: $!"; } Finally, the class needs a destructor so that when the object goes away, we can manually deallocate the shared memory and the semaphore stored inside the object. Otherwise, they’ll outlive their creator, and you’ll have to resort to ipcs and ipcr m (or a sysadmin) to get rid of them. That’s why we went through the elaborate wrappers in the main program to convert signals into exceptions: it that all destructors get run, SysV IPC objects get deallocated, and sysadmins get off our case. sub DESTROY { my $self = shift; return unless $self->{OWNER} == $$; # avoid dup dealloc shmctl($self->{SHMKEY}, IPC_RMID, 0) or warn "shmctl RMID: $!"; $self->{SEMA}->remove() or warn "sema->remove: $!"; } Sockets The IPC mechanisms discussed earlier all have one severe restriction: they’re designed for communication between processes running on the same computer. (Even though files can sometimes be shared across machines through mechanisms like NFS, locking fails miserably on many NFS implementations, which takes away most of the fun of concurrent access.) For general-purpose networking, sockets are the way to go. Although sockets were invented under BSD, they quickly 438 Chapter 16: Interprocess Communication spread to other forms of Unix, and nowadays you can find a socket interface on nearly every viable operating system out there. If you don’t have sockets on your machine, you’re going to have tremendous difficulty using the Internet. With sockets, you can do both virtual circuits (as TCP streams) and datagrams (as UDP packets). You may be able to do even more, depending on your system. But the most common sort of socket programming uses TCP over Internet-domain sockets, so that’s the kind we cover here. Such sockets provide reliable connections that work a little bit like bidirectional pipes that aren’t restricted to the local machine. The two killer apps of the Internet, email and web browsing, both rely almost exclusively on TCP sockets. You also use UDP heavily without knowing it. Every time your machine tries to find a site on the Internet, it sends UDP packets to your DNS server asking it for the actual IP address. You might use UDP yourself when you want to send and receive datagrams. Datagrams are cheaper than TCP connections precisely because they aren’t connection oriented; that is, they’re less like making a telephone call and more like dropping a letter in the mailbox. But UDP also lacks the reliability that TCP provides, making it more suitable for situations where you don’t care whether a packet or two gets folded, spindled, or mutilated. Or for when you know that a higher-level protocol will enforce some degree of redundancy or failsoftness (which is what DNS does.) Other choices are available but far less common. You can use Unix-domain sockets, but they only work for local communication. Various systems support various other non-IP-based protocols. Doubtless these are somewhat interesting to someone somewhere, but we’ll restrain ourselves from talking about them somehow. The Perl functions that deal with sockets have the same names as the corresponding syscalls in C, but their arguments tend to differ for two reasons: first, Perl filehandles work differently from C file descriptors; and second, Perl already knows the length of its strings, so you don’t need to pass that information. See Chapter 29 for details on each socket-related syscall. One problem with ancient socket code in Perl was that people would use hardcoded values for constants passed into socket functions, which destroys portability. Like most syscalls, the socket-related ones quietly but politely return undef when they fail, instead of raising an exception. It is therefore essential to check these functions’ return values, since if you pass them garbage, they aren’t going to be very noisy about it. If you ever see code that does anything like explicitly setting $AF_INET = 2, you know you’re in for big trouble. An immeasurably superior approach is to use the Socket module or the even friendlier IO::Socket module, Sockets 439 both of which are standard. These modules provide various constants and helper functions you’ll need for setting up clients and servers. For optimal success, your socket programs should always start out like this (and don’t forget to add the -T taint-checking switch to the shebang line for servers): #!/usr/bin/perl -w use strict; use sigtrap; use Socket; # or IO::Socket As noted elsewhere, Perl is at the mercy of your C libraries for much of its system behavior, and not all systems support all sorts of sockets. It’s probably safest to stick with normal TCP and UDP socket operations. For example, if you want your code to stand a chance of being portable to systems you haven’t thought of, don’t expect there to be support for a reliable sequenced-packet protocol. Nor should you expect to pass open file descriptors between unrelated processes over a local Unix-domain socket. (Yes, you can really do that on many Unix machines—see your local recvmsg (2) manpage.) If you just want to use a standard Internet service like mail, news, domain name service, FTP, Telnet, the Web, and so on, then instead of starting from scratch, try using existing CPAN modules for these. Prepackaged modules designed for these include Net::SMTP (or Mail::Mailer), Net::NNTP, Net::DNS, Net::FTP, Net::Telnet, and the various HTTP-related modules. The libnet and libwww module suites both comprise many individual networking modules. Module areas on CPAN you’ll want to look at are section 5 on Networking and IPC, section 15 on WWW-related modules, and section 16 on Server and Daemon Utilities. In the sections that follow, we present several sample clients and servers without a great deal of explanation of each function used, as that would mostly duplicate the descriptions we’ve already provided in Chapter 29. Networking Clients Use Internet-domain sockets when you want reliable client-server communication between potentially different machines. To create a TCP client that connects to a server somewhere, it’s usually easiest to use the standard IO::Socket::INET module: use IO::Socket::INET; $socket = IO::Socket::INET->new(PeerAddr => $remote_host, PeerPort => $remote_port, Proto => "tcp", Type => SOCK_STREAM) or die "Couldn’t connect to $remote_host:$remote_port : $!\n"; 440 Chapter 16: Interprocess Communication # send something over the socket, print $socket "Why don’t you call me anymore?\n"; # read the remote answer, $answer = <$socket>; # and terminate the connection when we’re done. close($socket); A shorthand form of the call is good enough when you just have a host and port combination to connect to, and are willing to use defaults for all other fields: $socket = IO::Socket::INET->new("www.yahoo.com:80") or die "Couldn’t connect to port 80 of yahoo: $!"; To connect using the basic Socket module: use Socket; # create a socket socket(Server, PF_INET, SOCK_STREAM, getprotobyname(’tcp’)); # build the address of the remote machine $internet_addr = inet_aton($remote_host) or die "Couldn’t convert $remote_host into an Internet address: $!\n"; $paddr = sockaddr_in($remote_port, $internet_addr); # connect connect(Server, $paddr) or die "Couldn’t connect to $remote_host:$remote_port: $!\n"; select((select(Server), $| = 1)[0]); # enable command buffering # send something over the socket print Server "Why don’t you call me anymore?\n"; # read the remote answer $answer = <Server>; # terminate the connection when done close(Server); If you want to close only your side of the connection, so that the remote end gets an end-of-file, but you can still read data coming from the server, use the shutdown syscall for a half-close: # no more writing to server shutdown(Server, 1); # Socket::SHUT_WR constant in v5.6 Sockets 441 Networking Servers Here’s a corresponding server to go along with it. It’s pretty easy with the standard IO::Socket::INET class: use IO::Socket::INET; $server = IO::Socket::INET->new(LocalPort => Type => Reuse => Listen => or die "Couldn’t be a tcp server on port $server_port, SOCK_STREAM, 1, 10 ) # or SOMAXCONN $server_port: $!\n"; while ($client = $server->accept()) { # $client is the new connection } close($server); You can also write that using the lower-level Socket module: use Socket; # make the socket socket(Server, PF_INET, SOCK_STREAM, getprotobyname(’tcp’)); # so we can restart our server quickly setsockopt(Server, SOL_SOCKET, SO_REUSEADDR, 1); # build up my socket address $my_addr = sockaddr_in($server_port, INADDR_ANY); bind(Server, $my_addr) or die "Couldn’t bind to port $server_port: $!\n"; # establish a queue for incoming connections listen(Server, SOMAXCONN) or die "Couldn’t listen on port $server_port: $!\n"; # accept and process connections while (accept(Client, Server)) { # do something with new Client connection } close(Server); The client doesn’t need to bind to any address, but the server does. We’ve specified its address as INADDR_ANY, which means that clients can connect from any available network interface. If you want to sit on a particular interface (like the external side of a gateway or firewall machine), use that interface’s real address instead. (Clients can do this, too, but rarely need to.) 442 Chapter 16: Interprocess Communication If you want to know which machine connected to you, call getpeername on the client connection. This returns an IP address, which you’ll have to translate into a name on your own (if you can): use Socket; $other_end = getpeername(Client) or die "Couldn’t identify other end: $!\n"; ($port, $iaddr) = unpack_sockaddr_in($other_end); $actual_ip = inet_ntoa($iaddr); $claimed_hostname = gethostbyaddr($iaddr, AF_INET); This is trivially spoofable because the owner of that IP address can set up their reverse tables to say anything they want. For a small measure of additional confidence, translate back the other way again: @name_lookup = gethostbyname($claimed_hostname) or die "Could not reverse $claimed_hostname: $!\n"; @resolved_ips = map { inet_ntoa($_) } @name_lookup[ 4 .. $#name_lookup ]; $might_spoof = !grep { $actual_ip eq $_ } @resolved_ips; Once a client connects to your server, your server can do I/O both to and from that client handle. But while the server is so engaged, it can’t service any further incoming requests from other clients. To avoid getting locked down to just one client at a time, many servers immediately fork a clone of themselves to handle each incoming connection. (Others fork in advance, or multiplex I/O between several clients using the select syscall.) REQUEST: while (accept(Client, Server)) { if ($kidpid = fork) { close Client; # parent closes unused handle next REQUEST; } defined($kidpid) or die "cannot fork: $!" ; close Server; # child closes unused handle select(Client); $| = 1; # new default for p