mastering regular expressions 1997 - Mastering Regular Expressions Table of Contents Mastering Regular Expressions Table of Contents Tables Preface 1

mastering regular expressions 1997 - Mastering Regular...

This preview shows page 1 out of 780 pages.

Unformatted text preview: Mastering Regular Expressions - Table of Contents Mastering Regular Expressions Table of Contents Tables Preface 1 Introduction to Regular Expressions 2 Extended Introductory Examples 3 Overview of Regular Expression Features and Flavors 4 The Mechanics of Expression Processing 5 Crafting a Regular Expression 6 Tool-Specific Information 7 Perl Regular Expressions A Online Information B Email Regex Program Index Mastering Regular Expressions Powerful Techniques for Perl and Other Tools Jeffrey E.F. Friedl O'REILLY™ Cambridge • Köln • Paris • Sebastopol • Tokyo [PU]O'Reilly[/PU][DP]1997[/DP] Page iv Mastering Regular Expressions by Jeffrey E.F. Friedl Copyright © 1997 O'Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472. Editor: Andy Oram Production Editor: Jeffrey Friedl Printing History: January 1997: First Edition. March 1997: Second printing; Minor corrections. May 1997: Third printing; Minor corrections. July 1997: Fourth printing; Minor corrections. November 1997: Fifth printing; Minor corrections. August 1998: Sixth printing; Minor corrections. December 1998: Seventh printing; Minor corrections. Nutshell Handbook and the Nutshell Handbook logo are registered trademarks and The Java Series is a trademark of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Page V Table of Contents Preface xv 1: Introduction to Regular Expressions 1 Solving Real Problems 2 Regular Expressions as a Language 4 The Filename Analogy 4 The Language Analogy 5 The Regular-Expression Frame of Mind Searching Text Files: Egrep Egrep Metacharacters 6 7 8 Start and End of the Line 8 Character Classes 9 Matching Any Character—Dot 11 Alternation 12 Word Boundaries 14 In a Nutshell 15 Optional Items 16 Other Quantifiers: Repetition 17 Ignoring Differences in Capitalization 18 Parentheses and Backreferences 19 The Great Escape 20 Expanding the Foundation 21 Linguistic Diversification 21 The Goal of a Regular Expression 21 A Few More Examples 22 Page vi Regular Expression Nomenclature 24 Improving on the Status Quo 26 Summary 28 Personal Glimpses 30 2: Extended Introductory Examples About the Examples A Short Introduction to Perl 31 32 33 Matching Text with Regular Expressions 34 Toward a More Real-World Example 36 Side Effects of a Successful Match 36 Intertwined Regular Expressions 39 Intermission 43 Modifying Text with Regular Expressions 45 Automated Editing 47 A Small Mail Utility 48 That Doubled-Word Thing 54 3: Overview of Regular Expression Features and Flavors. 59 A Casual Stroll Across the Regex Landscape 60 The World According to Grep 60 The Times They Are a Changin' 61 At a Glance 63 POSIX 64 Care and Handling of Regular Expressions 66 Identifying a Regex 66 Doing Something with the Matched Text 67 Other Examples 67 Care and Handling: Summary 70 Engines and Chrome Finish 70 Chrome and Appearances 71 Engines and Drivers 71 Common Metacharacters 71 Character Shorthands 72 Strings as Regular Expression 75 Class Shorthands, Dot, and Character Classes 77 Anchoring 81 Grouping and Retrieving 83 Quantifiers 83 [PU]O'Reilly[/PU][DP]1997[/DP] Page vii Alternation Guide to the Advanced Chapters 84 85 Tool-Specific Information 85 4: The Mechanics of Expression Processing 87 Start Your Engines! 87 Two Kinds of Engines 87 New Standards 88 Regex Engine Types 88 From the Department of Redundancy Department 90 Match Basics 90 About the Examples 91 Rule 1: The Earliest Match Wins 91 The "Transmission" and the Bump-Along 92 Engine Pieces and Parts 93 Rule 2: Some Metacharacters Are Greedy 94 Regex-Directed vs. Text-Directed 99 NFA Engine: Regex-Directed 99 DFA Engine: Text-Directed 100 The Mysteries of Life Revealed 101 Backtracking 102 A Really Crummy Analogy 102 Two Important Points on Backtracking 103 Saved States 104 Backtracking and Greediness 106 More About Greediness 108 Problems of Greediness 108 Multi-Character "Quotes" 109 Laziness? 110 Greediness Always Favors a Match 110 Is Alternation Greedy? 112 Uses for Non-Greedy Alternation 113 Greedy Alternation in Perspective 114 Character Classes vs. Alternation 115 NFA, DFA, and POSIX "The Longest-Leftmost" 115 115 POSIX and the Longest-Leftmost Rule 116 Speed and Efficiency 118 DFA and NFA in Comparison 118 Page viii Practical Regex Techniques 121 Contributing Factors 121 Be Specific 122 Difficulties and Impossibilities 125 Watching Out for Unwanted Matches. 127 Matching Delimited Text 129 Knowing Your Data and Making Assumptions 132 Additional Greedy Examples 132 Summary 136 Match Mechanics Summary 136 Some Practical Effects of Match Mechanics 137 5: Crafting a Regular Expression A Sobering Example 139 140 A Simple Change-Placing Your Best Foot Forward 141 More Advanced-Localizing the Greediness 141 Reality Check 144 A Global View of Backtracking 145 More Work for a POSIX NFA 147 Work Required During a Non-Match. 147 Being More Specific 147 Alternation Can Be Expensive 148 A Strong Lead 149 The Impact of Parentheses 150 Internal Optimization 154 First-Character Discrimination 154 Fixed-String Check 155 Simple Repetition 155 Needless Small Quantifiers 156 Length Cognizance 157 Match Cognizance 157 Need Cognizance 157 String/Line Anchors 158 Compile Caching 158 Testing the Engine Type 160 Basic NFA vs. DFA Testing 160 Traditional NFA vs. POSIXNFA Testing Unrolling the Loop Method 1: Building a Regex From Past Experiences 161 162 162 Page ix The Real "Unrolling the Loop" Pattern. 164 Method 2: A Top-Down View 166 Method 3: A Quoted Internet Hostname 167 Observations 168 Unrolling C Comments 168 Regex Headaches 169 A Naive View 169 Unrolling the C Loop 171 The Freeflowing Regex 173 A Helping Hand to Guide the Match. 173 A Well-Guided Regex is a Fast Regex. 174 Wrapup 176 Think! The Many Twists and Turns of Optimizations 6: Tool-Specific Information 177 177 181 Questions You Should Be Asking 181 Something as Simple as Grep 181 In This Chapter Awk 182 183 Differences Among Awk Regex Flavors 184 Awk Regex Functions and Operators 187 Tcl 188 Tcl Regex Operands 189 Using Tcl Regular Expressions 190 Tcl Regex Optimizations 192 GNU Emacs 192 Emacs Strings as Regular Expressions 193 Emacs's Regex Flavor 193 Emacs Match Results 196 Benchmarking in Emacs 197 Emacs Regex Optimizations 197 7: Perl Regular Expressions The Perl Way 199 201 Regular Expressions as a Language Component 202 Perl's Greatest Strength 202 Perl's Greatest Weakness 203 A Chapter, a Chicken, and The Perl Way 204 Page x An Introductory Example: Parsing CSV Text 204 Regular Expressions and The Perl Way 207 Perl Unleashed 208 Regex-Related Perlisms 210 Expression Context 210 Dynamic Scope and Regex Match Effects 211 Special Variables Modified by a Match 217 "Doublequotish Processing" and Variable Interpolation 219 Perl's Regex Flavor 225 Quantifiers-Greedy and Lazy 225 Grouping 227 String Anchors 232 Multi-Match Anchor 236 Word Anchors 240 Convenient Shorthands and Other Notations 241 Character Classes 243 Modification with \Q and Friends: True Lies 245 The Match Operator 246 Match-Operand Delimiters 247 Match Modifiers 249 Specifying the Match Target Operand 250 Other Side Effects of the Match Operator 251 Match Operator Return Value 252 Outside Influences on the Match Operator 254 The Substitution Operator 255 The Replacement Operand 255 The /e Modifier 257 Context and Return Value 258 Using /g with a Regex That Can Match Nothingness 259 The Split Operator 259 Basic Split 259 Advanced Split 261 Advanced Split's Match Operand 262 Scalar-Context Split 264 Split's Match Operand with Capturing Parentheses 264 Perl Efficiency Issues 265 "There's More Than One Way to Do It" 266 Regex Compilation, the /o Modifier, and Efficiency 268 Unsociable $& and Friends 273 Page xi The Efficiency Penalty of the /i Modifier 278 Substitution Efficiency Concerns 281 Benchmarking 284 Regex Debugging Information 285 The Study Function 287 Putting It All Together 290 Stripping Leading and Trailing Whitespace 290 Adding Commas to a Number 291 Removing C Comments 292 Matching an Email Address 294 Final Comments Notes for Perl4 304 305 A Online Information 309 BEmail Regex Program 313 Page xiii Tables 1-1 Summary of Metacharacters Seen So Far 15 1-2 Summary of Quantifier ''Repetition Metacharacters" 18 1-3 Egrep Metacharacter Summary 29 3-1 A (Very) Superficial Look at the Flavor of a Few Common Tools 63 3-2 Overview of POSIX Regex Flavors 64 3-3 A Few Utilities and Some of the Shorthand Metacharacters They Provide 73 3-4 String/Line Anchors, and Other Newline-Related Issues 82 4-1 Some Tools and Their Regex Engines 90 5-1 Match Efficiency for a Traditional NFA 143 5-2 Unrolling-The-Loop Example Cases 163 5-3 Unrolling-The-Loop Components for C Comments 172 6-1 A Superficial Survey of a Few Common Programs' Flavor 182 6-2 A Comical Look at a Few Greps 183 6-3 A Superficial Look at a Few Awks 184 6-4 Tcl's FA Regex Flavor 189 6-5 GNU Emacs's Search-Related Primitives 193 6-6 GNU Emacs's String Metacharacters 194 6-7 Emacs's NFA Regex Flavor 194 6-8 Emacs Syntax Classes 195 7-1 Overview of Perl's Regular-Expression Language 201 7-2 Overview of Perl's Regex-Related Items 203 7-3 The meaning of local 213 7-4 Perl's Quantifiers (Greedy and Lazy) 225 Page xiv 7-5 Overview of Newline-Related Match Modes 232 7-6 Summary of Anchor and Dot Modes 236 7-7 Regex Shorthands and Special-Character Encodings 241 7-8 String and Regex-Operand Case-Modification Constructs 245 7-9 Examples of m/…/g with a Can-Match-Nothing Regex 250 7-10 Standard Libraries That Are Naughty (That Reference $& and Friends) 278 7-11 Somewhat Formal Description of an Internet Email Address 295 Page xv Preface This book is about a powerful tool called "regular expressions." Here, you will learn how to use regular expressions to solve problems and get the most out of tools that provide them. Not only that, but much more: this book is about mastering regular expressions. If you use a computer, you can benefit from regular expressions all the time (even if you don't realize it). When accessing World Wide Web search engines, with your editor, word processor, configuration scripts, and system tools, regular expressions are often provided as "power user" options. Languages such as Awk, Elisp, Expect, Perl, Python, and Tcl have regular-expression support built in (regular expressions are the very heart of many programs written in these languages), and regular-expression libraries are available for most other languages. For example, quite soon after Java became available, a regular-expression library was built and made freely available on the Web. Regular expressions are found in editors and programming environments such as vi, Delphi, Emacs, Brief, Visual C++, Nisus Writer, and many, many more. Regular expressions are very popular. There's a good reason that regular expressions are found in so many diverse applications: they are extremely powerful. At a low level, a regular expression describes a chunk of text. You might use it to verify a user's input, or perhaps to sift through large amounts of data. On a higher level, regular expressions allow you to master your data. Control it. Put it to work for you. To master regular expressions is to master your data. [PU]O'Reilly[/PU][DP]1997[/DP] Page xvi Why I Wrote This Book You might think that with their wide availability, general popularity, and unparalleled power, regular expressions would be employed to their fullest, wherever found. You might also think that they would be well documented, with introductory tutorials for the novice just starting out, and advanced manuals for the expert desiring that little extra edge. Sadly, that hasn't been the case. Regular-expression documentation is certainly plentiful, and has been available for a long time. (I read my first regular-expression-related manual back in 1981.) The problem, it seems, is that the documentation has traditionally centered on the "low-level view" that I mentioned a moment ago. You can talk all you want about how paints adhere to canvas, and the science of how colors blend, but this won't make you a great painter. With painting, as with any art, you must touch on the human aspect to really make a statement. Regular expressions, composed of a mixture of symbols and text, might seem to be a cold, scientific enterprise, but I firmly believe they are very much creatures of the right half of the brain. They can be an outlet for creativity, for cunningly brilliant programming, and for the elegant solution. I'm not talented at anything that most people would call art. I go to karaoke bars in Kyoto a lot, but I make up for the lack of talent simply by being loud. I do, however, feel very artistic when I can devise an elegant solution to a tough problem. In much of my work, regular expressions are often instrumental in developing those elegant solutions. Because it's one of the few outlets for the artist in me, I have developed somewhat of a passion for regular expressions. It is my goal in writing this book to share some of that passion. Intended Audience This book will interest anyone who has an opportunity to use regular expressions. In particular, if you don't yet understand the power that regular expressions can provide, you should benefit greatly as a whole new world is opened up to you. Many of the popular cross-platform utilities and languages that are featured in this book are freely available for MacOS, DOS/Windows, Unix, VMS, and more. Appendix A has some pointers on how to obtain many of them. Anyone who uses GNU Emacs or vi, or programs in Perl, Tcl, Python, or Awk, should find a gold mine of detail, hints, tips, and understanding that can be put to immediate use. The detail and thoroughness is simply not found anywhere else. Regular expressions are an idea—one that is implemented in various ways by various utilities (many, many more than are specifically presented in this book). If you master the general concept of regular expressions, it's a short step to mastering a Page xvii particular implementation. This book concentrates on that idea, so most of the knowledge presented here transcend the utilities used in the examples. How to Read This Book This book is part tutorial, part reference manual, and part story, depending on when you use it. Readers familiar with regular expressions might feel that they can immediately begin using this book as a detailed reference, flipping directly to the section on their favorite utility. I would like to discourage that. This Book, as a Story To get the most out of this book, read it first as a story. I have found that certain habits and ways of thinking can be a great help to reaching a full understanding, but such things are absorbed over pages, not merely memorized from a list. Here's a short quiz: define the word "between" Remember, you can't use the word in its definition! Have you come up with a good definition? No? It's tough! It's lucky that we all know what "between" means because most of us would have a devil of a time trying to explain it to someone that didn't know. It's a simple concept, but it's hard to describe to someone who isn't already familiar with it. To some extent, describing the details of regular expressions can be similar. Regular expressions are not really that complex, but the descriptions can tend to be. I've crafted a story and a way of thinking that begins with Chapter 1, so I hope you begin reading there. Some of the descriptions are complex, so don't be alarmed if some of the more detailed sections require a second reading. Experience is 9/10 of the law (or something like that), so it takes time and experience before the overall picture can sink in. This Book, as a Reference This book tells a story, but one with many details. Once you've read the story to get the overall picture, this book is also useful as a reference. I've used cross references liberally, and I've worked hard to make the index as useful as possible. followed by a page number.) Until (Cross references are often presented as you read the full story, its use as a reference makes little sense. Before reading the story, you might look at one of the tables, such as the huge chart on page 182, and think it presents all the relevant information you need to know. But a great deal of background information does not appear in the charts themselves, but rather in the associated story. Once you've read the story, you'll have an appreciation for the issues, what you can remember off the top of your head, and what is important to check up on. Page xviii Organization The seven chapters of this book can be logically divided into roughly three parts, with two additional appendices. Here's a quick overview: The Introduction Chapter 1 introduces the concept of regular expressions. Chapter 2 takes a look at text processing with regular expressions. Chapter 3 provides an overview of features and utilities, plus a bit of history. The Details Chapter 4 explains the details of how regular expressions work. Chapter 5 discusses ramifications and practical applications of the details. Tool-Specific Information Chapter 6 looks at a few tool-specific issues of several common utilities. Chapter 7 looks at everything to do with regular expressions in Perl. Appendices Appendix A tells how to acquire many of the tools mentioned in this book. Appendix B provides a full listing of a program developed in Chapter 7. The Introduction The introduction elevates the absolute novice to "issue-aware" novice. Readers with a fair amount of experience can feel free to skim the early chapters, but I particularly recommend Chapter 3 even for the grizzled expert. • Chapter 1, Introduction to Regular Expressions, is geared toward the complete novice. I introduce the concept of regular expressions using the widely available program egrep, and offer my perspective on how to think regular expressions, instilling a solid foundation for the advanced concepts in later chapters. Even readers with former experience would do well to skim this first chapter. • Chapter 2, Extended Introductory Examples, looks at real text processing in a programming language that has regular-expression support. The additional examples provide a basis for the detailed discussions of later chapters, and show additional important thought processes behind crafting advanced regular expressions. To provide a feel for how to "speak in regular expressions," this chapter takes a problem requiring an advanced solution and shows ways to solve it using two unrelated regular-expression-wielding tools. • Chapter 3, Overview of Regular Expression Features and Flavors, provides an overview of the wide range of regular expressions commonly found in tools today. Due to their turbulent history, current commonly used regular expression flavors can differ greatly. This chapter also takes a look at a bit of the history and evolution of regular expressions and the programs that use them. The Page xix end of this chapter also contains the "Guide to the Advanced Chapters." This guide is your road map to getting the most out of the advanced material that follows. The Details Once you have the basics down, it's time to investigate the how and the why. Like the "teach a man to fish" parable, truly understanding the issues will allow you to apply that knowledge whenever and wherever regular expressions are found. That true understanding begins in: • Chapter 4, The Mechanics of Expressi...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture