Unformatted text preview: Mastering Regular Expressions - Table of Contents
Mastering Regular Expressions
Table of Contents
1 Introduction to Regular Expressions
2 Extended Introductory Examples
3 Overview of Regular Expression Features and Flavors
4 The Mechanics of Expression Processing
5 Crafting a Regular Expression
6 Tool-Specific Information
7 Perl Regular Expressions
A Online Information
B Email Regex Program
Index Mastering Regular Expressions
Powerful Techniques for Perl and Other Tools
Jeffrey E.F. Friedl O'REILLY™
Cambridge • Köln • Paris • Sebastopol • Tokyo [PU]O'Reilly[/PU][DP]1997[/DP] Page iv Mastering Regular Expressions
by Jeffrey E.F. Friedl
Copyright © 1997 O'Reilly & Associates, Inc. All rights reserved.
Printed in the United States of America.
Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA
Editor: Andy Oram
Production Editor: Jeffrey Friedl
January 1997: First Edition. March 1997: Second printing; Minor corrections. May 1997: Third printing; Minor corrections. July 1997: Fourth printing; Minor corrections. November 1997: Fifth printing; Minor corrections. August 1998: Sixth printing; Minor corrections. December 1998: Seventh printing; Minor corrections. Nutshell Handbook and the Nutshell Handbook logo are registered trademarks
and The Java Series is a trademark of O'Reilly & Associates, Inc.
Many of the designations used by manufacturers and sellers to distinguish their
products are claimed as trademarks. Where those designations appear in this
book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the
designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the
publisher assumes no responsibility for errors or omissions, or for damages
resulting from the use of the information contained herein. Page V Table of Contents
Preface xv 1: Introduction to Regular Expressions 1 Solving Real Problems 2 Regular Expressions as a Language 4 The Filename Analogy 4 The Language Analogy 5 The Regular-Expression Frame of Mind
Searching Text Files: Egrep
Egrep Metacharacters 6
8 Start and End of the Line 8 Character Classes 9 Matching Any Character—Dot 11 Alternation 12 Word Boundaries 14 In a Nutshell 15 Optional Items 16 Other Quantifiers: Repetition 17 Ignoring Differences in Capitalization 18 Parentheses and Backreferences 19 The Great Escape 20 Expanding the Foundation 21 Linguistic Diversification 21 The Goal of a Regular Expression 21 A Few More Examples 22 Page vi Regular Expression Nomenclature 24 Improving on the Status Quo 26 Summary 28 Personal Glimpses 30 2: Extended Introductory Examples
About the Examples
A Short Introduction to Perl 31
33 Matching Text with Regular Expressions 34 Toward a More Real-World Example 36 Side Effects of a Successful Match 36 Intertwined Regular Expressions 39 Intermission 43 Modifying Text with Regular Expressions 45 Automated Editing 47 A Small Mail Utility 48 That Doubled-Word Thing 54 3: Overview of Regular Expression Features and Flavors. 59 A Casual Stroll Across the Regex Landscape 60 The World According to Grep 60 The Times They Are a Changin' 61 At a Glance 63 POSIX 64 Care and Handling of Regular Expressions 66 Identifying a Regex 66 Doing Something with the Matched Text 67 Other Examples 67 Care and Handling: Summary 70 Engines and Chrome Finish 70 Chrome and Appearances 71 Engines and Drivers 71 Common Metacharacters 71 Character Shorthands 72 Strings as Regular Expression 75 Class Shorthands, Dot, and Character Classes 77 Anchoring 81 Grouping and Retrieving 83 Quantifiers 83 [PU]O'Reilly[/PU][DP]1997[/DP] Page vii Alternation
Guide to the Advanced Chapters 84
85 Tool-Specific Information 85 4: The Mechanics of Expression Processing 87 Start Your Engines! 87 Two Kinds of Engines 87 New Standards 88 Regex Engine Types 88 From the Department of Redundancy Department 90 Match Basics 90 About the Examples 91 Rule 1: The Earliest Match Wins 91 The "Transmission" and the Bump-Along 92 Engine Pieces and Parts 93 Rule 2: Some Metacharacters Are Greedy 94 Regex-Directed vs. Text-Directed 99 NFA Engine: Regex-Directed 99 DFA Engine: Text-Directed 100 The Mysteries of Life Revealed 101 Backtracking 102 A Really Crummy Analogy 102 Two Important Points on Backtracking 103 Saved States 104 Backtracking and Greediness 106 More About Greediness 108 Problems of Greediness 108 Multi-Character "Quotes" 109 Laziness? 110 Greediness Always Favors a Match 110 Is Alternation Greedy? 112 Uses for Non-Greedy Alternation 113 Greedy Alternation in Perspective 114 Character Classes vs. Alternation 115 NFA, DFA, and POSIX
"The Longest-Leftmost" 115
115 POSIX and the Longest-Leftmost Rule 116 Speed and Efficiency 118 DFA and NFA in Comparison 118 Page viii Practical Regex Techniques 121 Contributing Factors 121 Be Specific 122 Difficulties and Impossibilities 125 Watching Out for Unwanted Matches. 127 Matching Delimited Text 129 Knowing Your Data and Making Assumptions 132 Additional Greedy Examples 132 Summary 136 Match Mechanics Summary 136 Some Practical Effects of Match Mechanics 137 5: Crafting a Regular Expression
A Sobering Example 139
140 A Simple Change-Placing Your Best Foot Forward 141 More Advanced-Localizing the Greediness 141 Reality Check 144 A Global View of Backtracking 145 More Work for a POSIX NFA 147 Work Required During a Non-Match. 147 Being More Specific 147 Alternation Can Be Expensive 148 A Strong Lead 149 The Impact of Parentheses 150 Internal Optimization 154 First-Character Discrimination 154 Fixed-String Check 155 Simple Repetition 155 Needless Small Quantifiers 156 Length Cognizance 157 Match Cognizance 157 Need Cognizance 157 String/Line Anchors 158 Compile Caching 158 Testing the Engine Type 160 Basic NFA vs. DFA Testing 160 Traditional NFA vs. POSIXNFA Testing
Unrolling the Loop
Method 1: Building a Regex From Past Experiences 161
162 Page ix The Real "Unrolling the Loop" Pattern. 164 Method 2: A Top-Down View 166 Method 3: A Quoted Internet Hostname 167 Observations 168 Unrolling C Comments 168 Regex Headaches 169 A Naive View 169 Unrolling the C Loop 171 The Freeflowing Regex 173 A Helping Hand to Guide the Match. 173 A Well-Guided Regex is a Fast Regex. 174 Wrapup 176 Think!
The Many Twists and Turns of Optimizations
6: Tool-Specific Information 177
181 Questions You Should Be Asking 181 Something as Simple as Grep 181 In This Chapter
183 Differences Among Awk Regex Flavors 184 Awk Regex Functions and Operators 187 Tcl 188
Tcl Regex Operands 189 Using Tcl Regular Expressions 190 Tcl Regex Optimizations 192 GNU Emacs 192 Emacs Strings as Regular Expressions 193 Emacs's Regex Flavor 193 Emacs Match Results 196 Benchmarking in Emacs 197 Emacs Regex Optimizations 197 7: Perl Regular Expressions
The Perl Way 199
201 Regular Expressions as a Language Component 202 Perl's Greatest Strength 202 Perl's Greatest Weakness 203 A Chapter, a Chicken, and The Perl Way 204 Page x An Introductory Example: Parsing CSV Text 204 Regular Expressions and The Perl Way 207 Perl Unleashed 208 Regex-Related Perlisms 210 Expression Context 210 Dynamic Scope and Regex Match Effects 211 Special Variables Modified by a Match 217 "Doublequotish Processing" and Variable Interpolation 219 Perl's Regex Flavor 225 Quantifiers-Greedy and Lazy 225 Grouping 227 String Anchors 232 Multi-Match Anchor 236 Word Anchors 240 Convenient Shorthands and Other Notations 241 Character Classes 243 Modification with \Q and Friends: True Lies 245 The Match Operator 246 Match-Operand Delimiters 247 Match Modifiers 249 Specifying the Match Target Operand 250 Other Side Effects of the Match Operator 251 Match Operator Return Value 252 Outside Influences on the Match Operator 254 The Substitution Operator 255 The Replacement Operand 255 The /e Modifier 257 Context and Return Value 258 Using /g with a Regex That Can Match Nothingness 259 The Split Operator 259 Basic Split 259 Advanced Split 261 Advanced Split's Match Operand 262 Scalar-Context Split 264 Split's Match Operand with Capturing Parentheses 264 Perl Efficiency Issues 265 "There's More Than One Way to Do It" 266 Regex Compilation, the /o Modifier, and Efficiency 268 Unsociable $& and Friends 273 Page xi The Efficiency Penalty of the /i Modifier 278 Substitution Efficiency Concerns 281 Benchmarking 284 Regex Debugging Information 285 The Study Function 287 Putting It All Together 290 Stripping Leading and Trailing Whitespace 290 Adding Commas to a Number 291 Removing C Comments 292 Matching an Email Address 294 Final Comments
Notes for Perl4 304
305 A Online Information 309 BEmail Regex Program 313 Page xiii Tables
1-1 Summary of Metacharacters Seen So Far 15 1-2 Summary of Quantifier ''Repetition Metacharacters" 18 1-3 Egrep Metacharacter Summary 29 3-1 A (Very) Superficial Look at the Flavor of a Few Common Tools 63 3-2 Overview of POSIX Regex Flavors 64 3-3 A Few Utilities and Some of the Shorthand Metacharacters They Provide 73 3-4 String/Line Anchors, and Other Newline-Related Issues 82 4-1 Some Tools and Their Regex Engines 90 5-1 Match Efficiency for a Traditional NFA 143 5-2 Unrolling-The-Loop Example Cases 163 5-3 Unrolling-The-Loop Components for C Comments 172 6-1 A Superficial Survey of a Few Common Programs' Flavor 182 6-2 A Comical Look at a Few Greps 183 6-3 A Superficial Look at a Few Awks 184 6-4 Tcl's FA Regex Flavor 189 6-5 GNU Emacs's Search-Related Primitives 193 6-6 GNU Emacs's String Metacharacters 194 6-7 Emacs's NFA Regex Flavor 194 6-8 Emacs Syntax Classes 195 7-1 Overview of Perl's Regular-Expression Language 201 7-2 Overview of Perl's Regex-Related Items 203 7-3 The meaning of local 213 7-4 Perl's Quantifiers (Greedy and Lazy) 225 Page xiv 7-5 Overview of Newline-Related Match Modes 232 7-6 Summary of Anchor and Dot Modes 236 7-7 Regex Shorthands and Special-Character Encodings 241 7-8 String and Regex-Operand Case-Modification Constructs 245 7-9 Examples of m/…/g with a Can-Match-Nothing Regex 250 7-10 Standard Libraries That Are Naughty (That Reference $& and Friends) 278 7-11 Somewhat Formal Description of an Internet Email Address 295 Page xv Preface
This book is about a powerful tool called "regular expressions."
Here, you will learn how to use regular expressions to solve problems and get the
most out of tools that provide them. Not only that, but much more: this book is
about mastering regular expressions.
If you use a computer, you can benefit from regular expressions all the time (even
if you don't realize it). When accessing World Wide Web search engines, with
your editor, word processor, configuration scripts, and system tools, regular
expressions are often provided as "power user" options. Languages such as Awk,
Elisp, Expect, Perl, Python, and Tcl have regular-expression support built in
(regular expressions are the very heart of many programs written in these
languages), and regular-expression libraries are available for most other
languages. For example, quite soon after Java became available, a
regular-expression library was built and made freely available on the Web.
Regular expressions are found in editors and programming environments such as
vi, Delphi, Emacs, Brief, Visual C++, Nisus Writer, and many, many more.
Regular expressions are very popular.
There's a good reason that regular expressions are found in so many diverse
applications: they are extremely powerful. At a low level, a regular expression
describes a chunk of text. You might use it to verify a user's input, or perhaps to
sift through large amounts of data. On a higher level, regular expressions allow
you to master your data. Control it. Put it to work for you. To master regular
expressions is to master your data. [PU]O'Reilly[/PU][DP]1997[/DP] Page xvi Why I Wrote This Book
You might think that with their wide availability, general popularity, and
unparalleled power, regular expressions would be employed to their fullest,
wherever found. You might also think that they would be well documented, with
introductory tutorials for the novice just starting out, and advanced manuals for
the expert desiring that little extra edge.
Sadly, that hasn't been the case. Regular-expression documentation is certainly
plentiful, and has been available for a long time. (I read my first
regular-expression-related manual back in 1981.) The problem, it seems, is that
the documentation has traditionally centered on the "low-level view" that I
mentioned a moment ago. You can talk all you want about how paints adhere to
canvas, and the science of how colors blend, but this won't make you a great
painter. With painting, as with any art, you must touch on the human aspect to
really make a statement. Regular expressions, composed of a mixture of symbols
and text, might seem to be a cold, scientific enterprise, but I firmly believe they
are very much creatures of the right half of the brain. They can be an outlet for
creativity, for cunningly brilliant programming, and for the elegant solution.
I'm not talented at anything that most people would call art. I go to karaoke bars
in Kyoto a lot, but I make up for the lack of talent simply by being loud. I do,
however, feel very artistic when I can devise an elegant solution to a tough
problem. In much of my work, regular expressions are often instrumental in
developing those elegant solutions. Because it's one of the few outlets for the
artist in me, I have developed somewhat of a passion for regular expressions. It is
my goal in writing this book to share some of that passion.
This book will interest anyone who has an opportunity to use regular expressions.
In particular, if you don't yet understand the power that regular expressions can
provide, you should benefit greatly as a whole new world is opened up to you.
Many of the popular cross-platform utilities and languages that are featured in this
book are freely available for MacOS, DOS/Windows, Unix, VMS, and more.
Appendix A has some pointers on how to obtain many of them. Anyone who uses GNU Emacs or vi, or programs in Perl, Tcl, Python, or Awk,
should find a gold mine of detail, hints, tips, and understanding that can be put to
immediate use. The detail and thoroughness is simply not found anywhere else.
Regular expressions are an idea—one that is implemented in various ways by
various utilities (many, many more than are specifically presented in this book). If
you master the general concept of regular expressions, it's a short step to
mastering a Page xvii particular implementation. This book concentrates on that idea, so most of the
knowledge presented here transcend the utilities used in the examples.
How to Read This Book
This book is part tutorial, part reference manual, and part story, depending on
when you use it. Readers familiar with regular expressions might feel that they
can immediately begin using this book as a detailed reference, flipping directly to
the section on their favorite utility. I would like to discourage that.
This Book, as a Story
To get the most out of this book, read it first as a story. I have found that certain
habits and ways of thinking can be a great help to reaching a full understanding,
but such things are absorbed over pages, not merely memorized from a list. Here's
a short quiz: define the word "between" Remember, you can't use the word in its
definition! Have you come up with a good definition? No? It's tough! It's lucky
that we all know what "between" means because most of us would have a devil of
a time trying to explain it to someone that didn't know. It's a simple concept, but
it's hard to describe to someone who isn't already familiar with it. To some extent,
describing the details of regular expressions can be similar. Regular expressions
are not really that complex, but the descriptions can tend to be. I've crafted a story
and a way of thinking that begins with Chapter 1, so I hope you begin reading
there. Some of the descriptions are complex, so don't be alarmed if some of the
more detailed sections require a second reading. Experience is 9/10 of the law (or
something like that), so it takes time and experience before the overall picture can
This Book, as a Reference This book tells a story, but one with many details. Once you've read the story to
get the overall picture, this book is also useful as a reference. I've used cross
references liberally, and I've worked hard to make the index as useful as possible.
followed by a page number.) Until
(Cross references are often presented as
you read the full story, its use as a reference makes little sense. Before reading the
story, you might look at one of the tables, such as the huge chart on page 182, and
think it presents all the relevant information you need to know. But a great deal of
background information does not appear in the charts themselves, but rather in the
associated story. Once you've read the story, you'll have an appreciation for the
issues, what you can remember off the top of your head, and what is important to
check up on. Page xviii Organization
The seven chapters of this book can be logically divided into roughly three parts,
with two additional appendices. Here's a quick overview:
Chapter 1 introduces the concept of regular expressions.
Chapter 2 takes a look at text processing with regular expressions.
Chapter 3 provides an overview of features and utilities, plus a bit of history.
Chapter 4 explains the details of how regular expressions work.
Chapter 5 discusses ramifications and practical applications of the details.
Chapter 6 looks at a few tool-specific issues of several common utilities.
Chapter 7 looks at everything to do with regular expressions in Perl.
Appendix A tells how to acquire many of the tools mentioned in this book.
Appendix B provides a full listing of a program developed in Chapter 7.
The introduction elevates the absolute novice to "issue-aware" novice. Readers
with a fair amount of experience can feel free to skim the early chapters, but I
particularly recommend Chapter 3 even for the grizzled expert.
• Chapter 1, Introduction to Regular Expressions, is geared toward the
complete novice. I introduce the concept of regular expressions using the
widely available program egrep, and offer my perspective on how to think
regular expressions, instilling a solid foundation for the advanced concepts in
later chapters. Even readers with former experience would do well to skim this
first chapter. • Chapter 2, Extended Introductory Examples, looks at real text processing in a
programming language that has regular-expression support. The additional
examples provide a basis for the detailed discussions of later chapters, and
show additional important thought processes behind crafting advanced regular
expressions. To provide a feel for how to "speak in regular expressions," this
chapter takes a problem requiring an advanced solution and shows ways to
solve it using two unrelated regular-expression-wielding tools.
• Chapter 3, Overview of Regular Expression Features and Flavors, provides
an overview of the wide range of regular expressions commonly found in tools
today. Due to their turbulent history, current commonly used regular expression
flavors can differ greatly. This chapter also takes a look at a bit of the history
and evolution of regular expressions and the programs that use them. The Page xix end of this chapter also contains the "Guide to the Advanced Chapters." This
guide is your road map to getting the most out of the advanced material that
Once you have the basics down, it's time to investigate the how and the why. Like
the "teach a man to fish" parable, truly understanding the issues will allow you to
apply that knowledge whenever and wherever regular expressions are found. That
true understanding begins in:
• Chapter 4, The Mechanics of Expressi...
View Full Document