’s. So the regular expression for matching one or more a is /aa*/ , meaning one a followed by zero or more a s. More complex patterns can also be repeated. So /[ab]*/ means “zero or more a ’s or b ’s” (not “zero or more right square braces”). This will match strings like aaaa or ababab or bbbb . For specifying multiple digits (useful for finding prices) we can extend /[0-9]/ , the regular expression for a single digit. An integer (a string of digits) is thus /[0-9][0-9]*/ . (Why isn’t it just /[0-9]*/ ?) Sometimes it’s annoying to have to write the regular expression for digits twice, so there is a shorter way to specify “at least one” of some character. This is the Kleene + , which means “one or more of the previous character”. Thus, the expres- Kleene + sion /[0-9]+/ is the normal way to specify “a sequence of digits”. There are thus two ways to specify the sheep language: /baaa*!/ or /baa+!/ . One very important special character is the period ( /./ ), a wildcard expression that matches any single character ( except a carriage return), as shown in Fig. 2.6 . RE Match Example Matches /beg.n/ any character between beg and n begin , beg’n , begun Figure 2.6 The use of the period . to specify any character. The wildcard is often used together with the Kleene star to mean “any string of characters”. For example, suppose we want to find any line in which a particular word, for example, aardvark , appears twice. We can specify this with the regular expression /aardvark.*aardvark/ . Anchors are special characters that anchor regular expressions to particular places Anchors in a string. The most common anchors are the caret ˆ and the dollar sign $ . The caret ˆ matches the start of a line. The pattern /ˆThe/ matches the word The only at the start of a line. Thus, the caret ˆ has three uses: to match the start of a line, to in- dicate a negation inside of square brackets, and just to mean a caret. (What are the contexts that allow grep or Python to know which function a given caret is supposed to have?) The dollar sign $ matches the end of a line. So the pattern $ is a useful
14 C HAPTER 2 R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE pattern for matching a space at the end of a line, and /ˆThe dog\.$/ matches a line that contains only the phrase The dog. (We have to use the backslash here since we want the . to mean “period” and not the wildcard.) There are also two other anchors: \b matches a word boundary, and \B matches a non-boundary. Thus, /\bthe\b/ matches the word the but not the word other . More technically, a “word” for the purposes of a regular expression is defined as any sequence of digits, underscores, or letters; this is based on the definition of “words” in programming languages. For example, /\b99\b/ will match the string 99 in There are 99 bottles of beer on the wall (because 99 follows a space) but not 99 in There are 299 bottles of beer on the wall (since 99 follows a number). But it will match 99 in $ 99 (since 99 follows a dollar sign ($), which is not a digit, underscore, or letter).
