Heres the full regex \u02c6 w\u02c6 12 \u02c6 w\u02c6 A string that matches this has no mismatched

Heres the full regex ˆ wˆ 12 ˆ wˆ a string that

This preview shows page 498 - 500 out of 534 pages.

Here’s the full regex: ! ˆ ( (?: <(\w++)[ˆ>] + +(?<!/)>(?1)</\2> < [ˆ<>]++ < <\w[ˆ>] + +/> ) , + ) $ " A string that matches this has no mismatched tags (with a few caveats we’ll look at a bit later). This may appear to be quite complex, but it’s manageable when broken down into its component parts. The expression’s outer ! ˆ( ˙˙˙ )$ " wraps the main body of the regex to ensure that the entire subject string is matched before success is retur ned. That main body is also wrapped with an additional set of capturing parentheses, which, as we’ll soon see, allows a later recursive reference to “the main body.” The main body of this expression The main body of the regex, then, is three alternatives (each underlined within the regex, for visual clarity) wrapped in ! (?: ˙˙˙ ) + + " to allow any mix of them to match. The three alternatives attempt to match, respectively: matched tags, non-tag text, and self-closing tags. Extended Examples 481
Image of page 498
482 Chapter 10: PHP Because what each alternative can match is unique to that alternative (that is, where one alternative has matched, neither of the others may match), I know that later backtracking will never allow another alternative to match the same text. I can take advantage of that knowledge to make the process more efficient by using a possessive + on the “allow any mix to match” parentheses. This tells the engine to not even bother trying to backtrack, thereby hastening a result when a match can’t be found. For the same reason, the three alternatives may be placed in any order, so I put first the alternatives I felt were most likely to match most often ( 260). Now let’s look at the alternatives one at a time . . . The second alternative: non-tag text I’ll start with the middle alternative, because it’s the simplest: ! [ˆ<>]++ " . This alternative matches non-tag spans of text. The use of the possessive quantifier here may be overkill considering that the wrapping ! (?: ˙˙˙ ) + +) " is also possessive, but to be safe, I like to use a possessive quantifier when I know it can’t hurt. (A possessive quantifier is often used for its efficiency, but it can also change the semantics of a match. The change can be useful, but make sure you understand its ramifications 259). The third alternative: self-closing tags The third alternative, ! < \w [ˆ>] + + /> " , matches self-closing tags such as <br/> and <img ˙˙˙ /> (self-closing tags are characterized by the ‘ / ’ immediately before the closing bracket). As before, the use of a possessive quantifier here may be overkill, but it certainly doesn’t hurt. The first alternative: a matched set of tags Finally, let’s look at the first alter native: ! < (\w++) [ˆ>] + +(?<!/) > (?1) </\2> " The first part of this subexpression (marked with an underline) matches an open- ing tag, with its ! (\w++) " capturing the tag name within what turns out to be the overall regex’s second set of capturing parentheses. (The use of a possessive quantifier in ! \w++ " is an important point that we’ll look at in a bit.) ! (?<! / ) " is negative lookbehind ( 133) ensuring that we haven’t just matched a slash. We put it right before the ! > " at the end of the match-an-opening-tag section
Image of page 499
Image of page 500

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture