This preview shows page 498 - 500 out of 534 pages.
Here’s the full regex:!ˆ ((?:<(\w++)[ˆ>]++(?<!/)>(?1)</\2><[ˆ<>]++<<\w[ˆ>]++/>),+) $"A string that matches this has no mismatched tags (with a few caveats we’ll look ata bit later).This may appear to be quite complex, but it’s manageable when broken downinto its component parts. The expression’s outer!ˆ(˙˙˙)$"wraps the main body ofthe regex to ensure that the entire subject string is matched before success isretur ned. That main body is also wrapped with an additional set of capturingparentheses, which, as we’ll soon see, allows a later recursive reference to “themain body.”The main body of this expressionThe main body of the regex, then, is three alternatives (each underlined within theregex, for visual clarity) wrapped in!(?:˙˙˙)++"to allow any mix of them to match.The three alternatives attempt to match, respectively: matched tags, non-tag text,and self-closing tags.Extended Examples481
482 Chapter 10:PHPBecause what each alternative can match is unique to that alternative (that is,where one alternative has matched, neither of the others may match), I know thatlater backtracking will never allow another alternative to match the same text. Ican take advantage of that knowledge to make the process more efficient by usinga possessive+on the “allow any mix to match” parentheses. This tells the engineto not even bother trying to backtrack, thereby hastening a result when a matchcan’t be found.For the same reason, the three alternatives may be placed in any order, so I putfirst the alternatives I felt were most likely to match most often (☞260).Now let’s look at the alternatives one at a time . . .The second alternative: non-tag textI’ll start with the middle alternative, becauseit’s the simplest:![ˆ<>]++". This alternative matches non-tag spans of text. The useof the possessive quantifier here may be overkill considering that the wrapping!(?:˙˙˙)++)"is also possessive, but to be safe, I like to use a possessive quantifierwhen I know it can’t hurt. (A possessive quantifier is often used for its efficiency,but it can also change the semantics of a match. The change can be useful, butmake sure you understand its ramifications☞259).Thethirdalternative: self-closing tagsThe third alternative,!<\w [ˆ>]++/>",matches self-closing tags such as<br/>and<img˙˙˙/>(self-closing tags arecharacterized by the ‘/’ immediately before the closing bracket). As before, the useof a possessive quantifier here may be overkill, but it certainly doesn’t hurt.Thefirstalternative: a matched set of tagsFinally, let’s look at the firstalter native:!<(\w++) [ˆ>]++(?<!/)>(?1) </\2>"The first part of this subexpression (marked with an underline) matches an open-ing tag, with its!(\w++)"capturing the tag name within what turns out to be theoverall regex’s second set of capturing parentheses. (The use of a possessivequantifier in!\w++"is an important point that we’ll look at in a bit.)!(?<!/)"is negative lookbehind (☞133) ensuring that we haven’t just matched aslash. We put it right before the!>"at the end of the match-an-opening-tag section