Here’s the full regex:
!
ˆ (
(?:
<(\w++)[ˆ>]
+
+(?<!/)>(?1)</\2>
<
[ˆ<>]++
<
<\w[ˆ>]
+
+/>
)
,
+
) $
"
A string that matches this has no mismatched tags (with a few caveats we’ll look at
a bit later).
This may appear to be quite complex, but it’s manageable when broken down
into its component parts. The expression’s outer
!
ˆ(
˙˙˙
)$
"
wraps the main body of
the regex to ensure that the entire subject string is matched before success is
retur ned. That main body is also wrapped with an additional set of capturing
parentheses, which, as we’ll soon see, allows a later recursive reference to “the
main body.”
The main body of this expression
The main body of the regex, then, is three alternatives (each underlined within the
regex, for visual clarity) wrapped in
!
(?:
˙˙˙
)
+
+
"
to allow any mix of them to match.
The three alternatives attempt to match, respectively: matched tags, non-tag text,
and self-closing tags.
Extended Examples
481

482
Chapter 10:
PHP
Because what each alternative can match is unique to that alternative (that is,
where one alternative has matched, neither of the others may match), I know that
later backtracking will never allow another alternative to match the same text. I
can take advantage of that knowledge to make the process more efficient by using
a possessive
+
on the “allow any mix to match” parentheses. This tells the engine
to not even bother trying to backtrack, thereby hastening a result when a match
can’t be found.
For the same reason, the three alternatives may be placed in any order, so I put
first the alternatives I felt were most likely to match most often (
☞
260).
Now let’s look at the alternatives one at a time . . .
The second alternative: non-tag text
I’ll start with the middle alternative, because
it’s the simplest:
!
[ˆ<>]++
"
. This alternative matches non-tag spans of text. The use
of the possessive quantifier here may be overkill considering that the wrapping
!
(?:
˙˙˙
)
+
+)
"
is also possessive, but to be safe, I like to use a possessive quantifier
when I know it can’t hurt. (A possessive quantifier is often used for its efficiency,
but it can also change the semantics of a match. The change can be useful, but
make sure you understand its ramifications
☞
259).
The
third
alternative: self-closing tags
The third alternative,
!
<
\w [ˆ>]
+
+
/>
"
,
matches self-closing tags such as
<br/>
and
<img
˙˙˙
/>
(self-closing tags are
characterized by the ‘
/
’ immediately before the closing bracket). As before, the use
of a possessive quantifier here may be overkill, but it certainly doesn’t hurt.
The
first
alternative: a matched set of tags
Finally, let’s look at the first
alter native:
!
<
(\w++) [ˆ>]
+
+(?<!/)
>
(?1) </\2>
"
The first part of this subexpression (marked with an underline) matches an open-
ing tag, with its
!
(\w++)
"
capturing the tag name within what turns out to be the
overall regex’s second set of capturing parentheses. (The use of a possessive
quantifier in
!
\w++
"
is an important point that we’ll look at in a bit.)
!
(?<!
/
)
"
is negative lookbehind (
☞
133) ensuring that we haven’t just matched a
slash. We put it right before the
!
>
"
at the end of the match-an-opening-tag section

