Once-only subpatterns
With both maximizing and minimizing repetition, failure of
what follows normally causes the repeated item to be
re-evaluated to see if a different number of repeats allows the
rest of the pattern to match. Sometimes it is useful to
prevent this, either to change the nature of the match, or
to cause it fail earlier than it otherwise might, when the
author of the pattern knows there is no point in carrying
on.
Consider, for example, the pattern \d+foo when applied to
the subject line
123456bar
After matching all 6 digits and then failing to match "foo",
the normal action of the matcher is to try again with only 5
digits matching the \d+ item, and then with 4, and so on,
before ultimately failing. Once-only subpatterns provide the
means for specifying that once a portion of the pattern has
matched, it is not to be re-evaluated in this way, so the
matcher would give up immediately on failing to match "foo"
the first time. The notation is another kind of special
parenthesis, starting with (?> as in this example:
(?>\d+)bar
This kind of parenthesis "locks up" the part of the pattern
it contains once it has matched, and a failure further into
the pattern is prevented from backtracking into it.
Backtracking past it to previous items, however, works as normal.
An alternative description is that a subpattern of this type
matches the string of characters that an identical standalone
pattern would match, if anchored at the current point
in the subject string.
Once-only subpatterns are not capturing subpatterns. Simple
cases such as the above example can be thought of as a maximizing
repeat that must swallow everything it can. So,
while both \d+ and \d+? are prepared to adjust the number of
digits they match in order to make the rest of the pattern
match, (?>\d+) can only match an entire sequence of digits.
This construction can of course contain arbitrarily complicated
subpatterns, and it can be nested.
Once-only subpatterns can be used in conjunction with
look-behind assertions to specify efficient matching at the end
of the subject string. Consider a simple pattern such as
abcd$
when applied to a long string which does not match. Because
matching proceeds from left to right, PCRE will look for
each "a" in the subject and then see if what follows matches
the rest of the pattern. If the pattern is specified as
^.*abcd$
then the initial .* matches the entire string at first, but
when this fails (because there is no following "a"), it
backtracks to match all but the last character, then all but
the last two characters, and so on. Once again the search
for "a" covers the entire string, from right to left, so we
are no better off. However, if the pattern is written as
^(?>.*)(?<=abcd)
then there can be no backtracking for the .* item; it can
match only the entire string. The subsequent lookbehind
assertion does a single test on the last four characters. If
it fails, the match fails immediately. For long strings,
this approach makes a significant difference to the processing time.
When a pattern contains an unlimited repeat inside a subpattern
that can itself be repeated an unlimited number of
times, the use of a once-only subpattern is the only way to
avoid some failing matches taking a very long time indeed.
The pattern
(\D+|<\d+>)*[!?]
matches an unlimited number of substrings that either consist
of non-digits, or digits enclosed in <>, followed by
either ! or ?. When it matches, it runs quickly. However, if
it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
it takes a long time before reporting failure. This is
because the string can be divided between the two repeats in
a large number of ways, and all have to be tried. (The example
used [!?] rather than a single character at the end,
because both PCRE and Perl have an optimization that allows
for fast failure when a single character is used. They
remember the last single character that is required for a
match, and fail early if it is not present in the string.)
If the pattern is changed to
((?>\D+)|<\d+>)*[!?]
sequences of non-digits cannot be broken, and failure happens quickly.