PHP: Criteri ricorsivi

Criteri ricorsivi

Si consideri il caso in cui si debba riconoscere una stringa tra parentesi, si supponga inoltre di dovere ammettere un numero arbitrario di parentesi annidate. Senza l'uso della ricorsione, il miglior metodo consiste nel preparare un criterio che identifichi un numero finito di parentesi annidate. Però, in questo modo non si gestisce un numero arbitrario si parentesi annidate. Nella versione 5.6 di Perl è stata introdotta, a livello sperimentale, la possibilità per i criteri di riconoscimento di essere ricorsivi. Allo scopo è stata definita la sequenza speciale (?R). Il criterio illustrato di seguito risolve il problema delle parentesi (si assume che l'opzione CRE_EXTENDED sia attivata in modo da ignorare gli spazi): \( ( (?>[^()]+) | (?R) )* \)

In principio si identifica la parentesi di apertura. Quindi si cerca un numero indefinito di stringhe che possano essere composte da caratteri che non siano parentesi, oppure che siano riconosciute ricorsivamente dal criterio (ad esempio una stringa correttamente tra parentesi). Infine si cerca la parentesi di chiusura.

In questo particolare esempio si hanno arbitrarie ripetizioni annidate, e quindi l'uso delle sotto-regole a riconoscimento singolo per il riconoscimento della stringa priva di parentesi è particolarmente utile se il criterio viene applicato ad un testo non riconoscibile. Ad esempio, applicando il criterio a (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() si ottiene un esito di mancato riconoscimento rapidamente. Se, al contrario, non si fosse usato il criterio a riconoscimento singolo, sarebbe occorso molto tempo per avere un esito dato che vi sono moltissimi modi in cui i caratteri di ripetizione + e * possono essere applicati al testo, richiedendo, pertanto, la verifica di tutte le combinazioni prima di fornire l'esito negativo.

Il valori restituiti da una sotto-regola di cattura sono quelli ottenuti dal livello di ricorsione più esterno. Se il criterio illustrato in precedenza venisse applicato al testo (ab(cd)ef) Il valore ottenuto sarebbe "ef", che rappresenta l'ultimo valore catturato al livello più alto. Se si aggiungono altre parentesi al criterio, ottenendo \( ( ( (?>[^()]+) | (?R) )* ) \) allora il testo ottenuto sarebbe "ab(cd)ef", cioè il valore delle parentesi di livello più alto. Se nel criterio vi sono più di 15 sotto-regole di cattura, PCRE ha necessità di ottenere maggiore memoria per archiviare le informazioni durante la ricorsione. Questa memoria è ottenuta utilizzando pcre_malloc, al termine verrà liberata con pcre_free. Se non può essere ottenuta ulteriore memoria, si ha il salvataggio delle informazioni dei primi 15 testi catturati, dato che non vi è modo di restituire un messaggio di memoria insufficiente dalla ricorsione.

(?1), (?2) e così via possono pure essere usati per sottoregole ricorsive. È anche possibile usare sottoregole con nome: (?P>name) oppure (?P&name).

Se la sintassi di un riferimento a sottoregola (identificata per numero o per nome) è usata al di fuori delle parentesi a cui si riferisce, si comporta come una funzione di un linguaggio di programmazione. Un esempio precedente mostrava come la regola (sens|respons)e and \1ibility trova "sense and sensibility" e "response and responsibility", ma non "sense and responsibility". Se si usa invece la regola (sens|respons)e and (?1)ibility, questa identifica anche "sense and responsibility", come pure le altre due stringhe. Questi riferimenti devono, comunque, seguire le sottoregole a cui si riferiscono.

La lunghezza massima di una stringa corrisponde al pìù grande numero positivo ammesso per una variabile integer. D'altra parte, PCRE usa la ricorsione per gestire le sottoregole e le ripetizioni indefinite. Questo significa che lo spazio di stack disponibile può limitare la dimensione di una stringa che può essere processata da determinate regole.

add a note

User Contributed Notes 8 notes

down

horvath at webarticum dot hu ¶

11 years ago


With the (?R) item you can link only to the full pattern, because it quasi equals to (?0). You can not use anchors, asserts etc., and you can only check that string CONTAINS a valid hierarchy or not.

This is wrong: ^\(((?>[^()]+)|(?R))*\)$

However, you can bracketing the full expression, and replace (?R) to the relative link (?-2). This make it reusable. So you can check complex expressions, for example:
<?php

$bracket_system = "(\\(((?>[^()]+)|(?-2))*\\))"; // (reuseable)
$bracket_systems = "((?>[^()]+)?$bracket_system)*(?>[^()]+)?"; // (reuseable)
$equation = "$bracket_systems=$bracket_systems"; // Both side of the equation must be contain valid bracket systems
var_dump(preg_match("/^$equation\$/","a*(a-(2a+2))=4(a+3)-2(a-(a-2))")); // Outputs 'int(1)'
var_dump(preg_match("/^$equation\$/","a*(a-(2a+2)=4(a+3)-2(a-(a-2)))")); // Outputs 'int(0)'

?>

You can also catch multibyte quotes with the 'u' modifier (if you use UTF-8), eg:
<?php

$quoted = "(»((?>[^»«]+)|(?-2))*«)"; // (reuseable)
$prompt = "[\\w ]+: $quoted";
var_dump(preg_match("/^$prompt\$/u","Your name: »write here«")); // Outputs 'int(1)'

?>

down

emanueledelgrande at email dot it ¶

15 years ago


The recursion in regular expressions is the only way to allow the parsing of HTML code with nested tags of indefinite depth.
It seems it's not yet a spreaded practice; not so much contents are available on the web regarding regexp recursion, and until now no user contribute notes have been published on this manual page.
I made several tests with complex patterns to get tags with specific attributes or namespaces, studying the recursion of a subpattern only instead of the full pattern.
Here's an example that may power a fast LL parser with recursive descent (http://en.wikipedia.org/wiki/Recursive_descent_parser):

$pattern = "/<([\w]+)([^>]*?) (([\s]*\/>)| (>((([^<]*?|<\!\-\-.*?\-\->)| (?R))*)<\/\\1[\s]*>))/xsm";

The performances of a preg_match or preg_match_all function call over an avarage (x)HTML document are quite fast and may drive you to chose this way instead of classic DOM object methods, which have a lot of limits and are usually poor in performance with their workarounds, too.
I post a sample application in a brief function (easy to be turned into OOP), which returns an array of objects:

<?php
// test function:
function parse($html) {
    // I have split the pattern in two lines not to have long lines alerts by the PHP.net form:
    $pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|".
    "(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/sm";
    preg_match_all($pattern, $html, $matches, PREG_OFFSET_CAPTURE);
    $elements = array();
    
    foreach ($matches[0] as $key => $match) {
        $elements[] = (object)array(
            'node' => $match[0],
            'offset' => $match[1],
            'tagname' => $matches[1][$key][0],
            'attributes' => isset($matches[2][$key][0]) ? $matches[2][$key][0] : '',
            'omittag' => ($matches[4][$key][1] > -1), // boolean
            'inner_html' => isset($matches[6][$key][0]) ? $matches[6][$key][0] : ''
        );
    }
    return $elements;
}

// random html nodes as example:
$html = <<<EOD
<div id="airport">
    <div geo:position="1.234324,3.455546" class="index">
        <!-- comment test:
        <div class="index_top" />
        -->
        <div class="element decorator">
                <ul class="lister">
                    <li onclick="javascript:item.showAttribute('desc');">
                        <h3 class="outline">
                            <a href="http://php.net/manual/en/regexp.reference.recursive.php" onclick="openPopup()">Link</a>
                        </h3>
                        <div class="description">Sample description</div>
                    </li>
                </ul>
        </div>
        <div class="clean-line"></div>
    </div>
</div>
<div id="omittag_test" rel="rootChild" />
EOD;

// application:
$elements = parse($html);

if (count($elements) > 0) {
    echo "Elements found: <b>".count($elements)."</b><br />";
    
    foreach ($elements as $element) {
        echo "<p>Tpl node: <pre>".htmlentities($element->node)."</pre>
        Tagname: <tt>".$element->tagname."</tt><br />
        Attributes: <tt>".$element->attributes."</tt><br />
        Omittag: <tt>".($element->omittag ? 'true' : 'false')."</tt><br />
        Inner HTML: <pre>".htmlentities($element->inner_html)."</pre></p>";
    }
}
?>

down

Onyxagargaryll ¶

13 years ago


Here's an approach to create a multidimensional array according to a string's delimiters, i.e. we want to analyze...

"some text (aaa(b(c1)(c2)d)e)(test) more text"

... as multidimensional layers.

<?php
$string = "some text (aaa(b(c1)(c2)d)e)(test) more text";

/*
 * Analyses the string multidimensionally by its opening and closing delimiters
 */
function recursiveSplit($string, $layer) {
    preg_match_all("/\((([^()]*|(?R))*)\)/",$string,$matches);
    // iterate thru matches and continue recursive split
    if (count($matches) > 1) {
        for ($i = 0; $i < count($matches[1]); $i++) {
            if (is_string($matches[1][$i])) {
                if (strlen($matches[1][$i]) > 0) {
                    echo "<pre>Layer ".$layer.":   ".$matches[1][$i]."</pre><br />";
                    recursiveSplit($matches[1][$i], $layer + 1);
                }
            }
        }
    }
}

recursiveSplit($string, 0);

/*

Output:

Layer 0:   aaa(b(c1)(c2)d)e

Layer 1:   b(c1)(c2)d

Layer 2:   c1

Layer 2:   c2

Layer 0:   test
*/
?>

down

Anonymous ¶

8 years ago


sass parse sample

<?php

$data = 'a { b { 1 } c { d { 2 } } }';

preg_match('/a (?<R>\{(?:[^{}]+|(?&R))*\})/', $data, $m);
preg_match('/b (?<R>\{(?:[^{}]+|(?&R))*\})/', $data, $m);
preg_match('/c (?<R>\{(?:[^{}]+|(?&R))*\})/', $data, $m);
preg_match('/d (?<R>\{(?:[^{}]+|(?&R))*\})/', $data, $m);

/*
Array (
    [0] => a { b { 1 } c { d { 2 } } }
    [R] => { b { 1 } c { d { 2 } } }
    [1] => { b { 1 } c { d { 2 } } }
)
Array (
    [0] => b { 1 }
    [R] => { 1 }
    [1] => { 1 }
)
Array (
    [0] => c { d { 2 } }
    [R] => { d { 2 } }
    [1] => { d { 2 } }
)
Array (
    [0] => d { 2 }
    [R] => { 2 }
    [1] => { 2 }
)
*/

down

jonah at nucleussystems dot com ¶

14 years ago


An unexpected behavior came up that introduced a very hard-to-track bug in some code I was working on.  It has to do with the preg_match_all PREG_OFFSET_CAPTURE flag.  When you capture the offset of a sub-match, it's offset is given _relative_ to it's parent.  For example, if you extract the value between < and > recursively in this string:

<this is a <string>>

You will get an array that looks like this:

Array
(
    [0] => Array
    (
        [0] => Array
        (
            [0] => <this is a <string>>
            [1] => 0
        )
        [1] => Array
        (
            [0] => this is a <string>
            [1] => 1
        )
    )
    [1] => Array
    (
        [0] => Array
        (
            [0] => <string>
            [1] => 0
        )
        [1] => Array
        (
            [0] => string
            [1] => 1
        )
    )
)

Notice that the offset in the last index is one, not the twelve we expected.  The best way to solve this problem is to run over the results with a recursive function, adding the parent's offset.

down

-1

mzvarik at gmail dot com ¶

4 years ago


This regexp can be used for parsing IF conditions.

$str = '
(IF_MYVAR)My var is printed
(OR_MYVARTWO)My var two is printed
(OR_ANOTHER)if you use OR you don't have to END everytime
(ELSE)Whatever bro(END)

(IF_BLUE)Something (IF_SUPERB)super(END) blue - this is simple IF condition(END)
';

function isCondition($k) {
    // put your user conditions here
    $conds = [];
    $conds['BLUE'] = true;
    $conds['MYVARTWO'] = true;
    $conds['ELSE'] = true; // always true
     return $conds[$k];
}

function findConditions($str) {
    
    $pattern = '~ \(if_([^\)]+)\) ((?: (?!\((end|if_)). | (?R) )*+) \(end\) ~xis';
    
    $str = preg_replace_callback($pattern, function($m) {
        
        $k = $m[1];
        $v = $m[2];
        $v = findConditions($v) ?: $v;
        
        $ors = preg_split('~(?=\((OR_[^\)]+|ELSE))~is', $v);
        
        $v = array_shift($ors); // hlavní
                
        if (isCondition($k)) return findConditions($v);
        else {
            foreach ($ors as $or) {
                list($k, $v) = explode(")", $or, 2);
                $k = substr($k, 1);
                if (isCondition($k)) return findConditions($v);
            }
        }
        return ''; // no matching condition
    }, $str);
    return $str;
};

// will output:  Whatever bro  \n\n  Something blue
echo findConditions($str);

down

-1

Daniel Klein ¶

12 years ago


The order of non-mutually exclusive alternatives within a recursed sub-pattern is important.
<?php
$pattern = '/^(?<octet>[01]?\d?\d|2[0-4]\d|25[0-5])(?:\.(?P>octet)){3}$/';
?>

You might expect that this pattern will match any IP address in dotted-decimal notation (e.g. '123.45.67.89'). The pattern is intended to match four octets in the following ranges: 0-9, 00-99 & 000-255, each separated by a single dot. However, only the first octet can include values from 200-255; the remainder can only have values less than 200. The reason for this is that if the rest of the pattern fails, recursion is not back-tracked into to find an alternative match. The first part of the sub-pattern will match the first two digits of any octet from 200-255. The rest of the pattern will then fail because the third digit in the octet does not match either '\.' or '$'.

<?php
var_export(preg_match($pattern, '255.123.45.67')); // 1 (true)
var_export(preg_match($pattern, '255.200.45.67')); // 0 (false)
var_export(preg_match($pattern, '255.123.45.200')); // 0 (false)
?>

The correct pattern is:
<?php
$pattern = '/^(?<octet>25[0-5]|2[0-4]\d|[01]?\d?\d)(?:\.(?P>octet)){3}$/';
?>

Note that the first two alternatives are mutually exclusive so their order is unimportant. The third alternative, however, is not mutually exclusive but it will now only match when the first two fail.

<?php
var_export(preg_match($pattern, '255.123.45.67')); // 1 (true)
var_export(preg_match($pattern, '255.200.45.67')); // 1 (true)
var_export(preg_match($pattern, '255.123.45.200')); // 1 (true)
?>

down

-2

horvath at webarticum dot hu ¶

11 years ago


Below are some reusable patterns. I used comments with the 'x' modifier for human-readability.

You can write also a function, that generates patterns for specified bracket/quote pairs.

<?php

/* normal paretheses */
$simple_pattern = "(    (?#root pattern)
  (                     (?#text or expression)
    (?>[^\\(\\)]+)      (?#text)
    |
    \\((?-2)\\)         (?#expression and recursion)
  )*
)";

$simple_okay_text = "5( 2a + (b - c) ) - a * ( 2b - (c * 3(b - (c + a) ) ) )";
$simple_bad_text = "5( 2)a + (b - c) ) - )a * ( ((2b - (c * 3(b - (c + a) ) ) )";

echo "Simple pattern results:\n";
var_dump(preg_match("/^$simple_pattern\$/x",$simple_okay_text));
var_dump(preg_match("/^$simple_pattern\$/x",$simple_bad_text));
echo "\n----------\n";

/* some brackets */
$full_pattern = "(                  (?#root pattern)
  (                                 (?#text or expression)
    (?>[^\\(\\)\\{\\}\\[\\]<>]+)    (?#text not contains brackets)
    |
    (
      [\\(\\{\\[<]                  (?#start bracket)
        (?(?<=\\()(?-3)\\)|           (?#if normal)
        (?(?<=\\{)(?-3)\\}|           (?#if coursed)
        (?(?<=\\[)(?-3)\\]|           (?#if squared)
        (?1)\\>                       (?#else so if tag)
      )))                           (?#close nested-but-logically-the-some-level subpatterns)
    )
  )*
)";

$full_okay_text = "5( 2a + [b - c] ) - a * ( 2b - {c * 3<b - (c + a) > } )";
$full_bad_text = "5[ 2a + [b - c} ) - a * ( 2b - [c * 3(b - c + a) ) ) }";

echo "Full pattern results:\n";
var_dump(preg_match("/^$full_pattern\$/x",$simple_okay_text));
var_dump(preg_match("/^$full_pattern\$/x",$full_okay_text));
var_dump(preg_match("/^$full_pattern\$/x",$full_bad_text));
echo "\n----------\n";

/* some brackets and quotes */
$extrafull_pattern = "(             (?#root pattern)
  (                                 (?#text or expression)
    (?>[^\\(\\)\\{\\}\\[\\]<>'\"]+) (?#text not contains brackets and quotes)
    |
    (
      ([\\(\\{\\[<'\"])             (?#start bracket)
        (?(?<=\\()(?-4)\\)|           (?#if normal)
        (?(?<=\\{)(?-4)\\}|           (?#if coursed)
        (?(?<=\\[)(?-4)\\]|           (?#if squared)
        (?(?<=\\<)(?-4)\\>|           (?#if tag)
        ['\"]                         (?#else so if static)
      ))))                          (?#close nested-but-logically-the-some-level subpatterns)
    )
  )*
)";

$extrafull_okay_text = "5( 2a + ['b' - c] ) - a * ( 2b - \"{c * 3<b - (c + a) > }\" )";
$extrafull_bad_text = "5( 2a + ['b' - c] ) - a * ( 2b - \"{c * 3<b - (c + a) > }\" )";

echo "Extra-full pattern results:\n";
var_dump(preg_match("/^$extrafull_pattern\$/x",$simple_okay_text));
var_dump(preg_match("/^$extrafull_pattern\$/x",$full_okay_text));
var_dump(preg_match("/^$extrafull_pattern\$/x",$extrafull_okay_text));
var_dump(preg_match("/^$extrafull_pattern\$/x",$extrafull_bad_text));

/*

Outputs:

Simple pattern results:
int(0)
int(0)

----------
Full pattern results:
int(0)
int(0)
int(0)

----------
Extra-full pattern results:
int(0)
int(0)
int(0)
int(0)

*/

?>

add a note