PHP: Unicode 字符属性

Unicode 字符属性

自 PHP 5.1.0 起，三个额外的转义序列在选用 UTF-8 模式时用于匹配通用字符类型。他们是：

\p{xx}: 一个有属性 xx 的字符
\P{xx}: 一个没有属性 xx 的字符
\X: 一个扩展的 Unicode 字符

上面 xx 代表的属性名用于限制 Unicode 通常的类别属性。每个字符都有一个这样的确定的属性，通过两个缩写的字母指定。为了与 Perl 兼容，可以在左花括号 { 后面增加 ^ 表示取反。比如： \p{^Lu} 就等同于 \P{Lu}。

如果通过 \p 或 \P 仅指定了一个字母，它包含所有以这个字母开头的属性。在这种情况下，花括号的转义序列是可选的；以下两个例子是等同的：

\p{L}
\pL

**支持的 Unicode 属性**
Property	Matches	Notes
`C`	Other
`Cc`	Control
`Cf`	Format
`Cn`	Unassigned
`Co`	Private use
`Cs`	Surrogate
`L`	Letter	包含以下属性：`Ll`、 `Lm`、`Lo`、`Lt`、 `Lu`.
`Ll`	小写字母
`Lm`	Modifier letter
`Lo`	Other letter
`Lt`	Title case letter
`Lu`	Upper case letter
`M`	Mark
`Mc`	Spacing mark
`Me`	Enclosing mark
`Mn`	Non-spacing mark
`N`	Number
`Nd`	Decimal number
`Nl`	Letter number
`No`	Other number
`P`	Punctuation
`Pc`	Connector punctuation
`Pd`	Dash punctuation
`Pe`	Close punctuation
`Pf`	Final punctuation
`Pi`	Initial punctuation
`Po`	Other punctuation
`Ps`	Open punctuation
`S`	Symbol
`Sc`	Currency symbol
`Sk`	Modifier symbol
`Sm`	Mathematical symbol
`So`	Other symbol
`Z`	Separator
`Zl`	Line separator
`Zp`	Paragraph separator
`Zs`	Space separator

InMusicalSymbols 等扩展属性在 PCRE 中不支持

指定大小写不敏感匹配对这些转义序列不会产生影响，比如， \p{Lu} 始终匹配大写字母。

Unicode 字符集在具体文字中定义。使用文字名可以匹配这些字符集中的一个字符。例如：

\p{Greek}
\P{Han}

不在确定文字中的则被集中到 Common。当前的文字列表中有：

**支持的文字**
`Arabic`	`Armenian`	`Avestan`	`Balinese`	`Bamum`
`Batak`	`Bengali`	`Bopomofo`	`Brahmi`	`Braille`
`Buginese`	`Buhid`	`Canadian_Aboriginal`	`Carian`	`Chakma`
`Cham`	`Cherokee`	`Common`	`Coptic`	`Cuneiform`
`Cypriot`	`Cyrillic`	`Deseret`	`Devanagari`	`Egyptian_Hieroglyphs`
`Ethiopic`	`Georgian`	`Glagolitic`	`Gothic`	`Greek`
`Gujarati`	`Gurmukhi`	`Han`	`Hangul`	`Hanunoo`
`Hebrew`	`Hiragana`	`Imperial_Aramaic`	`Inherited`	`Inscriptional_Pahlavi`
`Inscriptional_Parthian`	`Javanese`	`Kaithi`	`Kannada`	`Katakana`
`Kayah_Li`	`Kharoshthi`	`Khmer`	`Lao`	`Latin`
`Lepcha`	`Limbu`	`Linear_B`	`Lisu`	`Lycian`
`Lydian`	`Malayalam`	`Mandaic`	`Meetei_Mayek`	`Meroitic_Cursive`
`Meroitic_Hieroglyphs`	`Miao`	`Mongolian`	`Myanmar`	`New_Tai_Lue`
`Nko`	`Ogham`	`Old_Italic`	`Old_Persian`	`Old_South_Arabian`
`Old_Turkic`	`Ol_Chiki`	`Oriya`	`Osmanya`	`Phags_Pa`
`Phoenician`	`Rejang`	`Runic`	`Samaritan`	`Saurashtra`
`Sharada`	`Shavian`	`Sinhala`	`Sora_Sompeng`	`Sundanese`
`Syloti_Nagri`	`Syriac`	`Tagalog`	`Tagbanwa`	`Tai_Le`
`Tai_Tham`	`Tai_Viet`	`Takri`	`Tamil`	`Telugu`
`Thaana`	`Thai`	`Tibetan`	`Tifinagh`	`Ugaritic`
`Vai`	`Yi`

\X 转义匹配了 Unicode 可扩展字符集（Unicode extended grapheme clusters）。可扩展字符集是一个或多个 Unicode 字符，组合表达了单个象形字符。因此无论渲染时实际使用了多少个独立字符，可以视该 Unicode 等同于 .，会匹配单个组合后的字符。

小于 PCRE 8.32 的版本中（对应小于 PHP 5.4.14 的内置绑定 PCRE 库）， \X 等价于 (?>\PM\pM*)。也就是说，它匹配一个没有 ”mark” 属性的字符，紧接着任意多个由 ”mark” 属性的字符。并将这个序列认为是一个原子组(详见下文)。典型的有 ”mark” 属性的字符是影响到前面的字符的重音符。

用 Unicode 属性来匹配字符的速度并不快，因为 PCRE 需要去搜索一个包含超过 15000 字符的数据结构。这就是为什么在 PCRE中要使用传统的转义序列\d、 \w 而不使用 Unicode 属性的原因。

add a note

User Contributed Notes 10 notes

down

huhwatnouDONTspamPLEASE at hotmail dot com ¶

9 years ago


To select UTF-8 mode for the additional escape sequences (\p{xx}, \P{xx}, and \X) , use the "u" modifier (see http://php.net/manual/en/reference.pcre.pattern.modifiers.php).

I wondered why a German sharp S (ß) was marked as a control character by \p{Cc} and it took me a while to properly read the first sentence: "Since 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. " :-$ and then to find out how to do so.

down

mercury at caucasus dot net ¶

14 years ago


An excellent article explaining all these properties can be found here: http://www.regular-expressions.info/unicode.html

down

xuantoaiph at gmail dot com ¶

11 years ago


My country, Vietnam, have our own alphabet table:
http://en.wikipedia.org/wiki/Vietnamese_alphabet
I hope PHP will support better than in Vietnamese.

down

o_shes01 at uni-muenster dot de ¶

14 years ago


For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter. 
For example, there are three codepoints for the "LJ" digraph in Unicode: 
  (*) uppercase "LJ": U+01C7 
  (*) titlecase "Lj": U+01C8 
  (*) lowercase "lj": U+01C9

down

suit at rebell dot at ¶

14 years ago


these properties are usualy only available if PCRE is compiled with "--enable-unicode-properties"



if you want to match any word but want to provide a fallback, you can do something like that: 



<?php

if(@preg_match_all('/\p{L}+/u', $str, $arr) {

  // fallback goes here

  // for example just '/\w+/u' for a less acurate match

}

?>

down

Steve ¶

1 year ago


Examples are always useful! See https://unicodeplus.com/category for more.

C    Other     
Cc   Control      (Unicode code points in the ranges U+0000-U+001F and U+007F-U+009F)
Cf   Format       (Soft hyphen (U+00AD), zero width space (U+200B), etc.)
Cn   Unassigned   (Any code point that is not in the Unicode table)
Co   Private use     
Cs   Surrogate    (Characters in the range U+D800 to U+DFFF, which are invalid in utf-8)

L    Letter
Ll   Lower case letter (a-z, µßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ and more)
Lm   Modifier letter   (Letter-like characters that are usually combined with others, but here they stand alone:
                        ʰʱʲʳʴʵʶʷʸʹʺʻʼʽʾʿˀˁˆˇˈˉˊˋˌˍˎˏːˑˠˡˢˣˤˬˮʹͺՙ and more)
Lo   Other letter      (ªºƻǀǁǂǃʔ and many more ideographs and letters from unicase alphabets)
Lt   Title case letter (ǅǈǋǲᾈᾉᾊᾋᾌᾍᾎᾏᾘᾙᾚᾛᾜᾝᾞᾟᾨᾩᾪᾫᾬᾭᾮᾯᾼῌῼ)
Lu   Upper case letter (A-Z, ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ and more)
L&   Ordinary letter   (Any character that has the Lu, Ll, or Lt property)

M    Mark
Mc   Spacing mark      (None in latin scripts)
Me   Enclosing mark    (Combining enclosing square (U+20DE) like in a⃞ , combining enclosing circle backslash (U+20E0) like in a⃠)
Mn   Non-spacing mark  (Combining diacritical marks U+0300-U+036f, like the accents on this letter a: áâãāa̅ăȧäảåa̋ǎa̍a̎ȁa̐ȃ)

N    Number      
Nd   Decimal number (0123456789, ٠١٢٣٤٥٦٧٨٩ and digits in many other scripts.)
Nl   Letter number  (ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿ and some more)
No   Other number   (⁰¹²³⁴⁵⁶⁷⁸⁹ ₀₁₂₃₄₅₆₇₈₉ ½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒ ①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳, etc.)

P    Punctuation      
Pc   Connector punctuation (_ underscore (U+005F), ‿ undertie U+203F, ⁀ character tie (U+2040), etc.)
Pd   Dash punctuation      (- hyphen-minus (U+002D), ‐ hyphen (U+2010), ‑ non-breaking hyphen (U+2011), ‒ figure dash (U+2012),
                            – en dash (U+2013), — em dash (U+2014), ― horizontal bar (U+2015), etc.)
Pe   Close punctuation     (right parenthesis, bracket, or brace: `)` (U+0029), `]` (U+005D), `}` (U+007D), etc.) 
Pf   Final punctuation     (right quotation marks: » (U+00BB), ’ (U+2019), ” (U+201D), etc.)
Pi   Initial punctuation   (left  quotation marks: « (U+00AB), ‘ (U+2018), “ (U+201C), etc.)
Po   Other punctuation     (!"#%&'*,./:;?@\¡§¶·¿)
Ps   Open punctuation      (left parenthesis, bracket, or brace: `(` (U+0028), `[` (U+005B), `{` (U+007B), etc.) 

S    Symbol      
Sc   Currency symbol     ($¢£¤¥, ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ₺ ₻ ₼ ₽ ₾ ₿ (U+20A0-U+20BF), etc.)
Sk   Modifier symbol     (Symbol-like characters that are usually combined with others, but here they stand alone:
                          ^`¨¯´¸ and more)
Sm   Mathematical symbol (+<=>|~¬±×÷϶ and many more)
So   Other symbol        (¦ broken bar (U+00A6), © copyright sign (U+00A9), ® registered sign (U+00AE), ° degree sign (U+00B0);
                          arrows, signs, emojis and many many more)

Z    Separator      
Zl   Line separator      (line separator (U+2028))
Zp   Paragraph separator (paragraph separator (U+2029))
Zs   Space separator     (space, no-break space, en quad, em quad, en space, em space, figure space, thin space, hair space, etc.)

down

php at lnx-bsp dot net ¶

7 years ago


Not made clear in the top of page explanation, but these escaped character classes can be included within square brackets to make a broader character class. For example:

<?php preg_match( '/[\p{N}\p{L}]+/', $data ) ?>

Will match any combination of letters and numbers.

down

Yzmir Ramirez ¶

11 years ago


If you are working with older environments you will need to first check to see if the version of PCRE will work with unicode directives described above:

<?php

// Need to check PCRE version because some environments are
// running older versions of the PCRE library
// (run in *nix environment `pcretest -C`)

$allowInternational = false;
if (defined('PCRE_VERSION')) {
    if (intval(PCRE_VERSION) >= 7) { // constant available since PHP 5.2.4
        $allowInternational = true;
    }
}
?>

Now you can do a fallback regex (e.g. use "/[a-z]/i"), when the PCRE library version is too old or not available.

down

-2

phpnet at N_O_S_P_A_M dot osps dot net ¶

2 years ago


I found the predefined "supported" scripts helpful, except that there's no apparent definition of what Unicode character ranges are covered by those definitions. So I wrote this to determine them and print out the equivalent PCRE character class definitions. An example fragment of output is (I can't include all output due to PHP.net Note-posting limits)

Canadian_Aboriginal=[\x{1400}-\x{167f}\x{18b0}-\x{18f5}]

The program:

<?php

$scriptNames = array(
    'Arabic',
    'Armenian',
    'Avestan',
    'Balinese',
    'Bamum',
    'Batak',
    'Bengali',
    'Bopomofo',
    'Brahmi',
    'Braille',
    'Buginese',
    'Buhid',
    'Canadian_Aboriginal',
    'Carian',
    'Chakma',
    'Cham',
    'Cherokee',
    'Common',
    'Coptic',
    'Cuneiform',
    'Cypriot',
    'Cyrillic',
    'Deseret',
    'Devanagari',
    'Egyptian_Hieroglyphs',
    'Ethiopic',
    'Georgian',
    'Glagolitic',
    'Gothic',
    'Greek',
    'Gujarati',
    'Gurmukhi',
    'Han',
    'Hangul',
    'Hanunoo',
    'Hebrew',
    'Hiragana',
    'Imperial_Aramaic',
    'Inherited',
    'Inscriptional_Pahlavi',
    'Inscriptional_Parthian',
    'Javanese',
    'Kaithi',
    'Kannada',
    'Katakana',
    'Kayah_Li',
    'Kharoshthi',
    'Khmer',
    'Lao',
    'Latin',
    'Lepcha',
    'Limbu',
    'Linear_B',
    'Lisu',
    'Lycian',
    'Lydian',
    'Malayalam',
    'Mandaic',
    'Meetei_Mayek',
    'Meroitic_Cursive',
    'Meroitic_Hieroglyphs',
    'Miao',
    'Mongolian',
    'Myanmar',
    'New_Tai_Lue',
    'Nko',
    'Ogham',
    'Old_Italic',
    'Old_Persian',
    'Old_South_Arabian',
    'Old_Turkic',
    'Ol_Chiki',
    'Oriya',
    'Osmanya',
    'Phags_Pa',
    'Phoenician',
    'Rejang',
    'Runic',
    'Samaritan',
    'Saurashtra',
    'Sharada',
    'Shavian',
    'Sinhala',
    'Sora_Sompeng',
    'Sundanese',
    'Syloti_Nagri',
    'Syriac',
    'Tagalog',
    'Tagbanwa',
    'Tai_Le',
    'Tai_Tham',
    'Tai_Viet',
    'Takri',
    'Tamil',
    'Telugu',
    'Thaana',
    'Thai',
    'Tibetan',
    'Tifinagh',
    'Ugaritic',
    'Vai',
    'Yi'
);
$scriptTypes = array();
foreach( $scriptNames as $n ) $scriptTypes[ $n ] = array();
for( $i=0; $i <= 0x10fff; $i++ ) {
//echo $i.PHP_EOL;
    foreach( $scriptNames as $scriptName ) {

        if ( preg_match( '/[\p{'. $scriptName .'}]/u', mb_chr( $i, 'UTF-8') ) ) {

            if (empty( $scriptTypes[ $scriptName ])
                || ( ($i - $scriptTypes[ $scriptName ][ count( $scriptTypes[ $scriptName ] ) - 1 ][1]) > 1)
            ) {

                $scriptTypes[ $scriptName ][] = [$i, $i];

            } else {

                $scriptTypes[ $scriptName ][ count( $scriptTypes[ $scriptName ] ) - 1 ][1] = $i;
            }
        }
    }
}
foreach( $scriptTypes as $scriptName => $unicodeRanges ) {

    printf(
        '%s=[',
        $scriptName
    );
    foreach( $unicodeRanges as $r ) {

        printf(
            '\x{%04x}',
            $r[0]
        );
        if ($r[1] > $r[0] )
            printf(
                '-\x{%04x}',
                $r[1]
            );
    }
    printf(
        ']'.PHP_EOL
    );
}

down

-5

o_shes01 at uni-muenster dot de ¶

14 years ago


For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter. 
For example, there are three codepoints for the "LJ" digraph in Unicode: 
  (*) uppercase "LJ": U+01C7 
  (*) titlecase "Lj": U+01C8 
  (*) lowercase "lj": U+01C9

add a note