mb_check_encoding

(PHP 4 >= 4.4.3, PHP 5 >= 5.1.3, PHP 7, PHP 8)

mb_check_encoding — Dizgelerin belirtilen kodlama için geçerli olup olmadığını sınar

Açıklama

mb_check_encoding(array|string|null $değer = null, ?string $kodlama = null): bool

Belirtilen bayt diziliminin belirtilen kodlamaya uygun olup olmadığını sınar. değer bir dizi ise tüm anahtarlar ve değerler sırayla doğrulanır. "Geçersiz Kodlama Saldırısı"'nı önlemek için yararlıdır.

Bağımsız Değişkenler

değer: Sınanacak bayt akımı veya bir dizi. Belirtilmezse, isteğin başlangıcından itibaren tüm girdi sınanır.

Uyarı
PHP 8.1.0 ve sonrasında, bu bağımsız değişkenin atlanması veya null aktarılması önerilmemektedir.
kodlama: kodlama bağımsız değişkeninde karakter kodlaması belirtilir. Belirtilmediği takdirde veya null ise dahili karakter kodlaması kullanılır.

Dönen Değerler

Başarı durumunda true, başarısızlık durumunda false döner.

Sürüm Bilgisi

Sürüm:	Açıklama
8.1.0	`değer` belirtilmeksizin veya bağımsız değişkene `null` atanarak işlevin çağrılması önerilmemektedir.
8.0.0	`değer` ve `kodlama` artık `null` olabiliyor.
7.2.0	İşlev `değer` olarak artık bir dizi olabiliyor. Evvelce sadece dizgeler destekleniyordu.

add a note

User Contributed Notes 8 notes

down

Riikka K ¶

10 years ago


To get more information about how this function validates UTF-8, I ran some tests on PHP 5.5.10, PHP 5.4.24 and PHP 5.3.28. It seems that the function detects valid and invalid byte sequences correctly according to UTF-8 and the Unicode specifications, except for one issue:

in PHP 5.3.28, the function allows code points above U+10FFFF, which also allows five and six byte sequences. The later versions have corrected this issue.

Other than that, each version works correctly. Overlong encodings, surrogates, any lone bytes above 0x80 and too short byte sequences are all considered invalid. All valid code points in Unicode are considered valid when encoded with correct number of bytes (including Astral planes, i.e. four byte squences below U+10FFFF).

mb_detect_encoding() provided similar results with strict parameter enabled (except for PHP 5.3.28, in which it performed worse than mb_check_encoding())

down

dorbah NOSPAM at rambler [dot] ru ¶

6 years ago


For useful function added by jbricci at ya-right dot com

<?php
function checkEncoding ( $string, $string_encoding )
{
    $fs = $string_encoding == 'UTF-8' ? 'UTF-32' : $string_encoding;
    $ts = $string_encoding == 'UTF-32' ? 'UTF-8' : $string_encoding;

    return $string === mb_convert_encoding ( mb_convert_encoding ( $string, $fs, $ts ), $ts, $fs );
}
?>

I've made a function that is guessing the codepage:
<?php
function detectEncoding($string)
{
    $arr_encodings = [
        'CP1251',
        'UCS-2LE',
        'UCS-2BE',
        'UTF-8',
        'UTF-16',
        'UTF-16BE',
        'UTF-16LE',
        'CP866',
    ];

    foreach($arr_encodings as $encoding){
        if (checkEncoding($string, $encoding)){
            return $encoding;
        }
    }
    
    return false;
}
?>

down

jbricci at ya-right dot com ¶

15 years ago


This function does not check for bad byte sequence(s), it only checks if the byte stream is valid. If you want to verify a encoded string is valid, (IE: does not contain any bad byte sequences do the following...

<?php

/* check a strings encoded value */

function checkEncoding ( $string, $string_encoding )
{
    $fs = $string_encoding == 'UTF-8' ? 'UTF-32' : $string_encoding;

    $ts = $string_encoding == 'UTF-32' ? 'UTF-8' : $string_encoding;

    return $string === mb_convert_encoding ( mb_convert_encoding ( $string, $fs, $ts ), $ts, $fs );
}

/* test 1 variables */

$string = "\x00\x81";

$encoding = "Shift_JIS";

/* test 1 mb_check_encoding (test for bad byte stream) */

if ( true === mb_check_encoding ( $string, $encoding ) )
{
    echo 'valid (' . $encoding . ') encoded byte stream!<br />';
}
else
{
    echo 'invalid (' . $encoding . ') encoded byte stream!<br />';
}

/* test 1 checkEncoding (test for bad byte sequence(s)) */

if ( true === checkEncoding ( $string, $encoding ) )
{
    echo 'valid (' . $encoding . ') encoded byte sequence!<br />';
}
else
{
    echo 'invalid (' . $encoding . ') encoded byte sequence!<br />';
}

/* test 2 */

/* test 2 variables */

$string = "\x00\xE3";

$encoding = "UTF-8";

/* test 2 mb_check_encoding (test for bad byte stream) */

if ( true === mb_check_encoding ( $string, $encoding ) )
{
    echo 'valid (' . $encoding . ') encoded byte stream!<br />';
}
else
{
    echo 'invalid (' . $encoding . ') encoded byte stream!<br />';
}

/* test 2 checkEncoding (test for bad byte sequence(s)) */

if ( true === checkEncoding ( $string, $encoding ) )
{
    echo 'valid (' . $encoding . ') encoded byte sequence!<br />';
}
else
{
    echo 'invalid (' . $encoding . ') encoded byte sequence!<br />';
}

?>

down

javalc6 at gmail dot com ¶

15 years ago


In order to check if a string is encoded correctly in utf-8, I suggest the following function, that implements the RFC3629 better than mb_check_encoding():



<?php

function check_utf8($str) {

    $len = strlen($str);

    for($i = 0; $i < $len; $i++){

        $c = ord($str[$i]);

        if ($c > 128) {

            if (($c > 247)) return false;

            elseif ($c > 239) $bytes = 4;

            elseif ($c > 223) $bytes = 3;

            elseif ($c > 191) $bytes = 2;

            else return false;

            if (($i + $bytes) > $len) return false;

            while ($bytes > 1) {

                $i++;

                $b = ord($str[$i]);

                if ($b < 128 || $b > 191) return false;

                $bytes--;

            }

        }

    }

    return true;

} // end of check_utf8

?>

down

eyecatchup at gmail dot com ¶

11 years ago


Unlike other comments suggest, there's no need to serialize a string to use preg_match's "u" modifier for testing if a string is valid UTF-8. You can just use
<?php
function is_utf8($str) {
    return (bool) preg_match('//u', $str);
}

down

-1

Stefan W ¶

11 years ago


Note that the algorithm in javalc6's comment checks UTF-8 compliance by the letter of the specs.

This means that overlong byte sequences will pass. For example: 0xC0 0xAF can be used to encode U+002F, the slash character. While legal, this character is more properly encoded as 0x2F. Overlong sequences are unnecessary and should be avoided; they have been - and still are - used in various attacks (like directory traversal attacks).

It also means that high Unicode characters outside the Basic Multilingual Plane will pass; this means characters above U+FFFF, composed of 4+ bytes (hieroglyhps, cuneiform, etc). You need to decide if you want those characters or not. If you do, be aware that they often cause compatibility problems (for example with JSON and some databases).

mb_check_encoding(), mb_detect_encoding(x, y, TRUE), and the other comments up to now all reject characters outside the BMP and overlong sequences.

down

-15

CertaiN ¶

11 years ago


The best way to validate UTF-8 sequence.
This works for not only scalar, but also array and object recursively.

<?php

function is_valid_utf8($text) {
    return (bool)preg_match('//u', serialize($text));
}

?>

down

-21

CertaiN ¶

11 years ago


For supporting non-scalar variables,

<?php
function validate_utf8($input) {
    return (bool)preg_match('//u', serialize($input));
}

add a note