html_entity_decode

(PHP 4 >= 4.3.0, PHP 5, PHP 7, PHP 8)

html_entity_decode — HTML エンティティを対応する文字に変換する

説明

html_entity_decode(string $string, int $flags = ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401, ?string $encoding = null): string

html_entity_decode() は htmlentities() の反対で、string にある HTML エンティティを対応する文字に変換します。

厳密に言うと、この関数は次の二つの条件を満たすすべての (数値エンティティを含む) エンティティをデコードします。それ以外のエンティティは、何も変換しません。 1) 選択したドキュメントタイプで必然的に有効になるもの。つまり XML の場合には、DTD で定義されている名前付きエンティティはデコードしません。 2) 選択したエンコーディングに関連づけられている符号化文字集合に含まれる文字で、選択したドキュメントタイプで許可されているもの。

パラメータ

string

入力文字列。

flags

以下のフラグのビットマスクによる組み合わせで、クォートの扱いやドキュメントの形式を指定します。デフォルトは ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401 です。

**使用可能な `flags` 定数**
定数名	説明
`ENT_COMPAT`	ダブルクォートを変換し、シングルクォートはそのままにします。
`ENT_QUOTES`	ダブルクォート、シングルクォートの両方を変換します。
`ENT_NOQUOTES`	ダブルクォート、シングルクォートの両方とも変換しません。
`ENT_SUBSTITUTE`	無効な符号単位シーケンスを含む文字列を渡したときに、空の文字列を返すのではなく Unicode の置換文字に置き換えます。 UTF-8 の場合は U+FFFD、それ以外の場合は &#FFFD; となります。
`ENT_HTML401`	コードを HTML 4.01 として処理します。
`ENT_XML1`	コードを XML 1 として処理します。
`ENT_XHTML`	コードを XHTML として処理します。
`ENT_HTML5`	コードを HTML 5 として処理します。

encoding

オプションの引数。文字を変換するときに使うエンコーディングを定義します。

省略した場合の encoding のデフォルト値は、 default_charset の値を使います。

技術的にはこの引数を省略可能ですが、 default_charset の指定が入力とは違う文字セットになっている可能性もあるので、適切な値を指定しておくことを強く推奨します。

以下の文字セットをサポートします。

**サポートする文字セット**
文字セット	エイリアス	説明
ISO-8859-1	ISO8859-1	西欧、Latin-1
ISO-8859-5	ISO8859-5	ほとんど使われないキリル文字セット (Latin/Cyrillic)。
ISO-8859-15	ISO8859-15	西欧、Latin-9 。Latin-1(ISO-8859-1) に欠けているユーロ記号やフランス・フィンランドの文字を追加したもの。
UTF-8		ASCII 互換のマルチバイト 8 ビット Unicode 。
cp866	ibm866, 866	DOS 固有のキリル文字セット。
cp1251	Windows-1251, win-1251, 1251	Windows 固有のキリル文字セット。
cp1252	Windows-1252, 1252	西欧のための Windows 固有の文字セット。
KOI8-R	koi8-ru, koi8r	ロシア語。
BIG5	950	繁体字中国語。主に台湾で使用されます。
GB2312	936	簡体字中国語。国の標準文字セットです。
BIG5-HKSCS		Big5 に香港の拡張を含めたもの。繁体字中国語。
Shift_JIS	SJIS, SJIS-win, cp932, 932	日本語。
EUC-JP	EUCJP, eucJP-win	日本語。
MacRoman		Mac OS で使われる文字セット。
`''`		空文字列を指定すると、スクリプトのエンコーディング (Zend multibyte)、 default_charset、そして現在のロケール (nl_langinfo() および setlocale() を参照ください) の順でエンコーディングを検出します。この方法はおすすめしません。

注意: これら以外の文字セットは理解できません。かわりにデフォルトのエンコーディングを使用し、警告を発生させます。

戻り値

デコードされた文字列を返します。

変更履歴

バージョン	説明
8.1.0	`flags` のデフォルト値が `ENT_COMPAT` から `ENT_QUOTES` \| `ENT_SUBSTITUTE` \| `ENT_HTML401` に変更されました。
8.0.0	`encoding` は、nullable になりました。

例

例1 HTML エンティティのデコード

<?php
$orig = "I'll \"walk\" the <b>dog</b> now";

$a = htmlentities($orig);

$b = html_entity_decode($a);

echo $a; // I'll &quot;walk&quot; the &lt;b&gt;dog&lt;/b&gt; now

echo $b; // I'll "walk" the <b>dog</b> now
?>

注意

注意:
trim(html_entity_decode(' ')); の結果が空の文字列にならないことを疑問に思う人もいるでしょう。なぜそうなるのかというと、デフォルトのエンコーディング ISO-8859-1 では ' ' エンティティが ASCII コード 32 (これは trim() で取り除かれる) ではなく ASCII コード 160 (0xa0) に変換されるからです。

参考

htmlentities() - 適用可能な文字を全て HTML エンティティに変換する
htmlspecialchars() - 特殊文字を HTML エンティティに変換する
get_html_translation_table() - htmlspecialchars および htmlentities で使用される変換テーブルを返す
urldecode() - URL エンコードされた文字列をデコードする

add a note

User Contributed Notes 20 notes

down

128

Martin ¶

13 years ago


If you need something that converts &#[0-9]+ entities to UTF-8, this is simple and works:



<?php

/* Entity crap. /

$input = "Fovi&#269;";



$output = preg_replace_callback("/(&#[0-9]+;)/", function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); }, $input);



/* Plain UTF-8. */

echo $output;

?>

down

txnull ¶

9 years ago


Use the following to decode all entities:
<?php html_entity_decode($string, ENT_QUOTES | ENT_XML1, 'UTF-8') ?>

I've checked these special entities: 
- double quotes (&#34;)
- single quotes (&#39; and &apos;) 
- non printable chars (e.g. &#13;)
With other $flags some or all won't be decoded.

It seems that ENT_XML1 and ENT_XHTML are identical when decoding.

down

aidan at php dot net ¶

20 years ago


This functionality is now implemented in the PEAR package PHP_Compat.

More information about using this function without upgrading your version of PHP can be found on the below link:

http://pear.php.net/package/PHP_Compat

down

Benjamin ¶

11 years ago


The following function decodes named and numeric HTML entities and works on UTF-8. Requires iconv.

function decodeHtmlEnt($str) {
    $ret = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
    $p2 = -1;
    for(;;) {
        $p = strpos($ret, '&#', $p2+1);
        if ($p === FALSE)
            break;
        $p2 = strpos($ret, ';', $p);
        if ($p2 === FALSE)
            break;
            
        if (substr($ret, $p+2, 1) == 'x')
            $char = hexdec(substr($ret, $p+3, $p2-$p-3));
        else
            $char = intval(substr($ret, $p+2, $p2-$p-2));
            
        //echo "$char\n";
        $newchar = iconv(
            'UCS-4', 'UTF-8',
            chr(($char>>24)&0xFF).chr(($char>>16)&0xFF).chr(($char>>8)&0xFF).chr($char&0xFF) 
        );
        //echo "$newchar<$p<$p2<<\n";
        $ret = substr_replace($ret, $newchar, $p, 1+$p2-$p);
        $p2 = $p + strlen($newchar);
    }
    return $ret;
}

down

-1

Daniel A. ¶

6 years ago


I wanted to use this function today and I found the documentation, especially about the flags, not particularly helpful.

Running the code below, for example, failed because the flag I used was the wrong one...

$string = 'Donna&#039;s Bakery';
$title = html_entity_decode($string, ENT_HTML401, 'UTF-8');
echo $title;

The correct flag to use in this case is ENT_QUOTES.

My understanding of the flag to use is the one that would correspond to the expected, converted outcome. So, ENT_QUOTES for a character that would be a single or double quote when converted... and so on.

Please help make the documentation a bit clearer.

down

-3

Matt Robinson ¶

15 years ago


I wrote in a previous comment that html_entity_decode() only handled about 100 characters. That's not quite true; it only handles entities that exist in the output character set (the third argument). If you want to get ALL HTML entities, make sure you use ENT_QUOTES and set the third argument to 'UTF-8'. 

If you don't want a UTF-8 string, you'll need to convert it afterward with something like utf8_decode(), iconv(), or mb_convert_encoding(). 

If you're producing XML, which doesn't recognise most HTML entities:

When producing a UTF-8 document (the default), then htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8') (because you only need to escape < and > and & unless you're printing inside the XML tags themselves).

Otherwise, either convert all the named entities to numeric ones, or declare the named entities in the document's DTD. The full list of 252 entities can be found in the HTML 4.01 Spec, or you can cut and paste the function from my site (http://inanimatt.com/php-convert-entities.php).

down

-3

php dot net at c dash ovidiu dot tk ¶

19 years ago


Quick & dirty code that translates numeric entities to UTF-8.

<?php

    function replace_num_entity($ord)
    {
        $ord = $ord[1];
        if (preg_match('/^x([0-9a-f]+)$/i', $ord, $match))
        {
            $ord = hexdec($match[1]);
        }
        else
        {
            $ord = intval($ord);
        }
        
        $no_bytes = 0;
        $byte = array();

        if ($ord < 128)
        {
            return chr($ord);
        }
        elseif ($ord < 2048)
        {
            $no_bytes = 2;
        }
        elseif ($ord < 65536)
        {
            $no_bytes = 3;
        }
        elseif ($ord < 1114112)
        {
            $no_bytes = 4;
        }
        else
        {
            return;
        }

        switch($no_bytes)
        {
            case 2:
            {
                $prefix = array(31, 192);
                break;
            }
            case 3:
            {
                $prefix = array(15, 224);
                break;
            }
            case 4:
            {
                $prefix = array(7, 240);
            }
        }

        for ($i = 0; $i < $no_bytes; $i++)
        {
            $byte[$no_bytes - $i - 1] = (($ord & (63 * pow(2, 6 * $i))) / pow(2, 6 * $i)) & 63 | 128;
        }

        $byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];

        $ret = '';
        for ($i = 0; $i < $no_bytes; $i++)
        {
            $ret .= chr($byte[$i]);
        }

        return $ret;
    }

    $test = 'This is a &#269;&#x5d0; test&#39;';

    echo $test . "<br />\n";
    echo preg_replace_callback('/&#([0-9a-fx]+);/mi', 'replace_num_entity', $test);

?>

down

-4

Free at Key dot no ¶

14 years ago


Handy function to convert remaining HTML-entities into human readable chars (for entities which do not exist in target charset):



<?php

function cleanString($in,$offset=null)

{

    $out = trim($in);

    if (!empty($out))

    {

        $entity_start = strpos($out,'&',$offset);

        if ($entity_start === false)

        {

            // ideal

            return $out;    

        }

        else

        {

            $entity_end = strpos($out,';',$entity_start);

            if ($entity_end === false)

            {

                 return $out;

            }

            // zu lang um eine entity zu sein

            else if ($entity_end > $entity_start+7)

            {

                 // und weiter gehts

                 $out = cleanString($out,$entity_start+1);

            }

            // gottcha!

            else

            {

                 $clean = substr($out,0,$entity_start);

                 $subst = substr($out,$entity_start+1,1);

                 // &scaron; => "s" / &#353; => "_"

                 $clean .= ($subst != "#") ? $subst : "_";

                 $clean .= substr($out,$entity_end+1);

                 // und weiter gehts

                 $out = cleanString($clean,$entity_start+1);

            }

        }

    }

    return $out;

}

?>

down

-4

neurotic dot neu at gmail dot com ¶

14 years ago


This is a safe rawurldecode with utf8 detection:



<?php

function utf8_rawurldecode($raw_url_encoded){

    $enc = rawurldecode($raw_url_encoded);

    if(utf8_encode(utf8_decode($enc))==$enc){;

        return rawurldecode($raw_url_encoded);

    }else{

        return utf8_encode(rawurldecode($raw_url_encoded));

    }

}

?>

down

-3

Anonymous ¶

3 years ago


Why doesn't the html_entity_decode() function convert entities without the last semicolon (like &#x41 or &#65) to characters? 

---
<?php
echo 'like &#x41 or &#65';
---

Browser displays fine:
----
like A or A

down

-7

me at richardsnazell dot com ¶

17 years ago


I had a problem getting the 'TM' trademark symbol to display correctly in an email subject line. Using html_entity_decode() with different charsets didn't work, but directly replacing the entity with it's ASCII equivalent did:

$subject = str_replace('&trade;', chr(153), $subject);

down

-8

Victor ¶

13 years ago


We were having very peculiar behavior regarding foreign characters such as e-acute.

However, it was only showing up as a problem when extracting those characters out of our mysql database and when being displayed through a proxy server of ours that handles dns issues.

As other users have made a note of, the default character setting wasn't what they were expecting it to be when they left theirs blank.

When we changed our default_charset to "UTF-8", our problems and needs for using functions like these were no longer necessary in handling foreign characters such as e-acute. Good enough for us!

down

-5

jojo ¶

18 years ago


The decipherment does the character encoded by the escape function of JavaScript. 

When the multi byte is used on the page, it is effective. 



javascript escape('aaああaa') ..... 'aa%u3042%u3042aa'

php  jsEscape_decode('aa%u3042%u3042aa')..'aaああaa'



<?php

function jsEscape_decode($jsEscaped,$outCharCode='SJIS'){

    $arrMojis = explode("%u",$jsEscaped);

    for ($i = 1;$i < count($arrMojis);$i++){

        $c = substr($arrMojis[$i],0,4);

        $cc = mb_convert_encoding(pack('H*',$c),$outCharCode,'UTF-16');

        $arrMojis[$i] = substr_replace($arrMojis[$i],$cc,0,4);

    }

    return implode('',$arrMojis);

}

?>

down

-11

florianborn (at) yahoo (dot) de ¶

19 years ago


Note that

<?php

 echo urlencode(html_entity_decode("&nbsp;"));

?>

will output "%A0" instead of "+".

down

-9

marion at figmentthinking dot com ¶

15 years ago


I just ran into the:

Bug #27626 html_entity_decode bug - cannot yet handle MBCS in html_entity_decode()!



The simple solution if you're still running PHP 4 is to wrap the html_entity_decode() function with the utf8_decode() function.



<?php

$string = '&nbsp;';

$utf8_encode = utf8_encode(html_entity_decode($string));

?>



By default html_entity_decode() returns the ISO-8859-1 character set, and by default utf8_decode()...



http://us.php.net/manual/en/function.utf8-decode.php

"Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1"

down

-7

slickriptide at gmail dot com ¶

8 years ago


When using this function, it's a good idea to pay attention when it says that leaving the charset parameter empty is "not recommended". 

I had an issue where I was storing text files, with entities converted, into a database. When I retrieved them later and ran

$text_file = html_entity_decode($text_data);

the entities were NOT decoded.

Once I was aware of the problem, I changed the decode call to fully specify all of the parameters:

$text_file = html_entity_decode($text_data, ENT_COMPAT | ENT_HTML5,'utf-8');

This converted the entities as expected.

down

-15

daniel at brightbyte dot de ¶

20 years ago


This function seems to have to have two limitations (at least in PHP 4.3.8):

a) it does not work with multibyte character codings, such as UTF-8
b) it does not decode numeric entity references

a) can be solved by using iconv to convert to ISO-8859-1, then decoding the entities, than convert to UTF-8 again. But that's quite ugly and detroys all characters not present in Latin-1.

b) can be solved rather nicely using the following code:

<?php
function decode_entities($text) {
    $text= html_entity_decode($text,ENT_QUOTES,"ISO-8859-1"); #NOTE: UTF-8 does not work!
    $text= preg_replace('/&#(\d+);/me',"chr(\\1)",$text); #decimal notation
    $text= preg_replace('/&#x([a-f0-9]+);/mei',"chr(0x\\1)",$text);  #hex notation
    return $text;
}
?>

HTH

down

-15

grvg (at) free (dot) fr ¶

18 years ago


Here is the ultimate functions to convert HTML entities to UTF-8 :

The main function is htmlentities2utf8

Others are helper functions



<?php

function chr_utf8($code)

    {

        if ($code < 0) return false;

        elseif ($code < 128) return chr($code);

        elseif ($code < 160) // Remove Windows Illegals Cars

        {

            if ($code==128) $code=8364;

            elseif ($code==129) $code=160; // not affected

            elseif ($code==130) $code=8218;

            elseif ($code==131) $code=402;

            elseif ($code==132) $code=8222;

            elseif ($code==133) $code=8230;

            elseif ($code==134) $code=8224;

            elseif ($code==135) $code=8225;

            elseif ($code==136) $code=710;

            elseif ($code==137) $code=8240;

            elseif ($code==138) $code=352;

            elseif ($code==139) $code=8249;

            elseif ($code==140) $code=338;

            elseif ($code==141) $code=160; // not affected

            elseif ($code==142) $code=381;

            elseif ($code==143) $code=160; // not affected

            elseif ($code==144) $code=160; // not affected

            elseif ($code==145) $code=8216;

            elseif ($code==146) $code=8217;

            elseif ($code==147) $code=8220;

            elseif ($code==148) $code=8221;

            elseif ($code==149) $code=8226;

            elseif ($code==150) $code=8211;

            elseif ($code==151) $code=8212;

            elseif ($code==152) $code=732;

            elseif ($code==153) $code=8482;

            elseif ($code==154) $code=353;

            elseif ($code==155) $code=8250;

            elseif ($code==156) $code=339;

            elseif ($code==157) $code=160; // not affected

            elseif ($code==158) $code=382;

            elseif ($code==159) $code=376;

        }

        if ($code < 2048) return chr(192 | ($code >> 6)) . chr(128 | ($code & 63));

        elseif ($code < 65536) return chr(224 | ($code >> 12)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63));

        else return chr(240 | ($code >> 18)) . chr(128 | (($code >> 12) & 63)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63));

    }



    // Callback for preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $str);

    function html_entity_replace($matches)

    {

        if ($matches[2])

        {

            return chr_utf8(hexdec($matches[3]));

        } elseif ($matches[1])

        {

            return chr_utf8($matches[3]);

        }

        switch ($matches[3])

        {

            case "nbsp": return chr_utf8(160);

            case "iexcl": return chr_utf8(161);

            case "cent": return chr_utf8(162);

            case "pound": return chr_utf8(163);

            case "curren": return chr_utf8(164);

            case "yen": return chr_utf8(165);

            //... etc with all named HTML entities

        }

        return false;

    }

    

    function htmlentities2utf8 ($string) // because of the html_entity_decode() bug with UTF-8

    {

        $string = preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $string);

        return $string;

    }

?>

down

-13

jl dot garcia at gmail dot com ¶

15 years ago


I created this function to filter all the text that goes in or comes out of the database.



<?php

function filter_string($string, $nohtml='', $save='') {

    if(!empty($nohtml)) {

        $string = trim($string);

        if(!empty($save)) $string = htmlentities(trim($string), ENT_QUOTES, 'ISO-8859-15');

        else $string = html_entity_decode($string, ENT_QUOTES, 'ISO-8859-15');

    }

    if(!empty($save)) $string = mysql_real_escape_string($string);

    else $string = stripslashes($string);

    return($string);

}

?>

down

-17

kae at verens dot com ¶

16 years ago


the references to 'chr()' in the example unhtmlentities() function should be changed to unichr, using the example unichr() function described in the 'chr' reference (http://php.net/chr).

the reason for this is characters such as &#x20AC; which do not break down into an ASCII number (that's the Euro, by the way).

add a note