• Replace HTML Special Characters With Entities – But Without Touching Tags

    I came a across a problem during the development of a CMS at work where I had to take a string of HTML source code and make sure all special html characters are replaced with their entities. For example, & (ampersand) should become &.


    PHP has a couple of useful functions for this sort of thing, namely htmlentities and htmlspecialchars. However running my string through either of these was no good to me because doing so would convert the characters used in the html tags too. For example, the following:

    This is a paragraph & that ampersand needs fixing

    Would become:

    1
    
    <p class="foo">This is a paragraph & that ampersand needs fixing</p>

    The ampersand is converted nicely, but now the HTML is useless. The first thought that struck me was to parse the string using php’s XML parser in order to get at the cdata directly, but of course that idea didn’t last long since the very characters I was trying to fix would have broken the parser.

    In the end I settled on using a regular expression to match content in between tags, but leave the tags themselves alone. I also added some functionality to leave anything between tags along so I could pass though HTML with embedded PHP and not have it break.

    Here is the function. It is coded to work with UTF-8, hence the multibyte functions and the /u modifier on the regex, but if you are working with a single byte character set you can just swap this out accordingly.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    
    <!--?php 
    function clean_entities($string) {
     
        $string = htmlspecialchars_decode($string);
     
        $parts = preg_split('/(<\?.*?\?-->)/us', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
     
        $string = '';
     
        foreach ($parts as $part) {
            if (false === mb_strpos(trim($part), '<!--?')) {
                $string .= preg_replace_callback(
                    '/(?<=\-->)((?![&lt;](\?|\/)*[a-z][^&gt;]*[&gt;]).)+/ius',
                    create_function(
                        '$matches',
                        'return htmlspecialchars($matches[0]);'
                    ),
                    $part
                );
            } else {
                $string .= $part;
            }
        }
     
        return $string;
     
    }
    ?&gt;

    This results in nice valid entities, but the tags and any embedded php are left alone:

    This is a paragraph & that ampersand fixed!

Comments on this post

Leave a Reply

  • Security Code :


    six + = 12