Remove all HTML to plain text

This function removes all HTML and keeps the plain text and is an enhancement of PHP strip_tags() function by strip out styles, scripts, embedded objects, and other unwanted page code.

<?php
/**
 * Remove HTML tags, including invisible text such as style and
 * script code, and embedded objects.  Add line breaks around
 * block-level tags to prevent word joining after tag removal.
 */
function strip_html_tags( $text )
{
// Remove invisible content
   
$text = preg_replace(
        array(
           
//ADD a (') before @<head ON NEXT LINE. Why? see below
           
'@<head[^>]*?>
.*?</head>@siu',
            '@<style[^>]*?>.*?</style>@siu',
            '@<script[^>]*?.*?</script>@siu',
            '@<object[^>]*?.*?</object>@siu',
            '@<embed[^>]*?.*?</embed>@siu',
            '@<applet[^>]*?.*?</applet>@siu',
            '@<noframes[^>]*?.*?</noframes>@siu',
            '@<noscript[^>]*?.*?</noscript>@siu',
            '@<noembed[^>]*?.*?</noembed>@siu',
          // Add line breaks before and after blocks
            '@</?((address)|(blockquote)|(center)|(del))@iu',
            '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
            '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
            '@</?((table)|(th)|(td)|(caption))@iu',
            '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
            '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
            '@</?((frameset)|(frame)|(iframe))@iu',
        ),
        array(
            ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
            "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            "\n\$0", "\n\$0",
        ),
        $text );
    return strip_tags( $text );
}
?>

Example

<?php
/* Read an HTML file */
$raw_text = file_get_contents( $filename );
 
/* Get the file's character encoding from a <meta> tag */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
   
$raw_Text, $matches );
$encoding = $matches[3];
 
/* Convert to UTF-8 before doing anything else */
$utf8_text = iconv( $encoding, "utf-8", $raw_text );
 
/* Strip HTML tags and invisible text */
$utf8_text = strip_html_tags( $utf8_text );
 
/* Decode HTML entities */
$utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "UTF-8" );

?>

Reference: http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_w...

Knowledge keywords: