How to get the charset from HTML code with SimpleHTML Dom

By including the Simple HTML Dom library you can extract any HTML element and get their attributes. This is how to get the charset from the meta tag element. The function _get_html_charset() will return "utf-8". In this case I send the HTML code to the function but you will find the code for starting from the URL below.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

<?php
/**
* Include Simple HTML Dom
*/
function feed_control_init(){
 require_once(
'sites/default/thirdparty/simplehtmldom/simple_html_dom.php');
}

/**
* Get HTML encoding from source code
*
* @param unknown_type $raw_text
* @return unknown
*/
function _get_html_charset($raw_text){
   
$html = str_get_html($raw_text);
   
//Or instead $html = file_get_html('<a href="http://www.mysite.com/'">http://www.mysite.com/'</a>);

   
$el=$html->find('meta[http-equiv=Content-Type]',0);
   
$fullvalue = $el->content;
   
preg_match('/charset=(.+)/', $fullvalue, $matches);
    return
$matches[1];
}
?>

Source: http://simplehtmldom.sourceforge.net/

Knowledge keywords: