How to split a text into chuncks containing a maximum number of characters and retain full sentences
This is how you can split a text into chuncks containing a maximum number of characters and retain full sentences.As an example, if you have a limit of 5000 characters, as in the case of Google Translate API, and would like to divide the text into chunks, but you do not want sentences or words to be divided in the middle.
The following code receives a text and if it exceeds the limit, it looks up the last punctuation point before the limit and parts of the text there until the whole text has been divided into chunks. The parts are stored in an array and can then be used to send to any API or any function. The results are then merged again if desired.
<?php
function cdt_translate_text($text, $source = FALSE, $target) {
$limit = 200;
//Test data
echo $text = "The following code receives a text. And if it exceeds the limit. It looks up the last punctuation point before the limit. And parts of the text there until the whole text has been divided into chunks. The parts are stored in an array. And can then be used to send to any API or any function. The results are then merged again if desired.";
$text = strip_tags($text);
$text_copy = $text;
do {
$last_point = _get_last_point($text_copy, $limit);
$text_sub[] = substr($text_copy, 0, $last_point+1);
$text_copy = substr($text_copy, $last_point+2);
$left = strlen($text_copy);
if ($r++ > 10) {
drupal_set_message(t("Maximum text chunks to translate has been exceeded"));
}
} while ($left > 0);
foreach ($text_sub as $key => $sub_text) {
$translated[] = _translate($sub_text, $source, $target);
}
return implode(" ", $translated);
}
function _get_last_point($text, $limit = 5000) {
$text_lenght = strlen($text);
if ($text_lenght <= $limit) {
return $text_lenght;
}else{
preg_match_all("/\./ui", $text, $matches, PREG_OFFSET_CAPTURE);
foreach ($matches[0] as $key => $value) {
if ($value[1] >= $limit) {
return ($key == 0 ? $matches[0][$key][1] : $matches[0][$key-1][1]);
}
}
return $text_lenght;
}
}
?> Check out the PHP functions:
chunk_split()
wordwrap()
