Gerando Códigos e Símbolos Unicode pelo PHP

Já falamos sobre Unicode, e sabemos que os símbolos UTF-8 podem precisar de 1 a 4 bytes devido ao número de símbolos definidos pela tabela Unicode e pelo funcionamento do algoritimo de codificação/decodificação UTF-8.

Para ajudar em algumas operações com strings com texto em UTF-8, pode ser necessário utilizar um recurso extra. A seguir, são disponíveis algumas funções para trabalhar com UTF-8:

Linguagem: PHP

Licença: LGPL 3 ou superior

/**
 * Retorna o codigo de um caracter em UTF-8
 * @param string $c: caractere UTF-8 (de 1 a 4 bytes)
 * @return int Codigo do caractere informado
 */
function ord_utf8($c) {

    // Caracteres UTF-8 tem entre 8 e 32 bits conforme tabela, sendo de 7 a 21 significativos

    // http://tools.ietf.org/html/rfc3629
    //      Intervalo      |        Sequencia de octetos
    //    (hexadecimal)    |              (binario)
    // --------------------+--------------------------------------
    // 00000000 - 0000007F | 0xxxxxxx
    // 00000080 - 000007FF | 110xxxxx 10xxxxxx
    // 00000800 - 0000FFFF | 1110xxxx 10xxxxxx 10xxxxxx
    // 00010000 - 0010FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    $tam_real = strlen($c);

    /// 1 - Obter o representante decimal de cada byte do caractere recebido
    $vt_ord = array();
    for ($i = $tam_real - 1; $i >= 0; $i--) {
        $vt_ord[$i] = ord($c[$i]);
    }

    /// 2 - Checar se o caracter e' um ASCII padrao (1 byte): tem 7 bits significativos
    if ($vt_ord[0] <= 0x7F) { // byte1 <= 01111111
        return $vt_ord[0];
    }

    /// 3 - Validar caracter e obter os bits necessarios

    // Se espera 2 bytes: tem 8 a 11 bits significativos
    if ($vt_ord[0] <= 0xDF) {                  // byte1 <= 11011111
        if ($tam_real == 2 &&                  // tem 2 bytes
            (($vt_ord[1] & 0xC0) == 0x80)) {   // byte2 & 11000000 == 10000000 (byte2 == 10xxxxxx)

            return ($vt_ord[1] & 0x3F) |       // byte2 & 00111111 (6 bits)
                   (($vt_ord[0] & 0x1F) << 6); // byte1 & 00011111 (+ 5 bits)
        }

    // Se espera 3 bytes: tem 12 a 16 bits significativos
    } elseif ($vt_ord[0] <= 0xEF) {            // byte1 <= 11101111
        if ($tam_real == 3 &&                  // tem 3 bytes
            (($vt_ord[1] & 0xC0) == 0x80) &&   // byte2 & 11000000 == 10000000 (byte2 == 10xxxxxx)
            (($vt_ord[2] & 0xC0) == 0x80)) {   // byte3 & 11000000 == 10000000 (byte3 == 10xxxxxx)

            return ($vt_ord[2] & 0x3F) |        // byte3 & 00111111 (6 bits)
                   (($vt_ord[1] & 0x3F) << 6) | // byte2 & 00111111 (+ 6 bits)
                   (($vt_ord[0] & 0x1F) << 12); // byte1 & 00011111 (+ 5 bits)
        }

    // Se espera 4 bytes: tem 17 a 21 bits significativos
    } elseif ($vt_ord[0] <= 0xF4) { // byte1 <= 11110111
        if ($tam_real == 4 &&       // tem 4 bytes
            (($vt_ord[1] & 0xC0) == 0x80) &&   // byte2 & 11000000 == 10000000 (byte2 == 10xxxxxx)
            (($vt_ord[2] & 0xC0) == 0x80) &&   // byte3 & 11000000 == 10000000 (byte3 == 10xxxxxx)
            (($vt_ord[3] & 0xC0) == 0x80)) {   // byte4 & 11000000 == 10000000 (byte4 == 10xxxxxx)

            return ($vt_ord[3] & 0x3F) |         // byte4 & 00111111 (6 bits)
                   (($vt_ord[2] & 0x3F) << 6) |  // byte3 & 00111111 (+ 6 bits)
                   (($vt_ord[1] & 0x3F) << 12) | // byte2 & 00111111 (+ 6 bits)
                   (($vt_ord[0] & 0x1F) << 18);  // byte1 & 00011111 (+ 5 bits)
        }
    }

    // Se o UTF-8 informado e' invalido
    $vt_binario = array();
    for ($i = 0; $i < $tam_real; $i++) {
        $vt_binario[] = sprintf('%08d', decbin($vt_ord[$i]));
    }
    $binario = implode(' ', $vt_binario);

    trigger_error('Caracter UTF-8 invalido: '.$binario, E_USER_NOTICE);
    return false;
}


/**
 * Gera um caractere UTF-8 a partir do seu codigo (7 a 21 bits)
 * @param int $ord: codigo do caractere
 * @return string caractere UTF-8
 */
function chr_utf8($ord) {

    // Tem 1 byte (7 bits significativos)
    if ($ord <= 0x7F) {
        return chr($ord);

    // Tem 2 bytes (11 bits significativos = 5 + 6)
    } elseif ($ord <= 0x7FF) {
        return chr((($ord >> 6) & 0x1F) | 0xC0).   // ((ord >> 6) & 00011111) | 11000000
               chr((   $ord     & 0x3F) | 0x80);   // (   ord     & 00111111) | 10000000

    // Tem 3 bytes (16 bits significativos = 4 + 6 + 6)
    } elseif ($ord <= 0xFFFF) {
        return chr((($ord >> 12) & 0xF)  | 0xE0).  // ((ord >> 12) & 00001111) | 11100000
               chr(( ($ord >> 6) & 0x3F) | 0x80).  // ( (ord >> 6) & 00111111) | 10000000
               chr((    $ord     & 0x3F) | 0x80);  // (    ord     & 00111111) | 10000000

    // Tem 4 bytes (21 bits significativos = 3 + 6 + 6 + 6)
    } elseif ($ord <= 0x10FFFF) {
        return chr((($ord >> 18) & 0x7)  | 0xF0).  // ((ord >> 18) & 00000111) | 11110000
               chr((($ord >> 12) & 0x3F) | 0x80).  // ((ord >> 12) & 00111111) | 10000000
               chr((($ord >> 6)  & 0x3F) | 0x80).  // ( (ord >> 6) & 00111111) | 10000000
               chr((    $ord     & 0x3F) | 0x80);  // (    ord     & 00111111) | 10000000
    }
    trigger_error('O codigo '.$ord.' nao pode ser representado em UTF-8', E_USER_NOTICE);
    return false;
}


/**
 * Retorna o tamanho esperado de um caracter UTF-8
 * @param string $c: caractere UTF-8 (de 1 a 4 bytes)
 * @return int tamanho do caractere em bytes
 */
function tamanho_utf8($c) {
    $ord = ord($c[0]);

    if ($ord <= 0x7F) {       // byte <= 01111111
        return 1;
    } elseif ($ord <= 0xDF) { // byte <= 11011111
        return 2;
    } elseif ($ord <= 0xEF) { // byte <= 11101111
        return 3;
    } elseif ($ord <= 0xF4) { // byte <= 11110111
        return 4;
    }

    trigger_error('O caractere informado nao representa um UTF-8', E_USER_NOTICE);
    return false;
}


/**
 * Retorna um caracter UTF-8 de uma posicao da string UTF-8
 * @param string $str: string codificada em UTF-8
 * @param int $pos: posicao do caracter desejado
 * @return string caractere da posicao indicada
 */
function get_char_utf8($str, $pos) {

    $len = strlen($str);
    $tam_caractere = 0;

    $caractere = 0;
    for ($i = 0; $i < $len; $i += $tam_caractere, $caractere++) {

        // Checar o tamanho do caractere UTF-8
        $tam_caractere = tamanho_utf8($str[$i]);

        if ($caractere == $pos) {
            return substr($str, $i, $tam_caractere);
        }
    }
    trigger_error('Nao existe a posicao '.$pos.' na string "'.$str.'"', E_USER_NOTICE);
    return false;
}

Com estas funções, você é capaz, inclusive, de implementar um html entities para caracteres unicode. Por exemplo, para gerar entities dos símbolos unicode de uma string, exceto os símbolos ASCII. Basta fazer:

$string = 'Atenção';
$string_codificada = html_entities_utf8($string);
echo $string_codificada; // Imprime Aten&#231;&#227;o

/**
 * Converte os caracteres UTF-8 para notacao com entities
 * @param string $string Texto unicode
 * @return string Texto em HTML entities
 */
function html_entities_utf8($string) {
    $saida = '';
    $len = strlen($string);
    $i = 0;
    while ($i < $len && $char_len = tamanho_utf8(substr($string, $i, 4))) {
        $char = substr($string, $i, $char_len);
        if ($char_len == 1) {
            $saida .= $char;
        } else {
            $ord = ord_utf8($char);
            $saida .= '&#' . $ord . ';';
        }
        $i += $char_len;
    }
    return $saida;
}

0 comentários

Postar um comentário

Nota: fique a vontade para expressar o que achou deste artigo ou do blog.
Dica: para acompanhar as respostas, acesse com uma conta do Google e marque a opção "Notifique-me".
Atenção: o blogger não permite inclusão de tags nos comentários, por isso, use algum site externo para postar seu código com dúvidas e deixe o link aqui. Exemplo: pastebin.com