问题描述
我在使用 UTF-8 字符串时遇到了问题.我想从字符串中读取单个字符,例如:
I have encountered a problem when using a UTF-8 string. I want to read a single character from the string, for example:
$string = "üÜöÖäÄ";
echo $string[0];
我期待看到 ü
,但我得到了 - 为什么?
I am expecting to see ü
, but I get � -- why?
推荐答案
使用 mb_substr($string, 0, 1, 'utf-8')
代替获取字符.
Use mb_substr($string, 0, 1, 'utf-8')
to get the character instead.
在您的代码中发生的情况是表达式 $string[0]
获取了字符串的 UTF-8 编码表示的第一个 byte,因为 PHP 字符串是有效的字节数组(PHP 内部不识别编码).
What happens in your code is that the expression $string[0]
gets the first byte of the UTF-8 encoded representation of your string because PHP strings are effectively arrays of bytes (PHP does not internally recognize encodings).
由于字符串中的第一个字符由多个字节组成(UTF-8 编码规则),你实际上只得到了角色的一部分.此外,这些规则使您检索的字节无效,无法单独作为一个字符,这就是您看到问号的原因.
Since the first character in your string is composed in more than one byte (UTF-8 encoding rules), you are effectively only getting part of the character. Furthermore, these rules make the byte you are retrieving invalid to stand as a character on its own, which is why you see the question mark.
mb_substr
知道编码规则,所以它不会天真地只给你一个字节;它将获得对第一个字符进行编码所需的数量.
mb_substr
knows the encoding rules, so it will not naively give you back just one byte; it will get as many as needed to encode the first character.
你可以看到 $string[0]
只给你一个字节:
You can see that $string[0]
gives you back just one byte with:
$string = "üÜöÖäÄ";
echo strlen($string[0]);
而 mb_substr
会返回两个字节:
While mb_substr
gives you back two bytes:
$string = "üÜöÖäÄ";
echo strlen(mb_substr($string, 0, 1, 'utf-8'));
而这两个字节其实只是一个字符(需要使用mb_strlen
为此):
And these two bytes are in fact just one character (you need to use mb_strlen
for this):
$string = "üÜöÖäÄ";
echo mb_strlen(mb_substr($string, 0, 1, 'utf-8'), 'utf-8');
最后,正如 Marwelln 在下面指出的那样,如果您使用 ,情况会变得更容易接受mb_internal_encoding
摆脱 'utf-8'
冗余:
Finally, as Marwelln points out below, the situation becomes more tolerable if you use mb_internal_encoding
to get rid of the 'utf-8'
redundancy:
$string = "üÜöÖäÄ";
mb_internal_encoding('utf-8');
echo mb_strlen(mb_substr($string, 0, 1));
您可以查看上述大部分内容.
这篇关于在 UTF-8 字符串上使用数组索引时输出错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!