问题描述
我试图找到术语字符"、代码点"和代理"的解释,虽然这些术语不仅限于 Java,但如果有任何特定于语言的差异,我想要与 Java 相关的解释.
I'm trying to find an explanation of the terms "character", "code point" and "surrogate", and while these terms aren't limited to Java, if there are any language-specific differences I'd like the explanation as it relates to Java.
我发现了一些关于字符和代码点之间差异的信息,字符是向人类用户显示的内容,而代码点是对特定字符进行编码的值,但我对代理一无所知.什么是代理,它们与字符和代码点有何不同?我对字符和代码点有正确的定义吗?
I've found some information about the differences between characters and code points, characters being what is displayed for human users, and code points being a value encoding that specific character, but I have a no idea about surrogates. What are surrogates, and how are they different from characters and code points? Do I have the right definitions for characters and code points?
在 another thread 关于将字符串作为字符数组单步执行,提示此问题的具体评论是请注意,此技术为您提供字符,而不是代码点,这意味着您可能会获得代理."我真的不明白,与其对一个 5 年前的问题发表一长串评论,我认为最好在一个新问题中要求澄清.
In another thread about stepping through a string as an array of characters, the specific comment that prompted this question was "Note that this technique gives you characters, not code points, meaning you may get surrogates." I didn't really understand, and rather than create a long series of comments on a 5-year-old question I thought it would be best to ask for clarification in a new question.
推荐答案
要在计算机中表示文本,您必须解决两件事:首先,您必须将符号映射到数字,然后,您必须表示这些符号的序列带字节的数字.
To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.
代码点 是标识符号的数字.为符号分配数字的两个众所周知的标准是 ASCII 和 Unicode.ASCII 定义了 128 个符号.Unicode 目前定义了 109384 个符号,远远超过 216.
A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that's way more than 216.
此外,ASCII 指定数字序列每个数字表示一个字节,而 Unicode 指定了几种可能性,例如 UTF-8、UTF-16 和 UTF-32.
Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.
当您尝试使用每个字符使用的位数少于表示所有可能值所需的位数时(例如使用 16 位的 UTF-16),您需要一些解决方法.
When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.
因此,Surrogates 是 16 位值,表示不适合的符号单个两字节值.
Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.
Java 在内部使用 UTF-16 来表示文本.
Java uses UTF-16 internally to represent text.
特别是,char
(字符)是包含 UTF-16 值的无符号两字节值.
In particular, a char
(character) is an unsigned two-byte value that contains a UTF-16 value.
如果您想了解有关 Java 和 Unicode 的更多信息,我可以推荐此时事通讯:第 1 部分, 第 2 部分
If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2
这篇关于Java - 什么是字符、代码点和代理?它们之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!