问题描述
我实际上对 Java 中的字符串编码感到困惑.我有一些问题.如果您知道他们的答案,请帮助我:
I am actually confused regarding the encoding of strings in Java. I have a couple of questions. Please help me if you know the answer to them:
1) Java 字符串在内存中的原生编码是什么?当我写 String a = "Hello"
时,它会以什么格式存储?由于 Java 是独立于机器的,我认为系统不会进行编码.
1) What is the native encoding of Java strings in memory? When I write String a = "Hello"
in which format will it be stored? Since Java is machine independent I don't think the system will do the encoding.
2) 我在网上读到UTF-16"是默认编码,但我很困惑,因为说当我写 int a = 'c'
时,我得到了字符的编号在 ASCII 表中.那么 ASCII 和 UTF-16 是一样的吗?
2) I read on the net that "UTF-16" is the default encoding but I got confused because say when I write that int a = 'c'
I get the number of the character in the ASCII table. So are ASCII and UTF-16 the same?
3) 另外,我不确定字符串在内存中的存储取决于什么:操作系统、语言?
3) Also I wasn't sure on what the storage of a string in the memory depends: OS, language?
推荐答案
1) 字符串是对象,通常包含一个 char
数组和字符串的长度.字符数组通常被实现为一个连续的 16 位字数组,每个字包含一个本机字节顺序的 Unicode 字符.
1) Strings are objects, which typically contain a char
array and the strings's length. The character array is usually implemented as a contiguous array of 16-bit words, each one containing a Unicode character in native byte order.
2) 将字符值分配给整数会将 16 位 Unicode 字符代码转换为其等效的整数.因此,'c'
,即 U+0063,变成了 0x0063
,即 99.
2) Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent. Thus 'c'
, which is U+0063, becomes 0x0063
, or 99.
3) 由于每个 String
都是一个对象,它包含除类成员之外的其他信息(例如,类描述符字、锁/信号量字等).
3) Since each String
is an object, it contains other information than its class members (e.g., class descriptor word, lock/semaphore word, etc.).
附录
对象内容取决于 JVM 实现(它决定了与每个对象相关的固有开销)以及类的实际编码方式(即,某些库可能比其他库更高效).
ADENDUM
The object contents depend on the JVM implementation (which determines the inherent overhead associated with each object), and how the class is actually coded (i.e., some libraries may be more efficient than others).
示例
典型的实现将为每个对象实例分配两个字的开销(用于类描述符/指针,以及一个信号量/锁控制字);String
对象还包含一个 int
长度和一个 char[]
数组引用.字符串的实际字符内容存储在第二个对象 char[]
数组中,该数组又分配了两个字,加上一个数组长度的字,再加上多达 16 位的 字符串所需的 char
元素(加上创建字符串时留下的任何额外字符).
EXAMPLE
A typical implementation will allocate an overhead of two words per object instance (for the class descriptor/pointer, and a semaphore/lock control word); a String
object also contains an int
length and a char[]
array reference. The actual character contents of the string are stored in a second object, the char[]
array, which in turn is allocated two words, plus an array length word, plus as many 16-bit char
elements as needed for the string (plus any extra chars that were left hanging around when the string was created).
附录 2
one char 代表one Unicode 字符的情况仅在大多数情况下才成立.这意味着 UCS-2 编码并且在 2005 年之前为真.但现在Unicode 已经变得更大,字符串必须使用 UTF-16 编码——唉,单个 Unicode 字符可能在 Java String 中使用 两个
char
s代码>.
ADDENDUM 2
The case that one char represents one Unicode character is only true in most of the cases. This would imply UCS-2 encoding and true before 2005. But by now Unicode has become larger and Strings have to be encoded using UTF-16 -- where alas a single Unicode character may use two char
s in a Java String
.
查看 Apache 实现的实际源代码,例如在:
http://www.docjar.com/html/api/java/lang/String.java.html
Take a look at the actual source code for Apache's implementation, e.g. at:
http://www.docjar.com/html/api/java/lang/String.java.html
这篇关于Java中String的字符编码是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!