问题描述
我正在将字符串从一种字符集转换为另一种字符集,并阅读了很多关于它的示例,最后找到了下面的代码,这对我来说看起来不错,而且作为字符集编码的新手,我想知道它是否正确方法.
I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .
public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
return new String(source, from).getBytes(to);
}
要将字符串从 ASCII 转换为 EBCDIC,我必须:
To convert String from ASCII to EBCDIC, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));
并且要从 EBCDIC 转换为 ASCII,我必须这样做:
And to convert from EBCDIC to ASCII, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));
推荐答案
您找到的代码 (transcodeField
) 不会将 String
从一种编码转换为另一种编码,因为 String
没有编码¹.它将字节从一种编码转换为另一种编码.该方法仅在您的用例满足 2 个条件时才有用:
The code you found (transcodeField
) doesn't convert a String
from one encoding to another, because a String
doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
- 您的输入数据是一种编码中的字节
- 您的输出数据必须是另一种编码的字节
在那种情况下,它是直截了当的:
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
如果输入数据包含无法在输出编码中表示的字符(例如将复杂的 UTF8
转换为 ASCII
),这些字符将被替换为 <代码>? 替换符号,数据将被破坏.
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8
to ASCII
) those characters will be replaced with the ?
replacement symbol, and the data will be corrupted.
然而,很多人问如何将字符串从一种编码转换为另一种编码",很多人回答以下片段:
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
这完全是公牛****.getBytes(String encoding)
方法返回一个字节数组,其中包含以指定编码编码的字符(如果可能,再次将无效字符转换为 ?
).带有第二个参数的 String 构造函数从字节数组中创建一个新的 String,其中字节采用指定的编码.现在,由于您刚刚使用 source.getBytes(inputEncoding)
来获取这些字节,它们未在 outputEncoding
中编码(除非编码使用相同的值,这对于像 abcd
这样的普通"字符很常见,但与更复杂的字符不同,例如重音字符 éêäöñ
).
This is complete bull****. The getBytes(String encoding)
method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?
). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding)
to get those bytes, they're not encoded in outputEncoding
(except if the encodings use the same values, which is common for "normal" characters like abcd
, but differs with more complex like accented characters éêäöñ
).
这是什么意思?这意味着当您拥有 Java String
时,一切都很棒.Strings
是 unicode,这意味着您的所有字符都是安全的.当您需要将该 String
转换为字节时,问题就出现了,这意味着您需要决定一种编码.选择与 unicode 兼容的编码,例如 UTF8
、UTF16
等是很棒的.这意味着即使您的 String 包含各种奇怪的字符,您的字符仍然是安全的.如果您选择不同的编码(US-ASCII
是最不支持的),您的字符串必须仅包含编码支持的字符,否则会导致字节损坏.
So what does this mean? It means that when you have a Java String
, everything is great. Strings
are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String
to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8
, UTF16
etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII
being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
现在终于有一些好的和坏的用法的例子了.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
最后一个例子表明,尽管两种编码都支持北欧字符,但它们使用不同的字节来表示它们并且在解码结果时使用了错误的编码 Mojibake.因此,没有将字符串从一种编码转换为另一种编码"这样的事情,您永远不应该使用损坏的示例.
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.
另请注意,您应该始终指定使用的编码(同时使用 getBytes()
和 new String()
),因为您不能相信默认编码永远是你想要的.
Also note that you should always specify the encoding used (with both getBytes()
and new String()
), because you can't trust that the default encoding is always the one you want.
作为最后一个问题,字符集和编码不是同样的事情,但它们非常相关.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹ 从技术上讲,字符串在 JVM 内部存储的方式是 UTF-16 编码,最高可达 Java 8,并且 变量编码 从 Java 9 开始,但开发人员不需要关心.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
注意
有可能有一个损坏的字符串,并且能够通过摆弄编码来修复它,这可能是将字符串转换为其他编码"的地方.误会源于.
It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.
// Input comes from network/file/other place and we have misconfigured the encoding
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
如果 input
中没有字符损坏,则字符串现在将被修复".然而,正确的方法是在读取 input
时使用正确的编码,而不是在之后修复它.特别是如果它有可能被损坏.
If no characters were corrupted in input
, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input
, not fix it afterwards. Especially if there's a chance of it becoming corrupted.
这篇关于将字符串从一种字符集转换为另一种字符集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!