浮点数和双精度数有什么区别?

What is the difference between float and double?(浮点数和双精度数有什么区别?)
本文介绍了浮点数和双精度数有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经了解了双精度和单精度之间的区别.然而,在大多数情况下,floatdouble 似乎可以互换,即使用其中一个似乎不会影响结果.真的是这样吗?浮点数和双精度数何时可以互换?它们之间有什么区别?

I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?

推荐答案

差别很大.

顾名思义,double 有 2xfloat[1]的精度支持>.一般来说,double 有 15 位精度,而 float 有 7.

As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.

位数的计算方法如下:

double 有 52 个尾数位 + 1 个隐藏位:log(253)÷log(10) = 15.95 位

double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits

float 有 23 个尾数位 + 1 个隐藏位:log(224)÷log(10) = 7.22 位

float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits

这种精度损失可能会导致在重复计算时累积更大的截断误差,例如

This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.

float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.7g
", b); // prints 9.000023

同时

double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.15g
", b); // prints 8.99999999999996

另外,float的最大值约为3e38,而double约为1.7e308,所以使用float可以达到无穷大";(即一个特殊的浮点数)比 double 更容易做一些简单的事情,例如计算 60 的阶乘.

Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.

在测试过程中,可能有几个测试用例包含这些巨大的数字,如果使用浮点数,可能会导致程序失败.

During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.

当然,有时候,即使 double 也不够准确,因此我们有时会有 long double[1] (上面的例子在 Mac 上给出 9.000000000000000066),但所有浮点类型都存在 舍入误差,因此如果精度非常重要(例如货币处理),您应该使用 int 或分数类.

Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.

此外,不要使用 += 对大量浮点数求和,因为错误会迅速累积.如果您使用的是 Python,请使用 fsum.否则,尝试实现 Kahan 求和算法.

Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.

[1]:C 和 C++ 标准没有指定 floatdoublelong double 的表示.有可能所有三个都实现为 IEEE 双精度.尽管如此,对于大多数架构(gcc、MSVC;x86、x64、ARM)float is 确实是 IEEE 单精度浮点数(binary32),而 double 一个IEEE双精度浮点数(binary64).

[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).

这篇关于浮点数和双精度数有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Rising edge interrupt triggering multiple times on STM32 Nucleo(在STM32 Nucleo上多次触发上升沿中断)
How to use va_list correctly in a sequence of wrapper functions calls?(如何在一系列包装函数调用中正确使用 va_list?)
OpenGL Perspective Projection Clipping Polygon with Vertex Outside Frustum = Wrong texture mapping?(OpenGL透视投影裁剪多边形,顶点在视锥外=错误的纹理映射?)
How does one properly deserialize a byte array back into an object in C++?(如何正确地将字节数组反序列化回 C++ 中的对象?)
What free tiniest flash file system could you advice for embedded system?(您可以为嵌入式系统推荐什么免费的最小闪存文件系统?)
Volatile member variables vs. volatile object?(易失性成员变量与易失性对象?)