解释了一种将双精度数舍入为 32 位整数的快速方法

A fast method to round a double to a 32-bit int explained(解释了一种将双精度数舍入为 32 位整数的快速方法)
本文介绍了解释了一种将双精度数舍入为 32 位整数的快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读 Lua 的源代码时,我注意到 Lua 使用将 double 值四舍五入为 32 位 int 值的宏.该宏在 Llimits.h 头文件中定义 内容如下:

union i_cast {double d;诠释我[2]};#define double2int(i, d, t) {易失性工会 i_cast u;u.d = (d) + 6755399441055744.0;(i) = (t)u.i[ENDIANLOC];}

这里 ENDIANLOC 是根据 endianness 定义的:0 表示little endian,1 表示大端架构;Lua 小心地处理字节序.t 参数被替换为整数类型,例如 intunsigned int.

我做了一些研究,发现该宏有一种更简单的格式,它使用相同的技术:

#define double2int(i, d) {双 t = ((d) + 6755399441055744.0);i = *((int *)(&t));}

或者,在 C++ 风格中:

inline int double2int(double d){d += 6755399441055744.0;return reinterpret_cast<int&>(d);}

这个技巧可以在任何使用 IEEE 754 的机器上运行(这意味着几乎每台机器今天).它适用于正数和负数,并且四舍五入遵循 银行家规则.(这并不奇怪,因为它遵循 IEEE 754.)

我写了一个小程序来测试它:

int main(){双 d = -12345678.9;诠释我;double2int(i, d)printf("%d
", i);返回0;}

它按预期输出-12345679.

我想详细了解这个棘手的宏是如何工作的.幻数 6755399441055744.0 实际上是 251 + 252,或 1.5 × 252,二进制为 1.5可以表示为 1.1.当任何 32 位整数与这个幻数相加时——

好吧,我从这里迷路了.这个技巧是如何工作的?

更新

  1. 正如@Mysticial 所指出的,这种方法并不局限于32位的int,它还可以扩展为64位的int只要数字在 252 的范围内.(虽然宏需要一些修改.)

  2. 有些资料说这种方法不能用于,有一个有趣的属性:

    <块引用>

    在 252 = 4,503,599,627,370,496 和 253 = 9,007,199,254,740,992 之间,可表示的数字正好是整数.

    这是因为尾数是 52 位宽.

    添加 251 + 252 的另一个有趣的事实是,它只影响尾数的两个最高位——无论如何都会被丢弃,因为我们正在只有最低的 32 位.


    最后但同样重要的是:标志.

    IEEE 754 浮点使用幅度和符号表示,而普通"机器上的整数使用 2 的补码算法;这里是怎么处理的?

    我们只讨论了正整数;现在假设我们正在处理由 32 位 int 表示的范围内的负数,因此(绝对值)小于 (−231 + 1);称之为 -a.这样的数字显然是通过添加幻数得到的,结果是 252 + 251 + (-a).

    现在,如果我们用 2 的补码表示来解释尾数,我们会得到什么?它必须是 (252 + 251) 和 (−a) 的 2 补码和的结果.同样,第一项仅影响高两位,位 0-50 中剩下的是 (-a) 的 2 的补码表示(同样,减去高两位).

    由于将 2 的补数减少到更小的宽度只是通过去除左侧的额外位来完成,因此取低 32 位可以正确地给出我们在 32 位 2 的补码算法中的 (-a).

    When reading Lua’s source code, I noticed that Lua uses a macro to round double values to 32-bit int values. The macro is defined in the Llimits.h header file and reads as follows:

    union i_cast {double d; int i[2]};
    #define double2int(i, d, t) 
        {volatile union i_cast u; u.d = (d) + 6755399441055744.0; 
        (i) = (t)u.i[ENDIANLOC];}
    

    Here ENDIANLOC is defined according to endianness: 0 for little endian, 1 for big endian architectures; Lua carefully handles endianness. The t argument is substituted with an integer type like int or unsigned int.

    I did a little research and found that there is a simpler format of that macro which uses the same technique:

    #define double2int(i, d) 
        {double t = ((d) + 6755399441055744.0); i = *((int *)(&t));}
    

    Or, in a C++-style:

    inline int double2int(double d)
    {
        d += 6755399441055744.0;
        return reinterpret_cast<int&>(d);
    }
    

    This trick can work on any machine using IEEE 754 (which means pretty much every machine today). It works for both positive and negative numbers, and the rounding follows Banker’s Rule. (This is not surprising, since it follows IEEE 754.)

    I wrote a little program to test it:

    int main()
    {
        double d = -12345678.9;
        int i;
        double2int(i, d)
        printf("%d
    ", i);
        return 0;
    }
    

    And it outputs -12345679, as expected.

    I would like to understand how this tricky macro works in detail. The magic number 6755399441055744.0 is actually 251 + 252, or 1.5 × 252, and 1.5 in binary can be represented as 1.1. When any 32-bit integer is added to this magic number—

    Well, I’m lost from here. How does this trick work?

    Update

    1. As @Mysticial points out, this method does not limit itself to a 32-bit int, it can also be expanded to a 64-bit int as long as the number is in the range of 252. (Although the macro needs some modification.)

    2. Some materials say this method cannot be used in Direct3D.

    3. When working with Microsoft assembler for x86, there is an even faster macro written in assembly code (the following is also extracted from Lua source):

       #define double2int(i,n)  __asm {__asm fld n   __asm fistp i}
      

    4. There is a similar magic number for single precision numbers: 1.5 × 223.

    解决方案

    A value of the double floating-point type is represented like so:

    and it can be seen as two 32-bit integers; now, the int taken in all the versions of your code (supposing it’s a 32-bit int) is the one on the right in the figure, so what you are doing in the end is just taking the lowest 32 bits of mantissa.


    Now, to the magic number; as you correctly stated, 6755399441055744 is 251 + 252; adding such a number forces the double to go into the "sweet range" between 252 and 253, which, as explained by Wikipedia, has an interesting property:

    Between 252 = 4,503,599,627,370,496 and 253 = 9,007,199,254,740,992, the representable numbers are exactly the integers.

    This follows from the fact that the mantissa is 52 bits wide.

    The other interesting fact about adding 251 + 252 is that it affects the mantissa only in the two highest bits—which are discarded anyway, since we are taking only its lowest 32 bits.


    Last but not least: the sign.

    IEEE 754 floating point uses a magnitude and sign representation, while integers on "normal" machines use 2’s complement arithmetic; how is this handled here?

    We talked only about positive integers; now suppose we are dealing with a negative number in the range representable by a 32-bit int, so less (in absolute value) than (−231 + 1); call it −a. Such a number is obviously made positive by adding the magic number, and the resulting value is 252 + 251 + (−a).

    Now, what do we get if we interpret the mantissa in 2’s complement representation? It must be the result of 2’s complement sum of (252 + 251) and (−a). Again, the first term affects only the upper two bits, what remains in the bits 0–50 is the 2’s complement representation of (−a) (again, minus the upper two bits).

    Since reduction of a 2’s complement number to a smaller width is done just by cutting away the extra bits on the left, taking the lower 32 bits gives us correctly (−a) in 32-bit, 2’s complement arithmetic.

    这篇关于解释了一种将双精度数舍入为 32 位整数的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Rising edge interrupt triggering multiple times on STM32 Nucleo(在STM32 Nucleo上多次触发上升沿中断)
How to use va_list correctly in a sequence of wrapper functions calls?(如何在一系列包装函数调用中正确使用 va_list?)
OpenGL Perspective Projection Clipping Polygon with Vertex Outside Frustum = Wrong texture mapping?(OpenGL透视投影裁剪多边形,顶点在视锥外=错误的纹理映射?)
How does one properly deserialize a byte array back into an object in C++?(如何正确地将字节数组反序列化回 C++ 中的对象?)
What free tiniest flash file system could you advice for embedded system?(您可以为嵌入式系统推荐什么免费的最小闪存文件系统?)
Volatile member variables vs. volatile object?(易失性成员变量与易失性对象?)