问题描述
我正在为仅支持 32 位单精度浮点运算的嵌入式硬件编写程序.然而,我正在实现的算法需要 64 位双精度加法和比较.我正在尝试使用两个 float
的元组来模拟 double
数据类型.因此,double d
将被模拟为包含元组的 struct
:(float d. float d.low)
.
I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double
datatype using a tuple of two float
s. So a double d
will be emulated as a struct
containing the tuple: (float d. float d.low)
.
使用字典顺序进行比较应该简单明了.然而,添加有点棘手,因为我不确定我应该使用哪个基础.应该是 FLT_MAX
吗?以及如何检测进位?
The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX
? And how can I detect a carry?
如何做到这一点?
编辑(清晰):我需要额外的有效数字而不是额外的范围.
Edit (Clarity): I need the extra significant digits rather than the extra range.
推荐答案
double-float 是一种技术,它使用成对的单精度数来实现单精度算术的几乎两倍的精度,同时单精度略有降低指数范围(由于范围远端的中间下溢和溢出).基本算法由 T.J. 开发.1970 年代的德克尔和威廉·卡汉.下面我列出了两篇相当近期的论文,展示了这些技术如何适用于 GPU,但是这些论文中涵盖的大部分材料都适用于独立于平台的内容,因此应该对手头的任务有用.
double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.
https://hal.archives-ouvertes.fr/hal-00021443纪尧姆·达格拉萨,大卫·德福尔在图形硬件上实现 float-float 运算符,第 7 届实数与计算机会议,RNC7.
https://hal.archives-ouvertes.fr/hal-00021443 Guillaume Da Graça, David Defour Implementation of float-float operators on graphics hardware, 7th conference on Real Numbers and Computers, RNC7.
http://andrewthall.org/papers/df64_qf128.pdf安德鲁·索尔用于 GPU 计算的扩展精度浮点数.
http://andrewthall.org/papers/df64_qf128.pdf Andrew Thall Extended-Precision Floating-Point Numbers for GPU Computation.
这篇关于模仿“双"使用 2 个“浮点数"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!