问题描述
我正在为仅支持 32 位单精度浮点运算的嵌入式硬件编写程序.但是,我正在实现的算法需要 64 位双精度加法和比较.我正在尝试使用两个 float
的元组来模拟 double
数据类型.因此,double d
将被模拟为包含元组的 struct
:(float d. float d.low)
.
I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double
datatype using a tuple of two float
s. So a double d
will be emulated as a struct
containing the tuple: (float d. float d.low)
.
使用字典顺序进行比较应该很简单.然而,添加有点棘手,因为我不确定我应该使用哪个基地.应该是 FLT_MAX
吗?我如何检测进位?
The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX
? And how can I detect a carry?
如何做到这一点?
编辑(清晰):我需要额外的有效数字而不是额外的范围.
Edit (Clarity): I need the extra significant digits rather than the extra range.
推荐答案
double-float 是一种技术,它使用单精度数对实现几乎两倍于单精度算术的精度,同时略微降低单精度指数范围(由于范围远端的中间下溢和溢出).基本算法由 T.J.1970 年代的德克尔和威廉·卡汉.下面我列出了两篇最近发表的论文,展示了这些技术如何适用于 GPU,但这些论文中涵盖的大部分内容都适用于独立于平台的平台,因此应该对手头的任务有用.
double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.
https://hal.archives-ouvertes.fr/hal-00021443纪尧姆·达·格拉萨,大卫·德福在图形硬件上实现 float-float 运算符,第七届实数与计算机会议,RNC7.
https://hal.archives-ouvertes.fr/hal-00021443 Guillaume Da Graça, David Defour Implementation of float-float operators on graphics hardware, 7th conference on Real Numbers and Computers, RNC7.
http://andrewthall.org/papers/df64_qf128.pdf安德鲁·索尔用于 GPU 计算的扩展精度浮点数.
http://andrewthall.org/papers/df64_qf128.pdf Andrew Thall Extended-Precision Floating-Point Numbers for GPU Computation.
这篇关于模拟“双重"使用 2 个“浮动"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!