问题描述
为什么会有这么一段代码,
Why does this bit of code,
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0.1f; // <--
y[i] = y[i] - 0.1f; // <--
}
}
运行速度比下一个位快 10 倍以上(相同,除非另有说明)?
run more than 10 times faster than the following bit (identical except where noted)?
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0; // <--
y[i] = y[i] - 0; // <--
}
}
使用 Visual Studio 2010 SP1 编译时.优化级别为 -02
,启用 sse2
.我没有用其他编译器测试过.
when compiling with Visual Studio 2010 SP1.
The optimization level was -02
with sse2
enabled.
I haven't tested with other compilers.
推荐答案
欢迎来到 的世界非规范化浮点数!它们会对性能造成严重破坏!!!
Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!
非正规(或次正规)数字是一种从浮点表示中获得一些非常接近零的额外值的技巧.对非规范化浮点的运算可能比规范化浮点运算慢几十到几百倍.这是因为许多处理器无法直接处理它们,必须使用微码捕获和解析它们.
Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.
如果您在 10,000 次迭代后打印出这些数字,您会发现根据使用的是 0
还是 0.1
,它们已经收敛到不同的值.
If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0
or 0.1
is used.
这是在 x64 上编译的测试代码:
Here's the test code compiled on x64:
int main() {
double start = omp_get_wtime();
const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
float y[16];
for(int i=0;i<16;i++)
{
y[i]=x[i];
}
for(int j=0;j<9000000;j++)
{
for(int i=0;i<16;i++)
{
y[i]*=x[i];
y[i]/=z[i];
#ifdef FLOATING
y[i]=y[i]+0.1f;
y[i]=y[i]-0.1f;
#else
y[i]=y[i]+0;
y[i]=y[i]-0;
#endif
if (j > 10000)
cout << y[i] << " ";
}
if (j > 10000)
cout << endl;
}
double end = omp_get_wtime();
cout << end - start << endl;
system("pause");
return 0;
}
输出:
#define FLOATING
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007
//#define FLOATING
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.46842e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.45208e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044
请注意,在第二次运行中,数字非常接近于零.
Note how in the second run the numbers are very close to zero.
非规范化数字通常很少见,因此大多数处理器不会尝试有效地处理它们.
Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.
为了证明这与非规范化数字有关,如果我们通过将其添加到代码的开头来将非规范化数清零:
To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
然后带有 0
的版本不再慢 10 倍,实际上变得更快.(这需要在启用 SSE 的情况下编译代码.)
Then the version with 0
is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)
这意味着我们没有使用这些奇怪的低精度几乎为零的值,而是将其舍入为零.
This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.
时序:Core i7 920 @ 3.5 GHz:
// Don't flush denormals to zero.
0.1f: 0.564067
0 : 26.7669
// Flush denormals to zero.
0.1f: 0.587117
0 : 0.341406
最后,这真的与它是整数还是浮点数无关.0
或 0.1f
被转换/存储到两个循环之外的寄存器中.所以这对性能没有影响.
In the end, this really has nothing to do with whether it's an integer or floating-point. The 0
or 0.1f
is converted/stored into a register outside of both loops. So that has no effect on performance.
这篇关于为什么将 0.1f 更改为 0 会使性能降低 10 倍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!