问题描述
每当我提到 C++ 标准库 iostream 的性能缓慢时,我都会遇到一波不相信的事情.然而,我的分析器结果显示在 iostream 库代码上花费了大量时间(完整的编译器优化),并且从 iostream 切换到特定于操作系统的 I/O API 和自定义缓冲区管理确实提供了一个数量级的改进.
Every time I mention slow performance of C++ standard library iostreams, I get met with a wave of disbelief. Yet I have profiler results showing large amounts of time spent in iostream library code (full compiler optimizations), and switching from iostreams to OS-specific I/O APIs and custom buffer management does give an order of magnitude improvement.
C++标准库做了哪些额外的工作,是标准要求的,在实践中有用吗?或者,某些编译器是否提供了与手动缓冲区管理相比具有竞争力的 iostream 实现?
What extra work is the C++ standard library doing, is it required by the standard, and is it useful in practice? Or do some compilers provide implementations of iostreams that are competitive with manual buffer management?
为了让事情顺利进行,我编写了几个简短的程序来练习 iostreams 的内部缓冲:
To get matters moving, I've written a couple of short programs to exercise the iostreams internal buffering:
- 将二进制数据放入
ostringstream
http://ideone.com/2PPYw - 将二进制数据放入
char[]
缓冲区 http://ideone.com/Ni5ct - 使用
back_inserter
http 将二进制数据放入vector
://ideone.com/Mj2Fi - 新:
vector
简单迭代器 http://ideone.com/9iitv - 新:将二进制数据直接放入
stringbuf
http://ideone.com/qc9QA - 新:
vector
简单迭代器加边界检查 http://ideone.com/YyrKy
- putting binary data into an
ostringstream
http://ideone.com/2PPYw - putting binary data into a
char[]
buffer http://ideone.com/Ni5ct - putting binary data into a
vector<char>
usingback_inserter
http://ideone.com/Mj2Fi - NEW:
vector<char>
simple iterator http://ideone.com/9iitv - NEW: putting binary data directly into
stringbuf
http://ideone.com/qc9QA - NEW:
vector<char>
simple iterator plus bounds check http://ideone.com/YyrKy
请注意,ostringstream
和 stringbuf
版本运行的迭代次数较少,因为它们的速度要慢得多.
Note that the ostringstream
and stringbuf
versions run fewer iterations because they are so much slower.
在 ideone 上,ostringstream
比 std:copy
+ back_inserter
+ std::vector> 慢大约 3 倍code>,并且比
memcpy
慢 15 倍到原始缓冲区.当我将实际应用程序切换到自定义缓冲时,这感觉与前后分析一致.
On ideone, the ostringstream
is about 3 times slower than std:copy
+ back_inserter
+ std::vector
, and about 15 times slower than memcpy
into a raw buffer. This feels consistent with before-and-after profiling when I switched my real application to custom buffering.
这些都是内存中的缓冲区,所以不能将 iostream 的缓慢归咎于缓慢的磁盘 I/O、太多的刷新、与 stdio 的同步,或者人们用来原谅观察到的缓慢的任何其他事情C++标准库iostream.
These are all in-memory buffers, so the slowness of iostreams can't be blamed on slow disk I/O, too much flushing, synchronization with stdio, or any of the other things people use to excuse observed slowness of the C++ standard library iostream.
很高兴看到其他系统上的基准测试和对常见实现所做的事情的评论(例如 gcc 的 libc++、Visual C++、Intel C++)以及标准规定的开销有多少.
It would be nice to see benchmarks on other systems and commentary on things common implementations do (such as gcc's libc++, Visual C++, Intel C++) and how much of the overhead is mandated by the standard.
许多人正确地指出,iostreams 更常用于格式化输出.但是,它们也是 C++ 标准为二进制文件访问提供的唯一现代 API.但是对内部缓冲进行性能测试的真正原因适用于典型的格式化 I/O:如果 iostreams 不能让磁盘控制器提供原始数据,那么当它们负责格式化时,它们怎么可能跟上?
A number of people have correctly pointed out that iostreams are more commonly used for formatted output. However, they are also the only modern API provided by the C++ standard for binary file access. But the real reason for doing performance tests on the internal buffering applies to the typical formatted I/O: if iostreams can't keep the disk controller supplied with raw data, how can they possibly keep up when they are responsible for formatting as well?
所有这些都是外部 (k
) 循环的每次迭代.
All these are per iteration of the outer (k
) loop.
在 ideone(gcc-4.3.4,未知的操作系统和硬件)上:
On ideone (gcc-4.3.4, unknown OS and hardware):
ostringstream
:53 毫秒stringbuf
:27 毫秒vector
和back_inserter
:17.6 毫秒vector
使用普通迭代器:10.6 msvector
迭代器和边界检查:11.4 毫秒char[]
:3.7 毫秒
ostringstream
: 53 millisecondsstringbuf
: 27 msvector<char>
andback_inserter
: 17.6 msvector<char>
with ordinary iterator: 10.6 msvector<char>
iterator and bounds check: 11.4 mschar[]
: 3.7 ms
在我的笔记本电脑上(Visual C++ 2010 x86,cl/Ox/EHsc
,Windows 7 Ultimate 64 位,Intel Core i7,8 GB RAM):
On my laptop (Visual C++ 2010 x86, cl /Ox /EHsc
, Windows 7 Ultimate 64-bit, Intel Core i7, 8 GB RAM):
ostringstream
:73.4 毫秒,71.6 毫秒stringbuf
:21.7 毫秒,21.3 毫秒vector
和back_inserter
:34.6 毫秒,34.4 毫秒vector
带普通迭代器:1.10 ms, 1.04 msvector
迭代器和边界检查:1.11 ms、0.87 ms、1.12 ms、0.89 ms、1.02 ms、1.14 mschar[]
:1.48 毫秒,1.57 毫秒
ostringstream
: 73.4 milliseconds, 71.6 msstringbuf
: 21.7 ms, 21.3 msvector<char>
andback_inserter
: 34.6 ms, 34.4 msvector<char>
with ordinary iterator: 1.10 ms, 1.04 msvector<char>
iterator and bounds check: 1.11 ms, 0.87 ms, 1.12 ms, 0.89 ms, 1.02 ms, 1.14 mschar[]
: 1.48 ms, 1.57 ms
Visual C++ 2010 x86,具有配置文件引导的优化 cl/Ox/EHsc/GL/c
、link/ltcg:pgi
、运行、link/ltcg:pgo
,测量:
Visual C++ 2010 x86, with Profile-Guided Optimization cl /Ox /EHsc /GL /c
, link /ltcg:pgi
, run, link /ltcg:pgo
, measure:
ostringstream
:61.2 毫秒,60.5 毫秒vector
带普通迭代器:1.04 ms, 1.03 ms
ostringstream
: 61.2 ms, 60.5 msvector<char>
with ordinary iterator: 1.04 ms, 1.03 ms
相同的笔记本电脑,相同的操作系统,使用 cygwin gcc 4.3.4 g++ -O3
:
Same laptop, same OS, using cygwin gcc 4.3.4 g++ -O3
:
ostringstream
:62.7 毫秒,60.5 毫秒stringbuf
:44.4 毫秒,44.5 毫秒vector
和back_inserter
:13.5 毫秒,13.6 毫秒vector
使用普通迭代器:4.1 ms, 3.9 msvector
迭代器和边界检查:4.0 ms、4.0 mschar[]
:3.57 毫秒,3.75 毫秒
ostringstream
: 62.7 ms, 60.5 msstringbuf
: 44.4 ms, 44.5 msvector<char>
andback_inserter
: 13.5 ms, 13.6 msvector<char>
with ordinary iterator: 4.1 ms, 3.9 msvector<char>
iterator and bounds check: 4.0 ms, 4.0 mschar[]
: 3.57 ms, 3.75 ms
相同的笔记本电脑,Visual C++ 2008 SP1,cl/Ox/EHsc
:
Same laptop, Visual C++ 2008 SP1, cl /Ox /EHsc
:
ostringstream
:88.7 毫秒,87.6 毫秒stringbuf
:23.3 毫秒,23.4 毫秒vector
和back_inserter
:26.1 毫秒,24.5 毫秒vector
带普通迭代器:3.13 ms, 2.48 msvector
迭代器和边界检查:2.97 毫秒、2.53 毫秒char[]
:1.52 毫秒,1.25 毫秒
ostringstream
: 88.7 ms, 87.6 msstringbuf
: 23.3 ms, 23.4 msvector<char>
andback_inserter
: 26.1 ms, 24.5 msvector<char>
with ordinary iterator: 3.13 ms, 2.48 msvector<char>
iterator and bounds check: 2.97 ms, 2.53 mschar[]
: 1.52 ms, 1.25 ms
同一台笔记本电脑,Visual C++ 2010 64 位编译器:
Same laptop, Visual C++ 2010 64-bit compiler:
ostringstream
:48.6 毫秒,45.0 毫秒stringbuf
:16.2 毫秒,16.0 毫秒vector
和back_inserter
:26.3 毫秒,26.5 毫秒vector
带普通迭代器:0.87 ms, 0.89 msvector
迭代器和边界检查:0.99 ms、0.99 mschar[]
:1.25 毫秒,1.24 毫秒
ostringstream
: 48.6 ms, 45.0 msstringbuf
: 16.2 ms, 16.0 msvector<char>
andback_inserter
: 26.3 ms, 26.5 msvector<char>
with ordinary iterator: 0.87 ms, 0.89 msvector<char>
iterator and bounds check: 0.99 ms, 0.99 mschar[]
: 1.25 ms, 1.24 ms
全部运行两次以查看结果的一致性.非常一致的 IMO.
Ran all twice to see how consistent the results were. Pretty consistent IMO.
注意:在我的笔记本电脑上,由于我可以节省比 ideone 允许的更多 CPU 时间,因此我将所有方法的迭代次数设置为 1000.这意味着 ostringstream
和 vector
重新分配,仅在第一次通过时发生,对最终结果几乎没有影响.
NOTE: On my laptop, since I can spare more CPU time than ideone allows, I set the number of iterations to 1000 for all methods. This means that ostringstream
and vector
reallocation, which takes place only on the first pass, should have little impact on the final results.
糟糕,在 vector
-with-ordinary-iterator 中发现了一个错误,迭代器没有被高级,因此缓存命中太多.我想知道 vector
如何优于 char[]
.虽然没有太大区别,vector
在 VC++ 2010 下仍然比 char[]
快.
Oops, found a bug in the vector
-with-ordinary-iterator, the iterator wasn't being advanced and therefore there were too many cache hits. I was wondering how vector<char>
was outperforming char[]
. It didn't make much difference though, vector<char>
is still faster than char[]
under VC++ 2010.
每次附加数据时,输出流的缓冲需要三个步骤:
Buffering of output streams requires three steps each time data is appended:
- 检查传入的块是否适合可用的缓冲区空间.
- 复制传入的块.
- 更新数据结束指针.
我发布的最新代码片段vector<char>
simple iterator plus bounds check"不仅做到了这一点,它还分配了额外的空间并在传入的块不移动时移动现有数据合身.正如 Clifford 指出的那样,在文件 I/O 类中缓冲不必这样做,它只会刷新当前缓冲区并重用它.所以这应该是缓冲输出成本的上限.而这正是制作可工作的内存缓冲区所需要的.
The latest code snippet I posted, "vector<char>
simple iterator plus bounds check" not only does this, it also allocates additional space and moves the existing data when the incoming block doesn't fit. As Clifford pointed out, buffering in a file I/O class wouldn't have to do that, it would just flush the current buffer and reuse it. So this should be an upper bound on the cost of buffering output. And it's exactly what is needed to make a working in-memory buffer.
那么为什么 stringbuf
在 ideone 上慢 2.5 倍,而在我测试时至少慢 10 倍?在这个简单的微基准测试中它没有被多态地使用,所以没有解释它.
So why is stringbuf
2.5x slower on ideone, and at least 10 times slower when I test it? It isn't being used polymorphically in this simple micro-benchmark, so that doesn't explain it.
推荐答案
没有像标题那样回答你问题的细节:2006 C++ 性能技术报告 有一个关于 IOStreams 的有趣部分(第 68 页).与您的问题最相关的是第 6.1.2 节(执行速度"):
Not answering the specifics of your question so much as the title: the 2006 Technical Report on C++ Performance has an interesting section on IOStreams (p.68). Most relevant to your question is in Section 6.1.2 ("Execution Speed"):
由于 IOStreams 处理的某些方面是分布在多个方面,它似乎该标准要求执行效率低下.但是这个事实并非如此——通过使用某种形式预处理,大部分工作可以被避免.稍微聪明一点链接器比通常使用的,它是可以删除其中一些效率低下.这在讨论中§6.2.3 和 §6.2.5.
Since certain aspects of IOStreams processing are distributed over multiple facets, it appears that the Standard mandates an inefficient implementation. But this is not the case — by using some form of preprocessing, much of the work can be avoided. With a slightly smarter linker than is typically used, it is possible to remove some of these inefficiencies. This is discussed in §6.2.3 and §6.2.5.
自该报告于 2006 年编写以来,人们希望许多建议已被纳入当前的编译器,但也许事实并非如此.
Since the report was written in 2006 one would hope that many of the recommendations would have been incorporated into current compilers, but perhaps this is not the case.
正如您所提到的,write()
中可能没有方面的功能(但我不会盲目地假设).那么有什么特点呢?在使用 GCC 编译的 ostringstream
代码上运行 GProf 会得到以下细分:
As you mention, facets may not feature in write()
(but I wouldn't assume that blindly). So what does feature? Running GProf on your ostringstream
code compiled with GCC gives the following breakdown:
- 44.23% 在
std::basic_streambuf
::xsputn(char const*, int) - 34.62% 在
std::ostream::write(char const*, int)
- 12.50% 在
main
- 6.73% 在
std::ostream::sentry::sentry(std::ostream&)
- 0.96% in
std::string::_M_replace_safe(unsigned int, unsigned int, char const*, unsigned int)
- 0.96% 在
std::basic_ostringstream
::basic_ostringstream(std::_Ios_Openmode) - 0.00% in
std::fpos
::fpos(long long)
所以大部分时间都花在了 xsputn
上,它在大量检查和更新光标位置和缓冲区后最终调用 std::copy()
(有详情请查看 c++itsstreambuf.tcc
.
So the bulk of the time is spent in xsputn
, which eventually calls std::copy()
after lots of checking and updating of cursor positions and buffers (have a look in c++itsstreambuf.tcc
for the details).
我的看法是,您已经关注了最坏的情况.如果您正在处理相当大的数据块,则执行的所有检查将是完成的总工作的一小部分.但是您的代码一次以四个字节为单位移动数据,并且每次都会产生所有额外成本.很明显,在现实生活中会避免这样做 - 考虑一下如果 write
在 1m 个 int 数组上调用而不是在一个 int 上调用 1m 次,那么惩罚将是微不足道的.在现实生活中,人们会真正欣赏 IOStreams 的重要特性,即其内存安全和类型安全设计.这样的好处是有代价的,而且您编写了一个测试,使这些成本支配了执行时间.
My take on this is that you've focused on the worst-case situation. All the checking that is performed would be a small fraction of the total work done if you were dealing with reasonably large chunks of data. But your code is shifting data in four bytes at a time, and incurring all the extra costs each time. Clearly one would avoid doing so in a real-life situation - consider how negligible the penalty would have been if write
was called on an array of 1m ints instead of on 1m times on one int. And in a real-life situation one would really appreciate the important features of IOStreams, namely its memory-safe and type-safe design. Such benefits come at a price, and you've written a test which makes these costs dominate the execution time.
这篇关于C++ 标准是否要求 iostreams 性能不佳,或者我只是在处理一个糟糕的实现?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!