问题描述
我在 D 维度上有 N 个点,假设 N 是 100 万,D 是 100.我所有的点都有二进制坐标,即{0, 1}^D,我只对速度感兴趣.
I have N points in D dimensions, where let's say N is 1 million and D 1 hundred. All my points have binary coordinates, i.e. {0, 1}^D, and I am only interested in speed.
目前我的实现使用 std::vector<int>
.我想知道是否可以通过更改我的 数据结构.我只做插入和搜索(我不改变位).
Currently my implementation uses std::vector<int>
. I am wondering if I could benefit in terms of faster execution by changing my data-structure. I am only doing insertions and searches (I don't change the bits).
我发现的所有相关问题都提到了 std::vector<char>
、std::vector<bool>
和 std::bitset
,但都提到了使用这种结构应该获得的空间优势.
All related questions I found mention std::vector<char>
, std::vector<bool>
and std::bitset
, but all mention the space benefits one should get by using such structures.
当速度是主要关注点时,对于 C++ 中的二进制数据,什么是合适的数据结构?
What's the appropriate data structure, when speed is of main concern, for binary data in C++?
我打算用二进制数据填充我的数据结构,然后进行大量连续搜索(我的意思是我并不真正关心点的第 i 个坐标,如果我正在访问一个点,我会连续访问其所有坐标).我将计算彼此之间的汉明距离.
I intend to populate my data structure with the binary data and then do a lot of contiguous searches (I mean that I don't really care for the i-th coordinate of a point, if I am accessing a point I will access all of its coordinates continuously). I will compute the Hamming distance between each other.
推荐答案
参考位置可能是驱动力.所以很明显,您将单个点的 D
坐标表示为一个连续的位向量.std::bitset<D>
将是一个合乎逻辑的选择.
Locality of reference will likely be the driving force. So it's fairly obvious that you represent the D
coordinates of a single point as a contiguous bitvector. std::bitset<D>
would be a logical choice.
不过,接下来要意识到的重要一点是,您可以轻松看到高达 4KB 的局部性优势.这意味着您不应选择一个点并将其与所有其他 N-1 个点进行比较.取而代之的是,以 4KB 为一组对点进行分组,然后对这些组进行比较.两种方式都是O(N*N)
,但是第二种会快很多.
However, the next important thing to realize is that you see locality benefits easily up to 4KB. This means that you should not pick a single point and compare it against all other N-1 points. Instead, group points in sets of 4KB each, and compare those groups. Both ways are O(N*N)
, but the second will be much faster.
你可以通过使用三角不等式击败 O(N*N)
- Hamming(a,b)+Hamming(b,c) >= Hamming (a,c)
.我只是想知道如何.这可能取决于您希望输出的方式.天真的输出将是一组 N*N 距离,这不可避免地是 O(N*N)
.
You may be able to beat O(N*N)
by use of the triangle inequality - Hamming(a,b)+Hamming(b,c) >= Hamming (a,c)
. I'm just wondering how. It probably depends on how you want your output. The naive output would be a N*N set of distances, and that's unavoidably O(N*N)
.
这篇关于当你只关心速度时如何存储二进制数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!