问题描述
我在 C++ 中有二进制矩阵,我用 8 位值的向量表示.
I have binary matrices in C++ that I repesent with a vector of 8-bit values.
例如下面的矩阵:
1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1
表示为:
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
我这样做的原因是因为计算这样一个矩阵和一个 8 位向量的乘积变得非常简单和高效(每行只有一个位与和一个奇偶校验计算),即比单独计算每个位要好得多.
The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.
我现在正在寻找一种有效的方法来转置这样的矩阵,但我无法弄清楚如何做到这一点,而无需手动计算每一位.
I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.
为了澄清一下,对于上面的例子,我想从转置中得到以下结果:
Just to clarify, for the above example, I'd like to get the following result from the transposition:
const uint8_t transposed[] = {
0b00000000,
0b00000100,
0b00000010,
0b00000110,
0b00000001,
0b00000101,
0b00000011,
0b00000111,
};
注意:我更喜欢一种可以使用任意大小的矩阵进行计算的算法,但我也对只能处理特定大小的算法感兴趣.
NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.
推荐答案
我花了更多时间寻找解决方案,并且找到了一些好的解决方案.
I've spent more time looking for a solution, and I've found some good ones.
在现代 x86 CPU 上,使用 SSE2 指令可以非常高效地转置二进制矩阵.使用这样的指令可以处理一个 16×8 的矩阵.
On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.
此解决方案的灵感来自 mischasan 的这篇博文 远远优于我迄今为止对这个问题的所有建议.
This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.
这个想法很简单:
#include <emmintrin.h>
- 将 16 个
uint8_t
变量打包到__m128i
- 使用
_mm_movemask_epi8
获取每个字节的 MSB,生成uint16_t
- 使用
_mm_slli_epi64
将128位寄存器移位一 - 重复直到你得到所有 8 个
uint16_t
s
#include <emmintrin.h>
- Pack 16
uint8_t
variables into an__m128i
- Use
_mm_movemask_epi8
to get the MSBs of each byte, producing anuint16_t
- Use
_mm_slli_epi64
to shift the 128-bit register by one - Repeat until you've got all 8
uint16_t
s
不幸的是,我还需要在 ARM 上进行这项工作.实现 SSE2 版本后,很容易只找到 NEON 等效项,但 Cortex-M CPU(与 Cortex-A 相反)没有SIMD 功能,所以 NEON 目前对我来说并不太有用.
Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.
注意:因为 Cortex-M 没有原生 64 位算法,我无法在任何答案中使用这些想法这建议通过将 8x8 块视为 uint64_t
来做到这一点.大多数具有 Cortex-M CPU 的微控制器也没有太多内存,因此我更喜欢在没有查找表的情况下完成所有这些操作.
NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t
. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.
经过一番思考,可以使用普通的 32 位算术和一些巧妙的编码来实现相同的算法.这样,我一次可以处理 4×8 块.这是由一位同事提出的,其神奇之处在于 32 位乘法的工作方式:您可以找到一个可以与之相乘的 32 位数字,然后每个字节的 MSB 在高 32 位中彼此相邻结果.
After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.
- 将 4 个
uint8_t
s 打包到一个 32 位变量中 - 屏蔽每个字节的第一位(使用
0x80808080
) - 乘以
0x02040810
- 取乘法的高 32 位的 4 个 LSB
- 通常,您可以屏蔽每个字节中的第 N 位(将屏蔽右移 N 位)并乘以幻数,左移 N 位.这里的好处是,如果你的编译器足够聪明,可以展开循环,那么掩码和幻数"都会成为编译时常量,因此移动它们不会导致任何性能损失.最后一系列的 4 位有一些问题,因为丢失了一个 LSB,所以在这种情况下,我需要将输入左移 8 位,并使用与第一个 4 位系列相同的方法.
- Pack 4
uint8_t
s in a 32-bit variable - Mask the 1st bit of each byte (using
0x80808080
) - Multiply it with
0x02040810
- Take the 4 LSBs of the upper 32 bits of the multiplication
- Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.
如果您使用两个 4×8 块执行此操作,那么您可以完成一个 8x8 块并排列生成的位,以便所有内容都放在正确的位置.
If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.
这篇关于你将如何转置二进制矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!