问题描述
我开始使用 C++ 使用 OpenMP.
I started working with OpenMP using C++.
我有两个问题:
- 什么是
#pragma omp for schedule
? dynamic
和static
有什么区别?
- What is
#pragma omp for schedule
? - What is the difference between
dynamic
andstatic
?
请举例说明.
推荐答案
其他人已经回答了大部分问题,但我想指出一些特定的情况,其中特定的调度类型比其他的更适合.调度控制如何在线程之间划分循环迭代.选择正确的时间表会对应用程序的速度产生很大影响.
Others have since answered most of the question but I would like to point to some specific cases where a particular scheduling type is more suited than the others. Schedule controls how loop iterations are divided among threads. Choosing the right schedule can have great impact on the speed of the application.
static
调度意味着迭代块以循环方式静态映射到执行线程.静态调度的好处在于,OpenMP 运行时保证如果您有两个具有相同迭代次数的独立循环并使用静态调度以相同数量的线程执行它们,那么每个线程将获得完全相同的迭代范围(s) 在两个平行区域.这在 NUMA 系统上非常重要:如果您在第一个循环中接触一些内存,它将驻留在执行线程所在的 NUMA 节点上.然后在第二个循环中,同一个线程可以更快地访问同一个内存位置,因为它将驻留在同一个 NUMA 节点上.
static
schedule means that iterations blocks are mapped statically to the execution threads in a round-robin fashion. The nice thing with static scheduling is that OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration range(s) in both parallel regions. This is quite important on NUMA systems: if you touch some memory in the first loop, it will reside on the NUMA node where the executing thread was. Then in the second loop the same thread could access the same memory location faster since it will reside on the same NUMA node.
假设有两个 NUMA 节点:节点 0 和节点 1,例如一个双插槽 Intel Nehalem 板,两个插槽均带有 4 核 CPU.然后线程 0、1、2 和 3 将驻留在节点 0 上,线程 4、5、6 和 7 将驻留在节点 1 上:
Imagine there are two NUMA nodes: node 0 and node 1, e.g. a two-socket Intel Nehalem board with 4-core CPUs in both sockets. Then threads 0, 1, 2, and 3 will reside on node 0 and threads 4, 5, 6, and 7 will reside on node 1:
| | core 0 | thread 0 |
| socket 0 | core 1 | thread 1 |
| NUMA node 0 | core 2 | thread 2 |
| | core 3 | thread 3 |
| | core 4 | thread 4 |
| socket 1 | core 5 | thread 5 |
| NUMA node 1 | core 6 | thread 6 |
| | core 7 | thread 7 |
每个内核都可以从每个 NUMA 节点访问内存,但远程访问比本地节点访问慢(在 Intel 上慢 1.5 到 1.9 倍).你运行这样的东西:
Each core can access memory from each NUMA node, but remote access is slower (1.5x - 1.9x slower on Intel) than local node access. You run something like this:
char *a = (char *)malloc(8*4096);
#pragma omp parallel for schedule(static,1) num_threads(8)
for (int i = 0; i < 8; i++)
memset(&a[i*4096], 0, 4096);
在这种情况下,4096 字节是 x86 上 Linux 上一个内存页的标准大小,如果不使用大页.此代码将整个 32 KiB 数组 a
归零.malloc()
调用仅保留虚拟地址空间,但实际上并未接触"物理内存(这是默认行为,除非使用了其他版本的 malloc
,例如一种像 calloc()
那样将内存归零的方法).现在这个数组是连续的,但只在虚拟内存中.在物理内存中,一半位于连接到插槽 0 的内存中,另一半位于连接到插槽 1 的内存中.这是因为不同的部分被不同的线程归零,并且这些线程驻留在不同的内核上,并且有一种叫做 first touch NUMA 策略,这意味着内存页分配在第一个触摸"内存页的线程所在的 NUMA 节点上.
4096 bytes in this case is the standard size of one memory page on Linux on x86 if huge pages are not used. This code will zero the whole 32 KiB array a
. The malloc()
call just reserves virtual address space but does not actually "touch" the physical memory (this is the default behaviour unless some other version of malloc
is used, e.g. one that zeroes the memory like calloc()
does). Now this array is contiguous but only in virtual memory. In physical memory half of it would lie in the memory attached to socket 0 and half in the memory attached to socket 1. This is so because different parts are zeroed by different threads and those threads reside on different cores and there is something called first touch NUMA policy which means that memory pages are allocated on the NUMA node on which the thread that first "touched" the memory page resides.
| | core 0 | thread 0 | a[0] ... a[4095]
| socket 0 | core 1 | thread 1 | a[4096] ... a[8191]
| NUMA node 0 | core 2 | thread 2 | a[8192] ... a[12287]
| | core 3 | thread 3 | a[12288] ... a[16383]
| | core 4 | thread 4 | a[16384] ... a[20479]
| socket 1 | core 5 | thread 5 | a[20480] ... a[24575]
| NUMA node 1 | core 6 | thread 6 | a[24576] ... a[28671]
| | core 7 | thread 7 | a[28672] ... a[32768]
现在让我们像这样运行另一个循环:
Now lets run another loop like this:
#pragma omp parallel for schedule(static,1) num_threads(8)
for (i = 0; i < 8; i++)
memset(&a[i*4096], 1, 4096);
每个线程将访问已映射的物理内存,并且它将具有与第一个循环期间相同的线程到内存区域的映射.这意味着线程只会访问位于其本地内存块中的内存,这会很快.
Each thread will access the already mapped physical memory and it will have the same mapping of thread to memory region as the one during the first loop. It means that threads will only access memory located in their local memory blocks which will be fast.
现在想象另一个调度方案用于第二个循环:schedule(static,2)
.这会将迭代空间切割"为两个迭代的块,总共会有 4 个这样的块.将会发生的是,我们将有以下线程到内存位置映射(通过迭代次数):
Now imagine that another scheduling scheme is used for the second loop: schedule(static,2)
. This will "chop" iteration space into blocks of two iterations and there will be 4 such blocks in total. What will happen is that we will have the following thread to memory location mapping (through the iteration number):
| | core 0 | thread 0 | a[0] ... a[8191] <- OK, same memory node
| socket 0 | core 1 | thread 1 | a[8192] ... a[16383] <- OK, same memory node
| NUMA node 0 | core 2 | thread 2 | a[16384] ... a[24575] <- Not OK, remote memory
| | core 3 | thread 3 | a[24576] ... a[32768] <- Not OK, remote memory
| | core 4 | thread 4 | <idle>
| socket 1 | core 5 | thread 5 | <idle>
| NUMA node 1 | core 6 | thread 6 | <idle>
| | core 7 | thread 7 | <idle>
这里发生了两件坏事:
- 线程 4 到 7 保持空闲,一半的计算能力丢失;
- 线程 2 和 3 访问非本地内存,它们将花费大约两倍的时间来完成,在此期间线程 0 和 1 将保持空闲.
所以使用静态调度的优点之一是它提高了内存访问的局部性.缺点是调度参数选择不当会影响性能.
So one of the advantages for using static scheduling is that it improves locality in memory access. The disadvantage is that bad choice of scheduling parameters can ruin the performance.
dynamic
调度基于先到先得"的原则.具有相同线程数的两次运行可能(并且很可能会)产生完全不同的迭代空间"->线程"映射,因为可以轻松验证:
dynamic
scheduling works on a "first come, first served" basis. Two runs with the same number of threads might (and most likely would) produce completely different "iteration space" -> "threads" mappings as one can easily verify:
$ cat dyn.c
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i;
#pragma omp parallel num_threads(8)
{
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[1] iter %0d, tid %0d
", i, omp_get_thread_num());
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[2] iter %0d, tid %0d
", i, omp_get_thread_num());
}
return 0;
}
$ icc -openmp -o dyn.x dyn.c
$ OMP_NUM_THREADS=8 ./dyn.x | sort
[1] iter 0, tid 2
[1] iter 1, tid 0
[1] iter 2, tid 7
[1] iter 3, tid 3
[1] iter 4, tid 4
[1] iter 5, tid 1
[1] iter 6, tid 6
[1] iter 7, tid 5
[2] iter 0, tid 0
[2] iter 1, tid 2
[2] iter 2, tid 7
[2] iter 3, tid 3
[2] iter 4, tid 6
[2] iter 5, tid 1
[2] iter 6, tid 5
[2] iter 7, tid 4
(当使用 gcc
代替时观察到相同的行为)
(same behaviour is observed when gcc
is used instead)
如果 static
部分的示例代码使用 dynamic
调度运行,则只有 1/70 (1.4%) 的机会保留原始位置和 69/70 (98.6%) 的机会发生远程访问.这个事实经常被忽视,因此实现了次优的性能.
If the sample code from the static
section was run with dynamic
scheduling instead there will be only 1/70 (1.4%) chance that the original locality would be preserved and 69/70 (98.6%) chance that remote access would occur. This fact is often overlooked and hence suboptimal performance is achieved.
在静态
和动态
调度之间进行选择还有另一个原因——工作负载平衡.如果每次迭代所花费的时间与完成的平均时间相差很大,那么在静态情况下可能会出现高度的工作不平衡.以完成一次迭代的时间随迭代次数线性增长的情况为例.如果迭代空间在两个线程之间静态划分,则第二个线程的工作量将是第一个线程的三倍,因此在 2/3 的计算时间中,第一个线程将处于空闲状态.动态调度引入了一些额外的开销,但在这种特殊情况下会导致更好的工作负载分配.一种特殊的动态
调度是guided
,随着工作的进行,越来越小的迭代块被分配给每个任务.
There is another reason to choose between static
and dynamic
scheduling - workload balancing. If each iteration takes vastly different from the mean time to be completed then high work imbalance might occur in the static case. Take as an example the case where time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of dynamic
scheduling is the guided
where smaller and smaller iteration blocks are given to each task as the work progresses.
由于预编译代码可以在各种平台上运行,如果最终用户可以控制调度,那就太好了.这就是 OpenMP 提供特殊 schedule(runtime)
子句的原因.对于 runtime
调度,类型取自环境变量 OMP_SCHEDULE
的内容.这允许在不重新编译应用程序的情况下测试不同的调度类型,还允许最终用户针对他或她的平台进行微调.
Since precompiled code could be run on various platforms it would be nice if the end user can control the scheduling. That's why OpenMP provides the special schedule(runtime)
clause. With runtime
scheduling the type is taken from the content of the environment variable OMP_SCHEDULE
. This allows to test different scheduling types without recompiling the application and also allows the end user to fine-tune for his or her platform.
这篇关于“静态"和“静态"有什么区别?和“动态"在 OpenMP 中安排?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!