学术英语论文学术英语论文选题

时间：2022-01-18 16:17:47 浏览次数：次

英语学术论文作业

Hybrid Parallel Programming on GPU Clusters

Abstract—Nowadays, NVIDIA’s CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a hybrid parallel programming approach using hybrid CUDA and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node.

Keywords: CUDA, GPU, MPI, OpenMP, hybrid, parallel programming

I. INTRODUCTION

Nowadays, NVIDIA’s CU DA [1, 16] is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA [1, 16] to achieve dramatic speedups on production and research codes.

In NVDIA the CUDA chip, all to the core of hundreds of ways to construct their chips, in here we will try to use NVIDIA to provide computing equipment for parallel computing. This paper proposes a solution to not only simplify the use of hardware acceleration in conventional general purpose applications, but also to keep the application code portable. In this paper, we propose a parallel programming

approach using hybrid CUDA, OpenMP and MPI [3] programming, which partition loop iterations according to the performance weighting of multi-core [4] nodes in a cluster. Because iterations assigned to one MPI process are processed in parallel by OpenMP threads run by the processor cores in the same computational node, the number of loop iterations allocated to one computational node at each scheduling step depends on the number of processor cores in that node.

In this paper, we propose a general approach that uses performance functions to estimate performance weights for each node. To verify the proposed approach, a heterogeneous cluster and a homogeneous cluster were built. In ourimplementation, the master node also participates in computation, whereas in previous schemes, only slave nodes do computation work. Empirical results show that in heterogeneous and homogeneous clusters environments, the proposed approach improved performance over all previous schemes.

The rest of this paper is organized as follows. In Section 2, we introduce several typical and well-known self-scheduling schemes, and a famous benchmark used to analyze computer system performance. In Section 3, we define our model and describe our approach. Our system configuration is then specified in Section 4, and experimental results for three types of application program are presented. Concluding remarks and future work are given in Section 5.

II. BACKGROUND REVIEW

A. History of GPU and CUDA

In the past, we have to use more than one computer to multiple CPU parallel computing, as shown in the last chip in the history of the beginning of the show does not need a lot of computation, then gradually the need for the game and even the graphics were and the need for 3D, 3D accelerator card appeared, and gradually we began to display chip for processing, began to show separate chips, and even made a similar in their CPU chips, that is GPU. We know that GPU computing could be used to get the answers we want, but why do we choose to use the GPU This slide shows the current CPU and GPU comparison. First, we can see only a maximum of eight core CPU now, but the GPU has grown to 260 core, the core number, we"ll know a lot

of parallel programs for GPU computing, despite his relatively low frequency of core, we I believe a large number of parallel computing power could be weaker than a single issue. Next, we know that there are within the GPU memory, and more access to main memory and GPU CPU GPU access on the memory capacity, we find that the speed of accessing GPU faster than CPU by 10 times, a whole worse 90GB / s, This is quite alarming gap, of course, this also means that when computing the time required to access large amounts of data can have a good GPU to improve.

CPU using advanced flow control such as branch predict or delay branch and a large cache to reduce memory access latency, and GPU"s cache and a relatively small number of flow control nor his simple, so the method is to use a lot of GPU computing devices to cover up the problem of memory latency, that is, assuming an access memory GPU takes 5seconds of the time, but if there are 100 thread simultaneous access to, the time is 5 seconds, but the assumption that CPU time memory access time is 0.1 seconds, if the 100 thread access, the time is 10 seconds, therefore, GPU parallel processing can be used to hide even in access memory than CPU speed. GPU is designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1.

Therefore, we in the arithmetic logic by GPU advantage, trying to use NVIDIA"s multi-core available to help us a lot of computation, and we will provide NVIDIA with so many core programs, and NVIDIA Corporation to provide the API of parallel programming large number of operations to carry out.

We must use the form provided by NVIDIA Corporation GPU computing to run it Not really. We can use NVIDIA CUDA, ATI CTM and apple made OpenCL (Open Computing Language), is the development of CUDA is one of the earliest and most people at this stage in the language but with the NVIDIA CUDA only supports its own graphics card, from where we You can see at this stage to use GPU graphics card with the operator of almost all of NVIDIA, ATI also has developed its own language of CTM, APPLE also proposed OpenCL (Open Computing Language), which OpenCL has been supported by NVIDIA and ATI, but ATI CTM has also given up the language of another,

by the use of the previous relationship between the GPU, usually only support single precision floating-point operations, and in science, precision is a very important indicator, therefore, introduced this year computing graphics card has to support a Double precision floating-point operations.

B. CUDA Programming

CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing [2] architecture developed by NVIDIA. CUDA is the computing engine in NVIDIA graphics processing units or GPUs that is accessible to software developers through industry standard programming languages. The CUDA software stack is composed of several layers as illustrated in Figure 2: a hardware driver, an application programming interface (API) and its runtime, and two higher-level mathematical libraries of common usage, CUFFT [17] and CUBLAS [18]. The hardware has been designed to support lightweight driver and runtime layers, resulting in high performance. CUDA architecture supports a range of computational interfaces including OpenGL [9] and Direct Compute. CUDA’s parallel programming model is designed to overcome this challenge while maintaining a low learning curve for programmer familiar with standard programming languages such as C. At its core are three key abstractions – a hierarchy of thread groups, shared memories, and barrier synchronization – that are simply exposed to the programmer as a minimal set of language extensions.

These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel, and then into finer pieces that can be solved cooperatively in parallel. Such a decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables transparent scalability since each sub-problem can be scheduled to be solved on any of the available processor cores: A compiled CUDA program can therefore execute on any number of processor cores, and only the runtime system needs to know the physical processor count.

C. CUDA Processing flow

In follow illustration, CUDA processing flow is described as Figure 3 [16]. The first step: copy data from main memory to GPU memory, second: CPU instructs the process to GPU, third: GPU execute parallel in each core, finally: copy the result from GPU memory to main memory.

III. SYSTEM HARDWARE

A.Tesla C1060 GPU Computing Processor

The NVIDIA? Tesla? C1060 transforms a workstation into a high-performance computer that outperforms a small cluster. This gives technical professionals a dedicated computing resource at their desk-side that is much faster and more energy-efficient than a shared cluster in the data center. The NVIDIA? Tesla? C1060 computing processor board which consists of 240 cores is a PCI Express 2.0 form factor computing add-in card based on the NVIDIA Tesla T10 graphics processing unit (GPU). This board is targeted as high-performance computing (HPC) solution for PCI Express systems. The Tesla C1060 [15] is capable of 933GFLOPs/s[13] of processing performance and comes standard with 4GB of GDDR3 memory at 102 GB/s bandwidth.

A computer system with an available PCI Express *16 slot is required for the Tesla C1060. For the best system bandwidth between the host processor and the Tesla C1060, it is recommended (but not required) that theTesla C1060 be installed in a PCI Express ×16 Gen2 slot. The Tesla C1060 is based on the massively parallel, many-core Tesla processor, which is coupled with the standard CUDA C Programming

[14] environment to simplify many-core programming.

B. Tesla S1070 GPU Computing System

The NVIDIA? Tesla? S1070 [12] computing system speeds the transition to energy-efficient parallel computing [2]. With 960 processor cores and a standard simplifies application development, Tesla solve the world’s most important computing challenges--more quickly and accurately. The NVIDIAComputing System is a rack-mount Tesla T10 computing processors. This system connects to one or two host systems via one or two PCI Express cables. A Host Interface Card (HIC) [5] is used

to connect each PCI Express cable to a host. The host interface cards are compatible with both PCI Express 1x and PCI 2x systems.

The Tesla S1070 GPU computing system is based on the T10 GPU from NVIDIA. It can be connected to a single host system via two PCI Express connections to that connected to two separate host systems via connection to each host. Each NVID corresponding PCI Express cable connects to GPUs in the Tesla S1070. If only one PCI connected to the Tesla S1070, only two of the GPUs will be used.

VI COCLUSIONS

In conclusion, we propose a parallel programming approach using hybrid CUDA and MPI programming, hich partition loop iterations according to the number of C1060 GPU nodes n a GPU cluster which consist of one C1060 and one S1070.

During the experiments, loop progress assigned to one MPI processor cores in the same experiments reveal that the hybrid parallel multi-core GPU currently processing with OpenMP and MPI as a powerful approach of composing high performance clusters.

V CONCLUSIONS

[2] D. G?ddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. uijssen,M. Grajewski, and S. Tureka, “Exploring weak scalability for EM calculations on a GPU-enhanced cluster,” Parallel Computing,

vol. 33, pp. 685-699, Nov 2007.

[3] P. Alonso, R. Cortina, F.J. Martínez-Zaldívar, J. Ranilla “Neville limination on multi- and many-core systems: OpenMP, MPI and UDA”, Jorunal of Supercomputing,

[4] Francois Bodin and Stephane Bihan, “Heterogeneous multicore arallel programming for graphics processing units”, Scientific rogramming, Volume 17, Number 4 / 2009, 325-336, Nov. 2009.

[5] Specification esla S1070 GPU Computing System

[7] Message Passing Interface (MPI)

[8] MPICH, A Portable Implementation of MPI

[9] OpenGL, D. Shreiner, M. Woo, J. Neider and T. Davis, OpenGL(R) rogramming Guide: The Official Guide to Learning OpenGL(R), Addison-Wesley, Reading, MA,

August 2005. 10] (2008) Intel 64 Tesla Linux Cluster Lincoln webpage. [Online] Available:

[11] Romain Dolbeau, Stéphane Bihan, and Fran?ois Bodin, HMPP: Alti-core Parallel Programming Environment

[12] The NVIDIA Tesla S1070 1U Computing System - Scalable Many Core Supercomputing for Data Centers

[13] Top 500 Super Computer Sites, What is Gflop/s,

[17] CUFFT, CUDA Fast Fourier Transform (FFT) library.

[18] CUBLAS, BLAS(Basic Linear Algebra Subprograms) on CUDA

GPU集群的混合并行编程

摘要——目前，NVIDIA的CUDA是一种用于编写高度并行的应用程序的通用的可扩展的并行编程模型。它提供了几个关键的抽象—线程阻塞，共享内存和屏障同步的层次结构。这在编程模式已经被证明在多线程多核GP 和透明的扩展数以百计的核心的编程上相当成功：在整个工业界和学术界的科学家已经使用CUDA实现在生产上的显着的速度提升和代码研究。在本文中，我们提出了一个使用混合CUDA和MPI编程的混合编程方法，分区循环迭代在一个GPU集群中的C1060 GPU节点的数目，其中包括在一个C1060和S1070。循环迭代分配给一个由处理器在相同的计算节点的核心运行的CUDA并行处理过的MPI进程。

关键词：CUDA，GPU，MPI，OpenMP，混合，并行编程

1.介绍

如今，NVIDIA?（英伟达?）的CUDA[1，16]是一种通用的编写高度可扩展的并行编程并行应用程序的模型。它提供了几个关键的抽象—层次的线程块，共享内存和障碍同步。这种模式已经被证实在多线程多核心GPU编程和从小规模扩展到数百个内核是非常成功的：科学家在工业界和学术界都已经使用CUDA[1，16]，生产和研发代码有实现显着的速度提升。在NVDIA 的CUDA芯片的所有的数百种方法构建自己的芯片里，在这里我们将尝试使用NVIDIA?（英伟达?）提供用于并行计算的计算设备。

本文提出了一个解决方案不仅简化在传统的硬件加速通用应用程序的使用,而且还保持应用程序代码的便携性。在这篇论文里,我们提出一种使用混合CUDA,OpenMP和MPI[3]的并行编程方法, 根据在一个集群中的性能叠加的多核[4]节点，它会分区循环迭代。因为迭代处理分配给一个MPI 进程是在并行的相同的计算节点上由OpenMP线程的处理器内核运行的，则循环迭代的次数分配给一个计算节点，每个调度步骤取决于在该节点的处理器内核的数量。

在本文中，我们提出了一种通用的方法，使用性能函数估计每个节点的性能权重。为了验证所提出的方法，我们建立了不同种类的集群和一个同构集群。在我们的实现中，主节点也参与计算，而在以往的计划，只有

从节点做计算工作。实证结果显示，在异构和同构集群环境中，提出的方法改进性能超过以往任何计划。

本文的其余部分安排如下：在第2节，我们介绍几种典型和着名的自排计划，和一个着名的用于分析计算机性能的基准系统。在第3节中，我们定义我们的模型和说明我们的方法。然后我们的系统配置的三种类型放在第4节，同时在第4节还有实验结果的应用程序。结束语和今后的工作安排在第5节。

2.背景回顾

A.GPU和CUDA的历史

在过去，我们必须使用多台计算机的多个CPU并行计算，如所示的最后一个芯片中历史开始并不需要大量的计算，然后逐渐人们有了游戏的需求，最后图形和3D，3D加速器卡的需要出现了，渐渐地，我们开始显示芯片的加工，开始表现出独立的芯片，甚至在他们的CPU芯片中做了一个类似的显示芯片，这是GPU。

我们知道，用GPU计算可以得到我们想要的答案，但为什么我们选择使用GPU？这幻灯片显示了当前CPU和GPU的比较。首先，我们可以看到最多只有八核心CPU，但是GPU已发展到260核心，从核心数量上，我们就可以知道有很多GPU上的并行程序，尽管他有个比较低频率的核心，我们认为大量的并行计算能力可能会弱于单独的一个。第二方面，我们知道，在GPU的存储器内，有更多访问主存储器的次数。对比CPU和GPU上的访问内存容量，我们发现，GPU的访问速度比CPU快10倍，CPU整体糟糕90GB/ s，这是相当惊人的差距，当然，这也意味着，当计算访问大量的数据能有一个良好的GPU来改善所需时间。

CPU采用了先进的流量控制，如分支预测或延迟分支和大容量高速缓存，以减少内存访问延迟，GPU的高速缓存和一个相对较小的数流量控制，也没有他的简单，所以这种方法是使用大量的GPU计算设备，掩盖了内存的问题，即，假设存取存储器GPU需要5秒的时间，但如果有100个线程同时获取的时间为5秒，但假设CPU时间存储器访问时间为0.1秒，如果100个线程访问时，时间是10秒。因此，GPU并行处理可以用来隐藏缺点甚至超过存取记忆体的CPU的速度。 GPU被设计，使得更多的晶体管致力于数据处理，而非数据缓存和流控制，如由图1示意性地示出。

B. CUDA编程

CUDA（统一计算设备架构的缩写）是一个由NVIDIA?（英伟达?）开发的并行计算架构。在NVIDIA图形处理单元或者GPUs中，CUDA是计算引擎，它是可以通过业界标准来使用的软件开发编程语言。CUDA软件栈几个层组成，如在图2中示出：一个硬件驱动程序，应用程序编程接口（API）和它的运行时间，两个较高级别的数学库常见的用法，CUFFT[17]和CUBLAS[18]。硬件被设计为支持轻量级的驱动程序和运行时间层，因此有高性能的表现。 CUDA架构支持一系列的计算接口包括OpenGL[9]和直接计算等。CUDA的并行编程模型是为了克服这一挑战的同时保持了较低的学习曲线，使熟悉标准编程语言（如C）的程序员便于学习。其核心是三个关键抽象—线程的层次结构团体，共享存储，和屏障同步，即以最小的语言扩展简单地接触到程序员。

C. CUDA处理流程

在后续说明，第一步：将数据从主内存拷贝到GPU内存；第二，CPU

发送指示到GPU；第三，GPU并行执行；第四，拷贝GPU内存中的结果到主内存中。

3. 系统硬件

A.Tesla C1060 GPU计算处理器

NVIDIA?（英伟达?）Tesla?系列C1060将工作站变为优于小的计算机集群的一个高性能计算机。这给了行业技术一个在自己的办公桌边一个比在数据中心的共享的群集更快，更高效的专门的计算资源。NVIDIA?（英伟达?）Tesla?系列C1060计算处理器板由240个核心组成，是一个PCI Express2.0的NVIDIA?（英伟达?）Tesla T10图形处理单元（GPU）的基础上的图形计算附加卡。此板有针对性的用于PCI Express的高性能计算（HPC）解决方案系统。特斯拉C1060[15]有933GFLOPs/秒的处理性能，标配4 GBGDDR3内存，102 GB/ s的带宽。

Tesla C1060需要一个可用的PCI Express×16插槽的计算机系统。为了获得主机的处理器和Tesla C1060之间最佳的系统带宽，建议（但不要求）是Tesla C1060安装在PCI Express×16第二代插槽。特斯拉C1060基于大规模并行，多核心的Tesla处理器，再加上标准的CUDA的 C语言编程的环境，以简化多核心编程。

B.特斯拉S1070 GPU的计算系统

NVIDIA?（英伟达?）Tesla?系列S1070[12]计算系统加快了到节能高效的并行计算的过度。拥有960个处理器内核和一个标准的C编译器，简化应用程序的开发，特斯拉S1070尺度更快，更准确的解决世界上最重要的计算难题。NVIDIA?（英伟达?）Tesla S1070计算系统是一个1U机架安装系统，它有四个特斯拉T10计算处理器。该系统通过一个或两个PCI Express的线连接到一个或两个主机系统。主机接口卡（HIC）是用来连接每个PCI Express连接到主机上。主机接口卡兼容的PCI Express1X和PCI Express2个系统。

Tesla S1070 GPU计算系统T10 NVIDIA?（英伟达?）GPU的基础上。它可以通过两个PCI Express连接到一个单独的主机系统连接到主机，或通过一个PCI Express连接到两个独立的主机系统连接到每一台主机。每NVIDIA开关和相应的PCI Express电缆连接到Tesla S1070的两个四GPU （图形处理器）。如果只有一个PCI Express电缆连接的Tesla S1070，那么只有两个GPU在使用。主机必须有两个可用的PCI Express槽和配置有两个电缆，才能连接所有在特斯拉S1070的四个GPU到单一的主机系统中。

4. 实验结论

总之，我们提出了一个使用混合CUDA和MPI编程的并行编程方法，即根据C1060 GPU节点的数目分区的在包括一个C1060和一个S1070的GPU 集群的循环迭代。实验过程中，分配给一个MPI进程的循环迭代由运行在相同的计算节点的处理器内核的CUDA并行地处理。实验表明，由OpenMP 和MPI处理的混合并行多核心GPU是目前一个强大的组成高性能集群的方法。

5参考文献

[2] D. G?ddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. uijssen,M. Grajewski, and S. Tureka, “Exploring weak scalability for EM calculations on a GPU-enhanced cluster,” Parallel Computing,

vol. 33, pp. 685-699, Nov 2007.

[3] P. Alonso, R. Cortina, F.J. Martínez-Zaldívar, J. Ranilla “Neville limination on multi- and many-core systems: OpenMP, MPI and UDA”, Jorunal of Supercomputing, [4] Francois Bodin and Stephane Bihan, “Heterogeneous multicore arallel programming for graphics processing units”, Scienti fic rogramming, Volume 17, Number 4 / 2009, 325-336, Nov. 2009.

[5] Specification esla S1070 GPU Computing System

[7] Message Passing Interface (MPI),

[8] MPICH, A Portable Implementation of MPI

[9] OpenGL, D. Shreiner, M. Woo, J. Neider and T. Davis, OpenGL(R) rogramming Guide: The Official Guide to Learning OpenGL(R), Addison-Wesley, Reading, MA, August 2005. 10] (2008) Intel 64 Tesla Linux Cluster Lincoln webpage. [Online] Available: lCluster/

[11] Romain Dolbeau, Stéphane Bihan, and Fran?ois Bodin, HMPP: Alti-core Parallel Programming Environment

[12] The NVIDIA Tesla S1070 1U Computing System - Scalable Many Core Supercomputing for Data Centers

[13] Top 500 Super Computer Sites, What is Gflop/s, h

[14]

[15]

[17] CUFFT, CUDA Fast Fourier Transform (FFT) library.

[18] CUBLAS, BLAS(Basic Linear Algebra Subprograms) on CUDA

推荐访问:学术英语理工类论文题目学术英语论文

[学术英语论文学术英语论文选题]相关文章

上一篇：很全,很详细商务英语论文题目_英语翻译论文题目
下一篇：[International trade实用论文2000字] foreign trade

[学术英语论文学术英语论...]头条范本

学术英语论文 学术英语论文选题

时间：2022-01-18 16:17:47 浏览次数：次

学术英语论文学术英语论文选题