摘要/Abstract
基于图形处理单元(GPU)的算法和程序为解决量子化学中的计算瓶颈开辟了道路. 作者设计了基于GPU的量子化学算法和程序, 实现了Hartree-Fock方法和密度泛函理论中双电子排斥积分计算、Fock矩阵构造以及交换相关泛函的计算. 由于计算内核使用OpenCL编程框架, 程序可以在多种架构的计算设备上执行. 对于不同计算模块和分子自洽场计算的测试表明, 基于OpenCL的GPU程序相比CPU上的串行程序实现了最快148倍的加速.
关键词: 图形处理单元, OpenCL, Hatree-Fock, 密度泛函理论, 直接自洽场计算
Graphics processing units (GPUs) have become a promising architecture to tackle many computational bottlenecks in quantum chemistry calculations. Herein, we present our new development on using GPU to accelerate Hartree-Fock (HF) and density functional theory (DFT) calculations in Beijing Density Functional (BDF) Package. Our program utilizes the OpenCL platform and thus can execute on a variety of computing devices from different companies as NVIDIA and AMD. All time-consuming steps in HF/DFT, such as calculation of electron repulsion integrals (ERIs), the formation of the Fock matrix, and the exchange-correlation (XC) functional related works, have been implemented on the GPU. In our algorithm, the coulomb- and exchange-matrix are calculated directly on GPU by contracting the primitive ERIs with the density matrix. The 1T1PI (1 thread 1 primitive integral) algorithm in which each thread evaluates one primitive ERI, is used to schedule the computational tasks on GPU. To achieve this task, the primitive Gaussian basis shell pairs μν are first prescreened and sorted according to their values. The Gaussian product theorem (GPT) is applied to each shell pairs and the intermediate values are calculated and copied into the GPU memory for further use. Then, the one-dimensional mapping is used to assign 32 work items (threads) into one workgroup to calculate the J matrix element and the permutation symmetry of the primitive ERIs is fully utilized. To calculate the K matrix, two-dimensional mapping is used and every 64 work items are packed into one workgroup. Permutation symmetry of exchanging the bra pair μλ and the ket pair νσ is ignored for reducing the expensive commutation between different workgroups on GPU. After a batch of coulomb- or exchange-matrix elements are computed on the GPU, the results are copied back to the CPU and accumulated to the Fock matrix. The XC terms are calculated through a numerical procedure due to the complex form of the XC functionals. We first pack the numerical grids as batches in which one batch has 128 grids. Then the none zero Gaussian basis shells on each grid batch are sifted out. The grid batches and the basis function sieving indices are duplicated on CPU and GPU memory to avoid unnecessary communication between CPU and GPU. The computational tasks are scheduled dynamically according to the grid batches on GPU. All steps as calculating the numerical grids and their weights, electron density and density gradient, the XC functional and its derivative, and the XC energy and the matrix elements of the XC potential, are optimized step by step on GPU. All calculations are carried out in 64-bit double-precision accuracy to achieve the same numerical precision as on the CPU. Benchmark calculations are carried out on several different GPUs from NVIDIA and AMD for assessing the performance of our code. The benchmark result indicates that the algorithm implemented on the GPU can achieve up to 148-fold speedup over a serial CPU implementation, and the total energy calculated on the GPU is as accurate as the resulting calculated on the CPU.
Key words: GPU, OpenCL, Hartree-Fock, density functional theory, direct self-consistent-field calculation
PDF全文下载地址:
点我下载PDF