Acceleration of Lattice Boltzmann Simulation via OpenACC
Shuhao Guo1,2,Jie Wu1,2
(1.State Key Laboratory of Mechanics and Control of Mechanical Structures,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China;2.Department of Aerodynamics,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China)
Abstract:
The parallel acceleration of well-developed serial codes for numerical simulations of fluid dynamic problems is implemented in this paper. To solve the flow field, the lattice Boltzmann method (LBM) is used. The OpenACC (OpenACCelerator) application programming interface, which is a relatively new standard for parallel computation, is adopted to perform the acceleration. By parallelizing the computation intensive loops and decreasing some meaningless data movement, the serial codes based on LBM are obviously accelerated. Several benchmark problems are considered and then simulated through the OpenACC programming model to assess performance and computational efficiency. After optimization, a reasonable speedup is obtained in comparison with the original version.
Key words: OpenACC lattice Boltzmann simulation parallel computing
DOI:10.11916/j.issn.1005-9113.18051
Clc Number:O368
Fund:
Shuhao Guo, Jie Wu. Acceleration of Lattice Boltzmann Simulation via OpenACC[J]. Journal of Harbin Institute of Technology (New Series), 2018, 25(5): 44-52. DOI: 10.11916/j.issn.1005-9113.18051.
Fund Sponsored by the Research Fund of State Key Laboratory of Mechanics and Control of Mechanical Structures (Nanjing University of Aeronautics and Astronautics) (Grant No. MCMS-0117G01) Corresponding author Jie Wu, Winner of the National Science Fund for Excellent Young Scholars. E-mail: wuj@nuaa.edu.cn Article history Received: 2018-05-15
Contents Abstract Full text Figures/Tables PDF
Acceleration of Lattice Boltzmann Simulation via OpenACC
Shuhao Guo1,2, Jie Wu1,2
1. State Key Laboratory of Mechanics and Control of Mechanical Structures, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China;
2. Department of Aerodynamics, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
Received: 2018-05-15
Fund: Sponsored by the Research Fund of State Key Laboratory of Mechanics and Control of Mechanical Structures (Nanjing University of Aeronautics and Astronautics) (Grant No. MCMS-0117G01)
Corresponding author: Jie Wu, Winner of the National Science Fund for Excellent Young Scholars. E-mail: wuj@nuaa.edu.cn
Abstract: The parallel acceleration of well-developed serial codes for numerical simulations of fluid dynamic problems is implemented in this paper. To solve the flow field, the lattice Boltzmann method (LBM) is used. The OpenACC (OpenACCelerator) application programming interface, which is a relatively new standard for parallel computation, is adopted to perform the acceleration. By parallelizing the computation intensive loops and decreasing some meaningless data movement, the serial codes based on LBM are obviously accelerated. Several benchmark problems are considered and then simulated through the OpenACC programming model to assess performance and computational efficiency. After optimization, a reasonable speedup is obtained in comparison with the original version.
Key words: OpenACC lattice Boltzmann simulation parallel computing
1 Introduction With the rapid growth of simulation scale, there exists an urgent demand for higher computational efficiency. According to the Moore's law, the performance of computer chips will double every two years until the physical size of microprocessor reaches tens of nanometer. At present, the most advanced microprocessor is reaching this limit, and the improvement of the chip's performance is slowing down. As a result, traditional multi-core CPUs computers can no longer satisfy researchers′ requests for large computing capability. Therefore, heterogeneous architectures, which combine traditional multi-core CPUs with highly efficient accelerators[1], come into service. Among these accelerators, the graphics processing unit (GPU) has gained great attention because of its remarkable computing power. For example, the NVIDIA Tesla K80 GPU accelerator is capable of processing nearly 3000 G floating-point operations per second, so it is an order of magnitude higher compared with the contemporaneous Intel CPU processor.
During the past two decades, the lattice Boltzmann method (LBM)[2] has evolved into a promising model to simulate various fluid dynamic problems. Considering its algorithmic simplicity, easy boundary treatment and intrinsic nature of parallel computation, LBM can be a high-efficiency tool in fluid simulation. Thus, there exists a great potential to explore in massively parallel programming of LBM by using GPU.
However, the architecture of GPU is a far cry from CPU, which means programmers must be familiar with the hardware architectures before starting the parallel programming. At present, CUDA (Compute Unified Device Architecture) from NVIDIA is widely used to transform the serial code into a parallel one. That's quite a challenge for most researchers. Extra time will be consumed to master a GPU programming language and much more time is needed to rewrite the serial code thoroughly, not to mention the various platforms with uncertain prospect on the market. Whilst, OpenACC (OpenACCelerator)[3] is a platform independent, high level, and directive based programming standard. After adding some OpenACC directives as annotations, the compilers are capable of understanding the programmer's intentions, and then locate and offload the computationally intensive region to the GPU accelerator. Automatically, it will take full advantage of accelerator's extraordinary compute capability. By using the OpenACC directives, users can implement the parallel computation easily. In addition, they also can freely compile the same code and conduct computations on either CPU or GPU from different vendors. To the best of our knowledge, there already exists some research in utilizing the GPU accelerators in parallelizing the LBM simulation codes[4-5], but very few implementations of LBM together with OpenACC acceleration can be found in Refs. [6-9]. On the other hand, some previous studies have shown the great potential of OpenACC in the field of fluid simulation[10-14]. It motivates us to continue to apply OpenACC to the acceleration of LB simulations.
This paper is structured as follows: Section 2 introduces the numerical methods used briefly. Section 3 describes OpenACC programming model and our efforts in OpenACC implementation and optimization. Section 4 analyzes the benchmark performance results, and all the work is summarized in Section 5.
2 Numerical Methods 2.1 Lattice Boltzmann Method To simulate the two-dimensional viscous and incompressible flow by using LBM, the following equation is solved, which can be written as[2]
$\begin{array}{*{20}{c}}{{f_\alpha }\left( {\mathit{\boldsymbol{x}} + {\mathit{\boldsymbol{e}}_\alpha }\delta t,t + \delta t} \right) - {f_\alpha }\left( {\mathit{\boldsymbol{x}},t} \right) = }\\{ - \frac{1}{\mathit{\boldsymbol{\tau }}}\left[ {{f_\alpha }\left( {\mathit{\boldsymbol{x}},t} \right) - f_\alpha ^{eq}\left( {\mathit{\boldsymbol{x}},t} \right)} \right]}\end{array}$ (1)
In Eq.(1), the right hand side means the collision sub-process, and the left hand side represents the stream sub-process. In addition, fα is the density distribution function, fαeq is the equilibrium status, τ is the single relaxation time, δt is the time step, and eα is the lattice velocity. In this paper, the D2Q9 lattice model is used and its lattice velocity set is
${\mathit{\boldsymbol{e}}_\alpha } = \left\{ {\begin{array}{*{20}{c}}\begin{array}{l}\left( {0,0} \right),\\\left( { \pm 1,0} \right),\left( {0, \pm 1} \right),\\\left( { \pm 1, \pm 1} \right),\end{array}&\begin{array}{l}\alpha = 0\\\alpha = 1 - 4\\\alpha = 5 - 8\end{array}\end{array}} \right.$ (2)
and then the equilibrium distribution function is
$\begin{array}{l}f_\alpha ^{eq}\left( {\mathit{\boldsymbol{x}},t} \right) = \\\;\;\;\;\;\rho {w_\alpha }\left[ {1 + \frac{{{\mathit{\boldsymbol{e}}_\alpha } \cdot \mathit{\boldsymbol{u}}}}{{c_s^2}} + \frac{{{{\left( {{\mathit{\boldsymbol{e}}_\alpha } \cdot u} \right)}^2} - {{\left( {{c_s}\left| \mathit{\boldsymbol{u}} \right|} \right)}^2}}}{{2c_s^4}}} \right]\end{array}$ (3)
where ρ is flow density, u is the flow velocity vector, and cs2=1/3. wα is the coefficient, w0=4/9, w1~4=1/9 and w5~8=1/36. The relationship between the relaxation time and the kinematic viscosity of fluid is υ=(τ-0.5)cs2δt.
2.2 Immersed Boundary Method When the object is introduced into the flow field, the immersed boundary method (IBM) can be used to deal with the boundary of object. In the framework of LBM, therefore, the resultant governing equations of immersed boundary-lattice Boltzmann method (IB-LBM) can be written as[15]
$\begin{array}{*{20}{c}}{{f_\alpha }\left( {\mathit{\boldsymbol{x}} + {\mathit{\boldsymbol{e}}_\alpha }\delta t,t + \delta t} \right) - {f_\alpha }\left( {\mathit{\boldsymbol{x}},t} \right) = }\\{ - \frac{1}{\mathit{\boldsymbol{\tau }}}\left[ {{f_\alpha }\left( {\mathit{\boldsymbol{x}},t} \right) - f_\alpha ^{eq}\left( {\mathit{\boldsymbol{x}},t} \right)} \right] + {F_\alpha }\delta t}\end{array}$ (4)
${F_\alpha } = \left( {1 - \frac{1}{{2\tau }}} \right){\mathit{\boldsymbol{w}}_\alpha }\left( {\frac{{{\mathit{\boldsymbol{e}}_\alpha } - \mathit{\boldsymbol{u}}}}{{c_s^2}} + \frac{{{\mathit{\boldsymbol{e}}_\alpha } \cdot \mathit{\boldsymbol{u}}}}{{c_s^4}}{\mathit{\boldsymbol{e}}_\alpha }} \right) \cdot f$ (5)
$\rho \mathit{\boldsymbol{u}} = \sum\limits_\alpha {{\mathit{\boldsymbol{e}}_\alpha }{f_\alpha }} + \frac{1}{2}\mathit{\boldsymbol{f}}\delta t$ (6)
where Fα describes the effect of immersed boundary, f is the force density in flow field, which can be distributed from the boundary force density.
To guarantee the no-slip boundary condition, the force density f in Eqs. (5) and (6) can firstly be set as unknown. Then the force density f can be transformed to the fluid velocity correction δu, which is determined by the boundary velocity correction δuB. So the equation system related to δuB can be written as
$\mathit{\boldsymbol{AX}} = \mathit{\boldsymbol{B}}$ (7)
where
$X = {\left\{ {\delta \mathit{\boldsymbol{u}}_B^1,\delta \mathit{\boldsymbol{u}}_B^2, \cdots ,\delta \mathit{\boldsymbol{u}}_B^m} \right\}^{\rm{T}}}$
$\mathit{\boldsymbol{A}} = \left( {\begin{array}{*{20}{c}}{{\delta _{11}}}&{{\delta _{12}}}& \cdots &{{\delta _{1n}}}\\{{\delta _{21}}}&{{\delta _{22}}}& \cdots &{{\delta _{2n}}}\\ \vdots&\vdots&\ddots&\vdots \\{{\delta _{m1}}}&{{\delta _{m2}}}& \cdots &{{\delta _{mn}}}\end{array}} \right)\left( {\begin{array}{*{20}{c}}{\delta _{11}^B}&{\delta _{12}^B}& \cdots &{\delta _{1m}^B}\\{\delta _{21}^B}&{\delta _{22}^B}& \cdots &{\delta _{2m}^B}\\ \vdots&\vdots&\ddots&\vdots \\{\delta _{n1}^B}&{\delta _{n2}^B}& \cdots &{\delta _{nm}^B}\end{array}} \right)$
$\mathit{\boldsymbol{B}} = \left( {\begin{array}{*{20}{c}}{\mathit{\boldsymbol{U}}_B^1}\\{\mathit{\boldsymbol{U}}_B^2}\\ \vdots \\{\mathit{\boldsymbol{U}}_B^m}\end{array}} \right) - \left( {\begin{array}{*{20}{c}}{{\delta _{11}}}&{{\delta _{12}}}& \cdots &{{\delta _{1n}}}\\{{\delta _{21}}}&{{\delta _{22}}}& \cdots &{{\delta _{2n}}}\\ \vdots&\vdots&\ddots&\vdots \\{{\delta _{m1}}}&{{\delta _{m2}}}& \cdots &{{\delta _{mn}}}\end{array}} \right)\left( {\begin{array}{*{20}{c}}{\mathit{\boldsymbol{u}}_1^ * }\\{\mathit{\boldsymbol{u}}_2^ * }\\ \vdots \\{\mathit{\boldsymbol{u}}_n^ * }\end{array}} \right)$
Here, m is the number of boundary points, n is the number of surrounding Eulerian mesh points, δuBl (l=1, 2, …, m) is the unknown boundary velocity correction vector, δij and δijB are the delta functions, which are used to connect the flow field and the boundary. For more details of the IB-LBM used here, one can refer to our previous work[15].
3 OpenACC Technique The only purpose of parallelization is to make the best use of existing hardware resources thus accelerating the program. This is also a direction where the OpenACC programming model works. In OpenACC, two tasks must be done for parallelization, one is parallelizing the computation and the other is data management. To accomplish these two jobs, OpenACC uses different directives as annotations, compute constructs like kernels or parallel for parallelizing, and data clauses like data/end data or enter/exit data for data management. During the parallel execution, all the data needed is transferred from CPU host to the accelerator device, and the calculation job is offloaded to the GPU.
To assess the performance and efficiency of OpenACC programming precisely, all the attempts in this work are conducted in one computer with Windows 10 (Enterprise Edition) system, which is equipped with an Intel i7-7700K CPU, a Gloway DDR4 RAM with 16GB and a NVIDIA GeForce GTX 1070 GPU accelerator with 8GB under the compiler platform of PGI 17.10 community edition. All the simulations below are performed by using one CPU core.
The OpenACC implementation of LB simulations code consists of several steps. The first one is to locate the computationally intensive regions of code. After assessing the numerical algorithm, it's easy to find the hotspot in code which is computationally intensive and contains potential in parallelization. Fig. 1 shows the flow chart of LB simulations. There are several subroutines in the code: collision, stream, flow variable, and IBM if required. Due to the intrinsic parallel nature of LBM, these subroutines are specifically suitable for parallel computation. Then, the second step is to add OpenACC kernels/parallel loop directives to offload the most computationally intense loops to the GPU accelerator, as shown in Fig. 2. At this stage, all the distribution function and flow variable arrays are copied form CPU host to GPU and transferred back to CPU host for each time step. This may result in a significant loss of performance owing to redundant data movement.
Figure 1
Figure 1 Flow chart of LB simulations
Figure 2
Figure 2 Use of ACC directives in LB simulations
Fig. 3 plots the data movement profiling through NVIDIA Visual Profiler. As a tiny test, the iteration steps are set as 200. As can be seen from the figure, the data movement occurs every time the accelerator launches a parallel kernel. It costs 15.853 s to finish the computation. Comparing with the serial version of LB code (16.318 s), it shows no speedup in efficiency. Nonetheless, the computation task has been offloaded to the GPU and the result obtained is correct. In the next step, it will improve dramatically.
Figure 3
Figure 3 Data movement profiling without data clause
Therefore, the next step is to decrease the meaningless data movement between the host and accelerator. During the iteration, all the data is calculated in GPU accelerator. If they are temporarily stored in the GPU, data exchange can be removed to overcome the high cost of data transfer and kernel launch. After initializing the parameters and flow field, all the arrays needed can be copied to the GPU storage space without exchange until the end of the iteration. To realize this, as shown in Fig. 4, a data region is established by data clauses, which includes "copy", "copyin", "copyout" and "create". All the arrays can obtain allocated memory in GPU by different data clauses depending on their data characteristics (such as global variable, intermediate variable, etc).
Figure 4
Figure 4 Use of data region for time iteration
As illustrated in Fig. 4, in this sample code for flow over a stationary circular cylinder, some arrays are declared in the "acc data" clause. Once the flow field initialization is accomplished, the "copyin" clause is used since the mesh and cylinder information is constant, and then updating the data to CPU host is junk effort. These data remain in the device memory and will not be sent back after the end of iteration. There are also some variables acting as intermediate variables and only work in a particular subroutine. They do not need initial values from the CPU host, nor does the host needs theirs. The "create" clause fits them well. Last but not least, the data exchange of flow variables (such as velocity, density) between host and device are necessary, and so "copy" clause works. After adding the data clause, as shown in Fig. 5, the time for data movement is reduced significantly. It costs only 1.525 s to finish the 200 iterations, which is a speed increase of over 10 times. With precise data management, heavy data transfer burden is relieved, which can lead to an improvement in performance.
Figure 5
Figure 5 Data movement profiling with data clause
4 Numerical Examples In this section, three fluid dynamic problems, i.e., lid-driven cavity flow, flows over a stationary circular cylinder, and an elastically mounted cylinder, are simulated to evaluate the performance of parallel LB codes ported by OpenACC. To show the efficiency and performance of parallel computation, the metrics of million lattice updates per second (MLUPS) is introduced, which is defined as[8]
$MLUPS = \frac{{\left( {mesh\;size} \right) \times \left( {iteration\;steps} \right)}}{{\left( {running\;time} \right) \times {{10}^6}}}$ (8)
With a given mesh size, the iteration steps are also fixed. Then, the running time can be measured. Less running time consumed, which produces higher MLUPS, means better parallel performance.
4.1 Lid-Driven Cavity Flow To validate the LBM parallel code with OpenACC, the lid-driven cavity flow at different Reynolds numbers is simulated. For this problem, the Reynolds number is defined as Re=UL/υ. In which, U is the driving velocity of lid, and L is the length of cavity. Three left walls are stationary. The computational domain is discretized by using the uniform mesh. Four Reynolds numbers are considered, i.e., Re=1 000, 5 000, 7 500 and 10 000. Correspondingly, the mesh sizes are 241×241, 361×361, 521×521 and 841×841.
To verify the results obtained by the parallelized LBM code, Fig. 6 presents the velocity profiles along the centerlines of the cavity at Re=1 000 and 7 500. The results computed by using LBM serial code, which have been widely validated[2], are involved to make comparison. It can be found from the figure that both results compare very well with each other. Fig. 7 shows the streamlines of parallel simulations at different Reynolds numbers, in which the clear development of vortex structures is demonstrated[2].
Figure 6
Figure 6 Comparison of centerline velocity profiles at Re=1 000 and 7 500
Figure 7
Figure 7 Streamlines of lid-driven cavity flow at Re=1 000, 5 000, 7500 and 10 000
Table 1 summarizes the performance between serial computation and parallel computation at four different mesh sizes and Reynolds numbers. From the table, one can find an interesting tendency. With the increase of mesh size, the speedup also grows. It means that although the parallel efficiency is directly relevant to the capability of accelerator device, the performance of parallelism also can be improved with the boost of computation scale. However, there is a physical limit. When the computation scale increases up to a certain degree, enlarging the scale solely can no longer bring an increased acceleration profit.
表 1
Table 1 Performance comparison of serial and parallel computations for lid-driven cavity flow Running set MLUPS (serial) MLUPS (parallel) Speedup
Re=1 000
(mesh size: 241×241) 5.790 297 75.764 862 13.1×
Re=5 000
(mesh size: 361×361) 4.032 279 99.152 402 24.6×
Re=7 500
(mesh size: 521×521) 3.467 184 112.746 055 32.5×
Re=10 000
(mesh size: 841×841) 2.727 646 128.750 964 47.2×
Table 1 Performance comparison of serial and parallel computations for lid-driven cavity flow
4.2 Flow Over a Stationary Circular Cylinder To also validate the IB-LBM parallel code with OpenACC, the flow over a stationary circular cylinder is simulated. The Reynolds number for this problem is defined as Re=UD/υ, where U is the free stream velocity and D is the diameter of the cylinder. The free stream condition is applied to four boundaries. The computational domain is discretized by using the non-uniform mesh, in which the mesh is fine and uniform around the cylinder. Two different Reynolds numbers are considered, i.e., Re=40 and 200, which represent steady and unsteady states, respectively. For both states, the mesh size is fixed at 461×421.
Similar to the lid-driven cavity flow problem above, the IB-LBM serial code has also been run. For the case of Re=40, the drag coefficient and recirculation region behind the cylinder from parallel simulation are 1.563 and 4.63, respectively. They are the same as the results of serial simulation. For the case of Re=200, the same time histories of drag and lift coefficients can be observed from the comparison in Fig. 8. In addition, the streamlines and instantaneous vorticity contours from parallel simulations, which are plotted in Fig. 9, also compare very well with that of our previous study[15].
Figure 8
Figure 8 Comparison of time history of drag and lift coefficients at Re=200
Figure 9
Figure 9 Streamlines and instantaneous vorticity contours of flow over a stationary circular cylinder at Re=40 and 200
Table 2 provides the performance of serial and parallel computations at different Reynolds numbers. As can be found from the table, the implementation of parallel acceleration for the problem of flow over a stationary circular cylinder gets a speedup about 22×. Compared with the case of lid-driven cavity flow with the similar mesh size, there is a bit of performance loss. It is partially owing to the introduction of an immersed boundary method, which brings in extra complexity in algorithm and parallel implementation. Nevertheless, the improvement of performance comparing with the serial one is still satisfactory.
表 2
Table 2 Performance comparison of serial and parallel computations for flow over a stationary circular cylinder Running set MLUPS (serial) MLUPS (parallel) Speedup
Re=40
(mesh size: 461×421) 2.374 135 51.735 030 21.8×
Re=200
(mesh size: 461×421) 2.372 472 51.707 537 21.8×
Table 2 Performance comparison of serial and parallel computations for flow over a stationary circular cylinder
4.3 Flow Over an Elastically Mounted Cylinder After simulating the flow over a stationary circular cylinder, the current IB-LBM parallel code is further used to simulate the flow over an elastically mounted cylinder. Since the cylinder is elastically mounted, it may experience vibration under some specific conditions, which is called as vortex-induced vibration (VIV).
For an elastically mounted cylinder having one degree of freedom, its motion can be described by the following equation[16]:
$\ddot Y + {\left( {\frac{{2{\rm{ \mathsf{ π} }}}}{{{U^ * }}}} \right)^2}Y = \frac{{{C_l}}}{{2{m^ * }}}$ (9)
where Y=y/D and y is the displacement of the cylinder, U* is the reduced velocity of the cylinder that is related to the response of the cylinder to fluid flow, Cl is the lift coefficient acting on the cylinder, m* is the mass ratio of the cylinder to the fluid. In the current simulation, U*=5 and m*=2 are used, which can produce the obvious periodic VIV of cylinder. Moreover, the Reynolds number is fixed at Re=200. Same as the stationary cylinder problem, the mesh size is also chosen as 461×421.
Fig. 10 plots the time histories of drag and lift coefficients and the displacement of cylinder from serial and parallel computations. Again, perfect agreement can be achieved. The maximum displacement is 0.478, which is close to that of Borazjani and Sotiropoulos (0.467)[16]. Fig. 11 presents the streamlines and instantaneous vorticity contours. At this moment, the cylinder is moving upward. Compared with the case of stationary cylinder, it can be found that the streamlines and vorticity contours show clear difference.
Figure 10
Figure 10 Comparison of time history of drag and lift coefficients and displacement of cylinder at Re=200
Figure 11
Figure 11 Streamlines and instantaneous vorticity contours of flow over an elastically mounted cylinder at Re=200
Since the cylinder can oscillate along y direction, the position data of cylinder needs to be updated each iteration step. It brings extra complexity for parallelization and burden for data management. But still the performance gets improved about 11 times faster than the serial implementation.
5 Conclusions In this work, the acceleration of the lattice Boltzmann simulation has been implemented by using the GPU parallel technique. Different from the CUDA programming language, OpenACC is a high level directive based programming standard. During the code parallelization process, OpenACC directives are very effective. After introducing precise data transfer arrangements, the use of OpenACC can receive a reasonable speedup of over 30 times for the LB simulations. Three typical fluid dynamic problems, i.e., lid-driven cavity flow, flows over a stationary circular cylinder and an elastically mounted cylinder, are simulated to validate the parallel codes. Compared with the serial codes, a dramatic improvement in performance can be achieved. The extension to use the advanced features of OpenACC and the interoperability with platform (like cuBLAS) will be implemented in future work.
References
[1] Foster I. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. New Jersey: Addison-Wesley, 1995. (0)
[2] Aidun C K, Clausen J R. Lattice-Boltzmann method for complex flows. Annual Review of Fluid Mechanics, 2010, 42(1): 439-472. DOI:10.1146/annurev-fluid-121108-145519 (0)
[3] Chandrasekaran S, Juckeland G. OpenACC for Programmers: Concepts and Strategies. New Jersey: Addison-Wesley, 2018. (0)
[4] Riesinger C, Bakhtiari A, Schreiber M, et al. A holistic scalable implementation approach of the lattice Boltzmann method for CPU/GPU heterogeneous clusters. Computation, 2017, 5(4): 48. DOI:10.3390/computation5040048 (0)
[5] Calore E, Gabbana A, Schifano S F, et al. Optimization of lattice Boltzmann simulations on heterogeneous computers. The International Journal of High Performance Computing Applications, 2018. Online First. DOI: 10.1177/1094342017703771. (0)
[6] Calore E, Kraus J, Schifano S F, et al. Accelerating lattice Boltzmann applications with OpenACC. In: Traff J, Hunold S, Versaci F.(eds) European Conference on Parallel Processing, Lecture Notes in Computer Science, 2015, 9233: 613-624. DOI: 10.1007/978-3-662-48096-0_47. (0)
[7] Calore E, Gabbana A, Kraus J, et al. Performance and portability of accelerated lattice Boltzmann applications with OpenACC. Concurrency and Computation: Practice and Experience, 2016, 28(12): 3485-3502. DOI:10.1002/cpe.3862 (0)
[8] Xu A, Shi L, Zhao T S. Accelerated lattice Boltzmann simulation using GPU and OpenACC with data management. International Journal of Heat and Mass Transfer, 2017, 109: 577-588. DOI:10.1016/j.ijheatmasstransfer.2017.02.032 (0)
[9] Xu A, Shyy W, Zhao T S. Lattice Boltzmann modeling of transport phenomena in fuel cells and flow batteries. Acta Mechanica Sinica, 2017, 33(3): 555-574. DOI:10.1007/s10409-017-0667-6 (0)
[10] Couder-Castañeda C, Barrios-Piña H, Gitler I, et al. Performance of a code migration for the simulation of supersonic ejector flow to SMP, MIC, and GPU using OpenMP, OpenMP+LEO, and OpenACC directives. Scientific Programming, 2015. Article ID 739107. DOI: 10.1155/2015/739107. (0)
[11] Rostami R M, Ghaffari-Miab M. Fast computation of finite difference generated time-domain Green's functions of layered media using OpenACC on graphics processors. 2017 Iranian Conference on Electrical Engineering, Piscataway: IEEE, 2017. 1596-1599. DOI: 10.1109/IranianCEE.2017.7985300. (0)
[12] Zhang S H, Yuan R, Wu Y, et al. Implementation and efficiency analysis of parallel computation using OpenACC: A case study using flow field simulations. International Journal of Computational Fluid Dynamics, 2016, 30(1): 79-88. DOI:10.1080/10618562.2016.1164309 (0)
[13] Xia Y, Lou J, Luo H, et al. OpenACC acceleration of an unstructured CFD solver based on a reconstructed discontinuous Galerkin method for compressible flows. International Journal for Numerical Methods in Fluids, 2015, 78(3): 123-139. DOI:10.1002/fld.4009 (0)
[14] Ma W, Lu Z, Yuan W, et al. Parallelization of an unsteady ALE solver with deforming mesh using OpenACC. Scientific Programming, 2017. Article ID 4610138. DOI: 10.1155/2017/4610138. (0)
[15] Wu J, Shu C. Implicit velocity correction-based immersed boundary-lattice Boltzmann method and its applications. Journal of Computational Physics, 2009, 228(6): 1963-1979. DOI:10.1016/j.jcp.2008.11.019 (0)
[16] Borazjani I, Sotiropoulos F. Vortex-induced vibrations of two cylinders in tandem arrangement in the proximity-wake interference region. Journal of Fluid Mechanics, 2009, 621: 321-364. DOI:10.1017/S0022112008004850 (0)
删除或更新信息,请邮件至freekaoyan#163.com(#换成@)
Acceleration of Lattice Boltzmann Simulation via OpenACC
本站小编 哈尔滨工业大学/2019-10-23
相关话题/Acceleration Lattice Boltzmann Simulation via
Modified Adsorption Capacity to Space Molecular Pollutant of Zeolite via Interface Engineering with
Modified Adsorption Capacity to Space Molecular Pollutant of Zeolite via Interface Engineering with Atomic Layer Deposition Author NameAffiliationXianghua GongSchool of Chemistry and Chemical Engineering, Harbin Institute of Technology, Harbin 150000, ChinaYang LiSchool of C ...哈尔滨工业大学科研学术 本站小编 哈尔滨工业大学 2020-12-05Relationship on the Head Deviation Rate and Laying Length of Porous Tubes
Relationship on the Head Deviation Rate and Laying Length of Porous Tubes Author NameAffiliationYameng WangCollege of Water Conservancy and Architecture Engineering, Shihezi University, Shihezi 832000, Xinjiang, ChinaHuanfang LiuCollege of Water Conservancy and Architecture ...哈尔滨工业大学科研学术 本站小编 哈尔滨工业大学 2020-03-06Numerical Simulation of Bubble Migration in Liquid Titanium Alloy Melt During Vertical Centrifugal C
Numerical Simulation of Bubble Migration in Liquid Titanium Alloy Melt During Vertical Centrifugal Casting Process Author NameAffiliationQin XuSchool of Mechanic and Electrical Engineering, Henan University of Technology, Zhengzhou 450001, ChinaXing WangSchool of Mechanic an ...哈尔滨工业大学科研学术 本站小编 哈尔滨工业大学 2020-03-06基于GPS平台的高效k-bisimulation计算算法
基于GPS平台的高效k-bisimulation计算算法阚忠良1,李建中1,2,徐俊2,王宏志2(1.黑龙江大学计算机科学技术学院,哈尔滨150080;2.哈尔滨工业大学计算机科学与技术学院,哈尔滨150001)摘要:计算图的互模划分在许多应用领域中起着至关重要的作用.图中两个点是互模的当且仅当这两 ...哈尔滨工业大学科研学术 本站小编 哈尔滨工业大学 2019-10-24减少上行CQI反馈的OFDIVIA系统资源分配
减少上行CQI反馈的OFDIVIA系统资源分配蒋樟琳,张中兆,沙学军,孙丽楠哈尔滨工业大学电子与信息工程学院,150080哈尔滨摘要:针对目前OFDMA系统上行信道信息的反馈量大和公平性欠佳的问题,提出了减少上行反馈的资源分配算法,该算法在高信噪比下,小区中心的用户不需要反馈CQI,而在低信噪比的小 ...哈尔滨工业大学科研学术 本站小编 哈尔滨工业大学 2019-10-24平板间方柱绕流的格子Boltzmann方法模拟
平板间方柱绕流的格子Boltzmann方法模拟周超英,葛家,ISLAMShamsUl哈尔滨工业大学深圳研究生院摘要:为研究平板间方柱绕流上下平板对置于其中的方柱绕流所产生的影响,采用格子Boltzmann方法对二维平板间低雷诺数(Re=100)方柱绕流问题进行了数值模拟研究.分析了3种不同阻塞比下, ...哈尔滨工业大学科研学术 本站小编 哈尔滨工业大学 2019-10-24Saving Strategy and Simulation Analysis of Green Log Walls in Severe Cold Area
Saving Strategy and Simulation Analysis of Green Log Walls in Severe Cold Area Hong-Peng Xu, Si Li (School of Architecture, Harbin Institute of Technology, Harbin 150001, China) Abstract: ...哈尔滨工业大学科研学术 本站小编 哈尔滨工业大学 2019-10-24Optimized Simulation Design of Double Glass Curtain Wall Shading System in Hot Summer and Cold Winte
Optimized Simulation Design of Double Glass Curtain Wall Shading System in Hot Summer and Cold Winter Zone Li-Sha He, Lin-Sen Mu, Jie Shen (College of Civil Engineering and Architecture, Zhejiang University, Hangzhou 310058, Chi ...哈尔滨工业大学科研学术 本站小编 哈尔滨工业大学 2019-10-24Study on the Outdoor Wind Environment Simulation and Design Strategies of Rural Settlements in Cold
Study on the Outdoor Wind Environment Simulation and Design Strategies of Rural Settlements in Cold Areas Ming Li, Hong Jin,Xin-Yu Zhang (School of Architecture, Harbin Institute of Technology, Harbin 150001, China) ...哈尔滨工业大学科研学术 本站小编 哈尔滨工业大学 2019-10-24Hardware Implementation and Simulations of Joint Coding and Modulation Diversity MIMO-OFDM Scheme
Hardware Implementation and Simulations of Joint Coding and Modulation Diversity MIMO-OFDM Scheme Xiang Gao, Zhan-Ji Wu, Long-Fei Hua, Jun-Peng Liu (School of Information and Communication Engineering, Beijing University of Pos ...哈尔滨工业大学科研学术 本站小编 哈尔滨工业大学 2019-10-24