基于矩阵三角化分解的Cholesky分解及FPGA并行结构设计

删除或更新信息，请邮件至freekaoyan#163.com(#换成@)

清华大学辅仁网/2017-07-07

刘书勇, 林俊宇, 吴艳霞, 张博为

哈尔滨工程大学计算机科学与技术学院, 哈尔滨 150001

Cholesky decomposition and parallel structure design based on matrix triangularization decomposition

LIU Shuyong, LIN Junyu, WU Yanxia, ZHANG Bowei

Collage of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China

摘要:

输出: BibTeX | EndNote (RIS)

摘要矩阵运算是高性能计算中核心问题之一，矩阵分解是提高矩阵运算并行性的重要途径，飞速发展的FPGA为并行运算结构提供了有力的环境支持。该文基于子矩阵更新同一化算法实现了Cholesky分解，基于FPGA设计了相应的并行结构。实验结果表明：与通用处理器的软件实现相比，本文实现的Cholesky分解的FPGA并行结果在核心计算性能上可以取得10倍以上的加速比，该算法针对矩阵三角化计算过程具有更高的数据和流水并行性。

关键词 ：矩阵三角化分解,Cholesky分解,并行结构,现场可编程门阵列

Abstract：Matrix computing is one of the core problems in high performance computing with matrix decomposition being an important way to improve the parallelism of matrix computations. FPGA gives a powerful environment for parallel computing. This study uses Cholesky decomposition based on a hardware-adaptive parallel sub-matrix identity updating algorithm. Its parallel structure is based on FPGA. Tests show that this structure achieves more than 10 fold speedup compared to general-purpose processors in terms of the kernel computational speed because the algorithm has better data-parallelism and pipeline-parallelism during matrix triangularization.

Key words：matrix triangularization decomposition Cholesky decomposition parallel structure field programmable gate array

收稿日期: 2016-04-11 出版日期: 2016-09-22

ZTFLH:

TP302.1

通讯作者:林俊宇,讲师,E-mail:linjunyu@hrbeu.edu.cnE-mail: linjunyu@hrbeu.edu.cn

引用本文:

刘书勇, 林俊宇, 吴艳霞, 张博为. 基于矩阵三角化分解的Cholesky分解及FPGA并行结构设计[J]. 清华大学学报（自然科学版）, 2016, 56(9): 963-968.
LIU Shuyong, LIN Junyu, WU Yanxia, ZHANG Bowei. Cholesky decomposition and parallel structure design based on matrix triangularization decomposition. Journal of Tsinghua University(Science and Technology), 2016, 56(9): 963-968.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2016.21.060或 http://jst.tsinghuajournals.com/CN/Y2016/V56/I9/963

图表:

图1 L^T-SC的计算过程

图2 L^T-SC的计算依赖图

图3 L^T-SC的阵列结构与计算时空图

图4 L^T-SC并行结构PE的配置

图5 L^T-SC总体并行结构

表1L^T-SC并行结构中PE单元的综合结果

图6L^T-SC阵列并行结构与通用处理器软件实现的加速比和最大频率随矩阵规模增加的变化

参考文献:

[1] Satchidanand G H, Sotirios G Z. FPGA implementation of a Cholesky algorithm for a shared-memory multiprocessor architecture [J]. Parallel Algorithms and Applications, 2004, 19(4): 211-226.
[2] Kung H T, Leiserson C E. Systolic arrays technical report [R]. Pitsburg, PA, USA: Carneige Mellon University, 1978.
[3] YANG Depeng, Gregory D P, LI Husheng. Compressed sensing and Cholesky decomposition on FPGAs and GPUs [J]. Parallel Computing, 2012, 38(8): 421-437.
[4] Alfredo B, Julien L, Jakub K, et al. A class of parallel tiled linear algebra algorithms for multicore architectures [J]. Parallel Computing, 2009, 35(1): 38-53.
[5] DOU Yong, Vassiliadis S, Kuzmanov G, et al. 64-bit floating-point FPGA matrix multiplication [C]//Proceedings of the 13th ACM/SIGDA International Symposium on Field Programmable Gate Arrays(FPGA’05), New York, NY, USA: ACM, 2005: 86-95.
[6] Vasily V, James W D. LU, QR and Cholesky factorizations using vector capabilities of GPUs [R]. Berkeley, CA, USA: EECS Department, University of California, Berkeley, 2008.
[7] Oleg M, Volodymyr L, Anatoli S, et al. Parallel implementation of Cholesky LLT-algorithm in FPGA-based processor [J]. Parallel Processing and Applied Mathematics, 2008, 49(67): 137-147.
[8] Eunice E S, CHU Peiyun. Efficient and optimal parallel algorithms for Cholesky decomposition [J]. Journal of Mathematical Modeling and Algorithms, 2003, 2(3): 217-234.
[9] Aatonio R, George A. A high throughput FPGA-based floating point conjugate gradient implementation for dense matrices [J]. ACM Transactions on Reconfigurable Technology and Systems, 2010, 3(1): 1-19.
[10] Badia J M, Vidal A M. Parallel algorithms to computer the eigenvalues and eigenvectors of symmetric Toeplitz matrices [J]. Parallel Algorithms and Applications, 1998, 13(1): 75-93.
[11] Anderson E, BAI Z, Bischof C, et al. LAPACK Users Guide [M]. 3rd ED. Philadelphia, PA, USA: The society of industrial and applied mathematics, 1999.
[12] Blackford L S, Choi J, Cleary A, et al. ScaLAPACK Users Guide [M]. Philadelphia, PA, USA: The Society of Industrial and Applied Mathematics, 1997.
[13] WU Guiming, DOU Yong, SUN Junqing, et al. A high performance and memory efficient LU decomposer on FPGAs [J]. Computers, IEEE Transactions, 2012, 61(3): 366-378.
[14] WU Guiming, DOU Yong, Gregory D P. Blocking LU decomposition for FPGAs [C]//Proceedings of the 18th IEEE Symposium on Field-Programmable Custom Computing Machines(FCCM’10). Charlotte, NC,USA: IEEE, 2010, 109-112.
[15] Haridas S G, Sotirios G Z. FPGA implementation of a Cholesky algorithm for a shared memory multiprocessor architecture [J]. International Journal of Parallel, Emergent and Distributed Systems, 2004, 19(4): 211-226.
[16] 郭磊, 唐玉华, 周杰, 等. 基于FPGA的Cholesky分解细粒度并行结构与实现 [J]. 计算机研究与发展, 48 (Suppl), 2011: 258-265.GUO Lei, TANG Yuhua, ZHOU Jie, et al. Fine-grained architecture and implementation for Cholesky decomposition on FPGA [J]. Journal of Computer Research and Development, 2011, 48(suppl): 258-265. (in Chinese)
[17] WANG Xiaojun, Leeser M. A truly two-dimensional systolic array FPGA implementation of QR decomposition [J]. ACM Transactions on Embedded Computing Systems (TECS), 2009, 9(1): 17-18.
[18] 邬贵明, 窦勇, 王淼. Cholesky分解细粒度并行算法 [J]. 计算机工程与科学, 2010, 32(9): 102-106.WU Guiming, DOU Yong, WANG Miao. A fine-grained parallel algorithm for the Cholesky decomposition [J]. Computer Engineering & Science, 2010, 32(9): 102-106. (in Chinese)
[19] 刘书勇,吴艳霞,张博文,等. 基于可重构计算系统的矩阵三角化分解硬件并行结构研究 [J]. 电子学报, 43(8), 2015: 1642-1650. LIU Shuyong, WU Yanxia, ZHANG Bowen, et al. Research of parallel hardware architecture for matrix triangularization decomposition based on reconfigurable computing system [J]. ACTA Electronica Sinica, 2015, 43 (8): 1642-1650. (in Chinese)

[1]	彭卓, 邓焱, 马骋, 熊剑平, 尹永利. 基于FPGA的高精度正弦信号发生器设计与实现[J]. 清华大学学报（自然科学版）, 2014, 54(2): 197-201.