深度学习中残差网络的随机训练策略

删除或更新信息，请邮件至freekaoyan#163.com(#换成@)

本站小编 Free考研考试/2021-12-27

孙琪¹, 陶蕴哲², 杜强³

1 北京大学北京国际数学研究中心, 北京 100871;
2 亚马逊网络服务人工智能, 西雅图 WA90121, 美国;
3 哥伦比亚大学应用物理与应用数学系, 纽约 NY10025, 美国

收稿日期:2020-03-17出版日期:2020-08-15发布日期:2020-08-15

STOCHASTIC TRAINING OF RESIDUAL NETWORKS IN DEEP LEARNING

Sun Qi¹, Tao Yunzhe², Du Qiang³

1 Beijing International Center for Mathematical Research, Peking University, Beijing 100871, China;
2 Amazon Web Services Artificial Intelligence, Seattle, WA 98121, USA;
3 Department of Applied Physics and Applied Mathematics, Columbia University, New York, NY 10027, USA

Received:2020-03-17Online:2020-08-15Published:2020-08-15

摘要
图/表
参考文献
相关文章
编辑推荐
-->Metrics
本文评论

为了有效提高深度学习模型在实际应用场景中的泛化能力，近年来工业界和学术界对神经网络训练阶段所采用的加噪技巧给予了高度关注.当网络模型架构中的待求参数固定时，修正方程的思想可以被用来刻画随机训练策略下数据特征的传播过程，从而看出在恰当位置添加剪枝层后的残差网络等价于随机微分方程的数值离散格式.建立这两者间的对应关系使得我们可以将残差网络的随机训练过程与求解倒向柯尔莫哥洛夫方程的最优控制问题联系起来.该发现不仅使得人们可以从微分方程及其最优控制的角度来研究加噪技巧所带来的正则化效应，同时也为构建可解释性强且有效的随机训练方法提供了科学依据.本文也以二分类问题作为简例来对上述观点做进一步的阐述和说明.
MR(2010)主题分类:
68Txx
93Exx
49J15
49J20

分享此文:

()

[1] Achille A, Soatto S. Information dropout:Learning optimal representations through noisy computation[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 40(12):2897-2905.

[2] Borzì A, Schulz V, Schillings C, Winckel G V. On the treatment of distributed uncertainties in PDE-constrained optimization[J]. GAMM-Mitteilungen, 2010, 33(2):230-246.

[3] Borzì A, von Winckel G. A POD framework to determine robust controls in PDE optimization[J]. Computing and visualization in science, 2011, 14(3):91-103.

[4] Bottou L, Curtis F E, Nocedal J. Optimization methods for large-scale machine learning[J]. SIAM Review, 2018, 60(2):223-311.

[5] Chang B, Meng L, Haber E, Tung F, Begert D. Multi-level Residual Networks from Dynamical Systems View[J]. arXiv preprint arXiv:1710.10348, 2017.

[6] Chaudhari P, Choromanska A, Soatto S, LeCun Y, Baldassi C, Borgs C, Chayes J, Sagun L, Zecchina R. Entropy-SGD:Biasing gradient descent into wide valleys[J]. arXiv preprint arXiv:1611.01838, 2016.

[7] Chaudhari P, Oberman A, Osher S, Soatto S, Carlier G. Deep Relaxation:partial differential equations for optimizing deep neural networks[J]. arXiv preprint arXiv:1704.04932, 2017.

[8] Chen R, Rubanova Y, Bettencourt J, Duvenaud D. Neural ordinary differential equa-tions[C]. In Advances in Neural Information Processing Systems, 2018.

[9] Cheng Y, Wang D, Zhou P, Zhang T. A survey of model compression and acceleration for deep neural networks[J]. arXiv preprint arXiv:1710.09282, 2017.

[10] Dauphin Y N, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization[C]. In Advances in neural information processing systems, 2014, 2933-2941.

[11] Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. Imagenet:A large-scale hierarchical image database[C]. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, Ieee, 2009, 248-255.

[12] Dinh L, Pascanu R, Bengio S, Bengio Y. Sharp minima can generalize for deep nets[C]. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, 1019-1028.

[13] E W. A proposal on machine learning via dynamical systems[J]. Communications in Mathematics and Statistics, 2017, 5(1):1-11.

[14] Evans L C. Partial differential equations[M]. American mathematical society, 1998.

[15] Gal Y, Ghahramani Z. Dropout as a Bayesian approximation:Representing model uncertainty in deep learning[C]. In international conference on machine learning, 2016, 1050-1059.

[16] Gastaldi X. Shake-shake regularization[J]. International Conference on Learning Representations Workshop Track, 2017.

[17] Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning[M], volume 1. MIT press Cambridge, 2016.

[18] Gunzburger M D, Webster C G, Zhang G. Stochastic finite element methods for partial differential equations with random input data[J]. Acta Numerica, 2014, 23:521-650.

[19] Haber E, Ruthotto L, Holtham E. Learning across scales-A multiscale method for Convolution Neural Networks[J]. 2017.

[20] He H, Huang G, Yuan Y. Asymmetric Valleys:Beyond Sharp and Flat Local Minima[C]. In Advances in Neural Information Processing Systems, 2019, 2549-2560.

[21] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition[C]. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, 770-778.

[22] He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks[C]. In European Conference on Computer Vision, Springer, 2016, 630-645.

[23] Higham D J. An algorithmic introduction to numerical simulation of stochastic differential equations[J]. SIAM review, 2001, 43(3):525-546.

[24] Hoffer E, Hubara I, Soudry D. Train longer, generalize better:closing the generalization gap in large batch training of neural networks[C]. In Advances in Neural Information Processing Systems, 2017, 1731-1741.

[25] Hu W, Li C J, Li L, Liu J G. On the diffusion approximation of nonconvex stochastic gradient descent[J]. arXiv preprint arXiv:1705.07562, 2017.

[26] Huang G, Liu Z, Maaten L V D, Weinberger K Q. Densely Connected Convolutional Networks.[C]. In CVPR, 2017, 4700-4708.

[27] Huang G, Sun Y, Liu Z, Sedra D, Weinberger K Q. Deep networks with stochastic depth[C]. In European Conference on Computer Vision, Springer, 2016, 646-661.

[28] Itô K. Diffusion processes[M]. Wiley Online Library, 1974.

[29] Izmailov P, Podoprikhin D, Garipov T, Vetrov D, Wilson A G. Averaging weights leads to wider optima and better generalization[J]. arXiv preprint arXiv:1803.05407, 2018.

[30] Kawaguchi K, Kaelbling L P, Bengio Y. Generalization in deep learning[J]. arXiv preprint arXiv:1710.05468, 2017.

[31] Keskar N S, Mudigere D, Nocedal J, Smelyanskiy M, Tang P T P. On large-batch training for deep learning:Generalization gap and sharp minima[J]. arXiv preprint arXiv:1609.04836, 2016.

[32] Kloeden P E, Pearson R A. The numerical solution of stochastic differential equations[M]. Springer Berlin Heidelberg, 2010.

[33] Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images[J]. 2009.

[34] Labach A, Salehinejad H, Valaee S. Survey of dropout methods for deep neural networks[J]. arXiv preprint arXiv:1904.13310, 2019.

[35] LeCun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015, 521(7553):436-444.

[36] Lee H C, Gunzburger M D. Comparison of approaches for random PDE optimization problems based on different matching functionals[J]. Computers & Mathematics with Applications, 2017, 73(8):1657-1672.

[37] Li H, Xu Z, Taylor G, Studer C, Goldstein T. Visualizing the loss landscape of neural nets[C]. In Advances in Neural Information Processing Systems, 2018, 6389-6399.

[38] Li Q, Lin T, Shen Z. Deep Learning via Dynamical Systems:An Approximation Perspective[J]. arXiv preprint arXiv:1912.10382, 2019.

[39] Li Q, Tai C, E W. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms[C]. In International Conference on Machine Learning, 2017, 2101-2110.

[40] Li X, Chen S, Hu X, Yang J. Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift[J]. arXiv preprint arXiv:1801.05134, 2018.

[41] Li Z, Shi Z. Deep Residual Learning and PDEs on Manifold[J]. arXiv preprint arXiv:1708.05115, 2017.

[42] Liu X, Xiao T, Si S, Cao Q, Kumar S, Hsieh C J. Neural SDE:Stabilizing Neural ODE Networks with Stochastic Noise[J]. arXiv preprint arXiv:1906.02355, 2019.

[43] Lu Y, Zhong A, Li Q, Dong B. Beyond Finite Layer Neural Networks:Bridging Deep Architectures and Numerical Differential Equations[J]. arXiv preprint arXiv:1710.10121, 2017.

[44] Øksendal B. Stochastic differential equations[G]. In Stochastic differential equations, Springer, 2003, 65-84.

[45] Osher S, Wang B, Yin P, Luo X, Pham M, Lin A. Laplacian Smoothing Gradient Des-cent[J]. arXiv preprint arXiv:1806.06317, 2018.

[46] Schmidhuber J. Deep learning in neural networks:An overview[J]. Neural networks, 2015, 61:85-117.

[47] Smith S L, Kindermans P J, Ying C, Le Q V. Don't decay the learning rate, increase the batch size[J]. arXiv preprint arXiv:1711.00489, 2017.

[48] Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout:a simple way to prevent neural networks from overfitting.[J]. Journal of machine learning research, 2014, 15(1):1929-1958.

[49] Sun Q, Du Q. A Distributed Optimal Control Problem with Averaged Stochastic Gradient Descent[J]. Communications in Computational Physics, 2020, 27(3):753-774.

[50] Sun Q, Tao Y, Du Q. Stochastic training of residual networks:a differential equation viewpoint[J]. arXiv preprint arXiv:1812.00174, 2018.

[51] Thorpe M, van Gennip Y. Deep limits of residual neural networks[J]. arXiv preprint arXiv:1810.11741, 2018.

[52] Veit A, Wilber M J, Belongie S. Residual networks behave like ensembles of relatively shallow networks[C]. In Advances in Neural Information Processing Systems, 2016, 550-558.

[53] Wan L, Zeiler M, Zhang S, Cun Y L, Fergus R. Regularization of neural networks using dropconnect[C]. In International Conference on Machine Learning, 2013, 1058-1066.

[54] Wang B, Shi Z, Osher S. ResNets Ensemble via the Feynman-Kac Formalism to Improve Natural and Robust Accuracies[C]. In Advances in Neural Information Processing Systems, 2019, 1655-1665.

[55] Wang B, Yuan B, Shi Z, Osher S J. EnResNet:ResNet Ensemble via the Feynman-Kac Formalism[J]. arXiv preprint arXiv:1811.10745, 2018.

[56] Wang K, Sun W, Du Q. A cooperative game for automated learning of elasto-plasticity knowledge graphs and models with AI-guided experimentation[J]. Computational Mechanics, 2019, 64(2):467-499.

[57] Warming R, Hyett B. The modified equation approach to the stability and accuracy analysis of finite-difference methods[J]. Journal of computational physics, 1974, 14(2):159-179.

[58] Zagoruyko S, Komodakis N. Wide residual networks[J]. arXiv preprint arXiv:1605.07146, 2016.

[59] Zhang H M, Dong B. A Review on Deep Learning in Medical Image Reconstruction[J]. Journal of the Operations Research Society of China, 2019, 1-30.

[1]	董彬. 图像反问题中的数学与深度学习方法[J]. 计算数学, 2019, 41(4): 343-366.
[2]	岳超. 高阶分裂步(θ₁,θ₂,θ₃)方法的强收敛性[J]. 计算数学, 2019, 41(2): 126-155.
[3]	张维, 王文强. 随机微分方程改进的分裂步单支θ方法的强收敛性[J]. 计算数学, 2019, 41(1): 12-36.
[4]	赵卫东. 正倒向随机微分方程组的数值解法[J]. 计算数学, 2015, 37(4): 337-373.
[5]	赵桂华, 李春香, 孙波. 带跳随机微分方程的Euler-Maruyama方法的几乎处处指数稳定性和矩稳定性[J]. 计算数学, 2014, 36(1): 65-74.

--> -->

阅读次数

全文

摘要

Cited

Shared

PDF全文下载地址:

http://www.computmath.com/jssx/CN/article/downloadArticleFile.do?attachType=PDF&id=261