[1] Achille A, Soatto S. Information dropout:Learning optimal representations through noisy computation[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 40(12):2897-2905.[2] Borzì A, Schulz V, Schillings C, Winckel G V. On the treatment of distributed uncertainties in PDE-constrained optimization[J]. GAMM-Mitteilungen, 2010, 33(2):230-246.[3] Borzì A, von Winckel G. A POD framework to determine robust controls in PDE optimization[J]. Computing and visualization in science, 2011, 14(3):91-103.[4] Bottou L, Curtis F E, Nocedal J. Optimization methods for large-scale machine learning[J]. SIAM Review, 2018, 60(2):223-311.[5] Chang B, Meng L, Haber E, Tung F, Begert D. Multi-level Residual Networks from Dynamical Systems View[J]. arXiv preprint arXiv:1710.10348, 2017.[6] Chaudhari P, Choromanska A, Soatto S, LeCun Y, Baldassi C, Borgs C, Chayes J, Sagun L, Zecchina R. Entropy-SGD:Biasing gradient descent into wide valleys[J]. arXiv preprint arXiv:1611.01838, 2016.[7] Chaudhari P, Oberman A, Osher S, Soatto S, Carlier G. Deep Relaxation:partial differential equations for optimizing deep neural networks[J]. arXiv preprint arXiv:1704.04932, 2017.[8] Chen R, Rubanova Y, Bettencourt J, Duvenaud D. Neural ordinary differential equa-tions[C]. In Advances in Neural Information Processing Systems, 2018.[9] Cheng Y, Wang D, Zhou P, Zhang T. A survey of model compression and acceleration for deep neural networks[J]. arXiv preprint arXiv:1710.09282, 2017.[10] Dauphin Y N, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization[C]. In Advances in neural information processing systems, 2014, 2933-2941.[11] Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. Imagenet:A large-scale hierarchical image database[C]. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, Ieee, 2009, 248-255.[12] Dinh L, Pascanu R, Bengio S, Bengio Y. Sharp minima can generalize for deep nets[C]. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, 1019-1028.[13] E W. A proposal on machine learning via dynamical systems[J]. Communications in Mathematics and Statistics, 2017, 5(1):1-11.[14] Evans L C. Partial differential equations[M]. American mathematical society, 1998.[15] Gal Y, Ghahramani Z. Dropout as a Bayesian approximation:Representing model uncertainty in deep learning[C]. In international conference on machine learning, 2016, 1050-1059.[16] Gastaldi X. Shake-shake regularization[J]. International Conference on Learning Representations Workshop Track, 2017.[17] Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning[M], volume 1. MIT press Cambridge, 2016.[18] Gunzburger M D, Webster C G, Zhang G. Stochastic finite element methods for partial differential equations with random input data[J]. Acta Numerica, 2014, 23:521-650.[19] Haber E, Ruthotto L, Holtham E. Learning across scales-A multiscale method for Convolution Neural Networks[J]. 2017.[20] He H, Huang G, Yuan Y. Asymmetric Valleys:Beyond Sharp and Flat Local Minima[C]. In Advances in Neural Information Processing Systems, 2019, 2549-2560.[21] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition[C]. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, 770-778.[22] He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks[C]. In European Conference on Computer Vision, Springer, 2016, 630-645.[23] Higham D J. An algorithmic introduction to numerical simulation of stochastic differential equations[J]. SIAM review, 2001, 43(3):525-546.[24] Hoffer E, Hubara I, Soudry D. Train longer, generalize better:closing the generalization gap in large batch training of neural networks[C]. In Advances in Neural Information Processing Systems, 2017, 1731-1741.[25] Hu W, Li C J, Li L, Liu J G. On the diffusion approximation of nonconvex stochastic gradient descent[J]. arXiv preprint arXiv:1705.07562, 2017.[26] Huang G, Liu Z, Maaten L V D, Weinberger K Q. Densely Connected Convolutional Networks.[C]. In CVPR, 2017, 4700-4708.[27] Huang G, Sun Y, Liu Z, Sedra D, Weinberger K Q. Deep networks with stochastic depth[C]. In European Conference on Computer Vision, Springer, 2016, 646-661.[28] Itô K. Diffusion processes[M]. Wiley Online Library, 1974.[29] Izmailov P, Podoprikhin D, Garipov T, Vetrov D, Wilson A G. Averaging weights leads to wider optima and better generalization[J]. arXiv preprint arXiv:1803.05407, 2018.[30] Kawaguchi K, Kaelbling L P, Bengio Y. Generalization in deep learning[J]. arXiv preprint arXiv:1710.05468, 2017.[31] Keskar N S, Mudigere D, Nocedal J, Smelyanskiy M, Tang P T P. On large-batch training for deep learning:Generalization gap and sharp minima[J]. arXiv preprint arXiv:1609.04836, 2016.[32] Kloeden P E, Pearson R A. The numerical solution of stochastic differential equations[M]. Springer Berlin Heidelberg, 2010.[33] Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images[J]. 2009.[34] Labach A, Salehinejad H, Valaee S. Survey of dropout methods for deep neural networks[J]. arXiv preprint arXiv:1904.13310, 2019.[35] LeCun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015, 521(7553):436-444.[36] Lee H C, Gunzburger M D. Comparison of approaches for random PDE optimization problems based on different matching functionals[J]. Computers & Mathematics with Applications, 2017, 73(8):1657-1672.[37] Li H, Xu Z, Taylor G, Studer C, Goldstein T. Visualizing the loss landscape of neural nets[C]. In Advances in Neural Information Processing Systems, 2018, 6389-6399.[38] Li Q, Lin T, Shen Z. Deep Learning via Dynamical Systems:An Approximation Perspective[J]. arXiv preprint arXiv:1912.10382, 2019.[39] Li Q, Tai C, E W. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms[C]. In International Conference on Machine Learning, 2017, 2101-2110.[40] Li X, Chen S, Hu X, Yang J. Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift[J]. arXiv preprint arXiv:1801.05134, 2018.[41] Li Z, Shi Z. Deep Residual Learning and PDEs on Manifold[J]. arXiv preprint arXiv:1708.05115, 2017.[42] Liu X, Xiao T, Si S, Cao Q, Kumar S, Hsieh C J. Neural SDE:Stabilizing Neural ODE Networks with Stochastic Noise[J]. arXiv preprint arXiv:1906.02355, 2019.[43] Lu Y, Zhong A, Li Q, Dong B. Beyond Finite Layer Neural Networks:Bridging Deep Architectures and Numerical Differential Equations[J]. arXiv preprint arXiv:1710.10121, 2017.[44] Øksendal B. Stochastic differential equations[G]. In Stochastic differential equations, Springer, 2003, 65-84.[45] Osher S, Wang B, Yin P, Luo X, Pham M, Lin A. Laplacian Smoothing Gradient Des-cent[J]. arXiv preprint arXiv:1806.06317, 2018.[46] Schmidhuber J. Deep learning in neural networks:An overview[J]. Neural networks, 2015, 61:85-117.[47] Smith S L, Kindermans P J, Ying C, Le Q V. Don't decay the learning rate, increase the batch size[J]. arXiv preprint arXiv:1711.00489, 2017.[48] Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout:a simple way to prevent neural networks from overfitting.[J]. Journal of machine learning research, 2014, 15(1):1929-1958.[49] Sun Q, Du Q. A Distributed Optimal Control Problem with Averaged Stochastic Gradient Descent[J]. Communications in Computational Physics, 2020, 27(3):753-774.[50] Sun Q, Tao Y, Du Q. Stochastic training of residual networks:a differential equation viewpoint[J]. arXiv preprint arXiv:1812.00174, 2018.[51] Thorpe M, van Gennip Y. Deep limits of residual neural networks[J]. arXiv preprint arXiv:1810.11741, 2018.[52] Veit A, Wilber M J, Belongie S. Residual networks behave like ensembles of relatively shallow networks[C]. In Advances in Neural Information Processing Systems, 2016, 550-558.[53] Wan L, Zeiler M, Zhang S, Cun Y L, Fergus R. Regularization of neural networks using dropconnect[C]. In International Conference on Machine Learning, 2013, 1058-1066.[54] Wang B, Shi Z, Osher S. ResNets Ensemble via the Feynman-Kac Formalism to Improve Natural and Robust Accuracies[C]. In Advances in Neural Information Processing Systems, 2019, 1655-1665.[55] Wang B, Yuan B, Shi Z, Osher S J. EnResNet:ResNet Ensemble via the Feynman-Kac Formalism[J]. arXiv preprint arXiv:1811.10745, 2018.[56] Wang K, Sun W, Du Q. A cooperative game for automated learning of elasto-plasticity knowledge graphs and models with AI-guided experimentation[J]. Computational Mechanics, 2019, 64(2):467-499.[57] Warming R, Hyett B. The modified equation approach to the stability and accuracy analysis of finite-difference methods[J]. Journal of computational physics, 1974, 14(2):159-179.[58] Zagoruyko S, Komodakis N. Wide residual networks[J]. arXiv preprint arXiv:1605.07146, 2016.[59] Zhang H M, Dong B. A Review on Deep Learning in Medical Image Reconstruction[J]. Journal of the Operations Research Society of China, 2019, 1-30. |